Nvidia Memory Corruption

Has anyone experienced GPU memory corruption on a dual GPU system? We’ve seen both texture buffers and streaming vbos become corrupt but only on a dual GPU system. I think the cards are 780s. We have yet to see corruption on a single GPU system. The temperature seems ok on both cards.

Does it only relate to buffer objects, or does it also happen to (non-buffer) textures?

Only buffer objects. Other textures seem ok. It doesn’t seem to matter if the buffer objects use bindless or not.

With Nvidia’s help, I finally figured out the source of the problem. We had faulty code in our application that was mixing sampler and texture types. In some cases, if a texture did not exist, our app created a small, default 2D texture. However, associating the 2D texture with a samplerBuffer, a sampler2DMS, or a sampler1D, for example, was enough to cause VBO and texture buffer corruption. Also, according to Nvidia, rendering corruption may occur if a non-depth texture is mapped to a shadow sampler. We also saw that glXSwapBuffers could take up to 12 seconds.

Apparently, the corruption does not happen on Fermi cards. It can happen on Keplers. We aren’t sure why we saw it only on systems with two Keplers.

Mixing texture types and samplers is clearly “undefined”. However, the driver did not report any OpenGL errors. When running in a debug context, the only warning that was printed was for mixing non-depth textures and shadow samplers. So, please be careful!

You can run glValidateProgram() when debugging, just before the draw call to check for problems like this. It should report that your sampler type and bound texture type do not match.

That’s a very useful tip. I’ll give that a shot. Thanks so much.

Really? No GL error but actual GPU memory corruption caused by reading from a texture of the wrong type? Wow. I’m going to have to remember that one.

By undefined, I’d expect bogus reads but not memory corruption.

Let us know if glValidateProgram catches it at least.

the corruption does not happen on Fermi cards … It can happen on Keplers.

I wonder why the difference?

[QUOTE=Dark Photon;1264909]

Let us know if glValidateProgram catches it at least.

I wonder why the difference?[/QUOTE]

glValidateProgram did not catch any errors on my 780.

Might be due to the fact that OpenGL allows multiple textures to be bound to the same unit if they’re of different texture types. In that case, validate may pass because it looks at the 2D binding for that unit, sees nothing, and accepts it (though it should complain that you have no 2D texture attached to a sampler2D, seems a like a bug). The TBO binding for that unit may not be checked at all. The texture type-bindings is one of the most bizarre features brought to the table by GL’s legacy, IMO.

I fixed all the sampler issues so the debug context is happy but we still have rendering corruption. The corruption happens only on dual 780 systems when we enable “prerendering”. Our prerendering code cycles through all of our shader programs and renders a quad with them. The goal is to prevent first render frame time spikes that can happen when uniforms change and the driver recompiles shaders. I need to scrutinize our code more carefully but it’s frustrating there are no GL errors.

Does the corruption correlate to prerendering with a specific shader program or set of shader programs?

Is it possible that some of the shaders are pulling from resources (vertex attribs, uniforms, UBOs, texture units, textures (if using bindless textures), GPU addresses, etc.), or worse, pushing to resources (e.g. transform feedback, atomics, store/load on image objects, GPU addresses, etc.) that aren’t being provided to the shader with the quad rendering?