NV_primitive_restart performance

Has anyone tried this extension and gotten better performance?

What’s the trick? All I have seen so far is large performance loss.

If I draw 1000 indexed instances of a single model with it enabled, frame time in ms jumps up 21X, and GPU Idle leaps way up to ~40%.

Even more puzzling, if I draw a basic scene “completely void of indexed primitives” and enable it (you’d think that’d be a no-op), frame time jumps up 15X, and again GPU Idle leaps way up to ~40%.

This looks like a CPU slow path. But how? Feel free to slap me around and tell me what I’m doing wrong.

FWIW, this is with NVidia 1.0-9640 drivers, but I’ve tried older drivers with the same result.

 
  // Primitive restart is always on, and always set to 0xFFFF
  glEnableClientState( GL_PRIMITIVE_RESTART_NV );
  glPrimitiveRestartIndexNV( 0xFFFF );

Are you using VBOs (vertex buffer objects)?

VBO + primitive restart == good
non-VBO + primitive restart == not so good

Primitive restart is a hardware feature that’s basically “free” when hardware is pulling vertices directly from a buffer that you provide.

If you are using vertex arrays in system memory, the driver will have to assemble the vertices to transfer them to hardware, which won’t pull directly from your application’s memory space. Primitive restart compilicates the logic needed to copy vertex data from your buffers. If you have primitive restart enabled, the NVIDIA OpenGL driver ends up using a path that is somewhat slower than what you’d get if primitive restart were disabled, but one that should handle it correctly.

A couple other gotchas:

  • “Using VBOs” means all enabled arrays are stored in VBOs. If you have any enabled array that isn’t in a VBO, hardware won’t pull it or any of the others directly.

  • Current NVIDIA hardware doesn’t directly pull all vertex array formats supported by the OpenGL API. If you use a goof-ball format, a very large stride, or do something else odd, we will end up back in the same software logic. The safe path here is to use floats of any component count and 4-component unsigned bytes (e.g., for color). There may be a couple other formats that work, but I can’t remember which ones.

Hope this helps,
Pat

Thanks, Pat. You nailed it. As a first test, I wasn’t using VBOs. I wouldn’t have guessed this would flip the driver into a slow path as I thought it simply DMAed any CPU-side index and vertex attribute arrays to the GPU prior to each glDrawRangeElements call.

However, as you said this case does render correctly.

Also, by “using VBOs”, do you mean the index array should be in a VBO as well?