VBO Performance

I’ve noticed that using VBO instead of traditional vertex arrays causes a severe performance drop, an average of 25% drop of the frame rate.

I’m not sure about why this happens, is it

  1. Driver screwed up? Then why works just great? Is it that hard for IHV to figure out how their cards internally handle vertex data? Then again how they made it right for VAs?

  2. Demo which is screwed up? Could be GL command ordering? Then why it show son the screen if commands have to be of a certain order?

  3. There has to be at least a BIG number of vertices in order to see any performance difference otherwise it’s a lose rather than a gain.

Thanks.

#3 : according to this http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=274549#Post274549
the limit seems to be somewhere around 1000-5000 polygons for the speedup in rendering to overcome the extra setup time compared to vertex arrays

The first value in those tables is the number of “objects” rendered, the second value is the number of vertices per (buffer) object.
In that app it needed about 18 vertices per buffer object for it to become more efficient to use VBOs than traditional vertex arrays.
That was with position (4 floats) + color (4 unsigned bytes) interleaved in one buffer object.

So what is the best way to render geometry that consists of an extremely large number of draw calls (10000-20000) but each only having up to 10 vertices?

On NVidia immediate mode seems to be the overall best option by far but I’m having problems with this on ATI. My test program works great on anything from the lowest Geforce 8xxx to the GTX 280 but the supposedly mega-powerful ATI HD 5870 barely matches a Geforce 8600.

I’m also having problems here with VBOs. On NVidia I get no performance boost and ATI even gets slower. So, any idea what I could do to make ATI work better? I already tried some grouping so that I can merge some stuff but the maintenance overhead and CPU cache hits I get far outweigh the improvements on the GPU side.

Are there any state changes between these draw calls? I’ll assume not. But if so that can easily become the limiting factor beyond draw call submission overhead. Be sure to do state sorting so that you minimize the number of state changes required, and maximize the number of times you have no state changes (or at least only cheap changes) between batches.

Also, consider whether this is one of those cases where GPU instancing makes sense (draw_instanced or instanced_arrays). That’ll reduce the number of draw calls you have to have, and thus draw call submission overhead.

But if not, or if that’s not practical for some reason…

On NVidia immediate mode seems to be the overall best option by far but I’m having problems with this on ATI. My test program works great on anything from the lowest Geforce 8xxx to the GTX 280 but the supposedly mega-powerful ATI HD 5870 barely matches a Geforce 8600.

For large numbers of batches, batch submission (CPU overhead) can easily become the bottleneck. Vertex arrays can definitely be faster than VBOs in such circumstances.

But for the absolute fastest batch submission (on NVidia), try geometry-only display lists. That is, compile each one of your batches into its own display list (i.e. buffer binds, vtx attrib ptr sets, vtx attrib enables, and batch draw call). Then when rerendering the scene, just call the display list to submit the batch. This gets rid of all the overhead of VBO buffer binds, make residents, and a lot of needless VBO-handle-to-GPU-address translation overhead which (per NV/bindless) kills the CPU caches and leaves you blocked waiting on CPU memory accesses a lot. This change is also crazy-simple to implement. See what your perf is with this.

If you like it, on NVidia at least, you can get this perf or darn close without using display lists merely by switching to VBOs+bindless (specifically NV_vertex_buffer_unified_memory). Essentially all this does over pure VBOs is allow you to feed OpenGL the GPU addresses of VBOs for vertex attribute and index lists directly, rather than make the driver do VBO handle-to-address lookups for every single vtx attrib pointer or element array set you do. Really simple conceptually, but a big performance win!

I’m also having problems here with VBOs. On NVidia I get no performance boost and ATI even gets slower. So, any idea what I could do to make ATI work better?

No clue (we haven’t been able to get our apps to work on ATI yet), ATI reportedly doesn’t have quite the stellar display-list perf that NV does, and ATI doesn’t have bindless…

So without profiling your specific application and trying tests, best I’d suggest is try to cram your problem into geometry instancing or maybe packing multiple batches into shared VBOs to reduce buffer binds, on the off chance that’s your bottleneck…

Also (independent of vendor) consider whether you can spatially group some of your batches together into larger batches to reduce batch submission (CPU) overhead.

So what is the best way to render geometry that consists of an extremely large number of draw calls (10000-20000) but each only having up to 10 vertices?

That really depends on the usage case. There are many ways that a draw loop for such a list of objects could appear. For example, you have this:


for(Each Object)
{
  Bind Object.Program
  Set Object.Uniforms
  Bind Object.Textures
  Bind Object.VAO (buffer objects)
  glDraw*
}

In this case, each object has a program, a set of uniforms and textures, a VAO (or, if you’re allergic to VAOs, the set of array object state), and a draw command.

This is the worst possible way to draw, well, anything. If this is your code, the very first thing you need to do is just some state sorting. Make it into something like this:


for(Each Program)
{
  Bind Program
  for(Objects using Program1)
  {
    Set Object.Uniforms
    Bind Object.Textures
    Bind Object.VAO (buffer objects)
    glDraw*
  }
}

This is the most basic of state sorting: sorting renderable things based on what program object they use. Minimizing the number of glUniform and glBindTexture calls that are made is the next step. This will require some data analysis on your part.

After that, you’re going to want to start sorting by VAO. This will of course mean that multiple objects will be sharing the same buffer object and vertex format. This shouldn’t be hard.

It changes the code to look like this:


for(Each Set)
{
  Bind Set.Program if different from currently bound
  Bind Set.VAO if different from currently bound
  for(Objects in Set)
  {
    Set Object.Uniforms
    Bind Object.Textures
    glDraw*
  }
}

Where a “Set” is a set of objects that all use the same Program and VAO.

Now, if you’ve got things optimized to where you are changing only the absolute bare minimum state between draw calls, then you’re ready to talk about actual optimizations for drawing.

If you are drawing lots of copies of the same object (or, depending on the case, small permutations of the same object that you could choose in a vertex shader) then instancing is a tool you’ll be interested in. It would allow you to do this:


for(Each Set)
{
  Bind Set.Program if different from currently bound
  Bind Set.VAO if different from currently bound
  for(Objects in Set)
  {
    Add per-object state to the state buffer object or texture
  }
  Bind state buffer object or texture
  glDraw*Instanced(#Objects in Set)
}

One draw call would draw all of the objects.

There are a lot of caveats when using instancing. If you want to go that route, do some investigating.