VBO vs vertex arrays for particles

My particles are rendering 30-40% faster with vertex arrays than with VBOs, so I’m guessing I’m doing something wrong. I draw up to 20 particle systems containing up to 400 particles each. Under the heaviest load, I’m using around 20-250 particles in each of the 20 systems for a total of about 2000 particles.

To draw each particle system with vertex arrays:

glInterleavedArrays(GL_T4F_C4F_N3F_V4F, 0, (GLvoid*)mVertData);
glDrawArrays(GL_QUADS, 0, num_particles * 4);

To draw with VBOs:

mVBOArrayId and mVBOIndexId are appropriately set up as GL_DYNAMIC_DRAW and GL_STATIC_DRAW, respectively.

glBindBuffer(GL_ARRAY_BUFFER, mVBOArrayId);
glEnableClientState(GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer(4, GL_FLOAT, PARTICLE_VERT_SIZE, BUFFER_OFFSET(0));
glEnableClientState(GL_COLOR_ARRAY);
glColorPointer(4, GL_FLOAT, PARTICLE_VERT_SIZE, BUFFER_OFFSET(16));
glEnableClientState(GL_NORMAL_ARRAY);
glNormalPointer(GL_FLOAT, PARTICLE_VERT_SIZE, BUFFER_OFFSET(32));
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(4, GL_FLOAT, PARTICLE_VERT_SIZE, BUFFER_OFFSET(48));
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, mVBOIndexId);
glDrawElements(GL_TRIANGLES, num_particles * 6, GL_UNSIGNED_SHORT, BUFFER_OFFSET(0));

So far I’ve also tried using GL_QUADS and GL_UNSIGNED_INT for the indices, and I’ve tried putting all the vertex data in one buffer and drawing all particles from all particle systems with one call to glDrawElements. None of this helps. I have tested on Linux with a GeForce 6800GT and 8800GTS with current drivers. I really expect VBOs to be faster, but they aren’t yet. Anyone have any suggetions? Is it possible that my hardware is just too old?

Don’t bet on it. “Vertex arrays” typically beat out “classic VBO” batches with smaller batches, on NVidia at least. Binding buffer objects isn’t cheap!

Another thing you can try is put your batches in display lists. That’ll give you the fastest perf for a given set of batches.

The only way I’ve made batches fly like NVidia display lists, with VBOs or otherwise, is to use NV bindless extensions with VBOs, but that is unfortunately still NVidia-only (more details on this below). Hoping for something in OpenGL 4.2 along these lines, to get rid of a bunch of the CPU-side memory access inefficiency in the driver when submitting small VBO batches.

One other option for static VBOs is to use VAOs. That’ll speed you up some, but it won’t get you to the performance of display lists or bindless+VBOs (the latter two are pretty much equals).

I draw up to 20 particle systems containing up to 400 particles each. Under the heaviest load, I’m using around 20-250 particles in each of the 20 systems for a total of about 2000 particles.

In addition to the above, another option you can try is putting all of your batches in one VBO. Then when you need to draw another batch, there’s no need to bind another buffer with the VBO path.

Another option of course is to use larger batches, but don’t get too big as this hits your culling efficiency. There’s a balancing act here depending on the CPU and GPU horsepower you’ve got to work with.

To draw each particle system with vertex arrays:

To draw with VBOs:

I find it odd that you are rendering one with DrawArrays and one with DrawElements. I also find it odd that you are rendering one with a dedicated interleaved array call and in another case rendering interleaved arrays by making the appropriate pointer and enable calls.

If you are trying to do an apples-to-apples comparison here, you should be using the same batch registration technique, same type of batch, and same batch data with both vertex arrays (client arrays) and VBOs (server arrays).

I have tested on Linux with a GeForce 6800GT and 8800GTS with current drivers. I really expect VBOs to be faster, but they aren’t yet. Anyone have any suggetions? Is it possible that my hardware is just too old?

It’s not necessarily that. …though it is getting old. …especially that 6800. :wink:

It’s typical to see this when you’re CPU bound. This is more likely to happen when you have smaller batches. Fast GPU and slow CPU aggravates the problem.

On NVidia, try display lists (one display list per batch; only put the batch in the display list; no state changes – that is, only put your buffer binds, pointer calls, pointer enables, and batch calls in the display list) and baseline all your performance measurements as a percentage of that. Then try client arrays and VBOs+bindless. You’ll likely find you can get the display list performance without the display list compile times from VBOs+bindless:

I’ve posted on my batch perf experiences here in various posts, but here’s one thread where I give some example code showing how to render batches with bindless VBOs side-by-side with plain “classic” VBOs. Just change the #ifdef to switch from one to the other.

Bindless of course currently isn’t a good option unless you can presume an NVidia G80+ GPU (GeForce 8 or better), or are open to a run-time switch on which draw path to use based on the available OpenGL extensions. Your GeForce 6800 wouldn’t support this path. But it would support display lists. Also, VBOs were even more slow on older GPUs (IIRC from years back). Though this may have been primarily due to CPUs and CPU memory being slower then.

Thank you for the incredibly detailed reply and sanity check, Mr. Photon. I think I’ve tried just about everything now. Unfortunately, bindless won’t help me because I want this to run everywhere, even on pretty old hardware. In the past I’ve had better luck with VBOs, but that was with much larger batches of vertices.

My final solution is to use vertex arrays with pointer calls instead of glInterleavedArrays, and with my data padded to 64 bytes per vertex (each vertex only really requires 60 bytes of storage). Byte alignment of the arrays didn’t help. And there doesn’t appear to be any difference between GL_QUADS and GL_TRIANGLES or glDrawArrays and glDrawElements. However, I’ll probably test these details again after I come up with more test cases and maybe get and AMD graphics card in the mix.