Well, i ran a few tests.
All this was done on an ASUS Laptop with Intel Centrino 1.8 GHz, 512 MB DRR RAM and an ATI Radeon 9700 Mobility (64 MB) using Catalyst 5.6. And WinXP, of course.
To test the throughput i used a particle-system. Blending off, Alpha-Test off, Depth-Writes on, no shaders, no lighting, etc, plain old
textured quads.
The particle-system itself was pretty simple. At one point a big bunch of particles were emitted and simply flew up. No complex math behind
it. Also the particles are screen-aligned billboards and the billboarding was done on the CPU.
Every vertex consisted of position (3 floats), color (4 bytes) and a texcoord (3 floats) = 28 bytes = 112 bytes per particle.
I used VBOs. I set the usage to STREAM and to DYNAMIC. I didn’t see a difference. Even STATIC seemed to make no difference. To upload the data i mapped the VBOs and stored the data directly in the buffer without temporarily storing it in RAM. Every particle-system had its own vertex-buffer, i didn’t share them.
I used glDrawArrays to render the particles, so no indexing, but that wouldn’t make a difference anyway.
Now to the interessting results:
First i used half a million particles. That worked with interleaved arrays. When no particles were rendered i got 31 fps, when particles were
rendered i got 15 fps. With non-interleaved arrays (3 arrays) i got ALWAYS 3 to 6 fps, even if no particles were updated/rendered. I tracked
it down to something around 480000 particles. If i used 48xxxx particles i got 15 fps, with 48xxxx+1 particles, performance broke in. Seems
to be a memory issue.
So for further tests i used 4 particle-systems with 100000 particles each = 1600000 vertices.
The result: interleaved and non-interleaved were about equally fast (31 fps with no particles, down to 9-11 fps with particles).
Surprisingly, immediate mode achieved 11 to 15 fps.
I also checked CPU usage. When updating the particles it rised from 50% to 75%-85%. There were no big differences, but immediate mode always
consumed a few % more.
Well, my conclusion is this: I don’t think, that the gfx-card was the limiting factor here. Filling the vertex-buffer was quite an easy task,
not many operations per particle, but still the CPU seemed to be the limiting piece. So, in general, it seems not to make a difference, what
to use.
However, i experienced a few other issues. One big disadvantage of non-interleaved arrays is, that one cannot map more than one buffer at a
time. This means, i am not able to update all vertex attributes in one loop, but i have to do one loop per attribute. Since i needed to
calculate some per-particle temporary results, i needed to do this 3 times as often as with interleaved arrays.
That makes non-interleaved arrays very cumbersome to work with and for more complex particle-systems it might be very inefficient.
So, interleaved arrays are my preferred choice.
Now, one advantage of vertex-arrays over immediate mode is, that you can update the vertex-arrays only every few frames, which you need to do
every frame, when using immediate mode. So vertex-arrays can be more efficient.
I THOUGHT!
My code only updates the vertex-arrays if 40 milliseconds have passed and some particles are active.
Now something strange happened. If i updated the vertex-arrays every frame, no matter if particles were active or not,
then i got 31 fps with no rendered particles and 20 fps with particles rendered. When i updated the vertex-arrays only
if at least one particle was active, then i got 31 fps with no particles rendered and 15 fps when particles were rendered.
I tried it both with STREAM and DYNAMIC usage, no difference.
So i changed my code to map the vertex-buffer every frame, even if no particles are active or no change was necessary.
That means i mapped it and immediatly unmapped it. This brought the fps back to 20, when particles were rendered.
I am pretty sure this is a driver bug! I cannot organize my engine to map and unmap all unused buffers every frame to
get a good framerate!
However, my conclusion is, that interleaved arrays are the best choice. They are easier to work with, give good performance
and (if that bug wouldn’t be), are a bit more efficient than immediate mode. However, the speed of immediate mode really
surprised me. Seems to be well optimized internally.
Puh, what a long post. Hope it is interessting to read.
Jan.