Immediate mode faster than using VBOs?

I recently started adding VBO support to a program I’ve been developing (some old 2.5D FPS engine) and noticed some results I found very puzzling. (All of the listed results were made on a Geforce 8600 with recent drivers but some were confirmed on a GTX 280):

  • when using a streaming VBO containing more than vertex, texture coordinates and color information the VBO version is noticably slower than using immediate mode.
  • when reducing the VBO to the 3 abovementioned entries the speed is approximately the same
  • using a static VBO only containing vertex and texture coordinates the CPU load measurably decreases but the overall frame rate only increases by 1-2 fps when vertical sync is off and the frame rate is far above 200. With lower FPS values there’s absolutely no measurable difference. Actually, due to the maintenance overhead of the VBO the resulting code is actually slower than a version without any VBOs!
  • even when using a very large amount of vertices I can’t see any improvement in using VBOs at all.

Concerning these results I don’t really understand the recent developments. There’s many situations where a VBO is just cumbersome to use and yet the only viable alternative has been deprecated in OpenGL. Recoding such applications will probably yield the same results as my tests…

You are obviously doing something very wrong, because VBOs are usually MUCH faster than immediate mode. In fact when using immediate mode the driver is forced to rearrange things into a VBO-like structure before rendering, so that can’t be faster than proper VBO usage.

Granted, there are some pitfalls that you need to know to get good performance out of VBOs. The most extreme slowdown i have measured on nVidia cards is when you use 3-byte colors. E.g. you use GL_UNSIGNED_BYTE for colors and 3 components (RGB). That means your memory layout is not aligned to 4 bytes and that will bring your app to a crawl.

So all your data needs to be aligned to 4 bytes (no problem with floats). Also nVidia does not like GL_BYTE data at all (ATI has no problems). So ALWAYS use GL_UNSIGNED_BYTE and if necessary unpack in a shader. I don’t know why nVidias hardware does not like signed bytes.

Then there’s the 32 / 64 byte alignment case. If possible make your vertex-data-set aligned to 32 bytes. E.g. position (3 floats = 12 bytes) + texture coordinates (2 floats = 8 bytes) + color (RGBA = 4 bytes) makes 24 bytes, so you can add 2 floats to get to 32 bytes. In my tests there was a minor speed-up but nothing to really care about usually.

Btw: If you really only need a replacement for immediate mode, you can check out my alternative: glim (the link is in my signature). I haven’t done any performance measurements though, would be interesting to know how it compares to GL’s immediate mode.

Jan.

Why am I not surprised to get such an answer… :wink:

Granted, there are some pitfalls that you need to know to get good performance out of VBOs. The most extreme slowdown i have measured on nVidia cards is when you use 3-byte colors. E.g. you use GL_UNSIGNED_BYTE for colors and 3 components (RGB). That means your memory layout is not aligned to 4 bytes and that will bring your app to a crawl.

Not doing that.

So all your data needs to be aligned to 4 bytes (no problem with floats). Also nVidia does not like GL_BYTE data at all (ATI has no problems). So ALWAYS use GL_UNSIGNED_BYTE and if necessary unpack in a shader. I don’t know why nVidias hardware does not like signed bytes.

Then there’s the 32 / 64 byte alignment case. If possible make your vertex-data-set aligned to 32 bytes. E.g. position (3 floats = 12 bytes) + texture coordinates (2 floats = 8 bytes) + color (RGBA = 4 bytes) makes 24 bytes, so you can add 2 floats to get to 32 bytes. In my tests there was a minor speed-up but nothing to really care about usually.

My current format consists of 8 floats, 3 for coordinate, 2 for textures and 2 for custom fields, plus one for later use. So alignment should be fine.

[/QUOTE]

I did some further tests with quite interesting results:

  • mixing VBO with immediate mode is faster than forcing a VBO update each time something changes. If I only change the VBO when something permanently changes to a different value it’s best.

  • I noticed that even if I double the amount of vertices in immediate mode without doubling the amount of polygons the speed will remain the same even though the rendering loop takes almost twice as long to execute. Again, no major difference between using VBOs and immediate mode.

I think my program is limted elsewhere so that all the optimizations here won’t help at all.

L1+L2 Cache, you’re probably getting out of it. Keep 2-3 4kB streaming vbos, map/write/unmap/draw , rotate in round-robin fashion.

>>ecently started adding VBO support to a program I’ve been developing (some old 2.5D FPS engine)

by 2.5D I assume theres not many polygons that youre drawing, in that case I doubt there will be a big diference between immediate + VBO.
VBO benifits most when youre drawing objects with lots (eg 10,000) vertices.
In your case youre prolly fill or pixel limited.
I believe with each VBO there is slight overhead, thus in your 2.5 app if youre drawing a single quad often then perhaps immediate is quicker than VBO

>>I think my program is limted elsewhere so that all the optimizations here won’t help at all.

well thats the mnost important thing, find what the actual bottleneck is first

The biggest scene I tested was 100000 vertices of which I could put 70000 in the VBO.

In your case youre prolly fill or pixel limited.

Most likely. I discovered that for each millisecond I save in the rendering loop the same amount of time gets added to the glFinish call so it always evens out.

why do you need to have a finish call after your render-loop ? it is true that if you are GPU bound, then glFinish() will wait until the GPU idles.

however I would recommend that you don’t do this, since this prevents from having overlap of CPU and GPU load over multiple frames.

Pierre B.
AMD fellow

Of course I tried that but it makes the application become very subtly jerky. And it only works on NVidia. On ATI this didn’t bring me any speed gains at all. In fact I ended up disabling the VBO on ATI again after doing some benchmarking. Turned out that mixing VBO and immediate mode isn’t handled well by ATI’s drivers and staying with one method is preferrable.