Omitting Vertex/Normal/TexCoord pointers with VBOs

Janko_Jerinic · December 9, 2010, 11:44am

Hi everyone,

Is there a way to omit repeated glVertexPointer, glNormalPointer and glTexCoordPointer calls when binding Vertex Buffer Objects, in order to reduse CPU/GPU overhead?

The whole idea is, when you bind a VBO using glBindBufferARB, you specify offsets into the VBO data, instead of real arrays of data. Hence, you will repeatedly do the following:

glVertexPointer(3, GL_FLOAT, 32, 0);
glNormalPointer(Gl.GL_FLOAT, 32, 12);
glTexCoordPointer(2, GL_FLOAT, 32, 24);

(using a 32-byte vertex footprint, 3 vertex/normal coords, 2 tex coords)

I’m rendering hundreds of thousands of objects (some are instanced, some aren’t), all of them using VBOs and each time I have to specify those identical offsets for separate array data. Of course, on a large scale, this induces a very heavy overhead.

I guess that, when you’re using VBO’s and specify pointers as offsets into the VBO, the addresses are resolved as absolute pointers (saved as client state) and hence, when you want to render the next object (by binding another VBO) and omit the abovementioned calls, regardless of your new VBO binding, actual pointers from which primitive data is read remain the same, and you get a reading from a dirty memory segment if you render more primitives than what the previous VBO contained.

Given all that, my question is this: is there a way to tell the driver that I’ll be using constant vertex/normal/texcoord pointer offsets and just bind different VBOs? That way, I would call glVertexPointer/glNormalPointer/glTexCoordPointer just once per frame and. With frustum culling, occlusion and all, I would still shave around 1e5-1e6 unnecessary calls and get a huge framerate boost.

Any ideas? (At least who to turn to :))

Muchas Gracias,
Janko Jerinic

Dark_Photon · December 9, 2010, 12:41pm

Yes. At least 3 options: NVidia bindless gets rid of nearly all of that overhead. As do geometry-only display lists (i.e. put only batches in display lists). And VAOs help some, but don’t get you all the way (in terms of potential perf gain).

Janko_Jerinic · December 13, 2010, 5:55am

Thanks, I guess that covers what I was looking for. Ok, I still need to make a few calls after I’ve bound a VBO, but from what I’ve read, it should be much quicker. I’ve implemented those extensions in C# and now I’ll give them a test ride.

One question, though: are there any known alternatives to bindless graphics on ATI cards?

Thanks a lot,
Janko Jerinic

skynet · December 13, 2010, 6:52am

If you are really rendering 100000 objects per frame and need to change the VBO for each, ‘bindless’ won’t give you much of a speedup. Even if you could render each object in just 1 call, that still would be 100000 calls. Try to render these objects more efficiently, i.e. group them together in the VBOs, batch more drawcalls together etc. You should reduce the number of drawcalls to well under 10k per frame.

mhagain · December 13, 2010, 7:53am

If all of your objects use the same vertex layout, why not put them all into a single big VBO and use glDraw(Range)Elements?

Janko_Jerinic · December 14, 2010, 2:12am

If all of your objects use the same vertex layout

Problem is, they don’t. For example, some are textured/multitextured, some aren’t. Some of them are transparent and hence must be rendered in a separate pass using a back-to-front sorting logic to achieve correct transparency. Visibility, color, render type (shaded, shaded & edges, wireframe) can also be changed per each element. Octree based frustum culling is used, so you’re never rendering the same set of elements.

Janko_Jerinic · December 14, 2010, 2:25am

If you are really rendering 100000 objects per frame and need to change the VBO for each, ‘bindless’ won’t give you much of a speedup

I am. I’m rendering huge, highly detailed construction projects of millions of square feet of area (architecture, construction, piping, plumbing, ventilation, miscellaneous elements etc. Figure it out).

You should reduce the number of drawcalls to well under 10k per frame.

I am very much aware of that. It’s just that, as I’ve mentioned, every object (wall, window, etc.) can be colored/textured differently, or hidden/shown, Batching these into static VBOs doesn’t work because each object may be transparent and order in which object are drawn changes since they must be treaded separately and sorted.

Even if I pimped up the framerate until it would allow me to use some multi-pass OIT algorithm, I’d still have problems with frustum culling & occlusion queries since these would also affect my batched VBOs.

I am so interested into other people’s solutions to these types of problems, but I’m afraid that going deeper into all of this would slightly exceed the scope of this thread

Thanks!

ZbuffeR · December 14, 2010, 2:59am

Do you really have a lot of transparent objects ?
Otherwise it can reasonably work with simply two passes, one opaque, then one transparent. Or only perform exact sorting for the nearest stuff in front.
Or use stuff like this for single pass order independent transparency if your hardware allows it :
http://blog.icare3d.org/2010/06/fast-and-accurate-single-pass-buffer.html

Dark_Photon · December 14, 2010, 6:47pm

Au contraire! With this use case, it can give you dramatic speedups! Issue is with VBOs you’re wasting a lot of time in the GL driver on CPU memory cache misses, and the GPU is underutilized. Bindless gets rid of a lot of those cache misses and dramatically speeds up your batch setup and submission. I and others have seen 2X speedups with real-world examples (non-contrived). It won’t solve all your problems, but it’s an easy technique to apply and will only increase your draw throughput. It’s still NVidia-only though. Hope we see this in EXT/ARB form at some point.

Try to render these objects more efficiently, i.e. group them together in the VBOs, batch more drawcalls together etc

Definitely agree. Cross vendor as well. However, you have to balance batch size (and in general GPU speed) with culling efficiency and state change needs (and in general CPU speed). LOD requirements also play in here.

True, but you typically bin things together that need “similar GL state” to render, to avoid a bunch of needless state changes when rendering. For instance, sounds like you already bin your translucent objs together so you can render them in their own pass. You can also bin opaque objects by which shader they need. You can force all objects to use texture so you don’t have to deal with the “no texture” case. And you can combine textures into texture atlases or arrays so you can render all objects using them in the same batch where it makes sense. Also, if you have similar repeated instances of things you’re drawing, you can use geometry instancing to combine a bunch of them into a single draw call. But again, this fights against culling efficiency and LOD granularity.

Main thing is, don’t hand objects to GL as soon as you cull them in. Even some crude binning of culled-in batches can radically reduce the number of state changes and greatly increase your draw throughput.

It’s just that, as I’ve mentioned, every object (wall, window, etc.) can be colored/textured differently, or hidden/shown

Your app so far is sounding like a great use case for bindless. But for other parallel options to increase batch size…

Color can be stored in a vertex attribute so that doesn’t prevent batching multiple things together. Textures can be stored in atlases/arrays, so that doesn’t prevent batching. Hidden/shown can be managed with a vtx attrib or instance attribute that you populate dynamically, so that doesn’t prevent batching either.

If you have different ways of applying textures, then that can sometimes be a good reason to break batches. But you can use dynamic branches in shaders to combine some of this where it makes performance sense.

Batching these into static VBOs doesn’t work because each object may be transparent and order in which object are drawn changes since they must be treaded separately and sorted.

Well, if all or most of the objects are typically opaque, I would think what you’d want to do is have most of your scene in client arrays/VBOs/display lists/whatever in a form where you can just rip them all as efficiently as possibly in whatever order and let the Z-buffer sort it out. To handle transparency, each part can have a flag (if you use instancing) or set of vertex flags (if you use vtx attribs) that lets you “turn off” rendering that part for opaque rendering purposes. Then after rendering the opaque stuff, you’d rip through your translucent objs, ideally using the same copy of the geometry you used when rendering it opaque if possible.

Even if I pimped up the framerate until it would allow me to use some multi-pass OIT algorithm, I’d still have problems with frustum culling & occlusion queries since these would also affect my batched VBOs.

Yeah, this is what sometimes gets forgotten when folks say just batch like crazy. It fights against culling efficiency. It’s a balancing act: CPU vs. GPU. An intermediate ground is to do course-grain culling on the CPU and fine-grained culling on the GPU, but that’s most useful when your fine-grain objects are still fairly complex.

Watch out for occlusion query though! That can majorly kill your performance if you’re not providing sufficient time between the query and using the query. Querying it right after ending it is the worst thing you can do! Conditional render to keep the query on the GPU can help some. But might be even faster to collect your opaque Z-buffer into at texture and feed that or a copy back into subsequent executions via 2D texture sampled in the shader.

skynet · December 15, 2010, 3:16am

I and others have seen 2X speedups with real-world examples (non-contrived).

So, yeah, you’re then at the speed of 50.000 drawcalls per frame. That won’t be much more interactive than now

He should definitely try to cut his batch count considerably down before trying to make the individual calls faster.