Au contraire! With this use case, it can give you dramatic speedups! Issue is with VBOs you’re wasting a lot of time in the GL driver on CPU memory cache misses, and the GPU is underutilized. Bindless gets rid of a lot of those cache misses and dramatically speeds up your batch setup and submission. I and others have seen 2X speedups with real-world examples (non-contrived). It won’t solve all your problems, but it’s an easy technique to apply and will only increase your draw throughput. It’s still NVidia-only though. Hope we see this in EXT/ARB form at some point.
Try to render these objects more efficiently, i.e. group them together in the VBOs, batch more drawcalls together etc
Definitely agree. Cross vendor as well. However, you have to balance batch size (and in general GPU speed) with culling efficiency and state change needs (and in general CPU speed). LOD requirements also play in here.
True, but you typically bin things together that need “similar GL state” to render, to avoid a bunch of needless state changes when rendering. For instance, sounds like you already bin your translucent objs together so you can render them in their own pass. You can also bin opaque objects by which shader they need. You can force all objects to use texture so you don’t have to deal with the “no texture” case. And you can combine textures into texture atlases or arrays so you can render all objects using them in the same batch where it makes sense. Also, if you have similar repeated instances of things you’re drawing, you can use geometry instancing to combine a bunch of them into a single draw call. But again, this fights against culling efficiency and LOD granularity.
Main thing is, don’t hand objects to GL as soon as you cull them in. Even some crude binning of culled-in batches can radically reduce the number of state changes and greatly increase your draw throughput.
It’s just that, as I’ve mentioned, every object (wall, window, etc.) can be colored/textured differently, or hidden/shown
Your app so far is sounding like a great use case for bindless. But for other parallel options to increase batch size…
Color can be stored in a vertex attribute so that doesn’t prevent batching multiple things together. Textures can be stored in atlases/arrays, so that doesn’t prevent batching. Hidden/shown can be managed with a vtx attrib or instance attribute that you populate dynamically, so that doesn’t prevent batching either.
If you have different ways of applying textures, then that can sometimes be a good reason to break batches. But you can use dynamic branches in shaders to combine some of this where it makes performance sense.
Batching these into static VBOs doesn’t work because each object may be transparent and order in which object are drawn changes since they must be treaded separately and sorted.
Well, if all or most of the objects are typically opaque, I would think what you’d want to do is have most of your scene in client arrays/VBOs/display lists/whatever in a form where you can just rip them all as efficiently as possibly in whatever order and let the Z-buffer sort it out. To handle transparency, each part can have a flag (if you use instancing) or set of vertex flags (if you use vtx attribs) that lets you “turn off” rendering that part for opaque rendering purposes. Then after rendering the opaque stuff, you’d rip through your translucent objs, ideally using the same copy of the geometry you used when rendering it opaque if possible.
Even if I pimped up the framerate until it would allow me to use some multi-pass OIT algorithm, I’d still have problems with frustum culling & occlusion queries since these would also affect my batched VBOs.
Yeah, this is what sometimes gets forgotten when folks say just batch like crazy. It fights against culling efficiency. It’s a balancing act: CPU vs. GPU. An intermediate ground is to do course-grain culling on the CPU and fine-grained culling on the GPU, but that’s most useful when your fine-grain objects are still fairly complex.
Watch out for occlusion query though! That can majorly kill your performance if you’re not providing sufficient time between the query and using the query. Querying it right after ending it is the worst thing you can do! Conditional render to keep the query on the GPU can help some. But might be even faster to collect your opaque Z-buffer into at texture and feed that or a copy back into subsequent executions via 2D texture sampled in the shader.