What is a more straightforward way to do instance culling?

IronTau · May 20, 2018, 3:44pm

I have a scene with few low poly models (foliage) instanced many times across the scene. Currently I do a single glDrawElementsInstanced without culling or LOD. I want to improve performance in the most typical way possible with culling and LOD.

I read and understood Instance culling using geometry shaders – RasterGrid
which I’ve seen referenced here many times. The solution above sounds great for me. I get to keep my precomputed instance positions and perform culling to prevent most of them from actually getting drawn.

This article seems to imply it is a more advanced way to do this task. I am more of a beginner, so rather than skipping right to this step I would want to consider the precursor ways to do the same task, perhaps on the CPU.

My questions:

am I correct in thinking this article presents a GPU workflow that is roughly equivalent to some CPU solution?
If I cull on the CPU and generate some array of all the instances (positions) I want to draw, I would have to bind some newly generated buffer (maybe a Uniform Buffer Object?). I need a new draw call to handle some variable amount of positions. What I have now is not dynamic, but I imagine the consistency of data gives some performance boost. What draw would you do for a dynamic list of positions? How would you get that information to the GPU?
A very simple solution using only techniques I’ve done would be to have a separate VAO for each ‘chunk’ of foliage and cull based on chunks and glDrawElementsInstanced for each static chunk that was not culled. This doesn’t sound entirely bad to me, but this wouldn’t teach me how to deal with culling the individual instances within a chunk. What I hope to do is cull by chunks first and do some very simple LOD draw, then implement a solution to my question #2 for the remaining instances.

Thank you for reading

IronTau · May 20, 2018, 7:21pm

Some example code for my current solution to simply rebuild the buffer each frame based on what instances I want drawn (after culling). Is this considered a particularly slow way to do it? There’s a chance this is all the performance boost I need, but I’d still appreciate any insight into a more proper way to deal with buffering some data per frame.

glBindBuffer(GL_ARRAY_BUFFER, buffer3);
glBufferData(GL_ARRAY_BUFFER, tallGrassAmount * sizeof(glm::mat4), &tallGrassTransformsCulled[0], GL_STREAM_DRAW);

glDrawElementsInstanced(
GL_TRIANGLES, grassModel->meshes[i].indices.size(), GL_UNSIGNED_INT, 0, tallGrassTransformsCulled.size()
);

Dark_Photon · May 21, 2018, 5:47am

[QUOTE=IronTau;1291542]My questions:

am I correct in thinking this article presents a GPU workflow that is roughly equivalent to some CPU solution? [/QUOTE]

Yes, it’s a one implementation of an algorithm that prefers GPU-side reduction to CPU-side reduction.

If I cull on the CPU and generate some array of all the instances (positions) I want to draw, I would have to bind some newly generated buffer (maybe a Uniform Buffer Object?).

Not necessarily “newly generated”, but yes.

I need a new draw call to handle some variable amount of positions. What I have now is not dynamic, but I imagine the consistency of data gives some performance boost. What draw would you do for a dynamic list of positions? How would you get that information to the GPU?

An Indirect draw call is typically used to provide the variable number of “instances”. The number of positions (and other per-instance vertex attributes) is the same.

A very simple solution using only techniques I’ve done would be to have a separate VAO for each ‘chunk’ of foliage and cull based on chunks and glDrawElementsInstanced for each static chunk that was not culled. This doesn’t sound entirely bad to me, but this wouldn’t teach me how to deal with culling the individual instances within a chunk. What I hope to do is cull by chunks first and do some very simple LOD draw, then implement a solution to my question #2 for the remaining instances.

Yes, this is called course-grain culling (aka broad-phase culling), and every decent large-world engine does this. What the GPU-based culling described in the article describes is the capability to do fairly cheap fine-grain culling (aka narrow-phase culling). That said, there’s no reason you can’t do course-grain culling on the GPU instead or as well. In fact, some of the vendor advice in recent years suggests that you do exactly that to maximize your GPU throughput in some cases, particularly in instances where there are a lot of batches.

Before you ship with GPU-based culling though, be sure to do a sanity check to ensure that the GPU work you’re actually preventing by doing GPU-side culling actually justifies the cost of doing the culling. For instance, if you have fairly expensive per-instance vertex shaders that you don’t want to take the hit for unless you “know” that the instance is on-screen. Or in the case where you’re pushing so many triangles that you may be triangle setup limited. Or if you’re using geometry shaders for actually rendering your instances and they’re the bottleneck.

Where GPU-side culling can really shine though is when you also dynamically determine which geometry LOD to render for each instance at the same time, dynamically generating separate bins of “instance data” for each geometry LOD. That is, for instances which are going to be small on the screen (e.g. “far away”), choose to render a low-detail instance of the model. And for instances which are going to be large on the screen (e.g. “close up”), choose to render a higher-detail instance of the model. All this can happen on the GPU without loading down your CPU or PCIe bus with per-frame per-instance culling computations and data.

Be sure to see Rastergrid’s follow-on articles in that series here:

[ul]
[li]Instance Cloud Reduction Reloaded [/li][li]GPU based dynamic geometry LOD [/li][/ul]
Also keep in mind that these posts and the one you referenced are 8 years old, and there are more efficient ways to do some of this nowadays that completely bypass any CPU involvement between the cull+LOD and render passes.

IronTau · May 23, 2018, 1:07pm

Right, I really liked the fact that I would get LOD for each instance. I think my basic culling by chunks will apply the same LOD to a whole chunk for simplicity. At least at first. Thanks for the reply!