Instance Shader

We really could use an “instance shader” for rendering massive amounts of objects like vegetation and controlling the culling on the GPU. It really doesn’t make sense to perform this step on the CPU when the number of instances is in the thousands. For example, our vegetation system renders a randomized grid of instances around the player. An instance shader could quickly decide which objects in the grid are actually visible, and discard the instance before the vertex shader if it isn’t visible.

uniform vec4 cameraplane0;
uniform vec4 cameraplane1;
uniform vec4 cameraplane2;
uniform vec4 cameraplane3;
uniform vec4 cameraplane4;
uniform vec4 cameraplane5;

uniform objectradius;

uniform instancematrix[MAX_INSTANCES]

bool PlaneDistanceToSphere(in vec4 plane in vec3 point, in float radius) {}

main ()
{
    mat4 mat = instancematrix[gl_InstanceID];
    vec3 pos = mat[3].xyz;
    float radius = objectradius * max(max(mat[0].length(),max[1].length),mat[2].length);
    if (PlaneDistanceToSphere(cameraplane0,position,radius) > 0.0) discard;
}

This would save time on vertex shaders, which are not free when you get high volumes of geometry, and it would save the CPU from having to iterate through each instance.

I tried doing something like this with geometry shaders, but the hardware limitations are ridiculously low and you’re also performing culling for every single vertex, which is really non-optimal.

Why not use a compute shader to build the instance data, then use an indirect rendering call to render the computed instance data?

I see no need here for a specialized hardware feature. And CS’s would almost certainly be faster, since you’re not conditionally culling out entire groups of vertices that were sent in a rendering command. And thanks to multi-draw indirect, CS’s would not limited to rendering copies of the same mesh; each mesh could be different. And it’s a lot more parallel-friendly, since you don’t have to stall the pipeline to wait for instance shaders to figure out whether to render something or not. And you can do it as a pre-process each frame, with some other work in-between (like rendering everything else in the scene), so that you don’t incur any pipeline stalls.

So not only can it be done now, what you’re asking for would be inferior to what we have currently.

Consider using transform feedback and a geometry shader with selective emission to do your instance culling on the GPU. With this approach, your geometry shader basically becomes the “instance shader” you’re talking about.

The idea is you render POINTS. Each point grabs the instance data for one instance, generates the bounding primitive, culls it against the frustum, and if the instance culls in, the shader outputs the instance data for that single instance to a VBO using transform feedback. If it culls out, it emits nothing. This can be very, very efficient (have implemented this with thousands of instances per frame)!

See the archives for details – a number of folks have already done this and tell you how. Websearch instance culling using geometry shaders for starters, and then go from there.

Further, you can use multi-stream output to automatically do LOD binning while you’re culling using the very same geometry shader executions! You’d use this for cases where your vegetation has multiple discrete LODs to save blasting too many polys far from the eyepoint.

Or… you can use a compute shader, but you have to exercise more care and knowledge to ensure that your solution works efficiently …and reliably! The reliably part isn’t really so much of an issue with traditional (non-compute) GPU shaders if you avoid side-effects as traditional shaders protect you from the sharp corners and rough edges that can get you with general GPU programming. That said, if you’ve done OpenCL, CUDA, or compute shader programming before and know what you’re getting into, go for it!

RasterGrid’s description of the problem is exactly what I am running into, particular with batch size vs. vertex shader load.

The asynchronous query in his technique is a weak point, but other than that this looks good.

My request still stands. Data should flow in one direction, not back from the GPU to the CPU. This is an okay technique for the capabilities we have right now, but it’s a non-optimal hack. Furthermore, my rendering technique works without passing any instance matrices around in uniform buffers, and putting all that data into a uniform buffer introduces new constraints.

Data should flow in one direction, not back from the GPU to the CPU. This is an okay technique for the capabilities we have right now, but it’s a non-optimal hack.

Note that if you’re talking specifically about this technique, that’s rather outdated. The asynchronous query isn’t necessary; with multidraw indirect rendering, you can avoid it by making your multidraw buffer large and “rendering” zero-sequence commands (which cause nothing to be rendered) for instances that weren’t filled in.

So whether you use compute shaders or geometry shaders to generate your buffers, you don’t need to read anything from the CPU. So there is no GPU-CPU-GPU synchronization; the only sync that happens is that the rendering call needs to wait for the generation process to have generated the commands (and flush the appropriate caches).

Furthermore, my rendering technique works without passing any instance matrices around in uniform buffers, and putting all that data into a uniform buffer introduces new constraints.

Well… where is your per-instance data coming from? If it’s coming from vertex arrays, that works just fine with the above algorithm.

If you’re generating it based on the instance count, that… becomes problematic. But that’s only because gl_InstanceID doesn’t include the baseinstance from the rendering command; it always starts at instance 0. In that case, you need to use an instanced vertex array to provide the proper instance index. It’d be a very simple array, just monotonically increasing GLushorts.

Or you can use ARB_shader_draw_parameters, which passes the shader the base instance. Sadly, it’s not available on all 4.x hardware.

The only way that the location of the uniform data would be a relevant issue would be if are using CPU commands between rendering calls to set program uniform state. Program uniform state is a bad idea if performance is a goal. That’s why Vulkan et. al don’t have them.

My request still stands.

They’re not going to introduce an entirely new programmable stage, not for something that you could easily and efficiently do yourself. The closest you’ll get to that is being able to fetch the primcount parameter of an indirect rendering call from a buffer object (which removes any need for a CPU-sync), or being able to have GPU processes write arbitrary rendering commands.

The latter is more for something like Vulkan.

[QUOTE=Alfonse Reinheart;1271792]Well… where is your per-instance data coming from? If it’s coming from vertex arrays, that works just fine with the above algorithm.

If you’re generating it based on the instance count, that… becomes problematic. But that’s only because gl_InstanceID doesn’t include the baseinstance from the rendering command; it always starts at instance 0. In that case, you need to use an instanced vertex array to provide the proper instance index. It’d be a very simple array, just monotonically increasing GLushorts.[/QUOTE]
Yeah, I am basically rendering a subsection of an n x n grid, so if you know the instance ID and have a starting offset you can figure out the X and Z position on the grid. Then the X and Z positions are used as a lookup for a small tiling grid of 4x4 matrices.

Unfortunately, I can’t raise our application system requirements past 4.0.

Yep, absolutely! Rastergrid’s articles on this topic (link and link) are very useful to get your mind into the concept and introduce the basic technique. But you can use other GL features to improve this even further.

You can avoid any GPU->CPU readbacks completely by having the GPU write the serialized instance count (i.e. the number of primitives written by the transform feedback run) in GPU memory so that subsequent GPU batches (e.g. indirect instanced draw calls) can read it from there directly.

One way is to serialize the instance count into buffer object(s) using atomics, as described (with code) in these threads (mainly the first):

If you prepopulate these buffers with draw indirect instanced draw call data, then after the transform feedback run, these buffers are prepopulated and ready to rip your “culled and LODed” instanced batches (including with correct instance counts). I’ve implemented this on NVidia GPUs and it works very well.

There was an extension announced a while back that gave you this capability without needing to use atomics. Ah, here we go:

[QUOTE=Dark Photon;1271794]Yep, absolutely! Rastergrid’s articles on this topic (link and link) are very useful to get your mind into the concept and introduce the basic technique. But you can use other GL features to improve this even further.

You can avoid any GPU->CPU readbacks completely by having the GPU write the serialized instance count (i.e. the number of primitives written by the transform feedback run) in GPU memory so that subsequent GPU batches (e.g. indirect instanced draw calls) can read it from there directly.

One way is to serialize the instance count into buffer object(s) using atomics, as described (with code) in these threads (mainly the first):

If you prepopulate these buffers with draw indirect instanced draw call data, then after the transform feedback run, these buffers are prepopulated and ready to rip your “culled and LODed” instanced batches (including with correct instance counts). I’ve implemented this on NVidia GPUs and it works very well.

There was an extension announced a while back that gave you this capability without needing to use atomics. Ah, here we go:

  • ARB_query_buffer_object[/QUOTE]
    Indirect rendering requires OpenGL 4.5, right? If that’s the case, I can’t use it since my product is already released.

I did get the technique working in one day. Here’s the results:

glDrawElementsIndirect and GL_DRAW_INDIRECT_BUFFER are in OpenGL 4.0.

GL_ATOMIC_COUNTER_BUFFER requires OpenGL 4.2.

Sweet, thanks for the tips.

And, for what it’s worth, multi-draw-indirect was core in 4.3.

Hmmm, can this be done with only OpenGL 4.0 support? I can’t drop support for the Intel 4000s.

No need to drop support; it would appear that they support the ARB_multi_draw_indirect extension. At least, on Windows.

Even so, you can still use non-multidraw functionality to make it work, even without querying the primitive count. Just use a loop of glDraw*Indirect calls, where the loop counter is a fixed value. You’ll send needless (empty) drawing commands, but the cost of a draw call itself is pretty minimal (it’s the state changes between draws that hurt).

[QUOTE=Alfonse Reinheart;1271811]No need to drop support; it would appear that they support the ARB_multi_draw_indirect extension. At least, on Windows.

Even so, you can still use non-multidraw functionality to make it work, even without querying the primitive count. Just use a loop of glDraw*Indirect calls, where the loop counter is a fixed value. You’ll send needless (empty) drawing commands, but the cost of a draw call itself is pretty minimal (it’s the state changes between draws that hurt).[/QUOTE]
Thanks, I’m just going to keep using the query and then implement additional optimizations later for hardware that supports it.