Gemoetry shader view frustum culling

Hi everyone,

I am currently rendering between 1000 and 10,000 instances of an 1000+ triangle object using instancing techniques (texture buffer object) and this works just fine. However, at any one time only a fraction of these instances are visible and I would now like to try and experiment with transform feedback and GPU frustum culling to try and generate a list of only visible instances.
My initial thoughts are to render each instance as a GL_POINT and use transform feedback to store the visible instance ID into a buffer object (storing the gl_InstanceID). By emitting the gl_InstanceID into a (texture)Buffer Object I can seamlessly integrate this new step into my existing pipeline and efficiently render the scene no matter how many instances are drawn.

I have no experience with transform feedback, but this is not the area wich I foresee as the main stumbling block. It’s the Gemoetry shaders which are something completely new to me and I just don’t have a clue how to write one. The goal of the Geometry shader is to frustum cull the Points and emit the gl_InstanceID. Each object instance would have the same radius which could be passed as a uniform parameter to aid in the culling process.

I’ve Googled a lot and can’t really find any decent tutorials on Geometry shaders and/or Geometry view frustum culling. Transform feedback examples are on the net as someone has nicely written the OpenGL Samples pack series of GL 3/4 examples, and there’s the Particle system transform feedback link on the front page of OpenGL.org.

So my question is: Can anyone help explain how to implement Gemoetry shader view frustum culling with a mind on instanced rendering, perhaps with some examples.
A secondary help would be if someone could post some reader-freindly geometry shader articles to help idiots like me with the syntax and rules (apart from the geometry_EXT spec).

OpenGL 3.3 compatibility profile
GLSL 1.5/330 compatibility shaders

You are lucky, as I wrote a demo that does what you are looking for and there are also accompanying articles.

First of all, I would suggest using instanced arrays instead of texture buffer object because on DX10 class hardware it is slightly faster than texture buffer object usage (though this difference is not present on DX11 class hardware).

Here you can find the necessary information:

And if you are even interested in further improving the technique with Hi-Z occlusion culling and/or geometry LOD then here are some further articles and another demo:

The later uses GL4 tech, however can be implemented as well using GL3-only stuff.

@aqnuep
Actually I have seen your demos and articles; very nice and well done to you! You seem to be the only person I can see implmementing something close to what I need.

Thinking back, your articles motivated me to introduce instanced rendering in my engine and subsequently I posted my (dissapointing) results based on these forums - “instancing sucks”.

Regarding instanced arrays. I felt these were much more cumbersome to implement that I had originally thought. Two issues with this - a 44 matrix is required in my case and this means 4 additional vertex attribute streams to setup. My Models already have their own vertex attribute streams so this is over and above the usual vertex,normal, tangent and texture. Although my engine supplied an integer list of all the instances to draw I felt the additional CPU time to fetch each instance 44 matrix and then pack into a buffer object was too much, so I ended up sending all instance data to the GPU.

Utimately I ended up using TBO insead. Texture buffer objects objects can hold massive amounts of (static) instance data which would never need changing. Each frame I only need to dynamically update a separate integer texture buffer object containing the visible gl_Instance IDs based on some culling method as before, or just send all of them knowing that at least the amount of data being uploaded would be smaller than the entire 4*4 transform matrix in the case of Instanced Arrays.

I’ll be interested to hear you views on that…

That was then, and now I have a slightly different case. I have no CPU culling available and many times the number of instances to draw.
My concern with Instance Arrays is as before; sending all instance attibute data but let the GPU cull this time. However sending 50,000+ sets of instance data just to ultimately discard is not very efficient. So I’m looking to TBO as the source of the instance data with a single large static TBO holding the transform data, and an integer TBO to hold the list of visible instance IDs.

In your articles which you point to, you said you performed visibility testing in the Vertex shader. I’ve just looked at your cull.vert shader and can see that you construct a BBOX from the MVP and instance position. That seems like a lot of extra matrix multiplies and if my models contain between 1000-2000+ triangles, that’s a lot of vertex work to perform. Is there some advantage over doing the same in the Geometry shader?

Well, that’s a good point, but if the total number of vertex attributes is below 16 then I wouldn’t worry about it. Also, you can use quaternions and then reduce the 4x4 floats to 4x2 floats.

Although my engine supplied an integer list of all the instances to draw I felt the additional CPU time to fetch each instance 4*4 matrix and then pack into a buffer object was too much, so I ended up sending all instance data to the GPU.

I think you don’t do something correctly here. When culling, you should emit the actual instance data that was feed to the culling pass. This way you can avoid the additional indirection that is needed to load the instance data index first. Of course, the amount of data to be emitted by the culling pass increases this way, however, culling is performed on a per-object basis so you shouldn’t worry about that, while the indirection introduces some slowdown to each vertex shader invocation during the actual rendering and that can be a visible performance hit.

In your articles which you point to, you said you performed visibility testing in the Vertex shader. I’ve just looked at your cull.vert shader and can see that you construct a BBOX from the MVP and instance position. That seems like a lot of extra matrix multiplies and if my models contain between 1000-2000+ triangles, that’s a lot of vertex work to perform. Is there some advantage over doing the same in the Geometry shader?

Well, you misunderstood something here as well. That vertex shader is actually the vertex shader executed before the geometry shader that culls the instances thus it is executed only once per instance, not for all the vertices of the actual objects, so it doesn’t matter how complex is the actual scene geometry.

Okay, the BBOX MV multiplication is a bit costly but don’t forget that this is done per object, not per actual vertex of the geometry. If you wish, of course you can use a sphere as bounding volume and you don’t need to do that much multiplications. However, as a general rule, don’t try to optimize the culling shader, rather sophisticate it even further as the cost of the culling pass is so small that optimizing it won’t change the overall performance of the rendering but by simplifying the culling you may end up with more potentially visible objects and thus even decrease the overall rendering performance.

The original reason why I performed the actual culling in the vertex shader was that there was some bug in the geometry shader compiler of the AMD drivers and this was the only workaround. However, in practice msot probably you should do the same thing as well, as, again, on early geometry shader implementations (most DX10 class hardware, especially on NVIDIA) geometry shaders couldn’t be executed paralelly so the more complex the geometry shader is the more it becomes the bottleneck of the pipeline. But as I said, not the culling pass will determine the overall performance, so do it in a way that fits the best your design.

When culling, you should emit the actual instance data that was feed to the culling pass. This way you can avoid the additional indirection that is needed to load the instance data index first

Yes, agreed that may save all the messing about with fetching instance data and packing into buffer and/or creating an index list.

Okay, the BBOX MV multiplication is a bit costly but don’t forget that this is done per object

Yes, I thought about this after I had posted. I forgot i’d be sending GL_POINTS and not the actual geometry so the vertex shader with the culling code will only be executed once per instance.
As a side note, if the TBO contains all the instance data when I render the cull pass with GL_POINTS, I don’t have a vertex array containing the position - that’s all in the TBO. Is there an easy way to reuse the TBO for the points rendering or do I have to create another buffer object containing just the point positions?
Perhaps I could re-use the TBO as the source for the vertex array and specify a stride to offset the fact that I’m sending in 4*4 matrix data?

Exactly as you say. I’ve done it in the very same way in the Nature and Mountains demo. You can reuse anytime a buffer object both as a VBO and TBO, only the vertex attrib specifications have to be correct, as you said.

Actually you’ll have to make vertex attribs out of your instance data anyway (or at least from part of it, unless you do attrib-less rendering), thus instanced arrays are trivial to be used in the very same fashion.

given the following VS and GS snippets, how can I make the GS emit more instance data than the original input instance data.
In other words, I input a vec4 as instance data and would like to emit 2 * vec3 as output. How do I do that?

VS snippet:


#version 330 compatibility
	
uniform vec3 objectextent;		//BBOX extents of model
uniform vec3 origin;			//to add to every instances' position; eg Sun's position
uniform mat4 modelmatrix;		//model scale
uniform mat4 cameraviewmatrix;


in vec4 instanceposition;			// X=orbital angle (radians); A=orbital distance
flat out vec4 emit_InstancePosition;		//pass through to Geometry shader
flat out int objectVisible;			//geometry shader cull flag

void main()

GS snippet:


#version 330 compatibility

layout(points) in;
layout(points, max_vertices = 1) out;
	

flat in vec4 emit_InstancePosition[1];
flat in int objectVisible[1];

out vec4 vertex_out;	//emit data to buffer object

void main()

The same way you have your vertex shader emit more than one value. Or your fragment shader emit more than one value. You declare multiple output values.

Now, if you’re doing transform feedback, you need to associate those outputs with buffers using the TF mechanisms.

Right I am using transform feedback.
If I declare two OUT variables in the geometry shader how do these get written to the buffer object. Do they get written sequentially or into separate buffers. When I compiled the GS I specified separate buffers but now I’m thinking I should be using interleaved buffers instead as I want to pack both OUT variables as two consecutive Vec3.

http://www.opengl.org/sdk/docs/man3/xhtml/glTransformFeedbackVaryings.xml
It’s quite easy.


// geometry shader
out vec3 x1;
out vec3 x2;
....


const char* vars[2]={"x1","x2"};
glTransformFeedbackVaryings(prog,2,vars, GL_INTERLEAVED_ATTRIBS);
glLinkProgram(prog);

...
glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER,0,mybuffer);
glBeginTransformFeedback(GL_POINTS);
glDrawArrays(....); // bind some VBO beforehand for these
glEndTransformFeedback();

Done, the “mybuffer” buf has your data. With a query, you can get the number of vertices written.

Thanks I’ll give that a try.

Mank thanks Llian, that worked.
Intead of two OUT variables (which I did check that they can get written to two separate buffer objects if I so wish), what if there is just one OUT which was a struct containing two vec4?
Would that mean the buffer object is updated with 8 floats using just the one stream (interleaved format), or is it a case that each set of 4 floats uses a separate stream, or is it invalid?

I’ll reply to my own post.
I tried a struct as output from the GS which did compile properly.
However, after linking the shader and reading back the transform feedbacks, there was 1 TFB varying of type GL_INT !

I’ve given up on that idea and will stick to two OUT variables to an interleaved buffer.

My results of comparing different instancing APIs and buffer sizes. Comments welcome!

Deferred Rendering, 1 point light infinite length
Asteroid field between 1,000 and 45,000 instances.

Intel Q6600 quad core cpu @ 2.8 GHz
AMD Radeon 4850, OpenGL 3.3 compatibility profile context

Input instance data = vec4, supplied as vertex attribute.

key:
TBO = texture buffer object
TFB = transform feedback
culling = vertex shader/Geometry shader instance culling via Transform Feedback
count = number of instances rendered (num visible after culling)



		|(1) TBO, without TFB			|(2) TFB, TBO used for 2nd pass		|(3) TFB, Instanced arrays
----------------+---------------------------------------+---------------------------------------------------------------------------
count		|0	1000	15,000	45,000		|0	1000	15,000	45,000		|0	1000	15,000	45,000
		|					|					|
----------------+---------------------------------------+---------------------------------------------------------------------------
FPS		|145	124	45	19		|140	97	44	20		|144	99	45	21
		|					|					|
FPS + culling	|---	---	---	---		|140	105	92	73		|144	106	75	47
		|					|	(did not measure)		|	(290)	(4,600)	(13,000)
		|					|					|
----------------+---------------------------------------+---------------------------------------------------------------------------
		|					|					|
Draw time (ms)	|5	5	5	5		|5	9	21	47		|5	9	21	47
ms + culling	|-	-	-	-		|5	8	10	12		|5	8	12	20
		|					|	(did not measure)		|	(290)	(4,600)	(13,000)
		|					|					|
------------------------------------------------------------------------------------------------------------------------------------

(1) No transform feedback. Just 1 pass and 1 Texture Buffer Object used to render. No culling available - all instances drawn.
(2) Transform feedback emitted 8 floats into (texture) buffer object used to render in 2nd pass. Did not record num visible instances.
(3) Transform feedback emitted between 5-8 floats into buffer object used to render in 2nd pass.

Geometry shader emits 2 OUT varaibles into single buffer object.
None of the measured draw times or FPS changed as a result of packing into interleaved buffer or two separate buffers.
Similarly, transform feedback writing 1 * vec4, 1 * vec4 + 1 float, 3 * vec3 + 1 * vec2 or 2 * vec4 makes no difference two measured performance.

GL Query issued to retreive num samples passed during TFB. Could not make use of GL_ARB_TRANSFORMFEEDBACK3 because the model is rendered via drawElementsInstanced and TFB3 only handles DrawArraysInstanced.