DrawXInstancedTransformFeedback

Since ARB_transform_feedback_instanced, it is possible to draw multiple instances of transform feedback data without using a query and the resulting round trip from server to client. The primcount must be specified by the client while the count is read from the transform feedback object. Having the possibility to do it the other way round would be a nice addition, especially for instance cloud reduction algorithms : we know what we need to render (a mesh), but we don’t know the number of instances for the current frame (because we’re doing per instance view frustum culling on the GPU, for example).

So here’s what I quickly came up with : two new instanced drawing functions which use the result of a transform feedback object as the primcount parameter

  • DrawArraysInstancedTransformFeedback(enum mode, int first, sizei count, uint id);
  • DrawElementsInstancedTransformFeedback(enum mode, sizei count, enum type, const void* indices, uint id);

I think what they do is pretty explicit so I’m not giving any detail. The parameters are the same as the standard functions, but the primcount parameter is replaced by the name of the TF object.

What you’re asking for doesn’t make sense. You want to use transform feedback to somehow produce a count of instances to render. How would that work? What would your shader have to look like to generate a count?

1/ You have an array of matrices, each matrix is an instance of a mesh.
2/ Use Transform Feedback to perform culling in a geometry shader : you get another Array of matrices. You also have the number of generated primitives stored in the transform feedback object.
3/ Draw your meshes with instancing using the culled matrix array as per instance data (vertexAttribDivisor) and use the number of generated primitives stored in the transform feedback object as the primcount in one of the functions I suggested.

Currently the only solution is to use a query to get the result. With my suggestion, we can do this asynchronously.

OK, that makes sense.

That wouldn’t work that way. I know it because I’m the author of the article you’ve linked. Transform feedback renders the captured data as the primitive type you specify. The problem is that the result of the transform feedback is the instance data buffer and you don’t want to feed it back that way. You cannot even use indexed triangles (DrawElements*) this way.

What we need in order to be able to make the algorithm you described, what I also investigated, is to be able to take an instanced draw command num_instances field from a buffer object. That would be, naturally an extension to the already existing indirect drawing functionality with a MultiDrawElementsIndirect style command that takes it’s num_instances parameter from a buffer filled previously by the culling phase using atomic counters. Actually I’ve already proposed such a development idea to NVIDIA and AMD. AMD actually implemented some of the proposal via AMD_multi_draw_indirect, however, even though this later provides MultiDrawElementsIndirect for executing multiple indirect draw commands, the num_instances parameter is still taken from client side.

Actually I do want to feed it back that way.

The solution you’re talking about is not what I’m describing in my suggestion.

To make it clear once and for all here’s the code I’d like to be able to produce

void init()
{
	glBindVertexArray(VERTEX_ARRAY_PER_INSTANCE_DATA);
		glEnableVertexAttribArray(0); // per instance matrix column 0
		glEnableVertexAttribArray(1); // per instance matrix column 1
		glEnableVertexAttribArray(2); // per instance matrix column 2
		glEnableVertexAttribArray(3); // per instance matrix column 3
		glBindBuffer(GL_ARRAY_BUFFER, BUFFER_PER_INSTANCE_DATA);
		glVertexAttribPointer(0, 4, GL_FLOAT, 0, sizeof(mat4), BUFFER_OFFSET(0));
		glVertexAttribPointer(1, 4, GL_FLOAT, 0, sizeof(mat4), BUFFER_OFFSET(  sizeof(vec4)));
		glVertexAttribPointer(2, 4, GL_FLOAT, 0, sizeof(mat4), BUFFER_OFFSET(2*sizeof(vec4)));
		glVertexAttribPointer(3, 4, GL_FLOAT, 0, sizeof(mat4), BUFFER_OFFSET(3*sizeof(vec4)));
	glBindVertexArray(VERTEX_ARRAY_RENDER);
		glEnableVertexAttribArray(0); // vertex position of the instanced mesh
		glEnableVertexAttribArray(1); // per instance matrix column 0
		glEnableVertexAttribArray(2); // per instance matrix column 1
		glEnableVertexAttribArray(3); // per instance matrix column 2
		glEnableVertexAttribArray(4); // per instance matrix column 3
		glBindBuffer(GL_ARRAY_BUFFER, BUFFER_MESH_VERTICES);
		glVertexAttribPointer(0, 3, GL_FLOAT, 0, 0, BUFFER_OFFSET(0));
		glBindBuffer(GL_ARRAY_BUFFER, BUFFER_PER_INSTANCE_DATA_CULLED);
		glVertexAttribPointer(0, 4, GL_FLOAT, 0, sizeof(mat4), BUFFER_OFFSET(0));
		glVertexAttribPointer(1, 4, GL_FLOAT, 0, sizeof(mat4), BUFFER_OFFSET(  sizeof(vec4)));
		glVertexAttribPointer(2, 4, GL_FLOAT, 0, sizeof(mat4), BUFFER_OFFSET(2*sizeof(vec4)));
		glVertexAttribPointer(3, 4, GL_FLOAT, 0, sizeof(mat4), BUFFER_OFFSET(3*sizeof(vec4)));
		glVertexAttribDivisor(1, 1);
		glVertexAttribDivisor(2, 1);
		glVertexAttribDivisor(3, 1);
		glVertexAttribDivisor(4, 1);
		glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, BUFFER_MESH_INDEXES);
	glBindVertexArray(0);
}

void cullPass()
{
	glUseProgram(PROGRAM_CULL);
	glBindTransformFeedback(GL_TRANSFORM_FEEDBACK, TRANSFORM_FEEDBACK_CULL);
	glBeginTransformFeedback(GL_POINTS);
		glBindVertexArray(VERTEX_ARRAY_PER_INSTANCE_DATA);
		glDrawArrays(GL_POINTS, 0, INSTANCE_COUNT);
	glEndTransformFeedback();
}

void renderPass()
{
	glUseProgram(PROGRAM_RENDER);
	glBindVertexArray(VERTEX_ARRAY_RENDER);
	// NEW : use the tf's counter to specify primcount
	glDrawElementsInstancedTransformFeedback( GL_TRIANGLES,
	                                          mesh.count, 
	                                          GL_UNSIGNED_SHORT, 
	                                          BUFFER_OFFSET(0), 
	                                          TRANSFORM_FEEDBACK_CULL );
}

And while I’m at it some shader code
Cull:

#version 430 core

// usual culling stuff
uniform vec3 u_instanceBoxMin;
uniform vec3 u_instanceBoxMax;
layout(std140) uniform FrustumPlanes {
	vec4 u_frustumPlanes[6]; // view frustum planes
}

/////////////////////////////////////////////////
// Vertex Shader
layout(location = 0) in vec4 i_perInstanceMeshMatrixCol0;
layout(location = 1) in vec4 i_perInstanceMeshMatrixCol1;
layout(location = 2) in vec4 i_perInstanceMeshMatrixCol2;
layout(location = 3) in vec4 i_perInstanceMeshMatrixCol3;

layout(location = 0) in vec4 o_perInstanceMeshMatrixCol0;
layout(location = 1) in vec4 o_perInstanceMeshMatrixCol1;
layout(location = 2) in vec4 o_perInstanceMeshMatrixCol2;
layout(location = 3) in vec4 o_perInstanceMeshMatrixCol3;
layout(location = 4) flat out int o_isVisible;

void main()
{
	mat4 modelMatrix = mat4( i_perInstanceMeshMatrixCol0,
	                         i_perInstanceMeshMatrixCol1,
	                         i_perInstanceMeshMatrixCol2,
	                         i_perInstanceMeshMatrixCol3 );

	// set varyings
	o_perInstanceMeshMatrixCol0 = i_perInstanceMeshMatrixCol0;
	o_perInstanceMeshMatrixCol1 = i_perInstanceMeshMatrixCol1;
	o_perInstanceMeshMatrixCol2 = i_perInstanceMeshMatrixCol2;
	o_perInstanceMeshMatrixCol3 = i_perInstanceMeshMatrixCol3;
	
	// compute AABB and test against view frustum planes
	vec3 aabbVertices[8];
	// ...

	// if the AABB is intersecting or inside the view frustum
	o_isVisible = 1;
}

/////////////////////////////////////////////////
// Geom Shader

layout(points) in;
layout(location = 0) in vec4 i_perInstanceMeshMatrixCol0[1];
layout(location = 1) in vec4 i_perInstanceMeshMatrixCol1[1];
layout(location = 2) in vec4 i_perInstanceMeshMatrixCol2[1];
layout(location = 3) in vec4 i_perInstanceMeshMatrixCol3[1];
layout(location = 4) flat in int i_isVisible[1];

layout(points, max_vertices = 1) out;
layout(location = 0, stream = 0) out vec4 o_perInstanceMeshMatrixCol0;
layout(location = 1, stream = 0) out vec4 o_perInstanceMeshMatrixCol1;
layout(location = 2, stream = 0) out vec4 o_perInstanceMeshMatrixCol2;
layout(location = 3, stream = 0) out vec4 o_perInstanceMeshMatrixCol3;

void main()
{
	if(1 == i_isVisible[0])
	{
		o_perInstanceMeshMatrixCol0 = i_perInstanceMeshMatrixCol0[0];
		o_perInstanceMeshMatrixCol1 = i_perInstanceMeshMatrixCol1[0];
		o_perInstanceMeshMatrixCol2 = i_perInstanceMeshMatrixCol2[0];
		o_perInstanceMeshMatrixCol3 = i_perInstanceMeshMatrixCol3[0];

		EmitVertex();
		EndPrimitive();
	}
}

Render:

#version 430 core

uniform mat4 u_viewMatrix;
uniform mat4 u_projectionMatrix;

/////////////////////////////////////////////////
// Vertex shader
layout(location = 0) in vec3 i_meshVertex;
layout(location = 1) in vec4 i_perInstanceMeshMatrixCol0;
layout(location = 2) in vec4 i_perInstanceMeshMatrixCol1;
layout(location = 3) in vec4 i_perInstanceMeshMatrixCol2;
layout(location = 4) in vec4 i_perInstanceMeshMatrixCol3;

void main()
{
	mat4 modelMatrix = mat4( i_perInstanceMeshMatrixCol0,
	                         i_perInstanceMeshMatrixCol1,
	                         i_perInstanceMeshMatrixCol2,
	                         i_perInstanceMeshMatrixCol3 );
	mat4 modelViewProjection = u_projectionMatrix * (u_viewMatrix * modelMatrix);
	gl_Position = modelViewProjection * vec4(i_meshVertex, 1.0);
}

/////////////////////////////////////////////////
// Fragment shader
layout(location = 0) out vec4 o_color;

void main()
{
	o_color = vec4(1.0);
}

In the end, I’m suggesting to use the counter of a transform feedback object for something else than just the number of vertices in a gl draw call, more specifically as the number of instances in an instanced rendering scenario. It seems feasible to me and would offer more async behaviour for instanced rendering algorithms.

@aqnuep The multi draw arrays solution you talk about comes in handy when you have different geometry/meshes to instantiate. My scenario assumes that we’re using multiple instances of one single mesh.

Ah, now I know what you mean. But this is something that is already possible via ARB_draw_indirect and ARB_shader_atomic_counters.

As in case of draw indirect the primcount parameter comes already from a buffer object, the only thing that you have to do is set the backup buffer of the atomic counter to the primcount field of the indirect draw command buffer and simply increase the atomic counter in the geometry shader.

Actually you don’t even need transform feedback and geometry shader, but you can do everything using ARB_shader_image_load_store and implement an append buffer using a read/write image and an atomic counter. This is even more efficient than using geometry shader and transform feedback because geometry shaders must ensure that the order of the primitives emitted is in the same order as those received as input. The hardware has to ensure this and it has a negative effect on performance. As we simply store an unordered array of instance data, we don’t have requirements related to the ordering, so it is faster to implement the whole thing with an append buffer.

Very interesting! I’m going to try to lay out what you mean, would you mind telling me if I understood you correctly ?

The vertex shader would look like something like this (actually we only need a vertex stage) :

#version 420 core

atomic_uint atomic_primCount;  // number of instances
image1D image_perInstanceData; // texture buffer

// usual culling stuff
uniform vec3 u_instanceBoxMin;
uniform vec3 u_instanceBoxMax;
layout(std140) uniform FrustumPlanes {
	vec4 u_frustumPlanes[6]; // view frustum planes
}

layout(location = 0) in vec4 i_perInstanceMeshMatrixCol0;
layout(location = 1) in vec4 i_perInstanceMeshMatrixCol1;
layout(location = 2) in vec4 i_perInstanceMeshMatrixCol2;
layout(location = 3) in vec4 i_perInstanceMeshMatrixCol3;

void main()
{
	mat4 modelMatrix = mat4( i_perInstanceMeshMatrixCol0,
	                         i_perInstanceMeshMatrixCol1,
	                         i_perInstanceMeshMatrixCol2,
	                         i_perInstanceMeshMatrixCol3 );

	// compute AABB and test against view frustum planes
	vec3 aabbVertices[8];
	// ...

	// if the AABB is visible
	if(1 == isVisible)
	{
		uint perInstanceOffset = 4u * atomicCounterIncrement(1u);
		imageStore(image_perInstanceData, perInstanceOffset  , modelMatrix[0]);
		imageStore(image_perInstanceData, perInstanceOffset+1, modelMatrix[1]);
		imageStore(image_perInstanceData, perInstanceOffset+2, modelMatrix[2]);
		imageStore(image_perInstanceData, perInstanceOffset+3, modelMatrix[3]);
	}
}

Where :

  • the atomic counter atomic_primCount is the primCount of an DRAW_INDIRECT_BUFFER (bound to an ATOMIC_COUNTER_BUFFER)
  • and the image_perInstanceData is a my ‘BUFFER_PER_INSTANCE_DATA_CULLED’ (given in my previous post) bound as TEXTURE_BUFFER to an image.

Yes, I meant exactly what you’ve presented.

I planned to update my Nature and Mountains demo as well to use this new technique just I was quite busy lately and also GL 4.2 drivers are not mature enough so I thought I don’t have to hurry.

Yes I’m pretty curious about the performances (writing to an image with synchronization doesn’t sound very GPU friendly, guess I’ll have to bench to find out). I’ll also be looking forward to seeing your updated demo on your blog, thanks for sharing the algorithm !

The whole point of this method is that there is no synchronization. Everything is done by the GPU thus no need to stall the pipeline as it is done in case you query the amount of primitives written during transform feedback.

I meant synchronization amongst the GPU variables. I agree the server and the client run asynchronously with this algorithm.

No need for synchronization amongst the GPU variables, it is ensured by the fact that OpenGL performs the operations one after the other on the server side.

Each vertex is processed in parallel in a vertex shader, right ? If you have say an atomic_counter in a shader, and you write to it, there must be some sort of resource locking management (confirmed by things such as the ‘coherent’ keyword, or memoryBarrier() in GLSL), unless there’s some sort of magic operating in GPUs which allows multiple threads to write to a shared variable and get a coherent result. This is why I’m curious to see how the shaders will peform -performance wise- with such things. The CPU is NOT involved in any if this, I know ;).

Yes, there is resource locking, however, there is dedicated hardware for handling coherency on multiple levels (SIMD core wide, core group wide and device-wide).

Actually geometry shaders and transform feedback are much worse from this point of view. There is in similar fashion a buffer that is accessed by all shader instances and there is also an atomic counter as well, as the shaders have to know where to store the next item, additionally there is also need for logic that ensures that the ordering of the output primitives matches the ordering of the input primitives.

I don’t really see why you think it would be any more synchronization overhead compared to transform feedback…

_blitz,

I didn’t read the whole thread, but what you describe in your first message is possible on some (if not all) DX10 and later GPUs.

It is not, but it is possible on all DX11 GPUs.

I was talking about hardware, not APIs (or wherever you got that from…)

I was talking about hardware, not APIs (or wherever you got that from…) [/QUOTE]

No, it is not supported by hardware. Atomic counters and in general any programmable atomic operations came with DX11 hardware (Radeon HD5000 series and GeForce 400 series).

Previous hardware did have support for read/write buffers (without atomic ops) like the Radeon HD3000 and HD4000 series, and there were some hard-wired counters (like those of occlusion queries and of transform feedback), but you did not have programmable atomic counters in previous hardware! It is not just the lack of API support.

As I said, you think in terms of API features, again.

I would implement it in the following way:

  1. Store the query result containing how many primitives have been written, into a buffer object. (that is usually done when a query is ended anyway)

  2. We might need to process the query result in the buffer in case the hardware stored it in different units or there is more than one result (e.g. one query result per engine, other data for other kinds of queries intermixed, etc.). I would use a compute shader to get the number of vertices written and store them into another buffer.

  3. There are 2 ways to implement the draw command:

a) We can copy the final value from the buffer into the state register that should contain the number of instances to render. Assuming the GPU has such a register. Then just do what we would do in glDraw{Arrays,Elements}, but don’t set the number of instances to 1. This is the easy way.

b) If we can’t do (a), we have to create the hardware command for the draw call, setting the number of instances to our computed value, and storing the command into another buffer object using a compute shader. (Yes, generating commands on the GPU is possible) Then you just ‘execute’ that buffer.

So there you have it.