Since ARB_transform_feedback_instanced, it is possible to draw multiple instances of transform feedback data without using a query and the resulting round trip from server to client. The primcount must be specified by the client while the count is read from the transform feedback object. Having the possibility to do it the other way round would be a nice addition, especially for instance cloud reduction algorithms : we know what we need to render (a mesh), but we don’t know the number of instances for the current frame (because we’re doing per instance view frustum culling on the GPU, for example).
So here’s what I quickly came up with : two new instanced drawing functions which use the result of a transform feedback object as the primcount parameter
DrawArraysInstancedTransformFeedback(enum mode, int first, sizei count, uint id);
I think what they do is pretty explicit so I’m not giving any detail. The parameters are the same as the standard functions, but the primcount parameter is replaced by the name of the TF object.
What you’re asking for doesn’t make sense. You want to use transform feedback to somehow produce a count of instances to render. How would that work? What would your shader have to look like to generate a count?
1/ You have an array of matrices, each matrix is an instance of a mesh.
2/ Use Transform Feedback to perform culling in a geometry shader : you get another Array of matrices. You also have the number of generated primitives stored in the transform feedback object.
3/ Draw your meshes with instancing using the culled matrix array as per instance data (vertexAttribDivisor) and use the number of generated primitives stored in the transform feedback object as the primcount in one of the functions I suggested.
Currently the only solution is to use a query to get the result. With my suggestion, we can do this asynchronously.
That wouldn’t work that way. I know it because I’m the author of the article you’ve linked. Transform feedback renders the captured data as the primitive type you specify. The problem is that the result of the transform feedback is the instance data buffer and you don’t want to feed it back that way. You cannot even use indexed triangles (DrawElements*) this way.
What we need in order to be able to make the algorithm you described, what I also investigated, is to be able to take an instanced draw command num_instances field from a buffer object. That would be, naturally an extension to the already existing indirect drawing functionality with a MultiDrawElementsIndirect style command that takes it’s num_instances parameter from a buffer filled previously by the culling phase using atomic counters. Actually I’ve already proposed such a development idea to NVIDIA and AMD. AMD actually implemented some of the proposal via AMD_multi_draw_indirect, however, even though this later provides MultiDrawElementsIndirect for executing multiple indirect draw commands, the num_instances parameter is still taken from client side.
In the end, I’m suggesting to use the counter of a transform feedback object for something else than just the number of vertices in a gl draw call, more specifically as the number of instances in an instanced rendering scenario. It seems feasible to me and would offer more async behaviour for instanced rendering algorithms.
@aqnuep The multi draw arrays solution you talk about comes in handy when you have different geometry/meshes to instantiate. My scenario assumes that we’re using multiple instances of one single mesh.
Ah, now I know what you mean. But this is something that is already possible via ARB_draw_indirect and ARB_shader_atomic_counters.
As in case of draw indirect the primcount parameter comes already from a buffer object, the only thing that you have to do is set the backup buffer of the atomic counter to the primcount field of the indirect draw command buffer and simply increase the atomic counter in the geometry shader.
Actually you don’t even need transform feedback and geometry shader, but you can do everything using ARB_shader_image_load_store and implement an append buffer using a read/write image and an atomic counter. This is even more efficient than using geometry shader and transform feedback because geometry shaders must ensure that the order of the primitives emitted is in the same order as those received as input. The hardware has to ensure this and it has a negative effect on performance. As we simply store an unordered array of instance data, we don’t have requirements related to the ordering, so it is faster to implement the whole thing with an append buffer.
I planned to update my Nature and Mountains demo as well to use this new technique just I was quite busy lately and also GL 4.2 drivers are not mature enough so I thought I don’t have to hurry.
Yes I’m pretty curious about the performances (writing to an image with synchronization doesn’t sound very GPU friendly, guess I’ll have to bench to find out). I’ll also be looking forward to seeing your updated demo on your blog, thanks for sharing the algorithm !
The whole point of this method is that there is no synchronization. Everything is done by the GPU thus no need to stall the pipeline as it is done in case you query the amount of primitives written during transform feedback.
No need for synchronization amongst the GPU variables, it is ensured by the fact that OpenGL performs the operations one after the other on the server side.
Each vertex is processed in parallel in a vertex shader, right ? If you have say an atomic_counter in a shader, and you write to it, there must be some sort of resource locking management (confirmed by things such as the ‘coherent’ keyword, or memoryBarrier() in GLSL), unless there’s some sort of magic operating in GPUs which allows multiple threads to write to a shared variable and get a coherent result. This is why I’m curious to see how the shaders will peform -performance wise- with such things. The CPU is NOT involved in any if this, I know ;).
Yes, there is resource locking, however, there is dedicated hardware for handling coherency on multiple levels (SIMD core wide, core group wide and device-wide).
Actually geometry shaders and transform feedback are much worse from this point of view. There is in similar fashion a buffer that is accessed by all shader instances and there is also an atomic counter as well, as the shaders have to know where to store the next item, additionally there is also need for logic that ensures that the ordering of the output primitives matches the ordering of the input primitives.
I don’t really see why you think it would be any more synchronization overhead compared to transform feedback…
I was talking about hardware, not APIs (or wherever you got that from…) [/QUOTE]
No, it is not supported by hardware. Atomic counters and in general any programmable atomic operations came with DX11 hardware (Radeon HD5000 series and GeForce 400 series).
Previous hardware did have support for read/write buffers (without atomic ops) like the Radeon HD3000 and HD4000 series, and there were some hard-wired counters (like those of occlusion queries and of transform feedback), but you did not have programmable atomic counters in previous hardware! It is not just the lack of API support.
As I said, you think in terms of API features, again.
I would implement it in the following way:
Store the query result containing how many primitives have been written, into a buffer object. (that is usually done when a query is ended anyway)
We might need to process the query result in the buffer in case the hardware stored it in different units or there is more than one result (e.g. one query result per engine, other data for other kinds of queries intermixed, etc.). I would use a compute shader to get the number of vertices written and store them into another buffer.
There are 2 ways to implement the draw command:
a) We can copy the final value from the buffer into the state register that should contain the number of instances to render. Assuming the GPU has such a register. Then just do what we would do in glDraw{Arrays,Elements}, but don’t set the number of instances to 1. This is the easy way.
b) If we can’t do (a), we have to create the hardware command for the draw call, setting the number of instances to our computed value, and storing the command into another buffer object using a compute shader. (Yes, generating commands on the GPU is possible) Then you just ‘execute’ that buffer.