Query buffer object

Not sure if this or has been suggested before, but:

Query buffer object - a buffer object that asynchronously stores the results of a given type of query carried out once per primitive or a given number (divisor) of primitives on a single rendering call.

I’m suggesting this mostly with occlusion queries in mind, though timer and sync queries might be useful too…

Pseudo-code:

CreateMultiQueryBufferObject(MQBO,OcclusionQuery)
BeginMultiQuery(MQBO,1)
DrawSomePrimitives()
EndMultiQuery

Query buffer object - a buffer object that asynchronously stores the results of a given type of query carried out once per primitive or a given number (divisor) of primitives on a single rendering call.

Why do you need a new object type for this? Query objects are effectively asynchronous.

Also, what does this mean?

What I’m suggesting is not the asynchronous-ness, but the multiple queries in one draw call, stored into a buffer object.

Just as an example with numbers, say I want to draw ten thousand triangles, and have an occlusion query for every hundred of them, without needing to start and end a new query for each group of a hundred triangles. So I’d have a 100 unit buffer object, and the GL would take the divisor of 100 that I would give it, and start a new query for each group, and then store the results consecutively in the buffer object.

As another (possibly contrived) example, if I wanted to find out the screen space area in pixels of multiple triangles, I’d have a buffer object the same length as the list of triangles, I’d set the depth and stencil to always pass, the colour buffer not to update, etc and render with a divisor of 1. Then, each slot in the query buffer object would give me the number of (onscreen) pixels covered by a given triangle, even if some of them overlapped.

Whether this is possible, and how practical or efficient it might be is debatable, of course, and I suspect that just emulating it in the driver in a loop that does BeginQuery, Draw [Divisor] Primitives, EndQuery, would be pointless; I could just do that myself…

Also, would a buffer object not allow you to avoid routing the results back through the CPU if you wanted to do something more than draw/not draw based on the result of a query? Or am I wrong in thinking that you can’t use the numerical result of a query without bringing it back to the CPU?

Just as an example with numbers, say I want to draw ten thousand triangles, and have an occlusion query for every hundred of them, without needing to start and end a new query for each group of a hundred triangles. So I’d have a 100 unit buffer object, and the GL would take the divisor of 100 that I would give it, and start a new query for each group, and then store the results consecutively in the buffer object.

So, you call glDrawElements(…, 10000, …, 0); Or whatever. And you want OpenGL to automatically break this up into 100 triangle groups, even though this would require breaking the primitive. Which OpenGL would otherwise never do.

I don’t see a way for this to be reasonably performant, even if it were a good idea.

It’s a nice idea and it neatly resolves one of the drawbacks with queries (having to break batches to begin/end each query) but if it’s not supported in hardware (and being supported by an API is not the same as being supported in hardware) then it’s of very limited real world value.

I’ve found myself wanting this. Maybe it would make more sense to use instancing, and have a separate query result per instance.

I think it would be better to call it multiple counters per query instead of multiple queries.

IMHO The best way would be to define a new fragment shader output like gl_QueryCounterIndex that specifies which of the counters of the query object is updated.

That way you could define counters not only per instance or primitive ID, but also per screen coords, pixel depth, pixel color, etc.

You can use the GL_EXT_shader_image_load_store extension to implement this with an array of atomic counters in a buffer object, but I don’t know if this is really performant. Atomic operations usually don’t scale very well.

Having a query index is a nice idea too.

I don’t know if it would work for occlusion queries though - does the fragment shader have that information already available at the time of it’s execution?

My main doubt in suggesting the idea is whether the hardware permits the SamplesPassed register (and any other registers that support querying) to be read by user defined programs, or just the driver (CPU).

  1. Create a 4kB buffer, fill it with zeros. // int visible[1024];
  2. Bind it via NV_shader_buffer_store EXT_shader_image_load_store
  3. Draw 12*1024 triangles, groups of 12 of which describe a cube, and have a “flat int var_ObjectIdx;” (which is identical among the 12 tris) , so basically you’ll be drawing/querying 1024 objects
  4. in the frag shader

void main(){
visible[var_ObjectIdx] = 1;
}

  1. either read from the buffer to sysram (and get exactly what you’re currently asking for), or use a shader/transform-feedback to prepare data for ARB_draw_indirect

You can draw on a lower-res FBO, and optionally have 4x MSAA, to make use of the rotated-grid samples to minimize the disappearance of objects, of which only a thin line is visible.

If NV_shader_buffer_store EXT_shader_image_load_store is missing, and you want to somehow support the above codepath while also supporting older gpus, replace 4) with :

4.1) The frag-shader outputs an int, (instead of gl_FragColor) with value of “var_ObjectIdx+1” .
4.2) Then, you render into a 32x32 px rendertarget (has 1024 pixels) i.e a million points (count = width*height) , that read a pixel from the above int-framebuffer texture, and set their gl_Position to coordinates that match that int-value. Their frag-shader outputs an int value of “1”. (the vtx sets its gl_position to be outside the viewport if the pixel had value of 0, which means “no object there” )

Implemented my idea above:
<a href=“http://dl.dropbox.com/u/1969613/openglForum/multiqueries.png” rel=“nofollow” target=“_blank”>multiqueries.png

</a>

AWESOME!

I was researching lately various techniques to do batched visibility determination and culling on the GPU. So far I was concentrating on geometry shader based techniques (or any other shader based techniques if EXT_shader_image_load_store is available) that do view frustum culling and Hi-Z map based occlusion culling on the GPU.

This MultiQuery technique would be a great addition to the list of GPU based culling algorithms and it even looks pretty efficient. Many thanks!

Slick!