NV_OCCLUSION_QUERY extention for occlusion culling

Hi,

I want to use the NV_OCCLUSION_QUERY extention for occlusion culling in an octree (with front to back rendering)
This is the source code for rendering the AABB containing the triangles, and testing if they are visible.

glDisable(GL_CULL_FACE);
glColorMask(GL_FALSE, GL_FALSE, GL_FALSE, GL_FALSE);
glDepthMask(GL_FALSE);
glBeginOcclusionQueryNV(cull_box2);
glPushMatrix();
glTranslatef(bb.mid[0],bb.mid[1],bb.mid[2]);
glScalef(bb.len[0],bb.len[1],bb.len[2]);
glutSolidCube(1.0);
glPopMatrix();
glEndOcclusionQueryNV();
glColorMask(GL_TRUE, GL_TRUE, GL_TRUE, GL_TRUE);
glDepthMask(GL_TRUE);
glEnable(GL_CULL_FACE);
GLuint pixelCount=0;
glGetOcclusionQueryuivNV( cull_box2 , GL_PIXEL_COUNT_NV, &pixelCount);

if (pixelCount>pixel_threshold)
render triangles of AABB with display list

Problem is, the code is slower with culling then without occlusion culling (just rendering with display list), even though a lot less triangles are rendered with occlusion culling. Especially when de depth of the octree is getting bigger (>4) it slows very much (i guess because of the begin- and end-occlusion query, i used some timing queries to find out where the bottleneck was). Note that with depth 2 of 3 (not too many occlusion querys) there is a significant win over non occlusion culling when e.g. inside a certain model, but with larger octree depths the overhead of occlusion queries and octree seem to win, and there are less fps.

In nvidia slides (http://www.nvidia.com/dev_content/gdc2002/GDC2002_occlusion_files/frame.htm) was a note : "Do other CPU computation while queries are being made "

Should i use a separate thread to perform the occlusion queries, or do the occlusion queries already run in an seperate thread ?
Is it normal that the queries become a bottleneck when used frequently ? With large octree depth, octree rendering without culling is even faster then with culling (with queries). With depth 5 there are like 1000’s / 10000’s of occlusion queries. Can this be a bottleneck ?

How can i fix this problem ?

tia, and merry christmas

Roel Martens

I’m not sure it was intended for octrees. The occlusion query isn’t that fast as you have found out. It’s useful for testing if a single high poly object is hidden from view. Like if a 5k vehicle model is hidden by the terrain.

Why are you disabling cull face, don’t you only need to render the front facing sides of the occlusion cube?

I don’t think you need to set up another thread. In between your glEndOcclusionQueryNV() and glGetOcclusionQueryuivNV call you can do other stuff while the occlusion query is running and before you need to see the result.

There’s more info here http://www.nvidia.com/dev_content/nvopenglspecs/GL_NV_occlusion_query.txt

[This message has been edited by Adrian (edited 12-24-2002).]

Cross posted! http://www.flipcode.com/cgi-bin/msg.cgi?showThread=00006887&forum=3dtheory&id=-1

i disable cull faces, because otherwise the pixelcount is 0 if my camera is inside a certain bounding box. (and obviously has to be rendered). For the actual rendering i enable cull face.

problem is i can’t do much while doing the queries, because the rendering of one bb cell can affect visibility of other cells, so i cannot make queries for the other cells at that time.

I hope the query was also intended for such purposes, because it’s supported by hardware (gf3, gf4, …), and should be fast ?!

bout the crosspost: since it is not crosspost on openGL forums, and because i think different people visit these sites, i don’t consider it that bad

[This message has been edited by Sigmund (edited 12-24-2002).]

When you test a query, you are serializing the pipeline up to the point where the query was ended. If you test it right after you finish (which you’re doing) then you’re basically starting and stopping the pipeline all the time, leading to much lower throughput.

The best use of the occlusion query extension is to begin/end it around certain geometries (say, meshes) during one frame. If, during that frame, the mesh draws 0 pixels, then during the next frame, you draw a much simpler shape, such as a single billboarded quad, about where the shape is supposed to be, but in some way that doesn’t touch the framebuffer (say, turn off frame buffer writes). Once that quad passes, the next frame you render the mesh again.

Note that you run the query one frame, but TEST the query the next frame. This means you won’t be stalled waiting for GPU results. Actually, you may still stall if you’re running more than one frame ahead, but that’s OK.

Frustum culling is certainly not good usage for occlusion queries. Do that yourself, math is wonderful

That’s ok if stalling is the problem.

But what if glBeginOcclusionQuery and glEndOcclusionQuery eats up CPU time? This happened in one of our implementations as well… Adrian, did you time these calls separately?

Michael

I would suppose GetOcclusionQuery would eat CPU time, because it might busy-wait until the query is completed on the CPU. I e, you’re not only stalling the GPU, but also stalling the CPU until it’s catching up to that point in the stream.

Of course, I know nothing of the implementation; perhaps there’s only a FIFO of some certain depth for these queries, and queueing more queries will implicitly wait for some previous query. That would be just one of many reasons why running a new query might stall, even without getting the result.

Originally posted by wimmer:
Adrian, did you time these calls separately?

No, I haven’t timed them. I was just saying how I understood it to work.

The queries are usefull as an extension to the normal frustum/occlusion culling if you draw multiple pass. You could check if an object is visible in the first pass and skip the following passes if the pixel count for that object is zero.
Another use could be to see if an object is in shadow of another, so that you need not to draw the shadowvolume and light pass for that object. This extension could be an replacement for the beam trees typical used for this problem.

Occlusion queries can be slower than just rendering the full geometry blindly. There is extra work involved in setting up and rasterizing the bounding box. The payback is that you save transformation and rasterization of the bounded geometry if the bounding box is “invisible”. There’s a few things to worry in the tradeoff:

(1) Performance loss in bounding box rendering. I see that you have culling turned off. Keep in mind that if you have any fancy texturing modes enabled, they will be applied to the bounding box, which could slow down bounding box rendering.

(2) Synchronization overhead: Waiting for your bounding box results will keep you from doing other work. It will also effectively idle the graphics engine – it won’t have any work between the time where it’s done rendering the bounding box and the time the next commands you generate hit the hardware. The ideal situation is to render the bounding box, do other stuff (graphics or necessary computation), then come back and check the result (which is hopefully ready).

(3) How often is the geometry fully occluded? This is an easy point to forget – if the bounding box isn’t fully occluded, you get absolutely no benefit from using the extension if you’re only using it to conditionally render complex geometry.

If you’re going to do multiple dependent bounding boxes without much in between, you’ll keep running into the idling issue (2) above.

I haven’t had anything to do with the implementation in our driver, so I’m not expert. In either a busy-wait or a wait-for-event scenario, you won’t get the CPU back until the query on the GPU is complete. The PIXEL_COUNT_AVAILABLE_NV query is present to allow you to check whether a result is ready before you do a potentially blocking query of PIXEL_COUNT_NV.

Note that the occlusion query results doesn’t have to be 100% “perfect” to be effective. For example, you might draw a bunch of static objects, then some dynamic ones, and then the complex and possibly occluded geometry that you only want to draw if the occlusion query says to. In theory, the dynamic objects might occlude your bounding box, so you might want to wait until after drawing the dynamic geometry to render the bounding boxes. But it may be more effective to just draw them after the static geometry, since the results are more likely to be ready by then.

I have written a quick and dirty ‘proposal’ for a couple of extensions that can solve problems of parallelism like this. And a more general and elegant solution (in my opinion, of course). They are just thoughts but I will like to know what do you, guys, think about them. Any comments?
(I hope you can understand them in my spanglish)

I have put it on the ‘Suggestions for the next release of OpenGL’ (but I will like to see something similar in current OpenGL )
www.opengl.org/discussion_boards/ubb/Forum7/HTML/000343.html

In most realistic app scenarios, I’d expect that there would be little/no synchronization overhead – if you issue more than a few queries, the HW probably won’t go idle.

  • Matt