Occlusion Culling with FBO+PBO+glReadPixels_async

Hello friends, I’m developing Graphic Engine from more than one year and I’ve tested more than few ways to do Occlusion Culling.

Testing environment:

[ul]
[li]nVidia Quadro FX 360M for notebook[/li][li]Intel Centrino DualCore 2.5GHz, 4GB RAM[/li][li]only frustum culling for objects outside frustum (with 6 planes in CPU)[/li][li]no octrees or other hierarchical order[/li][li]no physics or other stuff…just simple rendering using VBOs and Shaders 3.30[/li][li]tests done with same initial camera position[/li][li]50 objects (40.000 vertices, 16MB of total textures loaded)[/li][li]double buffer active and VSync disabled (using SwapBuffers function)[/li][/ul]

Using ARB_occlusion_query extension[HR][/HR]First pass:

[ol]
[li]sort objects from front to back order (Z-Depth was retrivied during frustum calculation)[/li][li]disable glColorMask(…)[/li][li]for each 3D object:[/li][LIST=1]
[li]BeginQuery(…)[/li][li]glDraw() object[/li][li]EndQuery(…)[/li][li]glGetQueryObject() results[/li][li]if (visiblePixels < threshold) set object’s variable “occluded = true”[/li][/ol]

[li]enable glColorMask(…)[/li][/LIST]

Second pass:
[ol]
[li]for each 3D object:[/li][LIST=1]
[li]if (occluded == true) skip object[/li][li]glDraw() object[/li][/ol]
[/LIST]

With ARB_occlusion_query I get 120fps.

Using glReadPixel (with same tecnique of Color Picking):
Pre-phase: I was already using a color picking tecnique during Mouse Selection, so I’ve tried to use same tecnique to do occlusion. Color Picking tecnique renders in a single pass all objects, each single object with a different color. In this way if one pixel is “158,0,0,0” (B,G,R,A format during glReadPixels) I’m sure at least one pixel of that object (ID=158) is visible, so I can render it during rendering.
Trick&Tips: the trick is to use a FrameBufferObject (for off-screen rendering) and two PBO (for async glReadPixels readback) to get pixels, but all those buffers are in different SIZE (width*height) of the real rendering phase.
Example: rendering scene at 1900x1200 (as in my notebook) but those FBO and PBOs are only 20% of that resolution
Concept: during occlusion I’m not really interested to know if ONE object’s pixel is visible, but I’m interested if few pixels are visible. So it’s useless rendering at the same resolution as final image (due to Fragment Shader calculations over more pixels/fragments and the readback procedure, using glReadPixels, over milions of pixels). The solution is to use a proportional reduced resolution for FBO and PBOs, and render to a smaller resolution. Less precision but very good speed!
[HR][/HR]First pass:
[ol]
[li]create a FrameBufferObject(FBO) for off-screen rendering smaller than rendering viewport[/li][li]create two PBOs (one for color_attachment0 and one for depth_attachment component) with the same previous FBO’s size[/li][li]enable rendering (color and depth) to our FBO[/li][li]clClear(…color…depth…)[/li][li]change current viewport size to fits FBO’s dimensions[/li][li]render objects are usual (or just in bouding_box mode for super fast process)[/li][li]execute glReadPixels() process to get an array of B,G,R,A values of our FBO[/li][li]parse the array (for few thousand pixels at this reduced resolution) and check BGRA values to identify object_IDs[/li][li]if object_ID is valid search set its variable “occluded = true”[/li][li]disable rendering to FBO and returns to window’s back_buffer[/li][li]change current viewport size to previous values (full render size)[/li][/ol]

Second pass:
[ol]
[li]for each 3D object:[/li][LIST=1]
[li]if (occluded == true) skip object[/li][li]glDraw() object[/li][/ol]
[/LIST]

With this procedure I get [b]144fps

[/b]
This test demonstrate that ARB_occlusion_query could be worst than other half-hardware methods.
With second tecnique I don’t need to order objects from front to back, and it is not so bad!!!

Hope this helps someone during occlusion or color picking procedure :wink:

Most interesting; what did you find as an acceptable buffer size for the reduced rendering?

I’m not surprised your other way was faster. You killed performance by immediately getting the query result for each object, rather than using conditional rendering. The glReadPixels way only has one GPU stall instead of many.

Also, generally speaking, occlusion culling isn’t used that way. You render your terrain, always. Then, you occlusion cull each object against that, not against each other.

Using occlusion_queries I was using 10 pixels as threshold to be sure a sufficient part of the Model is rendered, so I’ve reduced the “buffer” size (FBO’s size) to 1/5 (20%). This way is something similar to create a full-size FBO/PBO and check every 5 pixels (well, it’s not 100% the same because it depends what types of glTexParameter() you use), but the delay to “download” (from VRAM to CPU/RAM) millions of pixels are reduced due to reduced PBO size!

Thanks to have notice me this way. I didn’t knew this solution :slight_smile:
I’ve just read what it is and it seems a good way for Occlusion Culling!
Anyhow with your solution I still need to do a Depth Pass if I would want to know if Object_1 is behind/in-front of Object_2. With my method you still have one (reduced) Depth PBO attached to its parent FBO.

Second: using many function calls during ARB_occlusion_queries (BeginQuery…End Query) needs to create a different Render process (while checking occlusion) without the possibility to use the common one. Instead with my approach the rendering remains completely the same (obviously with different Shaders during culling pass). PS: and many function calls adds more overhead than just one simple draw call.

Third: with my approach you could have the possibility to resource-free main thread (which is the rendering one) from culling calculation because you can pass array of pixels (retrieved with glReadPixels function) to another thread and use its results few frames later to choose what objects to hide.

Fourth: using FBO+PBO technique you could know if one Object is near the edges of the viewport (an know to WHICH edge!) or in what part/region of the screen it will be effectively rendered. (in most of perspective cases, objects rendered at “top” should be far than others at “bottom”)

Fifth: during parse array of pixels you could check on which pixel your mouse is over and change its cursor according to BGRA “picking” values. In this way a raycast is unnecessary. I like this way to change cursor shape based on mouse position.

Sixth: no need to order objects from front to back

However I’m curious to test your solution asap! :smiley:

(sorry for my english :stuck_out_tongue: )

Anyhow with your solution I still need to do a Depth Pass if I would want to know if Object_1 is behind/in-front of Object_2.

Is that useful information?

Also, you seem to forget that your method is imperfect. It is not sample-accurate, since you’re rendering at reduced resolutions. So there will be times when the user will see an object pop into being. The conditional rendering method is always sample-accurate.

using many function calls during ARB_occlusion_queries (BeginQuery…End Query) needs to create a different Render process (while checking occlusion) without the possibility to use the common one. Instead with my approach the rendering remains completely the same (obviously with different Shaders during culling pass). PS: and many function calls adds more overhead than just one simple draw call.

I don’t know what a “Render process” is, but unless you actually profiled it, you have no way of knowing if the “overhead” is significant enough to be worth considering, compared to the savings of a readback, not to mention the GPU/CPU synchronization, cache pollution due to the need to render each object twice, etc.

with my approach you could have the possibility to resource-free main thread (which is the rendering one) from culling calculation because you can pass array of pixels (retrieved with glReadPixels function) to another thread and use its results few frames later to choose what objects to hide.

You still had to provoke a GPU/CPU synchronization to read the data back.

using FBO+PBO technique you could know if one Object is near the edges of the viewport (an know to WHICH edge!) or in what part/region of the screen it will be effectively rendered. (in most of perspective cases, objects rendered at “top” should be far than others at “bottom”)

You can know that by doing some basic work on the CPU too. Indeed, I imagine such computations will generally be cheaper than doing a GPU/CPU synchronization.

during parse array of pixels you could check on which pixel your mouse is over and change its cursor according to BGRA “picking” values. In this way a raycast is unnecessary. I like this way to change cursor shape based on mouse position.

And some people aren’t. You never stated that this was important.

no need to order objects from front to back

The method I outlined didn’t require that either.

LevelOfDetails and all similar things…

Nope, I don’t forget that.
Previously with ARB_occlusion_queries I was using threshold of 10 fragments to determinate when an object is enough visible to be rendered (to avoid very far object to be fully rendered), so using 1/5 rendering size to “merge” pixels 5-by-5 is other solution to do a similar thing. If in full-size rendering there are more than 5 neighboring pixels with same BGRA value, it means in reduced FBO (20% of full-size) I’m quite sure sure at least 1 pixel contains the same BGRA values. But I must also parse ALL pixels (or with heuristic algorithm) so at the end I can discard too small objects without problems due to total reduced_pixels count.

It’s a procedure with the objects loop, similar to: “for each Object do Render();
To execute OcclusionQueries (even with conditional rendering) I need to create another loop with “BeginQuery()…EndQuery()” at begin/end of EACH loop, so I cannot reuse the same main Render Loop Procedure but I need to manage twice the code (ore use many IF statements inside the shared function).

Queries needs:

[ul]
[li]sorting algorithm from front to back. Additional sorting objects is difficult because I want to sort even by Shaders or Textures to avoid too much OGL/Shaders function calls[/li][li]a different Render Procedure (as I write few rows above) with the problem to maintain synced 2 similar, but different, functions or use many IF statements to avoid Queries and its Results during final Render() call.[/li][li]ask something additional to GPU for every object: “how many fragments are visible for object X?”. Conditional rendering could solve this because it stops at first fragment, but in this way I’m unable to know how much portion of object is visible (I’m not interested to render a 100.000 vertices objects if it has visible only 5 pixels at very far distance (ZDepth).[/li][li]ARB_occlusion_queries are number limited. In my video card I can run only 32 queries at the same time and every loop cycle (or at different rate if I want to change it) I should interrogate GPU to try to get queries results.[/li][/ul]

glReadPixels() readback is executed async with 2 PBOs at N_frame+1.

Of course it is true, but to do this I need execute additional calculations just for that purpose.
With my solution you could have already the data during array of pixels reading without do nothing more.

But this is a “built-in” and “CPU-free” additional feature without do/execute nothing more: just get pixel value with “array[((mouseY*width)+mouseX)*RGBA_size]” where “RGBA_size = 4” if I required all 4 components from resized FBO/PBO.

In another discussion (2009) you said something different: Conditional rendering - OpenGL: Advanced Coding - Khronos Forums
It is impossible to use occlusion_queries, or conditional rendering, without rendering objects in the right order. For example if I sort back_to_front all InFrustum objects will pass occlusion/conditional rendering with “visible fragments >=1”.

PS: using FBO+PBO+glReadPixels and some additional technique I could also determine if an object is visible behind glasses (not full transparent but a bit opaque like in bathroom) due to pixels BGRA multiplication between overlapped objects (This is just an idea because it needs a lot of additional work to merge textures and readback merged BGRA values with texture’s GL_MODULATE and its different alternatives).

LevelOfDetails and all similar things…

Which is more easily done by simply testing the object’s distance from the camera.

sorting algorithm from front to back. Additional sorting objects is difficult because I want to sort even by Shaders or Textures to avoid too much OGL/Shaders function calls

You don’t have to sort back to front. As stated, you can just render your terrain, then render the objects in the terrain in whatever order you want, doing occlusion tests in whatever order you want. Yes, non-terrain objects won’t necessarily occlude objects behind them, but that’s relatively minor compared to the number of objects occluded by terrain.

Depending on your camera angle and terrain construction, of course. If all you have is a generalized bag of “objects” (none of which could be called “terrain”, this doesn’t help. But generally speaking, if you have a generalized bag of objects, you’re probably not going to get much consistent performance gain from occlusion tests of any kind, whether queries or your method.

And consistent performance is generally better than spikey-performance. You want to minimize the times when minor changes in positions or orientations cause massive changes in how much stuff you render. That’s one of the reasons to do occlusion tests based on terrain and not other objects.

To execute OcclusionQueries (even with conditional rendering) I need to create another loop with “BeginQuery()…EndQuery()” at begin/end of EACH loop, so I cannot reuse the same main Render Loop Procedure but I need to manage twice the code (ore use many IF statements inside the shared function).

In general, when doing occlusion queries, you don’t render the actual object to test the query. You render a test object, which is generally a cube or something. So it isn’t the same code.

Furthermore, the code is just rendering with one mesh vs. a different mesh. It shouldn’t be any different code-wise; the only difference is where it gets the mesh from.

(I’m not interested to render a 100.000 vertices objects if it has visible only 5 pixels at very far distance (ZDepth).

Why would you want to render a 100,000 vertex object at a very far distance regardless of how many pixels are shown? It’s not about the number of pixels; it’s about the fact that it is far away. You should be LOD’ing it down regardless of whether it’s partially occluded.

ARB_occlusion_queries are number limited. In my video card I can run only 32 queries at the same time and every loop cycle (or at different rate if I want to change it) I should interrogate GPU to try to get queries results.

The specification doesn’t allow that limitation anywhere, to my knowledge. What happens when you try to use more queries?

glReadPixels() readback is executed async with 2 PBOs at N_frame+1.

Thus compounding the inaccuracies by also adding a frame latency; this is exactly what conditional rendering is intended to avoid. For me, the most important thing in any visibility culling algorithm is not having objects visibly pop into place. Your method doesn’t provide that.

But if that inaccuracy is fine for your needs, you can do it. I just don’t suggest it.

Yes, but it should be specially done in another function created for that purpose.
All things could be done with a single and specific function…I’m just saying they could be “grouped” all in one single technique without create external and pre/post loops. Reading pixels is a requirement, so all data could be retrieved from them are something like “CPU/GPU-free” without additional object-loops, overhead, function calls, rendering, etc…Just a “simple” async glReadPixels() to have all the stuff in one hand without do nothing else.

This solution is imperfect like you have said before about my idea of FBO+PBO.
Considering only the terrain as main occluder object is not precise or perfect like resizing rendering readback tests…
Try to imagine to be in front (and near the door) of a big house: I don’t want to render and process the invisible city behind the house.

Sometimes bounding box is too big even if the real visible pixels are just few.
However I was not so interested in using bounding box because I prefer the real bounding-shape to increase precision due to Nature/Green (trees, vegtables, rocks, etc…) target of my engine.

I have a “Render()” function which is the complete rendering step and I can easily surround it with “BeginQuery…EndQuery…GetObject…” due to objects loop inside of it. So I should create another “RenderWithQueries()” function which is using those function during its loop cycles…or use IF statements to enable those function calls only when the Pass is “test occlusion”. I think it isn’t a very clean and manageable solution.
Maybe you use “object.render()” solution, but it needs to re-create the main loop in every pass you are creating: another not clean solution, and perhaps other more limitations.

It was just an example.
Example 2: if I’ve got a big 100.000 vertices object in the right corner, but only 5 pixels are shown, I’m not interested to render that object due to its poor interesting in the scene. With your method it will be rendered.

My GPU can use 32 queries at the same time (each on different VBO) and each additional try to add another one returns an OpenGL error as expected. In this situation I need to query the result function (glGetQueryObjectuiv with GL_QUERY_RESULT_AVAILABLE) many and many times to get a result. But I couldn’t know HOW MANY times at that time…so I should call it after every EndQuery(). But if the CPU is very fast (or GPU very slow) the query’s queue become quickly full and additional BeginQuery() will returns error. My notebook has “slow” GPU and sometimes I received this error before to manually create a Queue of Queries which blocks current thread until the queries are done. To test it it is sufficient to “fake render” a big model of few millions of triangles and test the Queue. I know this is an extreme case, but it could be possibile to some old HW configuration.

I “suggest” it just to show another approach if the absolute single-pixel-precision is not absolutely required.
Rendering a model, which is behind 1million of close leafs, because conditional rendering returns “true” for a single visible pixel, is not what I’m interested to handle in this technique…and I try to avoid this case.

My GPU can use 32 queries at the same time (each on different VBO) and each additional try to add another one returns an OpenGL error as expected.

As expected? There’s nothing in the specification that says that you cannot have more than some implementation-defined number of query objects at the same time. So you should not be getting an error. Also, there is no connection between buffer objects and queries.

What function is producing this error?

ARB_occlusion_query has GPU limits: get max number of simultaneous queries supported by GPU: glGetQueryiv(GL_SAMPLES_PASSED, GL_QUERY_COUNTER_BITS, maxCount)

Occlusion test (pseudo-code):


glGenQueries(maxCount, arrayIDs)
for each object do
  glBeginQuery(GL_SAMPLES_PASSED, arrayIDs[x]);
  render();
  glEndQuery(GL_SAMPLES_PASSED);
end for;

arrayIDs is an array contains “maxCount” numbers. In my case I get an array of consecutive 32 GLuint numbers (arrayIDs[1…32]).
But during glBeginQuery() call you need to provide “x” parameter with a valid and free number/id, and how you can know if the number is free or busy? It’s simple: you need to check queries results many times as possible to quickly free queries resources: glGetQueryObjectuiv(one_of_those_numbers, GL_QUERY_RESULT_AVAILABLE, boolean_output).
If “boolean_output == true” it means you can consider all queries as Done and you can re-use those “x” values in new glBeginQuery() calls.
But if CPU is many times faster than GPU (or this second one is old and slow) or the rendered model is very complex (millions of vertices if don’t want to use bounding_boxes), all my 32 initial free queries numbers could be busy waiting queries results.
In this case for an additional glBeginQuery() you don’t know what “x” to use because all available values (32 in my case) are busied. I also tried to call that function but I received and Exception due to driver/opengl invalid state.
In fact for glBeginQuery() function it is wrote:

If the query target’s count exceeds the maximum value representable in the number of available bits, as reported by glGetQueryiv with target set to the appropriate query target and pname GL_QUERY_COUNTER_BITS, the count becomes undefined
and “undefined” means “unknown state” and getting results for exceeding “x” it gets an error.
However if the “x” is already busy, glBeginQuery(…, arrayID) returns GL_INVALID_OPERATION error as expected.

I’ve avoided this behavior with a Queue/Array of free “x” values: when I call glBeginQuery(…, arrayIDs[x]) I remove that “x” from the Queue until a query result is available (which will free all busy “x” values). And if I call glBeginQuery() before at least one “x” value is free (= present in Queue/Array), I block thread execution in a loop waiting for results. This trick have solved my problem but it’s just a trick and I don’t like too much manually tricks in a software.

That’s not what that means. The GL_QUERY_COUNTER_BITS is the number of bits for the sample counter within a query object. That is, how many samples can a single query count. If it returned 24, then that means that a particular occlusion query object’s query count will only be able to count up to 16,777,216 samples. So if you render more than that many samples within a single query run, then the SAMPLES_PASSED count will be undefined.

It has nothing to do with how many query objects you can get, nor does it affect how many query objects can be active at any one time. As long as you don’t try to begin a query while that query is active, you’re fine.

I’ve tried Conditional_Render…but the result is worst.
Logic: with conditional_render I don’t need a “First Pass” to: loop though objects, wait for queries result, set “occluded” variable to “true” and later verify it during Rendering Pass…so I suppose Queries can be used in the same “Final Rendering Pass”. It’s right?

I’ve done this test with the SAME world/objects/camera of other tests in first page of this thread.

[ol]
[li]for each 3D object:[/li][LIST=1]
[li]disable glColorMask(…)[/li][li]disable DepthMask(…)[/li][li]BeginQuery(…)[/li][li]glDraw() object[/li][li]EndQuery()[/li][li]enable glColorMask(…)[/li][li]enable glDepthMask(…)[/li][li]glBeginConditionalRender(…, GL_QUERY_WAIT)[/li][li]glDraw() object[/li][li][u]glEndConditionalRender/u[/li][/ol]
[/LIST]

With NV_conditional_render (in Core from OpenGL 3) I get 98fps vs 120fps of ARB_occlusion_query vs 144fps of my PBO+FBO+glReadPixels technique.

I’ve also tested it with other glBeginConditionalRender parameters: GL_QUERY_NO_WAIT and even WAIT/NO_WAIT combinations with GL_QUERY_BY_REGION_*. Same results.

Maybe I’m using in a wrong way…please confirm this.

However with conditional render I’m not able to set any “occluded” variable to exclude hierarchical-object-children (which could be contained INSIDE this by octree) because my application has no any result from occlusion test but it’s demanded all to GPU to simply not render previous failed Query. From Application we are unable to know if a parent is occluded and then exclude its children…

Unless I read over something, you seem to neglect the fact that you can also batch multiple bounding boxes (or whatever test geometry you’re using) into a single cumulated object and first test that. If this doesn’t yield any passed samples, you can throw away the complete batch all at once. Unless your logic for computing the box minimum and maximum is expensive, this can substantially improve OC time with hardware occlusion queries - especially in scenes with high depth complexity. In your case, you could throw out a whole town with a single occlusion query. Not to mention loads and loads of vegetation and all the other good stuff in a virtual world.

EDIT: The algorithm then becomes:

a) frustum cull the scene
b) render large occluders to lay out some depth
c) group potentially visible objects in spatially coherent areas
d) for all groups:
[ol]
[li]render the biggest bounding volume [/li][li]if any samples pass, split the volume into subvolumes [/li][li]render subvolumes [/li][li]goto 2 [/li][/ol]

Of course, the worst case complexity is higher since you would have to test every single object anyway but introduce addition queries for enclosing bounding volumes. However, you can track the number of tests per frame and if the number of queries surpasses the number of objects multiple frames, you can switch to your current approach dynamically. Just don’t forget to switch back every once in a while to see if you benefit.

BTW, if an object doesn’t consist of enough faces and shader complexity is simple enough, consider not testing it at all as rendering it will not incur significant cost - neither for vertex nor for fragment processing.

All in all, occlusion culling isn’t an easy task as it depends on many factors like depth complexity, camera movement speed/direction/orientation, shader and mesh complexity (cost/benefit ratio), changes of scene geometry (i.e frame 0 is cool, frame 10 isn’t because something big moved relative to the camera) and so on.

Perhaps you should consider just doing an end-run around explicit hardware occlusion query, conditional-render predicated or not, and just read from a depth texture directly for your occlusion test. There’s nice benefits there for batching (no more shuffling around with a bazzillion little IDs). Check out the MIPed depth buffer (Hi-Z) approach used in March of the Froblins and on rastergrid’s site among other places.

In this case I need to waste many CPU cycles to do matrices multiplication CPU-side.
I’m using UniformBufferObject to pass once matrices to Shader (which will do all multiplications, GPU skinning, etc…), but in this way I demand all this work to CPU to create a VBO which also I cannot re-use, as it is, for final render.
Anyhow “merging” VBOs for this purpose is not a solution because a character inside an house is “never” visible (considering no windows or opened doors) from outside, so merging both VBOs is useless: the character just needs to be excluded from occlusion_query and rendering phase, not computed during occlusion test.
I’ve got an hierarchy like this below:

[ul]
[li]Terrain[/li][LIST]
[li]House[/li][LIST]
[li]Saloon[/li][LIST]
[li]Character 1 whatching TV[/li][li]Sofa[/li][li]Television[/li][li]Stairs[/li][/ul]

[li]Bedroom[/li][ul]
[li]Bed[/li][li]Lamp[/li][li]Character 2 bathing itself[/li][/ul]
[/LIST]
[/LIST]
[/LIST]

And if House test is failed, I’m not interested to test its children.
This could be an example like a dungeon inside/behind a mountain-wall: I can group all the dungeon in a single hierarchical structure and decide to not test its children if this parent object is behind the mountain which I’ve in front of me.
In you way I should merge all VBOs in one big one and test occlusion on whole that VBO…and I think it’s not a good solution.

Yes, I understand your idea…but merging many objects in one big VBO is not a solution. Imagine a big house with many furniture or characters inside it: should I merge all of them in the test VBO? What a waste of CPU…
And this merged-VBO is not re-usable during real render phase due to obvious problems I think I not need to explain.

But if the single-and-big test is passed then I should render all those objects…or I should execute additional occlusion_queries for each its child. It doesn’t seems to be the best idea.

How can I know, CPU-side, if the object will be BIG on screen after rendered? A small stone could be bigger (in rendered pixels) than an house if the house is 10 miles far from the stone. To know if an object will be big or small (in pixels screen) I need too much calculations about their zDepth, bounding coords and how much portion of the object still be present on screen.
However I need to sort object from front to back…and it is not simple in many cases.

You are aware that you need only compute the min and max of all enclosed objects, right? You need exactly one unit cube in a buffer object which you can then scale in the vertex shader to cover the space designated by your min/max. No uploads, no nothing. That’s very cheap compared to multitude of occlusion queries. Where did you read that you merge VBOs for this purpose? I have never mentioned that you should move any of the real objects anywhere. You only need to set a few uniforms and render the scaled unit cubes.

That’s the whole point. The house’s bounding box naturally encloses what’s inside the house. You test the house and if it fails everything inside is out. However, there are vast areas in your scene which are spatially coherent, i.e. close to each other, but are disjunctive in nature. A tree will not contain another tree. Still, if they are close, you can simply add their bounding volumes and check this single volume and maybe cull both objects with a single occlusion query.

Is your terrain not big? If there’s only the terrain functioning as a large occluder, so be it. Nobody’s talking about spatially small objects.

It doesn’t? Well, if you can reduce the number of occlusion queries by 50% instead of 100% percent, isn’t that a good idea? I already said above, you may likely have worse performance if you don’t have a lot of occlusion. Otherwise this should almost always be a win. You don’t go directly from big occlusion test object to all enclosed objects. You test one, object then 2 (or 4 or 8 …) and so on. If you enclose 100 objects and only need to do 20 queries instead of a full hundred queries, that’s a big win.

EDIT: BTW, you know what bounds multiple objects which are disjunctive in a nice way? A spatial data structure like an octree or quadtree. This is actually the cheapest way of testing large sections. Use what you already have you can start a some level below the root (because smaller objects most likely won’t intersect multiple big cells) and hierarchically refine the queries until you reach you maximum depth. Iff you actually reach maximum depth, you would need to test the contents of the whole cell, or manually group objects in the cell and do it hierarchically again as described above.

The topic is very exhaustive and fun to play with. However, I just remembered that you don’t use any spatial data structure in your application and that’s not good. It ruins some perfect optimization opportunities - especially when doing frustum and occlusion culling.

You said: “batch multiple bounding boxes into a single cumulated object”. For me this is as “merge all VBOs into a single big VBO” (VBO formed by bousing_boxes)
However to scale that cube I need to do special calculations to transform a cube (with dimensions [1,1,1] at position [0,0,0]) to dimensions [1.2, 0.5, 4.7] at position [x1, y1, z1]. Yeah it’s “simple” and quicker than what I’ve understood to merge into one single VBOs, but it seems to be a dirty technique.

In my case I’m using Interleaved VBOs with firsts 8 vertices (as offset) as bouding_box min/max: “12345678VNTP,VNTP,VNTP,…” where 12345678 are V1,V2,V3,…V8 of bounding_box vertices and VNTC is the Interleaved VBO formed by VertexNormalTexcoordPickid.
In this way I’ve already stored bounding_boxes and Picking Values (for my method) in one single upload to GPU.
When I want to draw bounding_box I just need to call glDrawArrays(GL_QUADS, 0, 8), and when I want to draw the complete object I call glDrawArrays(GL_TRIANGLES, 8, verticesCount).
So I’ve already got full bounding boxes coords stored on GPU and I need just to pass translate/rotative matrix which is the same, stored, pre-calculated and re-usable by CPU of the original object.

(PS: if I want to do a Picking Rendering Pass (using my technique with glReadPixels) I just active VBO’s attribute “Color0” and change to a different Shader (which assigns colors according to Pickid) without any additional upload to GPU)

But with NV_conditional_render you are unable to do this.
Application doesn’t know nothing if previous occluded test returned a visible or a complete occluded object, so you are unable to use an IF statement (on what “occluded == true” variable?) to process or discard its children.
Using ARB_occlusion_query is the most closed and working procedure to avoid this ConditionalRender limit.

An house is a big or small object? It depends on how much you are close to it.
If I put my face close and in front of a wall I can’t see other houses, trees or other things, because the biggest object is the house and not the others. So it’s all relative. In this example the House could be the biggest occluder, and not the terrain.
Imagine a close quarters shooting FPS: most of the rendered pixels are objects and not terrains.
I’m trying to manage all those cases and not try to solve just one simple and specified situation.

Yes I know and I prefer to test techniques differences in this environment.
Using a sort algorithm, octree or other spacial arrangement, I could be unable to really test many different approaches.
In fact if I order objects from front to back and use octrees, I’m sure an ARB_occlusion_query is more precise and maybe quicker than my FBO+PBO+glReadPixels method, but to order and use octrees you need initial CPU work.

In an normal way how much things you should do to…

[ol]
[li]know on which object you have the cursor[/li][li]zDistance from camera for many rendered pixels of that object[/li][li]discard objects if pixels are less than a threshold (5px for example could means too small object to be interested to render it)[/li][li]pass occlusion culling calculations to another thread (impossible with occlusions_queries which should be run in main thread)[/li][li]know if rendered object is near a viewport edge (“who are interested in this?” this is not my problem)[/li][/ol]
…?

I’ll try to answer:

[ol]
[li]raycast using mouse coords, convert those coords in a world_space and intersect with objects to find the first by its ZDepth[/li][li]zDistance are calculated during frustum culling using bouding_boxes or spheres coord/radius, not for any single pixel[/li][li]for this point you need ARB_occlusion_query additional call. NV_conditional_render cannot help[/li][li]maybe possible with Multiple OpenGL Contexts, but I’ve not tested it. So I’ll consider it as impossible.[/li][li]maybe with additional calculations using frustum planes (am I right?)…but I’m not sure neither I’m much interested to know[/li][/ol]

[ul]
[li]it is required something more:[/li][LIST]
[li]sorting objects from front to back[/li][li]zDepth pass to intersects with mouse raycast[/li][/ul]
[/LIST]

I think it depends of what you need.
I’m started to use Color Picking Selection and quickly increased its efficiently to this FBO+PBO+glReadPixel method.
Single pixel precision is not so important to me and having all data within a single readback is more sweet than implement 10 different tricks to have the same in formations after all.

You don’t get it. I never ever said anything about moving any data. I said you should batch process multiple objects which simply means you don’t look at single objects but at a collection and try to make a decision for this collection. In the case of bounding boxes this does only mean to determine the minimum extent in space and the maximum extent in space. This alone is enough to compute a bounding box. So far, this is all CPU stuff. If you allocate a single unit cube at application startup, you can render this cube for any bounding box in your application. All you need to do is properly transform it in the vertex shader. You don’t have to allocate any new buffers and don’t transfer any data - at all. Got it now?

Who said anything about conditional rendering? I was always talking hardware occlusion queries only - which makes sense since in my suggested approach you still need roundtrips to the CPU. However, you can minimize the work the application has to do and the time it has to wait for query results by effectively batching stuff.

First of all, the case you depict is very simple: You can use a bounding sphere as a coarse estimate and simply make your decision a function of distance and sphere radius. This way you can determine whether an object is a large occluder or not. This is merely an augmentation of your frustum culling step. This will handle even the smallest objects properly.

Second, in the second scenario you depict, a close quarters scenario, a spatial data structure immediately comes in handy. One simple approach is to simply render the cell the camera is currently in which will have many many other cells discarded naturally. And this does not even mean you need to apply some sophisticated batching as long as the cells in your data structure are sufficiently small. So, again, theoretically not a problem.

Good luck with that. You can’t handle every possible case. I would go as far to suggest that perfect (as in 100% correct) occlusion culling and high performance are mutually exclusive.