Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 2 12 LastLast
Results 1 to 10 of 18

Thread: Occlusion Culling with FBO+PBO+glReadPixels_async

  1. #1
    Intern Contributor
    Join Date
    Nov 2011
    Posts
    50

    Occlusion Culling with FBO+PBO+glReadPixels_async

    Hello friends, I'm developing Graphic Engine from more than one year and I've tested more than few ways to do Occlusion Culling.

    Testing environment:

    • nVidia Quadro FX 360M for notebook
    • Intel Centrino DualCore 2.5GHz, 4GB RAM
    • only frustum culling for objects outside frustum (with 6 planes in CPU)
    • no octrees or other hierarchical order
    • no physics or other stuff...just simple rendering using VBOs and Shaders 3.30
    • tests done with same initial camera position
    • 50 objects (40.000 vertices, 16MB of total textures loaded)
    • double buffer active and VSync disabled (using SwapBuffers function)


    Using ARB_occlusion_query extension
    First pass:

    1. sort objects from front to back order (Z-Depth was retrivied during frustum calculation)
    2. disable glColorMask(...)
    3. for each 3D object:
      1. BeginQuery(...)
      2. glDraw() object
      3. EndQuery(...)
      4. glGetQueryObject() results
      5. if (visiblePixels < threshold) set object's variable "occluded = true"

    4. enable glColorMask(...)


    Second pass:
    1. for each 3D object:
      1. if (occluded == true) skip object
      2. glDraw() object


    With ARB_occlusion_query I get 120fps.


    Using glReadPixel (with same tecnique of Color Picking):
    Pre-phase: I was already using a color picking tecnique during Mouse Selection, so I've tried to use same tecnique to do occlusion. Color Picking tecnique renders in a single pass all objects, each single object with a different color. In this way if one pixel is "158,0,0,0" (B,G,R,A format during glReadPixels) I'm sure at least one pixel of that object (ID=158) is visible, so I can render it during rendering.
    Trick&Tips: the trick is to use a FrameBufferObject (for off-screen rendering) and two PBO (for async glReadPixels readback) to get pixels, but all those buffers are in different SIZE (width*height) of the real rendering phase.
    Example: rendering scene at 1900x1200 (as in my notebook) but those FBO and PBOs are only 20% of that resolution
    Concept: during occlusion I'm not really interested to know if ONE object's pixel is visible, but I'm interested if few pixels are visible. So it's useless rendering at the same resolution as final image (due to Fragment Shader calculations over more pixels/fragments and the readback procedure, using glReadPixels, over milions of pixels). The solution is to use a proportional reduced resolution for FBO and PBOs, and render to a smaller resolution. Less precision but very good speed!

    First pass:
    1. create a FrameBufferObject(FBO) for off-screen rendering smaller than rendering viewport
    2. create two PBOs (one for color_attachment0 and one for depth_attachment component) with the same previous FBO's size
    3. enable rendering (color and depth) to our FBO
    4. clClear(...color...depth...)
    5. change current viewport size to fits FBO's dimensions
    6. render objects are usual (or just in bouding_box mode for super fast process)
    7. execute glReadPixels() process to get an array of B,G,R,A values of our FBO
    8. parse the array (for few thousand pixels at this reduced resolution) and check BGRA values to identify object_IDs
    9. if object_ID is valid search set its variable "occluded = true"
    10. disable rendering to FBO and returns to window's back_buffer
    11. change current viewport size to previous values (full render size)


    Second pass:
    1. for each 3D object:
      1. if (occluded == true) skip object
      2. glDraw() object


    With this procedure I get 144fps


    This test demonstrate that ARB_occlusion_query could be worst than other half-hardware methods.
    With second tecnique I don't need to order objects from front to back, and it is not so bad!!!!!

    Hope this helps someone during occlusion or color picking procedure

  2. #2
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,098
    Most interesting; what did you find as an acceptable buffer size for the reduced rendering?

  3. #3
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    I'm not surprised your other way was faster. You killed performance by immediately getting the query result for each object, rather than using conditional rendering. The glReadPixels way only has one GPU stall instead of many.

    Also, generally speaking, occlusion culling isn't used that way. You render your terrain, always. Then, you occlusion cull each object against that, not against each other.

  4. #4
    Intern Contributor
    Join Date
    Nov 2011
    Posts
    50
    Quote Originally Posted by tonyo_au View Post
    what did you find as an acceptable buffer size for the reduced rendering?
    Using occlusion_queries I was using 10 pixels as threshold to be sure a sufficient part of the Model is rendered, so I've reduced the "buffer" size (FBO's size) to 1/5 (20%). This way is something similar to create a full-size FBO/PBO and check every 5 pixels (well, it's not 100% the same because it depends what types of glTexParameter() you use), but the delay to "download" (from VRAM to CPU/RAM) millions of pixels are reduced due to reduced PBO size!

    Quote Originally Posted by Alfonse Reinheart
    conditional render
    Thanks to have notice me this way. I didn't knew this solution
    I've just read what it is and it seems a good way for Occlusion Culling!
    Anyhow with your solution I still need to do a Depth Pass if I would want to know if Object_1 is behind/in-front of Object_2. With my method you still have one (reduced) Depth PBO attached to its parent FBO.

    Second: using many function calls during ARB_occlusion_queries (BeginQuery...End Query) needs to create a different Render process (while checking occlusion) without the possibility to use the common one. Instead with my approach the rendering remains completely the same (obviously with different Shaders during culling pass). PS: and many function calls adds more overhead than just one simple draw call.

    Third: with my approach you could have the possibility to resource-free main thread (which is the rendering one) from culling calculation because you can pass array of pixels (retrieved with glReadPixels function) to another thread and use its results few frames later to choose what objects to hide.

    Fourth: using FBO+PBO technique you could know if one Object is near the edges of the viewport (an know to WHICH edge!) or in what part/region of the screen it will be effectively rendered. (in most of perspective cases, objects rendered at "top" should be far than others at "bottom")

    Fifth: during parse array of pixels you could check on which pixel your mouse is over and change its cursor according to BGRA "picking" values. In this way a raycast is unnecessary. I like this way to change cursor shape based on mouse position.

    Sixth: no need to order objects from front to back

    However I'm curious to test your solution asap!

    (sorry for my english )
    Last edited by tdname; 02-04-2013 at 05:14 AM.

  5. #5
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    Anyhow with your solution I still need to do a Depth Pass if I would want to know if Object_1 is behind/in-front of Object_2.
    Is that useful information?

    Also, you seem to forget that your method is imperfect. It is not sample-accurate, since you're rendering at reduced resolutions. So there will be times when the user will see an object pop into being. The conditional rendering method is always sample-accurate.

    using many function calls during ARB_occlusion_queries (BeginQuery...End Query) needs to create a different Render process (while checking occlusion) without the possibility to use the common one. Instead with my approach the rendering remains completely the same (obviously with different Shaders during culling pass). PS: and many function calls adds more overhead than just one simple draw call.
    I don't know what a "Render process" is, but unless you actually profiled it, you have no way of knowing if the "overhead" is significant enough to be worth considering, compared to the savings of a readback, not to mention the GPU/CPU synchronization, cache pollution due to the need to render each object twice, etc.

    with my approach you could have the possibility to resource-free main thread (which is the rendering one) from culling calculation because you can pass array of pixels (retrieved with glReadPixels function) to another thread and use its results few frames later to choose what objects to hide.
    You still had to provoke a GPU/CPU synchronization to read the data back.

    using FBO+PBO technique you could know if one Object is near the edges of the viewport (an know to WHICH edge!) or in what part/region of the screen it will be effectively rendered. (in most of perspective cases, objects rendered at "top" should be far than others at "bottom")
    You can know that by doing some basic work on the CPU too. Indeed, I imagine such computations will generally be cheaper than doing a GPU/CPU synchronization.

    during parse array of pixels you could check on which pixel your mouse is over and change its cursor according to BGRA "picking" values. In this way a raycast is unnecessary. I like this way to change cursor shape based on mouse position.
    And some people aren't. You never stated that this was important.

    no need to order objects from front to back
    The method I outlined didn't require that either.

  6. #6
    Intern Contributor
    Join Date
    Nov 2011
    Posts
    50
    Quote Originally Posted by Alfonse Reinheart View Post
    Is that useful information?
    LevelOfDetails and all similar things...

    Quote Originally Posted by Alfonse Reinheart View Post
    you seem to forget that your method is imperfect
    Nope, I don't forget that.
    Previously with ARB_occlusion_queries I was using threshold of 10 fragments to determinate when an object is enough visible to be rendered (to avoid very far object to be fully rendered), so using 1/5 rendering size to "merge" pixels 5-by-5 is other solution to do a similar thing. If in full-size rendering there are more than 5 neighboring pixels with same BGRA value, it means in reduced FBO (20% of full-size) I'm quite sure sure at least 1 pixel contains the same BGRA values. But I must also parse ALL pixels (or with heuristic algorithm) so at the end I can discard too small objects without problems due to total reduced_pixels count.

    Quote Originally Posted by Alfonse Reinheart View Post
    I don't know what a "Render process" is
    It's a procedure with the objects loop, similar to: "for each Object do Render();"
    To execute OcclusionQueries (even with conditional rendering) I need to create another loop with "BeginQuery()...EndQuery()" at begin/end of EACH loop, so I cannot reuse the same main Render Loop Procedure but I need to manage twice the code (ore use many IF statements inside the shared function).

    Quote Originally Posted by Alfonse Reinheart View Post
    you have no way of knowing if the "overhead"
    Queries needs:
    • sorting algorithm from front to back. Additional sorting objects is difficult because I want to sort even by Shaders or Textures to avoid too much OGL/Shaders function calls
    • a different Render Procedure (as I write few rows above) with the problem to maintain synced 2 similar, but different, functions or use many IF statements to avoid Queries and its Results during final Render() call.
    • ask something additional to GPU for every object: "how many fragments are visible for object X?". Conditional rendering could solve this because it stops at first fragment, but in this way I'm unable to know how much portion of object is visible (I'm not interested to render a 100.000 vertices objects if it has visible only 5 pixels at very far distance (ZDepth).
    • ARB_occlusion_queries are number limited. In my video card I can run only 32 queries at the same time and every loop cycle (or at different rate if I want to change it) I should interrogate GPU to try to get queries results.


    Quote Originally Posted by Alfonse Reinheart View Post
    compared to the savings of a readback
    glReadPixels() readback is executed async with 2 PBOs at N_frame+1.

    Quote Originally Posted by Alfonse Reinheart View Post
    You can know that by doing some basic work on the CPU too
    Of course it is true, but to do this I need execute additional calculations just for that purpose.
    With my solution you could have already the data during array of pixels reading without do nothing more.

    Quote Originally Posted by Alfonse Reinheart View Post
    And some people aren't
    But this is a "built-in" and "CPU-free" additional feature without do/execute nothing more: just get pixel value with "array[((mouseY*width)+mouseX)*RGBA_size]" where "RGBA_size = 4" if I required all 4 components from resized FBO/PBO.

    Quote Originally Posted by Alfonse Reinheart View Post
    The method I outlined didn't require that either
    In another discussion (2009) you said something different: http://www.opengl.org/discussion_boa...=1#post1184753
    It is impossible to use occlusion_queries, or conditional rendering, without rendering objects in the right order. For example if I sort back_to_front all InFrustum objects will pass occlusion/conditional rendering with "visible fragments >=1".

    PS: using FBO+PBO+glReadPixels and some additional technique I could also determine if an object is visible behind glasses (not full transparent but a bit opaque like in bathroom) due to pixels BGRA multiplication between overlapped objects (This is just an idea because it needs a lot of additional work to merge textures and readback merged BGRA values with texture's GL_MODULATE and its different alternatives).
    Last edited by tdname; 02-04-2013 at 03:13 PM.

  7. #7
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    LevelOfDetails and all similar things...
    Which is more easily done by simply testing the object's distance from the camera.

    sorting algorithm from front to back. Additional sorting objects is difficult because I want to sort even by Shaders or Textures to avoid too much OGL/Shaders function calls
    You don't have to sort back to front. As stated, you can just render your terrain, then render the objects in the terrain in whatever order you want, doing occlusion tests in whatever order you want. Yes, non-terrain objects won't necessarily occlude objects behind them, but that's relatively minor compared to the number of objects occluded by terrain.

    Depending on your camera angle and terrain construction, of course. If all you have is a generalized bag of "objects" (none of which could be called "terrain", this doesn't help. But generally speaking, if you have a generalized bag of objects, you're probably not going to get much consistent performance gain from occlusion tests of any kind, whether queries or your method.

    And consistent performance is generally better than spikey-performance. You want to minimize the times when minor changes in positions or orientations cause massive changes in how much stuff you render. That's one of the reasons to do occlusion tests based on terrain and not other objects.

    To execute OcclusionQueries (even with conditional rendering) I need to create another loop with "BeginQuery()...EndQuery()" at begin/end of EACH loop, so I cannot reuse the same main Render Loop Procedure but I need to manage twice the code (ore use many IF statements inside the shared function).
    In general, when doing occlusion queries, you don't render the actual object to test the query. You render a test object, which is generally a cube or something. So it isn't the same code.

    Furthermore, the code is just rendering with one mesh vs. a different mesh. It shouldn't be any different code-wise; the only difference is where it gets the mesh from.

    (I'm not interested to render a 100.000 vertices objects if it has visible only 5 pixels at very far distance (ZDepth).
    Why would you want to render a 100,000 vertex object at a very far distance regardless of how many pixels are shown? It's not about the number of pixels; it's about the fact that it is far away. You should be LOD'ing it down regardless of whether it's partially occluded.

    ARB_occlusion_queries are number limited. In my video card I can run only 32 queries at the same time and every loop cycle (or at different rate if I want to change it) I should interrogate GPU to try to get queries results.
    The specification doesn't allow that limitation anywhere, to my knowledge. What happens when you try to use more queries?

    glReadPixels() readback is executed async with 2 PBOs at N_frame+1.
    Thus compounding the inaccuracies by also adding a frame latency; this is exactly what conditional rendering is intended to avoid. For me, the most important thing in any visibility culling algorithm is not having objects visibly pop into place. Your method doesn't provide that.

    But if that inaccuracy is fine for your needs, you can do it. I just don't suggest it.

  8. #8
    Intern Contributor
    Join Date
    Nov 2011
    Posts
    50
    Quote Originally Posted by Alfonse Reinheart View Post
    Which is more easily done by simply testing the object's distance from the camera
    Yes, but it should be specially done in another function created for that purpose.
    All things could be done with a single and specific function...I'm just saying they could be "grouped" all in one single technique without create external and pre/post loops. Reading pixels is a requirement, so all data could be retrieved from them are something like "CPU/GPU-free" without additional object-loops, overhead, function calls, rendering, etc...Just a "simple" async glReadPixels() to have all the stuff in one hand without do nothing else.

    Quote Originally Posted by Alfonse Reinheart View Post
    that's relatively minor compared to the number of objects occluded by terrain
    This solution is imperfect like you have said before about my idea of FBO+PBO.
    Considering only the terrain as main occluder object is not precise or perfect like resizing rendering readback tests...
    Try to imagine to be in front (and near the door) of a big house: I don't want to render and process the invisible city behind the house.

    Quote Originally Posted by Alfonse Reinheart View Post
    when doing occlusion queries, you don't render the actual object to test the query. You render a test object, which is generally a cube
    Sometimes bounding box is too big even if the real visible pixels are just few.
    However I was not so interested in using bounding box because I prefer the real bounding-shape to increase precision due to Nature/Green (trees, vegtables, rocks, etc...) target of my engine.

    Quote Originally Posted by Alfonse Reinheart View Post
    It shouldn't be any different code-wise; the only difference is where it gets the mesh from
    I have a "Render()" function which is the complete rendering step and I can easily surround it with "BeginQuery...EndQuery...GetObject..." due to objects loop inside of it. So I should create another "RenderWithQueries()" function which is using those function during its loop cycles....or use IF statements to enable those function calls only when the Pass is "test occlusion". I think it isn't a very clean and manageable solution.
    Maybe you use "object.render()" solution, but it needs to re-create the main loop in every pass you are creating: another not clean solution, and perhaps other more limitations.

    Quote Originally Posted by Alfonse Reinheart View Post
    Why would you want to render a 100,000 vertex object
    It was just an example.
    Example 2: if I've got a big 100.000 vertices object in the right corner, but only 5 pixels are shown, I'm not interested to render that object due to its poor interesting in the scene. With your method it will be rendered.

    Quote Originally Posted by Alfonse Reinheart View Post
    The specification doesn't allow that limitation anywhere, to my knowledge. What happens when you try to use more queries?
    My GPU can use 32 queries at the same time (each on different VBO) and each additional try to add another one returns an OpenGL error as expected. In this situation I need to query the result function (glGetQueryObjectuiv with GL_QUERY_RESULT_AVAILABLE) many and many times to get a result. But I couldn't know HOW MANY times at that time...so I should call it after every EndQuery(). But if the CPU is very fast (or GPU very slow) the query's queue become quickly full and additional BeginQuery() will returns error. My notebook has "slow" GPU and sometimes I received this error before to manually create a Queue of Queries which blocks current thread until the queries are done. To test it it is sufficient to "fake render" a big model of few millions of triangles and test the Queue. I know this is an extreme case, but it could be possibile to some old HW configuration.

    Quote Originally Posted by Alfonse Reinheart View Post
    But if that inaccuracy is fine for your needs, you can do it. I just don't suggest it.
    I "suggest" it just to show another approach if the absolute single-pixel-precision is not absolutely required.
    Rendering a model, which is behind 1million of close leafs, because conditional rendering returns "true" for a single visible pixel, is not what I'm interested to handle in this technique...and I try to avoid this case.

  9. #9
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    My GPU can use 32 queries at the same time (each on different VBO) and each additional try to add another one returns an OpenGL error as expected.
    As expected? There's nothing in the specification that says that you cannot have more than some implementation-defined number of query objects at the same time. So you should not be getting an error. Also, there is no connection between buffer objects and queries.

    What function is producing this error?

  10. #10
    Intern Contributor
    Join Date
    Nov 2011
    Posts
    50
    ARB_occlusion_query has GPU limits: get max number of simultaneous queries supported by GPU: glGetQueryiv(GL_SAMPLES_PASSED, GL_QUERY_COUNTER_BITS, maxCount)

    Occlusion test (pseudo-code):
    Code :
    glGenQueries(maxCount, arrayIDs)
    for each object do
      glBeginQuery(GL_SAMPLES_PASSED, arrayIDs[x]);
      render();
      glEndQuery(GL_SAMPLES_PASSED);
    end for;

    arrayIDs is an array contains "maxCount" numbers. In my case I get an array of consecutive 32 GLuint numbers (arrayIDs[1..32]).
    But during glBeginQuery() call you need to provide "x" parameter with a valid and free number/id, and how you can know if the number is free or busy? It's simple: you need to check queries results many times as possible to quickly free queries resources: glGetQueryObjectuiv(one_of_those_numbers, GL_QUERY_RESULT_AVAILABLE, boolean_output).
    If "boolean_output == true" it means you can consider all queries as Done and you can re-use those "x" values in new glBeginQuery() calls.
    But if CPU is many times faster than GPU (or this second one is old and slow) or the rendered model is very complex (millions of vertices if don't want to use bounding_boxes), all my 32 initial free queries numbers could be busy waiting queries results.
    In this case for an additional glBeginQuery() you don't know what "x" to use because all available values (32 in my case) are busied. I also tried to call that function but I received and Exception due to driver/opengl invalid state.
    In fact for glBeginQuery() function it is wrote:
    If the query target's count exceeds the maximum value representable in the number of available bits, as reported by glGetQueryiv with target set to the appropriate query target and pname GL_QUERY_COUNTER_BITS, the count becomes undefined
    and "undefined" means "unknown state" and getting results for exceeding "x" it gets an error.
    However if the "x" is already busy, glBeginQuery(..., arrayID[x]) returns GL_INVALID_OPERATION error as expected.

    I've avoided this behavior with a Queue/Array of free "x" values: when I call glBeginQuery(..., arrayIDs[x]) I remove that "x" from the Queue until a query result is available (which will free all busy "x" values). And if I call glBeginQuery() before at least one "x" value is free (= present in Queue/Array), I block thread execution in a loop waiting for results. This trick have solved my problem but it's just a trick and I don't like too much manually tricks in a software.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •