Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 3 123 LastLast
Results 1 to 10 of 29

Thread: Deferred Shading Performance Problems

  1. #1
    Junior Member Regular Contributor
    Join Date
    Mar 2012
    Posts
    115

    Deferred Shading Performance Problems

    Hello,

    I wrote a deferred renderer (deferred shading), with which I first attempted to use light stencil volumes in order to restrict the drawing region. This is done using a stencil buffer attached to the GBuffer FBO, using the format GL_DEPTH24_STENCIL8.
    However, using stenciling as opposed to just rendering the light volume with depth testing (which still overdraws in many cases) cuts the frame rate in half instead of increasing it. Is the GL_DEPTH24_STENCIL8 format particularly slow, or is such performance normal for stenciling? Stenciling only half of the light volume and then using depth testing (no overdraw) was about as slow as stenciling everything. I tried not clearing the stencil buffer and just setting the stencil op to reset stenciled fragments, but still no dice. I also made sure that it does not write to the stencil buffer when testing against it, and I unbound the lighting shader when rendering the light volume. I also tried grouping light types in order to prevent shaders from being bound/unbound too often, but that didn't affect the frame rate.

    This is pretty vague, but what constitutes "good" performance in a deferred renderer? I am trying to use light maps ATM, since I only get 70 fps in a scene with ~150 lights without any other effects on a Radeon HD6700M. The lights are frustum culled using an octree. With SSAO and edge detection AA, it goes to ~40 at times, which is unacceptable. Scene geometry is not a problem, without lighting it runs at > 450 fps. The lights fill entire rooms in the scene (not good for deferred, but it is usually only 1 light per room).
    This leads to another question. How can I best integrate light maps in a deferred renderer? I tried using an emissivity buffer in the GBuffer which the light shaders use to reject fragments that are fully emissive, but this results in a high memory usage and was just as slow as rendering all lights dynamically. It still has to touch all of the fragments a light may effect emissive or not with this technique. I also tried using stenciling to stencil out emissive geometry to keep the lighting shaders from rendering to them, but I abandoned this when simply writing to the stencil buffer when rendering the level geometry already brought the frame rate to ~40fps.

    Thanks for any help you can provide!

  2. #2
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    941
    Sounds strange. Stenciling should be fast and GL_DEPTH24_STENCIL8 is a native format.
    Sorry, I have no any particular ideas what could cause the slowdown in your case, but I think we will need a bit more details to figure it out.
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

  3. #3
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    941
    One idea: can you check that the time is really spent on the GPU side or the CPU cost is that high?

    You can confirm this easily by using timer queries to measure the GPU time with and without stenciling.
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

  4. #4
    Junior Member Regular Contributor
    Join Date
    Mar 2012
    Posts
    115
    Thank you for the quick response!

    The half-stencil version takes ~17 ms (which alone takes the fps < 60) and the unstenciled one (depth only) takes ~10 ms. So it is all on the GPU side. I only timed the lighting. Everything else is almost instant.

  5. #5
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    941
    How you actually perform the stenciling? It shouldn't be done at the time of G-Buffer construction.
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

  6. #6
    Junior Member Regular Contributor Kopelrativ's Avatar
    Join Date
    Apr 2011
    Posts
    212
    Quote Originally Posted by cireneikual View Post
    Thank you for the quick response!

    The half-stencil version takes ~17 ms (which alone takes the fps < 60) and the unstenciled one (depth only) takes ~10 ms. So it is all on the GPU side. I only timed the lighting. Everything else is almost instant.
    Just to make sure, did you time it using queries, or real time on the CPU? The latter can be widely misleading. My own deferred shader executes in 0.02ms, but when using queries I find it is really 10ms.

  7. #7
    Senior Member OpenGL Pro Ilian Dinev's Avatar
    Join Date
    Jan 2008
    Location
    Watford, UK
    Posts
    1,261
    Could you post framerates/ms when you resize the FBOs in half width/height, and then quarter width/height. To check how vram bandwidth scales.
    Also, I'd try manually doing some tiling of the render, by using 128x128 px scissors (will cause many drawcalls, but those could be very cheap).

  8. #8
    Junior Member Regular Contributor
    Join Date
    Mar 2012
    Posts
    115
    I used timer queries, and normal timing as well and it gave the same results.
    At half GBuffer width/height (1/4 number of pixels) it ran at 3.5 ms, and at 1/4 width/height (1/16 resolution) it ran at 1 ms. So, it scaled pretty linearly.
    I am not sure how to tile the render, and I don't know what benefits that would provide. Could you elaborate on that a bit more?

    Thanks for the help so far.

  9. #9
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    2,882
    Quote Originally Posted by cireneikual View Post
    I wrote a deferred renderer (deferred shading), with which I first attempted to use light stencil volumes in order to restrict the drawing region. This is done using a stencil buffer attached to the GBuffer FBO, using the format GL_DEPTH24_STENCIL8.

    However, using stenciling as opposed to just rendering the light volume with depth testing (which still overdraws in many cases) cuts the frame rate in half instead of increasing it.
    I'm not too surprised by that. Have seen several talks/discussions that mention that the extra CPU/GPU effort involved in trying to limit the lighting fill by drawing light volumes (to limit fill in screen X/Y) via two-pass stencil (to limit fill with a double-ended light-volume-accurate screen Z test) can end up costing you more than just using simpler techniques. For instance, using depth bounds test (a courser double-ended Z test) or a single-ended depth test. Or just rendering screen-space aligned quads circumscribing the light bounding volumes, and batching those together so less likely to be CPU submission or state change bound.

    It just depends on what you're bound on. What's better for your situation depends on a number of things including the lighting calc complexity and fill you're trying to protect against, along with your scenes, their light source arrangement, possible viewpoints, GPU memory bandwidth, etc.

    Of course if you might potentially have a lot of overlap between light volumes, then it probably makes sense to to take the screen-space quads idea a bit further and go tile-based deferred (as Ilian Dinev suggested). Essentially this means taking your screen-space aligned quads per light, and binning them by the screen-space aligned rectangular subregions of the screen (e.g. each 16x16 region of pixels) that they touch. Once binned, you render all of the lights affecting one subregion (one region of pixels) all-in-one go (or with a minimal set of batches), which essentially reduces your G-buffer read fill to at most 1 (or a very small number) per pixel/sample, and your lighting buffer blend fill to 1 (or a very small number) per pixel/sample. Can greatly reduce bandwidth. And some engines (like DICE's FrostBite engine IIRC) do this binning on the GPU side rather than the CPU (google for the presentation). Wouldn't recommend jumping straight to the GPU binning approach first unless you're a GPGPU master -- would try CPU first to get a feel for what it can do for you.

    I'd suggest reading Andrew Lauritzen's SIGGRAPH 2010 Deferred Shading presentation, including the notes. Here it is:

    * Deferred Rendering for Current and Future Rendering Pipelines (Intel Page)
    * SIGGRAPH 2010 Beyond Programmable Shading course notes (see link to his presentation there)

    This is one of the better Deferred Shading write-ups out there that I've seen. The code is there too (see ZIP file).

    Deferred techniques have been out there for ~10 years (maybe more?), and there's a lot of info out there that's pretty dated. Lauritzen's talk helps cut through a lot of this and put it in perspective.

    Is the GL_DEPTH24_STENCIL8 format particularly slow...?
    No, not in my experience. I'd try an apples-to-apples perf comparison with it and the system FB and compare (drawing lots of well-batched geometry without state changes so that you'll be fill bound; resize the window to be sure). Stencil use with Deferred Shading (at least that I've seen) isn't really optimal -- too many state changes for too little work, increasing the chance that you end up bound on something else besides lighting fill.
    Last edited by Dark Photon; 06-15-2012 at 08:18 PM.

  10. #10
    Junior Member Regular Contributor
    Join Date
    Mar 2012
    Posts
    115
    Thanks for the info!

    I have a lot of light overlap, so I will try the tile based method. I assume that the lights are supposed to be passed to the shaders using fixed size arrays or the built in forward rendering lighting stuff.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •