Deferred Shading Performance Problems

Hello,

I wrote a deferred renderer (deferred shading), with which I first attempted to use light stencil volumes in order to restrict the drawing region. This is done using a stencil buffer attached to the GBuffer FBO, using the format GL_DEPTH24_STENCIL8.
However, using stenciling as opposed to just rendering the light volume with depth testing (which still overdraws in many cases) cuts the frame rate in half instead of increasing it. Is the GL_DEPTH24_STENCIL8 format particularly slow, or is such performance normal for stenciling? Stenciling only half of the light volume and then using depth testing (no overdraw) was about as slow as stenciling everything. I tried not clearing the stencil buffer and just setting the stencil op to reset stenciled fragments, but still no dice. I also made sure that it does not write to the stencil buffer when testing against it, and I unbound the lighting shader when rendering the light volume. I also tried grouping light types in order to prevent shaders from being bound/unbound too often, but that didn’t affect the frame rate.

This is pretty vague, but what constitutes “good” performance in a deferred renderer? I am trying to use light maps ATM, since I only get 70 fps in a scene with ~150 lights without any other effects on a Radeon HD6700M. The lights are frustum culled using an octree. With SSAO and edge detection AA, it goes to ~40 at times, which is unacceptable. Scene geometry is not a problem, without lighting it runs at > 450 fps. The lights fill entire rooms in the scene (not good for deferred, but it is usually only 1 light per room).
This leads to another question. How can I best integrate light maps in a deferred renderer? I tried using an emissivity buffer in the GBuffer which the light shaders use to reject fragments that are fully emissive, but this results in a high memory usage and was just as slow as rendering all lights dynamically. It still has to touch all of the fragments a light may effect emissive or not with this technique. I also tried using stenciling to stencil out emissive geometry to keep the lighting shaders from rendering to them, but I abandoned this when simply writing to the stencil buffer when rendering the level geometry already brought the frame rate to ~40fps.

Thanks for any help you can provide!

Sounds strange. Stenciling should be fast and GL_DEPTH24_STENCIL8 is a native format.
Sorry, I have no any particular ideas what could cause the slowdown in your case, but I think we will need a bit more details to figure it out.

One idea: can you check that the time is really spent on the GPU side or the CPU cost is that high?

You can confirm this easily by using timer queries to measure the GPU time with and without stenciling.

Thank you for the quick response!

The half-stencil version takes ~17 ms (which alone takes the fps < 60) and the unstenciled one (depth only) takes ~10 ms. So it is all on the GPU side. I only timed the lighting. Everything else is almost instant.

How you actually perform the stenciling? It shouldn’t be done at the time of G-Buffer construction.

[QUOTE=cireneikual;1238871]Thank you for the quick response!

The half-stencil version takes ~17 ms (which alone takes the fps < 60) and the unstenciled one (depth only) takes ~10 ms. So it is all on the GPU side. I only timed the lighting. Everything else is almost instant.[/QUOTE]

Just to make sure, did you time it using queries, or real time on the CPU? The latter can be widely misleading. My own deferred shader executes in 0.02ms, but when using queries I find it is really 10ms.

Could you post framerates/ms when you resize the FBOs in half width/height, and then quarter width/height. To check how vram bandwidth scales.
Also, I’d try manually doing some tiling of the render, by using 128x128 px scissors (will cause many drawcalls, but those could be very cheap).

I used timer queries, and normal timing as well and it gave the same results.
At half GBuffer width/height (1/4 number of pixels) it ran at 3.5 ms, and at 1/4 width/height (1/16 resolution) it ran at 1 ms. So, it scaled pretty linearly.
I am not sure how to tile the render, and I don’t know what benefits that would provide. Could you elaborate on that a bit more?

Thanks for the help so far.

[QUOTE=cireneikual;1238860]I wrote a deferred renderer (deferred shading), with which I first attempted to use light stencil volumes in order to restrict the drawing region. This is done using a stencil buffer attached to the GBuffer FBO, using the format GL_DEPTH24_STENCIL8.

However, using stenciling as opposed to just rendering the light volume with depth testing (which still overdraws in many cases) cuts the frame rate in half instead of increasing it.[/QUOTE]

I’m not too surprised by that. Have seen several talks/discussions that mention that the extra CPU/GPU effort involved in trying to limit the lighting fill by drawing light volumes (to limit fill in screen X/Y) via two-pass stencil (to limit fill with a double-ended light-volume-accurate screen Z test) can end up costing you more than just using simpler techniques. For instance, using depth bounds test (a courser double-ended Z test) or a single-ended depth test. Or just rendering screen-space aligned quads circumscribing the light bounding volumes, and batching those together so less likely to be CPU submission or state change bound.

It just depends on what you’re bound on. What’s better for your situation depends on a number of things including the lighting calc complexity and fill you’re trying to protect against, along with your scenes, their light source arrangement, possible viewpoints, GPU memory bandwidth, etc.

Of course if you might potentially have a lot of overlap between light volumes, then it probably makes sense to to take the screen-space quads idea a bit further and go tile-based deferred (as Ilian Dinev suggested). Essentially this means taking your screen-space aligned quads per light, and binning them by the screen-space aligned rectangular subregions of the screen (e.g. each 16x16 region of pixels) that they touch. Once binned, you render all of the lights affecting one subregion (one region of pixels) all-in-one go (or with a minimal set of batches), which essentially reduces your G-buffer read fill to at most 1 (or a very small number) per pixel/sample, and your lighting buffer blend fill to 1 (or a very small number) per pixel/sample. Can greatly reduce bandwidth. And some engines (like DICE’s FrostBite engine IIRC) do this binning on the GPU side rather than the CPU (google for the presentation). Wouldn’t recommend jumping straight to the GPU binning approach first unless you’re a GPGPU master – would try CPU first to get a feel for what it can do for you.

I’d suggest reading Andrew Lauritzen’s SIGGRAPH 2010 Deferred Shading presentation, including the notes. Here it is:

This is one of the better Deferred Shading write-ups out there that I’ve seen. The code is there too (see ZIP file).

Deferred techniques have been out there for ~10 years (maybe more?), and there’s a lot of info out there that’s pretty dated. Lauritzen’s talk helps cut through a lot of this and put it in perspective.

Is the GL_DEPTH24_STENCIL8 format particularly slow…?

No, not in my experience. I’d try an apples-to-apples perf comparison with it and the system FB and compare (drawing lots of well-batched geometry without state changes so that you’ll be fill bound; resize the window to be sure). Stencil use with Deferred Shading (at least that I’ve seen) isn’t really optimal – too many state changes for too little work, increasing the chance that you end up bound on something else besides lighting fill.

Thanks for the info!

I have a lot of light overlap, so I will try the tile based method. I assume that the lights are supposed to be passed to the shaders using fixed size arrays or the built in forward rendering lighting stuff.

Uniform arrays, UBOs, textures, …whatever works.

I am nearly done coding the tiling version, but I ran into a problem: I decide which lights are assigned to which tiles by rendering the light volumes to an FBO with the same resolution as the tile grid dimensions, but if a volume is completely encompassed by a tile, it will disappear (too low resolution). Is there some way I can render to the FBO such that it will render to fragments as long as geometry is touching them?

Maybe in the vtx shader push the corners away from the center toward the next pixel boundary? Just tweak the clip-space position you’d output before you’re done.

Alright, I got the tiled deferred shading to work, but it actually runs significantly slower than the normal deferred shading. Using timer queries, I found that the problem wasn’t on the GPU side, but the CPU was taking a lot of time to set up the lights after the light grid has been generated. Is glLightfv particularly slow? I have to call it hundreds of times each frame in order to set up the lights for each grid cell.

What? You use glLightfv? For deferred rendering? :confused:

Yes, since I am using tile-based deferred shading: http://visual-computing.intel-research.net/art/publications/deferred_rendering/
It groups lights together based on tiles in order to reduce g-buffer look-ups.

[QUOTE=cireneikual;1239081]Yes, since I am using tile-based deferred shading: http://visual-computing.intel-research.net/art/publications/deferred_rendering/
It groups lights together based on tiles in order to reduce g-buffer look-ups.[/QUOTE]
But why do you use glLightfv? Why don’t use use uniform buffers or at least uniforms?

It was the easiest to implement. Why not use it? Is it slow?

I can barely believe that it is in any way easier to implement than using uniforms.

Maybe because it’s deprecated, or because it’s not really general purpose, or because you are bound to 8 lights per draw call?

I would bet it is, though I don’t know, I didn’t call that function since the advent of programmable shaders.

Though, after this, I have doubts whether you actually implemented deferred rendering as that requires shaders and I (fortunately) haven’t met anybody yet who used glLightfv with shaders.

I betcha you have and have just forgotten :wink:

Remember GLSL pre-1.2 and gl_LightSource[0].diffuse, etc. (and similar Cg syntax in the arb profiles?) – training wheels that let you implement fixed function pipeline and variants in shaders?