PDA

View Full Version : Deferred Shading Performance Problems



cireneikual
06-15-2012, 08:25 AM
Hello,

I wrote a deferred renderer (deferred shading), with which I first attempted to use light stencil volumes in order to restrict the drawing region. This is done using a stencil buffer attached to the GBuffer FBO, using the format GL_DEPTH24_STENCIL8.
However, using stenciling as opposed to just rendering the light volume with depth testing (which still overdraws in many cases) cuts the frame rate in half instead of increasing it. Is the GL_DEPTH24_STENCIL8 format particularly slow, or is such performance normal for stenciling? Stenciling only half of the light volume and then using depth testing (no overdraw) was about as slow as stenciling everything. I tried not clearing the stencil buffer and just setting the stencil op to reset stenciled fragments, but still no dice. I also made sure that it does not write to the stencil buffer when testing against it, and I unbound the lighting shader when rendering the light volume. I also tried grouping light types in order to prevent shaders from being bound/unbound too often, but that didn't affect the frame rate.

This is pretty vague, but what constitutes "good" performance in a deferred renderer? I am trying to use light maps ATM, since I only get 70 fps in a scene with ~150 lights without any other effects on a Radeon HD6700M. The lights are frustum culled using an octree. With SSAO and edge detection AA, it goes to ~40 at times, which is unacceptable. Scene geometry is not a problem, without lighting it runs at > 450 fps. The lights fill entire rooms in the scene (not good for deferred, but it is usually only 1 light per room).
This leads to another question. How can I best integrate light maps in a deferred renderer? I tried using an emissivity buffer in the GBuffer which the light shaders use to reject fragments that are fully emissive, but this results in a high memory usage and was just as slow as rendering all lights dynamically. It still has to touch all of the fragments a light may effect emissive or not with this technique. I also tried using stenciling to stencil out emissive geometry to keep the lighting shaders from rendering to them, but I abandoned this when simply writing to the stencil buffer when rendering the level geometry already brought the frame rate to ~40fps.

Thanks for any help you can provide!

aqnuep
06-15-2012, 08:43 AM
Sounds strange. Stenciling should be fast and GL_DEPTH24_STENCIL8 is a native format.
Sorry, I have no any particular ideas what could cause the slowdown in your case, but I think we will need a bit more details to figure it out.

aqnuep
06-15-2012, 08:58 AM
One idea: can you check that the time is really spent on the GPU side or the CPU cost is that high?

You can confirm this easily by using timer queries (http://www.opengl.org/registry/specs/ARB/timer_query.txt) to measure the GPU time with and without stenciling.

cireneikual
06-15-2012, 10:56 AM
Thank you for the quick response!

The half-stencil version takes ~17 ms (which alone takes the fps < 60) and the unstenciled one (depth only) takes ~10 ms. So it is all on the GPU side. I only timed the lighting. Everything else is almost instant.

aqnuep
06-15-2012, 11:47 AM
How you actually perform the stenciling? It shouldn't be done at the time of G-Buffer construction.

Kopelrativ
06-15-2012, 01:49 PM
Thank you for the quick response!

The half-stencil version takes ~17 ms (which alone takes the fps < 60) and the unstenciled one (depth only) takes ~10 ms. So it is all on the GPU side. I only timed the lighting. Everything else is almost instant.

Just to make sure, did you time it using queries, or real time on the CPU? The latter can be widely misleading. My own deferred shader executes in 0.02ms, but when using queries I find it is really 10ms.

Ilian Dinev
06-15-2012, 03:10 PM
Could you post framerates/ms when you resize the FBOs in half width/height, and then quarter width/height. To check how vram bandwidth scales.
Also, I'd try manually doing some tiling of the render, by using 128x128 px scissors (will cause many drawcalls, but those could be very cheap).

cireneikual
06-15-2012, 03:49 PM
I used timer queries, and normal timing as well and it gave the same results.
At half GBuffer width/height (1/4 number of pixels) it ran at 3.5 ms, and at 1/4 width/height (1/16 resolution) it ran at 1 ms. So, it scaled pretty linearly.
I am not sure how to tile the render, and I don't know what benefits that would provide. Could you elaborate on that a bit more?

Thanks for the help so far.

Dark Photon
06-15-2012, 07:22 PM
I wrote a deferred renderer (deferred shading), with which I first attempted to use light stencil volumes in order to restrict the drawing region. This is done using a stencil buffer attached to the GBuffer FBO, using the format GL_DEPTH24_STENCIL8.

However, using stenciling as opposed to just rendering the light volume with depth testing (which still overdraws in many cases) cuts the frame rate in half instead of increasing it.

I'm not too surprised by that. Have seen several talks/discussions that mention that the extra CPU/GPU effort involved in trying to limit the lighting fill by drawing light volumes (to limit fill in screen X/Y) via two-pass stencil (to limit fill with a double-ended light-volume-accurate screen Z test) can end up costing you more than just using simpler techniques. For instance, using depth bounds test (a courser double-ended Z test) or a single-ended depth test. Or just rendering screen-space aligned quads circumscribing the light bounding volumes, and batching those together so less likely to be CPU submission or state change bound.

It just depends on what you're bound on. What's better for your situation depends on a number of things including the lighting calc complexity and fill you're trying to protect against, along with your scenes, their light source arrangement, possible viewpoints, GPU memory bandwidth, etc.

Of course if you might potentially have a lot of overlap between light volumes, then it probably makes sense to to take the screen-space quads idea a bit further and go tile-based deferred (as Ilian Dinev suggested). Essentially this means taking your screen-space aligned quads per light, and binning them by the screen-space aligned rectangular subregions of the screen (e.g. each 16x16 region of pixels) that they touch. Once binned, you render all of the lights affecting one subregion (one region of pixels) all-in-one go (or with a minimal set of batches), which essentially reduces your G-buffer read fill to at most 1 (or a very small number) per pixel/sample, and your lighting buffer blend fill to 1 (or a very small number) per pixel/sample. Can greatly reduce bandwidth. And some engines (like DICE's FrostBite engine IIRC) do this binning on the GPU side rather than the CPU (google for the presentation). Wouldn't recommend jumping straight to the GPU binning approach first unless you're a GPGPU master -- would try CPU first to get a feel for what it can do for you.

I'd suggest reading Andrew Lauritzen's SIGGRAPH 2010 Deferred Shading presentation, including the notes. Here it is:

* Deferred Rendering for Current and Future Rendering Pipelines (http://software.intel.com/en-us/articles/deferred-rendering-for-current-and-future-rendering-pipelines/) (Intel Page)
* SIGGRAPH 2010 Beyond Programmable Shading course notes (http://bps10.idav.ucdavis.edu/) (see link to his presentation there)

This is one of the better Deferred Shading write-ups out there that I've seen. The code is there too (see ZIP file).

Deferred techniques have been out there for ~10 years (maybe more?), and there's a lot of info out there that's pretty dated. Lauritzen's talk helps cut through a lot of this and put it in perspective.


Is the GL_DEPTH24_STENCIL8 format particularly slow...?
No, not in my experience. I'd try an apples-to-apples perf comparison with it and the system FB and compare (drawing lots of well-batched geometry without state changes so that you'll be fill bound; resize the window to be sure). Stencil use with Deferred Shading (at least that I've seen) isn't really optimal -- too many state changes for too little work, increasing the chance that you end up bound on something else besides lighting fill.

cireneikual
06-16-2012, 07:42 AM
Thanks for the info!

I have a lot of light overlap, so I will try the tile based method. I assume that the lights are supposed to be passed to the shaders using fixed size arrays or the built in forward rendering lighting stuff.

Dark Photon
06-16-2012, 02:09 PM
I have a lot of light overlap, so I will try the tile based method. I assume that the lights are supposed to be passed to the shaders using fixed size arrays or the built in forward rendering lighting stuff.
Uniform arrays, UBOs, textures, ...whatever works.

cireneikual
06-16-2012, 06:31 PM
I am nearly done coding the tiling version, but I ran into a problem: I decide which lights are assigned to which tiles by rendering the light volumes to an FBO with the same resolution as the tile grid dimensions, but if a volume is completely encompassed by a tile, it will disappear (too low resolution). Is there some way I can render to the FBO such that it will render to fragments as long as geometry is touching them?

Dark Photon
06-16-2012, 07:01 PM
Is there some way I can render to the FBO such that it will render to fragments as long as geometry is touching them?
Maybe in the vtx shader push the corners away from the center toward the next pixel boundary? Just tweak the clip-space position you'd output before you're done.

cireneikual
06-20-2012, 10:52 AM
Alright, I got the tiled deferred shading to work, but it actually runs significantly slower than the normal deferred shading. Using timer queries, I found that the problem wasn't on the GPU side, but the CPU was taking a lot of time to set up the lights after the light grid has been generated. Is glLightfv particularly slow? I have to call it hundreds of times each frame in order to set up the lights for each grid cell.

aqnuep
06-20-2012, 11:46 AM
Alright, I got the tiled deferred shading to work, but it actually runs significantly slower than the normal deferred shading. Using timer queries, I found that the problem wasn't on the GPU side, but the CPU was taking a lot of time to set up the lights after the light grid has been generated. Is glLightfv particularly slow? I have to call it hundreds of times each frame in order to set up the lights for each grid cell.
What? You use glLightfv? For deferred rendering? :confused:

cireneikual
06-20-2012, 12:11 PM
Yes, since I am using tile-based deferred shading: http://visual-computing.intel-research.net/art/publications/deferred_rendering/
It groups lights together based on tiles in order to reduce g-buffer look-ups.

aqnuep
06-20-2012, 12:41 PM
Yes, since I am using tile-based deferred shading: http://visual-computing.intel-research.net/art/publications/deferred_rendering/
It groups lights together based on tiles in order to reduce g-buffer look-ups.
But why do you use glLightfv? Why don't use use uniform buffers or at least uniforms?

cireneikual
06-20-2012, 01:15 PM
It was the easiest to implement. Why not use it? Is it slow?

aqnuep
06-20-2012, 01:48 PM
It was the easiest to implement.
I can barely believe that it is in any way easier to implement than using uniforms.

Why not use it?
Maybe because it's deprecated, or because it's not really general purpose, or because you are bound to 8 lights per draw call?

Is it slow?
I would bet it is, though I don't know, I didn't call that function since the advent of programmable shaders.

Though, after this, I have doubts whether you actually implemented deferred rendering as that requires shaders and I (fortunately) haven't met anybody yet who used glLightfv with shaders.

Dark Photon
06-20-2012, 04:28 PM
...I (fortunately) haven't met anybody yet who used glLightfv with shaders.
I betcha you have and have just forgotten ;)

Remember GLSL pre-1.2 and gl_LightSource[0].diffuse, etc. (and similar Cg syntax in the arb profiles?) -- training wheels that let you implement fixed function pipeline and variants in shaders?

Dark Photon
06-20-2012, 04:34 PM
Alright, I got the tiled deferred shading to work...Is glLightfv particularly slow? I have to call it hundreds of times each frame in order to set up the lights for each grid cell.
If you've got a bunch of calls to an API which does so very, very little? Oh yeah! You want to minimize your state change API calls.

Plus as aqnuep pointed out, being limited to only 8 light sources per batch is not good if you're talking potentially a lot of lights.

Plus some of those fixed-function light related calls potentially direct the driver to recompile, relink, and reupload shaders under-the-hood while you're rendering -- not pretty. You're at the mercy of the driver for how many shader permutations it keeps track of and avoids rebuilding.

Shader-based lighting gives you a lot more control and efficiency potential.

aqnuep
06-20-2012, 06:04 PM
I betcha you have and have just forgotten ;)

Remember GLSL pre-1.2 and gl_LightSource[0].diffuse, etc. (and similar Cg syntax in the arb profiles?) -- training wheels that let you implement fixed function pipeline and variants in shaders?

No, I mean I know it's possible, but never ever heard somebody using those in a deferred renderer. That's kind of new to me.

cireneikual
06-22-2012, 07:01 AM
I tried uniform arrays, but they were even slower than the gl_LightSource[...]... stuff. So, now I am trying UBO's, and the shader compiles just fine (no errors), but fails to validate, and returns a garbage error log length. Any ideas as to what may cause this?

The shader:



// G buffer
uniform sampler2D gPosition;
uniform sampler2D gDiffuse;
uniform sampler2D gSpecular;
uniform sampler2D gNormal;


// Specularity info
uniform vec3 viewerPosition;
uniform float shininess;

uniform Light
{
vec3 position;
vec3 color;
float range;
float intensity;
} lightData[16];


uniform int numLights;


uniform vec3 lightAttenuation;


void main()
{
vec3 worldPos = texture2D(gPosition, gl_TexCoord[0].st).xyz;
vec3 worldNormal = texture2D(gNormal, gl_TexCoord[0].st).xyz;


vec3 sceneDiffuse = texture2D(gDiffuse, gl_TexCoord[0].st).rgb;
vec3 sceneSpecular = texture2D(gSpecular, gl_TexCoord[0].st).rgb;


vec4 finalColor = vec4(0.0, 0.0, 0.0, 0.0);


for(int i = 0; i < numLights; i++)
{
vec3 lightDir = lightData[i].position - worldPos;
float dist = length(lightDir);


float lightRange = lightData[i].range;


if(dist > lightRange)
continue;

lightDir /= dist;


float lambert = dot(lightDir, worldNormal);


if(lambert <= 0.0)
continue;


float fallOff = max(0.0, (lightRange - dist) / lightRange);


float attenuation = clamp(fallOff * lightData[i].intensity * (1.0 / (lightAttenuation.x + lightAttenuation.y * dist + lightAttenuation.z * dist * dist)), 0.0, 1.0);


// Specular
vec3 lightRay = reflect(normalize(-lightDir), worldNormal);
float specularIntensity = attenuation * pow(max(0.0, dot(lightRay, normalize(viewerPosition - worldPos))), shininess);
specularIntensity = max(0.0, specularIntensity);


finalColor += vec4(sceneDiffuse * attenuation * lambert * lightData[i].color + sceneSpecular * specularIntensity * lightData[i].color, 0.0);
}


gl_FragColor = finalColor;
}


Shader validation:



bool Shader::Finalize(unsigned int id)
{
glLinkProgram(id);
glValidateProgram(id);


// Check if validation was successful
int result;


glGetProgramiv(id, GL_VALIDATE_STATUS, &result);


if(result == GL_FALSE)
{
// Not validated, print out the log
int logLength;


glGetShaderiv(id, GL_INFO_LOG_LENGTH, &logLength);


if(logLength <= 0)
{
std::cerr << "Unable to validate shader: Error: Invalid log length \"" << logLength << "\": Could not retrieve error log!" << std::endl;


return false;
}


// Allocate the string
char* log = new char[logLength];


glGetProgramInfoLog(id, logLength, &result, log);


std::cerr << "Unable to compiler program: " << log << std::endl;


delete log;


return false;
}


return true;
}


Sorry about jumping from one problem to the next! The information you have given me so far has been very helpful!

Dark Photon
06-22-2012, 07:31 AM
I tried uniform arrays, but they were even slower than the gl_LightSource[...]... stuff.

I'm not sure what you're doing but in my experience uniform arrays are very fast. Also, I wasn't trying to suggest this is potentially a uniform arrays vs. gl_LightSource issue (GLSL side) but rather a uniform arrays vs. glLightfv issue (CPU side). In the uniform array case, you can set up all your light attributes in one call (per light attribute, or for all). Whereas with glLightfv you set up each attribute for each light with its own call. But again, it goes back to what you are bound on. And we don't know that yet.

If I were you I'd do some profiling on the CPU and GPU side. How much time are you actually using for culling, light rebining, drawing, etc? How many state changes are you doing for how many batches? What's your min/max batch size? Use gDEBugger (or other tool) to dump all the GL calls you're making in a frame and give it a scan for clues.

cireneikual
06-22-2012, 10:39 AM
I didn't try using gDEBugger yet, but I'll look into it :). I did some profiling using timers (timer queries and normal timers), and found that:
- Culling/tiling lights takes 0 (beyond resolution) to 2 ms. I tried various tile sizes (16x16, 32x32, 64x64, 128x128), and I get pretty much the same times.
- Rendering lights takes 8 ms on GPU side
- The rendering on the CPU side takes 10 - 40 ms (unacceptable!!!)
Lights are batched in groups of 1-16 (depending on how many are on a tile). I tried 32 once, but it just crashed (too many uniforms). Eventually, I will probably query this to make the batches as large as the particular machine can handle. If no lights are in a tile, the tile is just ignored.
So the issue is in setting up the uniforms, since that is pretty much all the CPU does when rendering the lights. It loops though all the tiles, sets the uniforms, and draws a quad for each. The quad drawing is fast.

For array uniforms, I just accumulated values for each pass in an array and then set the uniform array at the end. While there are 2 API calls in this version (setting array and giving the number of lights used in the pass), it runs as bad as 12 fps when a lot of lights are in the view. Using glLightfv dropped as low as 30. The non-tiled version never really dropped below 60 (this is all with about 150 lights), but I am running this on a pretty hi spec machine.

Since UBOs bind so quickly, I tried using an array of them and having each light keep its own UBO that it binds when rendering the tile. Almost all of the lights are static, so they don't even need to be updated that often.

I set the uniforms like this:



void Shader::SetShaderParameter4fv(const std::string &name, const std::vector<float> &params)
{
int paramLoc;

std::unordered_map<std::string, int>::iterator it = m_attributeLocations.find(name);

if(it == m_attributeLocations.end())
m_attributeLocations[name] = paramLoc = glGetUniformLocationARB(m_progID, name.c_str());
else
paramLoc = it->second;


// If location was not found
#ifdef DEBUG
if(paramLoc == -1)
std::cerr << "Could not find the uniform " << name << "!" << std::endl;
else
glUniform4fvARB(paramLoc, params.size(), &params[0]);
#else
glUniform4fvARB(paramLoc, params.size(), &params[0]);
#endif
}


Only the array uniform runs slowly, the others are fine. The only thing different between the array and single value forms is the glUniform...ARB(...) call.

EDIT 1: I found that if I purposely error out the shader, it shows the warning "warning(#312) uniform block with instance is supported in GLSL 1.5". However, if I request #version 150, it complains about gl_TexCoord, and still fails to validate the program. Using #version 400 gets rid of all warnings, but it again fails to validate.

EDIT 2: I tried TBO's as well, since they allow me to submit ALL lights at once :)! However, I ran into the same problem I did with UBOs: The shader does not validate. What could be the cause of this?

cireneikual
06-23-2012, 11:57 AM
I solved the shader issue, it was a stupid mistake :p. I am now uploading all light data in one go using a TBO. However, something weird is happening: according to the normal timer, it takes 0 - 1 ms to do everything lighting related, and 4.5 ms according to the timer queries. That seems pretty unlikely, since I am still getting 30 - 140 fps. It cannot being anything besides the lighting, since without it, I get really high frame rates.

Also, is there a way I can get the minimum/maximum depth of a tile region (for depth culling) without using something like OpenCL? That would probably help boost performance a lot, since the lights are all in a maze-like indoor environment with lots of occluders.

Dark Photon
06-23-2012, 01:20 PM
I solved the shader issue, it was a stupid mistake :p. I am now uploading all light data in one go using a TBO.

Good deal.


However, something weird is happening: according to the normal timer, it takes 0 - 1 ms to do everything lighting related, and 4.5 ms according to the timer queries. That seems pretty unlikely, since I am still getting 30 - 140 fps.

Hmm, OK. By "normal timer" I assume you mean CPU-based timer. By timer query I assume you mean GPU-based timer. That's very possible. In the former you're timing how long it takes you to "queue" the work, and in the latter you're timing how long it takes you to "do" the work. If you want the former to be closer to the latter, than put a glFinish() right before you stop the CPU timer. That forces the CPU to wait until all the queue GPU work is done before it returns. This introduces a large pipelining bubble, so in practice you'd never do this except possibly at end-of-frame.


That seems pretty unlikely, since I am still getting 30 - 140 fps.

I have no idea what you're implying here. 30-140fps = 7-33ms, which even in the best case covers either timing.


Also, is there a way I can get the minimum/maximum depth of a tile region (for depth culling) without using something like OpenCL? That would probably help boost performance a lot, since the lights are all in a maze-like indoor environment with lots of occluders.
You can do the depth buffer reduction with a GLSL shader instead. Try that first. That should still be pretty fast. Ping-pong reduction is sometimes used here.

The main thing that OpenCL/CUDA bring to the table is use of the shared memory on the compute units (GPU multiprocessors). There are cases like reduction where your shader/kernel can execute more quickly if your algorithm takes advantage of that.

cireneikual
06-23-2012, 01:41 PM
You can do the depth buffer reduction with a GLSL shader instead. Try that first. That should still be pretty fast. Ping-pong reduction is sometimes used here.

I'll try that! However, how do I make sure it always renders the maximum depth? If it just interpolates all the depths, it will give an average instead of a maximum, so it may cull a light when it shouldn't be culled.

Dark Photon
06-24-2012, 06:07 PM
I'll try that! However, how do I make sure it always renders the maximum depth? If it just interpolates all the depths, it will give an average instead of a maximum...
Why would it interpolate?

You're writing the shader. You can make it do whatever you want. Such as (for instance) read all the input depth values via texelFetch(), compute the min() and max() values across those input depth values, and output those min and max values on 2 MRTs (render targets), 0 and 1.