Fast Stencil Light Volumes for Deferred Shading

Hello,

A while back I posted about some deferred shading performance problems, and how using stencil light volumes actually slowed it down instead of speeding it up.
This has probably been thought up before, but I thought I might share it here anyways in case it hasn’t.
I managed to batch light stencil tests into groups of 8 (8 stencil bits) by using glStencilMask to act as an OR operation to write which lights are affecting which pixels when performing a depth test on the front faces of the light volumes.
In a second pass for the back faces of the light volumes, I then switch depth testing to GL_GREATER, and set the stencil func to render the light only if its bit was set earlier, using the and’ed mask parameter.

With this system, there is no overdraw, and the stencil test is fast. Overall, the system is faster than without the stenciling.

This is how the light rendering looks like:


// ---------------------------- Render Lights ----------------------------


    // Query visible lights
    std::vector<OctreeOccupant*> result;


    m_lightSPT.Query_Frustum(result, pScene->GetFrustum());


    glEnable(GL_VERTEX_ARRAY);


    glEnable(GL_STENCIL_TEST);


    glClearStencil(0);


    for(unsigned int i = 0, size = result.size(); i < size;)
    {
        glClear(GL_STENCIL_BUFFER_BIT);


        glStencilOp(GL_KEEP, GL_KEEP, GL_REPLACE);


        glColorMask(false, false, false, false);


        // Batch 8 lights together
        unsigned int firstLightIndex = i;


        for(unsigned int j = 0; j < 8 && i < size; j++, i++)
        {
            glStencilFunc(GL_ALWAYS, 0xff, 0xff);
            glStencilMask(m_lightIndices[j]);


            Light* pLight = static_cast<Light*>(result[i]);


            if(!pLight->m_enabled)
                continue;


            pLight->SetTransform(pScene);
            pLight->RenderBoundingGeom();
        }


        i = firstLightIndex;


        glColorMask(true, true, true, true);


        // Now render with reversed depth testing and only to stenciled regions
        glCullFace(GL_FRONT);


        glDepthFunc(GL_GREATER);


        glEnable(GL_BLEND);


        glStencilOp(GL_KEEP, GL_KEEP, GL_KEEP);


        for(unsigned int j = 0; j < 8 && i < size; j++, i++)
        {
            glStencilFunc(GL_EQUAL, 0xff, m_lightIndices[j]);


            Light* pLight = static_cast<Light*>(result[i]);


            if(!pLight->m_enabled)
                continue;


            // If camera is inside light, do not perform depth test (would cull it away improperly)
            if(pLight->Intersects(pScene->m_camera.m_position))
            {
                glDisable(GL_STENCIL_TEST);


                pLight->SetTransform(pScene);


                pLight->SetShader(pScene);


                pLight->RenderBoundingGeom();


                glEnable(GL_STENCIL_TEST);
            }
            else
            {
                pLight->SetTransform(pScene);


                pLight->SetShader(pScene);


                pLight->RenderBoundingGeom();
            }
        }


        glCullFace(GL_BACK);


        glDepthFunc(GL_LESS);


        glDisable(GL_BLEND);


        Shader::Unbind();
    }


    // Re-enable stencil writes to all bits
    glStencilMask(0xff);


    glDisable(GL_VERTEX_ARRAY);


    glDisable(GL_STENCIL_TEST);


    GL_ERROR_CHECK();

[QUOTE=cireneikual;1242460]I managed to batch light stencil tests into groups of 8 (8 stencil bits) by using glStencilMask to act as an OR operation to write which lights are affecting which pixels when performing a depth test on the front faces of the light volumes. In a second pass for the back faces of the light volumes, I then switch depth testing to GL_GREATER, and set the stencil func to render the light only if its bit was set earlier, using the and’ed mask parameter.

With this system, there is no overdraw, and the stencil test is fast. Overall, the system is faster than without the stenciling.[/QUOTE]

That’s interesting. Thanks!

Say, curious question: did you happen to bench tile-based deferred with your system? That is, batching lights by screen tile, and then for each tile: read the G-buffer, apply the lights for that tile all-in-one-go, write the lighting buffer?

One thing you can do is use instancing for batched lights. This may improve things a bit. One bad thing with this implementation of deferred shading is that you speak allot with the API, also you clear the stencil buffer again and again.

As Dark Photon mentioned tile based deferred shading is probably the way to go. I’ve recently implemented it and the performance advantage was INSANE. Now I can render 500 (limited by the UBO size at the moment) point lights without any significant impact. Before, the bottleneck was the lighting stage and now by far the material stage (where you create the G buffer). I am planning to write an article about the implementation I used so if someone is interested I can rush things a bit.

@ Dark Photon: I tried the tiling method a while back, but it performed worse due to how I implemented it. I did it without compute shaders or OpenCL, since I wanted to support older hardware, and because I have no experience with either. All the tiling happened on the CPU, so it came out very CPU limited.
@ Godlike: I couldn’t get tiled deferred to work properly myself, so I would love to see that article! I want more lights :smiley:

I had a deferred shader updated with tile based lights. For each light, I created a quad that will precisely cover the light. It is positioned correctly in z, to make use of depth culling. The position is in front of the light, not at the z of the light source. There are some tricks to be aware of when the camera is inside the light (where my implementation still have some problems). The same technique can be used for a lot of nice effects, like adding spherical fogs, local color coded markers, etc. The vertex shader is as follows. The performance increased a lot, especially for lamps further away or hidden by objects. And that is the usual case, except for a few near lamps.

uniform vec4 Upoint;            // A light. .xyz is the coordinate, and .w is the strength (reach)
layout (location = 0) in vec2 vertex;
out vec2 screen;             // Screen coordinate
void main(void)
{
    float strength = Upoint.w;
    // Relative bounding (2D) box around the point light
    vec3 box =  vec3(vertex*2-1, 0)*strength;
    vec4 viewPos = UBOViewMatrix * vec4(Upoint.xyz, 1);
    vec3 d = normalize(viewPos.xyz);
    // We want to move the quad towards the player. It shall be moved so as
    // precisely be outside the range of the lamp. This is needed as the depth
    // culling will remove parts of the quad that are hidden.
    float l = min(strength, -viewPos.z-1); // Correction if camera inside light
    // The modelView is one of the corners of the quad in view space.
    vec4 modelView = -vec4(d, 0)*l + vec4(box, 0) + viewPos;
    vec4 pos = UBOProjectionMatrix * modelView;
    pos /= pos.w;
    gl_Position = pos;
    // Copy position to the fragment shader. Only x and y is needed. Scale it
    // from interval -1 .. 1, to the interval 0 .. 1.
    screen = pos.xy/2+0.5;
}

The fragment shader do the usual thing (add light intensity as a function of the distance from the lamp to the pixel).

The system I use is… funky. It goes like this, I have a render target that is essentially RGBA_8/16/32UI, depending on how many lights I wish to support in a single call. For each light I render a “light volume”, in the same fashion as one does stencil shadows, but rather than incrementing, I just flip the bit of that integer buffer (I get away with this because the light volume is essentially the “shadow” of a planar polyon) This way all lights get drawn to the integer buffer. Then in the final pass, each bit of that integer buffer indicates if a light is active. In my system, a part of the G-buffer is an offset into a range of a texture buffer object that “lists” the lights the mesh worries about (done via a CPU computation) and the pass to do the lighting iterates over that range so that only those lights whose “bounding-whatever” that intersect the bounding box of a mesh are added. The main thing I get out of this is that I avoid lots more look ups (for doing a whatever per light means the g-buffer needs to be read again for each light), I avoid blending to add the lights (and the icky choice of doing FP16/FP32 blending or Fixed8 blending with banding)… this system also lets me do detect when any set of lights are active on one pixel by using a mask, opening up the door for drawing weird stuff (like change in a funky way if light A and light B are hitting the same thing).

Sounds like Light Indexed Deferred Rendering (also in ShaderX7)

@kRogue: please paragraphs :slight_smile:

I’ve seen opening posts in forums that are a page long without spaces and its a wonder anyone replies. Spread the word!

I just wanted to ask, since this thread seems to have become a general discussion. I want to implement a similar shader framework and I am wondering if it counts as “deferred” or not…

It’s really more about shadows than lights. In short the static (deterministic) elements of the scene have shadow geometry precomputed. The goal is to make accurate (soft umbra/penumbra etc) go anywhere shadows that do not have jagged / saw-toothed qualities. The scene is sectioned into chunks that each have up to 4 shadow generating lights. Then the shadows are drawn to an RGBA buffer one component per shadow with colour masks and a depth buffer. A second (MRT) buffer probably generates a depth texture for later lookup as I suspect copying the depth buffer into a texture would not fly.

At this point shadows can be generated for non-deterministic elements of the scene via some real time algorithm; haven’t given it much thought but it is complicated by the chunking. And shadows for blended geometry can be accumulated without writing to the depth buffer.

Then the same depth buffer is used to draw the scene as usual discarding pixels that are behind the shadows and the lights are modulated by the greyscale values in the shadow buffer. If a pixel is in front (or on top) of a shadow its depth must be compared with the saved depth texture to determine if it is shadowed or not. Blended geometry can skip the depth comparison.

Is this deferred lighting? There is no G buffer, but it’s kind of flipped around. Also the shadows can be light instead (more technically the inverse of shadow) if a scene is more dark than light.

Sounds like Light Indexed Deferred Rendering (also in ShaderX7)

It is sort-of-ish like it, but not quite the same. Basically the system has the following g-buffer:

[ul]
[li]Material ID (just a GL_R16UI)[/li][li]diffuse and specular color packed into on GL_RGBA16UI[/li][li]normal + depth[/li][li]light bitfield buffer[/li][li]some other buffers for FX and transparency[/li][/ul]

The MaterialID is an offset into a texture buffer object that stores a header consisting of:

[ul]
[li]shader ID[/li][li]lightBegin, lightEnd which gives a range of indices into another texture buffer object about what lights to worry about for the fragment[/li][li]another pair of ranges for “custom” float data for the mesh of that fragment into another texture buffer object[/li][li]another pair of ranges for “custom” uint data for the mesh of that fragment into another texture buffer object[/li][/ul]

The data of the lights are all stored in one texture buffer object and each light has:

[ul]
[li]position of light[/li][li]direction of light[/li][li]color of light[/li][li]radial attenuation coefficients (linear and quadratic)[/li][li]angular attenuation range (cosine of angles stored instead of actual angle)[/li][li]“light bit mask”[/li][/ul]

the standard lighting shader, then loops over the lights in the range [lightBegin, lightEnd), the light is considered active if the lightbitmask locigcal-anded with light bitfield buffer matches with the lightbitmask. The use I have for this was having a light go through a portal. The face of the portal was always planar so the light volume it case was always ok to just do flipping. You can see the demos of this pet project at: http://www.youtube.com/playlist?list=PL2322715E8A420CCD … sighs it has been a long time since I have had the time to do that project :frowning:

Sure doesn’t sound like it.

With Deferred Shading, you sample your materials into a screen-sized buffer and then go back and apply lighting to it.

With Deferred Lighting, you reverse it: sample your lighting (irradiance) into a screen-sized buffer and then go back and apply materials to it.

In both cases, you’re sampling at the nearest opaque fragment within each pixel (or sample, if doing MSAA). Thus the complication with translucents.

Neither of these necessarily requires that shadowing be handled for any/all light sources.

…the shadows are drawn to an RGBA buffer one component per shadow with colour masks and a depth buffer.

What this does sound like is what I’ve seen called “Deferred Shadows”, “Shadow Collector”, or “Screen-space Shadow Mask”. The idea is you sample your shadow term at the nearest opaque fragment within each pixel (or sample, if doing MSAA), and then just apply the shadowing term to each pixel (or sample) in your final pass when generating the composite radiance/luminance for each pixel (or sample).

…discarding pixels that are behind the shadows and the lights are modulated by the greyscale values in the shadow buffer.

I’m guessing by this you don’t mean behind the shadows, but behind the nearest opaque fragment, which is where the occlusion field (shadows) is sampled. (?)

Pretty slick! Thanks for sharing.

Thanks. I understood what deferred shading/lighting means but I was not sure how narrow the distinction is. I think if a technique doesn’t have the same drawbacks as these two maybe it doesn’t make sense to apply the terminology. But the same idea of filling a full screen buffer of sorts for later sampling applies. So I wasn’t sure.

What this does sound like is what I’ve seen called “Deferred Shadows”, “Shadow Collector”, or “Screen-space Shadow Mask”.

Good good, I was fishing for some jargon terms for which to search with :slight_smile:

I remember searching around for antialiasing techniques for ages until I found the right name for what I wanted (I think it’s called morphological??) but unfortunately I could not find any public code (for preprocessing purposes) and the one or two papers was really terse so it would have been a major project to roll from scratch. I’d hoped GIMP would have a filter but it’s AA was just a mindless kernel. My goal actually there is to generate perfect alpha test contours which I am pretty sure is possible but I’ve never seen it in a new game (the grass billboards and such are always bumpy up close)

Point is knowing the right code words goes a long way.

I’m guessing by this you don’t mean behind the shadows, but behind the nearest opaque fragment, which is where the occlusion field (shadows) is sampled. (?)

Right. Except I said shadows because the Z buffer is already initialized wherever a shadow exists going into the second phase. And if a pixel is already in front of (or on) the shadow then the test is moot but would probably have to be done anyway.

Out of curiosity. Is the portal here like a magic portal? Or is it like how light enters a room through a doorway even though the source is around the corner?

Out of curiosity. Is the portal here like a magic portal? Or is it like how light enters a room through a doorway even though the source is around the corner?

Just look at the videos… but in a nutshell, a portal is a connection between two places, ala the game Portal. The thing I made had the light travel through the portals casting light volumes to get lighting correct…