Optimizing cascaded shadow mapping

Hi!

I’m currently trying to optimize the cascaded shadow mapping implementation in my 3D-engine by reducing the CPU-overhead of sending uniform data and submitting draw calls. Therefore, I wrote a simple geometry shader that takes a single triangle as input and generates NUM_FRUSTUM_SPLITS triangles from that. Source code:


#version 330
layout (triangles) in;
layout (triangle_strip, max_vertices = 9) out;
uniform mat4 wtl [3]; // Transform from world space to light space
in vec2 uv_vs [];
out vec2 uv_out;
void main() 
{
  for (int i = 0; i < 3; i++) {
    gl_Layer = i;
    for (int j = 0; j < 3; j++) {
      uv_out = uv_vs[j];
      gl_Position = wtl[i] * gl_in[j].gl_Position;
      EmitVertex();
    }
    EndPrimitive();
  } 
}

As expected, CPU-time went from ~4 ms to less than 1 ms to render all the meshes (~12,000) that are visible from the light source’s point of view. Unfortunately, overall frame rate decreased from ~40 fps to ~29 fps with this approach. So it seems that using a geometry shader is much slower than the standard approach of switching depth attachments and rendering the meshes multiple times. Do you guys have any ideas how to optimize this? It’s a heavy scene, with a total triangle count of ~50,000,000 (there is no Level-Of-Detail yet). By the way, I do not switch VAOs in between the draw calls, all the geometry in the scene shares a single VAO, VBO and IBO and uses glDrawElementsBaseVertex() with the corresponding offsets and type.

Thanks in advance!

frames/sec aren’t good for comparison as they are non-linear in time. Suggest you use ms/frame instead.

So total frame time was increased from 25ms per frame to 34.5ms, a net increase of 9.5ms, while your CPU time was reduced 3ms.

So it seems that using a geometry shader is much slower than the standard approach of switching depth attachments and rendering the meshes multiple times. Do you guys have any ideas how to optimize this?

It sounds like you’re throwing all the objects at all the splits. This is typically not an efficient use of the GPU, particularly when you’re using a geometry shader as they’re hard for the pipeline to optimize and they cost you whether or not the primitives you’re processing happen to be in or out of frustum. And of course if they’re out, the time you’re spending doing all that vertex and geometry shading is a complete waste.

I suggest you cull objects on the CPU based on the bounds of your light-space shadow frusta, one per shadow map split. And there are all kinds of things you can do to optimize that.

Also, do some playing with your code, enabling and disabling certain aspects of it, to determine where your performance is going and what your biggest bottleneck is.

Once you locate that, post specifically what you’re doing to get ideas on how you can do it differently. For instance, if its uniform updates, show us what you’re doing.

Yes, I do this because an object that intersects frustum 1 can still cast a shadow onto a fragment that reads from frustum 2 when the shadow is computed. When I compute the shadow, I simply use the fragment’s view space depth to select the right split. I also enable GL_DEPTH_CLAMP when the shadow maps are rendered to ensure that the objects end up in all of the splits.

I do that already, but based on a bounding box that encloses all the frustum splits for the reason given above. I’m using a bounding volume hierarchy (BVH), currently a quadtree, to discard the nodes and objects that are either outside the view frustum or too small to be rendered.

The code is too complex to post it here, but if you’re motivated you can check out my repository on github (still lot of work in progress, and only static geometry):
https://github.com/fleissna/flyEngine
I tried to write a generic renderer that doesn’t know anything about OpenGL. It knows about textures, render targets, shaders, draw calls etc. and handles all of that by communicating with an abstract “API”-object.

Here’s what I basically do each frame:
1.) Query visible objects from BVH
2.) Group visible objects by shader and material (to minimize state changes)
3.) Iterate through nested list of shaders, materials and objects and submit draw calls

This is done two times, shadow map + scene forward pass.
Here’s a video that should give you an idea how the culling works in my engine:
https://www.youtube.com/watch?v=HQ-qh0RPoFE
My quadtree implementation:
https://github.com/fleissna/flyEngine/blob/master/engine/include/Quadtree.h

Sure. That’s typical.

Sure. But that doesn’t explain why you don’t cull per split frustum.

If you have an object whose shadow will cast into both shadow split 1 and split 2, you render it into the shadow maps for both splits. That does not mean you need to render it into the shadow maps for splits 0, 3, 4, etc.

I’m not surprised that adding a geometry shader to this tech approach resulted in slowdown. Like you, I’ve also tried using geometry shaders for geometry amplification (creating duplicate geometry destined for multiple layers/viewports), and been completely underwhelmed by the performance – particularly when there is a lot of out-of-frustum geometry. The GPU frustum culling occurs too late in the pipeline to save you a lot of wasted cycles and frame time. This puts the burden back on the application to be smarter about what it sends down the pipeline to start with.

Thanks very much for your suggestions, I have an idea now how I could optimize the culling process for shadow maps, I just couldn’t figure out the math yet that is needed to do this. I’m currently using axis aligned bounding boxes (AABB) to wrap the objects in the scene. In order to test a single object it is not enough to just check if its AABB intersects the light frustum, but to check if it can potentially cast a shadow into that frustum.

So here’s what I would do:

  • Transform the 8 vertices of the object’s AABB into light space and compute a new AABB from that.
  • Extrude this AABB such that its maximum z-value matches the one of the light frustum’s AABB, something like this:

object_bb_max.z = std::max(object_bb_max.z, light_frustum_bb_max.z); // where light_frustum_bb_max is later used to construct the orthographic projection matrix that is used to render the shadow map.

  • If those two AABBs intersect, there is a chance that the object will cast a shadow into that split -> render it into the shadow map for that split.
  • Do all of that hierarchically on a scene graph / BVH for each individual frustum split.

Any suggestions, improvements, performance considerations?

I would suggest you not call that the light frustum. The light space frustum is what you cull objects into for rasterizing into that shadow map split’s depth map. It is also the same volume of space over which your 0…1 depth values in your shadow split depth map run.

You have:

  1. The full camera eye-space frustum
  2. The camera eye-space sub-frustum for each split (#1 split up by Z planes).
  3. The light-space frustum for each split (which is derived from the camera eye-space sub-frustum for that split)

Any suggestions, improvements, performance considerations?

Just a couple thoughts for you to consider.

General boxInFrustum() checks for full boxes can be expensive. sphereInFrustum() checks are much cheaper. Consider protecting against the need to do boxInFrustum() checks for everything by doing a sphereInFrustum() check for your objects first. You might consider doing only the sphereInFrustum() checks for starters as it’ll make it easier for you to get something up quickly (the math is easy; transform one point and you’re ready to test). Then for the perf++ you can look at adding boxes as a sub-test for objects that pass the cheap sphere cull test.

Consider having higher-level grouping nodes in your scene graph just do sphereInFrustum() checks as they’re cheap.

In case you haven’t already, you might look at some of the cascading shadow map literature out there (IIRC ShaderX* and GPU Gems have some useful stuff). There are a number of techniques on how to optimize the bounds of your light-space frustum. Unlike the normal camera frustum, the light frustum for a split is not defined just by the bounds of the camera eye-space sub-frustum for each split. It’s defined by the entire volumetric region of space which may contain objects which can cast into that shadow split eye frustum. Think of taking the split’s eye-space subfrustum and translating it toward the light to mark off a region of space. It runs: 1) from the back side of that shadow split’s eye-space subfrustum as viewed from the light source all the way up 2) to the light source. #2 is impractical though, so instead you often stop at some distance N toward the light source past the front-side of the shadow-map split’s eye-space sub-frustum. In general, this yields a frustum that doesn’t have the usual 6 sides (L/R/B/T/N/F), but one that might have IIRC 11-12 sides. I think there’s an article in GPU Gems about defining and culling to this frustum. You have to decide for you whether it’s worth taking advantage of this to cull away more objects before throwing them down the pipe.

Re transform all 8 verts of an AABB and recompute a new AABB, I wouldn’t do either of these since you could cull the whole box after transforming and testing just one or two vertices. Look at some boxInFrustum() tests for OBBs for details.

[QUOTE=Dark Photon;1291608]
General boxInFrustum() checks for full boxes can be expensive. [/QUOTE]
That’s why I considered computing AABBs and performing AABB/AABB intersections (object vs light space frustum), which are rather cheap. But yeah, you still have to pay the price for the 8 vertex transforms.
Nevermind, I ended up using the method that is proposed in Real-Time Rendering Third Edition. Frustum planes are extracted from the view-projection matrix, then I do AABB/Frustum intersections with plane equations in world space. The advantage of this approach is that I can use the exact same culling code for both the main scene and for the shadow map splits. For culling, I use a projection matrix that includes everything that is in front of the light and behind the back plane by setting the near clip plane to zero.

Doing the culling per split gave me a nice performance boost, currently 16 million triangles end up in the splits instead of 50 million. Culling itself is still rather expensive, it takes about 5 ms for the main scene and 2.5 ms for the shadow splits in worst case, selecting from a total of 4 million meshes, where about 21,000 (main scene) and 11,000 (shadow map) end up being processed by the GPU.

I can probably optimize this using multithreading. Multithread the BVH traversal itself doesn’t make much sense because of high execution divergence when nodes are traversed. But I could launch a thread at the beginning of the frame that performs culling for the main scene, then cull and render the shadow maps, then synchronize and render the main scene.

I’m going to check out your references and consider using spheres instead of boxes when I have time to do that. Thanks so far!

Sounds like you’re making great progress!

Re culling cost, 4 million meshes, and 50 million tris culled into your view frustum, it sounds like your batch size may be fairly small. You might look at batching your meshes together based on spatial locality (i.e. merging draw calls). This’ll let you save both on culling cost and draw dispatch cost and the slight expense of courser culling. There’s also LOD, which you mentioned you’re not supporting yet.

[QUOTE=Dark Photon;1291622]Sounds like you’re making great progress!

Re culling cost, 4 million meshes, and 50 million tris culled into your view frustum, it sounds like your batch size may be fairly small. You might look at batching your meshes together based on spatial locality (i.e. merging draw calls). This’ll let you save both on culling cost and draw dispatch cost and the slight expense of courser culling. There’s also LOD, which you mentioned you’re not supporting yet.[/QUOTE]

Coarse culling/merging meshes can be problematic, consider a giant 3D model that consists of just a single mesh. If only a tiny little part of this giant mesh intersects the view frustum, you have to draw it and you will waste lots of GPU-cycles just to clip all those triangles that aren’t rasterized anyway. I guess in practice one must find a tradeoff depending on the scene you’re working with.

I currently do not really batch the meshes. I just ensure that each shader that’s used in a frame is bound exactly once, the same holds for materials. Within the materials I loop over the visible meshes, send model matrix + transpose inverse model matrix (to transform the normals) via glUniformMatrix4fv() and submit the draw call. The scene I’m testing just consists of 1000 copies of the Sponzamodel uniformly placed on a grid, I’m not a good 3D-modeler yet :stuck_out_tongue:

Now I’m currently struggling with shadowing artifacts, especially self shadowing on curved surfaces:
https://raw.githubusercontent.com/fleissna/flyEngine/master/screenshots/screenshot4.png
https://raw.githubusercontent.com/fleissna/flyEngine/master/screenshots/screenshot3.png
I already use PCF and use glPolygonOffset when rendering the shadow maps. I also know how to do poisson disk filtering that results in noisy, but less jagged shadow edges, but I’m looking for a way to prevent or at least lessen those artifacts up front.

Man, those are sure big shadow texels. You might want to look at tightening your shadow bounds and/or increasing shadow map resolution.

The shadow acne on the sides of those sphere’s (and the lit surface next to it) looks much too bright. Typically you only shadow diffuse and specular, which typically fade to zero intensity along light-space tangent surfaces. So why is the surface so bright here? That seems to be part of your problem. The big shadow texels is another part. And the acne itself you can probably fix with polygon offset or normal offset techniques.

Thanks for the hint, I went through the source code and saw that shadows were applied to the final, lit color instead of just diffuse + specular. Fixed it.

I increased my shadow factor just to demonstrate the artifacts.

I also used a bounding sphere to compute the frustum splits (to avoid shimmering edges on camera rotations), but that resulted in bounds that were too large in terms of (culling) performance and shadow map resolution. For now, I resolved that by using more splits and increasing the shadow map resolution. Here’s a demo of my current state:

Good deal!

I also used a bounding sphere to compute the frustum splits (to avoid shimmering edges on camera rotations), but that resulted in bounds that were too large in terms of (culling) performance and shadow map resolution.

That’s interesting. When I used that trick, I didn’t have that problem (and I did it for the same reason: to get rid of shadow edge flickering and keep the shadow edge appearance static with static light source positions). How are you choosing your split distances? IIRC, I was using the “practical split” strategy which is basically a blend of the pure linear and pure log split distances. That helps pull in the split distances, reducing the size of your bspheres. And also, what are your camera FOVs? Of course, the larger those are, the bigger the sphere has to be to circumscribe the splits. …just wondering what might be different about your situation where you didn’t get sufficient shadow res.

For now, I resolved that by using more splits and increasing the shadow map resolution. Here’s a demo of my current state:
https://www.youtube.com/watch?v=QivCx4WEwcM

Looks great!

For now I’m just doubling the distances for each consecutive split (16, 32, 64, 128, …), nothing sophisticated. I’ll probably look into more advanced solutions later. Bounding spheres can be significantly larger than the actual split bounds, so there can be lots of wasted space depending on the split distances and FOV of course. I guess this is what caused worse performance and resolution, because the sphere is not that tight as a box.

It’s currently fairly small with 45 degrees.

I guess I should also take possible shadow receivers into account when computing the splits. This would result in tighter boxes at the expense of more culling work to do on CPU. Also cases like the camera looking straight to the ground would be much more performant then. (Only if I remove that giant ground plane that currently covers the entire scene in the demo :p)

Anyways, the main bottleneck in my engine is polygon count due to the lack of LOD (except for instancing). I’m going to look out for mesh-simplification techniques or libraries (MeshLab?) that can do that for me.

Non of these links work for me :frowning: is the github repository maybe private?
The images say that it cannot be displayed because it contains errors.
github just says 404.
youtube shows mostly nothing.

[QUOTE=tsojtsoj;1292640]Non of these links work for me :frowning: is the github repository maybe private?
The images say that it cannot be displayed because it contains errors.
github just says 404.
youtube shows mostly nothing.[/QUOTE]

Hi!

I’m sorry, but I decided to keep my repository private some time ago for personal reasons. Here is a video that shows the current state of my renderer (don’t take the description too seriously ;-)): https://www.youtube.com/watch?v=8GoGrkCP3FY

If yout want to know something specifically, just ask.