Is there effesient optimizations for "Cascade Shadow Mapping"?

Hi, guys!

For my app the “Cascade Shadow Mapping” is eating more than 50% percents of frame rendering time.

Which techniques can you recommend for optimizing? :confused:

CPU based “Frustum culling” already implemented.

Thank you for answer!

There are all kinds of optimizations that are possible here.

You need to profile your implementation and see where your time is going. That will tell you what area of your rendering you need to optimize. Once you know that, feel free to post back here for ideas on optimizing that specific area.

The next cycle is eating the most of all time:



    for (GameItem gameItem : items) {
      Matrix4f modelMatrix = transformation.buildModelMatrix(gameItem); // this building of matrix is cacheable inside, really matrix calculated once.
      shaderProgram.setUniform("modelMatrix", modelMatrix);
      glDrawElements(GL_TRIANGLES, getVertexCount(), GL_UNSIGNED_INT, 0);
    }


How many draw calls (i.e. items)?
How many vertices per draw call?
Really no other state changes between draw calls except modelMatrix?
Are the vertex attributes and index lists for these draw calls all pulling from the same set of VBO(s), or are the VBO bindings being changed between draw calls?
Which GPU and GL drivers are you testing on?

cascades:

  • near: 4739 items
  • middle: 5237 items
  • far: 5439 items

it just cube: 8 vertices.

yes.

They are the same set of VBO.

Driver: amdgpu (OS Debian)
GPU: Radeon R9 Nano

Profiling average time in milliseconds. Mesh.renderList - the described function with cycle.
Look at DirectLightShadowRenderer.render section.


  L--48668: Renderer.render
    L--569: Renderer.cull
    L--8902: PointLightShadowRenderer.render
      L--8860: PointLightShadowRenderer.renderLight
        L--1883: PointLightShadowRenderer.renderCubeFace[0]
          L--872: PointLightShadowRenderer.cull
          L--812: Mesh.renderList
        L--1635: PointLightShadowRenderer.renderCubeFace[1]
          L--836: PointLightShadowRenderer.cull
          L--642: Mesh.renderList
        L--797: PointLightShadowRenderer.renderCubeFace[2]
          L--715: PointLightShadowRenderer.cull
          L--13: Mesh.renderList
        L--766: PointLightShadowRenderer.renderCubeFace[3]
          L--647: PointLightShadowRenderer.cull
          L--68: Mesh.renderList
        L--927: PointLightShadowRenderer.renderCubeFace[4]
          L--738: PointLightShadowRenderer.cull
          L--59: Mesh.renderList
        L--2760: PointLightShadowRenderer.renderCubeFace[5]
          L--754: PointLightShadowRenderer.cull
          L--1892: Mesh.renderList
    L--29681: DirectLightShadowRenderer.render
      L--9149: DirectLightShadowRenderer.renderCascade[0]
        L--802: DirectLightShadowRenderer.cull
        L--8267: Mesh.renderList
      L--9986: DirectLightShadowRenderer.renderCascade[1]
        L--760: DirectLightShadowRenderer.cull
        L--9115: Mesh.renderList
      L--10322: DirectLightShadowRenderer.renderCascade[2]
        L--818: DirectLightShadowRenderer.cull
        L--9393: Mesh.renderList
    L--1188: SpotLightShadowRenderer.render
      L--1053: SpotLightShadowRenderer.renderLight
        L--909: SpotLightShadowRenderer.cull
        L--69: Mesh.renderList
    L--191: ReprojectionRenderer.render
      L--62: ReprojectionRenderer.renderReprojectedFrame
      L--112: ReprojectionRenderer.renderMipmap
    L--7457: Renderer.renderScene
      L--7201: Mesh.renderList


[QUOTE=nimelord;1291047]cascades:

  • near: 4739 items
  • middle: 5237 items
  • far: 5439 items

it just cube: 8 vertices.

They are the same set of VBO.

Driver: amdgpu (OS Debian)
GPU: Radeon R9 Nano[/QUOTE]

Really? A separate draw call per cube? I suspect you’re probably driving your CPU pretty hard and under-utilizing your GPU, due to feeding the GPU very inefficiently. Looking at your timings, your triangle throughput is pretty low, even if you meant to say your timings are in usec, not msec.

You should at least look at improving your batching (e.g. applying one of the forms of geometry instancing: LINK1, LINK2) when rasterizing to your shadow cascades.

Before you do though, as a performance test, do your 3 cascade culls up-front in init, and pre-batch your shadow cascade cubes (i.e. group cubes together in shared draw calls). Then at runtime, just render to your shadow cascades from your pre-batched data. Time this and compare with what you had before.

But this is but one possible optimization. How are you currently doing your culling? For instance, describe the boundary of your shadow split light-space frustum.

It because I’m a beginner. :slight_smile:

Ok, will do that. I think it is the bottleneck.

Also I found some optimizations for shadow map building:

  • For closed models, cull back faces (What does it mean?)
  • Turn off shading, color writes (I already turn off color writes. But How can I do that with shading?)
  • Only send vertex positions (Already done that.)
  • Draw roughly front-to-back (It means optimization is sorting of objects by depth of point of view, Am I right? I have Idea with it.)

The Idea is - don’t waste CPU time for sorting cubes. I can use reprojected prev depth map (The sun moves very slowly and reprojection result must be very good).
Thus I can use as depth buffer reprojected result for rejection of rasterizing for hided fragments.
What do you think about it?

For every frame I calculate frustum planes and test each object as if distance between plane and center of object is more than bounding radius then reject object.

glEnable(GL_CULL_FACE)
glCullFace(GL_BACK)

This requires that faces have a consistent vertex ordering, which should be specified via glFrontFace().

If faces have a consistent vertex ordering then after projection all front faces will be counter-clockwise and all back faces will be clockwise, or vice versa. For a closed mesh, you’ll never see back faces (they will always be occluded by a closer front face), so there’s no need to draw them.

In the core profile, don’t have a fragment shader. In the compatibility profile, have a fragment shader with an empty main() function and “layout(early_fragment_tests) in;”. The implementation will calculate depth by itself, and the colour doesn’t matter if colour writes are disabled. However, this won’t work if you’re using [var]discard[/var] e.g. for alpha testing. If that’s the case, render any surfaces which actually depend upon alpha testing separately.

Yes. Although this doesn’t matter if you’re using the core profile and don’t have a fragment shader. If you render front-to-back, any primitives which are occluded will fail the depth test (because the occluding primitives will already have been rendered into the depth buffer), so the fragment shader won’t be invoked.

This only works if none of the shadow casters move. If the shadow casters are simple (and you optimise their rendering sufficiently), re-projection might be more expensive than rendering. And re-projection is inaccurate, so you can’t keep doing it indefinitely; you will need to render occasionally to prevent errors from accumulating.

For optimizations:

  1. For closed models, cull back faces
  2. Turn off shading, color writes
  3. Only send vertex positions
  4. Draw roughly front-to-back

I already had implemented 1, 3, and color disabling of 2

About “Turn off shading”:
I just disable one presented string “gl_FragDepth = gl_FragCoord.z;” and shader became empty.
Then my “direction light shadow rendering” speeded up from 29,7 ms to 23,7 ms. (On intel GPU it have much more effect.)

Now I’m trying to implement batch operation for drawing.
There are three variants of send “per instance data” to GPU: “Buffer Texture” , “Uniform Buffer Object” and “attribute divisor”.
Which way I should prefer?

I want to implement it for depth maps and also for scene rendering.
My depth shader is to simple:


#version 330

layout (location=0) in vec3 position;
layout (location=1) in vec2 texCoord;
layout (location=2) in vec3 vertexNormal;

uniform mat4 modelMatrix;
uniform mat4 projectionMatrix;
uniform mat4 projLightViewMatrix;

void main() {
    gl_Position = projLightViewMatrix * modelMatrix * vec4(position, 1.0f);
}

But my scene shader more complicated:


#version 330

const int DL_NUM_CASCADES = 3;

layout (location=0) in vec3 position;
layout (location=1) in vec2 texCoord;
layout (location=2) in vec3 vertexNormal;
layout (location=3) in float boundingRadius;

out vec2 gTexCoord;
out vec3 gMvVertexNormal;
out vec3 gMvVertexPos;
out vec4 gMlightviewVertexPos[DL_NUM_CASCADES];
out mat4 gModelViewMatrix;
out vec4 gWPosition;


uniform mat4 viewMatrix;
uniform mat4 modelMatrix;
uniform mat4 projectionMatrix;
uniform mat4 lightViewMatrix[DL_NUM_CASCADES];
uniform mat4 orthoProjectionMatrix[DL_NUM_CASCADES];
uniform bool occlusionCulling;

void main() {

    gTexCoord = texCoord;
    gWPosition = modelMatrix * vec4(position, 1.0);

    mat4 modelViewMatrix =  viewMatrix * modelMatrix;
    vec4 mvPos = modelViewMatrix * vec4(position, 1.0);
    gl_Position = projectionMatrix * mvPos;
    gMvVertexNormal = normalize(modelViewMatrix * vec4(vertexNormal, 0.0)).xyz;
    gMvVertexPos = mvPos.xyz;

    for (int i = 0 ; i < DL_NUM_CASCADES ; i++) {
        gMlightviewVertexPos[i] = orthoProjectionMatrix[i] * lightViewMatrix[i] * gWPosition;
    }
    gModelViewMatrix = modelViewMatrix;
}

[QUOTE=nimelord;1291062]Now I’m trying to implement batch operation for drawing.
There are three variants of send “per instance data” to GPU: “Buffer Texture” , “Uniform Buffer Object” and “attribute divisor”.
Which way I should prefer?[/QUOTE]

Actually, if you just want to get something up quickly to test, just append all of the vertices for the culled-in primitives into a single shared VBO (or a small group of them), and then launch a one or (or a few) large draw calls, instead of the 15,000+ draw calls you were doing. (**) Initially I’d just make the the assumption/simplification that their vertex positions are all located in a shared OBJECT SPACE coord frame, so you can use the same ModelViewProj transform for all of them.

The goal here isn’t to engineer the final solution, but rather to establish that batching your data better up-front will improve your shadow rasterization performance.

If you see an improvement, then perhaps you might want to explore whether you can get even better performance by using instancing on the GPU.

As far as which form, there are really only two:

[ol]
[li]Ones that use attribute divisor (which “push” the per-instance data into the shader), and [/li][li]Ones that use gl_InstanceID in the shader to go lookup the instance data from someplace else (which “pull” the per-instance data into the shader). [/li][/ol]

Examples of “someplace else” for #2 include textures, buffer objects, standard uniforms, images, …wherever.

(**) If you’re targetting NVidia GPUs, I recommend using their bindless vertex attributes capability. This allows you to make your VBOs GPU-resident and to dispatch draw calls on VBOs using their GPU addresses rather than indirectly via VBO handles and offsets. The more batches you have to render, the more this pays big dividends. Years ago when I was working with it, I saw gains of up to 2X by dispatching lots of batches per frame this way. That said, the more general and cross-GPU-vender solution to this performance problem is to batch your data better so that your not so at the mercy of per-draw-call overhead in the driver.

I implemented it via VS parameters:


#version 430

layout (location=0) in vec3 position;
layout (location=5) in mat4 modelMatrix;

uniform mat4 projLightViewMatrix;


void main() {
    gl_Position = projLightViewMatrix * modelMatrix * vec4(position, 1.0f);
}

It speeded up 2x from 20.285 ms to 9.293 ms.
It is good i think.

Now I’m trying to rework it for UBO:


#version 430

layout (location=0) in vec3 position;

layout (std140) uniform vertexUniformBlock {
  mat4 modelMatrix;
} vb;

uniform mat4 projLightViewMatrix;


void main() {
  gl_Position = projLightViewMatrix * vb.modelMatrix * vec4(position, 1.0f);
}


    int uniformBlockBinding = 0;
    String uniformBlockName = "vertexUniformBlock";
    int uniformBlockIndex = shaderProgram.getUniformBlockIndex(uniformBlockName);
    if (uniformBlockIndex == GL_INVALID_INDEX) {
      throw new IllegalStateException("Invalid uniform block name: " + uniformBlockName);
    }
    glUniformBlockBinding(shaderProgram.getProgramId(), uniformBlockIndex, uniformBlockBinding);
    int bufferId = glGenBuffers();
    glBindBuffer(GL_UNIFORM_BUFFER, bufferId);
    glBindBufferBase(GL_UNIFORM_BUFFER, uniformBlockBinding, bufferId);
    glBufferData(GL_UNIFORM_BUFFER, buff, GL_DYNAMIC_DRAW);
    glDrawElementsInstanced(GL_TRIANGLES, getVertexCount(), GL_UNSIGNED_INT, 0, items.size());
    glBindBuffer(GL_UNIFORM_BUFFER, 0);

Looks like VS has not received FBO data.

The buff is definitely filled.
Could you notice problem?
May be I skipped some mandatory GL call?

Thank you for any help.

I knew UBO cannot be used, because it can contains only 65536 bytes per uniform block for my hardware.
It is too low for my case.

What about SSBO, can I use it instead UBO?

Should be able to. Max size is one of the key differences between them: SSBO (OpenGL Wiki). Just keep in mind that it may be slower to access than UBO.

I have implemented “Instansed rendering” instead “100500 draw calls”.

And it let rendering speed up rendering: +50% for FPS.

Values of profiling present in 10^-6 sec.

many draw calls:


  L--49402: Renderer.render
    L--6742: PointLightShadowRenderer.render
      L--6726: PointLightShadowRenderer.renderLight
        L--1267: PointLightShadowRenderer.renderCubeFace[0]
        L--1039: PointLightShadowRenderer.renderCubeFace[1]
        L--727: PointLightShadowRenderer.renderCubeFace[2]
        L--704: PointLightShadowRenderer.renderCubeFace[3]
        L--897: PointLightShadowRenderer.renderCubeFace[4]
        L--2035: PointLightShadowRenderer.renderCubeFace[5]
    L--21655: DirectLightShadowRenderer.render
      L--36: DirectLightShadowRenderer.update
      L--2010: DirectLightShadowRenderer.renderCascade[0]
        L--1046: DirectLightShadowRenderer.renderMeshes
      L--10837: DirectLightShadowRenderer.renderCascade[1]
        L--8509: DirectLightShadowRenderer.renderMeshes
      L--8695: DirectLightShadowRenderer.renderCascade[2]
        L--7118: DirectLightShadowRenderer.renderMeshes
    L--1045: SpotLightShadowRenderer.render
      L--986: SpotLightShadowRenderer.renderLight
    L--18627: Renderer.renderScene

Instanced rendering:


  L--31372: Renderer.render: [23432 - 132218]
    L--18981: PointLightShadowRenderer.render: [8523 - 32116]
      L--18948: PointLightShadowRenderer.renderLight: [8490 - 31421]
        L--3062: PointLightShadowRenderer.renderCubeFace[0]: [1306 - 9701]
        L--2264: PointLightShadowRenderer.renderCubeFace[1]: [1258 - 6681]
        L--2991: PointLightShadowRenderer.renderCubeFace[2]: [1031 - 6027]
        L--4238: PointLightShadowRenderer.renderCubeFace[3]: [1095 - 7913]
        L--3800: PointLightShadowRenderer.renderCubeFace[4]: [1057 - 5932]
        L--2436: PointLightShadowRenderer.renderCubeFace[5]: [1105 - 8315]
    L--7855: DirectLightShadowRenderer.render: [5803 - 30812]
      L--42: DirectLightShadowRenderer.update: [12 - 688]
      L--1895: DirectLightShadowRenderer.renderCascade[0]: [1193 - 4354]
        L--216: DirectLightShadowRenderer.renderMeshes: [106 - 693]
      L--2875: DirectLightShadowRenderer.renderCascade[1]: [2114 - 24756]
        L--1152: DirectLightShadowRenderer.renderMeshes: [728 - 11375]
      L--2968: DirectLightShadowRenderer.renderCascade[2]: [2343 - 6624]
        L--1179: DirectLightShadowRenderer.renderMeshes: [783 - 4185]
    L--1143: SpotLightShadowRenderer.render: [950 - 2457]
      L--1093: SpotLightShadowRenderer.renderLight: [916 - 2228]
    L--1003: Renderer.renderScene: [682 - 11126]

So, frame rendering speeded up from 49.4 to 31.3 msec.

Thank you guys!

PS: “Instanced Rendering” has no difference between using VS attributes or SSBO.

After “Instance rendering” optimization was been implemented next bottlenecks I had found was located in “cull” calls:
The “cull” - is a “frustum culling” on the CPU side that working in one current thread.

profile of perfomance: in mksec:


  L--30858: Renderer.render
    L--1346: Renderer.cull
    L--18618: PointLightShadowRenderer.render
      L--18577: PointLightShadowRenderer.renderLight
        L--3105: PointLightShadowRenderer.renderCubeFace[0]
          L--2574: PointLightShadowRenderer.cull
        L--2325: PointLightShadowRenderer.renderCubeFace[1]
          L--1888: PointLightShadowRenderer.cull
        L--2986: PointLightShadowRenderer.renderCubeFace[2]
          L--2870: PointLightShadowRenderer.cull
        L--3672: PointLightShadowRenderer.renderCubeFace[3]
          L--3415: PointLightShadowRenderer.cull
        L--3433: PointLightShadowRenderer.renderCubeFace[4]
          L--2933: PointLightShadowRenderer.cull
        L--2868: PointLightShadowRenderer.renderCubeFace[5]
          L--2345: PointLightShadowRenderer.cull
    L--7320: DirectLightShadowRenderer.render
      L--71: DirectLightShadowRenderer.update
      L--1620: DirectLightShadowRenderer.renderCascade[0]
        L--1298: DirectLightShadowRenderer.cull
        L--266: DirectLightShadowRenderer.renderMeshes
      L--2754: DirectLightShadowRenderer.renderCascade[1]
        L--1516: DirectLightShadowRenderer.cull
        L--1197: DirectLightShadowRenderer.renderMeshes
      L--2790: DirectLightShadowRenderer.renderCascade[2]
        L--1550: DirectLightShadowRenderer.cull
        L--1199: DirectLightShadowRenderer.renderMeshes
    L--1014: SpotLightShadowRenderer.render
      L--955: SpotLightShadowRenderer.renderLight
        L--889: SpotLightShadowRenderer.cull
    L--128: ReprojectionRenderer.render
      L--44: ReprojectionRenderer.renderReprojectedFrame
      L--69: ReprojectionRenderer.renderMipmap
    L--1034: Renderer.renderScene

Ant it make me do some optimization here.

My shadow mapping has no fragment shading (it just default one from pipeline).
Could you answer: does make sense to perform frustum/occlusion culling on the GPU side for Shadow mapping?

Core profile or compatibility profile? The core profile doesn’t have a default fragment shader, but that doesn’t matter if you only need depth. If it’s the compatibility profile, you should provide a fragment shader with an empty main() function and “layout(early_fragment_tests) in;”

The GPU will perform frustum culling on individual primitives. If much of the scene is outside the lighting frustum, there will be some benefit to culling larger groups of geometry; the main question is whether you can do that without the cost outweighing the benefits.

As GClements said, it depends. It makes sense to try it at least. I’ve applied it to a similar case when rendering thousands of objects using instanced rendering to both cull and LOD-bin the instances into draw batches, and it resulted in a huge frametime savings. You just have to decide if you can make the per-instance culling cheap enough to do efficiently in a shader (e.g. bounding sphere-frustum checks). Transform feedback with a geometry shader works well here.

Will try it.
Thank you, guys!