Early Z rejection

Hi,

I’m trying to implement early Z rejection in my application to discard unnecessary fragment computations.

As a test I’ve created a slow convolution fragment shader and insert the following four lines just before rendering:

glEnable(GL_DEPTH_TEST);
glDepthFunc(GL_LEQUAL);
glClearDepth(0.4f);
glClear(GL_DEPTH_BUFFER_BIT);

I draw at depth 0.5 so this should reject the entire rectangle. This correctly omits rendering in the generated image but it seems to take exactly the same amount of time as it did when not rejecting, shaders included.

I can only conclude that early Z rejection is not taking place but I can’t see why. I’m testing this on a GeForce 6800GT with NVIDIA’s 77.77 drivers. Any ideas?

A bit naive, but maybe you modify Z value within your shader ?

No, that’d just be too easy. :smiley:

Did you tried to do the z-test inside your fragment shader ? This should disable early z-test as early z-test is done before the fragment shader. Then you can compare again.
But also, maybe because you draw only a rectangle, the difference cannot be really seenable.

Hope that helps.

EDIT:

Also check for that:

http://home.dataparty.no/kristian/school/ntnu/masterthesis/blog-week39.php

There’s no depth testing in the shader. Here is the shader, if you’re curious:

uniform sampler2DRect imageTex, kernelTex;

void main() {
  half3 accum = 0.0;

  for(int y = 0; y < 13; ++ y) {
    for(int x = 0; x < 13; ++ x) {
      half3 sample = texture2DRect(imageTex, gl_TexCoord[0].st - half2(x - 6, y - 6)).rgb;
      accum += sample * texture2DRect(kernelTex, half2(y * 13 + x, 1)).rgb;
    }
  }
}

(and yes, I’m aware it’s not a terribly efficient convolution; that’s the point, I wanted something slow enough to see the early rejection taking place :wink: )

The difference certainly should be visible with a rectangle. The fragment shader takes more than a second to run on a 2K image, and the difference with the shader disabled is immense.

I’ve checked the link and my code already satisfies all the conditions, as far as I’m aware. This is roughly what the code does:

glUseProgramObjectARB(convolveProgram);
(...set some parameters...)

glEnable(GL_DEPTH_TEST);
glDepthFunc(GL_LEQUAL);
glClearDepth(0.4f);
glClear(GL_DEPTH_BUFFER_BIT);
glDepthMask(GL_FALSE);

glViewport(0, 0, 2048, 1536);
glMatrixMode(GL_PROJECTION);
glOrtho(-1, 1, 1, -1, -10.0f, 10.0f);
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();

glBegin(GL_QUADS);
  glTexCoord2f(0.0f, 1536.0f);
  glVertex3f(-1.0f, -1.0f, 0.0f);

  glTexCoord2f(2048.0f, 1536.0f);
  glVertex3f(1.0f, -1.0f, 0.0f);

  glTexCoord2f(2048.0f, 0.0f);
  glVertex3f(1.0f, 1.0f, 0.0f);

  glTexCoord2f(0.0f, 0.0f);
  glVertex3f(-1.0f, 1.0f, 0.0f);
glEnd();
glFinish();

(not cut and pasted, so minor errors might have crept in)

Nothing there that should prevent early z kicking in, surely?

I recently worked and developed a new terrain rendering algorithm. One of the optimizations that i came up with was to sort the terrain patches front to back so as to get some benefit from early z-rejection. However i noticed that there was very little performance gain on nVidia hardware (i used GFX 5700 Ultra) from an average of around 55MTris/s to around 56MTris/s. Although my fragment shader was not very expensive but still it should theoretically have given me better throughput considering the amount of fragments that i was sending down the pipeline. However ATi hardware showed a bit more improvement but still not satisfactory. This has only led me to believe that early z-rejection is not as hot as most vendors claim it to be even considering that the fragmentation part of the pipeline is the bottle neck (which it is in your case).

I draw at depth 0.5 so this should reject the entire rectangle. This correctly omits rendering in the generated image but it seems to take exactly the same amount of time as it did when not rejecting, shaders included.
How are you measuring your time? Are you sure that you’re not synchronized to your monitor’s vertical refresh signal?

Time is measured by wrapping the entire section of code presented above in timing calls. It’s certainly not limited by vsync, as an example:

Without shader: 278ms
With shader, without depth testing: 1305ms
With shader, with depth testing: 1336ms

(variation of the higher figures is +/- 40ms, so to a measurable degree they are the same)

I’d have expected the third figure to be not so far off 300ms, in all honesty, but I’m not even seeing improvements of 100ms or so.

Zulfiqar,

Thanks for the info. I’ve been testing on a 6800GT and it appears that the situation isn’t much improved here either, unless I’m missing a condition necessary for early Z.

Perhaps an NVIDIA engineer could comment on the apparent ineffect in this relatively simple case?

I had a project where I had an option to render to FP texture. When FP was off, early-z worked just fine, around 80% speed gain for serious shader and 2x overdraw, but as soon as I switched to path that was using FP renderbuffer for added HDR, early z went off…

Hi Jay,

This sounds like a potential driver bug. If instead of clearing the depth to 0.4, you clear it to 1.0 and then draw (depth only) a fullscreen quad at 0.4, does this change the results?

If you could email me your example program, we could have a look at what’s really happening in the driver and hardware.

Thanks -
Cass

I’ve replied off-forum with a test case.

I made the same tests a year ago and had the same results… keep us updated !

Originally posted by Jay Cornwall:
(and yes, I’m aware it’s not a terribly efficient convolution; that’s the point, I wanted something slow enough to see the early rejection taking place :wink: )

Maybe your shader is just too expensive. Your shader has 2*169 indirect texture access, that seems to be more than a challenge for hardware acceleration. So if the shader falls back to software it will probably omit early z-rejection too(executed only in hardware!?).

Repeat your test with a less expensive shader (just 8 indirect texture accesses).

Originally posted by ashaman:
Maybe your shader is just too expensive. Your shader has 2*169 indirect texture access, that seems to be more than a challenge for hardware acceleration. So if the shader falls back to software it will probably omit early z-rejection too(executed only in hardware!?).

No, NVIDIA chips have no restriction on number of texture accesses other than the maximum number of executed instructions.
I wouldn’t call the texture accesses in the given shader indirect, because the result of one texture lookup is not used to lookup another.

they’re called ‘dependent texture reads’ not ‘indirect texture reads’, aren’t they? Or have the semantics changed again?

Originally posted by Jay Cornwall:
[b]Time is measured by wrapping the entire section of code presented above in timing calls. It’s certainly not limited by vsync, as an example:

Without shader: 278ms
With shader, without depth testing: 1305ms
With shader, with depth testing: 1336ms
[/b]
Without shader 278ms??? For one frame? That’s less than 4 frames a second! Sounds like you fall off from the hardware path into software rendering…

Originally posted by Zulfiqar Malik:
However i noticed that there was very little performance gain on nVidia hardware (i used GFX 5700 Ultra) from an average of around 55MTris/s to around 56MTris/s.
At those triangle loads I would expect you to be limited by vertex shading, not fragment shading, so I’m not surprised if you don’t see much of an increase.

Originally posted by Jay Cornwall:
Time is measured by wrapping the entire section of code presented above in timing calls.
Do you call glFinish() before and after that section of code? You can rely on the numbers without it. Like this:

glFinish();
int startTime = timer();

// Code here ...

glFinish();
int endTime = timer();

Originally posted by Humus

At those triangle loads I would expect you to be limited by vertex shading, not fragment shading, so I’m not surprised if you don’t see much of an increase.

My vertex shader is not heavy at all and i shouldn’t theoretically be limited by vertex shader performance considering the amount of vertexes current GPU’s can process (my GPU is not exactly new :slight_smile: but it CAN theoretically process more than 250Million vertexes per second). Secondly i have also tried using the simplest vertex shader which only does modelview projection transformation on incoming vertex and passes it down the pipeline. I did not get any performance increase from that either! I also tested the code on a Radeon 9700 pro in which case i got approx 75MTris/s. I am also doing proper batching and i have done that after testing on both nVidia and ATi hardware so i don’t “think” that batching is the bottle neck. But i am happy nonethless because my terrain rendering algorithm is performing better than any other (including clipmaps) and i am currently writing a research paper on it. Hope to make it public in a few months’ time.

Originally posted by Zulfiqar Malik:
[b] [quote]
Originally posted by Humus

At those triangle loads I would expect you to be limited by vertex shading, not fragment shading, so I’m not surprised if you don’t see much of an increase.

My vertex shader is not heavy at all and i shouldn’t theoretically be limited by vertex shader performance considering the amount of vertexes current GPU’s can process (my GPU is not exactly new :slight_smile: but it CAN theoretically process more than 250Million vertexes per second). Secondly i have also tried using the simplest vertex shader which only does modelview projection transformation on incoming vertex and passes it down the pipeline. I did not get any performance increase from that either! I also tested the code on a Radeon 9700 pro in which case i got approx 75MTris/s. I am also doing proper batching and i have done that after testing on both nVidia and ATi hardware so i don’t “think” that batching is the bottle neck. But i am happy nonethless because my terrain rendering algorithm is performing better than any other (including clipmaps) and i am currently writing a research paper on it. Hope to make it public in a few months’ time.[/b][/QUOTE]I’ve found with both GeforeFX and Geforce6 series that when pushing 1,00,000 triangles per frame that the rasterization of the those triangles is more of a bottleneck that then actual vertex program. I get over 100 million triangles a second still so I can’t complain too much :slight_smile:

The scene I’ve tested isn’t fill limited as its a single texture with a depth complexity of 1. The pointer to rasterisation being a bottleneck is shown but when I move the geometry off the screen the frame rate steadily goes up. The vertices are being all passed into the vertex processor, all the polygons will also being clipped to the frustum so its not clipping either. Since I’m not fill limited the only explanation is a bottlneck between clip and fragment processing - the only candidate left is the rastorizer.

Back on topic… I’ve also struggled to see much improvement in tests with z early tests. From the card manufacturers I’d love to see a proper explanation of how the z ealy tests are implemented and how to utilize z early tests.

Robert.