PDA

View Full Version : Early Z rejection



Jay Cornwall
08-31-2005, 05:09 AM
Hi,

I'm trying to implement early Z rejection in my application to discard unnecessary fragment computations.

As a test I've created a slow convolution fragment shader and insert the following four lines just before rendering:

glEnable(GL_DEPTH_TEST);
glDepthFunc(GL_LEQUAL);
glClearDepth(0.4f);
glClear(GL_DEPTH_BUFFER_BIT);

I draw at depth 0.5 so this should reject the entire rectangle. This correctly omits rendering in the generated image but it seems to take exactly the same amount of time as it did when not rejecting, shaders included.

I can only conclude that early Z rejection is not taking place but I can't see why. I'm testing this on a GeForce 6800GT with NVIDIA's 77.77 drivers. Any ideas?

ZbuffeR
08-31-2005, 05:33 AM
A bit naive, but maybe you modify Z value within your shader ?

Jay Cornwall
08-31-2005, 06:07 AM
No, that'd just be too easy. :D

jide
08-31-2005, 07:21 AM
Did you tried to do the z-test inside your fragment shader ? This should disable early z-test as early z-test is done before the fragment shader. Then you can compare again.
But also, maybe because you draw only a rectangle, the difference cannot be really seenable.

Hope that helps.

EDIT:

Also check for that:

http://home.dataparty.no/kristian/school/ntnu/masterthesis/blog-week39.php

Jay Cornwall
09-01-2005, 12:20 AM
There's no depth testing in the shader. Here is the shader, if you're curious:

uniform sampler2DRect imageTex, kernelTex;

void main() {
half3 accum = 0.0;

for(int y = 0; y < 13; ++ y) {
for(int x = 0; x < 13; ++ x) {
half3 sample = texture2DRect(imageTex, gl_TexCoord[0].st - half2(x - 6, y - 6)).rgb;
accum += sample * texture2DRect(kernelTex, half2(y * 13 + x, 1)).rgb;
}
}
}(and yes, I'm aware it's not a terribly efficient convolution; that's the point, I wanted something slow enough to see the early rejection taking place ;) )

The difference certainly should be visible with a rectangle. The fragment shader takes more than a second to run on a 2K image, and the difference with the shader disabled is immense.

I've checked the link and my code already satisfies all the conditions, as far as I'm aware. This is roughly what the code does:

glUseProgramObjectARB(convolveProgram);
(...set some parameters...)

glEnable(GL_DEPTH_TEST);
glDepthFunc(GL_LEQUAL);
glClearDepth(0.4f);
glClear(GL_DEPTH_BUFFER_BIT);
glDepthMask(GL_FALSE);

glViewport(0, 0, 2048, 1536);
glMatrixMode(GL_PROJECTION);
glOrtho(-1, 1, 1, -1, -10.0f, 10.0f);
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();

glBegin(GL_QUADS);
glTexCoord2f(0.0f, 1536.0f);
glVertex3f(-1.0f, -1.0f, 0.0f);

glTexCoord2f(2048.0f, 1536.0f);
glVertex3f(1.0f, -1.0f, 0.0f);

glTexCoord2f(2048.0f, 0.0f);
glVertex3f(1.0f, 1.0f, 0.0f);

glTexCoord2f(0.0f, 0.0f);
glVertex3f(-1.0f, 1.0f, 0.0f);
glEnd();
glFinish();(not cut and pasted, so minor errors might have crept in)

Nothing there that should prevent early z kicking in, surely?

Zulfiqar Malik
09-01-2005, 08:11 PM
I recently worked and developed a new terrain rendering algorithm. One of the optimizations that i came up with was to sort the terrain patches front to back so as to get some benefit from early z-rejection. However i noticed that there was very little performance gain on nVidia hardware (i used GFX 5700 Ultra) from an average of around 55MTris/s to around 56MTris/s. Although my fragment shader was not very expensive but still it should theoretically have given me better throughput considering the amount of fragments that i was sending down the pipeline. However ATi hardware showed a bit more improvement but still not satisfactory. This has only led me to believe that early z-rejection is not as hot as most vendors claim it to be even considering that the fragmentation part of the pipeline is the bottle neck (which it is in your case).

al_bob
09-01-2005, 08:15 PM
I draw at depth 0.5 so this should reject the entire rectangle. This correctly omits rendering in the generated image but it seems to take exactly the same amount of time as it did when not rejecting, shaders included.How are you measuring your time? Are you sure that you're not synchronized to your monitor's vertical refresh signal?

Jay Cornwall
09-01-2005, 11:52 PM
Time is measured by wrapping the entire section of code presented above in timing calls. It's certainly not limited by vsync, as an example:

Without shader: 278ms
With shader, without depth testing: 1305ms
With shader, with depth testing: 1336ms

(variation of the higher figures is +/- 40ms, so to a measurable degree they are the same)

I'd have expected the third figure to be not so far off 300ms, in all honesty, but I'm not even seeing improvements of 100ms or so.

Zulfiqar,

Thanks for the info. I've been testing on a 6800GT and it appears that the situation isn't much improved here either, unless I'm missing a condition necessary for early Z.

Perhaps an NVIDIA engineer could comment on the apparent ineffect in this relatively simple case?

M/\dm/\n
09-02-2005, 01:03 AM
I had a project where I had an option to render to FP texture. When FP was off, early-z worked just fine, around 80% speed gain for serious shader and 2x overdraw, but as soon as I switched to path that was using FP renderbuffer for added HDR, early z went off...

cass
09-02-2005, 04:39 AM
Hi Jay,

This sounds like a potential driver bug. If instead of clearing the depth to 0.4, you clear it to 1.0 and then draw (depth only) a fullscreen quad at 0.4, does this change the results?

If you could email me your example program, we could have a look at what's really happening in the driver and hardware.

Thanks -
Cass

Jay Cornwall
09-02-2005, 07:14 AM
I've replied off-forum with a test case.

divide
09-03-2005, 12:00 AM
I made the same tests a year ago and had the same results... keep us updated !

ashaman
09-05-2005, 08:08 AM
Originally posted by Jay Cornwall:
(and yes, I'm aware it's not a terribly efficient convolution; that's the point, I wanted something slow enough to see the early rejection taking place ;) )
Maybe your shader is just too expensive. Your shader has 2*169 indirect texture access, that seems to be more than a challenge for hardware acceleration. So if the shader falls back to software it will probably omit early z-rejection too(executed only in hardware!?).

Repeat your test with a less expensive shader (just 8 indirect texture accesses).

Relic
09-05-2005, 09:59 PM
Originally posted by ashaman:
Maybe your shader is just too expensive. Your shader has 2*169 indirect texture access, that seems to be more than a challenge for hardware acceleration. So if the shader falls back to software it will probably omit early z-rejection too(executed only in hardware!?).
No, NVIDIA chips have no restriction on number of texture accesses other than the maximum number of executed instructions.
I wouldn't call the texture accesses in the given shader indirect, because the result of one texture lookup is not used to lookup another.

knackered
09-06-2005, 11:54 AM
they're called 'dependent texture reads' not 'indirect texture reads', aren't they? Or have the semantics changed again?

andras
09-06-2005, 05:14 PM
Originally posted by Jay Cornwall:
Time is measured by wrapping the entire section of code presented above in timing calls. It's certainly not limited by vsync, as an example:

Without shader: 278ms
With shader, without depth testing: 1305ms
With shader, with depth testing: 1336ms
Without shader 278ms??? For one frame? That's less than 4 frames a second! Sounds like you fall off from the hardware path into software rendering..

Humus
09-06-2005, 07:48 PM
Originally posted by Zulfiqar Malik:
However i noticed that there was very little performance gain on nVidia hardware (i used GFX 5700 Ultra) from an average of around 55MTris/s to around 56MTris/s.At those triangle loads I would expect you to be limited by vertex shading, not fragment shading, so I'm not surprised if you don't see much of an increase.

Humus
09-06-2005, 07:51 PM
Originally posted by Jay Cornwall:
Time is measured by wrapping the entire section of code presented above in timing calls.Do you call glFinish() before and after that section of code? You can rely on the numbers without it. Like this:


glFinish();
int startTime = timer();

// Code here ...

glFinish();
int endTime = timer();

Zulfiqar Malik
09-06-2005, 09:31 PM
Originally posted by Humus

At those triangle loads I would expect you to be limited by vertex shading, not fragment shading, so I'm not surprised if you don't see much of an increase.

My vertex shader is not heavy at all and i shouldn't theoretically be limited by vertex shader performance considering the amount of vertexes current GPU's can process (my GPU is not exactly new :) but it CAN theoretically process more than 250Million vertexes per second). Secondly i have also tried using the simplest vertex shader which only does modelview projection transformation on incoming vertex and passes it down the pipeline. I did not get any performance increase from that either! I also tested the code on a Radeon 9700 pro in which case i got approx 75MTris/s. I am also doing proper batching and i have done that after testing on both nVidia and ATi hardware so i don't "think" that batching is the bottle neck. But i am happy nonethless because my terrain rendering algorithm is performing better than any other (including clipmaps) and i am currently writing a research paper on it. Hope to make it public in a few months' time.

Robert Osfield
09-07-2005, 12:12 AM
Originally posted by Zulfiqar Malik:


Originally posted by Humus

At those triangle loads I would expect you to be limited by vertex shading, not fragment shading, so I'm not surprised if you don't see much of an increase.

My vertex shader is not heavy at all and i shouldn't theoretically be limited by vertex shader performance considering the amount of vertexes current GPU's can process (my GPU is not exactly new :) but it CAN theoretically process more than 250Million vertexes per second). Secondly i have also tried using the simplest vertex shader which only does modelview projection transformation on incoming vertex and passes it down the pipeline. I did not get any performance increase from that either! I also tested the code on a Radeon 9700 pro in which case i got approx 75MTris/s. I am also doing proper batching and i have done that after testing on both nVidia and ATi hardware so i don't "think" that batching is the bottle neck. But i am happy nonethless because my terrain rendering algorithm is performing better than any other (including clipmaps) and i am currently writing a research paper on it. Hope to make it public in a few months' time.I've found with both GeforeFX and Geforce6 series that when pushing 1,00,000 triangles per frame that the rasterization of the those triangles is more of a bottleneck that then actual vertex program. I get over 100 million triangles a second still so I can't complain too much :-)

The scene I've tested isn't fill limited as its a single texture with a depth complexity of 1. The pointer to rasterisation being a bottleneck is shown but when I move the geometry off the screen the frame rate steadily goes up. The vertices are being all passed into the vertex processor, all the polygons will also being clipped to the frustum so its not clipping either. Since I'm not fill limited the only explanation is a bottlneck between clip and fragment processing - the only candidate left is the rastorizer.

Back on topic... I've also struggled to see much improvement in tests with z early tests. From the card manufacturers I'd love to see a proper explanation of how the z ealy tests are implemented and how to utilize z early tests.

Robert.

dorbie
09-08-2005, 06:56 AM
Not necessarily, there are setup overheads if a triangle isn't trivially rejected and that has nothing to do with pixel fill limitations.

cppguru
09-08-2005, 10:04 PM
First, that's nothing strange in 4 FPS in this test. I don't know screen resolution, let it be 1024x768 (just for estimating), GeForce 6800 GT has 6 pipelines x 700 MHz, then it can render 169*42=676 instructions with 350000000*6/(1024*768)/(676) = 3.95 FPS. Software render might be 1000 times slower =).

Futher, i guess this test is generally incorrect.
I don't know, is it a driver bug or someone's else ;) , but EarlyZ culling should be used by different means. Try the following: render some full-screen quad with lesser Z with depth write on and color write off, and after (in second pass) render it with depth test GL_EQUAL, depth write off and color write on. Not ever glClearDepth needed.

PS: sorry for my english.

andras
09-11-2005, 06:39 AM
Originally posted by cppguru:
First, that's nothing strange in 4 FPS in this test. I don't know screen resolution, let it be 1024x768 (just for estimating), GeForce 6800 GT has 6 pipelines x 700 MHz, then it can render 169*42=676 instructions with 350000000*6/(1024*768)/(676) = 3.95 FPS. Software render might be 1000 times slower =).Maybe I misunderstood, but I thought the 4FPS was measured for the shader-disabled test.

Dirk
09-11-2005, 03:13 PM
Originally posted by Robert Osfield:


Back on topic... I've also struggled to see much improvement in tests with z early tests. From the card manufacturers I'd love to see a proper explanation of how the z ealy tests are implemented and how to utilize z early tests.
I disagree with Robert on many things :) , but this one I can only agree with. Some good examples of how to utilize early Z would be VERY appreciated. I'm not asking about how they are implemented (all that secrecy around HW implementations will prevent that from happening, even though it would be interesting), but this is a potentially extremely beneficial feature that is apparently not as trivial to utilize as it seems.

For example the comment above that you have to turn off depth writes caught me very much by surprise. I haven't verified it yet, but something like that definitely needs to be documented better.

Simon, Humus, how about writing a little demo for this? Please!

A related question: does the early z (if you can get it to work ;) ) also accelerate the occlusion query or only actual rendering? Does anybody have any experience with this?

zed
09-11-2005, 09:45 PM
Simon, Humus, how about writing a little demo for this? Please!there is a demo on humuses site


A related question: does the early z (if you can get it to work ;) ) also accelerate the occlusion query or only actual rendering? Does anybody have any experience with this?this is mentioned in the occlusion spec. draw first a depth pass with occlusion enabled. and then query the result to see what meshes are visable.

Dirk
09-12-2005, 06:18 PM
Originally posted by zed:
there is a demo on humuses site
Ah, great! I hadn't seen that before. It took some hacking to make it work on my Linux/nVidia system (is there really no way to set the X visual for GDK???), but it seems to work fine now (at least the ftob numbers are much bigger than the btof ones ;) ) Thanks for the hint!


this is mentioned in the occlusion spec. draw first a depth pass with occlusion enabled. and then query the result to see what meshes are visable.Hm, I can't find any mention of early or hierarchical Z in the spec. Given that it's going to be system-specific I didn't expect to find any, really.

alphan
09-16-2005, 05:20 AM
I've implemented a ray casting algorithm on NV40. I use early depth testing and depth bounds testing for computation masking. Rendering is done on a PBuffer. It works, though with some strange restrictions:

-If multiple pbuffer's are used, it works only for the firstly created pbuffer (strange, maybe it's my fault but I could not find a way to make it work for the second pbuffer)
-Breaks down after context switching. Even aviod making the same context active; calling glActiveContext() kills optimization.
-And some other known things: depth mode must be LESS or LEQUAL..

When depth bounds testing activated, early depth culling works faster.
Anyone tested early depth testing with FBO ? I wonder if it is working correctly.

dorbie
09-16-2005, 01:37 PM
These don't seem that strange when you consider that a coarse z scheme requires chip resident cache and estimation of farthest and possibly nearest z and additional hardware based tile level comparrisons (source tile nearest vs coarse destination farthest) and so may support a limited subset of operations for early rejection. That chip resident fast coarse z memory would be a scarce resource and may only be available for one buffer. The useage could be primitive with no paging & management etc and so only a single buffer may be supported. Context switching may not back up & restore it (may not even be possible given the hardware).

On top of all this it's gonna be optimized to hit the benchmark cases and even if possible a driver may not offer the coverage you would like for exotic comparrison, multi-puffer and context switching modes.

zeckensack
09-19-2005, 07:07 AM
Originally posted by Zulfiqar Malik:


Originally posted by Humus

At those triangle loads I would expect you to be limited by vertex shading, not fragment shading, so I'm not surprised if you don't see much of an increase.

My vertex shader is not heavy at all and i shouldn't theoretically be limited by vertex shader performance considering the amount of vertexes current GPU's can process (my GPU is not exactly new :) but it CAN theoretically process more than 250Million vertexes per second).Your 5700Ultra? Frankly, no.
I can squeeze 142Mverts/s through my vanilla 5700 (425MHz core, 275MHz memory), but there's a setup limitation at 70MTris/s. I.e. that's the peak for strips/fans, and the absolute maximum triangle rate the chip can support under whatever circumstances, at that clock speed.

I'd conclude that you're doing pretty well on vertex performance with little room for improvement, if at all.

divide
09-19-2005, 07:55 AM
Seems that Humus' Early Z rejection test program also works on nVidia hardware :)

Zulfiqar Malik
09-20-2005, 12:27 AM
Originally posted by zeckensack

Your 5700Ultra? Frankly, no.
I can squeeze 142Mverts/s through my vanilla 5700 (425MHz core, 275MHz memory), but there's a setup limitation at 70MTris/s. I.e. that's the peak for strips/fans, and the absolute maximum triangle rate the chip can support under whatever circumstances, at that clock speed.

I'd conclude that you're doing pretty well on vertex performance with little room for improvement, if at all.

Well, i did achieve a fillrate of around 92MTris/s (triangle LISTS) but a lot of those were getting rejected at the rasterization stage. But still, that gives a good estimate of the number of vertexes the hardware can process. I read somewhere that the peak theoretical vertex fill rate for 5700 ultra is around 240 MVertes/s.

Btw, i recently got a 6800 GT an i crossed the sweet 100 MTris/s spot, using my algorithm. I was easily getting around 105 MTris/s. A few more optimizations and i can even increase that !