Early Z-buffer test with fragment programs/shaders

Hi all!

How is eraly Z-buffer testing done if fragment progrmas/shaders are enabled where the z-value of each pixel is changed? Will there still be an early Z-buffer test? Or is Z-buffer testing done after fragment processing? Or will only the Z-value operation be done before testing?

Thanks a lot
Corrail

With z fragment alteration early z test does not happen.

Okay…

Would then the performance increase if I save the z-buffer (if it is availalbe, e.g. with stencil shadows) in a rectangluar texture and do z-testing in fragment program/shader on my own?

Don´t think so, since you are doing the z-test (in the fragment program) only shortly before the real test is done. Therefore the performance will certainly decrease because of the extra work you are doing in the shader.

Jan.

Yes, I know that I’ve an extra work with the depth sampling but if I use a very complex pixel shader (with loops and so on) it should be faster, doesn’t it?

Yes, I know that I’ve an extra work with the depth sampling but if I use a very complex pixel shader (with loops and so on) it should be faster, doesn’t it?

No fragment shaders support early outs or loops. So, even if the first line of your fragment shader was “kill”, the entire rest of the shader would have to be run anyway (unless the compiler noticed this and got rid of the rest through dead-code elimination, but that’s a compile-time optimization).

The only way your plan is actually an improvement is when fragment programs actually terminate the execution of a pipe due to a ‘kill’ statement.

FYI.

I tried to save some fillrate by using a simple depth writing shader pass before my very expensive shader pass on NVIDIA FX hardware but never saw any performance improvement. It seemed like early z rejection was not available in subsequent passes if the initial pass wrote to the depth buffer using a shader. Several other people on CgShaders.org had similar experiences. This was about 6 months ago.

Same for me. Never got early-Z rejects running with NVIDIA GeForce FX hardware. I wrote a simple test program and filed a bug-report at the NVIDIA developer site. Did not get any feedback (yet) …

Early-Z works fine with ATI R3xx hardware …

  • Klaus

hmm it seems, that i ran into a quite similar problem… just had to present my realtime volume visualisation demo on a gforce fx 5900 and it was about 5 times slower, than on my ati card… it seems that the early-z isnt enabled, if some special circumstances arent met… i for example do one pass using a very simple fragment program (it hast just one instruction fetching the alpha from a 3d texture and writing it to the output alpha) front to back with the alpha-test enabled and depth func set to lequal. so i just get parts of my volume, that are above a certain iso value. in the next pass i do the shading for the pixels that have to be shaded by doing a second pass with the depth func set to equal. on the ati-card this seems to work just fine and i get a very nice speed improvement (no fragment program does change the depth of the fragment btw). on the other hand, the nvidia card didnt seem to discard the fragments that failed z-test before running the fragment program…

in the end i really would like to know when the z-test is really performed before computing the fragment programs in order to optimize my application for this on an nv3x card…

oh and btw doesnt do doom3 one rendering pass without shading and then one with shading in order to save bandwith with the early-z test?.. just wondering what they are doing there

[This message has been edited by Chuck0 (edited 01-28-2004).]

oh and btw doesnt do doom3 one rendering pass without shading and then one with shading in order to save bandwith with the early-z test?.. just wondering what they are doing there

They do multiple passes over the same geometry because stencil shadowing requires it. The initial ambient pass also doubles as a nice way to apply depth information so that early depth tests on subsequent passes can eliminate occluded fragment program execution. But the primary purpose of this is simply because it is needed for stencil shadowing.

Chuck,
can you use GL_LEQUAL instead of GL_EQUAL? I seem to remember ATI once recommended this because their (older?) hardware needed it for full early Z rejection goodness. This may or may not help with NVIDIA hardware, but it sure can’t hurt to try it.

Another thing I’ve found out the hard way is that some ATI hardware needs you to explicitly clear the depth buffer at least once. Just overwriting the Z buffer with glDepthFunc(GL_ALWAYS) will not give you early Z rejection for subsequent passes.

Originally posted by zeckensack:
[b]Chuck,
can you use GL_LEQUAL instead of GL_EQUAL? I seem to remember ATI once recommended this because their (older?) hardware needed it for full early Z rejection goodness. This may or may not help with NVIDIA hardware, but it sure can’t hurt to try it.

Another thing I’ve found out the hard way is that some ATI hardware needs you to explicitly clear the depth buffer at least once. Just overwriting the Z buffer with glDepthFunc(GL_ALWAYS) will not give you early Z rejection for subsequent passes.[/b]

i just tried changing my algorithm to use gl_lequal, but the way it is now its even far too slow on the r300 i have (without the gl_equal there is much more overdraw than before, since im doing the shading pass now back to front). btw i think gl_equal inst optimal for the hierachical-z stuff, but it still seems to work for the early-z rejection (on r3xx hw that is)… i guess i simply have to live with the fact, that my app wont run fast on nv hardware :stuck_out_tongue:

btw are there any official statements about when early z is enabled and when not? this would really be some interesting information, since applications like real time raycasters using fragment programs simply have to rely on early-z in order to save huge amounts of needlessly executed fragment programs.

[This message has been edited by Chuck0 (edited 01-29-2004).]

There are a few cases when early Z must be disabled:

  • Pixel shader outputs depth
  • Shader contains a “texkill” instruction
  • Alpha test is enabled

There may be more depending on your particular piece of hardware, but certainly you have to disable early Z in these cases.

Why would an early Z test be disabled if a shader contains a KIL as long as it doesn’t write to the depth buffer? Wouldn’t an early Z test just kill the fragment earlier and therefor give more chance for optimization?

Early depth means that the depth buffer is updated before the “texkill” is executed. This means if the fragment passes the Z test, but fails the “texkill” then the depth buffer will contain the wrong data if early Z is enabled.

You must perform “texkill” and alpha test before depth test as shown in the OpenGL pipeline.

Doubtful, but it’s highly implementations specific and we don’t get told the recipe to the secret sauce. Early z does not mean the z is written early, at least at the fragment level. It means blocks of occluded fragments are rejected early.

Early Z is completely hidden from the user but the intent is to reject fragments efficiently before most fragment processing. About the only thing that should stop early z is the fragment modification of z output.

It may be that on a block basis the hardware does something like store one depth value & derivatives (or min & max) and can only do this where it knows the fragments are guaranteed to be written, but all kinds of things would complicate this, not merely texkill. This may be a weakness of some schemes, if early z only operates on contiguous blocks of constant depth gradient from single primitives.

It would explain the texkill thing but it would be more complex than simply early fragment z writes. I’d still have issues with that explanation, for example it might prevent subsequent early z reject but not this reject. It may just boil down to a lack of optimization or a quirk in the hardware design. Obviously there’s stuff we mere punters don’t know.

Texkill should not on the face of things prevent early z rejection of fragments on a block by block basis. So maybe something like block coarse z writes is going on some hardware, I’ll bet they won’t come out and tell us what the heck is going on though. Trade secrets and fear of litigation prevents the graphics vendors explaining their algorithms in useful detail. It’s a shame really.

[This message has been edited by dorbie (edited 01-30-2004).]

Originally posted by dorbie:
Doubtful, but it’s highly implementations specific and we don’t get told the recipe to the secret sauce.
If you look closely, you’ll find that OpenGL guy must be expected to know quite a bit about a particular brand of special sauce

I did notice before I posted. There’s more that one OpenGL implementation out there and the primary objective of coarse z (early z hyperz etc) is still early REJECTION of fragments. I don’t care what anyone else claims about it.

As I said in my post, even if early coarse z writes were prevented by texkill that still shouldn’t prevent early reject of the same fragments. If you think carefully about the problem this is glaringly obvious. Maybe it’s just a design issue, hardware isn’t like software but at least w.r.t. the algorithm something doesn’t add up, and saying early z writes z early so texkill puts the kybosh on it is at best an incomplete explanation. It frankly doesn’t make a whole lot of sense.

[This message has been edited by dorbie (edited 01-30-2004).]

Originally posted by dorbie:
I did notice before I posted. There’s more that one OpenGL implementation out there and the primary objective of coarse z (early z hyperz etc) is still early REJECTION of fragments. I don’t care what anyone else claims about it.

Who said anything about hyper Z? Early Z and hyper Z can be two different things entirely. Hyper Z can work on a block level to allow trivial acceptance or rejection of a block of pixels. Early Z means just that: Perform the Z test as early as possible (usually before the shader). The point of having both of these is that hyper Z won’t trivially reject complete blocks all the time, so having an early Z check means that you can still save on shader computations in these cases.

As I said in my post, even if early coarse z writes were prevented by texkill that still shouldn’t prevent early reject of the same fragments.

Again, early Z and hyper Z are not the same thing.

It frankly doesn’t make a whole lot of sense.

Makes good sense to me

Once again the purpose of early z test is to reject fragments early, not merely write fragments early. If you’re writing fragments then you’ve passed the depth test and you save nothing. On the other hand if you have failed the depth test then you can reject irrespective of texkill.

So it doesn’t entirely make perfect sense at all. The claim that you cannot early z reject because texkill may also reject later is completely nonsensical at one level. Or to use the exact scenario that you cannot early reject a fragment because you want to write it (NO YOU DON’T, it won’t be written) before a kill.

Given what you’ve said I could figure that the whole depth operation must be done early including the write, whereas a depth pass result could in theory at least store the depth value before writing until the fragment shader had executed still allowing the kill to prevent the write, but maybe that’s naive of me. A depth test reject shade then depth write makes obvious sense and is the exact reason everyone asks why kills break early z schemes (in all their forms).

You say that early z and coarse z etc are not the same thing and I’ll grant you that I can understand the distinction and why one is an improvement on the other (if both are implemented), but broadly speaking they serve a similar purpose and each could reject early irrespective of later rejects. When developers talk about early z reject in a context like this discussion I immediately assume they are generally referring to the whole bag of tricks that might save fill due to early occlusion (and I think that’s a correct assumption), that includes coarse z reject, hyper z, superduper z and early z. If no savings are apparent after a zfill first pass trick with KIL on a shader then all of the above must be disabled. ALL of them, and that again makes no sense for the reasons already stated, same goes for alpha test, unless again it’s a design quirk, which I keep saying.

When posters here are questioning why kill instructions are inexplicably disabling the effectiveness of z occlusion optimizations they’ve tried to implement I can guarantee they’re not drawing any distinctions between block reject and early fragment z (it’s all hidden and never really addressed in detail that I’ve seen). Nor does there seem to be a difference in practice w.r.t. kill (at least on some hardware), no performance improvement is no performance improvement. And it doesn’t make sense at least to a dumb software guy like me.

[This message has been edited by dorbie (edited 01-31-2004).]