PDA

View Full Version : Early Z-buffer test with fragment programs/shaders



Corrail
01-22-2004, 08:07 AM
Hi all!

How is eraly Z-buffer testing done if fragment progrmas/shaders are enabled where the z-value of each pixel is changed? Will there still be an early Z-buffer test? Or is Z-buffer testing done after fragment processing? Or will only the Z-value operation be done before testing?

Thanks a lot
Corrail

dorbie
01-22-2004, 08:11 AM
With z fragment alteration early z test does not happen.

Corrail
01-22-2004, 08:44 AM
Okay...

Would then the performance increase if I save the z-buffer (if it is availalbe, e.g. with stencil shadows) in a rectangluar texture and do z-testing in fragment program/shader on my own?

Jan
01-22-2004, 08:54 AM
Donīt think so, since you are doing the z-test (in the fragment program) only shortly before the real test is done. Therefore the performance will certainly decrease because of the extra work you are doing in the shader.

Jan.

Corrail
01-24-2004, 02:42 AM
Yes, I know that I've an extra work with the depth sampling but if I use a very complex pixel shader (with loops and so on) it should be faster, doesn't it?

Korval
01-24-2004, 11:11 AM
Yes, I know that I've an extra work with the depth sampling but if I use a very complex pixel shader (with loops and so on) it should be faster, doesn't it?

No fragment shaders support early outs or loops. So, even if the first line of your fragment shader was "kill", the entire rest of the shader would have to be run anyway (unless the compiler noticed this and got rid of the rest through dead-code elimination, but that's a compile-time optimization).

The only way your plan is actually an improvement is when fragment programs actually terminate the execution of a pipe due to a 'kill' statement.

roffe
01-24-2004, 01:41 PM
FYI.

I tried to save some fillrate by using a simple depth writing shader pass before my very expensive shader pass on NVIDIA FX hardware but never saw any performance improvement. It seemed like early z rejection was not available in subsequent passes if the initial pass wrote to the depth buffer using a shader. Several other people on CgShaders.org had similar experiences. This was about 6 months ago.

Klaus
01-24-2004, 01:47 PM
Same for me. Never got early-Z rejects running with NVIDIA GeForce FX hardware. I wrote a simple test program and filed a bug-report at the NVIDIA developer site. Did not get any feedback (yet) ...

Early-Z works fine with ATI R3xx hardware ...

- Klaus

Chuck0
01-28-2004, 10:59 AM
hmm it seems, that i ran into a quite similar problem... just had to present my realtime volume visualisation demo on a gforce fx 5900 and it was about 5 times slower, than on my ati card... it seems that the early-z isnt enabled, if some special circumstances arent met... i for example do one pass using a very simple fragment program (it hast just one instruction fetching the alpha from a 3d texture and writing it to the output alpha) front to back with the alpha-test enabled and depth func set to lequal. so i just get parts of my volume, that are above a certain iso value. in the next pass i do the shading for the pixels that have to be shaded by doing a second pass with the depth func set to equal. on the ati-card this seems to work just fine and i get a very nice speed improvement (no fragment program does change the depth of the fragment btw). on the other hand, the nvidia card didnt seem to discard the fragments that failed z-test before running the fragment program...

in the end i really would like to know when the z-test is really performed before computing the fragment programs in order to optimize my application for this on an nv3x card...

oh and btw doesnt do doom3 one rendering pass without shading and then one with shading in order to save bandwith with the early-z test?... just wondering what they are doing there http://www.opengl.org/discussion_boards/ubb/smile.gif

[This message has been edited by Chuck0 (edited 01-28-2004).]

Korval
01-28-2004, 11:37 AM
oh and btw doesnt do doom3 one rendering pass without shading and then one with shading in order to save bandwith with the early-z test?... just wondering what they are doing there

They do multiple passes over the same geometry because stencil shadowing requires it. The initial ambient pass also doubles as a nice way to apply depth information so that early depth tests on subsequent passes can eliminate occluded fragment program execution. But the primary purpose of this is simply because it is needed for stencil shadowing.

zeckensack
01-28-2004, 11:39 AM
Chuck,
can you use GL_LEQUAL instead of GL_EQUAL? I seem to remember ATI once recommended this because their (older?) hardware needed it for full early Z rejection goodness. This may or may not help with NVIDIA hardware, but it sure can't hurt to try it.

Another thing I've found out the hard way is that some ATI hardware needs you to explicitly clear the depth buffer at least once. Just overwriting the Z buffer with glDepthFunc(GL_ALWAYS) will not give you early Z rejection for subsequent passes.

Chuck0
01-29-2004, 05:23 AM
Originally posted by zeckensack:
Chuck,
can you use GL_LEQUAL instead of GL_EQUAL? I seem to remember ATI once recommended this because their (older?) hardware needed it for full early Z rejection goodness. This may or may not help with NVIDIA hardware, but it sure can't hurt to try it.

Another thing I've found out the hard way is that some ATI hardware needs you to explicitly clear the depth buffer at least once. Just overwriting the Z buffer with glDepthFunc(GL_ALWAYS) will not give you early Z rejection for subsequent passes.

i just tried changing my algorithm to use gl_lequal, but the way it is now its even far too slow on the r300 i have (without the gl_equal there is much more overdraw than before, since im doing the shading pass now back to front). btw i think gl_equal inst optimal for the hierachical-z stuff, but it still seems to work for the early-z rejection (on r3xx hw that is)... i guess i simply have to live with the fact, that my app wont run fast on nv hardware :p

btw are there any official statements about when early z is enabled and when not? this would really be some interesting information, since applications like real time raycasters using fragment programs simply have to rely on early-z in order to save huge amounts of needlessly executed fragment programs.

[This message has been edited by Chuck0 (edited 01-29-2004).]

OpenGL guy
01-30-2004, 04:07 PM
There are a few cases when early Z must be disabled:
- Pixel shader outputs depth
- Shader contains a "texkill" instruction
- Alpha test is enabled

There may be more depending on your particular piece of hardware, but certainly you have to disable early Z in these cases.

endash
01-30-2004, 04:24 PM
Why would an early Z test be disabled if a shader contains a KIL as long as it doesn't write to the depth buffer? Wouldn't an early Z test just kill the fragment earlier and therefor give more chance for optimization?

OpenGL guy
01-30-2004, 05:04 PM
Early depth means that the depth buffer is updated before the "texkill" is executed. This means if the fragment passes the Z test, but fails the "texkill" then the depth buffer will contain the wrong data if early Z is enabled.

You must perform "texkill" and alpha test before depth test as shown in the OpenGL pipeline.

dorbie
01-30-2004, 06:02 PM
Doubtful, but it's highly implementations specific and we don't get told the recipe to the secret sauce. Early z does not mean the z is written early, at least at the fragment level. It means blocks of occluded fragments are rejected early.

Early Z is completely hidden from the user but the intent is to reject fragments efficiently before most fragment processing. About the only thing that should stop early z is the fragment modification of z output.

It may be that on a block basis the hardware does something like store one depth value & derivatives (or min & max) and can only do this where it knows the fragments are guaranteed to be written, but all kinds of things would complicate this, not merely texkill. This may be a weakness of some schemes, if early z only operates on contiguous blocks of constant depth gradient from single primitives.

It would explain the texkill thing but it would be more complex than simply early fragment z writes. I'd still have issues with that explanation, for example it might prevent subsequent early z reject but not this reject. It may just boil down to a lack of optimization or a quirk in the hardware design. Obviously there's stuff we mere punters don't know.

Texkill should not on the face of things prevent early z rejection of fragments on a block by block basis. So maybe something like block coarse z writes is going on some hardware, I'll bet they won't come out and tell us what the heck is going on though. Trade secrets and fear of litigation prevents the graphics vendors explaining their algorithms in useful detail. It's a shame really.

[This message has been edited by dorbie (edited 01-30-2004).]

zeckensack
01-30-2004, 06:33 PM
Originally posted by dorbie:
Doubtful, but it's highly implementations specific and we don't get told the recipe to the secret sauce.If you look closely, you'll find that OpenGL guy must be expected to know quite a bit about a particular brand of special sauce http://www.opengl.org/discussion_boards/ubb/wink.gif

dorbie
01-30-2004, 06:38 PM
I did notice before I posted. There's more that one OpenGL implementation out there and the primary objective of coarse z (early z hyperz etc) is still early REJECTION of fragments. I don't care what anyone else claims about it.

As I said in my post, even if early coarse z writes were prevented by texkill that still shouldn't prevent early reject of the same fragments. If you think carefully about the problem this is glaringly obvious. Maybe it's just a design issue, hardware isn't like software but at least w.r.t. the algorithm something doesn't add up, and saying early z writes z early so texkill puts the kybosh on it is at best an incomplete explanation. It frankly doesn't make a whole lot of sense.

[This message has been edited by dorbie (edited 01-30-2004).]

OpenGL guy
01-30-2004, 11:02 PM
Originally posted by dorbie:
I did notice before I posted. There's more that one OpenGL implementation out there and the primary objective of coarse z (early z hyperz etc) is still early REJECTION of fragments. I don't care what anyone else claims about it.
Who said anything about hyper Z? Early Z and hyper Z can be two different things entirely. Hyper Z can work on a block level to allow trivial acceptance or rejection of a block of pixels. Early Z means just that: Perform the Z test as early as possible (usually before the shader). The point of having both of these is that hyper Z won't trivially reject complete blocks all the time, so having an early Z check means that you can still save on shader computations in these cases.

As I said in my post, even if early coarse z writes were prevented by texkill that still shouldn't prevent early reject of the same fragments.
Again, early Z and hyper Z are not the same thing.

It frankly doesn't make a whole lot of sense.
Makes good sense to me http://www.opengl.org/discussion_boards/ubb/wink.gif

dorbie
01-31-2004, 12:12 AM
Once again the purpose of early z test is to reject fragments early, not merely write fragments early. If you're writing fragments then you've passed the depth test and you save nothing. On the other hand if you have failed the depth test then you can reject irrespective of texkill.

So it doesn't entirely make perfect sense at all. The claim that you cannot early z reject because texkill may also reject later is completely nonsensical at one level. Or to use the exact scenario that you cannot early reject a fragment because you want to write it (NO YOU DON'T, it won't be written) before a kill.

Given what you've said I could figure that the *whole* depth operation must be done early including the write, whereas a depth pass result could in theory at least store the depth value before writing until the fragment shader had executed still allowing the kill to prevent the write, but maybe that's naive of me. A depth test reject shade then depth write makes obvious sense and is the exact reason everyone asks why kills break early z schemes (in all their forms).

You say that early z and coarse z etc are not the same thing and I'll grant you that I can understand the distinction and why one is an improvement on the other (if both are implemented), but broadly speaking they serve a similar purpose and each could reject early irrespective of later rejects. When developers talk about early z reject in a context like this discussion I immediately assume they are generally referring to the whole bag of tricks that might save fill due to early occlusion (and I think that's a correct assumption), that includes coarse z reject, hyper z, superduper z and early z. If no savings are apparent after a zfill first pass trick with KIL on a shader then all of the above must be disabled. ALL of them, and that again makes no sense for the reasons already stated, same goes for alpha test, unless again it's a design quirk, which I keep saying.

When posters here are questioning why kill instructions are inexplicably disabling the effectiveness of z occlusion optimizations they've tried to implement I can guarantee they're not drawing any distinctions between block reject and early fragment z (it's all hidden and never really addressed in detail that I've seen). Nor does there seem to be a difference in practice w.r.t. kill (at least on some hardware), no performance improvement is no performance improvement. And it doesn't make sense http://www.opengl.org/discussion_boards/ubb/smile.gif at least to a dumb software guy like me.


[This message has been edited by dorbie (edited 01-31-2004).]

OpenGL guy
01-31-2004, 01:53 AM
Originally posted by dorbie:
Once again the purpose of early z test is to reject fragments early, not merely write fragments early. If you're writing fragments then you've passed the depth test and you save nothing. On the other hand if you have failed the depth test then you can reject irrespective of texkill.
If you are doing early fragment depth test, as opposed to trivial block rejection, why not update the Z buffer at this point? It makes more sense to move the whole Z block ahead in the pipeline than just a portion of it.

So it doesn't entirely make perfect sense at all. The claim that you cannot early z reject because texkill may also reject later is completely nonsensical at one level.
That wasn't what I said at all. Early Z means just that: Early Z. Z test comes early. If you want to talk about trivial rejection, that is something else. Even the title of this thread implied early Z test.

Given what you've said I could figure that the whole depth operation is done early including the write, whereas a depth pass result could in theory at least store the depth value before writing until the fragment shader had executed still allowing the kill to prevent the write, but maybe that's naive of me.
Sounds naive to me and I am just a software person http://www.opengl.org/discussion_boards/ubb/wink.gif

You say that early z and coarse z etc are not the same thing and I'll grant you that I can understand the distinction and why one is an improvement on the other (if both are implemented), but broadly speaking they serve a similar purpose and each could reject early irrespective of later rejects. When developers talk about early z reject in a context like this discussion I immediately assume they are generally referring to the whole bag of tricks that might save fill due to early occlusion (and I think that's a correct assumption), that includes coarse z reject, hyper z, superduper z and early z. If no savings are apparent after a zfill first pass trick with KIL on a shader then all of the above must be disabled. ALL of them, and that again makes no sense for the reasons already stated, same goes for alpha test, unless again it's a design quirk, which I keep saying.
Maybe it's a difference in semantics. In any event, you don't need two methods of early rejection, that's why you have hyper Z and early Z. They are two different features with two different goals.

When posters here are questioning why kill instructions are inexplicably disabling the effectiveness of z occlusion optimizations they've tried to implement I can guarantee they're not drawing any distinctions between block reject and early fragment z (it's all hidden and never really addressed in detail that I've seen). Nor does there seem to be a difference in practice w.r.t. kill (at least on some hardware), no performance improvement is no performance improvement. And it doesn't make sense http://www.opengl.org/discussion_boards/ubb/smile.gif at least to a dumb software guy like me.
Sure it makes sense. You can't do depth test if the fragment is going to be killed. You can trivially reject the fragment, but that's it.

Maybe my viewpoint is different because I have seen how some hardware actually works...

dorbie
01-31-2004, 02:31 AM
I know you can trivially reject obviously you can, (wait a sec isn't that what I'VE been saying), but apparently it doesn't work on some hardware, that's partly what people are saying when they say it doesn't save them anything under certain circumstances, they cannot measure each method in isolation.

In addition you CAN in theory early z test even with a kill, the issue is can you subsequently stop the write with a kill or alpha test after a z pass. You're intrinsically tying z test to z write. Lot's of things make sense if you consider a single design, but there are other options.

Once again people aren't posting here because of one method or one aspect of an implementation. The "feature" is the thing at issue, the feature being occluded fragments getting significantly faster fill. It doesn't happen with certain shader instructions therefore the whole thing is broken trivial block reject included (on some hardware).

I can understand what you're saying, but at the same time I'm not exactly making a complex or convoluted point here and I get the impression I'm just not getting through to you.

Once again, and I did edit so you probably missed it, the claim you cannot early reject a fragment because you want to write it (NO YOU DON'T, it won't be written) before a kill is not a compelling one, it's going to fail z it will never be written. It can be safely rejected.

As for not needing two methods of early rejection, I'm not the one who started drawing distinctions in a thread discussing the overall perfomance of the combined system. The purpose of each method is perfectly clear and I'm not going down a semantic rathole with you over how similar or different they are, let's just agree they are what they are.

[This message has been edited by dorbie (edited 01-31-2004).]

Jan
01-31-2004, 02:43 AM
nVidia wrote in a paper, that to save fillrate one should use the alpha-test as often as possible, for example when doing a specular pass or so.
However now i have an ATI card. Does that mean to get the most out of it i should DISABLE alpha-test as often as possible? And that using the KIL instruction hurts performance instead of increasing it?

Would be very nice, if those two companies could just create a paper, where they explain what to do and what not to get the best performance out of their products. Such information is very rare.

I think most people (including me) still use the intuitive and most naive way (ie. using the alpha-test, or putting as much into one pass as possible, instead of doing multiple passes). But how should we know better? I couldnīt even get an extension-list for ATIs cards from ATI themselves. That much information should be possible to publish, shouldnīt it??

Jan.

Korval
01-31-2004, 11:16 AM
Would be very nice, if those two companies could just create a paper, where they explain what to do and what not to get the best performance out of their products.

They did. You don't expect to find ATi's performance paper on nVidia's site, do you? It's on the ATi site.


I couldnīt even get an extension-list for ATIs cards from ATI themselves. That much information should be possible to publish, shouldnīt it??

It keeps changing with every driver revision (which, for ATi, happens once a month). So, there's no real point. People who are interested likely have the card already and can just ask.

Jan
01-31-2004, 12:29 PM
"They did. You don't expect to find ATi's performance paper on nVidia's site, do you? It's on the ATi site."

What kind of idiot do you think i am? Of course i searched ATIs AND nVidias pages for such papers. nVidia has some old papers, but they usually only talk about their new features and how to use them properly, not about the general architecture and how to use standard-stuff properly.

On ATIs page i couldnīt find such stuff yet, but i have to admit i didnīt look too closely.

But if you have some links, i would like to know them.

Jan.

OpenGL guy
01-31-2004, 12:46 PM
Originally posted by Jan2000:
nVidia wrote in a paper, that to save fillrate one should use the alpha-test as often as possible, for example when doing a specular pass or so.
I don't see how alpha test can save fillrate. The fragment still has to run through the shader so you can see what the alpha value is! You can save bandwidth, however.

However now i have an ATI card. Does that mean to get the most out of it i should DISABLE alpha-test as often as possible? And that using the KIL instruction hurts performance instead of increasing it?
If you can use a clip plane in place of a "kill", then that's probably the better way to go.

Would be very nice, if those two companies could just create a paper, where they explain what to do and what not to get the best performance out of their products. Such information is very rare.
It's not that rare... ATI has a whole SDK for developers. Unfortunately, many people don't seem to read it...

OpenGL guy
01-31-2004, 12:50 PM
Originally posted by dorbie:
In addition you CAN in theory early z test even with a kill, the issue is can you subsequently stop the write with a kill or alpha test after a z pass. You're intrinsically tying z test to z write. Lot's of things make sense if you consider a single design, but there are other options.
Just where would you store this Z value that you need to "stop from writing"? You're talking about increasing the complexity quite a bit.

Once again people aren't posting here because of one method or one aspect of an implementation. The "feature" is the thing at issue, the feature being occluded fragments getting significantly faster fill. It doesn't happen with certain shader instructions therefore the whole thing is broken trivial block reject included (on some hardware).
I can't speak for all hardware (or drivers), but I can tell you how some hardware works and what one IHV recommends.

Once again, and I did edit so you probably missed it, the claim you cannot early reject a fragment because you want to write it (NO YOU DON'T, it won't be written) before a kill is not a compelling one, it's going to fail z it will never be written. It can be safely rejected.
Maybe I mispoke but all I said was the early Z rejection is safe, but not necessarily early Z test. Early Z test implies resolving the Z buffer before the fragment color is computed, this does not always give the proper result.

Anyway, I've stated my point of view.

dorbie
01-31-2004, 01:12 PM
Granted it may increase the complexity but perhaps not, hence my early comment that perhaps I was being naive. Hardware has registers per fragment during processing but the real issue is pipelineing and hiding the latency of memory fetches and other nastier things like dependent reads for z pass fragments, or in this case keeping the z around during that delay which may require cache & fragment recirculation beyond mere registers (or whatever the hardware details are) but I expect that kind of thing already exists for other reasons. After the early z test, an early depth write could be stored in ooohhh.... let's see..... an fbuffer, but that's just one implementation feature that's been published, it may still not be practical for all I know. The devil's in the details I don't have. This scheme need only be implemented when a kil shader is in place not as a general solution. You could still early z write unless a there was a kill shader in which case you could store the z. Is that better than brute forcing every fragment? Dunno, but it will depend on the ratio of z pass to z fail if it isn't free.

jwatte
01-31-2004, 09:12 PM
I believe graphics cards use highly custom memory controllers. For example, framebuffer blend is probably done in the memory controller, rather than a more remote "blend unit". (*)

It seems plausible that early Z is also implemented very close to the DRAM, as a read-modify-write operation, for speed. If that's the case, you can't read/test, then run a program with possible KIL, then write; that wouldn't utilize the early Z if the early Z was implemented that close to the memory.


(*) this is likely the reason we don't get blending for floating-point framebuffers on current hardware.

zeckensack
02-01-2004, 02:49 AM
We're going round in circles.
For all I know, ATI hardware does have the ability to correctly resolve the depth buffer, even when there's "hyper z incompatible" fragment processing going on.

So, OpenGL guy, you can selectively write (or selectively commit) depending on the outcome of combined Z test*, alpha test, KIL, whatever. You can.

So I assume - until I'm corrected - that you have some sort of fallback per-fragment culling silicon that's active whenever the "Z block" can't do its work. I assume that because if you don't, and all fallback visibility testing is indeed resolved by the "Z block", there would be absolutely no issue.

I'd imagine that the reason for all of this is that you can't keep the "Z block" data in sync with the results at the end of fragment processing due to pipelining reasons. Visibility results would come back an unknown amount of cycles later, even though you need correct Z at all times. Otherwise you'd risk rejecting perfectly visible fragments.

We can all go around poking our noses and wondering what's going on. So if you can, please do tell, ot at least dismiss my ramblings as nonsense.

*I'm referring to "Z test" as test only. This is different from GL semantics, where active depth testing automatically implies active depth writes (barring the depth mask).

V-man
02-01-2004, 10:54 AM
Originally posted by jwatte:
I believe graphics cards use highly custom memory controllers. For example, framebuffer blend is probably done in the memory controller, rather than a more remote "blend unit". (*)

(*) this is likely the reason we don't get blending for floating-point framebuffers on current hardware.

I don't understand.

It seems simple enough to me. If you can read a certain buffer that is in a certain format, and then do the blending, then it shouldn't be to tough to hard to support another format (ie float)

Isn't it possible that NV and ATI do blending using integers and adding support for floats would be complicated and expensive.

These chips are expensive enough as is.

jwatte
02-01-2004, 08:53 PM
Doing blending in unsigned byte format would be a reasonably few gates, and would probably execute in a clock cycle or less. Putting a full floating point unit (actually, 4) in the memory controller seems like it would be another ball of wax entirely; I don't think that you could make it run fast enough to be on the straight path to/from memory.

Of course, this is only speculation -- the closest I've ever come to hardware is register level. Would be awesome if someone who actually knew could acutally comment.

V-man
02-02-2004, 10:54 AM
Probably all FPUs in the GPU are SIMD capable. So if the card can do 256 bit mem fetch, and write, than that means 8 floats at a time.

Currently, the Volari can do 512 bit (4 controllers, 128 bit each) memory access.

I guess we'll see when this feature comes out.
Anybody know if the next generation will have this?

OpenGL guy
02-05-2004, 04:57 PM
Originally posted by zeckensack:
We're going round in circles.
For all I know, ATI hardware does have the ability to correctly resolve the depth buffer, even when there's "hyper z incompatible" fragment processing going on.

So, OpenGL guy, you can selectively write (or selectively commit) depending on the outcome of combined Z test*, alpha test, KIL, whatever. You can.

So I assume - until I'm corrected - that you have some sort of fallback per-fragment culling silicon that's active whenever the "Z block" can't do its work. I assume that because if you don't, and all fallback visibility testing is indeed resolved by the "Z block", there would be absolutely no issue.

I'd imagine that the reason for all of this is that you can't keep the "Z block" data in sync with the results at the end of fragment processing due to pipelining reasons. Visibility results would come back an unknown amount of cycles later, even though you need correct Z at all times. Otherwise you'd risk rejecting perfectly visible fragments.

We can all go around poking our noses and wondering what's going on. So if you can, please do tell, ot at least dismiss my ramblings as nonsense.

*I'm referring to "Z test" as test only. This is different from GL semantics, where active depth testing automatically implies active depth writes (barring the depth mask).

I can't make sense of what you're saying. What's your question?

JanHH
02-05-2004, 07:08 PM
"Maybe my viewpoint is different because I have seen how some hardware actually works..."

you must be Neo from Matrix http://www.opengl.org/discussion_boards/ubb/wink.gif.

It seems to me that the main point this discussion (and probably misunderstandings) is about wheter a depth TEST automatically implies a depth WRITE? If this is the case, in fact a KIL statement in the shaders will not work correctly for fragments that
pass the depth test but then get KILled in the shader, if not, it will. This is really all this is about, isn't it?

I once compared using alpha test to not using it in terms of performance but am not sure if I remember the results correctly.. I think there was no difference (on a gf4 ti). But at least it looks differen http://www.opengl.org/discussion_boards/ubb/wink.gif.

But I think that GF FX s*cks anyway, for other reasons, the ARB_fragment_program performance is nothing but a shame, and although I do not like direct3d or microsoft in general at all, d3d still is the standard in gaming, and the gf fx shader performance there is, if I recall correclty, as bad as it is with ARB_fragment_program, so I wonder if Nvidia will be successful or even exist (think of 3d fx) in the future if they don't change this *soon*. I just read the the next generation of chip sets coming from them whill still have the same problem..

Funny idea btw to post Nvidia performance optimization papers on ATI's web site.. maybe they should do it, for marketing reasons *g*.

OK this is off topic, sorry http://www.opengl.org/discussion_boards/ubb/wink.gif.

Jan



[This message has been edited by JanHH (edited 02-05-2004).]

V-man
02-05-2004, 11:02 PM
The conversation between OpenGL guy and Dorbie got a little confusing.

GL guy said this

>>Early depth means that the depth buffer is updated before the "texkill" is executed. This means if the fragment passes the Z test, but fails the "texkill" then the depth buffer will contain the wrong data if early Z is enabled.

You must perform "texkill" and alpha test before depth test as shown in the OpenGL pipeline.<<<

In essence, we are talking about updating the depth buffer as soon as the depth test passes.

It is my undestanding that Z testing occurs very early (the earliest fragment test). Is there a benifit to update the depth buffer at this point? Can't you just update the depth at the end of the pipe as you update the color and stencil at the same time?

Can a subsequent fragment which is about to overwrite the fragment ahead of it which is still executing the FP cause a abort of the FP?

The world may never know! http://www.opengl.org/discussion_boards/ubb/smile.gif

Mezz
02-06-2004, 12:44 AM
Wouldn't updating the Z buffer at the end lock you out of early Z testing entirely?

If you did the Z test at the beginning for fragment A, then started running some heavy processing on it then fragment B comes along, it's the same pixel location as fragment A and it also passes the Z test (with whats in the Z buffer at the time) Then when fragment A is done and writes to the Z/colour/stencil buffer it now turns out that fragment B ought to have failed the Z test. So, when fragment B is done you either overwrite everything blindly resulting in an incorrect image or you have to do another Z test, which happens?

Is this right/wrong or am I just off ball entirely?

-Mezz

MZ
02-06-2004, 06:49 AM
Originally posted by V-man:
The conversation between OpenGL guy and Dorbie got a little confusing.Same feeling here, I'm barely able to track what the two guys *disagree* about.

OpenGL guy, I think you brought both a bit of confusion and a bit of enlightment to this thread.

The confusion:
=============

endash asked:
Why would an early Z test be disabled if a shader contains a KIL as long as it doesn't write to the depth buffer? Wouldn't an early Z test just kill the fragment earlier and therefor give more chance for optimization?

OpenGL guy replied:
Early depth means that the depth buffer is updated before the "texkill" is executed. This means if the fragment passes the Z test, but fails the "texkill" then the depth buffer will contain the wrong data if early Z is enabled. You must perform "texkill" and alpha test before depth test as shown in the OpenGL pipeline.It seems like you ignored assumption about depth writes being disabled. This confused a lot in this thread.

Quote from "Radeon 9700 OpenGL Programming and Optimization Guide.doc" from ATI OGL SDK:
To enable the optimization, the app only needs to ensure that the shader is not writing a depth value and that pixels are not being killed by the shader or by the alpha test when the depth values are being updated.According to this, the state of depth mask changes a lot. Confirmation would be welcome.

The enlightment:
================
Your rationale about HW design in which early-depth-culling requires early-depth-write (IF depth write is enabled) was very valuable, thanks for that.

Now knowing this, I think there is relatedissue with glStencilOp: when it is set to update stencil buffer (depth+stencil actually) and the KIL/Alpha is in use, then this all might turn off early-Z culling despite depth writes being disabled, am I right?



[This message has been edited by MZ (edited 02-06-2004).]

V-man
02-06-2004, 09:01 PM
Originally posted by Mezz:
Wouldn't updating the Z buffer at the end lock you out of early Z testing entirely?

If you did the Z test at the beginning for fragment A, then started running some heavy processing on it then fragment B comes along, it's the same pixel location as fragment A and it also passes the Z test (with whats in the Z buffer at the time) Then when fragment A is done and writes to the Z/colour/stencil buffer it now turns out that fragment B ought to have failed the Z test. So, when fragment B is done you either overwrite everything blindly resulting in an incorrect image or you have to do another Z test, which happens?

Is this right/wrong or am I just off ball entirely?

-Mezz

Early z-test means that instead of processing the fragment (texturing, color sum or your FP), it will first do a depth test and if your fragment fails, it gets culled, leaved a hole for another fragment.

I don't want to talk about stencil testing here cause I think this complicates the matter.

If I remember correctly, on NVidia it's depth test, then stencil, then alpha. Of course, you have to enable the testing so that they will occur.

Writing to the depth buffer during testing would give you a better performance I guess.
I'm not sure how big it would be though.
The GPU could cache a section of the buffer as well. I'm assuming that textures aren't the only thing that get cached.
There is always the issue of comparing fragments between each other cause I think multiple fragments could be depth tested in parallel.

The problem comes in if you are doing depth replace. In this case, the fragment must be evaluated, then depth testing can occur.
This is an obvious performance killer.