Full support for early-z would be nice

Lindley · July 6, 2007, 1:29pm

As it stands now, using fragment kill (Cg discard) or modifying the depth value in a fragment program will disable early-z.

I’d like to see this changed so that it’s possible to set early-z to tolerate these things, if one is simply using it as a mask, and you’re only interested in setting the depth value for the next pass.

Granted, to a certain extent that’s what the stencil buffer is for; but it would be nice to have the option, especially since it would allow more complicated behavior to be specified essentially using two layers of stencil. Maybe actually make the glStencilOp’s “zpass/zfail” distinction more useful.

Korval · July 6, 2007, 3:22pm

I’d like to see this changed so that it’s possible to set early-z to tolerate these things
It’s physically impossible to do that. Well, for depth modifying programs, that is. After all, you can’t test something early until you have a value to test, and you don’t have a value to test until you process the fragment and get its result.

However, I wasn’t aware that discard actually turns off early-z. Are you sure this is the case?

But, there’s one show-stopper in all of this: for OpenGL to do something about it, it needs to specify early-z as being something real. Currently, early-z is an optimization that the specification allows to happen whenever the hardware can allow it. So really, if discard is stopping early-z, it’s either a un-optimized driver or a full-on hardware impossibility (ie: for whatever reason, the hardware can’t allow it). And the spec can’t do anything about the latter.

Komat · July 6, 2007, 5:05pm

Originally posted by Korval:

However, I wasn’t aware that discard actually turns off early-z. Are you sure this is the case?

See this post

Korval · July 6, 2007, 5:45pm

See this post
That’s only for ATi hardware, yes?

Brolingstanz · July 6, 2007, 6:39pm

Might just as well factor for the GCD, unless a plethora of shader permutations x 2 is acceptable

Anyway, just skimming through GPU Gems 2 I couldn’t find anything definitive on discard, and I haven’t sat down to test this stuff myself.

P.S. Kinda wondering if the G80 is doing things differently (where’s GPUG 3?!).

Komat · July 7, 2007, 2:30am

Originally posted by Korval:
[quote]See this post
That’s only for ATi hardware, yes? [/QUOTE]Yes however it is similar ( GPU Guide ) on Nvidia hw. I do not know what is current status with G80 however I think that most of the issues will still apply.

Humus · July 8, 2007, 3:49pm

We now have a document in the ATI SDK that explains all the pecularities of the HyperZ:
http://ati.amd.com/developer/SDK/AMD_SDK_Samples_May2007/Documentations/Depth_in-depth.pdf

Originally posted by Lindley:
[b] As it stands now, using fragment kill (Cg discard) or modifying the depth value in a fragment program will disable early-z.

I’d like to see this changed so that it’s possible to set early-z to tolerate these things, if one is simply using it as a mask, and you’re only interested in setting the depth value for the next pass.[/b]
This is a hardware issue and not an OpenGL one. As for fragment kill, just get an HD 2900XT if that’s what you’re concerned about since that harware can now use EarlyZ for fragment kill.
For the modified depth, that will never be optimized by any hardware since it’s impossible as Korval pointed out. Using depth-out in not recommended in general since it kills pretty much every opportunity to optimize. Plus it may slow down later passes too because it easily trashes depth compression etc.

Korval · July 8, 2007, 4:38pm

As for fragment kill, just get an HD 2900XT if that’s what you’re concerned about since that harware can now use EarlyZ for fragment kill.
I might if you could get ATi to actually start making good, performant hardware again. You know, something that can win in price/performance benchmarks against nVidia cards

Brolingstanz · July 8, 2007, 9:39pm

Great article, Humus. Great list of do’s and don’ts, and the why’s and whatfor’s are really appreciated.

For the modified depth, that will never be optimized by any hardware since it’s impossible as Korval pointed out.
For the sake of argument I could suggest that an application-specified maximum deviation could be tested to allow for a bounded modification to fragment Z, but I won’t.

Lindley · July 9, 2007, 7:46am

Originally posted by Korval:
[QB] [quote]I’d like to see this changed so that it’s possible to set early-z to tolerate these things
It’s physically impossible to do that. Well, for depth modifying programs, that is. After all, you can’t test something early until you have a value to test, and you don’t have a value to test until you process the fragment and get its result.[/QUOTE]What if you’d simply like to set the depth for the next pass based on the result of this one? There really should be some way to write something into the depth buffer which isn’t actually used for depth-testing in the current iteration.

For example: I’ve got a full 2048x2048 pixels, and I’d like to do 3x3 nonmax suppression. One could simply do 9 texture lookups at each pixel, and hope that cache locality keeps the speed up. Or, one could do two passes of 3 texture lookups each.

Obviously, the first pass could set an output depth value for the maximums in that direction, so that the second pass would only have to run the fragment program on those particular pixels, and the rest could be culled with early-Z.

Stenciling wouldn’t be quite appropriate----you can’t just “discard” nonmaxes on the first pass, because you have to write max information for use in the second pass, and Cg currently has no way to output to the stencil buffer directly.

That all works fine up until you have a 3-pass algorithm. Suddenly, the middle pass has to both cull a number of pixels using early-Z, and further restrict the number of pixels tested in pass 3. It may be possible to achieve this with the current software (z-setting-only passes, etc), but it’s pretty difficult, and seems inefficient. It’d be nice if there was some sort of “delayed depth setting” option or something.

Korval · July 9, 2007, 9:54am

There really should be some way to write something into the depth buffer which isn’t actually used for depth-testing in the current iteration.
… What?

So you want to depth test based on the polygon, but write something fake to the depth buffer? That breaks the integrity of the depth buffer.

What does the next polygon do, the one that happens to be slightly in front of the previous one? The depth in the buffer doesn’t match the geometry’s depth, so that’s going to look a bit odd.

One could simply do 9 texture lookups at each pixel, and hope that cache locality keeps the speed up. Or, one could do two passes of 3 texture lookups each.
Um, I’m pretty sure that the 9 lookups will be faster because they all happen in one pass. No need for vertex T&L just to get another set of textures (or vertex T&L feedback, thus taking up memory).

Lindley · July 9, 2007, 12:10pm

Originally posted by Korval:
[QB] [quote]There really should be some way to write something into the depth buffer which isn’t actually used for depth-testing in the current iteration.
… What?

So you want to depth test based on the polygon, but write something fake to the depth buffer? That breaks the integrity of the depth buffer.

What does the next polygon do, the one that happens to be slightly in front of the previous one? The depth in the buffer doesn’t match the geometry’s depth, so that’s going to look a bit odd.
[/QUOTE]You’re not really thinking in GPGPU terms here, whereas I am. I’ve got a multi-step computation, and each step determines which values in the next step need to be evaluated further, a whittling-down process. I’d like to be able to use early-z to both control which fragments are computed upon entering a pass, and set the depth buffer up to control which fragments will be computed in the next pass. Think of each pass as masking more and more of the image (eg, computation grid).

An alternate means of accomplishing this would be to discard passing fragments, and simply render a series of screen-size quads with greater and greater depth for each pass. That might keep depth buffer “integrity” a bit more intact, but it introduces other hassles.

Obviously introducing false values into the depth buffer shouldn’t be easy; but I would like it to be possible, given some glEnable call. It’s not like it’d do anything worse than glDepthMask(GL_FALSE) lets you in the normal 3D context…

Korval · July 9, 2007, 12:54pm

You’re not really thinking in GPGPU terms here, whereas I am.
And therein lies the problem. We’ll not have any extending of a rendering API with features that don’t help with regard to rendering tasks.

Lindley · July 9, 2007, 1:31pm

Oh well. I’m sure you could come up with a graphics example if you wanted to, I’m just not doing that right now.

(Giving the programmer more direct control over the hardware----suitably dummy-protected, of course----is never a bad thing.)

At the very least, fixing stencil buffers so they actually work with FBOs is a must. That would allow you to do some of the above. Right now, best I can tell, even the packed_depth_stencil extension doesn’t work with FBOs unless you examine it with a glReadPixels first. (It’s Schroedinger’s stencil buffer!)

Komat · July 10, 2007, 12:43am

If you need more direct control for GPGPU purposes maybe you should use Nvidia CUDA or ATI “Close To the Metal” APIs.

Lindley · July 10, 2007, 1:32pm

I’d love to. Unfortunately, the people I’m writing code for don’t have an 8800 yet.

dorbie · July 11, 2007, 8:09pm

Functionally OpenGL specifies the right thing, itis essential functionality since specifying early z in the fragment flow would cause occlusion issues (it would be broken).

Early Z means z writes may have to be deferred pending pixel kills later on the pipe. This is the central problem in supporting early Z and it takes hardware to solve it.

When the shader result can alter Z (and stencil) write you have a problem that hardware must solve. This is not just for modification of the depth write value (which some HW does) but killing the framebuffer pixel write.

Hardware designers are intimately familiar with this and make their choices based on what they see as the best performance tradeoffs.

MZ1 · July 12, 2007, 2:26pm

Originally posted by Korval:
[quote]You’re not really thinking in GPGPU terms here, whereas I am.
And therein lies the problem. We’ll not have any extending of a rendering API with features that don’t help with regard to rendering tasks. [/QUOTE]This anti-GPGPU bigotry is getting annoying.

Lindley · July 12, 2007, 2:42pm

Is there no way to make the hardware a bit more flexible, eg allow the programmer to decide when early-z should be attempted?

It’s extremely simple to get a depth buffer into a state you want it to be in. It’d be awesome if you could do that any way you like----rendering, glDrawPixels, attaching a previously set up buffer to an FBO----and then request that the values in that buffer be used for early-Z on the next render pass.

The only thing that would need to change, I think, is the definition of when a frame ends. Currently, as I understand it, early-z and a bunch of other things are reset on every glClear(GL_DEPTH_BUFFER_BIT). It’d be wonderful if there was a glEndFrame() or somesuch that reset all those things without actually clearing anything.

Korval · July 12, 2007, 4:30pm

This anti-GPGPU bigotry is getting annoying.
It’s not bigotry. It’s wanting my rendering API to not be coopted into something else.

I don’t care what you do with the hardware. I don’t mind things like CUDA or CTTM. These are exactly as they should be: specialized APIs for specialised tasks.

This is better for both parties. Graphics developers don’t have to worry about their APIs getting bogged down by unnecessary baggage, and GP-GPU guys get an API that is designed specifically for their needs.

It’s win-win.

Currently, as I understand it, early-z and a bunch of other things are reset on every glClear(GL_DEPTH_BUFFER_BIT).
Early-z is not a buffer. It is not a thing that is cleared and uncleared. It is simply performing a calculation ahead of time.

Hi-Z or hierarchial Z-buffers, or whatever other technique ATi/nVidia come up with for accelerating z-tests are entirely separate things from early-z. These enhanced Z-buffer mechanics exists to cull large segments of fragments, not just one. And even they are merely a different way of interpreting the z-buffer itself.

The use of early-z depends only on what the fragment shader and post-fragment shader settings are. The state of the z-buffer (the values stored in it) is ultimately irrelevant to early-z. Now, the other z-buffer techniques like Hi-Z do depend on the state of a particular section of the z-buffer.