Data dependent Early Fragment Rejection on nVidia FX (NV3X) hardware (Impossible?)

I’m trying to implemenent GPU raytracer on FX 5950U hardware. For performance it is crucial to reject selected fragments early before they get into fragment processing.
Selection is made by computing value in fragment program in previous pass(es). Some papers call it “computation mask”. And this mask is computed in GPU. That is why I call it “data dependent”.

nVidia claims they support early-z culling, early stencil rejection, etc. etc.

That is true, but seems to be true only for not data depended case (eg. stencil shadowing, where z writes are done only by rendering primitives sent from app).

On the other hand, it seems that ATI (Radeon 9800) supports data dependent case.

There were discussions on this topic in the past at shadertech.com , gpgpu.org or OpenGL.org .

Also some GPGPU papers refer to this feature:
Purcell ,
Chan&Durand and others.

But nobody is able to give a clear answer. Is it possible to configure OpenGL machine to do data dependent early fragment rejection somehow on NV38 hardware ?

There are some very expensive workarounds:
<a href=“http://www.shadertech.com/forums/viewtopic.php?t=2044” target=“_blank”>by Warlock using readback to CPU and rendering to depth using GL_POINTS
</a> or by me doing this idea using VBO/PBO

Would be nice if some independent nvidia guru clarify that it is not possible to do data dependent fragment rejection on FX hardware more effectively.

I don’t expect nvidia guys to do so, because they would also say “our FX class GPUs are not suitable for many GPGPU tasks, because of lack of this feature, ATI beats us here”.

Yes, that is in fact true!

There is no way to implement data-dependent early-z on the GeForce FX. If you write data-dependent z-values with the GeForce FX, early-z gets disabled until the depth-buffer is cleared.

Some people told me that the problem persists with the GeForce 6 (i don’t have a GeForce 6 yet to try it out).

  • Klaus
    BTW: ATI also supports data-dependent early-stencil tests - very nice …

I was under the impression that the same is true for ATI cards. Or was it only hierZ that gets disabled?

There is no way to implement data-dependent early-z on the GeForce FX. If you write data-dependent z-values with the GeForce FX, early-z gets disabled until the depth-buffer is cleared.
Note that this is not a problem; this is a basic fact of any early-z check architecture.

With early-z checks, these checks happen before the fragment program. If the fragment program is going to change the outcome of the test (by changing the depth value), then the test clearly must happen after the fragment program. Otherwise, you may cull fragments that shouldn’t be culled based on the results of the fragment program operation.

There is no way to do early-z checks if the fragment program changes the z-depth.

Korval:
I’m sorry, but you mess this topic a little. We are not talking about depth replacing fragment program which wants to utilize early-z culling. This is not possible in priciple and I hope we all agree on it. You are absolutely right at this point.

I’m talking about two subsequent passes.

In first pass: FP#1 is computing state of pixel and writes it as one from predefined values into the Z buffer (if i have for example three states, so I can write Z=0, Z=0.5 or Z=1.0 to distiguish them). Yes, this program does depth replacement and no early-z culling is expected to be performed. This pass is cheap so i’m happy.

In second pass: FP#2 is computing some expensive computation (eg. tracing ray). This program does not depth replacement, alpha test, texkill or so. This program just intensively computes and writes result into output color register. I want to cull all fragments that are not marked by previous pass to save computational power. Computation is expensive so I want to cull as many fragments as possible before reaching fragment processor.

This is possible on ATI and not possible on nVidia according to my information. See links I have posted for more details in previous threads.

If the fragment program is going to change the outcome of the test (by changing the depth value), then the test clearly must happen after the fragment program.

That’s obvious - however, nobody was talking about changing the depth value in a fragment program.

Applications with a lot of semi-transparent compositing operations can greatly benefit from terminating further processing of individual pixels by writing z-values in an intermediate pass. The z-value is not changed in this intermediate pass.

  • Klaus

Originally posted by Klaus:
That’s obvious - however, nobody was talking about changing the depth value in a fragment program.

It was strongly implied by the original post:

Originally posted by woid:
That is true, but seems to be true only for not data depended case (eg. stencil shadowing, where z writes are done only by rendering primitives sent from app).

Originally posted by dorbie:
It was strongly implied by the original post
Well, i understood woid :slight_smile:

Can anyone with a GeForce6 confirm that early-z only works with opaque polygons (without “holes”) on that architecture ?

  • Klaus

Yup now it has been made clear the rest of us can keep up :slight_smile: This is probably a symptom of early z vs coarse z. Coarse z probably relies on some block destination cache on the chip and that’s defeated by the earlier replacement pass instead of using vanilla primitive writes. In the absence of early z test the coarse z only shows its Achilles’ heel.

Originally posted by Klaus:
[b]
Can anyone with a GeForce6 confirm that early-z only works with opaque polygons (without “holes”) on that architecture ?

  • Klaus[/b]
    I have a 6800 GT. If you can send me an app, I’ll test it for you.

Originally posted by Zeno:
I have a 6800 GT. If you can send me an app, I’ll test it for you.
Thanks, Zeno - i finally got a 6800 GT today. Same behaviour as on the old FX cards.

I’ll wait for the Quadro card and try again (there’s a driver setting for Quadro cards to enable early-z).

[Edit]No wait. Had to change driver settings from “multi-display performance mode” to “single display mode” (using a two monitor setup here). Now data-dependent early-z works on the 6800. Great! No early-stencil though …

  • Klaus

Originally posted by Klaus:
[b]

[Edit]No wait. Had to change driver settings from “multi-display performance mode” to “single display mode” (using a two monitor setup here). Now data-dependent early-z works on the 6800. Great! No early-stencil though …[/b]
Awesome, glad to hear early z works :slight_smile: . About the stencil - there was a big discussion at Beyond3D a while back about one of Humus’ recent demos. He relied on early stencil rejection to avoid a lot of shading calculation. For a while the speedup wasn’t working on NVIDIA cards, but (if I recall correctly), someone noticed that a small change in the stencil test setup fixed the problem and early stencil did work. I’ll look it up at home if no one else chimes in before then.

Early stencil works as long as you don’t write stencil values.