Early z and discard

MaVo159 · May 19, 2016, 5:42am

Hello everyone,

I am a little confused about the conditions for early fragment rejection based on depth.
I read that in most drivers early depth tests (between vs and fs) are only used if:
-no alpha to sample coverage
-no discard in fs
-no changes to z position
While the last condition (no z position changes) is quite obvious, I am not so sure about the reasoning for the other two.
In my mind early rejection should be fine either way, as long as the depth buffer is only written after the fragment shader runs.
I could only come up with two possible explanations:

Something similar to imageAtomicMin is used on the depth buffer only once, which somehow mitigates the impact of the in-order-rasterization guarantee.
The memory bandwith used by having both a z-test during rasterization and after the fs is too high.
In case 1 it would be interesting if there are proposals for disabling this guarantee (amd offers this for vulkan).
I have my doubts about case 2 because that sounds like a small memory bandwidth cost given the chance to reduce the number of fragment shader executions.
Does anyone know more details on what is actually going on?

Alfonse_Reinheart · May 19, 2016, 6:19am

Something similar to imageAtomicMin is used on the depth buffer only once, which somehow mitigates the impact of the in-order-rasterization guarantee.

Depth testing requires the “in-order rasterization guarantee” you’re talking about. Remember that OpenGL is specified to work as if all previous triangles had already done their rasterization. So any depth test must be able to read from depth values written by previous triangles.

So the operation is more like fragment shader interlock. Conceptually, we can think of the depth test as an interlock operation, as well as the depth write. Both of them need ordering guarantees, and both of them need to ensure that no other fragment is trying to modify that memory while it’s working.

That means that the read/test/conditional-write must be done within one interlock. The interlock starts right before reading the depth value, and the interlock ends only when the depth write is performed.

In order for discard to not affect whether early tests are performed, this conceptual interlock would have to be wrapped around the entire fragment shader. No other fragment shaders for that pixel could even begin executing until that interlock completes. Is that a good idea? Of course not; even if it’s possible, it would kill performance. Holding interlocks for an indeterminate period of time is extremely bad for parallelism.

Oh, and don’t forget: the optimization is not “early depth tests”. It’s “early tests of anything that could cause the fragment to be discarded”. That include stencil, multisample, etc. It also includes occlusion queries.

As further evidence for this, look at what happens when you [i]explicitly force[/i] early fragment tests. When your FS does that, the integrity of the discard operation is compromised.

A normal discard operation will stop depth, stencil, and so forth from being written. Every element of a fragment is discarded. Whereas with a forced early test, the depth/stencil/etc is always updated (if it passes), whether you discard the fragment or not. The only thing discard stops in those cases is side-effects and color writes. Why?

Because the interlock for the other updates has already completed before the fragment shader even started. That’s what early tests mean.

MaVo159 · May 20, 2016, 3:53pm

Thanks for the detailed explanation. I understand that doing the whole read-test-write early would be wrong if the fragment could be discarded. However, I am having some trouble with your 4th paragraph:

In order for discard to not affect whether early tests are performed, this conceptual interlock would have to be wrapped around the entire fragment shader. No other fragment shaders for that pixel could even begin executing until that interlock completes.

I don’t see why the interlock would be longer or more difficult to time than in any late-test scenario, regardless of an additional early test (which only reads). Am I missing something here?

Maybe I should rephrase the scenario I am imagining:
Lets assume we implicitly or explicitly deactivate all other tests than depth (stencil etc.) and also assume we are in a situation that still requires late depth testing (discard in fs).
Theoretically, the driver could still do an additional early z-test (without write, just don’t execute the fs if the test fails).
The early test wouldn’t necessarily prevent all fs executions that would fail the late z-test, but could prevent some. The late test would also still require the in-order guarantee, which shouldn’t be any more difficult than without the early test. The early test would also need to ensure that there are no writes going on, depending on the wheter reading is atomic or not.

Is such a redundant early test something some drivers actually do, or is the idea silly? Apart from the additional read (and potential locking) I still fail to see the downside.

Alfonse_Reinheart · May 20, 2016, 5:21pm

[QUOTE=MaVo159;1282573]Thanks for the detailed explanation. I understand that doing the whole read-test-write early would be wrong if the fragment could be discarded. However, I am having some trouble with your 4th paragraph:

I don’t see why the interlock would be longer or more difficult to time than in any late-test scenario, regardless of an additional early test (which only reads). Am I missing something here?

Maybe I should rephrase the scenario I am imagining:
Lets assume we implicitly or explicitly deactivate all other tests than depth (stencil etc.) and also assume we are in a situation that still requires late depth testing (discard in fs).
Theoretically, the driver could still do an additional early z-test (without write, just don’t execute the fs if the test fails).[/quote]

The driver can’t do squat; it’s the hardware that has to do it. Which means the hardware has to be able to:

1: Perform two separate depth tests, one before the FS and one after.

2: Perform a depth test without writing depth, but only during one of those tests.

Being able to do #1 is non-trivial. Early testing is not a matter of turning on one piece of circuitry and turning off another. It’s doing the same processing, just at a different point in the pipeline. Being able to do two would require having multiple pieces of comparison hardware, or require doing some odd loop-back gymnastics on a per-fragment basis.

To what end? So that someone can write discard in their shader with no loss of early-z? Is discard really so prevalent in shaders as to be worth this?

You say that like the extra (atomic) read is some minor issue. You’re talking about potentially doubling the memory bandwidth for the depth buffer. That is not something you should just gloss over.

MaVo159 · May 21, 2016, 4:11am

Thanks, that gives me a better idea of the constraints.
My general intuition about the flexibility and trade-offs outside of the programmable parts of the pipeline isn’t great yet. For anything inside shaders, playing around with the shader compiler and reading the ISA documentation helps, but for stuff like this resources are sparse.

bertgp · March 30, 2021, 8:56pm

Dx11 performancereloaded has these slides that show what NVIDIA and AMD do when discard (clip) is involved. This is from GDC 2013.

Clearly implementers agree with Alfonse_Reinheart that it’s not worth it to add a separate hardware path to support discard.

system · October 19, 2021, 5:40pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.