ARB_FP execution:before or after depth test?

The specs are not clear about this:
Depth(or stecil) testing is performed before or after fragment processing through FP?Since fragment programs can replace the z component of the fragment,depth testing should be done after the execution of fp.Does this mean that per-pixel lighting calculations,for example,are performed for every fragment,whether it’s visible or not?Or is there a discrimination between depth-replacing programs and regular fragment programs?

Conceptually, the depth test always happens after the fragment shader. However, most modern hardware has some kind of early depth-test that can discard lots of hidden fragments early in the pipeline. This early depth-test is disabled if the shader writes out depth.

Originally posted by harsman:
Conceptually, the depth test always happens after the fragment shader. However, most modern hardware has some kind of early depth-test that can discard lots of hidden fragments early in the pipeline. This early depth-test is disabled if the shader writes out depth.
What do you mean lots of hidden fragments?If this early depth-test uses the z value provided by the fixed pipeline it should discard ALL hidden fragments(if,of course,the z-buffer is filled with the correct values after a quick z-buffer only pass).
So,if I use a depth-replacing program there is no way to use the z-buffer to reduce overdraw,since the fragments are discarding after the fragment shader?

There is a huge slowndown when replacing the depth value. Would it be possibly to give a range to the hw and to guarantee that the difference between the default and the new depth value lies withing this range? When using z-correct bumpmapping the depth is shifted only by a small amount. If the hw knows that the new depth, calculated by the shader, is in the range DefaultDepth-MinRange…DefaultDepth+MaxRange , most of these fragments could also be culled before shader execution.

Originally posted by LarsMiddendorf:
There is a huge slowndown when replacing the depth value. Would it be possibly to give a range to the hw and to guarantee that the difference between the default and the new depth value lies withing this range? When using z-correct bumpmapping the depth is shifted only by a small amount. If the hw knows that the new depth, calculated by the shader, is in the range DefaultDepth-MinRange…DefaultDepth+MaxRange , most of these fragments could also be culled before shader execution.
Funny…I started this thread after reading your suggestion in Opengl2x forum!It’s a good suggestion,although I think it would make depth-testing a little more complex.How GL_EQUAL,for example,should work in this case?

I meant it just as a hint for the hw occlusion culling. The depth test should continue to work as expected. Pseudo Code:

 
DepthBuffer=ReadDepthBuffer(); 
Default=gl_FragCoord.z;
switch (gl_DepthFunc)
{
 case GL_EQUAL:
  if (DepthBuffer<Default+DepthBiasMin) discard;
  if (DepthBuffer>Default+DepthBiasMax) discard;
  break;
 case GL_LEQUAL:
  if (DepthBuffer<Default+DepthBiasMin) discard;
  break;
 case GL_LESS:
  if (DepthBuffer<=Default+DepthBiasMin) discard;
  break;
 ...
}
    
ExecuteFragmentShader();
StandardDepthTest(Depth,gl_FragDepth,gl_DepthFunc);

I hope there is a way to configure the early z test this way to make z-correct bumpmapping fast.

A very good suggestion Lars. I guess it depends on how many popular games are going to change the depth value in fragment programs, as to whether its viable in transistor space.

Originally posted by mikeman:
What do you mean lots of hidden fragments?If this early depth-test uses the z value provided by the fixed pipeline it should discard ALL hidden fragments(if,of course,the z-buffer is filled with the correct values after a quick z-buffer only pass).
So,if I use a depth-replacing program there is no way to use the z-buffer to reduce overdraw,since the fragments are discarding after the fragment shader?

Sorry, that was kind of unclear. All hidden fragments are of course discarded, there are no rendering artifacts. What I meant was that all fragments that won’t contribute to any pixels won’t necessarily be culled by an early depth-test. Think about what happens when you render all your objects in near to far order for example.

I suppose that driver know if some fragment program change FragDepth (while compiling FP code). In that case driver should disable early z-test. But if FP doesn’t change FragDepth early z-test should be enabled. If it’s work as I expect, we can get huge speed-up if we first render only to ZBuffer and after all render image with shaders. Something like Deffered Shading Rendering. It can save a lot of GPU clocks. :slight_smile:

Someone should make a benchmark app for that. One big quad with depth = 0.0 and lot of quads with FP with depth > 0.0 and with FP that change FragDepth. I can do that but my card doesn’t support FP (GF Ti-4800SE) :frowning:

yooyo

Originally posted by yooyo:
If it’s work as I expect, we can get huge speed-up if we first render only to ZBuffer and after all render image with shaders. Something like Deffered Shading Rendering. It can save a lot of GPU clocks. :slight_smile:
yooyo

I think this is common ground to developers of application using fragment shading.Doom3 does the exact same thing.
One quick pass that fills z-buffer with correct values,then turn z-writing(not z-testing) off and shade only visible fragments.If the rendering is done otherwise,any amount of overdraw would literally kill the performance,since fragment processing is too expensive to waste it on invisible fragments.In my applications,I see 30-40% gain in FPS,and I don’t even use that complex geometry as real games.That’s why it is absolutely necessary to come up with a solution of how to early discard fragments when replacing the depth,otherwise techniques like z-correct bumpmapping are useless for real-time rendering.

I’ve got a Radeon 9800 pro and surprisingly it make nearly no difference if there are one or two screen filling quads behind each other with a complex shader when the zbuffer is filled previously. Areas in the shadow that are stenciled out are also drawn faster. It seems that there is also some kind of early stencil test. Unfortunatelly the whole speedup is away when replacing the depth value.

Yes, that is exactly how early Z is documented to work. If you first lay down the Z buffer, and then render with LEQUAL Z testing, you will get best performance. WHen laying down the Z buffer, you should draw near-to-far, as well.

And, yes, modifying the Z value means that early Z cannot be used (because the hardware can’t know what you’re going to change it to).

There are some idiosynchrasies; for example, I believe that enabling alpha test, or turning off depth writes, will disable the early Z test on some cards, so try not doing that if you want the performance boost.

Originally posted by jwatte:
There are some idiosynchrasies; for example, I believe that enabling alpha test, or turning off depth writes, will disable the early Z test on some cards, so try not doing that if you want the performance boost.
That’s not true.As long as hw has knowledge about the z value of the fragment before fragment processing,early z-test should be enabled.It’s not important if the z-value will be written in the buffer in the end,z-test works as always.
In fact,after the first quick z-buffer only pass,you MUST turn off z-writes,otherwise you’ll end up updating the z-buffer with the same values over and over again.
Alpha test happens,of course,after fragment processing,but that was always the case,even with the fixed pipeline,alpha is part of the color output.

Just to don’t forget… Early z-test in case of mixing fixed function pipeline in first pass and vertex+fragment program in other passes can work only if OPTION POSITION INVARIANCE are enabled in vertex program.

yooyo

After some thought,I believe that the simplest solution to our problem(slowdown when replacing depth) is to add a special fragment shader variable that allows us to read the current value of the zbuffer.Such a variable is mentioned in GLSL spec but is not implemented yet.
Using that,we can simply disable fixed z-test and perform our own depth test in the begining of the fragment shader,killing the fragments that fail our test,and prevent further processing.Any form of depth test,including Lars’ suggestion can be implemented this way.
Also,I was thinking:We could simulate this with current hardware by saving the zbuffer into a texture(a-la shadowmapping),reading the value in the fragment shader and performing our depth test as i mentioned above.It could speed things up for depth-replacing programs,but I’m not sure,I haven’t try it yet.

Originally posted by mikeman:

Also,I was thinking:We could simulate this with current hardware by saving the zbuffer into a texture(a-la shadowmapping),reading the value in the fragment shader and performing our depth test as i mentioned above.It could speed things up for depth-replacing programs,but I’m not sure,I haven’t try it yet.

terminating a fragment program for example with the kill instruction wont speed up the fragment processing in current hardware… the kill instruction does work, and prevents the fragment from getting a pixel, but it doesnt really stop the fp execution and thus doesnt save time…

btw there are really some cases where early z-test doesnt work, even though it theoretically should (witnessed on nv3x hardware… im afraid i didnt experiment enough on fx cards, but search for some posts by klaus, who did quite some research on this issue).

Originally posted by Chuck0:
terminating a fragment program for example with the kill instruction wont speed up the fragment processing in current hardware… the kill instruction does work, and prevents the fragment from getting a pixel, but it doesnt really stop the fp execution and thus doesnt save time…
[/QB]
Right.I did a little test about this and you’re right,the fragment shader keeps executing,but the results are discarded.It’s just a waste of GPU cycles and I think it should be fixed.

if you mean by fixed, that it should really terminate the fragment processing im quite sure present hardware just isnt able to do that, since the fragment processing units are highly pipelined and still not that flexible… as far as i have heard only the next generation hardware (like the nv40) will be able to really gain speed by the kill instruction.
this will certainly improve nowadays hardware raytracers and raycasters greatly since they wont have to rely on hacks that use multiple passes and the z-test to prevent fp execution :slight_smile:

Originally posted by mikeman:
[quote]Originally posted by jwatte:
There are some idiosynchrasies; for example, I believe that enabling alpha test, or turning off depth writes, will disable the early Z test on some cards, so try not doing that if you want the performance boost.
That’s not true.As long as hw has knowledge about the z value of the fragment before fragment processing,early z-test should be enabled.It’s not important if the z-value will be written in the buffer in the end,z-test works as always.
[/QUOTE]Whatever you think about it, it happens to be a fact on ATI hardware. What is logical for software isn’t always logical in hardware. If you need to juggle many fragments in parallel and avoid synchronisation issues, one solution is to put z-test and write logic very close to the memory controller. In that case, doing early z-testing would require you to keep the z-value of each fragment around for a long time during its travel down the pipeline, which might not be desirable.

I’m not saying that it’s the best possible solution, but from a hardware view it probably has some benefits. I have no idea how Nvidia or other IHV’s handles early z though.

Originally posted by harsman:
I’m not saying that it’s the best possible solution, but from a hardware view it probably has some benefits. I have no idea how Nvidia or other IHV’s handles early z though.[/QB]
I have a GFX5200.The only things that disable early z-test is running depth-replacing fragment programs and,of course,glDisable(GL_DEPTH_TEST).Alpha test and zwrites have no impact.