Can early KIL (ARB FP) improve fillrate?

Consider 2 theoretical fragment frograms: A and B.
Program A is ~20 instruction long, and its 3rd instruction is KIL.
Program B is almost the same as A, but the KIL instruction is moved to end of program.
Both programs compute the same effect, and they kill exactly the same fragments.

Should I expect A to be faster than B ?

Additional assumption: These programs are used in situation where majority of fragments is killed (to have a picture, compare it to “invisible fillrate” consumption, like in shadow volumes)

I don’t have ARB FP capable HW yet, so I can’t test it myself.

Come on, this is one of the most interesting questions to get asked around here in ages and nobody answers it. Where are you IHV guys when you’re needed?

[This message has been edited by dorbie (edited 03-17-2003).]

There was a short discussion about this on the cgshaders.org forums a while back. If I remember correctly it was Gary King of NVIDIA (correct me if I’m wrong) who said that on NV30 the entire fragment program is always evaluated, so the position of the KIL instruction in the program has no influence on performance.

Interesting thread, http://www.cgshaders.org/forums/viewtopic.php?t=901

Hum i read something about that for DX9 shaders a while ago, but can’t remember where, so don’t take my word on it.

I think hardware up to ps_2_0 evaluate the whole fragment program independantly of where the texkill instruction is placed.

However, i think ps_3_0 hardware is planned to do it correctly, and exits the program as soon as it finds a texkill instruction, giving a speed improvement if the texkill is done early.

Y.

Is that so hard to move kill upwards? BTW, you CAN check that out if you have GeForce, just use NVEMU. And effect will be even more noticable, especialy if you drop 10 instructions, as every fragment will be moved through CPU, and that’s a lot slower!

Originally posted by M/\dm/
:
Is that so hard to move kill upwards?

This might break parallelism (when it exists) and introduce the need of a synchronization mechanism. i.e., more silicon.

Julien.

deepmind, he ment to the original poster if its so hard to try out and find out.

it doesn’t actually hurt parallelism that much, but till now, yep, its always done at the end… looks like all the test does is setting an additional bit automatically checked for in the alpha test, or so

This is from the ARB_F_P

“KIL does not sample
from a texture, but rather prevents further processing of the
current fragment if any component of its 4-tuple vector is less than
or equal to zero.”

The above gives the impression that as soon as KIL is encountered, processing is stopped.

Here’s some more

“Subsequent stages of the GL pipeline will be
skipped for this fragment.”

stages. not instructions. all that follows after the fragment program will be skipped.

but yes, it does sound irritating (the first one)

The general rule of thumb on things like this is to allow any early outs you can. Such an optimizations may not buy anything today but may in the future.

I would equate this as similar to depth optimizations, as I don’t think you will be able to measure any meaningful increase for doing the right thing today, but you should do it to allow future HW the benefit where applicable.

-Evan

That’s too bad, my emerging algorithm will have to wait in fridge few years.
I tailored my question to be as precise as possible. In reality, it is not about placement of KIL, but about ability to skip expensive calculations for large groups of pixels.
About Nv30 emulation: it emulates feature set of NV30, not its implementation, so any performance comparison has no use for me.

I had idea to utilize early Z culling for such “cheap test + KIL + expensive calculations” fragment programs. The program should be splitted into 3 passes:

  1. Run the “cheap test + KIL” part, render depth only, writing near-Z value into framebuffer where KIL lets you.
  2. Now run the “expensive” part, utilizing early Z culling to quickly skip pixels rejected by the KIL in previous pass.
  3. Render flat quad to restore contents of depth buffer (from RTT)

I skipped details (such as depth range), but I hope you get the idea. In my algorithm this won’t work, because polygons overlap.

Thanks for replies, especially your support, dorbie.

Note that just because one fragment got to the KIL, doesn’t mean that the next fragment will. There can be no hierarchical gains based on KIL output.

Rendering depth first (with no color/alpha writes), and doing it from front to back, is the recommended way to go. Then render with GL_EQUAL depth mode, assuming you can get depth invariant vertex processing, or LEQUAL with possibly a small bit of polygon offset if you can’t. This will guarantee that you touch each pixel exactly once (per multisample) and thus the cost of using 50-instruction shaders should be “close to” 50 instructions times number of pixels on the screen over number of parallel fragment pipes.

Interesting information, the only disapointment is the naive assumption by some that this is somehow an oversight that someone is to blame for. SIMD parallelism just doesn’t accomodate this sort of thing, if that’s the hardware. It takes radically different and more complex hardware implementation to exploit these opportunities for performance gains, and even then other issues may still limit what you can do.

Interesting information, the only disapointment is the naive assumption by some that this is somehow an oversight that someone is to blame for. SIMD parallelism just doesn’t accomodate this sort of thing, if that’s the hardware. It takes radically different and more complex hardware implementation to exploit these opportunities for performance gains, and even then other issues may still limit what you can do.

I imagine that, once you get hardware that can do looping in fragment programs, you’ll see a performance benifit from early-outs.