What actually does GL_RASTERIZER_DISCARD

Hi All,

The title of this thread is a little bit cryptic, but I’ll try to explain it immediately.

According to all sources, glEnable(GL_RASTERIZER_DISCARD) should disable rasterization and prevent touching fragment shader at all.
Here are citations:

Everything is quite clear, isn’t it?

But it doesn’t work that way. At least not on NV 337.88 drivers (and probably all others, but currently have not time or will to check it).

I wanted to find what’s exactly happening with ROP on Fermi (a discussion started in another thread), and carried out series of experiments:

Rendering without lighting:

  • 9.78ms - normal rendering;
  • 8.75ms - with trivial FS outputting just black pixels;
  • 6.28ms - with rasterization discard and FS outputting just black pixels.

With lighting and blending (problem defined in the another thread)

  • 12.91ms - normal rendering;
  • 9.31ms - with rasterization discard and full FS;
  • 8.78ms - with trivial FS outputting just black pixels;
  • 6.27ms - with rasterization discard and FS outputting just black pixels.

The conclusion is obvious.

An interesting test on this could be to use the FS to increase an atomic (or write on an output shader buffer, but an atomic is easier to set up) and check the final result with rasterization off.

The spec does seem to be quite specific about primitives not going into the rasterization stage, but the spec is only “in effect”, as I’m sure you know. It’s perfectly valid for the implementation to process the complete pipeline and just ensure the result is discarded without affecting any of the output, since performance is not a guarantee of any kind in the specs.

But I see your point. This is a quite a trap.

When you link a program without separable shaders, the Nvidia compiler flags any input varying data that the fragment shader doesn’t use, and culls whatever computation they require from the previous stage (taking into considering transform feedback, if TF varyings are set). So the compiler could be greatly simplifying your vertex shader on you. I’ve seen the result of this through program introspection - some vertex shader inputs may even disappear as they become inactive.

But in the rasterizer discard case, it’s not re-linking your program, so you are still get the vertex shader running all your lighting computations even with discard on. Based on your “trivial FS outputting just black pixels without rasterization discard” cases, it appears to be doing just that (8.75ms without lighting, vs 8.78ms with lighting).

You can test this by using all the fragment shader inputs in some trivial way (summing them, for example).

Yes, it could be interesting to see what’s happening, but I’ll rather try something that malexander proposed…

[QUOTE=malexander;1261701]But in the rasterizer discard case, it’s not re-linking your program, so you are still get the vertex shader running all your lighting computations even with discard on. Based on your “trivial FS outputting just black pixels without rasterization discard” cases, it appears to be doing just that (8.75ms without lighting, vs 8.78ms with lighting).

You can test this by using all the fragment shader inputs in some trivial way (summing them, for example).[/QUOTE]
Very useful tip! Thanks!

I haven’t had time today to carry out all intended experiments, but this is something very interesting:

  • 11.10ms for full VS with lighting and FS that outputs the value of the normal.

Compared to 8.78ms for “pitch black FS”, this is a much longer time. That leads to a conclusion that the optimization probably can go beyond the scope of the shader. Very interesting, indeed! But I shouldn’t draw a conclusion before some more experimenting.

If the optimization is really going out of the scope of a single shader, the previous results of rasterization discarding is completely valid. But one question still says: Why isn’t gl_Position also optimized out when the rasterization discarding is active?

Here are some new results with the same datasets on GTX470 and GTX850M:

What can we conclude from this results?

[b]1. Shader optimization is going beyond shader boundaries in monolithic programs (without separate shader objects).

  1. GL_RASTERIZER_DISCARD works correctly.

  2. gl_Position is not optimized out even if [/b]GL_RASTERIZER_DISCARD is in effect.

What I still don’t understand is why rasterization is so expensive. 1.78ms just to fill HD screen with black pixels on GTX470, and even 5.41ms for GTX850M. 1:3 ratio.
GTX470 has 40 while GM107 has 16 ROP (8 per memmory controller/channel). 2.5:1 ratio. Of course, clock is different, but the number of ROP units seems to correspond to the speed of rasterization.
Anyway, it seems a little bit slow, isn’t it?

The first point is dependent on the compiler. Certainly it seems that the Nvidia compiler does these sorts of optimizations. I guess there’s nothing stopping it from doing those optimizations on separable programs between consecutive stages, if a varying between them isn’t used. For point #3, gl_Position can’t be optimized out, because doing so would require a re-link of the shader that is depedent on the GL state. And linking a program is very expensive, so no one wants that :slight_smile:

What I still don’t understand is why rasterization is so expensive. 1.78ms just to fill HD screen with black pixels on GTX470, and even 5.41ms for GTX850M. 1:3 ratio.

1.8ms seems about right from what I’ve seen for a desktop part. It still needs to run your trivial shader and do 2M framebuffer writes, possibly more with z-writes and depth testing (framebuffer reads) for 1080p. There will also be some fixed per-frame swapbuffer/OS overhead when you hand off the frame. 5.4ms does seem high to me, but it fits with the number of ROPs and their likely slower mobile clockspeed.

Of course, I was speaking about NV implementation. :wink:

I’m not sure I understand this. Could you explain it a little bit more?

No, the time I displayed is a GPU time, without swapbuffer/OS overhead. After that point there is another drawing and variety of stuff to execute before swapping buffers.

I’m not sure I understand this. Could you explain it a little bit more?

Sure. When a shader is linked, its executable code is set in stone. This includes the code to produce gl_Position. If you change some shader-state parameters then you have to re-link the program (eg. transform feedback varyings, fragment data locations). However, GL_RASTERIZER_DISCARD is part of the global GL state and not tied to a program, so it can’t affect the shaders executable code.

Hypothetically, in order for it to do so, the compiler would have to put conditionals in the program on some builtin GL-state variable (glRasterizeDiscard, say). Then the shader could become slower overall due to the conditionals, and a lot more work for the compiler and shader setup. Extending this hypothetical discussion to other OpenGL global states that might see some improvement if they were considered by the shader, I think you can see how optimizing would quickly become a real rats nest of conditionals in the shader code. So it’s best for the compiler not to do it :slight_smile:

In the situation where a shader is used with GL_RASTERIZER_DISCARD both off and on (perhaps for transform feedback), it would be much better for the user to create two different shader programs if it becomes a performance issue.

NVidia (and as far as I know most other vendors) recompiles shaders automatically depending on GL settings. It even issues a performance warning to the debug log when it does this.

However, this happens only once: the driver keeps multiple compiled program variants per shader program id, and it even stores all those variants into the binary blob if you use glGetProgramBinary. Thus it is recommended to write program binary at the end of your application, when all possible uses of the program have happened and not directly after linking.