PDA

View Full Version : What actually does GL_RASTERIZER_DISCARD



Aleksandar
09-16-2014, 03:33 AM
Hi All,

The title of this thread is a little bit cryptic, but I'll try to explain it immediately.

According to all sources, glEnable(GL_RASTERIZER_DISCARD) should disable rasterization and prevent touching fragment shader at all.
Here are citations:



Primitives can be optionally discarded before rasterization by calling Enable and Disable with target RASTERIZER_DISCARD. When enabled, primitives are discarded immediately before the rasterization stage, but after the optional transform feedback stage (see section 13.2). When disabled, primitives are passed through to the rasterization stage to be processed normally. When enabled, RASTERIZER_DISCARD also causes the Clear and ClearBuffer* commands to be ignored.
The state required to control primitive discard is a bit indicating whether discard is enabled or disabled. The initial value of primitive discard is FALSE.



- GL_RASTERIZER_DISCARD for advanced rendering control while doing transform feedback.
- If no fragment shader is present, rasterization can even be turned off by calling glEnable() with the parameter GL_RASTERIZER_DISCARD. This makes transform feedback the end of the pipeline and it can be used in this mode when only the captured vertex data is of interest and the rendering of primitives is not required.
- When rendering the sorting pass, we will not be rasterizing any polygons, and so our first pass program has no fragment shader. To disable rasterization we will call glEnable(GL_RASTERIZER_DISCARD). If an attempt is made to render with a program object that does not contain a fragment shader and rasterization is not disabled, an error will be generated.
- GL_RASTERIZER_DISCARD - Discard primitives before rasterization.



As transform feedback logically sits right before rasterization in the OpenGL pipeline, we can ask OpenGL to turn off rasterization (and therefore anything after it) by calling
glEnable(GL_RASTERIZER_DISCARD);
This stops OpenGL from processing primitives any further after transform feedback has been executed. The result is that our vertices are recorded into the output transform feedback buffers, but nothing is actually rasterized.


Everything is quite clear, isn't it?

But it doesn't work that way. At least not on NV 337.88 drivers (and probably all others, but currently have not time or will to check it).

I wanted to find what's exactly happening with ROP on Fermi (a discussion started in another thread (http://www.opengl.org/discussion_boards/showthread.php/184779-NVIDIA-Maxwell-texturing-performance)), and carried out series of experiments:

Rendering without lighting:
- 9.78ms - normal rendering;
- 8.75ms - with trivial FS outputting just black pixels;
- 6.28ms - with rasterization discard and FS outputting just black pixels.


With lighting and blending (problem defined in the another thread (http://www.opengl.org/discussion_boards/showthread.php/184779-NVIDIA-Maxwell-texturing-performance))
- 12.91ms - normal rendering;
- 9.31ms - with rasterization discard and full FS;
- 8.78ms - with trivial FS outputting just black pixels;
- 6.27ms - with rasterization discard and FS outputting just black pixels.

The conclusion is obvious.

Ed Daenar
09-16-2014, 03:52 AM
An interesting test on this could be to use the FS to increase an atomic (or write on an output shader buffer, but an atomic is easier to set up) and check the final result with rasterization off.

The spec does seem to be quite specific about primitives not going into the rasterization stage, but the spec is only "in effect", as I'm sure you know. It's perfectly valid for the implementation to process the complete pipeline and just ensure the result is discarded without affecting any of the output, since performance is not a guarantee of any kind in the specs.

But I see your point. This is a quite a trap.

malexander
09-16-2014, 11:30 AM
When you link a program without separable shaders, the Nvidia compiler flags any input varying data that the fragment shader doesn't use, and culls whatever computation they require from the previous stage (taking into considering transform feedback, if TF varyings are set). So the compiler could be greatly simplifying your vertex shader on you. I've seen the result of this through program introspection - some vertex shader inputs may even disappear as they become inactive.

But in the rasterizer discard case, it's not re-linking your program, so you are still get the vertex shader running all your lighting computations even with discard on. Based on your "trivial FS outputting just black pixels without rasterization discard" cases, it appears to be doing just that (8.75ms without lighting, vs 8.78ms with lighting).

You can test this by using all the fragment shader inputs in some trivial way (summing them, for example).

Aleksandar
09-17-2014, 08:53 AM
An interesting test on this could be to use the FS to increase an atomic (or write on an output shader buffer, but an atomic is easier to set up) and check the final result with rasterization off.
Yes, it could be interesting to see what's happening, but I'll rather try something that malexander proposed...


But in the rasterizer discard case, it's not re-linking your program, so you are still get the vertex shader running all your lighting computations even with discard on. Based on your "trivial FS outputting just black pixels without rasterization discard" cases, it appears to be doing just that (8.75ms without lighting, vs 8.78ms with lighting).

You can test this by using all the fragment shader inputs in some trivial way (summing them, for example).
Very useful tip! Thanks!

I haven't had time today to carry out all intended experiments, but this is something very interesting:

- 11.10ms for full VS with lighting and FS that outputs the value of the normal.

Compared to 8.78ms for "pitch black FS", this is a much longer time. That leads to a conclusion that the optimization probably can go beyond the scope of the shader. Very interesting, indeed! But I shouldn't draw a conclusion before some more experimenting.

If the optimization is really going out of the scope of a single shader, the previous results of rasterization discarding is completely valid. But one question still says: Why isn't gl_Position also optimized out when the rasterization discarding is active?

Aleksandar
09-18-2014, 04:10 AM
Here are some new results with the same datasets on GTX470 and GTX850M:



With lighting:
- 8.37 ms - Full FS
- 6.89 ms - FS out normal
- 5.51 ms - Full FS + discard
- 5.49 ms - FS out black
- 3.71 ms - FS black + discard

Without lighting:
- 6.42 ms - Full FS
- 3.93 ms - Full FS + discard
- 5.49 ms - FS out black
- 3.71 ms - FS black + discard




With lighting:
-11.12 ms - Full FS
- 9.61 ms - FS out normal
- 4.92 ms - Full FS + discard
- 9.46 ms - FS out black
- 4.05 ms - FS black + discard

Without lighting:
-10.84 ms - Full FS
- 4.21 ms - Full FS + discard
- 9.46 ms - FS out black
- 4.05 ms - FS black + discard


What can we conclude from this results?

1. Shader optimization is going beyond shader boundaries in monolithic programs (without separate shader objects).

2. GL_RASTERIZER_DISCARD works correctly.

3. gl_Position is not optimized out even if GL_RASTERIZER_DISCARD is in effect.

What I still don't understand is why rasterization is so expensive. 1.78ms just to fill HD screen with black pixels on GTX470, and even 5.41ms for GTX850M. 1:3 ratio.
GTX470 has 40 while GM107 has 16 ROP (8 per memmory controller/channel). 2.5:1 ratio. Of course, clock is different, but the number of ROP units seems to correspond to the speed of rasterization.
Anyway, it seems a little bit slow, isn't it?

malexander
09-19-2014, 08:34 AM
The first point is dependent on the compiler. Certainly it seems that the Nvidia compiler does these sorts of optimizations. I guess there's nothing stopping it from doing those optimizations on separable programs between consecutive stages, if a varying between them isn't used. For point #3, gl_Position can't be optimized out, because doing so would require a re-link of the shader that is depedent on the GL state. And linking a program is very expensive, so no one wants that :)


What I still don't understand is why rasterization is so expensive. 1.78ms just to fill HD screen with black pixels on GTX470, and even 5.41ms for GTX850M. 1:3 ratio.

1.8ms seems about right from what I've seen for a desktop part. It still needs to run your trivial shader and do 2M framebuffer writes, possibly more with z-writes and depth testing (framebuffer reads) for 1080p. There will also be some fixed per-frame swapbuffer/OS overhead when you hand off the frame. 5.4ms does seem high to me, but it fits with the number of ROPs and their likely slower mobile clockspeed.

Aleksandar
09-20-2014, 06:35 AM
The first point is dependent on the compiler. Certainly it seems that the Nvidia compiler does these sorts of optimizations.
Of course, I was speaking about NV implementation. ;)


For point #3, gl_Position can't be optimized out, because doing so would require a re-link of the shader that is depedent on the GL state.
I'm not sure I understand this. Could you explain it a little bit more?


There will also be some fixed per-frame swapbuffer/OS overhead when you hand off the frame.
No, the time I displayed is a GPU time, without swapbuffer/OS overhead. After that point there is another drawing and variety of stuff to execute before swapping buffers.

malexander
09-22-2014, 01:45 PM
I'm not sure I understand this. Could you explain it a little bit more?

Sure. When a shader is linked, its executable code is set in stone. This includes the code to produce gl_Position. If you change some shader-state parameters then you have to re-link the program (eg. transform feedback varyings, fragment data locations). However, GL_RASTERIZER_DISCARD is part of the global GL state and not tied to a program, so it can't affect the shaders executable code.

Hypothetically, in order for it to do so, the compiler would have to put conditionals in the program on some builtin GL-state variable (glRasterizeDiscard, say). Then the shader could become slower overall due to the conditionals, and a lot more work for the compiler and shader setup. Extending this hypothetical discussion to other OpenGL global states that might see some improvement if they were considered by the shader, I think you can see how optimizing would quickly become a real rats nest of conditionals in the shader code. So it's best for the compiler not to do it :)

In the situation where a shader is used with GL_RASTERIZER_DISCARD both off and on (perhaps for transform feedback), it would be much better for the user to create two different shader programs if it becomes a performance issue.

mbentrup
10-06-2014, 07:25 AM
NVidia (and as far as I know most other vendors) recompiles shaders automatically depending on GL settings. It even issues a performance warning to the debug log when it does this.

However, this happens only once: the driver keeps multiple compiled program variants per shader program id, and it even stores all those variants into the binary blob if you use glGetProgramBinary. Thus it is recommended to write program binary at the end of your application, when all possible uses of the program have happened and not directly after linking.