subroutines and multiple render targets

I wonder how to use a shader with subroutines and multiple render targets. I have a fragment shader which contains a few subroutines. (I posted only relevant pieces of the code).


#version 420

in vec3 data1;
in vec3 data2;

layout (location = 0) out vec4 fragColor;
layout (location = 1) out vec3 target1;
layout (location = 2) out vec3 target2;


subroutine void RenderPassType();
subroutine uniform RenderPassType RenderPass;

subroutine (RenderPassType)
void first()
{
    
}

subroutine (RenderPassType)
void second()
{
	target1 = data1;
	target2 = data2;
}

subroutine (RenderPassType)
void third()
{
	//some computations
	fragColor = result
}

I created two fbo’s which contain rendering targets (textures).

First:


glFramebufferTexture2D(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT, GL_TEXTURE_2D, depthTex, 0);

Second:


glFramebufferTexture2D(GL_DRAW_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, tex1, 0);
glFramebufferTexture2D(GL_DRAW_FRAMEBUFFER, GL_COLOR_ATTACHMENT1, GL_TEXTURE_2D, tex2, 0);

A data for glDrawBuffers:


GLenum drawBuffers1[] = {GL_NONE, GL_NONE, GL_NONE,);
GLenum drawBuffers2[] = {GL_NONE, GL_COLOR_ATTACHMENT0, GL_COLOR_ATTACHMENT1);

Rendering:


//bind the first fbo
//bind the first subroutine
glDrawBuffers(3, drawBuffers1);
...
//unbind the first fbo


//bind the second fbo 
//bind the second subroutine
glDrawBuffers(3, drawBuffers2);
...
//unbind the second fbo

...

The above code works, but I wonder if it is the only right way to use subroutines together
with multiple render targets. Is something which I can do better (more efficient) ?

Do I have to use (location = 0) for the default framebuffer output ?

When I first bind the second fbo and subroutine and next the first fbo and subroutine glDrawBuffers
clears all textures. What can I do about it ?

If you have multiple color outputs then you have to write a value to all of them, otherwise the ones that you don’t write any value to become undefined.
Think about it this way: if you don’t output any color to a particular color output, a write to that color output still will happen, just an implementation dependent value value may be written there.

Ok, but I want to use my fragment shader with subroutines for a deferred rendernig. Then a render loop looks like this:

  1. gbuffer stage
    2 for each light:
    (additive blending)
    - shadow map stage
    - shading stage

The first subroutine destroys render targets(GL_NONE). How to solve it ?

First, why would you like to use the same shader for all of these? I think it’s fine if you switch shaders 3 times in a frame. You try to be overzealous on batching things together.

In fact, for the shadow map rendering stage you don’t even need a fragment shader, thus if you have one, that will actually cost you a lot in performance.

In fact, for the shadow map rendering stage you don’t even need a fragment shader, thus if you have one, that will actually cost you a lot in performance.

Not unless you’re using a compatibility profile.

The fragment depth is part of the results of fragment shader execution. And thus it is undefined. And having an empty fragment shader doesn’t lose you any performance.

[QUOTE=Alfonse Reinheart;1247091]If the current fragment stage program object has no fragment shader, or no fragment program object is current for the fragment stage, the results of fragment shader execution are undefined.[/QUOTE]Exactly, fragment shader execution is undefined, but per-fragment operations are not, thus depth testing and depth writing IS defined.

You don’t need a fragment shader for depth-only rendering, not even in core profile.

The fragment depth output by a fragment shader is, of course, makes no sense without a fragment shader, but you don’t need a fragment shader to output a depth, you have a fixed function one.

Also, having an empty fragment shader DOES cost your performance. Depth testing and depth write is done by fixed function hardware which can have a throughput of many pixels per clock, especially with hierarchical Z, not to mention that no shader cores are needed to be used. While if you have a fragment shader you’ll have to launch shaders on your shader engines and even if they do nothing, it will still cost you several clocks per tile (unless the driver is smart enough to just ignore your empty fragment shader in which case you’ll get the same results).

The fragment depth output by a fragment shader is, of course, makes no sense without a fragment shader, but you don’t need a fragment shader to output a depth, you have a fixed function one.

If you’re right, please point to the part of the OpenGL 4.3 specification that states that the input to the depth comparison does not have to come from the fragment shader. Because it clearly says:

Those “fragments resulting from fragment shader execution” contain undefined data, as previously stated. If you’re right, you should be able to show me where the resulting fragments will get defined data from.

Also, having an empty fragment shader DOES cost your performance. Depth testing and depth write is done by fixed function hardware which can have a throughput of many pixels per clock, especially with hierarchical Z, not to mention that no shader cores are needed to be used. While if you have a fragment shader you’ll have to launch shaders on your shader engines and even if they do nothing, it will still cost you several clocks per tile (unless the driver is smart enough to just ignore your empty fragment shader in which case you’ll get the same results).

You’re making some pretty big assumptions here. Like the assumption that fragment processing can be skipped by the hardware at all. That it can copy data directly from the rasterizer to the ROPs without some kind of per-fragment shader happening to intervene.

I’d like to see proof that this is true. Preferably in the form of performance tests on the difference between an empty fragment shader and not having one. On multiple different kinds of hardware.

The description of per-fragment operations starts as follows:

Thus you can see that fragments are produced by the rasterization, not by fragment shaders.

Also, what chapter 15 tells is actually:

Thus despite the results of fragment shader execution are undefined, most data required for per-fragment operations is not affected by the fragment shader, namely pixel ownership and scissor test, multisample operations, depth and stencil test (unless depth or stencil export is used, which is obviously not the case if there is no fragment shader) and occlusion query.

Once again, the results of fragment shader execution are undefined, not the fragments. By default the results of fragment shader execution are the color outputs, unless an other explicit mechanism is used like depth or stencil export.

ROPs deal with blending, sRGB conversion and logic op. Obviously those will get undefined data, thus you cannot expect anything good to be in your color buffers after all. But depth/stencil is not handled by the same piece of hardware. Neither are e.g. scissor and pixel ownership tests.

The fact that a lack of a fragment shader doesn’t result in an error, but just the results of its execution are undefined is already a good enough reason, I believe. Don’t you think it wasn’t an oversight from the ARB but they did define it this way intentionally?

Not to mention that in most cases the depth and stencil test happens before the fragment shader is executed (if there is one), these are the so called “early tests” now even explicitly mentioned in the spec, and if the early tests fail then no fragment shaders are executed even if there is one. So if you think hardware cannot avoid the execution of fragment shaders then why do you think the ARB cared writing about it in the spec or why they introduced a mechanism to force/disable early depth in the fragment shader?

So how does discard work?

Fragments are produced by the rasterizer, and modified by the fragment shader. Just like vertices are produced by Vertex Specification and modified by the vertex shader. Later stages work based on the vertices output by the vertex shader, just as later stages work based on fragments output by the fragment shader.

[QUOTE=aqnuep;1247099]
Thus despite the results of fragment shader execution are undefined, most data required for per-fragment operations is not affected by the fragment shader, namely pixel ownership and scissor test, multisample operations, depth and stencil test (unless depth or stencil export is used, which is obviously not the case if there is no fragment shader) and occlusion query.[/quote]

The fragment shader does not output the X or Y position. It does output the depth value, and therefore it will output an undefined value. You seem to have missed this important part of what you quoted:

“The processed fragments resulting from fragment shader execution” have undefined data. The depth output from the fragment shader is part of that fragment. And it has an undefined value.

The fact that a lack of a fragment shader doesn’t result in an error, but just the results of its execution are undefined is already a good enough reason, I believe. Don’t you think it wasn’t an oversight from the ARB but they did define it this way intentionally?

How does that prove anything? Lack of a vertex shader also doesn’t produce an error, but there’s almost nothing useful you can do with that.

these are the so called “early tests” now even explicitly mentioned in the spec

Ahem:

They are explicitly mentioned solely for the Image Load/Store feature of being able to force early tests so that you can get more guaranteed behavior. And you need a fragment shader to activate it.

So if you think hardware cannot avoid the execution of fragment shaders then why do you think the ARB cared writing about it in the spec or why they introduced a mechanism to force/disable early depth in the fragment shader?

You seem to be misunderstanding the difference between “discarding the fragment before the fragment shader” and “processing the fragment without a fragment shader and getting defined results”. The latter is what you’re alleging that OpenGL allows; the former is what OpenGL actually allows.

Also, you haven’t provided any evidence that not providing a fragment shader is faster in any way than providing an empty one. Which is what you claimed and what I asked you to provide.

Well, guess what, if discard is used by the shader then it is very likely that early depth/stencil tests are disabled automatically, because otherwise it might not result in proper results, yeah? Or actually the early tests still can happen, but the depth/stencil writes cannot as they might get discarded.

This is the important part, they are modified only.

It just optionally outputs a depth. Just because these are all transparent from the user’s point of view, it doesn’t mean it won’t matter. If you output depth in your fragment shader, once again, those early depth/stencil tests will be disabled, unless you force early tests using the functionality introduced by ARB_shader_image_load_store or you use ARB_conservative_depth properly.

No, I didn’t miss it. Once again, fragment shader only modifies some data, namely it outputs color values and optionally modifies depth.

Lack of vertex shader DOES produce an INVALID_OPERATION error at draw time in core profile.

Come on, if you don’t have to run fragment shaders on the shader cores, more vertex shaders can be in flight at once. How wouldn’t be that faster? Think about it.

You get defined results for depth and stencil, even without a fragment shader. The spec is unfortunately pretty vague on this, but you can try it out anytime if you don’t believe me. Just create a core profile context, setup a vertex shader-only program, set draw buffers to none, attach a depth texture to your framebuffer and let it go. I bet you it will work.

Also, if you don’t believe in driver behavior, you can anytime ask the vendors on their opinion, they are the ARB, they can tell it for sure. I’m not willing to continue arguing about facts.

Lack of vertex shader DOES produce an INVALID_OPERATION error at draw time in core profile.

Does it?

It doesn’t seem that way. Or if it does, the GL_INVALID_OPERATION must come from something else.

Come on, if you don’t have to run fragment shaders on the shader cores, more vertex shaders can be in flight at once. How wouldn’t be that faster? Think about it.

You’re effectively alleging that shader compilers and implementations that are smart enough to inline functions, remove dead code, and various other optimizations are too stupid to optimize an empty fragment shader into the same thing as not having one. So as I said before, where is your evidence that any real systems will exhibit this performance difference?

In short, show me the driver/compiler which cannot optimize away an empty fragment shader.

Just create a core profile context, setup a vertex shader-only program, set draw buffers to none, attach a depth texture to your framebuffer and let it go. I bet you it will work.

Yes, and so will this fragment shader:


#version 330

out vec4 someColor;

void main()
{
  someColor = vec4(1.0f, 0.0f, 0.0f, 1.0f);
}

If you use this fragment shader without a call to glBindFragDataLocation, it will in virtually all cases assign someColor to output number 0. But the OpenGL standard does not require this behavior.

My point is that the standard itself doesn’t provide coverage for the behavior that you suggest it does, not whether or not it will “work” on some or even all implementations. You can rely on undefined behavior if you want, but you shouldn’t suggest that others do so.

Okay, I admit I was wrong about this, in fact not having a vertex shader only makes the results of vertex processing undefined but doesn’t generate an error. At least one of us can admit if he’s wrong…

The spec clearly states in chapter 14

Furthermore, section 15.2 clearly states:

Now, I think there may be room for interpretation here, although I have to agree that such an important fact should be made clear as day. So here’s mine: Depth values are initially defined during rasterization. If a fragment shader is active and writes to gl_FragDepth, then the value is modified and set to the specified new value. Furthermore, it has to do so statically, i.e. not base writing to gl_FragDepth on dynamic braching whatsoever. Otherwise, the initial value, produced during rasterization and not during fragment shading, is used. Not having a fragment shader, i.e. fragment shader execution leading to undefined results, automatically results in gl_FragDepth not being set and thus the rasterized depth value not being modified. The depth value is defined - after rasterization.

In conclusion, I agree with aqnuep and I have yet to see a case where generating a correct depth map without a fragment shader has not worked cross-platform and cross-vendor and I have working examples of renderings on AMD, NVIDIA and Intel hardware.

Furthermore, why on earth would the ARB mandate that a fragment shader be present to actually defined a valid depth value? What about depth pre-passes? Nobody gives a damn about fragment shader outs. What about occlusion queries? Fragment shading? Why? I think this would be an oversight no one in the ARB could justify.

Thanks, thokra, for searching that for me. Apparently, I was doing lazy when trying to find the relevant spec statements.

Another thing to note, which is actually related to the original question:

While the driver may be able to optimize out an empty fragment shader, it cannot optimize out a shader that dynamically uniformly calls subroutines from which only a single subroutine is the one that doesn’t emit any colors.

In this particular case the driver cannot optimize out the fragment shader as the decision is done at run-time (even if the decision is dynamically uniform) thus you’ll have to pay for the cost of fragment shader execution for sure.