PDA

View Full Version : If statements in shaders, confused by results.



Patrikwa
06-04-2015, 05:33 AM
Hi!
I have some questions regarding performance that im quite confused about, wherever i read i get the impression that "if-statements" is a big no-no unless you really need them.
As I understand it it is because the GPU simply evaluates both branches of an "if-else" and just discard the result of the one on the loosing side of the condition.
For example if I do something like:



float someValue = value1 * value2;
if(someValue > someOthervalue)
... do some calculations "c1"...
else
... do some other calculations "c2"...


...both calculation "c1" and "c2" gets evaluated, thus performance take a hit. (At least thats what I thought)

What I found was that if the calculation "c1" was a very expensive one and "c2" was a very cheap one, having the condition there to stop some fragments entering "c1" calculation actually had a positive effect on performance.
How is that possible? Have I completely misunderstood something?
My scene took about 9.2 ms per frame to render before i started optimizing and after removing any branches from the code I ended up with 10ms per frame which is really not what I wanted :)

The code running (excluding implementation of functions) was the following. Removing these if-statements "should" not have a negative impact on the performance right?



void main()
{
vec3 color = vec3(0,0,0);

E = normalize(E_in);
L = normalize(L_in);
H = normalize(H_in);

float spotFactor = dot(L, -lightDirection_fs);

// Inside spot cone
if(spotFactor > lightCutoff)
{
vec2 UVs = GetTextureCoords(E);

vec3 N = FindNormal(E, UVs);
float visibility = GetShadowValue();
float lightAmount = GetAttenuationValue();

vec3 diffuse = vec3(0,0,0);
vec3 specular = vec3(0,0,0);

if(lightAmount >= 0.01)
{
vec3 diffuse = CalculateDiffuseLight(N, UVs);
if(visibility == 1.0)
specular = CalculateSpecularLightBlinnPhong(N, UVs);
color = (diffuse + specular) * visibility * lightAmount;
color *= (1.0 - (1.0 - spotFactor)/(1.0 - lightCutoff));
}
}
gl_FragColor = vec4(color, alpha);
}

GClements
06-04-2015, 07:34 AM
As I understand it it is because the GPU simply evaluates both branches of an "if-else" and just discard the result of the one on the loosing side of the condition.
Not necessarily.

Modern hardware can perform an actual branch if the condition has the same value for all elements in a SIMD block (what nvidia calls a "warp" and AMD a "wavefront").

Patrikwa
06-04-2015, 08:08 AM
Not necessarily.

Modern hardware can perform an actual branch if the condition has the same value for all elements in a SIMD block (what nvidia calls a "warp" and AMD a "wavefront").

Thanks for the information!
Could you elaborate this statement?
If I for example use a uniform and compare it to a static number (I used a lot of "if(useTexture == 1) and similar that I tried removing" would that count as something modern hardware can actually branch?
Any specific scenarios that cannot be warped?
What is "modern" hardware in this sense? I need to support "older" hardware, back to at least around 2007.
Im sitting at a GTX970 so I suppose it can handle this "Warp" technology and thats why I get these unexpected results? (well, unexpected for me at least)

__bob__
06-04-2015, 08:11 AM
Why dont remove the last if?
If your ligtamount is little you dont need to compute illumination...

Or remove all the if... if your lightamount is so little, result will be nearly vec3(0)... depending of the cost of FindNormal(E, UVs), GetShadowValue();GetAttenuationValue();

I use fragment shaders with lot of "if" in a "while" loop on GTX 980, without any slow down... maybe there is another problem...

Alfonse Reinheart
06-04-2015, 08:24 AM
Thanks for the information!
Could you elaborate this statement?
If I for example use a uniform and compare it to a static number (I used a lot of "if(useTexture == 1) and similar that I tried removing" would that count as something modern hardware can actually branch?

Yes, it would count as a branch. That doesn't mean it's bad; all of the different instances will take the same branch, so the cost is minimal.

Remember: the only problem with branching in a shader is if different instances executing on the same computational unit have to take different paths.


Any specific scenarios that cannot be warped?

Just look at it from the perspective of the hardware. The individual instances will need to be broken up if, and only if, the conditional expression can be taken by multiple different instances in the same rendering command and if the a particular computation actually results in neighboring instances taking different paths.

For example, all fragment shader instances get the same gl_PrimitiveID value (as well as `flat in` interpolated values). Conditions based on those will not be statically uniform, but they will be uniform within each "warp/wavefront". Therefore, neighboring instances will always take the same path, so conditions based on them will be reasonably fast.

Even if a condition is based on interpolated input parameters, that alone doesn't mean that your rendering will be slow. You only pay the price performance-wise for those specific instances where the runtime condition forces "warp/wavefronts" to actually be broken up. So if you were rendering a full-screen quad, and you're doing a condition based on being on the left half of the quad rather than the right, then only those "warp/wavefronts" in the middle will be slower.

Don't be afraid of conditions. Be aware of them, and use them judiciously, but don't assume that conditions are always (or even usually) terrible.


What is "modern" hardware in this sense? I need to support "older" hardware, back to at least around 2007.

I would say that, for the purposes of this discussion, modern would be anything DX10 or better. That's about 2008 or so. That was the point when unified shader architectures became the norm. Even so, older hardware had similar properties, at least with respect to static/uniform branching.


Im sitting at a GTX970 so I suppose it can handle this "Warp" technology

It's not "technology"; it's "terminology". That's just what NVIDIA calls the individual instances that are operating on the same computational unit.

Patrikwa
06-04-2015, 08:38 AM
Why dont remove the last if?
If your ligtamount is little you dont need to compute illumination...

Or remove all the if... if your lightamount is so little, result will be nearly vec3(0)... depending of the cost of FindNormal(E, UVs), GetShadowValue();GetAttenuationValue();

I use fragment shaders with lot of "if" in a "while" loop on GTX 980, without any slow down... maybe there is another problem...

Yeah, as I said, Im trying to optimize the code by removing if-statements, but when I do, i get worse performance, which is the exact opposite to what I thought I knew about the graphics pipeline :)
Im not sure what you mean otherwise. If I remove the last "if" then I would do all the lightning calculations despite the fragment being in completely darkness which seems quite unnecessary.

Patrikwa
06-04-2015, 08:52 AM
Thanks a lot for the elaboration on the subject Alfonse Reinheart, really helps!
I thought for a minute there that I did something wrong with my testing since most of the conditions did not affect performance in any direction.

In the light of this new (for me) information, in the case of multiple shaders vs single shader with conditions, what would actually be preferred performance wise? Lets say a basic scenario where I need to either calculate TBN matrix in vertex shader or not, depending on if displacement mapping is active or not. I could of course test the different scenarios myself and compare the actual render time but it would be interesting to hear someones theories on the matter.
So basically, binding 2 different shaders or using 1 shader and let a uniform control if some code is executed or not by using conditions. Ive seen this been debated before but without any final result.

GClements
06-04-2015, 09:25 AM
If I for example use a uniform and compare it to a static number (I used a lot of "if(useTexture == 1) and similar that I tried removing" would that count as something modern hardware can actually branch?
That's something that even old hardware can handle: the driver can compile different versions of the shader for useTexture==1 and useTexture!=1, and select the appropriate version for each draw call.

Expressions fall into three basic types:
1. Statically-uniform, where the expression's value is constant for an entire draw call.
2. Dynamically-uniform, where the expression's value is constant for all "threads" within a work group (warp, wavefront). In the fragment shader, this includes variables with "flat" interpolation, gl_PrimitiveID, gl_Layer, etc.
3. Non-uniform, where the value is different for different vertices or fragments within the same work group.

All hardware can optimise branches (i.e. only evaluate the branch actually taken) for case 1. More modern hardware can also do so for case 2. Case 3 requires that both branches are executed, with results being discarded within the branch not taken.

To get better results in all cases, ensure that any common calculations are lifted out of the conditional, so you don't end up performing essentially the same computation twice in the case both branches are executed. If you have simple and complex cases, lifting common subexpressions may result in the branch for the simple case being empty, in which case the issue of "executing both branches" doesn't arise.

Aside from performance, certain operations are undefined within non-uniform control flow, specifically derivatives (the dFdx(), dFdy() and fwidth() functions), as well as sampling mipmapped textures (as that implicitly uses derivatives). So if you're accessing textures with those functions, the call needs to be outside of any conditional statement even if only one branch actually needs the texture data.

Patrikwa
06-05-2015, 01:16 AM
Great stuff GClements!
This is one helpful place for sure :)


...where the expression's value is constant for all "threads" within a work group (warp, wavefront)

What defines a work group here? Is each triangle a work group? Or each array of vertices in a VAO?


...as well as sampling mipmapped textures (as that implicitly uses derivatives)

Im generating mipmaps for my textures and acccess them through standard texture2D in shader, I assume this means it's using derivatives to sample then?



2. Dynamically-uniform, where the expression's value is constant for all "threads" within a work group (warp, wavefront). In the fragment shader, this includes variables with "flat" interpolation, gl_PrimitiveID, gl_Layer, etc.
3. Non-uniform, where the value is different for different vertices or fragments within the same work group.

In case of a fragment shader, would this condition fall into category 1, 2 or 3? The value "distanceToPoint" here obviously changes between each vertex, however the "lightRange" is a static uniform that changes between draws (since im running forward rendering this is executed for every light per fragment)



float distanceToPoint = length(lightPosition - vertexWorldSpace);
if(distanceToPoint <= lightRange)
{
float sqDist = pow(lightPosition.x - vertexWorldSpace.x, 2.0) + pow(lightPosition.y - vertexWorldSpace.y, 2.0) + pow(lightPosition.z - vertexWorldSpace.z, 2.0);
float cAttenuation = lightAttenuation.x + (lightAttenuation.y * 0.001 * sqrt(sqDist)) + (lightAttenuation.z * 0.00001 * sqDist);
return min(5.0, 1.0 / cAttenuation);
}
else
return 0.0;

GClements
06-05-2015, 04:34 AM
What defines a work group here? Is each triangle a work group? Or each array of vertices in a VAO?
In the fragment shader, it's typically a rectangular "block" of fragments, with the size determined by the implementation (32 or 64 is typical). All of the fragments will belong to the same primitive (i.e. gl_PrimitiveID will be the same for all fragments in the work group, as will any "flat"-qualified inputs).

In the vertex shader, it's some number of vertices, again with the number determined by the implementation. All vertices will correspond to the same draw call.



Im generating mipmaps for my textures and acccess them through standard texture2D in shader, I assume this means it's using derivatives to sample then?

Yes. The texture functions with "Lod" or "Grad" in the name take an explicit level-of-detail or explicit derivatives from which the level-of-detail is calculated. The other functions are equivalent to calling the corresponding "Grad" function with the derivatives obtained using dFdx() and dFdy(), so these are undefined within non-uniform control flow (dFdx() and dFdy() calculate the difference between the value for the current fragment and the value for a horizontally- or vertically-adjacent fragment; within non-uniform control flow, the value for adjacent fragments may be garbage if those fragments take a different branch).


In case of a fragment shader, would this condition fall into category 1, 2 or 3? The value "distanceToPoint" here obviously changes between each vertex, however the "lightRange" is a static uniform that changes between draws (since im running forward rendering this is executed for every light per fragment)



float distanceToPoint = length(lightPosition - vertexWorldSpace);
if(distanceToPoint <= lightRange)
{


This falls between cases 2 and 3. Formally, it's non-uniform control flow (case 3), as different fragments within the same work group can have different values for the comparison. Consequently, the result of texture() would be undefined within the branches.

However, the values will typically be highly correlated, i.e. adjacent fragments will often have similar values for distanceToPoint. In many cases, all fragments within a work group will have the same value for the result of the comparison, and thus a modern GPU will only execute one branch. It's only in the case where the block of fragments forming the work group lies on the boundary that the GPU will need to execute both branches (which doesn't really matter anyhow, as the "else" branch is trivial).

Patrikwa
06-05-2015, 05:02 AM
This all sound really great, GPUs are clearly alot more clever than internet in general had me believe :)

So based on all of this, where does the mantra "do not use conditions in shaders!" come from? To me it seems very harmless unless we do some really crazy stuff.

So to sum up and make sure I got this straight:
* Comparing uniforms to static values is completely harmless and the GPU will compile a set of different shaders based on the possible branches.
* Comparing values calculated in the fragment shader where the result is the same for almost all/all fragments in a work group is basically harmless.
* Having trivial code in "else" is basically harmless (as in previous example with just a return 0.0)

Huge amounts of thanks for all the help you guys have given me here, really appreciate it! :)

GClements
06-05-2015, 05:58 AM
So based on all of this, where does the mantra "do not use conditions in shaders!" come from?
On older hardware, any condition that isn't statically uniform (i.e. based solely on uniform variables and constants) will result in both branches being executed. For conditions which are statically uniform, the implementation must compile multiple versions of the shader and select the correct one for each draw call.

On newer hardware, the GPU can only branch if the condition is true for all items in a work group. For conditions which depend upon non-uniform expressions, the probability of this occurring depends upon many factors, including how the implementation's organises work groups. Many of the factors will depend upon the actual data which is being processed rather than being an intrinsic property of the shader.

So the mantra is partly out of date, and partly an exaggeration (i.e. you need to be aware that conditional expressions are more problematic for a GPU than for conventional CPU code).


So to sum up and make sure I got this straight:
* Comparing uniforms to static values is completely harmless and the GPU will compile a set of different shaders based on the possible branches.

It may do so. For newer hardware, it needn't bother, as it will always be able to branch in such cases. For older hardware, if you have N such conditions, it could require 2^N different versions of the shader. For large N, that may not be practical.


* Comparing values calculated in the fragment shader where the result is the same for almost all/all fragments in a work group is basically harmless.
Newer hardware will uses branches where it can. For conditions which aren't statically uniform, common subexpressions should still be moved out of the conditional for the cases where both branches end up being executed (the compiler may be able to do this itself, but it's best not to rely upon it).

Patrikwa
06-05-2015, 06:28 AM
It may do so. For newer hardware, it needn't bother, as it will always be able to branch in such cases. For older hardware, if you have N such conditions, it could require 2^N different versions of the shader. For large N, that may not be practical.

But what would the alternative be in the case of for example "fetch the normals from a normal map or not", either I use 2 different shaders or I use conditionals (which then in worst case would generate 1 additional shader on its own) or is there a third option in these cases?
As of now my N is in the region of <10 in total over all my shaders so as it stands now my options are to keep it that way or start using a lot more shaders which would then introduce the expense of binding/unbinding more shaders, but who knows, maybe that wins performancewise in the end but thats an entirely different discussion :)

GClements
06-05-2015, 07:00 AM
But what would the alternative be in the case of for example "fetch the normals from a normal map or not", either I use 2 different shaders or I use conditionals (which then in worst case would generate 1 additional shader on its own) or is there a third option in these cases?
The other option is that the implementation doesn't optimise it and always executes both branches.

You can always resort to explicitly generating multiple shader versions yourself (e.g. using #if/#else/#endif). But I wouldn't consider doing so unless you find cases where that actually makes a difference in practice.

You can sometimes eliminate branches altogether by the use of "unit" values. E.g. rather than using or not using a normal map, you can write a shader which always uses a normal map and just use a 1x1 texture containing the vector (0,0,1) for the case where you don't need a normal map. Even if this results in redundant calculations, if it allows you to coalesce more data into a single draw call, it may be faster overall.

Patrikwa
06-05-2015, 02:32 PM
The other option is that the implementation doesn't optimise it and always executes both branches.

You can always resort to explicitly generating multiple shader versions yourself (e.g. using #if/#else/#endif). But I wouldn't consider doing so unless you find cases where that actually makes a difference in practice.

You can sometimes eliminate branches altogether by the use of "unit" values. E.g. rather than using or not using a normal map, you can write a shader which always uses a normal map and just use a 1x1 texture containing the vector (0,0,1) for the case where you don't need a normal map. Even if this results in redundant calculations, if it allows you to coalesce more data into a single draw call, it may be faster overall.

Yeah I actually considered creating a vec3(1,1,1) 1x1 texture for color and just always multiply this in the calculations and do something similar for normal maps, specular maps, displacement maps, cube maps, etc. Not sure how that would affect performance but I guess testing is the only way forward.

On an entirely different subject now that I seem to have some real openGL pros on "the line", is there somewhere I can read up on the differences between GPUs and how they handle shader code?
I have found some really strange behaviors on cheaper GPUs where the following code...



if(staticUniform == 1)
doSomething...
else
doSomethingElse...


...did not work when I set the "staticUniform" to 1, however if I wrote "staticUniform >= 0.5" it evaluated correctly.
Also when I used a uniform in vertex shader AND fragment shader, only one of them worked and I had to put some affixes like "someUniform_vs" so they were different.
Im writing my code for PC and MAC and on the MAC i get all sort of crazy stuff going on.
Is there like a "OpenGL syntax" wiki or something somewhere? I get the feeling that some manufacturers allow stuff that according to OpenGL specs are not allowed and thus I write code for my GPU where everything just seem to work but on other GPUs I run into issues.
A small example is the limit to "varying" variables where the minimum according to openGL standards is 8, and thus I need to keep my usage down to 8 to prevent running into issues on lowend machines. (This I obviously already found, but I havent found like a wiki page listing all these things)

GClements
06-05-2015, 03:48 PM
I have found some really strange behaviors on cheaper GPUs where the following code...



if(staticUniform == 1)
doSomething...
else
doSomethingElse...


...did not work when I set the "staticUniform" to 1, however if I wrote "staticUniform >= 0.5" it evaluated correctly.

If staticUniform is a float, all bets are off. Comparing floats with == is always a risky proposition; the slightest rounding error at any point in the process and it fails.



Is there like a "OpenGL syntax" wiki or something somewhere? I get the feeling that some manufacturers allow stuff that according to OpenGL specs are not allowed and thus I write code for my GPU where everything just seem to work but on other GPUs I run into issues.

The definitive answers are in the specifications (https://www.opengl.org/registry/).


A small example is the limit to "varying" variables where the minimum according to openGL standards is 8, and thus I need to keep my usage down to 8 to prevent running into issues on lowend machines. (This I obviously already found, but I havent found like a wiki page listing all these things)
Implementations are always free to provide higher limits, and the standard-mandated limits generally increase over time (e.g. GL_MAX_TEXTURE_SIZE wasn't required to be any more than 64 until OpenGL 3.0, but I've never used an implementation with such a low limit).