How deadly is branching?

BenFoppa · September 2, 2014, 7:55pm

I’ve been led to believe that branching in a shader is just about the worst thing you can do. So does that mean I should opt for this:


#version 330 core

void main() {
  int x = 1 - (gl_VertexID & 2);
  int y = 1 - ((gl_VertexID - 1) & 2);
 
  gl_Position = vec4(float(x), float(y), 0, 1);
}

instead of:


#version 330 core
 
void main() {
  if(gl_VertexID == 0) {
    gl_Position = vec4(1, -1, 0, 1);
  } else if(gl_VertexID == 1) {
    gl_Position = vec4(1, 1, 0, 1);
  } else if(gl_VertexID == 2) {
    gl_Position = vec4(-1, 1, 0, 1);
  } else {
    gl_Position = vec4(-1, -1, 0, 1);
  }
}

Thanks!

Disclaimer: I didn’t actually try to run these shaders. Might not be correct.

carsten_neumann · September 3, 2014, 1:11am

This shader is going to execute a whopping 4 times on a processor that (typically) can have hundreds of threads of execution in flight. Even if you render lots of passes you are long going to be fill rate limited by all the fragment shader work before these vertex shaders become even noticeable.
In general: when optimizing measure where your bottleneck is/where you are spending significant amounts of time, then figure out how to improve that. Everything else will just get you tangled up in minutiae that end up having no visible effect after all.

BenFoppa · September 3, 2014, 1:23am

[QUOTE=carsten neumann;1261399]This shader is going to execute a whopping 4 times on a processor that (typically) can have hundreds of threads of execution in flight. Even if you render lots of passes you are long going to be fill rate limited by all the fragment shader work before these vertex shaders become even noticeable.
In general: when optimizing measure where your bottleneck is/where you are spending significant amounts of time, then figure out how to improve that. Everything else will just get you tangled up in minutiae that end up having no visible effect after all.[/QUOTE]

Gooooood points. Thanks!

Yandersen · September 3, 2014, 3:19am

Heh, I know what you are doing!

Here is the better solution:


const vec4 Corners[4] = {
vec4(-1.0, -1.0, 0.0, 1.0),
vec4( 1.0, -1.0, 0.0, 1.0),
vec4( 1.0,  1.0, 0.0, 1.0),
vec4(-1.0,  1.0, 0.0, 1.0)
};

void main() {
gl_Position = Corners[gl_VertexID];
}

Firadeoclus · September 3, 2014, 4:10am

That advice probably comes from a time when GPUs had poor branching support, but that’s no longer the case.

Branching has a cost. You obviously need to evaluate the branch condition. If the branch is divergent (work items which are processed in parallel take different paths) the GPU will evaluate both sides for all the work items. And the branch will generally increase code size. But not using branching comes with a cost, too, as you’re no longer able to skip the execution of instructions which don’t contribute to the final result. It’s a trade-off, but using conditionals is often the right choice.

mhagain · September 3, 2014, 6:51am

In most of these cases I wouldn’t be surprised if the shader compiler translated them to the same instructions.

Specifically, you shouldn’t assume that the high-level GLSL you write has any direct relationship to the actual low-level GPU instructions that the shader compiler generates. For many cases of branching, the compiler is unlikely to actually generate a branch instruction at all, but more likely to emulate it via something like a lerp or step instruction: hence the old advice that a GPU may execute both sides of the branch!

Yandersen · September 3, 2014, 3:07pm

Hm, I thought the compiler evaluates quantity of cycles for each branch and pad the fastest branch with no-ops to ensure that no matter which branch will be actually taken the total execution time will be the same… But what about the branched texture accesses? The memory operations’ latency couldn’t be estimated as it depends on the cache-hits, right? Would be interesting to know how the actual hardware works, but I am afraid the info of this type is classified.

MtRoad · September 3, 2014, 5:15pm

Awesome discussion all. Does anyone have any recommended reading further on GPU architecture and instructions?

Firadeoclus · September 4, 2014, 7:39am

I’m not aware of any GPU architecture which does that. As for information, a lot is publicly available. For example:
http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf

GClements · September 4, 2014, 3:42pm

The basic concept is known as “Single Instruction, Multiple Data” or “SIMD”, also known as “vector processing”.

Each value in your program is actually an array (vector) of many such values. Any operation which you perform on a value is performed element-wise on the array. Consequently, it isn’t possible to “branch” such that different instructions are performed on different elements of the array. The same concept can be found in e.g. C++'s std::valarray and Python’s NumPy library.

Conditionals are implemented using conditional execution. The condition of an “if” statement is, like everything else, an array of values (in this case, an array of boolean values forming a “mask”; this is the EXEC register in the AMD document referenced by Firadeoclus).

Instructions are only performed for elements where the mask is true. An “else” statement simply inverts the mask, so that subsequent instructions are only performed for elements where the (original) mask is false. IOW, both branches are “executed”, but the instructions within each branch only affect a subset of the elements; for the other elements, the result is as if all instructions are no-ops.

Aside: the earliest GPUs didn’t even have conditional execution. The original NV_vertex_program extension stated that multiplication by zero and one must be invariant, so that conditionals could be implemented via multiplication and addition:

t ? a : b <=> t*a + (1-t)*b

Modern GPUs can perform genuine branches in the case of (dynamically-) uniform control flow, i.e. where the condition is either true for all elements or false for all elements. Earlier GPUs could only perform this optimisation when this situation could be detected in advance (statically-uniform control flow). E.g. a condition which only referenced uniform variables could be evaluated on the CPU, which would then instruct the GPU to execute one of two different shaders (one only includes the “if” branch, the other the “else” branch").