Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Results 1 to 10 of 10

Thread: How deadly is branching?

  1. #1
    Junior Member Newbie
    Join Date
    Aug 2014
    Posts
    26

    How deadly is branching?

    I've been led to believe that branching in a shader is just about the worst thing you can do. So does that mean I should opt for this:

    Code :
    #version 330 core
     
    void main() {
      int x = 1 - (gl_VertexID & 2);
      int y = 1 - ((gl_VertexID - 1) & 2);
     
      gl_Position = vec4(float(x), float(y), 0, 1);
    }

    instead of:

    Code :
    #version 330 core
     
    void main() {
      if(gl_VertexID == 0) {
        gl_Position = vec4(1, -1, 0, 1);
      } else if(gl_VertexID == 1) {
        gl_Position = vec4(1, 1, 0, 1);
      } else if(gl_VertexID == 2) {
        gl_Position = vec4(-1, 1, 0, 1);
      } else {
        gl_Position = vec4(-1, -1, 0, 1);
      }
    }

    Thanks!

    Disclaimer: I didn't actually try to run these shaders. Might not be correct.
    Last edited by BenFoppa; 09-03-2014 at 01:18 AM.

  2. #2
    Advanced Member Frequent Contributor
    Join Date
    Apr 2010
    Posts
    787
    This shader is going to execute a whopping 4 times on a processor that (typically) can have hundreds of threads of execution in flight. Even if you render lots of passes you are long going to be fill rate limited by all the fragment shader work before these vertex shaders become even noticeable.
    In general: when optimizing measure where your bottleneck is/where you are spending significant amounts of time, then figure out how to improve that. Everything else will just get you tangled up in minutiae that end up having no visible effect after all.

  3. #3
    Junior Member Newbie
    Join Date
    Aug 2014
    Posts
    26
    Quote Originally Posted by carsten neumann View Post
    This shader is going to execute a whopping 4 times on a processor that (typically) can have hundreds of threads of execution in flight. Even if you render lots of passes you are long going to be fill rate limited by all the fragment shader work before these vertex shaders become even noticeable.
    In general: when optimizing measure where your bottleneck is/where you are spending significant amounts of time, then figure out how to improve that. Everything else will just get you tangled up in minutiae that end up having no visible effect after all.
    Gooooood points. Thanks!

  4. #4
    Junior Member Regular Contributor
    Join Date
    Dec 2010
    Location
    Oakville, ON, CA
    Posts
    107
    Heh, I know what you are doing!

    Here is the better solution:

    Code :
    const vec4 Corners[4] = {
    vec4(-1.0, -1.0, 0.0, 1.0),
    vec4( 1.0, -1.0, 0.0, 1.0),
    vec4( 1.0,  1.0, 0.0, 1.0),
    vec4(-1.0,  1.0, 0.0, 1.0)
    };
     
    void main() {
    gl_Position = Corners[gl_VertexID];
    }

  5. #5
    Junior Member Newbie
    Join Date
    Sep 2014
    Posts
    8
    Quote Originally Posted by BenFoppa View Post
    I've been led to believe that branching in a shader is just about the worst thing you can do.
    That advice probably comes from a time when GPUs had poor branching support, but that's no longer the case.

    Branching has a cost. You obviously need to evaluate the branch condition. If the branch is divergent (work items which are processed in parallel take different paths) the GPU will evaluate both sides for all the work items. And the branch will generally increase code size. But not using branching comes with a cost, too, as you're no longer able to skip the execution of instructions which don't contribute to the final result. It's a trade-off, but using conditionals is often the right choice.

  6. #6
    Senior Member OpenGL Pro
    Join Date
    Jan 2007
    Posts
    1,215
    In most of these cases I wouldn't be surprised if the shader compiler translated them to the same instructions.

    Specifically, you shouldn't assume that the high-level GLSL you write has any direct relationship to the actual low-level GPU instructions that the shader compiler generates. For many cases of branching, the compiler is unlikely to actually generate a branch instruction at all, but more likely to emulate it via something like a lerp or step instruction: hence the old advice that a GPU may execute both sides of the branch!

  7. #7
    Junior Member Regular Contributor
    Join Date
    Dec 2010
    Location
    Oakville, ON, CA
    Posts
    107
    Hm, I thought the compiler evaluates quantity of cycles for each branch and pad the fastest branch with no-ops to ensure that no matter which branch will be actually taken the total execution time will be the same... But what about the branched texture accesses? The memory operations' latency couldn't be estimated as it depends on the cache-hits, right? Would be interesting to know how the actual hardware works, but I am afraid the info of this type is classified.

  8. #8
    Intern Newbie
    Join Date
    Apr 2014
    Posts
    47
    Awesome discussion all. Does anyone have any recommended reading further on GPU architecture and instructions?

  9. #9
    Junior Member Newbie
    Join Date
    Sep 2014
    Posts
    8
    Quote Originally Posted by Yandersen View Post
    Hm, I thought the compiler evaluates quantity of cycles for each branch and pad the fastest branch with no-ops to ensure that no matter which branch will be actually taken the total execution time will be the same... But what about the branched texture accesses? The memory operations' latency couldn't be estimated as it depends on the cache-hits, right? Would be interesting to know how the actual hardware works, but I am afraid the info of this type is classified.
    I'm not aware of any GPU architecture which does that. As for information, a lot is publicly available. For example:
    http://developer.amd.com/wordpress/m...chitecture.pdf

  10. #10
    Member Regular Contributor
    Join Date
    Jun 2013
    Posts
    495
    Quote Originally Posted by Yandersen View Post
    Would be interesting to know how the actual hardware works
    The basic concept is known as "Single Instruction, Multiple Data" or "SIMD", also known as "vector processing".

    Each value in your program is actually an array (vector) of many such values. Any operation which you perform on a value is performed element-wise on the array. Consequently, it isn't possible to "branch" such that different instructions are performed on different elements of the array. The same concept can be found in e.g. C++'s std::valarray and Python's NumPy library.

    Conditionals are implemented using conditional execution. The condition of an "if" statement is, like everything else, an array of values (in this case, an array of boolean values forming a "mask"; this is the EXEC register in the AMD document referenced by Firadeoclus).

    Instructions are only performed for elements where the mask is true. An "else" statement simply inverts the mask, so that subsequent instructions are only performed for elements where the (original) mask is false. IOW, both branches are "executed", but the instructions within each branch only affect a subset of the elements; for the other elements, the result is as if all instructions are no-ops.

    Aside: the earliest GPUs didn't even have conditional execution. The original NV_vertex_program extension stated that multiplication by zero and one must be invariant, so that conditionals could be implemented via multiplication and addition:
    Code :
    t ? a : b <=> t*a + (1-t)*b

    Modern GPUs can perform genuine branches in the case of (dynamically-) uniform control flow, i.e. where the condition is either true for all elements or false for all elements. Earlier GPUs could only perform this optimisation when this situation could be detected in advance (statically-uniform control flow). E.g. a condition which only referenced uniform variables could be evaluated on the CPU, which would then instruct the GPU to execute one of two different shaders (one only includes the "if" branch, the other the "else" branch").

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •