Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 2 12 LastLast
Results 1 to 10 of 12

Thread: Dynamic subrountines vs switch/if

  1. #1
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,104

    Dynamic subrountines vs switch/if

    I have done some benchmarking between using dynamic subrountines verses a set of if tests controlled by a uniform that I modify as needed. My initial tests show with 2 subroutines, 1 in the vertex shader and another in the fragment shader using a uniform is about 40% faster on an nVidia 580.

    Has anyone else compared these?

  2. #2
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,124
    I haven't, though woudn't a more apples-to-apples test be comparing subroutines against switch/if based on a constant value than a uniform? This ideally should result in no run-time conditional branching in both cases.

    (In practice, the constant value would be set by your shader generator prior to compilation.)

  3. #3
    Senior Member OpenGL Pro
    Join Date
    Apr 2010
    Location
    Germany
    Posts
    1,099
    Dark Photon, aren't you suggesting to compare compile-time to run-time performance in that case? If I get your drift, you suggest having your generator define some constant depending on the usage scenario and have all unreachable branches thrown out by dead code elimination, correct? How would you do that with subroutines? AFAIK, the latter are pure run-time constructs which aren't branched in the shader anyway but selected with API calls. I'm a little confused but I tend to think that tonyo actually compares apples to apples here.

    tonyo, could you post your shaders? It's hard to speculate without some hints.

  4. #4
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    985
    In theory, switch statements could be also implemented using jump tables on newer GPUs in a similar fashion like compilers do it on the CPU. In that case, they could be as fast as subroutines, but I'm not sure how much is this a common practice in the drivers.

    Anyways, your best bet is to use subroutines for the most coherent performance across implementations, as those are really nothing more than function pointers thus you have constant time jumps to the appropriate code, while in case of switch statements with dynamically uniform selectors you either have constant time jumps (if the GLSL compiler builds jump tables) or linear time jumps (if the case statements are evaluated as any other conditional statement).

    In your case, considering you only had a choice between 2 subroutines (if I understand correctly), the difference might not have came out, but when you have way more options, constant time will beat linear time for sure.
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

  5. #5
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,124
    Quote Originally Posted by thokra View Post
    Dark Photon, aren't you suggesting to compare compile-time to run-time performance in that case?
    Not quite. They are more both compile time I think. If my perception of how subroutines work is correct, that wouldn't be a "run-time" evaluation (in the sense of being re-evaluated every time the shader is run -- e.g. every vertex or every fragment), but rather pre-runtime when you bind the shader and configure uniforms. I'm thinking it just pokes in an unconditional jump point as a preprocess, and then there's no per-vertex or per-fragment run-time overhead (save dead code elimination that might have otherwise been possible across the subroutine interface). I could be all wet though.

    If I get your drift, you suggest having your generator define some constant depending on the usage scenario and have all unreachable branches thrown out by dead code elimination, correct?
    Right.

    How would you do that with subroutines? AFAIK, the latter are pure run-time constructs which aren't branched in the shader anyway but selected with API calls.
    Yeah, but those API calls set uniforms, right? These uniforms (code stubs which are precompiled) can be plugged in at shader bind/configure time at the very latest. My impression is there isn't a conditional evaluation going on in the shader for every single execution of that shader to make that happen. You're telling it in advance which one to use, so it just plugs in a jump point (I think). Unconditional jumps are cheap.

    It's conditionals (based on non-constant expressions) that can cause run-time execution divergence and thus can get expensive. Conditionals based on uniform expressions are cheaper, but you still have to evaluate the condition and do a branch. Conditionals based on constant expressions are the best, because then the whole condition evaluation and branch just gets ripped out by the compiler, and then the compiler can optimize code across the conditional as if it wasn't even there.

    All that said, I'm not a driver developer and don't know exactly how subroutines are implemented in each of the vendors' OpenGL drivers.

  6. #6
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,104
    this is gist of my test code which I call 1000 times alternating the colour selection mode
    Code :
    #version 410 core
     
    in vec2    v_uv;
     
    uniform vec4            u_Diffuse;
    uniform sampler2D   u_DiffuseTexture; 
     
    uniform int               u_switch;
     
     
    subroutine vec4 fetchColour_Type(vec2 p_ST);
    subroutine uniform fetchColour_Type u_fetchColour;
     
    subroutine (fetchColour_Type)
    vec4  fetchDiffuseColour(vec2 p_ST)
    {
      return u_Diffuse;
    }
     
    subroutine (fetchColour_Type)
    vec4  fetchTexture(vec2 p_ST)
    {
      return texture2D(u_DiffuseTexture,p_ST.st);
    }
     
    layout(location = 0, index = 0) out vec4 fFragColour;
     
     
    void    main()
    {
      vec4 colour;
    #if SWITCH
      if (u_switch == 0)
        colour = fetchDiffuseColour(v_uv;);
      else
        colour = fetchTexture(v_uv;);
      endif
    #else
      colour = u_fetchColour(v_uv);
    #endif
      fFragColour = colour;
    }

    My actual code has several other choices for colour selection, then whether to apply lighting or not, wireframe the edges and some other things; each selected via a subroutine. These are all dependant on the material
    being rendered and options selected by the user at runtime, for example whether they wish to see a raster image drapped on the mesh or a solid colour or procedural texturing.

    The code all works fine I was just starting to look at optimising my shaders when I got this result. I had not considered sqnuep comment that the test I am doing may be too simple to so a benefit to subrountines.
    Certainly from a maintenance view point subrountines a much nicer.

  7. #7
    Senior Member OpenGL Pro
    Join Date
    Apr 2010
    Location
    Germany
    Posts
    1,099
    First of all, excuse my late reply.

    Quote Originally Posted by Dark Photon
    Not quite. They are more both compile time I think. If my perception of how subroutines work is correct, that wouldn't be a "run-time" evaluation[..]
    Quote Originally Posted by Dark Photon
    Yeah, but those API calls set uniforms, right? These uniforms (code stubs which are precompiled) can be plugged in at shader bind/configure time at the very latest. My impression is there isn't a conditional evaluation going on in the shader for every single execution of that shader to make that happen. You're telling it in advance which one to use, so it just plugs in a jump point (I think). Unconditional jumps are cheap.
    The uniform indices are, as with any other uniform, bound at link time and stay fixed, so you're definitely right there. However, there is a run-time cost and that is a function call that most likely cannot be inlined. That's what I was getting at. as aqnuep stated, subroutines are implemented as function pointers and actually I wouldn't know how to do run-time switching otherwise. I didn't express my suspicion because, like you, I'm not a driver dev and cannot claim anything for sure.

    It would be nice to hear from aqnuep again on the inlining or if function call overhead can somehow be avoided by the driver. Also, it would be nice to know when to expect jump tables to be built and which cases you're at the mercy of conditional evaluation and branch prediction.

    EDIT: While we're talking about code generation, do drivers on different OSs produce the same binaries for the same GPU? Furthermore, except for AMD's ShaderAnalyzer, is there any way to disassemble shader binaries like an objdump for GLSL? Would be awesome.
    Last edited by thokra; 01-28-2013 at 03:43 AM.

  8. #8
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,124
    For NVidia, I've been using the Cg compiler for years to dump the NVidia assembly for a GLSL shader:

    Code :
    cgc -oglsl -strict -glslWerror -profile gpu_vp vert.glsl
    cgc -oglsl -strict -glslWerror -profile gpu_fp  frag.glsl
    ...

    Has been very useful in the past when diagnosing some shader performance problems.

    To what degree it supports every possible GLSL feature I don't know, but it supports quite a bit up through GLSL 3.x.

  9. #9
    Senior Member OpenGL Pro
    Join Date
    Apr 2010
    Location
    Germany
    Posts
    1,099
    Thanks, haven't done CG before so I forgot about NV's offline compiler. In general, something like that would be very advantageous. Even if it meant that 3 major companies had to offer 3 compilers with 3 differing binaries. I wouldn't care.

    I just checked what the cgc can do nowadays and the reference manual shows that it supports NV_gpu_program5 with -ogp5 on GF400 or higher.

  10. #10
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    985
    Quote Originally Posted by Dark Photon View Post
    Not quite. They are more both compile time I think. If my perception of how subroutines work is correct, that wouldn't be a "run-time" evaluation (in the sense of being re-evaluated every time the shader is run -- e.g. every vertex or every fragment), but rather pre-runtime when you bind the shader and configure uniforms. I'm thinking it just pokes in an unconditional jump point as a preprocess, and then there's no per-vertex or per-fragment run-time overhead (save dead code elimination that might have otherwise been possible across the subroutine interface). I could be all wet though.
    No, they are not compile time. A uniform can come from a buffer object (run-time) and an array of subroutine uniforms can be indexed using dynamically uniform (run-time) values thus both are run-time data (even though they could be theoretically unrolled at compile time in some cases).

    Quote Originally Posted by Dark Photon View Post
    Yeah, but those API calls set uniforms, right? These uniforms (code stubs which are precompiled) can be plugged in at shader bind/configure time at the very latest. My impression is there isn't a conditional evaluation going on in the shader for every single execution of that shader to make that happen. You're telling it in advance which one to use, so it just plugs in a jump point (I think). Unconditional jumps are cheap.
    Once again, your theory fails as for uniform buffers it's highly unlikely that the driver will parse your buffer data, re-compile your shaders using the uniform data in the buffer and then launch the draw. If that would happen, you would have horrible performance.

    To sum it up, uniforms and subroutines are run-time.
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •