Dynamic subrountines vs switch/if

I have done some benchmarking between using dynamic subrountines verses a set of if tests controlled by a uniform that I modify as needed. My initial tests show with 2 subroutines, 1 in the vertex shader and another in the fragment shader using a uniform is about 40% faster on an nVidia 580.

Has anyone else compared these?

I haven’t, though woudn’t a more apples-to-apples test be comparing subroutines against switch/if based on a constant value than a uniform? This ideally should result in no run-time conditional branching in both cases.

(In practice, the constant value would be set by your shader generator prior to compilation.)

Dark Photon, aren’t you suggesting to compare compile-time to run-time performance in that case? If I get your drift, you suggest having your generator define some constant depending on the usage scenario and have all unreachable branches thrown out by dead code elimination, correct? How would you do that with subroutines? AFAIK, the latter are pure run-time constructs which aren’t branched in the shader anyway but selected with API calls. I’m a little confused but I tend to think that tonyo actually compares apples to apples here.

tonyo, could you post your shaders? It’s hard to speculate without some hints.

In theory, switch statements could be also implemented using jump tables on newer GPUs in a similar fashion like compilers do it on the CPU. In that case, they could be as fast as subroutines, but I’m not sure how much is this a common practice in the drivers.

Anyways, your best bet is to use subroutines for the most coherent performance across implementations, as those are really nothing more than function pointers thus you have constant time jumps to the appropriate code, while in case of switch statements with dynamically uniform selectors you either have constant time jumps (if the GLSL compiler builds jump tables) or linear time jumps (if the case statements are evaluated as any other conditional statement).

In your case, considering you only had a choice between 2 subroutines (if I understand correctly), the difference might not have came out, but when you have way more options, constant time will beat linear time for sure.

Not quite. They are more both compile time I think. If my perception of how subroutines work is correct, that wouldn’t be a “run-time” evaluation (in the sense of being re-evaluated every time the shader is run – e.g. every vertex or every fragment), but rather pre-runtime when you bind the shader and configure uniforms. I’m thinking it just pokes in an unconditional jump point as a preprocess, and then there’s no per-vertex or per-fragment run-time overhead (save dead code elimination that might have otherwise been possible across the subroutine interface). I could be all wet though.

If I get your drift, you suggest having your generator define some constant depending on the usage scenario and have all unreachable branches thrown out by dead code elimination, correct?

Right.

How would you do that with subroutines? AFAIK, the latter are pure run-time constructs which aren’t branched in the shader anyway but selected with API calls.

Yeah, but those API calls set uniforms, right? These uniforms (code stubs which are precompiled) can be plugged in at shader bind/configure time at the very latest. My impression is there isn’t a conditional evaluation going on in the shader for every single execution of that shader to make that happen. You’re telling it in advance which one to use, so it just plugs in a jump point (I think). Unconditional jumps are cheap.

It’s conditionals (based on non-constant expressions) that can cause run-time execution divergence and thus can get expensive. Conditionals based on uniform expressions are cheaper, but you still have to evaluate the condition and do a branch. Conditionals based on constant expressions are the best, because then the whole condition evaluation and branch just gets ripped out by the compiler, and then the compiler can optimize code across the conditional as if it wasn’t even there.

All that said, I’m not a driver developer and don’t know exactly how subroutines are implemented in each of the vendors’ OpenGL drivers.

this is gist of my test code which I call 1000 times alternating the colour selection mode


#version 410 core

in vec2    v_uv;

uniform vec4            u_Diffuse;
uniform sampler2D   u_DiffuseTexture; 

uniform int               u_switch;


subroutine vec4 fetchColour_Type(vec2 p_ST);
subroutine uniform fetchColour_Type u_fetchColour;

subroutine (fetchColour_Type)
vec4  fetchDiffuseColour(vec2 p_ST)
{
  return u_Diffuse;
}

subroutine (fetchColour_Type)
vec4  fetchTexture(vec2 p_ST)
{
  return texture2D(u_DiffuseTexture,p_ST.st);
}

layout(location = 0, index = 0) out vec4 fFragColour;


void    main()
{
  vec4 colour;
#if SWITCH
  if (u_switch == 0)
    colour = fetchDiffuseColour(v_uv;);
  else
    colour = fetchTexture(v_uv;);
  endif
#else
  colour = u_fetchColour(v_uv);
#endif
  fFragColour = colour;
}

My actual code has several other choices for colour selection, then whether to apply lighting or not, wireframe the edges and some other things; each selected via a subroutine. These are all dependant on the material
being rendered and options selected by the user at runtime, for example whether they wish to see a raster image drapped on the mesh or a solid colour or procedural texturing.

The code all works fine I was just starting to look at optimising my shaders when I got this result. I had not considered sqnuep comment that the test I am doing may be too simple to so a benefit to subrountines.
Certainly from a maintenance view point subrountines a much nicer.

First of all, excuse my late reply.

The uniform indices are, as with any other uniform, bound at link time and stay fixed, so you’re definitely right there. However, there is a run-time cost and that is a function call that most likely cannot be inlined. That’s what I was getting at. as aqnuep stated, subroutines are implemented as function pointers and actually I wouldn’t know how to do run-time switching otherwise. I didn’t express my suspicion because, like you, I’m not a driver dev and cannot claim anything for sure.

It would be nice to hear from aqnuep again on the inlining or if function call overhead can somehow be avoided by the driver. Also, it would be nice to know when to expect jump tables to be built and which cases you’re at the mercy of conditional evaluation and branch prediction.

EDIT: While we’re talking about code generation, do drivers on different OSs produce the same binaries for the same GPU? Furthermore, except for AMD’s ShaderAnalyzer, is there any way to disassemble shader binaries like an objdump for GLSL? Would be awesome.

For NVidia, I’ve been using the Cg compiler for years to dump the NVidia assembly for a GLSL shader:


cgc -oglsl -strict -glslWerror -profile gpu_vp vert.glsl
cgc -oglsl -strict -glslWerror -profile gpu_fp  frag.glsl
...

Has been very useful in the past when diagnosing some shader performance problems.

To what degree it supports every possible GLSL feature I don’t know, but it supports quite a bit up through GLSL 3.x.

Thanks, haven’t done CG before so I forgot about NV’s offline compiler. In general, something like that would be very advantageous. Even if it meant that 3 major companies had to offer 3 compilers with 3 differing binaries. I wouldn’t care.

I just checked what the cgc can do nowadays and the reference manual shows that it supports NV_gpu_program5 with -ogp5 on GF400 or higher.

No, they are not compile time. A uniform can come from a buffer object (run-time) and an array of subroutine uniforms can be indexed using dynamically uniform (run-time) values thus both are run-time data (even though they could be theoretically unrolled at compile time in some cases).

Once again, your theory fails as for uniform buffers it’s highly unlikely that the driver will parse your buffer data, re-compile your shaders using the uniform data in the buffer and then launch the draw. If that would happen, you would have horrible performance.

To sum it up, uniforms and subroutines are run-time.

  1. The case where you use conditionals based on constant expressions is definitely compile time (on NVidia at least). 2) And the case where you have subroutines is almost certainly pre-execution time (I misworded in my original; sorry; meant to say “They are both more like compile time I think.” but I go on to qualify I mean plugged in before the shader is executed. Sort of like a post-link fix-up.) Those were the two cases I had mentioned.

The point being that there’s very likely not conditionals being evaluated during shader execution for the latter (and definitely not the former). But there is for the “if on uniform value” case.

…it’s highly unlikely that the driver will parse your buffer data, re-compile your shaders using the uniform data in the buffer and then launch the draw.

I never meant that (I meant pre-execution), but I see now that my miswording could have implied that. My bad. And yes, I agree that would be very inefficient implementation of subroutines, and I’d hope that no-one implements them like that (we ditched that behavior 7 generations of GPU ago!)

Still, “pre-execute” is not the appropriate expression either. When you have conditionals based on uniform values or when you call subroutines, that happens at shader execution time (at least 99% of the cases), that means the value gets dynamically evaluated or a dynamic jump happens, only that this dynamism is uniform to ensure that each shader invocation on a single compute unit takes the same path, but it is completely a run-time decision.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.