Obviously the question was posted in an attempt to avoid implementing all 3 and checking over a huge range of hardware, as this is expensive in both time and equipment. This is an insanely small team.
The use of subroutines (which I do in my shaders for code reuse) does not answer the fundamental question of whether branching is in a state on significant coverage of GPUs such that one should feel free to use it for minor additions. This is more a question of shader state change potential vs a branch evaluation that is constant over a block of fragments being processed.
Why so? 1x1 fetches that fast? Are they a 1 op at this point? After they’ve been moved into the texel cache I would imagine, so after the first sample, you are good to go?
Small teams shouldn’t concern themselves with this level of performance. You shouldn’t be relying on deep-level optimizations like this. Even on a large team, I wouldn’t suggest bothering with this kind of optimization unless there was actual profiling data that showed that a significant improvement could be made.
Small teams should be focused on the biggest bang for their buck. This isn’t it.
Hmmm…while it is true that during optimization I will deal with the bottle necks (this is FAR from my first rodeo, 14 years doing game graphics for a living), it doesn’t hurt to think about the little things before implementing a feature.
Furthermore, this does not actually answer the question, which would be useful information not only for myself and this project, but the community in general, I suspect.
The bottleneck for my renderer is fill performance, which is directly related to fragment shader execution, so every little thing helps.
In my experience on modern hardware (Gf 8+) conditions are not very performance critical. Often there is no measurable performance difference, at all. So, unless you want to squeeze out the latest bit of performance, it is usually not that important to avoid them by all means.
Also, in your case the condition will always evaluate identical across all fragments. GPUs usually need to evaluate the same branches in a 2x2 (or 4x4 ?) pixel area, so if the condition would evaluate differently on some fragments, the shader would need to compute BOTH branches and only discard the result of one branche for every fragment. In your case such things won’t happen.
In the long run it would be most useful to use a preprocessor to #ifdef the code in the shader and thus create different variations of one and the same base shader.
Thanks Jan, yeah it certainly helps From my understanding the latency from larger branches on modern GPUs comes from the pipeline. Where a block of fragments are executed simultaneously and results are sync’d. For that block you get the worst performance of the block.
On older cards that did not have “real” branching, just conditional memory ops, the card would execute both branches and then discard a result during a mov.
That’s my understanding of things at the moment, which may be flawed. For now I’ve decided to do the 1x1 texture look up. This is mostly due to the fact that the potential latency is easily hidden by other samples in most of the shaders, so I’m really just paying for a 1 cycle mul extra.
Towards the end of the project I may clean this up, but I suspect I won’t really NEED to.