tips for performance? (request)

Hello.
I’m making my first shader inspired by the easy-findable shockwave shader on the web (yeah, it’s the same one on many pages with very little variations). So I said I would do something accessible for my entry level (that is level 0.5 :D). Once I get it to work, I’ll post it so to have to what to laugh in a few months - a year when I’ll master the shading techniques. Just kidding (half).
OK, generally I’m self-teaching but when I get too deep into the mud I ask somebody else for help. From what I could see, coding shaders has very strange things compared to CPU programming so I feel a little lost.

My first concern is performance; I read everywhere to move the code as “higher” as possible (CPU - vs - fs, in this order) I’ve also read that many short/fast shaders can actually slow things down compared to fewer, longer ones.
I’d like some opinions from people that has coded shaders enough to be able to tell; roughly, what’s faster/slower of the following, so I can build a style of coding?:

  1. create new variables in the shader
  2. extra operations (to avoid mediated variables)
  3. function calls (built-in or otherwise, nested)
  4. branching (even nested)
  5. iterating (even nested)
    Now from what I’ve read the opinions about branching are split in halves: some say avoid it at all costs, some say it’s not the case anymore since years back, due to new hardware changes. I find it very cumbersome to avoid ifs and fors.
    I know feel develops in time but a coarse indication would help a lot.
    Other tips you could share?

There are different types of branches. First, branches (ifs, loops, etc.) based on “constant” expressions are typically compiled right out of your shader by the shader compiler and never make it to the GPU as actual run-time evaluated branches. So those are free (actually better than free because the compiler typically does dead code elimination and folds the code together as tightly as possible). For instance in this case:


const int   NUM_SLICES = 5;

if ( NUM_SLICES < 10 )
  A();
else
  B();

In this case, the whole “if” likely gets ripped out and what’s fed to the shader is simply “A();”. That same “constant folding” and dead code elimination is applied recursively to A() and to the rest of your program.

Then there are branches in fragment shaders which are “coherent”. In other words, where neighboring pixels/fragments take the same path. These are relatively efficient because the path that was not chosen by any neighboring pixels/samples doesn’t need to be executed at all. An example would be where the branch condition is based completely on uniform values.

Then there are branches in fragment shaders which are “incoherent”. That is, where neighboring pixels/fragments take different paths. These aren’t as efficient because both paths (or in general) multiple paths need to be executed by each shader even though the result of many instructions are just “thrown away”, which is wasteful.

And then there are branches in shaders other that fragment shaders. I’m not sure how these are mapped to the GPU’s SIMD so I’m not really sure about the potential inefficiencies associated with branch execution of neighboring shaders. Anyone know more details here?

Thanks for structured info. So it depends on one hand on the compiler and on the other hand on the hardware.
Is there a way to test the performance of shaders? Like a benchmark software that “plays” shaders and outputs high precision statistics, mainly FPS.
All I could find on Google is shader benchmark software that tests GPU.
This way I’m learning to fish instead of keeping asking around performance questions.
My software running the shaders displays FPS but it varies too much, like from 50 to 1000 FPS.

Benchmarking executions is what you want, but FPS (frames per second) is not what you want to use as a metric (it is non-linear). Use “ms per frame” (the reciprocal), which is linear. Read this:

My software running the shaders displays FPS but it varies too much, like from 50 to 1000 FPS.

Converting to ms/frame (which is linear), that’s 1ms to 20ms per frame. Now you need to start isolating what you’re doing to see what is consuming the most time. Optimize iteratively until it is fast enough. Post questions on specifics of what you’re doing if you want optimization ideas from others.

For some reasons you can get a lot of performance from changing this:

for(int i = 0; i < soManyLoops; ++i)

…to this:

for(int i = soManyLoops; i >= 0; --i)

I use this in a fragment shader that calculates soManyLoops at the beginning. This gives me a ~27% performance gain in one specific workload!

Thank you both. Dark Photon, that’s a real revelation there. Funny how a misconception sometimes gets to be adopted by a majority as a standard.
I’d be more than happy if there was a way to measure the duration of an arbitrary code block.

Osbios, I’ve looked over like 4 times and couldn’t see a difference. The 5th time I saw it, it’s so discrete that I didn’t imagine it would have such impact: take soManyLoops * parsing the value in soManyLoops for the condition checking compared to having the numeral constant directly.

So, please, tell me how to test a code speed.