Lots of small shaders or one big one?

At present I’m automatically generating lots of different shader programs, one for each combination of factors such as the number of lights on a model, whether it’s bump mapped, how many textures it has etc.
However, shader compilation times are becoming very large, so I’m wondering if I should just write one big shader which conditionally executes the different bits of code dependent on uniform values.
Obviously the monolithic shader will be slower, but in your experience do conditionals really slow down shader execution that much?

Well, more shaders means more compilation time and AFAIK a single shader with a lot of uniforms bools for the switches will still slow things down. I think on SM 3.0 cards this performance penaltry would be less than on SM 2.0 cards but I believe it would still be noticable.

I like the way Valve did it in Half Life 2. The have one big shader but with static const bools as switches and they generate specialized shaders from this ubershader by modifying the values of those bools. In OpenGL you could create a similar system with #defines or const bools in your shader code.

[ www.trenki.net | vector_math (3d math library) | software renderer ]

Perhaps a silly question, but could one generate lots of shaders containing a function for one thing, and then just attach them as needed in a program object? Ie, say two shaders which modifies the normal: one no-op (no bump), one regular bump. Same for texture mapping coordinates, one no-op, one parallax.

You’d still end up with a bunch of shaders, but they should be smaller and not that numerous as the combinations you can make from them.

In OpenGL you could create a similar system with #defines or const bools in your shader code.
Yes, but how exactly would this deal with the shader compile-time issue? You’re still having to compile and link all of those shaders.

Whether it’s done with #define’s or with compile+link+buildmain, the build time of a particular shader program is ridiculously slow. Hopefully it’s something that will be addressed in gl3.1 with binary chunks, or maybe the driver can use the registry and CRC’s to cache previously built shaders.
As a work around, I store persistence information next to the scene file which lists all the shader module combinations I’ve required in previous runs with that scene, then read this file and pre-build all those shader programs at start up. Yes, it slows start up down but means the user doesn’t experience quite as many stop-start scenarios.
And yes, using uniform controlled conditionals has a dramatic effect on performance even on pretty new cards - so I wouldn’t recommend using them in your bulk shaders.
By the way, have a read of this crytek paper, where amongst other things they ponder the same question.

In my experience, conditionals don’t significantly slow runtime if they’re only dependent on uniforms. After all, the slowdown is primarily the result of needing to execute all encountered control paths for every pixel in a block.

If every pixel on the screen (and thus every pixel in every block) follows the same control path, it shouldn’t be much slower than if the code were written without conditionals.

That’s what I observed on the 7-series cards, anyway.

You’re right, the 7 series has improved some of these things dramatically over the 6 series, but there’s still a measurable cost. Plus, lots of my customers are still on 6 series cards at the moment.
If you don’t mind taking a frametime performance hit to avoid a longer start-up time then by all means use uniform branching. Personally I (and my customers) would rather pay a one-off price on start-up, than a drop in frame rate. Horses for courses.

Thanks for the link, knackered. Looks like the answer is that I’ll have to come up with a compromise between dynamic branching and multiple shaders.
I’ll do some tests to see how badly uniform-dependent-branching slows things down on an 8800. Lindley, fingers crossed I’ll get the same not-much-slower result as you!

@Lord crc: I had the same idea but unfortunately it’s the shader program linking step that takes the time, not the compilation of individual shaders.

Keep in mind, that changing uniforms for static conditions can force shader’s recompilation inside driver ( that’s like changing fixed function state on hardware that really not supported FFP on chip and emulated by driver with shaders )

Some results for you.

With the following code snippet

for (int i = 0; i < lightCount; ++i)
{
    AddLight(i);//simple Blinn lighting
}

if lightCount is a const int equal to 4, I got 102 fps.
If lightCount was a uniform equal to 4, I got 70 fps.
Ouch.

However, I tried the following

if (lightCount < 4)
{    
    if (lightCount < 2)
    {
        if (lightCount == 1)
        {
            AddLight(0);
        }
    }
    else if (lightCount == 2)
    {
        AddLight(0);
        AddLight(1);
    }
    else
    {
        AddLight(0);
        AddLight(1);
        AddLight(2);
    }
}
else if (lightCount < 6)
etc.

With a const int the frame rate was the same 102 fps, but with a uniform the frame rate dropped to only 97fps!
So as I interpret it, the lesson is that ifs cause a tolerable loss of speed, but fors are to be avoided at all costs.

@Nikolai: I timed the frames during which the uniform value was changed (from 0 up to 8), but couldn’t detect any difference, so it seems that (in my drivers at least - ForceWare 158.19) the shader wasn’t being recompiled.

So as I interpret it, the lesson is that ifs cause a tolerable loss of speed, but fors are to be avoided at all costs.

Not quite. Yes, loops are expensive, but drivers unroll loops if they know exact number of iterations during compilation, which is the case when using const.
So it’s not loops that we have to avoid, but loops with variable number of iterations.

When using if’s you “unroll” the loop yourself.

An alternative solution could look like this:

for (int i = 0; i < maxLights; ++i)
{
  if (i >= lightCount) break;
  AddLight(i);
}

This way driver can unroll the loop, and it will execute only one “if” instruction for one light. Two if’s for two lights. So it should be faster for few number of lights. Theoretically :wink:

Back to the topic of many shaders vs. one shader.

Best solution would be to put functions into separate shader objects, compile them, and link in different combinations. The problem is, that you can pretty much assume that you’ll end up with drivers recompiling entire shader when you perform link.
If only shader compilation would really compile it into form that can be quickly linked with other shader objects (that’s what I believe was the intent of GLSL specs authors) then such discussions would never take place - there would simply be no problem.

But we have to deal with what we have.
For some applications it’s the compile time that matters, and for some it’s the final rendering performance.

Third solution is to have no conditionals but using dummy values. For example - if you want to use 3 out of 8 lights you set lights 3-7 to have a color of (0.0, 0.0, 0.0). They will be processed but will not affect rendered image.
This will probably be slower than using if’s when you’re using maxLights=8. On the other hand if your shader should support maxLights=2 then it can be faster than if’s.

What’s my point?
Why using just one solution if you can use a hybrid of two or even more solutions? Why limit yourself to just one solution?

An example:

if (maxLights < 4)
{
  if (maxLights < 2)  //one or two lights
  {
    AddLight(0);
    AddLight(1);
  }
  else   //three or four lights
  {
    AddLight(0);
    AddLight(1);
    AddLight(2);
    AddLight(3);
  }
}

Now we process:
for 1-2 lights - we process 2 lights
for 3-4 lights - we process 4 lights

So it’s averagely half a light too many, but always one less ‘if’ instruction. It’s a trade. It may be that on some GPU’s ‘if’ instruction will be as expensive as processing one light - it will give improvement in that case. Again, theoretically.

Sigh, what a crazy situation we GL programmers have been forced into. We deserve medals for sticking with this bloody API.

If only shader compilation would really compile it into form that can be quickly linked with other shader objects (that’s what I believe was the intent of GLSL specs authors) then such discussions would never take place - there would simply be no problem.
I think you’re underestimating what needs to happen during the link process.

Program linking was, almost certainly, never expected to be really fast. Now yes, nVidia’s implementation of glslang does basically stick all the shader files together and compile them again, which is clearly unnecessary. But linking wasn’t going to be fast because it takes times.

After the compile step, you have some functions and so forth that are perhaps in a pseudo-assembly form. Or maybe an expression tree or something. What matters is that you do not have anything remotely like real assembly. That’s because during linking, you need to decide on things like inlining, optimization, etc. Because linking is the final step, that’s when you have the last chance to make certain decisions that might allow or prevent a large program to fit into the available space. You can’t do things like uniform assignments, etc until linking.

So, in short, shader compiling really buys you the up-front parsing and the building of a symbol table. Much of the heavy lifting of assembly generation has to be done elsewhere. Linking was never going to be a particularly fast step because it has to do too much.

Another thought on the topic… Would it be possible to compile the shader in a different thread? In that case, you could initially compile a few “uber” shaders, which uses uniforms etc. Then in a background thread compile the optimized versions and start using them as they’re ready…

Just a wild idea :slight_smile:

There’s other defects in the language:
Look here under “Search for the most complex shader”

So much for a platform-independent API.
I think these things really need to be addressed.
Almost no-one uses Fixed-function anymore so this IS an issue.

I think you’re underestimating what needs to happen during the link process.
Yeah, I guess it sounded that way.

On CPU linking usually takes less time than compilation, but on GPU it’s the other way around.

Still, some optimizations can be performed during compilation (local optimizations), some have to wait for linking (loop unrolling, inlining, throwing away inactive uniforms). It would probably save some noticeable amount of time, but I don’t see anyone willing to put so much effort into it.

loop unrolling in the link stage??

loop unrolling in the link stage??
Final decision if to unroll a loop cannot be made before all shader objects are attached to program object and instruction counts in both cases are known. So it must be done during LinkProgram call. That’s what I meant. I didn’t mean the linking operation itself, although you may call attaching shader objects to program object a first step in linking process :slight_smile:

Sigh, what a crazy situation we GL programmers have been forced into. We deserve medals for sticking with this bloody API.
Yes, we’re the best :smiley:

Would it be possible to compile the shader in a different thread?
Not really. You shouldn’t access one OpenGL context from many threads. And it’s of no use to compile shaders in another context.
Yeah, but that was some creative thinking :slight_smile:

The real question about NVIDIA’s optimisation stragegy is whether all the time they spend rebuilding shaders in the driver is worth it - if it takes more time to do that than to just use a less-optimised shader, then why bother?

As far as the GPU/CPU differences in compiling go - A compiler optimizes each file on its own, inlining when it can, then links. What NVIDIA (and ATI?) is(are) doing, is half-compiling each file, globbing them together, then optimizing this giant file. That made sense when GPUs didn’t have branching hardware, and branches kicked you to software - if you could get rid of loops/branches, you wanted to. Now, the overhead of recompiling all the time is probably much higher than the aggregate overhead of branches. NVIDIA just still hasn’t changed their GLSL compiler to do more stuff in the compile stage, and less in the link stage.

Also, as for uniform-based branching, the G80, at least, should be OK with it - as long as every thread (actually, every 8-thread block) does the same thing, there should be no performance penalty. This is based on the CUDA docs, but it’s the same processor whether you’re using it for CUDA or graphics. IIRC, the GeForce 7xxx and R500 aren’t too bad at branching, either. I don’t know about R600, though - I haven’t had a chance to play with one.

There’s other defects in the language:
That’s not a defect in the language. That’s a defect in driver development.

And it’s of no use to compile shaders in another context.
That’s what shared objects are for. You can share them between contexts.

Now, the overhead of recompiling all the time
See that’s the thing. You compile once. The fact that nVidia feels the need to recompile a shader just because you changed a uniform into a magic number is their problem; it isn’t something that OpenGL can solve. Also, this is a different issue from the compile vs. link thing.

I suspect that nVidia will not be doing that in GL 3.0, since 3.0 has a mechanism that allows you to specify that certain uniforms are constant (and what their values are) at link-time. As such, there’s no need for this dynamic nonsense that nVidia does, since the user can just compile two separate programs themselves.