Shader length vs speed

It’s kind of obvious a longer fragment shader will cause the GPU to process more operations, and will therefore take longer to execute.

I’ve been surprised lately to notice the shader is also slower if it as some code present, but not executed.

In my case, I have a for loop with a condition that is always false (controlled through uniforms), but that shader is a lot slower than the same shader with the non execute code removed.

for (a=0; a st 8; a++) // st = smaller than
{
if (x[a]) // always false
{
[…] bunch of operations
}
}

is much slower than
for (a=0; a st 8; a++) // st = smaller than
{
if (x[a]) // always false
{
}
}

Does anybody understand how this works?

Unless you have a GF6800, all code is executed, and the results from false branches are simply ignored.

I have a GF6800 ultra…

The compiler detects the empty statement and removes the conditional altogether (in turn, it can remove the loop if there is nothing else inside)

It is strange.
I have made a test with GF6800GT

Fragment1 (fragment shader):

float fDiffuse=dot(normal, vcLightDir);
… rest of the shader computing bumpmapping with diffuse, specular with gloss map, ... 

Fragment2:

float fDiffuse=dot(normal, vcLightDir);
if(fDiffuse<0.1)	//0.1=reference value for testing
	discard;
… rest of the shader

Fragment 3:

float fDiffuse=dot(normal, vcLightDir);
If(fDiffuse>0.1){
	… rest of the shader
}
else{
	gl_FragColor=vec4(0.0,0.0,0.0,0.0);
}

Fragment1 is faster than fragment2 and fragment2 is faster than fragment3.

I know that for pre-GF6800 all the fragment-shader should be computed so this behaviour is correct. But GF6800 claims (iirc) dynamic flow-control (Shader Model3) so Fragment3 should be faster than Fragment1 (there is a big number of operations inside the if statement). I’m using Forceware 71.20.

I think I heard that performance gains with dynamic branching will only appear if the conditional blocks are quite big, ie try shader 3 with at least 10 instructions in the ‘if’ and ‘else’ blocks.
… to be verified.

EDIT : silly me, did not read carefully the previous post…

But maybe the case with ‘else’ is not frequent enough ?

Maybe their glSlang support is not so far, maybe it´ll work as expected, if you use NV_fragment_program2.

Just a guess.

Jan.

>>In my case, I have a for loop with a condition that is always false (controlled through uniforms), but that shader is a lot slower than the same shader with the non execute code removed.<<

It’s not possible to optimize on uniforms at compile time. The code must be generated and if there is a branch it is not for free.
As said, the empty body case will of course not generate any code.

Jan, GLSL on NVIDIA is using NV_vertex_program* as the backend. Get the NVemulate tool on the developer.nvidia.com site and look at assembly dumps.

You’re absolutely right Relic. The code needs to remain after it’s compiled. I still don’t understand though why a code that exist but isn’t excuted slows down the process. For sur on a CPU it wouldn’t change the performances.

I have noticed the same thing on my 6800 card. I am hoping that this will get better with future drivers, but I worry that it is a hardware limitations. My guess is that a branch is executed by stepping through all of the instructions one by one but treating them as a no-op. Just a guess. We may have to wait for future hardware generations - these programmable units are still pretty immature compared to a CPU.

I was just reviewing the release notes associated with the version 60 drivers and found the following:

"NVIDIA’s GLSL implementation does not (currently) allow control flow to depend on uniform parameters in the fragment domain. Some control flow dependent on uniform parameters is allowed in the vertex domain (except for NV1x and NV2x GPUs) but this is not recommended due to poor performance.

In general, control flow dependent on uniform parameters is not recommended because it may well require the expensive recompilation of a shader at run-time. Instead, you should compile and link a stable of program objects for the uniform values you expect to often use where the uniform value is instead handled as a constant."

hdg, those notes applied to the Release 60 drivers. Release 65 drivers include support for fragment level branching/looping and vertex texture fetches.

Yes. But I think that those release notes are for the first versions of the 6x drivers. With first versions if you run GLSL Shading Language Demo (from 3DLabs: http://developer.3dlabs.com/openGL2/downloads/index.htm)) you will notice that the mandelbrot and Julia shaders will run very slow (the use a loop based in an uniform value). But with latest 6x.yy drivers or with 70.xx they have implemented branching and looping. This is part of the ‘object assembly’ generated code (using nvemulate) for the Julia shader:

...
ADDR  R3.x, R0.y, c[2];
[BOLD] LOOP c[3].yxxw; [/BOLD]
SLTR  H0.x, R2, c[3].z;
SLTR  H0.w, R1.x, c[4].x;
MULXC HC.x, H0.w, H0;
[BOLD] BRK   (EQ.x); [/BOLD]
MOVR  R2.x, R3;
MOVR  R1.w, R4.x;
MULR  R0.w, R1, R2.x;
...
MADR  R2.x, R1.w, R1.w, R0.w;
[BOLD] ENDLOOP; [/BOLD]
MULR  R0.w, R1.x, c[7].y;
...

I have also noted than in the test I made, the ‘object assembly’ generated shader does not include the branching (inside the if statment it has the half-light vector normalization, access to the decal texture, the specular computation and the final sum/multiply of all the calculated and uniform lighting parameters). It calculates all the shader and when doing the final sum, it multiplies the result conditionally to get one value or the other.

SGTRC HC.w, R1, c[0].z;
...
MADR  result.color.xyz(NE.w), R0, R1, R2;

This is why it is slower.

It is strange. I presume that with a branching it should be faster.
Another surprise is that it doesn’t use the NV_fragment_program_2 normalize instruction. (I have three normalize calls inside my shader). It is still using the DP3/RSQ pair to normalize. Maybe the ‘unified compiler’ will convert them…

As I mentioned in an earlier post, some of these performance issues are immature drivers, while others are immature hardware. Driver improvements will come along about every 3 months or so, while hardware advances will be much slower.

The branching capability makes code development much easier and funner, but the performance is pretty bad. I suspect it will take a few years to get highly optimized branching hardware within the GPU.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.