PDA

View Full Version : shader length vs speed



vince
12-13-2004, 11:07 AM
It's kind of obvious a longer fragment shader will cause the GPU to process more operations, and will therefore take longer to execute.

I've been surprised lately to notice the shader is also slower if it as some code present, but not executed.

In my case, I have a for loop with a condition that is always false (controlled through uniforms), but that shader is a lot slower than the same shader with the non execute code removed.

for (a=0; a st 8; a++) // st = smaller than
{
if (x[a]) // always false
{
[...] bunch of operations
}
}

is much slower than
for (a=0; a st 8; a++) // st = smaller than
{
if (x[a]) // always false
{
}
}

Does anybody understand how this works?

gmeed
12-13-2004, 11:55 AM
Unless you have a GF6800, _all_ code is executed, and the results from false branches are simply ignored.

vince
12-13-2004, 12:04 PM
I have a GF6800 ultra...

kehziah
12-13-2004, 12:11 PM
The compiler detects the empty statement and removes the conditional altogether (in turn, it can remove the loop if there is nothing else inside)

Cab
12-13-2004, 12:32 PM
It is strange.
I have made a test with GF6800GT

Fragment1 (fragment shader):

float fDiffuse=dot(normal, vcLightDir);
… rest of the shader computing bumpmapping with diffuse, specular with gloss map, ... Fragment2:


float fDiffuse=dot(normal, vcLightDir);
if(fDiffuse<0.1) //0.1=reference value for testing
discard;
… rest of the shaderFragment 3:


float fDiffuse=dot(normal, vcLightDir);
If(fDiffuse>0.1){
… rest of the shader
}
else{
gl_FragColor=vec4(0.0,0.0,0.0,0.0);
}Fragment1 is faster than fragment2 and fragment2 is faster than fragment3.

I know that for pre-GF6800 all the fragment-shader should be computed so this behaviour is correct. But GF6800 claims (iirc) dynamic flow-control (Shader Model3) so Fragment3 should be faster than Fragment1 (there is a big number of operations inside the if statement). I'm using Forceware 71.20.

ZbuffeR
12-13-2004, 01:08 PM
I think I heard that performance gains with dynamic branching will only appear if the conditional blocks are quite big, ie try shader 3 with at least 10 instructions in the 'if' and 'else' blocks.
... to be verified.

EDIT : silly me, did not read carefully the previous post...

But maybe the case with 'else' is not frequent enough ?

Jan
12-13-2004, 02:13 PM
Maybe their glSlang support is not so far, maybe it´ll work as expected, if you use NV_fragment_program2.

Just a guess.

Jan.

Relic
12-14-2004, 02:16 AM
>>In my case, I have a for loop with a condition that is always false (controlled through uniforms), but that shader is a lot slower than the same shader with the non execute code removed.<<

It's not possible to optimize on uniforms at compile time. The code must be generated and if there is a branch it is not for free.
As said, the empty body case will of course not generate any code.

Jan, GLSL on NVIDIA is using NV_vertex_program* as the backend. Get the NVemulate tool on the developer.nvidia.com site and look at assembly dumps.

vince
12-14-2004, 05:30 AM
You're absolutely right Relic. The code needs to remain after it's compiled. I still don't understand though why a code that exist but isn't excuted slows down the process. For sur on a CPU it wouldn't change the performances.

hdg
12-14-2004, 09:11 AM
I have noticed the same thing on my 6800 card. I am hoping that this will get better with future drivers, but I worry that it is a hardware limitations. My guess is that a branch is executed by stepping through all of the instructions one by one but treating them as a no-op. Just a guess. We may have to wait for future hardware generations - these programmable units are still pretty immature compared to a CPU.

hdg
12-14-2004, 01:47 PM
I was just reviewing the release notes associated with the version 60 drivers and found the following:

"NVIDIA's GLSL implementation does not (currently) allow control flow to depend on uniform parameters in the fragment domain. Some control flow dependent on uniform parameters is allowed in the vertex domain (except for NV1x and NV2x GPUs) but this is not recommended due to poor performance.

In general, control flow dependent on uniform parameters is not recommended because it may well require the expensive recompilation of a shader at run-time. Instead, you should compile and link a stable of program objects for the uniform values you expect to often use where the uniform value is instead handled as a constant."

jra101
12-14-2004, 01:57 PM
hdg, those notes applied to the Release 60 drivers. Release 65 drivers include support for fragment level branching/looping and vertex texture fetches.

Cab
12-14-2004, 02:22 PM
Yes. But I think that those release notes are for the first versions of the 6x drivers. With first versions if you run GLSL Shading Language Demo (from 3DLabs: http://developer.3dlabs.com/openGL2/downloads/index.htm) you will notice that the mandelbrot and Julia shaders will run very slow (the use a loop based in an uniform value). But with latest 6x.yy drivers or with 70.xx they have implemented branching and looping. This is part of the 'object assembly' generated code (using nvemulate) for the Julia shader:


...
ADDR R3.x, R0.y, c[2];
LOOP c[3].yxxw;
SLTR H0.x, R2, c[3].z;
SLTR H0.w, R1.x, c[4].x;
MULXC HC.x, H0.w, H0;
BRK (EQ.x);
MOVR R2.x, R3;
MOVR R1.w, R4.x;
MULR R0.w, R1, R2.x;
...
MADR R2.x, R1.w, R1.w, R0.w;
ENDLOOP;
MULR R0.w, R1.x, c[7].y;
...I have also noted than in the test I made, the 'object assembly' generated shader does not include the branching (inside the if statment it has the half-light vector normalization, access to the decal texture, the specular computation and the final sum/multiply of all the calculated and uniform lighting parameters). It calculates all the shader and when doing the final sum, it multiplies the result conditionally to get one value or the other.

SGTRC HC.w, R1, c[0].z;
...
MADR result.color.xyz(NE.w), R0, R1, R2;This is why it is slower.

It is strange. I presume that with a branching it should be faster.
Another surprise is that it doesn't use the NV_fragment_program_2 normalize instruction. (I have three normalize calls inside my shader). It is still using the DP3/RSQ pair to normalize. Maybe the 'unified compiler' will convert them...

hdg
12-14-2004, 02:53 PM
As I mentioned in an earlier post, some of these performance issues are immature drivers, while others are immature hardware. Driver improvements will come along about every 3 months or so, while hardware advances will be much slower.

The branching capability makes code development much easier and funner, but the performance is pretty bad. I suspect it will take a few years to get highly optimized branching hardware within the GPU.