PDA

View Full Version : For loops in GLSL and REP Instruction



LarsMiddendorf
04-10-2005, 03:59 AM
I want to iterate over the lights in the fragment shader and a static REP Loop seems to be a good solution. The other idea would be to use a #define and to recompile the shader for each lightcount, but I've read that with ps30 it's possible to create uber shaders.
The problem is, that unfortunately the glsl compiler (gf6800, driver version 76.41) always creates a LOOP/ENDLOOP loop with the maximum iteration count of 255 and a break instruction. How should a for loop look like to use static branching? Thanks.



uniform int lightcount;
for(int i=0;i<lightcount;i++)
{
}nv_fragment_program2:


LOOP c[2];
SLTRC HC.x, R1.y, c[3];
BRK (EQ.x);
ADDR R1.y, R1, c[1].x;
ENDLOOP;I want the compiler to generate something similar to:



REP program.local[0];

ENDREP;

Korval
04-10-2005, 12:06 PM
That's up to nVidia's compiler. You've done all you can to let it know what kind of loop it should build; it's now up to them to make their compiler better.

LarsMiddendorf
04-10-2005, 12:45 PM
Thank's for your reply. I hoped there was a special pattern for this kind of loop. There is a not unimportant speed difference between "#define lightcount" and "uniform lightcount" with full dynamic branching. In D3D there are these constant integer registers for loops even with ps20 and the drivers unrolls the loop and recompiles the shader internally. Hopefully this behaviour will be also implemented with uniform variables in glsl, because compiling and linking a glsl shader manually with the "#define lightcount X" is very slow.

LarsMiddendorf
04-12-2005, 03:29 AM
I did some tests and it seems that LOOP and REP are nearly equally fast,when the index register is not used. It's surprising how many instructions can be executed without to much performance lost, if the loop iteration count is fixed. But infact it is the unnecessary dynamic branching BRK instruction that really slows the shader down several times. Does someone know, if this will be fixed in one of the next driver releases ?

V-man
04-12-2005, 04:27 AM
I guess the penalty comes in when different fragments are processed differently due to dynamic branching.

Try this one


uniform int lightcountMinusOne;
for(int i=lightcountMinusOne; i>=0; i--)
{
}Good luck!

LarsMiddendorf
04-12-2005, 05:55 AM
That could have been the special pattern, but it generated nearly the same code. Only the SLTRC is replaced with SGERC. All fragments are running through the same path of the shader.
I've also checked the output of the cg compiler with profile fp40 enabled and it also used the BRK instruction.
But the DirectX shader compiler (fxc.exe)generates the correct result using profile ps_3_0, even if the loop starts at zero.


void main(out float4 color0:COLOR0, const uniform int lightcount)
{
float temp=0;
for(int i=0;i<lightcount;i++)
{
temp=temp+i;
}
color0=float4(1,1,1,1)*temp;
}

ps_3_0
def c0, 0, 1, 0, 0
mov r0.z, c0.x
mov r0.w, c0.x
rep i0
add r0.z, r0.z, r0.w
add r0.w, r0.w, c0.y
endrep
mov oC0, r0.z

V-man
04-12-2005, 09:49 PM
fxc is smarter!

How is i0 set?

LarsMiddendorf
04-13-2005, 12:06 AM
The parameter lightcount is assigned to i0.
I compiled only the pixel shader without an .fx file. The shader itself is senseless, but the interesting thing is the loop. Here is the output:

//
// Generated by Microsoft (R) D3DX9 Shader Compiler 9.06.168.0000
//
// fxc /Tps_3_0 fragmentprogram.cg
//
//
// Parameters:
//
// int $lightcount;
//
//
// Registers:
//
// Name Reg Size
// ------------ ----- ----
// $lightcount i0 1
//

ps_3_0
def c0, 0, 1, 0, 0
mov r0.z, c0.x
mov r0.w, c0.x
rep i0
add r0.z, r0.z, r0.w
add r0.w, r0.w, c0.y
endrep
mov oC0, r0.z

// approximately 7 instruction slots usedA better glsl implementation could detect that the result depends only on uniform and consts and run this loop on the cpu, reducing the shader to one mov instruction.