For loops in GLSL and REP Instruction

I want to iterate over the lights in the fragment shader and a static REP Loop seems to be a good solution. The other idea would be to use a #define and to recompile the shader for each lightcount, but I’ve read that with ps30 it’s possible to create uber shaders.
The problem is, that unfortunately the glsl compiler (gf6800, driver version 76.41) always creates a LOOP/ENDLOOP loop with the maximum iteration count of 255 and a break instruction. How should a for loop look like to use static branching? Thanks.

  
uniform int lightcount;
for(int i=0;i<lightcount;i++)
{
}

nv_fragment_program2:

 
LOOP c[2];
SLTRC HC.x, R1.y, c[3];
BRK   (EQ.x);
ADDR  R1.y, R1, c[1].x;
ENDLOOP;

I want the compiler to generate something similar to:

 
REP program.local[0];

ENDREP;

That’s up to nVidia’s compiler. You’ve done all you can to let it know what kind of loop it should build; it’s now up to them to make their compiler better.

Thank’s for your reply. I hoped there was a special pattern for this kind of loop. There is a not unimportant speed difference between “#define lightcount” and “uniform lightcount” with full dynamic branching. In D3D there are these constant integer registers for loops even with ps20 and the drivers unrolls the loop and recompiles the shader internally. Hopefully this behaviour will be also implemented with uniform variables in glsl, because compiling and linking a glsl shader manually with the “#define lightcount X” is very slow.

I did some tests and it seems that LOOP and REP are nearly equally fast,when the index register is not used. It’s surprising how many instructions can be executed without to much performance lost, if the loop iteration count is fixed. But infact it is the unnecessary dynamic branching BRK instruction that really slows the shader down several times. Does someone know, if this will be fixed in one of the next driver releases ?

I guess the penalty comes in when different fragments are processed differently due to dynamic branching.

Try this one

uniform int lightcountMinusOne;
for(int i=lightcountMinusOne; i>=0; i--)
{
}

Good luck!

That could have been the special pattern, but it generated nearly the same code. Only the SLTRC is replaced with SGERC. All fragments are running through the same path of the shader.
I’ve also checked the output of the cg compiler with profile fp40 enabled and it also used the BRK instruction.
But the DirectX shader compiler (fxc.exe)generates the correct result using profile ps_3_0, even if the loop starts at zero.

void main(out float4 color0:COLOR0, const uniform int lightcount)
{
  float temp=0;
  for(int i=0;i<lightcount;i++)
  {
    temp=temp+i;
  }
  color0=float4(1,1,1,1)*temp;
}
  
  ps_3_0
   def c0, 0, 1, 0, 0
   mov r0.z, c0.x
   mov r0.w, c0.x
   rep i0
     add r0.z, r0.z, r0.w
     add r0.w, r0.w, c0.y
   endrep
   mov oC0, r0.z

fxc is smarter!

How is i0 set?

The parameter lightcount is assigned to i0.
I compiled only the pixel shader without an .fx file. The shader itself is senseless, but the interesting thing is the loop. Here is the output:

//
// Generated by Microsoft (R) D3DX9 Shader Compiler 9.06.168.0000
//
//   fxc /Tps_3_0 fragmentprogram.cg
//
//
// Parameters:
//
//   int $lightcount;
//
//
// Registers:
//
//   Name         Reg   Size
//   ------------ ----- ----
//   $lightcount  i0       1
//

    ps_3_0
    def c0, 0, 1, 0, 0
    mov r0.z, c0.x
    mov r0.w, c0.x
    rep i0
      add r0.z, r0.z, r0.w
      add r0.w, r0.w, c0.y
    endrep
    mov oC0, r0.z

// approximately 7 instruction slots used

A better glsl implementation could detect that the result depends only on uniform and consts and run this loop on the cpu, reducing the shader to one mov instruction.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.