Shader with loop vs multipass

I have some strange results on GeForce 6 / GeForce 7.
I have a shader that computes some perlin noise value and I use it in volume renderer that renders multiple planes. Works ok.
I’ve decided to switch multipass rendering to just one plane with shader that will have a loop.

Now there is some performance problem here. Typicaly multipass rendering will cost:
n * texture access
n * lighting
n * framebuffer blend
And shader with loop will cost:
n * texture access
n * lighting
1 * framebuffer blend
Should be faster but it turns out to be ~2x slower.
Now when I add some optimizations to shader that will make it skip empty spaces then I get expected performance boost.
Final results are as follows:

Multipass (always 1600 planes):            197-200ms
Single pass with loop (always 1600 steps): 382-387ms
Single pass with loop (varying step):      59-64ms

I’m rather interested in general discussion / experience exchange. What could cause such severe performance drop when using 1600 steps in loop when comparing to 1600 passes?

Each of steps/passes requires access to one texture 6 times (to generate perlin noise) and perhaps that’s the case. This texture is aligned parallely to these planes so it could cause better caching in multipass approach. What do you think?

Have you tested the disasembled code with the NV/ATI tools(or try to use the DX10 HLSL fx compiler, just will need to change very few the shader), etc? That will discover prolly some bad branching.

Some time ago I had a similar problem. Had to use textureXXXXXLOD instructions because seems some hardware can’t process complex texture fetchs inside a branched loop and the compiler goes mad. Also sure you don’t use derivatives inside or loop will be unrolled. Also take into consideration the current SM3 hw lacks fragment shader contant indexing.

hope it helps

I’m using just texture2D without LOD/derivatives. I don’t use constant indexing, but thanks for the tip - it will probably be helpful in future.
Ok, just to give some idea of shader structure (as I said, I’m rather interested in general discussion):

(...) some computations
float layer = 0.0;
while (layer < 16.0)
{
  6x texture2D - accessing the same 128x128 GL_LUMINANCE texture
  update texture coordinates (6 sets held in gl_TexCoord[0] - [2])
  float value = (computations based on fetched value)
  if (value > 0.45)
  {
    blending layer and lighting computations
  }
  layer = layer + 0.01;
}
write fragment's color

Please don’t give me advices on how to improve this shader by adding some new functionality - I’ve allready implemented better shader. I’m only interested why this one - being equivalent to drawing 1600 planes using shader with equal code (except for the loop) - works nearly 2x slower.

In the meantime I’ll probably have a look into disassebled shader.

If I am not mistaken, maximal loop iteration count on nvidia hardware is 256.

In the meantime I’ll probably have a look into disassebled shader.
Shader disassembly looks ok - loop body and multipass shader are very similar.

If I am not mistaken, maximal loop iteration count on nvidia hardware is 256.
Well, it works with 1600 steps giving correct results. And with 160 steps or 16 steps the difference between shader with loop and multipass approach are always the same (~2x).

I also tried 64x64 GL_LUMINANCE texture (so it was 4KB texture) - no performance gain in either approach. Guess that makes my chaching theory less probable.

Only guess:
The hw simultaneously operates on blocks of fragments that are executing in sync. I do not know the numbers however on the GF6/7 hw those block are relatively big. Because they operating in sync, if one fragment inside the block needs to do more work, remaining fragments in the block will need to wait for it.

It is possible that the hw can better manage the resources in the case when the fragment processing is completed for some fragment (multipass) than in the case where it needs to wait for other fragments to do theirs work (loop).

Could you probably try a regular for loop with constant iteration count instead of while loop?

if one fragment inside the block needs to do more work, remaining fragments in the block will need to wait for it.
Yeah, I was thinking of that, too. The thing is that with such unoptimized shader, number of operations in each pixel in block is nearly equal.
The only difference in execution is that multi pass shader has sync at every step and single pass has one sync at the end. This should actually be in favor of single pass, this is why:

It is possible that the hw can better manage the resources in the case when the fragment processing is completed for some fragment
This seems more likely.

I was thinking about texture access, because it’s one resource constantly used by all fragment shaders. I removed texture access from shader and now it uses sin() to compute value. It is now pure math shader. Still similar results:
130ms in multipass
240ms in singlepass

Originally posted by k_szczech:

float layer = 0.0;
while (layer < 16.0)
{
  6x texture2D - accessing the same 128x128 GL_LUMINANCE texture
  update texture coordinates (6 sets held in gl_TexCoord[0] - [2])
  float value = (computations based on fetched value)
  if (value > 0.45)
  {
    blending layer and lighting computations
  }
  layer = layer + 0.01;
}

Dont use floating point loops. The PS3.0 makes loops using the aL register( which is integer based ). Use this code instead:

int layer = 0;
while (layer <128)
{
  6x texture2D - accessing the same 128x128 GL_LUMINANCE texture
  update texture coordinates (6 sets held in gl_TexCoord[0] - [2])
  float value = (computations based on fetched value)
  if (value > 0.45)
  {
    blending layer and lighting computations
  }
  ++layer;
}

And yes, caution with the maximum iterations. Some hardware can do only 128 ( you are trying 1600 ). If you use more then the OpenGL shader will be emulated by software and the performace will drop suddenly ( see the compile shader log text returned. For example, ATI puts there always if the shader will run in HW or in SW mode. NVIDIA only complains on error. ).

If this still doesn’t work try to replace the “while” by a “for”. The GLSL dynamic shader compiler inside some drivers tends to be a bit stupid sometimes.

On the other hand, GL_LUMINANCE textures can be conflictive. Try to use better a standard RGBA8 texture.

Dont use floating point loops
Switched to int - no difference.

And yes, caution with the maximum iterations
I’ll keep that in mind, but since it works I asume it’s not the case here. I’m observing similar results with 160 and 16 layers (GeForce 7800GT).

try to replace the “while” by a “for”
Tried both - equal speed.

GL_LUMINANCE textures can be conflictive
Tried RGBA - worked a bit slower. So seems like my GPU can handle GL_LUMINANCE textures pretty nicely. I also mentioned that I tried removing all texture access from shader with no effect.

Although your advices haven’t helped to solve this little mystery I find them very usefull in general. You have gained recognition :slight_smile:

[b]
The only difference in execution is that multi pass shader has sync at every step and single pass has one sync at the end. This should actually be in favor of single pass, this is why:

Due to sync you have:
Multipass cost = max(3, 5) in first pass and max(6, 2) in second pass = 11
Single pass cost = max (3+6, 5+2) = 9
[/b]
Before any fragment in the single pass shader can start second loop, all fragments within the block must complete the first loop because all fragments are executing the exactly same instruction even if the results of that instruction might be ignored for some fragments.
Because of this, time for one loop iteration is maximum over times across all fragments in block and in your example the direct cost should be 11 in both cases.

Before any fragment in the single pass shader can start second loop, all fragments within the block must complete the first loop
I wasn’t aware of that. Thanks.
Could it be that the loop instruction itself is so expensive?

Originally posted by k_szczech:
Could it be that the loop instruction itself is so expensive?
Must be that, with the changes you did all should be ok now hehe. No idea what can be…

The only thing you can do else is to put inside an auxiliary function the code inside the loop and call it manually unrolling it… Something like:

void myAuxFunction ()
{
    6x texture2D - accessing the same 128x128 GL_LUMINANCE texture
    update texture coordinates (6 sets held in   gl_TexCoord[0] - [2])
  float value = (computations based on fetched value)
  if (value > 0.45)
  {
    blending layer and lighting computations
  }
}

myAuxFunction ();
myAuxFunction ();
myAuxFunction ();
myAuxFunction ();
myAuxFunction ();
myAuxFunction ();
myAuxFunction ();
myAuxFunction ();
myAuxFunction ();
myAuxFunction ();
myAuxFunction ();
myAuxFunction ();
myAuxFunction ();
myAuxFunction ();//16 times manual loop unrolling, for example

Just a notice: I made some experiments with Cg, and while maximal loop counter on nvidia hardware is 256, you can nest loops to emulate larger iteration count.

I’ve been working on volume rendering and observed another funny thing - nested 16x16 loop executes faster than single loop with 128 iterations, even though it has 2x more steps.
This allowed me to gain quality and performance at the same time. Seems like a way to go.

Some results:

Single loop:
64 iterations: 50FPS
128 iterations: 13FPS

Nested loops:
8x8 iterations: 87FPS
16x8 iteratoins: 48FPS
16x16 iterations: 27FPS
32x16 iterations: 13FPS
32x32 iterations: 5FPS

I also got similar results with nested loops. With a single loop of 64 iterations the compiler does not unroll the loop, not even partially.

On Geforce 6800 go,
A single loop with 64 iterations uses 32 instructions, 3 R-regs, 1 H-regs
A nested 16x16 loop uses 128 instructions, 7 R-regs, 1 H-regs
A nested 16x16 loop with the inner loop manually unrolled it produces the exact asm as 16x16

The 16x16 versions run twice as fast as the other one! I thought unrolling should definitely result in a speedup but increasing the number of temporary registers will cost some performance especially in a program that uses a lot of texture fetches.

Yes, that was my first thought after comparing single vs. nested loops. So my previous guess, that loop instruction is very expensive seems to be correct.
Note the difference between 64 and 128 iterations in single loop - it’s nearly 4 times. And difference between 8x8 and 16x8 is 2 times.
So, the next question is - why such difference in single loop?
Or perhaps we shouldn’t care and just use nested loops? :slight_smile:

I just updated my nVidia drivers to 97.92 and guess what … now the nested loop 64x1 runs at the same speed as 8x8! :slight_smile: 64x1 runs at the same speed as a single 64 loop, but 1x64 runs at less than half the speed!! 64x1 generates the same code as 64 and the output asm does not have an unrolled loop! Though it could still be unrolling just before using the shader. There are two phases of optimization here I would say, one during compile/link time and one just before draw time to specialize the shader to the specific uniform values used for the current batch of geometry.

Now I guess the second one, the shader specializer, can’t afford to spend too much time to optimize based on modified uniform values which makes it a bit unreliable. I had a function with a static branch in it which will be called many times per fragment, and I thought the static branch will be eliminated but it doesn’t appear so :frowning: . I gained a 2x speedup when I removed the branch! All this time I was thinking the branch was eliminated.

NVIDIA, can you please give us a way to get the final assembly of the code that goes to the GPU? This would be very helpful and would eliminate the need for us to second guess what the driver is doing. I know this asm is going to be GPU specific but atleast some kind of info would be good. Considering that ATI has exposed their GPU instruction set with CTM could you do the same for nv40 based GPUs?

Or someone should clone Mr. Eric Lengyel to reverse engineer make a lib to dump the asm :smiley:

Originally posted by tarantula:
I had a function with a static branch in it which will be called many times per fragment, and I thought the static branch will be eliminated but it doesn’t appear so :frowning: . I gained a 2x speedup when I removed the branch! All this time I was thinking the branch was eliminated.

Depends if it was really a static branch or not
Did you have if(0) or something like that in your code?
Or
const bool value=false;
if(value)

Both the branches in the examples you have given should be eliminated at compile time. The compiler also eliminates the branch in the absence of the const qualifier and value is not written to.

I remember reading and observing that shaders are reoptimized with uniform values. However now I am not able to observe the optimization happening :frowning:

Just to be clear, I’m talking about the branch being eliminated in this case:

uniform bool value;
if(value)

This is done after the uniform values are updated / just before rendering the primitives.

Hasn’t anyone observed this case being optimized? :frowning: