PDA

View Full Version : ATI fixed bug but...



V-man
11-27-2005, 11:47 AM
This code is for testing static loops


const vec2 const0=vec2(0.01, 0.025);
const int loopcount=XXX;

void main()
{
gl_Position = ftransform();

int i;
vec2 texCoord=vec2(0.0, 0.0);
for(i=0; i<loopcount; i++)
{
texCoord+=const0;
}

gl_TexCoord[0].xy=texCoord;
} when I used to set loopcount to 30,



Link successful. The GLSL vertex shader will run in software - available number of
temporary registers exceeded. The GLSL fragment shader will run in hardware.
Now it works but if loopcount >= 250, weird things happen.
Polygons are rendered all over the place.

Can you fix it?
Thanks

execom_rt
11-28-2005, 01:56 AM
Since you are compiling the GLSL code, I assume that you have, at least, a Radeon 9600 and a recent version of ATI Catalyst.
The question, does this code really out of hardware specification.

Here some comparision between different implementation (based from your code).
Cg conversion from your GLSL program. This is the way the nVidia / GLSL will compile your code:


!!ARBvp1.0
#const c[0] = 0.3 0.7499998
PARAM c[5] = { { 0.29999998, 0.74999982 },
program.local[1..4] };
MOV result.texcoord[0].xy, c[0];
DP4 result.position.w, vertex.position, c[4];
DP4 result.position.z, vertex.position, c[3];
DP4 result.position.y, vertex.position, c[2];
DP4 result.position.x, vertex.position, c[1];
END
# 5 instructions, 0 R-regsWe see that the nVidia implementation is pretty clever, because it has precomputed the values for you.
So it will be very fast and it will works on a Geforce 2MX for example.

HLSL version (DX9)

Here the code converted:


const float2 const0=float2(0.01, 0.025);
const int loopcount=30;

void main(uniform float4x4 ModelViewMatrixProj, in float4 gl_Vertex:POSITION, out float4 gl_Position:POSITION, out float2 TexCoord:TEXCOORD0)
{
gl_Position = mul( ModelViewMatrixProj, gl_Vertex);
int i;
float2 texCoord=float2(0.0, 0.0);
for(i=0; i<loopcount; i++)
{
texCoord+=const0;
}
TexCoord.xy=texCoord;
}Now converted to Vertex Shader 2.0


// Default values:
//
// loopcount
// i0 = { 30, 0, 1, 0 };
//
// const0
// c4 = { 0.01, 0.025, 0, 0 };
//

vs_2_0
def c5, 0, 0, 0, 0
dcl_position v0
mul r0, v0.y, c1
mad r0, c0, v0.x, r0
mad r0, c2, v0.z, r0
mad oPos, c3, v0.w, r0
mov r0.xy, c5.x
rep i0
add r0.xy, r0, c4
endrep
mov oT0.xy, r0

// approximately 9 instruction slots usedIn vertex shader 2.0, the for i/loopcount is using the rep i0/endrep instructions.
But it fits into the vertex shader 2.0 specifications.

So there is indeed a 'problem' with the loops implementation in GLSL on ATI/PC (this code would works on MacOS 10.4.3 on ATI)

Jan
11-28-2005, 02:47 AM
In a glsl shader on ATI, if i use a const int as the loop count my shader does get unrolled.

If i use a const float it doesn't.

And, well, using not a constant but a variable doesn't work at all, but that's a known issue :(

Jan.

V-man
11-28-2005, 04:23 AM
The loop should not get unrolled as the R300 (9500 to 9800) can do a loop, specially if the loop count is above 256.
We are suppose to be able execute >65000 instructions on VS 2.0 hw, so this is what my test is for.


This is the way the nVidia / GLSL will compile your code:Yes, that's one of the things to watch out for. Plug in a large loopcount this time.

execom_rt
11-28-2005, 05:20 AM
The previous ARB vertex program I've posted was using the standard ARB_vertex_program, which is not the default on nVidia.

With vp30 (Geforce FX) or vp40 (Geforce 6 here) profiles, it is using a loop instruction : Here the GLSL code that would run on a Geforce 6x or better. No unroll this time (code for Geforce FX is different but similar, using NV_vertex_program).

Note that it's 12 instructions, even with larger value of count. the limit is 65535 instructions in the shader, no 65535 executions of an instruction.


#var float2 const0 : : c[6] : -1 : 1
#var int loopcount : : c[5] : -1 : 1
#const c[4] = 0 1
#default const0 = 0.01 0.025
#default loopcount = 30

!!ARBvp1.0
OPTION NV_vertex_program3;

PARAM c[7] = { program.local[0..3],
{ 0, 1 },
program.local[5..6] };
TEMP R0;
TEMP CC;
BB1:
MOV R0.xy, c[4].x;
DP4 result.position.w, vertex.attrib[0], c[3];
DP4 result.position.z, vertex.attrib[0], c[2];
DP4 result.position.y, vertex.attrib[0], c[1];
DP4 result.position.x, vertex.attrib[0], c[0];
MOV R0.z, c[4].x;
BB2:
SLTC CC.x, R0.z, c[5];
BRA BB4 (EQ.x);
BB3:
ADD R0.xy, R0, c[6];
ADD R0.z, R0, c[4].y;
BRA BB2;
BB4:
MOV result.texcoord[0].xy, R0;
END
# 12 instructions, 1 R-regs