CG and ARBFP trouble

Hi!

I have a trouble witch ARBFP profile in
CG.

Next code works:
total += tex2D( diffuse1, texCoords[0]);
total += tex2D( diffuse1, texCoords[1]);
total += tex2D( diffuse1, texCoords[2]);

but if i add next string:
total += tex2D( diffuse1, texCoords[3]);

I get CG error: program can’t be loaded.
But programs compiles OK! Witchout any error.
I get error when i calling LoadProgram

I can’t have more than 3 texture reads !

Can you help me ?
Thanks!
P.S.
I have a radeon 9700 pro.

As far as I remember R3xx had problems with repeated texture reads. Try binding same tex to another unit & see if that helps. I guess this code will work just fine on FX

Is there any another ticks to avoid this problem ? I need more than 3 texture reads!
Why ATI do not whant to fix this bug ?

Even if same texture in different
texture channels i can’t have more than 3 texture reads. Even from different texture channels. Even from different textures from different texture channels. If i have more than 3 texture reads i have “program can’t be loaded”. Radeon 9700pro. Cg. Please help-me !

Thanks!

It’s seems to be CG problem.
The code generating by cg exceedes ps 2.0
limit of 4 phases of texture sampling((
I fixed similar problems by rearanging
generated assembly code.

OK, try to receive listing and post here.

cgc.exe -profile arbfp1 -l zzz.txt -o out.txt test.cg

Hi!

This is my CG program. And it’s do not whant
to work on my Radeon9700pro.
Different texture channels witch same texture. Thisk hack do not help me.

uniform samplerRECT scr0:TEXUNIT0;
uniform samplerRECT scr1:TEXUNIT1;
uniform samplerRECT scr2:TEXUNIT2;
uniform samplerRECT scr3:TEXUNIT3;
uniform samplerRECT scr4:TEXUNIT4;

float4 main(
float2 TEX0 : TEXCOORD0
):COLOR
{
float4 out_color;
out_color=0;
out_color += texRECT(scr0, TEX0+float2(-1.0,-1.0));
out_color += texRECT(scr1, TEX0+float2(1.0,-1.0));
out_color += texRECT(scr2, TEX0+float2(1.0,1.0));
out_color += texRECT(scr3, TEX0+float2(-1.0,1.0));
out_color/=4.0;
return out_color;
}

Resulting code:

!!ARBfp1.0

ARB_fragment_program generated by NVIDIA Cg compiler

cgc version 1.1.0003, build date Jul 7 2003 11:55:19

command line args: -profile arbfp1

#vendor NVIDIA Corporation
#version 1.0.02
#profile arbfp1
#program main
#semantic scr0 : TEXUNIT0
#semantic scr1 : TEXUNIT1
#semantic scr2 : TEXUNIT2
#semantic scr3 : TEXUNIT3
#var samplerRECT scr0 : TEXUNIT0 : texunit 0 : -1 : 1
#var samplerRECT scr1 : TEXUNIT1 : texunit 1 : -1 : 1
#var samplerRECT scr2 : TEXUNIT2 : texunit 2 : -1 : 1
#var samplerRECT scr3 : TEXUNIT3 : texunit 3 : -1 : 1
#var float2 TEX0 : $vin.TEXCOORD0 : TEXCOORD0 : 0 : 1
#var float4 main : $vout.COLOR : COLOR : -1 : 1
PARAM c0 = {0.25, -1, -1, 0};
PARAM c1 = {1, -1, 1, 1};
TEMP R0;
TEMP R1;
ADD R0.xy, fragment.texcoord[0], c0.yzyy;
TEX R0, R0, texture[0], RECT;
ADD R1.xy, fragment.texcoord[0], c1;
TEX R1, R1, texture[1], RECT;
ADD R1, R0, R1;
ADD R0.xy, fragment.texcoord[0], c1.zwzz;
TEX R0, R0, texture[2], RECT;
ADD R0, R1, R0;
ADD R1.xy, fragment.texcoord[0], c1.yzyy;
TEX R1, R1, texture[3], RECT;
ADD R1, R0, R1;
MUL result.color, R1, c0.x;
END

12 instructions, 2 R-regs, 0 H-regs.

End of program


I am using CG runtime.
After calling cgLoadProgram i have an error
“Program could no load”.

Please, help me. Thanks.

[This message has been edited by DSA (edited 01-05-2004).]

hi!
As I said before, this is CG & ATI bug((
CG must generate code according card caps!
ATI must rearange code to reduce fetching phases! Or supply cg compiler or cg->glHlsl converter!
The problem is similar to register combiner vs ps1.4 WAR(

// Phase 0
ADD R0.xy, fragment.texcoord[0], c0.yzyy;
// Phase 1
TEX R0, R0, texture[0], RECT;
ADD R1.xy, fragment.texcoord[0], c1;
// Phase 2
TEX R1, R1, texture[1], RECT;
ADD R1, R0, R1;
ADD R0.xy, fragment.texcoord[0], c1.zwzz;
// Phase 3
TEX R0, R0, texture[2], RECT;
ADD R0, R1, R0;
ADD R1.xy, fragment.texcoord[0], c1.yzyy;
// Phase 4
TEX R1, R1, texture[3], RECT;
ADD R1, R0, R1;
MUL result.color, R1, c0.x;

ATI PS 2.0 can do only 4 phases (texture fetching groups)

The correct code must be

// Phase 0
ADD R0.xy, fragment.texcoord[0], c0.yzyy;
ADD R1.xy, fragment.texcoord[0], c1;
// Phase 1
TEX R0, R0, texture[0], RECT;
TEX R1, R1, texture[1], RECT;

ADD R1, R0, R1;
ADD R0.xy, fragment.texcoord[0], c1.zwzz;
// Phase 2
TEX R0, R0, texture[2], RECT;
ADD R0, R1, R0;
ADD R1.xy, fragment.texcoord[0], c1.yzyy;
// Phase 3
TEX R1, R1, texture[3], RECT;
ADD R1, R0, R1;
MUL result.color, R1, c0.x;

Using more temporaly registers you can reduce
phases up to 2.

Ok! Thanks!

Why CG developers dont fix this gluck ?
Or nobody on ATI hardware uses CG ?

Why CG developers dont fix this gluck ?

It’s really simple.

nVidia hardware that has support for ARB_fp works fastest when the fewest number of temporaries is used. ATi hardware doesn’t care about the number of temporaries used, but it does have a limitation on the number of texture dependencies that is allowed. In your case, and probably quite a few others, these two are mutually exclusive.

The CG compiler is made by nVidia. Therefore, it will compile to ARB_fp code that works fast for nVidia hardware, and they don’t care that the fp code breaks on ATi cards. End of story.

I am always thinks , that CG targeted as cross platform and cross hardware language.
But… As i know , CG now is open source …
I do not see any problems to adapte CG for any hardware platform …
Or at least add switch -no optimize for converting directly Cg code into shaders without any optimizations.

I think the problem might be the “+=” part which may cause false dependencies.

Try re-writing your code something like:

t1 = tex2d( texture, coords[0] );
t2 = tex2d( texture, coords[0] );
t3 = tex2d( texture, coords[0] );
t4 = tex2d( texture, coords[0] );
result = t1+t2+t3+t4;

This may be enough to break the implicit dependency.

First of all I want to thank everyone who’s been involved with this thread. I am so happy.
Thank you Yuri, DSA and Korval.

I was able to get my 3x3 simple blur filter to run, perfectly it only has only 2 texture indirections along with 10 texture instructions.

I tried Jwatte’s suggestion and it makes no difference. The cg compiler still outputs:
ADD
TEX
ADD
TEX

The trick to getting around ATI’s texture indirecitons limit is just to do all the texture access (TEX) at once.

Here’s my fragment program code originally it was output by the Cg compilier but then reorganized (and ultimately rewritten) by me.

!!ARBfp1.0
MOV R0.xy, fragment.texcoord[0];
ADD R1.xy, R0, c0.yzyy;
ADD R2.xy, R0, c1.xyxx;
ADD R3.xy, R0, c1.zwzz;
ADD R4.xy, R0, c2.xyxx;
ADD R5.xy, R0, c2.yzyy;
ADD R6.xy, R0, c2.zwzz;
ADD R8.xy, R0, c3.zwzz;
ADD R7.xy, R0, c3.xyxx;
TEX R0, R0, texture[0], 2D;
TEX R1, R1, texture[0], 2D;
TEX R2, R2, texture[0], 2D;
TEX R3, R3, texture[0], 2D;
TEX R4, R4, texture[0], 2D;
TEX R5, R5, texture[0], 2D;
TEX R6, R6, texture[0], 2D;
TEX R7, R7, texture[0], 2D;
TEX R8, R8, texture[0], 2D;
ADD R0,R0,R1;
ADD R0,R0,R2;
ADD R0,R0,R3;
ADD R0,R0,R4;
ADD R0,R0,R5;
ADD R0,R0,R6;
ADD R0,R0,R7;
ADD R0,R0,R8;
MUL result.color, R0, c0.x;

any instruction between the TEX’s just increases the indireciton count. This isn’t all the code to the program, just enough to help some grasp the idea, I think.

This is just for information purposes. I have been watching this thread since DSA’s first post. I had the exact same problem and had been searhing for the past 3 days for a solution. So, may it serve those in the future. Maybe I just missed it, but for me this litte bit of information was just nowhere to found 'cept here.

Thanks guys, very much,
Jes

Bad. Veary Bad.
In this case IMHO R9700 not fully supports ARBFP extension. Why program did not work ?
Syntax of program is ok.
If hardware supports extension , this extension must be supported fully. In any cases.
This is my first big disappointment in ATI’s hardware.

As i said : CG now is open source , why did not change it little bit ?

Is there any hero ?

That’s reall headache. Nvidias hardware is slover then ATI’s, but more flexible. It means, if you write a Radeon-friendly shader, it would be slow on Nvidia, and a Nvidia-friendly shader probably wan’t run on radeon(if it has indirections). The Unified Compiler of nvidia makes the things a bit better, but not perfect. I’m afraid, if you wan’t to archieve the best performance on both cards, you must use NV_fp for Nvidia, leaving RAdeon-optimised ARB code for ATI.

Sorry for beeing off-topic, this are just my thoughts…

UNMODIFY

How did you resolve this problem ?
Did not use Cg at all ?
Or use Cg for NV and Assembler for ATI

Does this answer your question?
The short answer is I reorganized the ARB assembler instructions of my fragment program via my own newly aquired ideas and knowledge, which eliminated the texture indirection error.

Oh, before that In my earlier response when I responded to Jwatte, I forgot to mention, I only checked with the Cg compiler, I know next to zip about GLSL, which I think his code was. The Cg compiler outputs the same, whether the HLSL is written in a temporary-heavy fashion or an indirection-heavy fashion, it favored the indirection-heavy organization in it’s fragment output code either way.

I load up my fragment and vertex programs on my own and use the ARB extensions to control them in OGL, just the way they do it in the ATI Simple Shader demo, the one with the cute little elephant.

You should be able to copy this code and use it straight away with cg or the ARB_ extensions, my hardware is an Radeon 9800. I have no idea if or what the result would be on something else, but it works famously on mine.

!!ARBfp1.0

ARB_fragment_program generated by NVIDIA Cg compiler

cgc version 1.1.0003, build date Jul 7 2003 11:55:19

command line args: -profile arbfp1 -entry ps_main

#vendor NVIDIA Corporation
#version 1.0.02
#profile arbfp1
#program ps_main
#semantic Texture0
#var sampler2D Texture0 : : texunit 0 : -1 : 1
#var float4 inDiffuse : $vin.COLOR0 : COLOR0 : 0 : 1
#var float2 tex : $vin.TEXCOORD0 : TEXCOORD0 : 1 : 1
#var float4 ps_main : $vout.COLOR0 : COLOR0 : -1 : 1
PARAM c0 = {0.11111111, 0.0024999999, 0, 0};
PARAM c1 = {-0.0024999999, 0, 0, 0.0024999999};
PARAM c2 = {0, -0.0024999999, -0.0024999999, 0.0024999999};
PARAM c3 = {0.0024999999, 0.0024999999, 0.0024999999, -0.0024999999};
TEMP R0;
TEMP R1;
TEMP R2;
TEMP R3;
TEMP R4;
TEMP R5;
TEMP R6;
TEMP R7;
TEMP R8;
MOV R0.xy, fragment.texcoord[0];
ADD R1.xy, R0, c0.yzyy;
ADD R2.xy, R0, c1.xyxx;
ADD R3.xy, R0, c1.zwzz;
ADD R4.xy, R0, c2.xyxx;
ADD R5.xy, R0, c2.yzyy;
ADD R6.xy, R0, c2.zwzz;
ADD R8.xy, R0, c3.zwzz;
ADD R7.xy, R0, c3.xyxx;
TEX R0, R0, texture[0], 2D;
TEX R1, R1, texture[0], 2D;
TEX R2, R2, texture[0], 2D;
TEX R3, R3, texture[0], 2D;
TEX R4, R4, texture[0], 2D;
TEX R5, R5, texture[0], 2D;
TEX R6, R6, texture[0], 2D;
TEX R7, R7, texture[0], 2D;
TEX R8, R8, texture[0], 2D;
ADD R0,R0,R1;
ADD R0,R0,R2;
ADD R0,R0,R3;
ADD R0,R0,R4;
ADD R0,R0,R5;
ADD R0,R0,R6;
ADD R0,R0,R7;
ADD R0,R0,R8;
MUL result.color, R0, c0.x;
END

26 instructions, 2 R-regs, 0 H-regs.

End of program

btw, ignore the #comments as this is modified from the original compiler output, they are all pretty irrelevant anyway.

Jes

hi, i did also some blurring filters with cg. a hint from me is to do the texturecoord-offsetting on the vertex-processor (you can get 16 texcoord sets to the fragment-processor - two in every one of the eight sets supplied with TEXCOORDx).

looks like this:
(gauss filters are separable, so this is the horizontal pass, the vertical pass looks similar)

#define FLOAT 1
#define HALF 0
#define FIXED 0

void gauss3x3FilterP1VP(in float4 position_in : POSITION,
/horizontal pass/ in float2 texcoord_in : TEXCOORD0,

                   out float4   position_out   : POSITION,
                 
                   out float2   texcoord00_out : TEXCOORD0,
                   out float2   texcoord10_out : TEXCOORD1,
                 
               uniform float4x4 viewproj_mat)

{
position_out = mul(viewproj_mat, position_in);

texcoord00_out = texcoord_in + float2(-0.5f, 0);
texcoord10_out = texcoord_in + float2( 0.5f, 0);

}

float4 gauss3x3FilterP1FP(in float2 texcoord00_in : TEXCOORD0,
in float2 texcoord10_in : TEXCOORD1,

                 uniform samplerRECT tex_in     : TEXUNIT0) : COLOR

{
/*
|1 2 1| |p00 p10 p20|
*/
#if FLOAT
float4 p00 = texRECT(tex_in, texcoord00_in);
float4 p10 = texRECT(tex_in, texcoord10_in);

return float4(p00 + p10)/2.0f;

#endif

#if HALF
half4 p00 = h4texRECT(tex_in, texcoord00_in);
half4 p10 = h4texRECT(tex_in, texcoord10_in);

return (p00 + p10)/2.0h;

#endif

#if FIXED
fixed4 p00 = x4texRECT(tex_in, texcoord00_in);
fixed4 p10 = x4texRECT(tex_in, texcoord10_in);

//return float4(p00 + p10)/2.0f;
return half4(p00 + p10)/2.0h;

#endif
}

i do also use bilinear filtering to get weighted sums (saves texture fetches).

my question to this thread is: what are texture indirections? (the problem with the radeons)

[This message has been edited by Chris Lux (edited 01-08-2004).]

In this case IMHO R9700 not fully supports ARBFP extension. Why program did not work ?

I think the ATI hardware follows the spec just fine. It talks explicitly about dependent read limits, and what causes those. There’s a queriable value that you can use to test how deep your dependencies goes.

What’s BAD is that the CG compiler compiles to ARB_fp without checking this limit, and it chooses lower register usage rather than lower dependency depth. Clearly, that’s an NVIDIA optimization, but if anyone’s to use it for real, they just HAVE to work well with ATI hardware, too.

Between the inability to use the CG DCC plug-ins, and choices like this in the CG compiler, I just don’t think CG is viable as a run-time solution for any commercial product. That’s a REAL SHAME! If these problems were fixed, I’d love to use it.

just for your information, there is a NoDependentReadLimit option for the ARBfp1 profile, so you can force the cg compiler to restrict the texure indirections…

Scotty, I tried the Cg compiler flag you mentioned and I would still need to rewrite the outputted code to get it to run(ATIR9800),I didn’t check for 1:1 correspondence in the outputs, but when I queried the programs stats after the error in loading OGL they were the same number of texture indirections as the output with the default NoDependentReadLimit value.

Chris, I computed the texcoords in the VProgram ,from your suggestion. no really noticable difference in speed, however you are right you can send all coords to the FP via the VP, packing 2 to 1 coord, it works.

To answer your question, “what are texture indirections?” They are simply using a texture coord to access a texel in a texture. Indirection just means to refrence something via some index…however, this ATI Texture indirection limit thing is kind of a misnomer. ATI hardware can do texture indirection, wonderfully! I was able to get 9 texture reads in, so far, in my FP that I posted. The problem is the relationship between the way it prefers the FP code to be, and the code output by the Cg compiler v1.1 for the ARB_FRAGMENT_PROGRAM (arbfp1). The problem presents itself when loading an ARB_FRAGMENT_PROGRAM, the error message back is that the program exceeded the number of “native texture indirections” and so the program fails, but if you rearrange the instructions, put all the texture acesss together it won’t fail. (so far, anyway… I havn’t tried it with other more complex FP’s)

Recently, yet after I got my own FP to work. I found a PDF on ATI’s site about optimizing shaders for thier hardware, in the Mojo Day section. In it, it covers this aspect of instruction ordering, but had I found it before I got my FP to work I don’t think I would of understood it right away, I still glaze over most of it, but if I set to it I bet I could have gotten the same idea from it that I got from Yuri and others.

Jwatte, or anyone who knows for sure, Do the Cg DCC plug-ins fail on all ATIR9x cards? The Maya one doesn’t work on mine.

I would love to use the SL’s in Maya, but I don’t mind so much. I am happy with rendermonkey, although I would love to be able to do feedback effects with it (is there a way? I fiddled and fiddled, but, finally decided to learn how to do it on my own). I am grateful for Cg too, presently, it’s one of the tools I use, but I think Cg will be eclipsed by GLSL. And this whole issue will be a thing of the past.

I am so glad there are so many people who love computers.
Jes