Texture indirection on 9700 Pro

I’ve started work on a little benchmark of my own and keep getting “Too many texture indirections at xxx” errors on my 9700 during my perlin noise test.

The GeforceFX 5200 runs this fragment program just fine (although “run” may be a generous word) and I only count 3 texture indirections in the program.

Humus mentioned a bug here: http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/009439.html
related to CAT 3.2. I’m using CAT 3.4 and it still seems to be counting indirections incorrectly.

I emailed ATI devrel with no response. How can I get past this?

Thanks,
Zeno

Why don’t you post the program?

I haven’t made up my mind yet about whether I want to make this an open source project, but since I’m in need of help, I guess this shader will at least be open source .

// 3DPerlinNoise fragment shader
//
// This shader is a complete implementation
// of Perlin’s improved 3D noise algorithm.
//
// See http://mrl.nyu.edu/~perlin/noise/ for details
//
// Created by Wade Lutgen June 8, 2003
//

!!ARBfp1.0

OPTION ARB_precision_hint_fastest;
ATTRIB position = fragment.texcoord[0];
ATTRIB color = fragment.color;
PARAM const = { 1.0, 0.0, 0.0, 0.0 };
PARAM texFactor = 0.00390625;
TEMP uvw;
TEMP t0, t1;
TEMP A, AA, BB;
TEMP p;
TEMP gradx, gradxm1;
TEMP frac, int, dir;

// Find unit cube that contains point
FLR int, position;
// Scale for OGL texture lookup
MUL int, int, texFactor.x;
// Relative xyz of point in cube
FRC frac, position;

// Perform lookups into “permutation” texture
// Actually performs 2 lookups at once, one for p
// which goes into the r component of output and one
// for p[x+1] which goes into the g component

// A.x = p A.y = p[x+1]
TEX A, int.x, texture[0], 1D;
// A.x = p + y, A.y = p[x+1]+y
ADD A, A, int.y;

// AA.x = p[A.x], AA.y = p[A.x+1]
TEX AA, A.x, texture[0], 1D;
// AA.x = p[A.x]+z, AA.y = p[A.x+1]+z
ADD AA, AA, int.z;

// BB.x = p[A.y], BB.y = p[A.y+1]
TEX BB, A.y, texture[0], 1D;
// BB.x = p[A.y]+z, BB.y = p[A.y+1]+z
ADD BB, BB, int.z;

// 8 lookups into RGBA gradient texture
// This particular algorithm has a pre-built in lookup into
// the p texture as well, so instead of g(x), this is actually
// g(p(x)). Saves four texture lookups.
// After lookup, rescale gradient and dot with current location
TEX p, AA.x, texture[1], 1D;
MAD p, p, 2.0, -1.0;
DP3 gradx.x, frac, p;

TEX p, BB.x, texture[1], 1D;
MAD p, p, 2.0, -1.0;
ADD dir, frac, -const.xwww;
DP3 gradxm1.x, dir, p;

TEX p, AA.y, texture[1], 1D;
MAD p, p, 2.0, -1.0;
ADD dir, frac, -const.wxww;
DP3 gradx.y, dir, p;

TEX p, BB.y, texture[1], 1D;
MAD p, p, 2.0, -1.0;
ADD dir, frac, -const.xxww;
DP3 gradxm1.y, dir, p;

// Add ‘1’ to each texture coordinate
// for the next 4 lookups
ADD AA, AA, texFactor;
ADD BB, BB, texFactor;

TEX p, AA.x, texture[1], 1D;
MAD p, p, 2.0, -1.0;
ADD dir, frac, -const.wwxw;
DP3 gradx.z, dir, p;

TEX p, BB.x, texture[1], 1D;
MAD p, p, 2.0, -1.0;
ADD dir, frac, -const.xwxw;
DP3 gradxm1.z, dir, p;

TEX p, AA.y, texture[1], 1D;
MAD p, p, 2.0, -1.0;
ADD dir, frac, -const.wxxw;
DP3 gradx.w, dir, p;

TEX p, BB.y, texture[1], 1D;
MAD p, p, 2.0, -1.0;
ADD dir, frac, -const.xxxw;
DP3 gradxm1.w, dir, p;

// Compute fade curve s(t) = (6t^5 - 15t^4 + 10*t^3)
MAD t0, frac, 6.0, -15.0;
MAD t1, t0, frac, 10.0;
MUL t0, t1, frac;
MUL t1, t0, frac;
MUL uvw, t1, frac;

// linear interpolate result
LRP t0, uvw.x, gradxm1, gradx;
LRP t1.x, uvw.y, t0.y, t0.x;
LRP t1.y, uvw.y, t0.w, t0.z;
LRP t0, uvw.z, t1.y, t1.x;

MUL result.color, t0, color;

END

I’d also appreciate any optimization tips anyone might have.

You’re computing

gvec = g[k+p[j+p[i]]]

Which fits just fine, but you’re
adding the “k+1” puts you over the
limit.

You need to declare new variables
(AAA and BBB ?) instead of trying
to reuse AA and BB.

Rather than wait for a driver fix,
you could write a program-to-program
“optimizer” that inserts new temps
to eliminate false dependencies.

Thanks -
Cass

Good suggestion Cass. Unfortunately the ATI drivers are smarter than you might think and they see around this trick

If I understood you correctly, you were suggesting that I replace this:

ADD AA, AA, texFactor;
ADD BB, BB, texFactor;

and all subsequent texture instructions which depend on AA or BB with AAA or BBB, like this:

ADD AAA, AA, texFactor;
ADD BBB, BB, texFactor;

It didn’t work. Same error.

In order to get the card to accept the program, I have to remove every TEX instruction after the first 4, which gives me only 1 lookup into g (I need 8). So, aparently, it sees EVERY texture instruction up there as being dependent on eachother in a series (which they aren’t).

Did I misunderstand what you’re saying? If so, could you leave a little snippet of modified shader code?

Thanks for your help

[This message has been edited by Zeno (edited 06-11-2003).]

I count 8 indirections in that program… R300 can only handle 7.

Not sure I would call it “smarter”.

The other obvious false dependence is “p”.
If you declare them as p0…p7, that should remove them.

The spec defines this as a dependence too:
TEX R0, …;
MUL R0, R0, …;
MUL R1, …;
TEX R0, …; # new level of indirection

The last TEX can’t be moved up and paired with the previous TEX even if they’re really independent - without adding more temp registers, which the hardware may not be able to support. In this case, however, I think you’re well within the maximum register limit, but the driver’s not required to help you out here.

Thanks -
Cass

Did this last suggestion work, Zeno?

Sorry I didn’t respond earlier. This is a home project so I can only work on it at night.

Yes! Your suggestion worked. After I added new temporaries, p1-p7, to replace p and re-ordered where I placed AAA and BBB, I got a new message telling me that I had exceeded another limit - it counted 75 ALU instructions (which is about 30 more than there are in the program). Luckily, I remembered something in the back of my head about ATI hardware not having fully general swizzles. Sure enough, those “const.xxwx” instructions were actually costing me 4 ALU instructions. Eek.

After I replaced all of those with pre-set constants, the program loaded up and ran .

Unfortunately, it’s got some strange visual glitches in the form of a few pixel-wide vertical stripes on the 9700. The new program works exactly the same as the old one on GFFX. My next goal is to track down the cause of this.

So would you call the dependency thing a driver bug, or simply lack of a driver feature? The program is functionally equivalent now to what it was before.

Thanks again for your help.

– Zeno

[This message has been edited by Zeno (edited 06-12-2003).]

Zeno,

Glad to help.

It’s certainly witin the letter of the spec to not resolve “false” indirections, so I wouldn’t consider it a bug.

The spec definitely allows for the driver to eliminate false indirections if it chooses to. You’ll have to ask someone at ATI whether they have plans to do this.

And, of course, you could write a ARBfp->ARBfp translator that found and resolved these kinds of issues. You don’t really need the driver to solve the problem if you know how to work around it. This would make a good open source project for OpenGL developers.

Thanks -
Cass