ARB_vertex_program address register woes under Catalyst 3.7 / Radeon 9700

I’m currently trying to get matrix palette skinning working in my application.

Using the address register either causes my program to return an error when uploading the program string or my framerate to drop from ~100fps to ~2fps.

Just to get things going I had the following code transforming everything by an identity matrix (this is not the full program just a fragment).

PARAM bones = { program.local[0…3] };

TEMP bonepos;
TEMP skinnedpos;

ADDRESS addr;

Transform vertices by first bone matrix

ARL addr.x, boneindex.x;
DP4 bonepos.x, bones[0], position;
DP4 bonepos.y, bones[1], position;
DP4 bonepos.z, bones[2], position;
MUL skinnedpos, weight.x, bonepos;

Adding in a single use of address register indirection drops the framerate from 100fps to 2fps

PARAM bones = { program.local[0…3] };

TEMP bonepos;
TEMP skinnedpos;

ADDRESS addr;

Transform vertices by first bone matrix

ARL addr.x, boneindex.x;
DP4 bonepos.x, bones[addr.x], position;
DP4 bonepos.y, bones[1], position;
DP4 bonepos.z, bones[2], position;
MUL skinnedpos, weight.x, bonepos;

Is there a ‘proper’ way to do things that won’t drop me down to a software path that i’m not aware of?

I’ve also noticed the following changes will cause the program to return errors when trying to load.

Changing the PARAM of bones from { program.local[0…3] } to anything not starting with a 0 (like program.local[16…18]) when using the address register.

Trying to inline program.local as in DP4 bonepos.x, program.local[addr], position;

Additionally, using too large a range in an ATTRIB will cut my framerate by about 1/3 even without using address register indirect. ie: ATTRIB bones[] = { program.local[0…47] };

Driver bugs? My not using the commonly accepted syntax? Gremlins? Any insight would be appreciated.

[This message has been edited by Pop N Fresh (edited 09-23-2003).]

here is what I use (on NVIDIA NV3x)

##########################################

ATTRIB VtxPos = vertex.attrib[0];
ATTRIB VtxClr = vertex.attrib[1];
ATTRIB VtxNrm = vertex.attrib[2];
ATTRIB VtxBID = vertex.attrib[3];
ATTRIB VtxUVW = vertex.attrib[4];

PARAM matBone[64] = { program.env[32…95] };
ADDRESS Offset;

##########################################

note that I use .env scope instead of .local scope (perhaps ati ‘initializes’ every constant per draw call when using .local scope?)

also I specify the register count (64) in the declaration.

Not sure if this will help, but I can say it works great on GF4MX to GFFX.

mtm

Changing to using env instead of local and specifying the register count in the declaration has no effect. Just specifying a PARAM with a large range chops 30fps off the framerate, and trying to actually use the address indirection drops me down to 2fps. Same as using local.

Thanks. It seems more and more likely this is some sort of driver problem. sigh

I just tested my own code and I seem to agree. I get a framerate drop of about 50% when doing the address instructions. (from ~360 fps to 160fps)

Using the D3D backend on the the same app produced almost the same results (~360fps) if the address lookup was done or not.

It could possibly be the case that it is not the address instruction but the usage of generic attributes with standard attributes that causes the slowdown. (ie. the address instructions uses generic attributes for weights etc. but I am still passing in the position/normal/texcoord via vertex.position/vertex.normal etc.)

Setting the matrix offset to other values other than 0 did not seem to have any real effect. (BTW: I use program.local)

If I manage to figure it out I’ll let you know.

Thanks for trying that out for me. I use all generic attributes so I don’t think that’s it.

Are you also using the 3.7 drivers? I’m wondering if I should try rolling back to 3.6.

Yes I am using the Cat3.7 drivers (resonably certain)

OK I just tried using all standard attributes as input for my program as thus:

ATTRIB iPos = vertex.position;
ATTRIB iNormal = vertex.normal;
ATTRIB iTex0 = vertex.texcoord[0];

//ATTRIB iAttr6 = vertex.attrib[6];
//ATTRIB iWeight = vertex.attrib[1];

ATTRIB iAttr6 = vertex.color;
ATTRIB iWeight = vertex.texcoord[1];

(As you can see I am now using all standard input attributes with weights from tex1 and indices from the color)

Doing this I achieved a massive frame boost (from 160fps - 300fps) (still not quite as fast as the D3D backend (360fps) but much better)

So you may want to try something similar. (and perhaps submit a bug report to ATI)

Damn, using standard attributes mean messing up my nice beautiful generic mesh class :stuck_out_tongue:

Ah well, it’s better than no workaround at all. I’ll post a bug to devrel@ati.com and hope it’s fixed in 3.8 (well, more likely 3.9 since 3.8 is probably in beta).

Thanks again.

Alright. Got this running at a reasonable framerate now. Still not quite as high as it should be but better than 2fps

It seems there are two different sources of slowdown. The first is defining an array PARAM with a large range. PARAM m[] = { program.env[0…2] }; is fine but having PARAM m[] = { program.env[0…47] }; takes about a quarter to a third off my framerate.

The second is dependent on the type of the data i’m using for my address indices. Let’s say I have

ATTRIB	index	= vertex.attrib[5];
ARL		addr.x, index.x;
DP4		bonepos.x, bones[addr.x], position;

If I pass vertex.attrib[5] as GL_UNSIGNED_BYTE I get the slowdown to 2fps. Change to GL_FLOAT and no slowdown except the PARAM array slowdown mentioned above. I have all my data stored in a VBO in case anyone is wondering. It’s possible this bug only affects data in a VBO but I don’t know either way.

[This message has been edited by Pop N Fresh (edited 09-24-2003).]

Pop,
tried passing it as secondary color (GL_UNSIGNED_BYTE, I mean)? Just curios

One additional idea Pop, would you mind posting the vertex structure that you are feeding this vertex program with? (ie mine is usually 1 interwoven stream with:
Position : 3 floats
Normal : 3 floats
Texture coord : 2 floats
Matrix indices: 4 ubytes
Matrix weights: 1-4 floats

(note that even if 1 weight is used, I still pass all 4 indices for 32 byte alignment)

Massive slowdown ( ~2fps ) vertex structure:
position : 3 floats
normal : 3 floats
texcoord : 2 floats
matrix indices : 4 ubytes
weights : 4 floats

Non-massive slowdown vertex structure:
position : 3 floats
normal : 3 floats
texcoord : 2 floats
matrix indices : 4 floats
weights : 4 floats

Pop,

What specific attribute do you pass the indices through? If you pass them through the secondary color channel, will it still slow down? The theory here is that the hardware can only un-pack ubytes from certain hard-wired channel kinds (namely, colors).

Another theory might be that the hardware can’t unpack ubytes at all. It’d be interesting to see which theory is closer to reality.

I was using generic attributes. Specifically vertex.attrib[5]. The newest dev drivers seem to fix the problem though and now it works without slowdown passing ubytes through vertex.attrib[5]. I still get slowdown from using a large range in my PARAM array however.

I’m pretty sure Radeon 9500+ unpacks bytes. At least ATI’s OpenGL performance FAQ says they do and I can’t see why they’d they put incorrect info in it

On second thought, secondary color isn’t a good option anyway. Sec color isn’t specified to have an alpha channel, so you’d lose the fourth index.