Hardware Skinning Question

I wonder what is the best choice
to perform hardware skinning for
real-time character animation :
vertex blend extension or vertex program ?

In my opinion the second one seems to be
more flexible (number of matrices
per vertex, number of bones),
better supported on current graphics cards
and more appropriate to perform
also lighting computations.

Drawbacks : less efficient and more
complicated to implement.

Is it right ?
Does anyone have an idea as to
what should be the best choice ?

Thanks in advance.

Luc

I’ve just implemented hardware vertex skinning in my program, and I’ve found that ARB_vertex_program (through Cg in my case) was very effective.

A couple of tips: You’re limited in the number of uniforms (locals) you can use, so its best to send the bone matrices as 4x3 matrices (3 4 element vectors per matrix).

If you limit the number of bones affecting each vertex to 4 then you can send the vertex weights and indices as just 2 vertex attributes.

Your skeletons need to be stored so that you have a matrix for each bone that describes the transformation from the initial pose (so that I is the initial pose). That way you only need to send the bone matrix to the shader, and you don’t need additional per-vertex information such as offset vectors.

It’s a good idea to have a software fallback in case vertex program is not supported, or you need to use more bones than you have available registers to store.

[This message has been edited by bunny (edited 02-13-2004).]

Thanks for your answer.
Your explanation confirm the way I want to do the job.

Do you have any performance informations about software and harware skinning ?
Source code ?

Moreover I have 25 bones, which card models (ATI/NVIDIA) will have the needed number of registers ?

Luc.

Minimal requiment for ARB_vertex_program is 96 param registers = 32 bones(and nothing else). 25 bones will take 75 registers away.
With newer cards the number of parameters is much larger. GFFX can handle 256…

You can also use a unit quaternion more or less directly instead of a rotation matrix. This will save space for bone representation but it will cost computation and temporaries. This may or may not be a worthwhile tradeoff, depending on vertex count vs bone count.

I have found some very old code of mine flying around, I’m sure it can be adapted to ARB_vp:

void
transform(Vector3& target,const Vector3& src,const uQuaternion& xform)
{
	static Vector3 tmp;
	static float xx=xform.x*xform.x;
	static float xy=xform.x*xform.y;
	static float xz=xform.x*xform.z;
	static float xw=xform.x*xform.w;
	static float yy=xform.y*xform.y;
	static float yz=xform.y*xform.z;
	static float yw=xform.y*xform.w;
	static float zz=xform.z*xform.z;
	static float zw=xform.z*xform.w;

	//check aliased target/source vector
	if ((&target)==(&src))
	{
		tmp.x=src.x+2.0f*(-src.x*(yy+zz)+src.y*(xy-zw)+src.z*(xz+yw));
		tmp.y=src.y+2.0f*( src.x*(xy+zw)-src.y*(xx+zz)+src.z*(yz-xw));
		tmp.z=src.z+2.0f*( src.x*(xz-yw)+src.y*(yz+xw)-src.z*(xx+yy));
		memcpy(&target,&tmp,sizeof(Vector3));
	}
	else
	{
		target.x=src.x+2.0f*(-src.x*(yy+zz)+src.y*(xy-zw)+src.z*(xz+yw));
		target.y=src.y+2.0f*( src.x*(xy+zw)-src.y*(xx+zz)+src.z*(yz-xw));
		target.z=src.z+2.0f*( src.x*(xz-yw)+src.y*(yz+xw)-src.z*(xx+yy));
	}
}

Originally posted by claustre:
[b]Thanks for your answer.
Your explanation confirm the way I want to do the job.

Do you have any performance informations about software and harware skinning ?
Source code ?

Moreover I have 25 bones, which card models (ATI/NVIDIA) will have the needed number of registers ?

Luc.[/b]

I haven’t done any formal benchmarks, but skinning 100 birds with 10 bones in each, roughly 125 vertices per mesh, I got about an 80% improvement in frame rate over software skinning, although the software skinning isn’t optimised for vector operations (no SSE etc). The frame rate with hardware skinning was almost the same as with skinning disabled altogether.

Gamasutra.com has an article taken from the Cg tutorial with better descriptions and a Cg code implementation. See http://www.gamasutra.com/features/20030325/fernando_05.shtml

uhm about the limitation of input parameters… i used the state.matrix.program[x] to get the matrices into the vertex program. i dont know, if this limits the usable parameters, but it doesnt seem so. (another way would be using the state.matrix.modelview[x] parameters if the vertex blend extension is supported).

I couldn’t get access to more than 8 program matrices in Cg. The problem with matrix blend is that one or other of the major chipset makers (I forget which) has decided to no longer support it. Perhaps someone can confirm/clarify this?

The problem with matrix blend is that one or other of the major chipset makers (I forget which) has decided to no longer support it.

Well, I’m pretty sure nVidia never actually supported the extension. That extension came out right around the time of NV_vertex_program, which nVidia declared superior (rightly so, considering that it did so much more). And, since ATi wasn’t nearly as big a player, they alone couldn’t carry the functionality.

hmm on my r300 i use 32 program matrices… havent tested it on nv hardware though

Originally posted by Chuck0:
[b]hmm on my r300 i use 32 program matrices… havent tested it on nv hardware though

[/b]

It’s probably a Cg compiler bug. It gives me an array index out of range error if I go above 8. If you can access 32 with ARB_VP then that’s probably the best way to go.

I submit my matrices as program locals, not sure if there’s a better way to do it or not. My vertex program looks like this:

!!ARBvp1.0

Purpose: Vertex lighting for 3-weight skeleton

ATTRIB iPos = vertex.position;
ATTRIB iNormal = vertex.normal;
ATTRIB iTexCoord = vertex.texcoord;
ATTRIB iBoneIdx = vertex.color;
ATTRIB iWeights = vertex.attrib[6];
PARAM wvp[4] = { state.matrix.mvp };
PARAM ambientCol = program.local[0];
PARAM diffuseCol = program.local[1];
PARAM lightDir = program.local[2];
PARAM bones[%i] = { program.local[%i…%i] };
OUTPUT oPos = result.position;
OUTPUT oColor = result.color;
OUTPUT oTexCoord = result.texcoord;
ADDRESS a;
TEMP mv0, mv1, mv2, modelNormal, dot, tempPos, boneIdx;

Pass through the texcoords

MOV oTexCoord, iTexCoord;

Extract the bone indices that were submitted as uchars

MUL boneIdx, iBoneIdx, 255.9;

Compute the modelview matrix by summing up the weighted bone matrices

ARL a, boneIdx.x;
MUL mv0, bones[a.x+0], iWeights.x;
MUL mv1, bones[a.x+1], iWeights.x;
MUL mv2, bones[a.x+2], iWeights.x;

ARL a, boneIdx.y;
MAD mv0, bones[a.x+0], iWeights.y, mv0;
MAD mv1, bones[a.x+1], iWeights.y, mv1;
MAD mv2, bones[a.x+2], iWeights.y, mv2;

ARL a, boneIdx.z;
MAD mv0, bones[a.x+0], iWeights.z, mv0;
MAD mv1, bones[a.x+1], iWeights.z, mv1;
MAD mv2, bones[a.x+2], iWeights.z, mv2;

Transform the normal to model coordinates

DP3 modelNormal.x, mv0, iNormal;
DP3 modelNormal.y, mv1, iNormal;
DP3 modelNormal.z, mv2, iNormal;

Compute the diffuse lighting

DP3 dot.x, modelNormal, lightDir;
LIT dot, dot;

Calculate final vertex color

MAD oColor, dot.y, diffuseCol, ambientCol;
MOV oColor.w, 1.0;

Calculate the final vertex position

DP4 tempPos.x, mv0, iPos;
DP4 tempPos.y, mv1, iPos;
DP4 tempPos.z, mv2, iPos;
MOV tempPos.w, 1.0;

DP4 oPos.x, wvp[0], tempPos;
DP4 oPos.y, wvp[1], tempPos;
DP4 oPos.z, wvp[2], tempPos;
DP4 oPos.w, wvp[3], tempPos;

END

If someone tries zeckensack’s way, can you post the number of instructions your VP compiles to?

The vertex blend extension never performed very well, and NVIDIA has removed support for it from their version 50 and up drivers.

Note that with vertex blend, ATI gave you 4 matrices, but NVIDIA only gave you two. What’s worse – you only get two matrices PER TRIANGLE, not per vert, when you do this. Then you have to cut your model up into zillions of little itty-bitty pieces, each of which only references two matrices. It’s ludicrously bad in state changes, restricts your art tremendously, and doesn’t perform well at all.

If you don’t want to use ARB_vertex_program, then you should do it on the CPU, in my opinion.