Skinning on the GPU vs the CPU

So recently I have delved into modern openGL. I am learning about shaders–vertex, and fragment for now–and wondering if I should use them for skinning.
The reason I switched to modern openGL and started thinking about this, is because my comp has 32 megs of VRAM and using the vertex arrays with the fixed pipeline was not working. I was getting some weird behaviors (vertices appeared to be sticking into place, and popping back randomly)

So I switched to VBOs and just went down the rabbit whole.

SO. A vertex shader can get attr and uniforms passed. But if I were to do skinning in a vertex shader, I would need to pass the transforms as uniforms per bone associated w/ a vert every frame. The reason I was considering vertex shaders was to reduce the CPU->GPU overhead. But it still appears to be there anyways…

If I am not uploading the pre-posed verts, I am uploading the pose transforms–every frame; the overhead still exists, no?

As it stands now: I have quat & trans pairs that i use to transform verts and send those verts to the GPU every frame.
Is there a cleaner way to do this that is faster, effectively more efficient, using more of the GPU?

Define “the overhead”.

There are two costs associated with sending data to the GPU. One cost is fixed; every send, no matter how small, has some fixed cost. The other cost varies based on how much data you’re sending.

You must pay the fixed cost, since you have to be sending something to the GPU, either post-transformed vertices or the pose. So the only cost that varies is the size-based one. Unless your model is very tiny, or has a truly massive set of bones, the pose will likely be smaller. So, all things being equal, the pose should be faster.

Then again, how much faster depends on where you are losing performance. It may not make an actual improvement at all.

I’m more curious as to what kind of hardware you have that supports vertex shaders, yet only has 32MB of RAM.

Right. There is always a cost; either the pose or the model, and the size is the only variable factor–but they appear to be the same.

Generally I will be sending about 3 bones worth of data per vertex per frame. That wouldn’t level out to just sending the per-calculated mesh? Equal cost?
It’s one of those moments where I have to make the call based on the data that I am rendering I guess.

BTW I using a budget 200 dollar Asus laptop from Amazon. Runs Blender3D jut fine w/OpenGL 4, but only 32mb of VRAM but I believe 4g of RAM

Why would the number of bones be “per vertex”? Yes, each vertex uses a set of bones, but you wouldn’t make the bones per-vertex data. The set of bones is per-model data, from which each vertex references one or more bones to use.

The bones should be a uniform array (or even better, a uniform buffer), since they don’t change per-draw command.

Okay. Point taken. Appreciated. But still. All things considered if we were to define a buffer it would fail for linear interpolation…
Unless we can/should linearly interpolate inside the shader? then; its game over… & Alfonse Reinheart; thanks for playing, you’ve been a big help

You don’t want linear interpolation as done by a texture sampler - you need to do that yourself. Generally you have a list of bones and a list of weights per vertex, and you look up the matrix for each bone, multiply it by its weight, and add the matrices together. Having a limit of four bones is the easiest to start with, as you can send, per vertex, ‘ivec4 boneIndex; vec4 boneWeight;’.

All things considered if we were to define a buffer it would fail for linear interpolation…

I don’t know what “linear interpolation” means in the context of skinning.

OK, some confusion here.

For skinning on the GPU you use an array of bone matrices (which are friendlier for GPUs than quaternions) - one per bone in the model. I’ll use standalone uniforms here rather than a UBO for the sake of code clarity. A 4x3 (or 3x4 depending on your chosen flavour of poison) matrix is sufficient.

matrix4x3 boneMatrices[MAX_BONE_MATRICES]; // make this define the maximum you need to support

So if your model has 30 bones you only need to send 30 bone matrices; the per-vertex data is and will remain static.

Calculate the bone matrices on the CPU and send them.

Each vertex has bone indices as part of it’s attributes; one integer per index and typically 4 indices. Each vertex also has a blend weight (if you’re using them). This is totally static data and lives in a static VBO.

To run the skinning, in your vertex shader you do something like:

position = (boneMatrices[boneIndices.x] * vertexPosition) *  blendWeights.x +
    (boneMatrices[boneIndices.y] * vertexPosition) *  blendWeights.y +
    (boneMatrices[boneIndices.z] * vertexPosition) *  blendWeights.z +
    (boneMatrices[boneIndices.w] * vertexPosition) *  blendWeights.w;

So, the only data you’re sending each frame is the bone matrices, and you only need to send as many as the current model has bones, up to your pre-defined maximum. Everything else is static data. Using glUniformMatrix you can send them all in one go (rather than one at a time) which your GPU and driver will also love you more for.

As well as performance measurable by how much data you send (which is not always a reliable metric) you also should be measuring performance by how you distribute work between the CPU and the GPU. Different workloads are differently suited to each processor, and most machines should easily be able to run skinning on the GPU - even a relatively weak one like your’s (Intel graphics?) - much faster than on the CPU, because (1) most data can remain static, and (2) the GPU is just faster for this kind of calculation.

Very clear. My confusion was I was thinking you could send the WHOLE animation and just let it SIT on the GPU: aka matrices[frames][bone]
Then in the GPU you could linearly interpolate between the values and never leave the GPU. Based on the replies I’m doubting its possible; is that possible? (probably not highly recommended though, hence one of you guys woulda mentioned it i’m guessing)

So now it all makes sense. Ties all the way back for the 1st reply. Sending and updating the matrices is faster because:

the pose will likely be smaller. So, all things being equal, the pose should be faster
and

(1) most data can remain static, and (2) the GPU is just faster for this kind of calculation.

Very clear. My confusion was I was thinking you could send the WHOLE animation and just let it SIT on the GPU: aka matrices[frames][bone]

You could, but for a single character you likely don’t have to, as passing such a small number of matrices is unlikely to be the bottleneck.

If you were animating dozens of characters in a crowd via instancing though, you’d want something similar to that approach (with frames replaced by instance). We use that approach for rendering background crowd animation.

Very clear. My confusion was I was thinking you could send the WHOLE animation and just let it SIT on the GPU: aka matrices[frames][bone]
Then in the GPU you could linearly interpolate between the values and never leave the GPU. Based on the replies I’m doubting its possible; is that possible? (probably not highly recommended though, hence one of you guys woulda mentioned it i’m guessing)

You could, but it wouldn’t give you very good results.

First, this would have to be in a uniform block. Second, there are very strict limits on uniform block sizes, and longer animations can exceed those limits. Now, you can get around this problem by passing only the two adjacent matrices in the uniform block. Effectively, when you call glBindBufferRange for the uniform buffer, you would be binding a range containing only the two neighboring frames worth of animation data, rather than the animation’s entire data. Or you can use an SSBO, which has much larger limits.

Third, you would have to do per-frame interpolation on the GPU. For each vertex, you would have to blend together the sets of matrices used by that vertex. That’s expensive, and needlessly so.

Fourth, your animation’s data must be completely decompressed. Obviously for simple systems, this is minor. But on actual production code, where memory is at a premium, compressing animation data is not optional. Though I suppose nowadays you could use a GPU compute process to decompress it.

But there’s an even more important reason not to: you can’t use this interpolate between animations. Normally, there is blending between animations, so that when you switch animations, there isn’t a pop. You also would be unable to d

Again, GPU compute could be employed to do this kind of animation work. So the vertex data for that object would only take one frame’s worth of bone data.

I haven’t tested this, but I think it may be possible to glBindBufferRange two different ranges of the same UBO to two different binding points. The man page text for glBindBufferRange certainly doesn’t preclude it (and it’s not listed among the error conditions), but a closer review of the spec would be needed to establish if it is actually allowed.

In other words:

glBindBufferRange (GL_UNIFORM_BUFFER, 0, matricesUBO, frame1offset, framesize);
glBindBufferRange (GL_UNIFORM_BUFFER, 1, matricesUBO, frame2offset, framesize);

I did a pass through the spec surrounding glBindBufferRange. I didn’t see anything that would prohibit that from working. And I don’t see how it would make sense for it to fail.

The only thing I can think of is overlapping ranges, but that’s more a case of me looking for excuses for why it might fail rather than a serious objection to it.

You can also get around it by using a texture.

Alternatively, you could use a compute shader for this step.

And if you’re performing interpolation, you really would want to use quaternions rather than matrices.

But as malexander suggests, there isn’t much benefit unless you’re animating multiple models with the same skeleton. Otherwise, you’re probably better off just uploading the bone matrices each frame. Particularly if you have a shortage of VRAM.

Yes. That’s by far the best way to go. For basic single animation track skeletal playback, you can animate thousands of skeletally animated character this way – much, much more cheaply than you can do it on the CPU.

The reason I switched to modern openGL and started thinking about this, is because my comp has 32 megs of VRAM and using the vertex arrays with the fixed pipeline was not working. I was getting some weird behaviors (vertices appeared to be sticking into place, and popping back randomly)

So I switched to VBOs and just went down the rabbit whole.

I can relate. VBOs require some reading and experience to learn how to use them fast. If you have questions on that, I’d suggest starting a different thread rather than mixing that in here. I and others will be happy to help with tips and suggestions.

SO. A vertex shader can get attr and uniforms passed. But if I were to do skinning in a vertex shader, I would need to pass the transforms as uniforms per bone associated w/ a vert every frame. The reason I was considering vertex shaders was to reduce the CPU->GPU overhead. But it still appears to be there anyways…

For basic skeletal animation (basic, single animation track playback), you don’t need to upload anything per frame except a tiny uniform that defines the current time and which animation track the character is on. You let the GPU handle everything else (specifically, in a vertex shader you write).

You can pre-upload all your bind-pose skeletal meshes to a VBO, and pre-upload all your skinning transforms (as matrices, quaternion-translation pairs, or [better] dual quaternions :slight_smile: ) to a texture. Then come render time, in the vertex shader you just sample the appropriate joint transforms from adjacent keyframes based on the current time (if you want to support keyframe interpolation) and also based on joint weights (if you support smooth skinnning), compute the aggregate transform, and then use that to transform the vertex position. Conceptually very simple (but with skeletal, there’s lots of fun in the details!)

If I am not uploading the pre-posed verts, I am uploading the pose transforms–every frame; the overhead still exists, no?

For this basic single-animation-track playback, you are pre-uploading the pre-posed (bind pose) verts and pre-uploading the full skinning transforms (for all tracks/keyframes/joints). So that overhead doesn’t exist.

However, when you get further along and want to support more complex animations (feathering, blending, IK, etc.) then – for characters that require more complex animation – you may decide to compute your skinning transforms on-the-fly on the CPU, and then upload them per-frame to the GPU (for those characters which can’t use the simple single-track method), but of course that’s a little more expensive. This’ll probably reduce the number of characters you can animate at the same frame rate by an order of magnitude or two. You’re still skinning on the GPU, but you’re now computing pose transforms dynamically on the CPU (whereas before the per-joint pose transforms were all precomputed and preuploaded, requiring no per-frame “compute” cost on the CPU). Alternatively, you might enhance your implementation to put this dynamic pose transform generation on the GPU.

As it stands now: I have quat & trans pairs…

Sounds like you’re on the right track. But read up on Dual Quaternions though! Your joints will thank you:

The first is a good CG intro to DQ. The second is the “meaty” stuff (shader code, technical papers, etc.)

…that i use to transform verts and send those verts to the GPU every frame.
Is there a cleaner way to do this that is faster, effectively more efficient, using more of the GPU?

Yep! :slight_smile: And we’re just scratching the surface here!

(By the way, none of this requires a compute shader, and all the complexity that brings)

Just ask if you have more questions!