Display lists and vertex skinning

I’m currently doing vertex skinning on the CPU. Can anyone approximate what the performance difference will be if I’ll do the skinning in a vertex shader and use a display list instead of a vertex array to render the model?

Everything from slower, to faster, depending on where you’re currently bottlenecked.

What kind of vertex arrays? Plain vertex arrays? ARB_vertex_buffer_object? NV_vertex_array_range? On what card?

What kind of skinning? How many bones? How many bones per vertex? Is the CPU version heavily optimized?

Where does a profiler say you’re spending your time? Do you have a full scene load, or is this rendering a single character? Do you have reason to believe that your current measurements will transfer over to the actual load seen in your final program?

I’m using plain vertex arrays. The card is 9800 Pro with 128 RAM.

The skinning is rigid, four bones per vertex, the number of bones varies from model to model but is not more than the vertex shader can handle(considering uniform parameters limits).

The CPU version is straight forward, not optimzed.

The application is a model viewer, so there is only one model rendered, nothing else.

I would also like to ask if there is a way to query or calculate the amount of memory a display list will take on video memory.

Well, implement and measure. I’d suggest optimizing the skinning code on the CPU, and using an optimal vertex transfer method (such as mapped stream-draw vertex buffer objects). If you have to slice the mesh in many pieces for shader skinning, that’s likely to cause significant overhead.

There is no way of knowing how much memory a display list will take on a card – or whether it will go on the card at all. It might very well just sit in AGP memory, leaving the VRAM for high-bandwidth access for frame buffers and textures.

In the end, though: does your current version run fast ENOUGH? If so, why worry?

Originally posted by jwatte:
If you have to slice the mesh in many pieces for shader skinning, that’s likely to cause significant overhead.

What do you mean by that? why do I need the change the mesh?

Originally posted by jwatte:
In the end, though: does your current version run fast ENOUGH? If so, why worry?
I guess I’m asking out of curiosity. I don’t really need it right now. If it was critical for me to know then I would have implemented it and measure the performance myself. I just thought someone in here might have already implemented something similiar and can tell me what the performance differences where.

I didn’t realy use vertex buffer objects yet, but shouldn’t a display list (assuming it is stored on VRAM) be at least as fast as a VBO?

When you say to optimize the skinning code did you mean using SSE instruction and such or are there optimizations to be done to the skinning process itself. Could this optimizations be done in the vertex shader code?

I got CPU skinning for 4-bone vertices in 100 cycles per vertex with SSE.

Take a 2 GHz CPU for example, this buys you 20 M vertices / s, actually faster than some older cards can render in fixed function.

In practice, I am sending 32 bytes / vertex over the bus, I hit a glass ceiling at 530 MB/s, so thats 15 M vertices / s.

So to have any speed gain from switching to HW-skinning, you GFX card must be able to render a 50+ instruction vertex shader at a rate more than 15 M verts / s. Currently, thats an ATI 9800 pro and about it.

If you have a mesh with more bones than shader constant space, then you have to submit the mesh in multiple chunks.

But it doesn’t end there! Because the mesh may index bones that have higher numbers than there are constant register space, you will have to re-number the bones for each of the chunks.

That’s what I mean by “slice up the mesh”.

When skinning, it’s usually faster to calculate a blended matrix, and then run the vertex through that, than to calculate vertex by matrix four times and then blend.

For best quality, and to support non-uniform, varying scale on bones, you might instead want to do this on the CPU, calculating vertices per bone, instead of bones per vertex. I e:

clear vertex array
foreach bone
foreach vertex affected by bone
transform vert by bone, sum into array
draw array

Blending the matrix instead of the vertex sounds like a good optimization, I’ll try that.