Rendering instances of animated mesh

Hi,

Which way is preferable to render a lot of instances of relatively simple (~3000 triangles) skinned mesh? The animation of each instance is unique. Unfortunately, the method described here is not feasible for my meshes because I have typically more than 20 bones per mesh. As I understand, NVIDIA hardware does not have constant registers, instead constants are baked into the shader microcode. Thus, each time I change constants, the driver has to recompile a shader, reset vertex pipeline etc.

I’ve done some tests and the results imho are confusing. The problem is that I don’t see performance difference between “pseudo-instancing” and changing the shader constants.

I’d really appreciate your ideas on what rendering method to use. I was considering switching to frame-based animation (like in Quake1-3),
but I don’t see a fast way to do it either.

Something’s getting lost in your post here, from the context of the paper you seem to be claiming that a uniform specification requires a complete shader recompile (you said globals), that isn’t correct. If that’s not what you meant by globals then in this context it’s not clear what you are saying. The technical paper is merely minimizing the matrix work between instances by using attributes to feed transormation information cheaply to the shader. The theory behind this is that it avoids CPU modelview matrix work by using uniforms or coordinates. The further claim is that coordinates are cheaper than a uniform for the matrix because it eliminates an additional validation forced by the uniform update. None of these techniques needs a shader recompile.

The absolute limit is not the bones in a mesh but the transformations you need to apply to a batch of vertices.
You seem to assume that every vert needs to see every bone and that’s just not the case. In fact your chain with lots of bones (I assume lots of zero weights) would be ridiculous and the very antithesis of this paper’s approach.

A shader doesn’t need to recompile for changing attributes but it depends what you’re talking about w.r.t. constants. The bone matrices can be set for each group of vertices and the weights for those matrices without recompiling a shader (an horrific prospect if you ever have to do this in such a fine grained manner).

So, batch your verts to limit bones, specify matrices between batches and per vertex weights. Calling these matrices constants depends on your definition of constants, they don’t need a shader recompile. Trying to instance your bone matrices is a secondary consideration, limiting bones per vertex batch should be a primary concern.

The lesson to take away from this is you have a blank sheet, you can use any incomming shader attributes to send any information you like. You’re not at all bound by the labels the fixed function pipeline attaches to things. And ofcourse if you can exploit persistence of some per vertex attributes in horrible ways to eliminate a matrix update here or there :-/.

(Edit: missspoke here, the ‘instancing’ is purely eliminating the validation for the uniform update)

So, understand what you’re saving using this, a matrix uniform update in the shader and/or modelview maintenance overheads on the CPU. What YOU save in your app depends on things that aren’t clear from your post.

The trick is to use dangling vertex attributes instead of constants.

Just like in

glTexcoord()
glBegin()
glVertex()
glVertex()
glVertex()
glVertex()
glEnd()

each vertex gets the same texcoord, because if it is not changed, the last value stays in the hardware register.

This is not a vertex shader constant.
Also, Nvidia boards have true constants in the vertex shader, just not in the pixel shader, IMHO.

Yup, I got that part, I slightly misunderstood the value vs uniforms is purely in eliminating the flush/validation for uniform updates (the introduction calls these constants and I missed that). I’ve corrected my post in a few places.

dorbie

Something’s getting lost in your post here
I’m sorry for my English :frowning: And thanks for trying to understand my question.

Uniform specification requires a complete shader recompile (you said globals)
That’s my guess. As far as I understand, the point of that paper is as follows. NVIDIA doesn’t really have constant registers (it is true for pixel shaders, but I’m not sure about vertex shaders). They put the value of a constant (by constant I mean program.env[n]) directly into the shader microcode. Thus, each time I update the constant, the driver has to update the microcode. And this is where “pseudo-instancing” helps; instead of changing the microcode, the driver merely issues a command to change vertex attribute (i.e. vertex.attrib[n]). It is important that the latter command can be queued, thus pipeline flush doesn’t occur.

I wrote a test app. It renders 1600 instances of a mesh with a descent framerate. What’s bothering me is that I don’t see the difference in terms of performance between:

  1. Setting (per instance) constants via glTexCoord
  2. Setting (again, per instance) constants via glProgramLocalParameter
  3. Using glVertexPointer (also per instance) to setup morphing-based animation

I suspect that some way is preferable, I just don’t know how to figure out which one.

Christian Schüler

Also, Nvidia boards have true constants in the vertex shader, just not in the pixel shader, IMHO.
This guys say that pixel shaders do not have constant registers.

One of the reason of such loss is that the NV30 stores constants as additional instructions in the shader’s code. Probably, such instructions used for constants require both memory and clocks.
Edit Sorry, first I haven’t understood your post. You were saying that vertex processor has constant registers, and pixel processor has not. Thanks, if it’s true, it explains some of my results.

It doesn’t require a recompile. The tech report is very clear on this, the uniforms cause a revalidation & flush, this sounds more like register mapping not a recompile. The preference probably depends on what your application is doing. One principal is offloading additional transformation to the GPU (the extra/separate model matrix instead of a single modelview) in exchange for reducing the per instance overhead and/or CPU matrix management. It’ll depend heavily on the complexity and number of the instances, how loaded your CPU is (and GPU on the vertex side) among other things.

It’s still not clear what your tradeoff’s are.

dorbie

It doesn’t require a recompile. The tech report is very clear on this, the uniforms cause a revalidation & flush
Ok, It may be not a full recompiling, but flushing the pipeline does not sound good either.

The preference probably depends on what your application is doing.
Unfortunately, I don’t understand tradeoffs myself. I was just hoping that maybe someone has already made experiments for the similar case and figured out which method is the fastest. I’m trying to render ~1000 instances of the mesh. I’m using geometric LOD, thus each instance contains 1-3000 triangles. I guess that CPU overhead for computing the transformation matrix is negligible. The vertex shader is simple (just skinning with 2 bones per vertex).

Well it also depends on the details of the shaders, you keep mentioning your bone skinning (now we’re down to a couple of bones, whew! :-)).

One thing the instancing demo does is introduce an additional matrix * vertex transformations and eliminates the per instance modleview multiplication. That’s a saving on the CPU side right off the bat, uniform or not.

There’s no indication of what you do in detail. I assume your shader looks pretty much the same for two of your options but a heck of a lot simpler for the CPU driven skinning. That will only start to bite when you run some real application code but could be a performance win. It may simply depend on how you send your vertices.

I would expect some difference between your vertex skinning and data skinning, but if you’re not optimizing how you sent the vertices for your memory useage patterns you’re wasting your time measuring anything.

In your case there is a tradeoff here between dispatch from CPU writeable system memory vs instruction count. If you’re not display listing then you should be using a VBO to allocate the vertex data and changing the useage flags for different shader & CPU skinning combos and hope the driver does the smart thing.

Originally posted by Muromec:
Unfortunately, I don’t understand tradeoffs myself. I was just hoping that maybe someone has already made experiments for the similar case and figured out which method is the fastest.
I just did a little test in our engine:

300 characters, ~700-1300 tris
2 passes (600 skeleton updates)
~20 bones/skeleton (updated with a single glUniform4fv)

30 fps on a GeFX 5900. IIRC, a 6800 reaches ~550 characters (again 30 fps, similar scene). And, believe me, updating the uniforms is not the bottleneck.

I’m trying to render ~1000 instances of the mesh.
I don’t think that’s a reasonable goal for skinned meshes. A couple hundred, yes. But not a thousand. Especially if the average size of models is 1500 triangles; that’s 1,500,000 tris per frame.

dorbie

I would expect some difference between your vertex skinning and data skinning, but if you’re not optimizing how you sent the vertices for your memory useage patterns you’re wasting your time measuring anything.
Sorry, I didn’t make clear that I didn’t write a correct shader for frame-based animation. I’ve just added a call to glVertexPointer per instance (I have to do this for frame-based animation). I’ve noticed that there’s no performance hit due to this call, which was confusing.
I’m not quite sure about optimizations you mention. You mean optimizing for pre-/post- TNL caches?

spasi
Thanks for info! And where’s your bottleneck? (just curious)

Korval
Well, why not? I guess rendering 1.5M triangles/frame doesn’t scare my GF6800 =)

Have you tried profiling your code? How much time is used for calc/setting the modelview matrix, calling drawing functions etc?

nik_bg
On GF5600 the program spends 1-5% of the time setting constants, calling glDrawElements takes 80-95% time, SwapBuffers takes 1% of time.
Inserting 20 additional glTexCoord or glProgramEnvParameter per instance does not change anything.

Well, why not? I guess rendering 1.5M triangles/frame doesn’t scare my GF6800 =)
Because 1.5M triangles per frame equates, at 75fps, to 112.5M, which is well beyond any real-world achievable triangle throughput.

donno in my app i often have ~250,000 polygons (tris+quads) onscreen + ~500,000 onscreen with shadows enabled. this is running realtime on a gffx5900, this is all very unoptimized eg rendered with vinilla VA’s, heaps of error checking etc