How to structure data for HT&L vs Texture Sorting

system · October 27, 2000, 7:40pm

How expensive is texture changes, assuming I
can keep most textures on the card across
frames?

The reason I’m asking is that with a skeletal
mesh animation system, it seems better to
store/render meshes on a per-bone basis than
in the traditional per-texture order. This to
get the most effect out of HT&L.

Assuming I have a medium-size mesh, with
perhaps 50 vertexes per bone and ten bones,
would the HT&L gain from doing 50 vertexes at
a time outweigh the penalty of switching
textures a few times per bone?

mcraighead · October 27, 2000, 8:38pm

Ummm, I really hope you’re not planning on switching textures a few times every 50 vertices…

If you’re doing skeletal animation, can’t you use the same set of textures for the entire model?

Matt

Gorg · October 27, 2000, 8:40pm

You should follow Matt advice and keep all the textures in one so you never need to switch. Have you seen Unreal or Q3 skins? That’s how you should make your texture for your models.

[This message has been edited by Gorg (edited 10-27-2000).]

DFrey · October 28, 2000, 5:09am

Correction Gorg: Q3 skins are generally not kept in a single texture. Q3 skins are made up of one or more shaders. Each shader could use as many as 8 textures, though most use 1, 2, or 3. Q2 skins however were just single textures.

[This message has been edited by DFrey (edited 10-28-2000).]

system · October 28, 2000, 7:40am

Unfortunately, I will not be able to use
just one texture per mesh. Perhaps one
texture SET (I’m going to require two texture
units because this is special-purpose and
I’m lazy

Anyway, what about the transformation and
representation? Does it pay off to send
about 50 vertexes at a time to the transform
engine, or would I be better just coding it
up in assembly myself? And what about the
joints, which will have a few weighted
vertexes?

Gorg · October 28, 2000, 7:50am

Originally posted by DFrey:
[b]Correction Gorg: Q3 skins are generally not kept in a single texture. Q3 skins are made up of one or more shaders. Each shader could use as many as 8 textures, though most use 1, 2, or 3. Q2 skins however were just single textures.

[This message has been edited by DFrey (edited 10-28-2000).][/b]

Thank you DFrey. to tell you the truth I never bother to have a look, I just saw on some website that the base texture was in one file only. Anyway, I still made my point!

mcraighead · October 28, 2000, 9:47am

Hardware T&L is basically free, so long as you don’t burden it by sending a new matrix every primitive or something ridiculous like that. You should not bother writing your own T&L routines without a really good reason to do so. Your app is still responsible for frustum and visibility-based culling, though.

Anyone who tells you SW T&L is faster than HW T&L doesn’t know what they’re talking about… either they’ve cheated (the SW isn’t doing as much work as the HW) or they’ve done a bad job with their HW T&L implementation. Among the benchmarks that have been quoted to “prove” this, I can classify pretty much either one of them into one of these categories.

State changes are still bad, and you should still minimize them, regardless of their type. Don’t send redundant state changes, minimize texture switches, and play with order of drawing at least a bit. Sometimes front-to-back is fastest, sometimes other methods are the way to go.

Batches of 50 vertices are on the small side. Remember that there is API call overhead every time you even call into the driver. Not a ton, but MS does a bit, and then we have to do boilerplate stuff for each entry point – for example, error checking of arguments. Batches should be large enough to make the API call overhead irrelevant but small enough to fit in the cache, if at all possible.

Matt

system · October 28, 2000, 1:19pm

>Batches should be large enough to make the
>API call overhead irrelevant but small
>enough to fit in the cache, if at all
>possible.

I understand. However, my modeling skills
aresuch that getting even 50 vertexes in
something like a “lower leg” would be quite
an accomplishment

What I’m hearing is that this loop is
preferrable, though (because T&L is “free”):
(please excuse the blatant simplifications)

void transform_and_draw_bone(bone & b)
{
transform_bone(b);
draw_bone(b);
}

void draw_texture_geometry(texture & t)
{
bind_textures(t);
for_each(t.bones.begin(),
t.bones.end(),
transform_and_draw_bone);
}

for_each(used_textures.begin(),
used_textures.end(),
draw_texture_geometry);

as opposed to:

void draw_bone_texture(texture & t)
{
bind_textures(b);
draw_bone(b);
}

void draw_bone_geometry(bone & b)
{
transform_bone(b);
for_each(b.textures.begin(),
b.textures.end(),
draw_bone_texture);
}

for_each(geometry_bones.begin(),
geometry_bones.end(),
draw_bone_geometry);

I’m currently looking into the fence and
vertex array range extensions, to see if they
are what I need (ideally, I’d like to pre-
transform various sections of the model, then
bind/draw textures on indexed vertexes from
that transformed set).

Oh, and desperately trying to figure out how
to get all this to be interactive enough on
non-HT&L systems. With SSE or 3DNow!, it’ll
probably be good enough. Besides, I can save
that for later, and maybe it’ll be moot by
the time I get there

mcraighead · October 28, 2000, 1:33pm

I’m not exactly clear on what “transform_bone” means… does it mean you’re sending a new modelview matrix, or a pair of new modelview matrices (vertex weighting), or that you’re doing transforms on the app’s side?

You mention pretransforming vertices… it doesn’t exactly work that way. No T&L HW I am familiar with is capable of dumping the results out into a buffer and then reading them back in future rendering passes. Instead, almost all T&L HW transforms each vertex as it gets to it, unless it already has that vertex in its vertex cache. It stores the result in the cache, discarding an old vertex. If you want to pretransform, you have to do that on the CPU and then set up an identity projection and modelview matrix. This saves the GPU some work, but not much work – again, unless you have high polygon counts, you can consider T&L free.

With the numbers of vertices you are talking about, I can virtually guarantee that you will not be T&L-limited to any large extent.

Matt

system · October 28, 2000, 7:59pm

>You mention pretransforming vertices… it
>doesn’t exactly work that way. No T&L HW I
>am familiar with is capable of dumping the
>results out into a buffer and then reading
>them back in future rendering passes.
>Instead, almost all T&L HW transforms each
>vertex as it gets to it, unless it already
>has that vertex in its vertex cache.

I’m slowly grokking this.

transform_bone() would basically do a
glMultMatrix(). So, HT&L would apply that
matrix for each vertex it’s using, rather
than applyit while it’s snarfing the data
out of the vertex buffer. That’s good,
because then the cost of changing the matrix
is “only” the call into the ICD, not any “re-
transform” of any buffer (-segmets).

Speaking of which, is the nVidia ICD memory
mapped so that you don’t have to enter the
kernel to send a new matrix, or do you have
to take the overhead of a DeviceIoControl()
(or similar) for each state change? (well,
in that case you could at least defer them
until you needed to push them to the hardware
but it’s still icky if you have to involve
Windows at all )

I have enuff to go on I think. I’ve read up
on the Fence and VertexArrayRange stuff, and
it seems reasonably straightforward.

system · October 29, 2000, 12:39am

Back from feeding through most of the extra
docs at nVidia’s site, it feels like.

Basically, I’m trying to get HT&L to do my
model animation for me “for free”. However,
there appears to be no way to specify a
different transform matrix for different
parts of a vertex_array_range for the same
call to glDraw{Range}Elements().

Under Direct3D, I can use the vertex
blending stuff to render one “bone” in my
skeleton, plus the joint to the next bone.
Then I can switch the matrixes, render the
next bone and its joint to the next bone,
etc until I’m done.

Can we have this functionality under OpenGL,
too, please? By Monday?

[This message has been edited by bgl (edited 10-29-2000).]

mcraighead · October 29, 2000, 9:26am

A DeviceIoControl for every state change? Ugh, you must be thinking of D3D or something. OpenGL renders in user mode (it’s a user-mode DLL) and uses kernel-mode stuff relatively infrequently. D3D, on the other hand, runs in user mode, but in a different process, in Win9x, and in the kernel (inside the display driver) on Win2K. The runtime batches up commands from the user and then calls the D3D driver to execute them. But that’s a different story, the story of how MS screwed up D3D… in short, with OpenGL, you don’t have to worry about user->kernel transition overhead much.

That doesn’t mean state changes don’t have a real cost, though.

There is no way, D3D or OGL, to get a state change inside a primitive. D3D, you set your state and you call Draw[Indexed]Primitive. OGL, you set your state and do a Begin/End or a DrawElements. The difference here would seem to be that D3D doesn’t have a matrix stack, while OGL does…

You have to break up your list of indices into the part for each bone. This goes with both APIs…

Matt

system · October 29, 2000, 4:18pm

>You have to break up your list of indices
>into the part for each bone. This goes with
>both APIs…

Yes, I came to that conclusion.

What I was missing was EXT_vertex_weighting.
I was looking for the SEARCH feature on the
Nvidia web site, and couldn’t find it, so in
the end, Google came to the rescue.

Nice to hear OGL doesn’t need to go into the
kernel (much) to talk to the card. I suppose
at a minimum “swap” might need to (at least
indirectly) however

Okay, I have enough to chew on for… uh…
quite some time now. Thanks, all!