Push/Pop v.s. LoadMatrix?

NordFenris · April 15, 2003, 4:54am

I suppose that this is somewhat driver related, but I thought I’d ask your opinion…

Would it be faster to replace a tight set of glPushMatrix/glPopMatrix by just restoring the earlier matrix by loading it with glLoadMatrix?

M_dm_n · April 15, 2003, 5:03am

You have to check that yourself, but pushing poping is very fast, loading is not quite the same procedure, Loading is something like glTrans/Rot.

NordFenris · April 15, 2003, 5:10am

Yeah, I’ll run some tests on my own, I just hoped that anyone could give me a quick insight before I knack up a test bed in vain.

Thanks, Madman!

marcus256 · April 16, 2003, 10:15pm

Push/Pop is done directly in hardware with a hardware matrix stack (just make sure that you do not overflow it). Load is done from system memory over the AGP bus => NOT at all as fast (I even think the GL spec sais “use glLoadIdentity() instead of glLoadMatrix( my_identity_matrix ) since the latter may be slower”, or something similar, which may give you a clue.

NordFenris · April 18, 2003, 1:22am

That’s exactly what I wanted to hear, thanks, Marcus!

imported_jwatte · April 18, 2003, 11:29am

Marcus,

Are you sure the push and pop are in a hardware stack? I don’t have any benchmarking data that would indicate that it’s done in hardware. Seems to be a thing done in the driver, as far as I can tell.

What hardware are you thinking about?

marcus256 · April 20, 2003, 10:29am

jwatte,

I don’t have any hard proofs. I think I have read it in (at least) one place before, but I may be wrong. It just seems odd to me to limit the matrix strack depth unless it’s implemented in hardware. Also, it makes perfect sense to do it in hardware, since it really only requires a quite small memory buffer to implement it - and push/pops can be quite frequent in cetrain applications. You probably need a mirrored software stack too, in otder to do fast glGet:s (without having to stall the pipe or go over the bus).

Anyway, that’s my 2 cents. I haven’t designed any GL hardware/drivers myself.

HS1 · April 20, 2003, 11:21am

It would make sense for a TnL card to use a hardware based stack.
On the other hand the driver has issue a command over the AGP bus anyway.

Assuming such a command would be 32bit it could be send in one AGP cycle. If you attach the matrix to the command (in case the stack is software based) you would have to transfer 32Bit + (16 * 32Bit) = 68Bytes or 17 AGP cycles for every push or pop.

So we are speaking of a difference of 3.76e-9 seconds vs. 6.39e-8 seconds at AGP4x.

Of course I dont know “how its done” thats only what my math tells me.

[This message has been edited by HS (edited 04-20-2003).]

Overmind · April 20, 2003, 12:33pm

For the decision between glPush/PopMatrix and glLoadMatrix, it doesn’t really matter if Push/Pop is implemented in hardware. I would use glPush/Pop because it COULD be implemented in hardware and save bandwith, but glLoadMatrix will transphere the matrix for sure.

I am pretty sure that even on T&L cards the matrix stack is done by the driver in software, but who knows when the first card that can do matrix manipulation in hardware?

This topic is very similar to the discussion about doing T&L yourself vs. letting OpenGL doing it, knowing that with both methods end up using CPU power. Now, a few years later, it is obvious which method to prefer, because we have hardware T&L cards.

The point is, if OpenGL can do something for you, DON’T try to implement it in software, because the worst case is that the driver does it equally fast/slow, but it could be that it is faster.

marcus256 · April 21, 2003, 1:50am

Originally posted by Overmind:
The point is, if OpenGL can do something for you, DON’T try to implement it in software, because the worst case is that the driver does it equally fast/slow, but it could be that it is faster.

Exactly my opinion (it goes for many other things than OpenGL too, but OpenGL is a very good example of this philosophy).

There is usually very little point in trying to optimize something in software that can be done by drivers/APIs. If something is done in software in the drivers, chanses are very good that the driver writers are very competent and likely to do a better job than yourself anyway. Even IF you can do a better job, that marginal performance gain is usually wiped out in 6 months or so due to improved drivers, better hardware, new PC configurations etc. etc.

…and if people ask more from the drivers, HW vendors will be forced to do better drivers.

tooltech · April 21, 2003, 2:30am

just remember that making your own mults of matrixes will enable you to do more optimizations in combination with a scene graph…

As long as you don’t need the result from two mults you could us ethe gl version of pop push etc, but when optimizing for separate passes, state changes, lods etc, you would gain from doing your own transforming…

imported_jwatte · April 21, 2003, 10:27am

Assuming such a command would be 32bit it could be send in one AGP cycle. If you attach the matrix to the command (in case the stack is software based) you would have to transfer 32Bit + (16 * 32Bit) = 68Bytes or 17 AGP cycles for every push or pop.

I don’t understand the math.

If you push, but don’t change the matrix, then there is no need to tell the hardware anything.

If you push, and then MultMatrix or LoadMatrix then you have to send the new argument matrix to the hardware. The only difference is whether you send the matrix pre- or post-composition with the previously existing matrix.

Seeing as there is no bandwidth difference, I would assume it’s cheaper to not put matrix multiply and memory buffers in the hardware, but instead use the CPU for that, and just make sure to update the value of the matrix in-line with other geometry parameters and state changes, if and when actual geometry is submitted to the card. (I e, as part of an imaginary DMA command queue)

I’m sure that back in the '80s, when CPU matrix multiplies hurt a bit, there was (expensive) hardware that would do it for you, and it could be that that’s where the limited stack depth comes from. But, as I also have never written an OpenGL driver, I don’t know for sure, either.

tooltech · April 21, 2003, 11:00am

Just a comment…

I can see that there is a large gain in speed to combine transforms on a defined number of geometries. These premultiplied geometries shares the same modelview transform.

These shared geometries can then be drawn without any load,pop/push etc. and will be much faster because they do not need to recalc the drivers internal T&L states.

HS1 · April 21, 2003, 11:18am

Originally posted by jwatte:
If you push, but don’t change the matrix, then there is no need to tell the hardware anything.

You are absolutly right, looks like my thoughts got mixed up .
A ‘push’ would only require a local memcpy and no AGP transfer at all.

I almost forgot this too:
“As long as its fast enough, dont worry about it…”

marcus256 · April 21, 2003, 10:34pm

Originally posted by jwatte:
I’m sure that back in the '80s, when CPU matrix multiplies hurt a bit, there was (expensive) hardware that would do it for you, and it could be that that’s where the limited stack depth comes from. But, as I also have never written an OpenGL driver, I don’t know for sure, either.

But why are nVidias drivers limited, for instance? The maximum projection matrix stack depth is 4 on my Ti4200. Mesa 5.x has a corresponding depth of 32. Of course, it can still be in software, just assuming that people will never use the projection matrix stack anyway (or at most one or two depth levels).

I don’t think matrix multiplications are done in hardware. Modern CPUs can do that very well (SSE/3DNow! etc), and you probably still want the matrix in client memory to minimize stalls when you do “get” or context switches etc. Push/pop could still be in hardware, as a POP would only require a buffer pointer update in HW.

Regarding AGP vs HW speeds, Push/Pop vs Load etc, don’t forget that OpenGL can run over a network connection, which is NOT as fast as an AGP bus (when I tried GL over a LAN for the first time I noticed how much faster display lists actually can be than immediate/vertex arrays ).

Ok, I agree to the previous post: “As long as its fast enough, dont worry about it…” Not that I care about how it’s done for performance reasons, I’m just curious.

system · April 22, 2003, 11:19am

Originally posted by marcus256:
But why are nVidias drivers limited, for instance? The maximum projection matrix stack depth is 4 on my Ti4200.

Because you have to think of the future and make the best decisions, and not explain them to the public to make sure the competition doesnt catch on.

My guess is that even if it’s possible to have a super large stack (back in the old days and even now), it wasn’t done so because the engineers beleived that one day the entire matrix stack may be stored on board with some hardwired optimization.

If this optimization has become reality today, then its time to celebrate.