Driver transform speed without HW T&L

I’m wondering if there’s any point to do your own transform code if the card doesn’t support HW T&L. This project I’m starting to work on won’t be finished for quite some time, so I could go for HW T&L only but that seems so unclean. Doing your own lighting instead of using SW OpenGL is probably worthwhile since OpenGL’s lighting model is pretty costly so my own simplified model would probably be faster. But many drivers use SSE or 3DNow! for transforms so they’re probably lightning fast. The real question here is how optimised they are for vertex re-use. My own transforms will transform each vertex once, but I don’t know how optimised the drivers are for veretx re-use. If I use vertex arrays, strips and glDrawElements will vertices being transformed multiple times be a problem? If vertex re-use is good in the driver my code probably won’t offer a performance benefit so I can just skip doing any transforms. Does anyone know what the case is on this and how good different drivers are?

Joakim

[This message has been edited by harsman (edited 01-18-2001).]

The thing is that HW T&L is much faster than software, even if the HW has to transform some vertices multiple times. There are exceptional cases where you can get faster SW T&L, if your app caches vertex transforms and lighting in such a way that you only transform a small portion of the scene for each frame, for example. Maybe doing SW lighting makes sense.
The thing is that the only way to know which one is faster is to implement both solutions and have benchmark code in your app that tests which one is the fastest.

Originally posted by coco:
The thing is that HW T&L is much faster than software, even if the HW has to transform some vertices multiple times. There are exceptional cases where you can get faster SW T&L, if your app caches vertex transforms and lighting in such a way that you only transform a small portion of the scene for each frame, for example. Maybe doing SW lighting makes sense.
The thing is that the only way to know which one is faster is to implement both solutions and have benchmark code in your app that tests which one is the fastest.

I wasn’t talking about doing SW transforms or lighting on a HW T&L card, that would be stupid and much slower than doing it in hardware. What I wanted to know was if it was worthwhile to do your own transforms as opposed to using the drivers transform code when running on a rasterisation only device (anything except GeForce or Radeon). Will the driver’s software transforms be faster than my software transforms, that’s what I was wondering. I guess this depends very much on the driver, how much vertex reuse it has and how fast it’s transform code is (my guess is faster than my puny efforts since eg Nvidias drivers use SSE or 3DNow!).

I would guess that you would have to optimize your code a lot to beat a well-written driver. And if you pass pre-transformed vertices, you also better make sure the driver don’t transform them again with the identity matrix. Some drivers might be clever enough to know this, but I don’t know.

You can also make your transformation optimized for your particular application. For example, if you know you only gonna look left/right, then you don’t have to bother with pitch and roll angles.

I don’t think you shouldn’t have to do your own transformations. There are quite a few other things that needs optimization better.

Sorry then. Still my last sentence keeps being correct:
“The thing is that the only way to know which one is faster is to implement both solutions and have benchmark code in your app that tests which one is the fastest.”
I prefer to leave it to the driver:
-driver writers spend huge amounts of time optimizing it specifically for their cards
-leaving it to the driver makes your code cleaner and lets you focus in higher order algorithms (PVS, BSP, etc).
-you have to think in the future more than in the past, specially if you’re begining to write your engine. T&L cards are spreading fast and as you said, HW T&L is faster than SW 99.9% of the time.

Originally posted by coco:
Sorry then. Still my last sentence keeps being correct:
“The thing is that the only way to know which one is faster is to implement both solutions and have benchmark code in your app that tests which one is the fastest.”
I prefer to leave it to the driver:
-driver writers spend huge amounts of time optimizing it specifically for their cards
-leaving it to the driver makes your code cleaner and lets you focus in higher order algorithms (PVS, BSP, etc).
-you have to think in the future more than in the past, specially if you’re begining to write your engine. T&L cards are spreading fast and as you said, HW T&L is faster than SW 99.9% of the time.

You’re right of course but to benchmark the difference I need to implement my own transforms, which was what I was trying to avoid. Imagine implementing a complete pipeline up until rasterization only to find out that it would have been faster if you let the driver do it. I think I’ll go with using the OpenGL transform pipeline though, I heard Quake 3 does that, so drivers should be pretty good since they’re so optimised for Quake 3 benchmarks.

one thing thats bother me about quake3 is surely using GL_TRI_STRIP is gonna be far quicker than using what q3 does ie GL_TRIANGLES (though i grant u strips r not as flexable as tris)
so why did they use tris instead of tri_strips?
(edit)
also while im on the subject q3 does from what i understand a lot of the maths itself ie no multmatrices just loadmatrices, surely with hardware tnl its better to do the opposite! ie offload the calculations onto the gpu. ok to sum up q3 im thinking can/could be a lot quicker than it is on todays gpu’s or is this a false assumption?

[This message has been edited by zed (edited 01-19-2001).]

quake 3 does not use Triangle strips???

I think the reason Q3 uses triangles instead of tri_strips is so that it was easier for him to just batch up everything together without having to stripify it or break things up into 2 groups.

If you look at: http://www.quake3arena.com/news/glopt.html
John talks a bit about what techniques he used in q3, and he mentions there that although the triangles are in list rather than strip form, they are in an indexed form and arranged so that triangles that COULD be made into strips appear sequentially in the list. He mentions that drivers could take advantage of this if they wanted to, although I dont know that any do. This seems like an uncommon case and a lot of extra work. However, being as how everyone likes to optimize their driver’s for whatever John does, it is very possible.

How about it Matt. Any thing like this in your drivers?

About harsman:
“Imagine implementing a complete pipeline up until rasterization only to find out that it would have been faster if you let the driver do it.”
If you do a really good job, in some cards it will be slower (specially in T&L cards), but in some older cards it will be faster, so both paths of your code get to use if you are planning in throwing your product to the masses.
Anyway, I also choose to leave it all to OpenGL, as I said before, specially because the software path will only be usable in older cards.

From the Quake 3 gl optimizations doc:

“Ideally, drivers should supporting EXT_compiled_vertex_arrays, which allows us to explicitly tell you that we aren’t going to change the vertex values after we have specified them, so you can batch process the entire load. There are two levels of benefit from this: shared vertexes in a single DrawElements call and shared vertexes across multiple rendering passes on the same geometry. Some drivers get the first benefit even without the compiled vertex arrays by scanning the indexes before processing the triangles, but to save the work across multiple rendering passes the extension is necessary.”

So good drivers will go through the vertices before rasterizing and only transform once especially if I use EXT_CompiledVertexArrays. In that case multiple transformations won’t be much of an issue, and my transformations probably won’t offer a performance gain, conmsidering most drivers probably does this since it makes Quake 3 benchmarks better. As an added bonus I don’t have to develop a transformation pipeline :slight_smile: Any low-level SIMD type optimizations in premium drivers (e.g. Nvidia) will make it run even faster than my own code, so I’ll definetily go with gl transforms.