Hi! As I’ve stated in previous posts, I’ve been working on a game engine for some time now, and I’m trying to optimize some of it… So here’s a quick couple of questions:
I find that I’m performing cross and dot products quite often in my engine (Upwards of 1000 times per frame). Could this be a problem? Since these are so critical to my program, are there assembly routines (or anything faster than C/C++ multiplies and additions) that perform these faster?
I store my vectors like this:
direction[3];
direction[0] (=> x component)
direction[1] (=> y component)
direction[2] (=> z component)
3DNow! and SSE enhanced assembly can speed up such calculations. In fact if I’m not mistaken, in SSE2, you can calculate a dot product in just 5 instructions (maybe less). Also I think you may want to add a fourth float to your vector structure to make it an even 16 bytes long, so that they can remain aligned in memory.
GPSnoopy : I didn’t check it properly but I timed other asm routines that replace “old C” functions and there are faster… You can check it by disassemblying two functions (asm and c version) and count clock cycle. But of course, with a NV20 dot product are faster, you can do it with only two instructions (see NV_vertex_program pdf on nvidia site) …
Originally posted by Arath:
[b]GPSnoopy : I didn’t check it properly but I timed other asm routines that replace “old C” functions and there are faster… You can check it by disassemblying two functions (asm and c version) and count clock cycle. But of course, with a NV20 dot product are faster, you can do it with only two instructions (see NV_vertex_program pdf on nvidia site) …
Arath[/b]
And of course, using the gfx-card for your general math routines will result in your app running at the speed of a turtle nailed to the floor.
Originally posted by DFrey: 3DNow! and SSE enhanced assembly can speed up such calculations. In fact if I’m not mistaken, in SSE2, you can calculate a dot product in just 5 instructions (maybe less). Also I think you may want to add a fourth float to your vector structure to make it an even 16 bytes long, so that they can remain aligned in memory.
Pardon my ignorance but what is the advantage of being aligned on 16 bytes?
It has to do with the way the cache fetches stuff from main memory and the fact that modern ram actually isn’t “random access” at all. It also enables certain SIMD optimizations in OpenGL drivers. Pages are typically 4k though and are handled in VMM so that’s not it.
Yep, all those strange words and sacronyms get really confusing sometimes (I just finished a course on computer architecture and communication, and I now know more acronyms than I will ever need for the rest of my life. How about “IEEE 802.3 uses CSMA/CD as MAC and typically uses TP-cable with RJ45 connectors as transfer media”). Anyway, linear memory accesses and structure alignment is a great optimization trick that is very effective. Typically much more worthwhile than low level math stuff. Since main memory is so damn slow optimizing for cache hits really makes a difference.