Quick Cross Product/Dot Product

Hi! As I’ve stated in previous posts, I’ve been working on a game engine for some time now, and I’m trying to optimize some of it… So here’s a quick couple of questions:

I find that I’m performing cross and dot products quite often in my engine (Upwards of 1000 times per frame). Could this be a problem? Since these are so critical to my program, are there assembly routines (or anything faster than C/C++ multiplies and additions) that perform these faster?

I store my vectors like this:

direction[3];

direction[0] (=> x component)
direction[1] (=> y component)
direction[2] (=> z component)

-Thanks in advance!

3DNow! and SSE enhanced assembly can speed up such calculations. In fact if I’m not mistaken, in SSE2, you can calculate a dot product in just 5 instructions (maybe less). Also I think you may want to add a fourth float to your vector structure to make it an even 16 bytes long, so that they can remain aligned in memory.

Hi,

here what I use to calculate dot product, it’s fast even it doesn’t use SSE or 3DNow instructions.

__forceinline float __cdecl DotProduct(const float _v1[3], const float _v2[3])
{
float dotret;

__asm 
{
	mov ecx, _v1
	mov eax, _v2

	;optimized dot product		;15 cycles
	fld dword ptr   [eax+0]     ;starts & ends on cycle 0
	fmul dword ptr  [ecx+0]     ;starts on cycle 1
	fld dword ptr   [eax+4]     ;starts & ends on cycle 2
	fmul dword ptr  [ecx+4]     ;starts on cycle 3
	fld dword ptr   [eax+8]     ;starts & ends on cycle 4
	fmul dword ptr  [ecx+8]     ;starts on cycle 5
	fxch            st(1)       ;no cost
	faddp           st(2),st(0) ;starts on cycle 6, stalls for cycles 7-8
	faddp           st(1),st(0) ;starts on cycle 9, stalls for cycles 10-12
	fstp dword ptr  [dotret]    ;starts on cycle 13, ends on cycle 14
}

return dotret;

}

Hope it can help !

Arath

Arath, is it really faster than if you were doing in Visual C++ 6.0:
DotProduct = V1[0]*V2[0] + V1[1]*V2[1] + V1[2]*V2[2] ;

?

'cause Visual C++ really optimizes the code very well. And for such a short stuff there is no need to create a function.

Plus you’re calling “__asm” which backups on the stack and restores the registers at the end and takes a few more cycles.

Though, the fastest would be to buy an NV20 and use the GL_vertex_program extension Grin

vc6 does NOT tack advantage of 3dnow,sse instructions maybe also mmx (dont quote me on that last one)

GPSnoopy : I didn’t check it properly but I timed other asm routines that replace “old C” functions and there are faster… You can check it by disassemblying two functions (asm and c version) and count clock cycle. But of course, with a NV20 dot product are faster, you can do it with only two instructions (see NV_vertex_program pdf on nvidia site) …

Arath

Originally posted by Arath:
[b]GPSnoopy : I didn’t check it properly but I timed other asm routines that replace “old C” functions and there are faster… You can check it by disassemblying two functions (asm and c version) and count clock cycle. But of course, with a NV20 dot product are faster, you can do it with only two instructions (see NV_vertex_program pdf on nvidia site) …

Arath[/b]

And of course, using the gfx-card for your general math routines will result in your app running at the speed of a turtle nailed to the floor.

Metrowerks CodeWarrior for Windows can optimize for MMX and/or 3DNow!

Originally posted by DFrey:
3DNow! and SSE enhanced assembly can speed up such calculations. In fact if I’m not mistaken, in SSE2, you can calculate a dot product in just 5 instructions (maybe less). Also I think you may want to add a fourth float to your vector structure to make it an even 16 bytes long, so that they can remain aligned in memory.

Pardon my ignorance but what is the advantage of being aligned on 16 bytes?

I’m no hardware guru, but from what I understand, memory reads are faster if they do not cross a page boundary.

It has to do with the way the cache fetches stuff from main memory and the fact that modern ram actually isn’t “random access” at all. It also enables certain SIMD optimizations in OpenGL drivers. Pages are typically 4k though and are handled in VMM so that’s not it.

Ah that’s right, I confused a paragraph of memory with a page of memory.

Yep, all those strange words and sacronyms get really confusing sometimes (I just finished a course on computer architecture and communication, and I now know more acronyms than I will ever need for the rest of my life. How about “IEEE 802.3 uses CSMA/CD as MAC and typically uses TP-cable with RJ45 connectors as transfer media”). Anyway, linear memory accesses and structure alignment is a great optimization trick that is very effective. Typically much more worthwhile than low level math stuff. Since main memory is so damn slow optimizing for cache hits really makes a difference.