PDA

View Full Version : Quick Cross Product/Dot Product



DLink
02-19-2001, 07:22 PM
Hi! As I've stated in previous posts, I've been working on a game engine for some time now, and I'm trying to optimize some of it.... So here's a quick couple of questions:

I find that I'm performing cross and dot products quite often in my engine (Upwards of 1000 times per frame). Could this be a problem? Since these are so critical to my program, are there assembly routines (or anything faster than C/C++ multiplies and additions) that perform these faster?

I store my vectors like this:

direction[3];

direction[0] (=> x component)
direction[1] (=> y component)
direction[2] (=> z component)

-Thanks in advance! http://www.opengl.org/discussion_boards/ubb/smile.gif

DFrey
02-19-2001, 09:38 PM
3DNow! and SSE enhanced assembly can speed up such calculations. In fact if I'm not mistaken, in SSE2, you can calculate a dot product in just 5 instructions (maybe less). Also I think you may want to add a fourth float to your vector structure to make it an even 16 bytes long, so that they can remain aligned in memory.

Arath
02-19-2001, 11:36 PM
Hi,

here what I use to calculate dot product, it's fast even it doesn't use SSE or 3DNow instructions.

__forceinline float __cdecl DotProduct(const float _v1[3], const float _v2[3])
{
float dotret;

__asm
{
mov ecx, _v1
mov eax, _v2

;optimized dot product ;15 cycles
fld dword ptr [eax+0] ;starts & ends on cycle 0
fmul dword ptr [ecx+0] ;starts on cycle 1
fld dword ptr [eax+4] ;starts & ends on cycle 2
fmul dword ptr [ecx+4] ;starts on cycle 3
fld dword ptr [eax+8] ;starts & ends on cycle 4
fmul dword ptr [ecx+8] ;starts on cycle 5
fxch st(1) ;no cost
faddp st(2),st(0) ;starts on cycle 6, stalls for cycles 7-8
faddp st(1),st(0) ;starts on cycle 9, stalls for cycles 10-12
fstp dword ptr [dotret] ;starts on cycle 13, ends on cycle 14
}

return dotret;
}

Hope it can help !

Arath

GPSnoopy
02-20-2001, 07:14 AM
Arath, is it really faster than if you were doing in Visual C++ 6.0:
DotProduct = V1[0]*V2[0] + V1[1]*V2[1] + V1[2]*V2[2] ;

?

'cause Visual C++ really optimizes the code very well. And for such a short stuff there is no need to create a function.

Plus you're calling "__asm" which backups on the stack and restores the registers at the end and takes a few more cycles.


Though, the fastest would be to buy an NV20 and use the GL_vertex_program extension *Grin*

zed
02-20-2001, 04:55 PM
vc6 does NOT tack advantage of 3dnow,sse instructions maybe also mmx (dont quote me on that last one)

Arath
02-20-2001, 11:39 PM
GPSnoopy : I didn't check it properly but I timed other asm routines that replace "old C" functions and there are faster... You can check it by disassemblying two functions (asm and c version) and count clock cycle. But of course, with a NV20 dot product are faster, you can do it with only two instructions (see NV_vertex_program pdf on nvidia site) ...

Arath

harsman
02-21-2001, 12:28 AM
Originally posted by Arath:
GPSnoopy : I didn't check it properly but I timed other asm routines that replace "old C" functions and there are faster... You can check it by disassemblying two functions (asm and c version) and count clock cycle. But of course, with a NV20 dot product are faster, you can do it with only two instructions (see NV_vertex_program pdf on nvidia site) ...

Arath

And of course, using the gfx-card for your general math routines will result in your app running at the speed of a turtle nailed to the floor.

JoeMac
02-21-2001, 04:50 AM
Metrowerks CodeWarrior for Windows can optimize for MMX and/or 3DNow!

EricK
02-22-2001, 06:16 AM
Originally posted by DFrey:
3DNow! and SSE enhanced assembly can speed up such calculations. In fact if I'm not mistaken, in SSE2, you can calculate a dot product in just 5 instructions (maybe less). Also I think you may want to add a fourth float to your vector structure to make it an even 16 bytes long, so that they can remain aligned in memory.

Pardon my ignorance but what is the advantage of being aligned on 16 bytes?

DFrey
02-22-2001, 07:15 AM
I'm no hardware guru, but from what I understand, memory reads are faster if they do not cross a page boundary.

harsman
02-23-2001, 03:50 AM
It has to do with the way the cache fetches stuff from main memory and the fact that modern ram actually isn't "random access" at all. It also enables certain SIMD optimizations in OpenGL drivers. Pages are typically 4k though and are handled in VMM so that's not it.

DFrey
02-23-2001, 03:55 AM
Ah that's right, I confused a paragraph of memory with a page of memory.

harsman
02-23-2001, 04:59 AM
Yep, all those strange words and sacronyms get really confusing sometimes (I just finished a course on computer architecture and communication, and I now know more acronyms than I will ever need for the rest of my life. How about "IEEE 802.3 uses CSMA/CD as MAC and typically uses TP-cable with RJ45 connectors as transfer media"). Anyway, linear memory accesses and structure alignment is a great optimization trick that is very effective. Typically much more worthwhile than low level math stuff. Since main memory is so damn slow optimizing for cache hits really makes a difference.