Multiplies are significantly faster than divides. Of course, the slowest part of the above code is the square root. Nvidia has some fastmath routines on their site that has a faster square root in it. There are also other “tricks” like Nvidia’s floating around the web. They’re not as accurate or as universal as just calling sqrt(), so be aware of this.
void __forceinline __fastcall normalizeASM(float* v)
{
static float f=0;
static const float one=1.0f;
__asm{
mov eax,dword ptr[v]
fld dword ptr[eax]
fmul dword ptr[eax]
fstp f
fld dword ptr[eax+4]
fmul dword ptr[eax+4]
fadd f
fstp f
fld dword ptr[eax+8]
fmul dword ptr[eax+8]
fadd f
fsqrt
fstp f
fld one
fdiv f
fstp f
fld dword ptr[eax]
fmul f
fstp dword ptr[eax]
fld dword ptr[eax+4]
fmul f
fstp dword ptr[eax+4]
fld dword ptr[eax+8]
fmul f
fstp dword ptr[eax+8]
}
}
thats the one i wrote… could be faster without the f, but i dont know how to store from the floatreg into a stdregister…
(means fstp ebx, not allowed… ) anyone?
Yeah, I’d do it in asm. It would be a quite short function. I haven’t done any 3dnow stuff for some time, so don’t have the instructions name in my head (was something like PFI2D I think … or something other weird). But the 3dnow docs are freely available on AMD homepage and the SSE eqvivalent is free for download from Intel too.
[This message has been edited by Humus (edited 02-12-2001).]
Cool. So now all I need to do is download tens of megabytes of Intel and AMD processor specs, study them, learn assembly, and maybe come up with a mathematical trick for approximating a square root (if such instructions aren’t built-in). That shouldn’t take long .
I believe I simply copied that routine from the 3DNow! SDK. To get maximum performance when working on a bunch of vectors it is much better to use most of the above code inlined and use prefetch (or prefetchw) to fetch the next vector into cache while working with the current vector. Oh and you can lose the pfrsqit1, pfrcpit2, and one pfmul instruction if 15 bit precision is good enough.
[This message has been edited by DFrey (edited 02-13-2001).]
DFrey:
While your code probably works I don’t understand why you start with FEMMS? It should only be at the end. You signal that you’re exiting the multimedia state before you entering it?
No, that’s ok.
Read the 3Dnow! specs coming with the SDK:
“Like the EMMS instruction, the FEMMS instruction can be used to clear the MMX
state following the execution of a block of MMX instructions. Because the MMX
registers and tag words are shared with the floating-point unit, it is necessary to clear
the state before executing floating-point instructions. Unlike the EMMS instruction,
the contents of the MMX/floating-point registers are undefined after a FEMMS
instruction is executed. Therefore, the FEMMS instruction offers a faster context
switch at the end of an MMX routine where the values in the MMX registers are no
longer required. FEMMS can also be used prior to executing MMX instructions where
the preceding floating-point register values are no longer required, which facilitates
faster context switching.”
That’s how it was in the 3DNow! SDK. From my understanding, they tacked it onto the beginning just to put the mmx registers into a known (undefined ) state. I understand perfectly why it is on the end, and thought it odd at first when I saw it at the start too. But the white paper on it says the FEMMS instruction is to facilitate “Faster Enter/Exit of MMX or floating-point state”.