PDA

View Full Version : How is GL's multmatrix so fast?



knackered
04-26-2002, 06:17 AM
If I do this:-
glLoadMatrixf(matrix1);
glMultMatrixf(matrix2);

instead of this:-
MatMult(matrix1, matrix2, matprod);
glLoadMatrixf(matprod);

I get a *significant* performance boost.
(I've shown my mult matrix code below).

Why is this?

The reason I ask is because I gathered from matt and cass that all matrix mults are done on the CPU, and the resultant matrix uploaded to the GPU...so, surely some mad assembly version couldn't be *that* much faster than this simple bit of C ?




outputmatrix[0]=matrix1[0]*matrix2[0]+matrix1[1]*matrix2[4]+matrix1[2]*matrix2[8]+matrix1[3]*matrix2[12];
outputmatrix[1]=matrix1[0]*matrix2[1]+matrix1[1]*matrix2[5]+matrix1[2]*matrix2[9]+matrix1[3]*matrix2[13];
outputmatrix[2]=matrix1[0]*matrix2[2]+matrix1[1]*matrix2[6]+matrix1[2]*matrix2[10]+matrix1[3]*matrix2[14];
outputmatrix[3]=matrix1[0]*matrix2[3]+matrix1[1]*matrix2[7]+matrix1[2]*matrix2[11]+matrix1[3]*matrix2[15];

outputmatrix[4]=matrix1[4]*matrix2[0]+matrix1[5]*matrix2[4]+matrix1[6]*matrix2[8]+matrix1[7]*matrix2[12];
outputmatrix[5]=matrix1[4]*matrix2[1]+matrix1[5]*matrix2[5]+matrix1[6]*matrix2[9]+matrix1[7]*matrix2[13];
outputmatrix[6]=matrix1[4]*matrix2[2]+matrix1[5]*matrix2[6]+matrix1[6]*matrix2[10]+matrix1[7]*matrix2[14];
outputmatrix[7]=matrix1[4]*matrix2[3]+matrix1[5]*matrix2[7]+matrix1[6]*matrix2[11]+matrix1[7]*matrix2[15];

outputmatrix[8]=matrix1[8]*matrix2[0]+matrix1[9]*matrix2[4]+matrix1[10]*matrix2[8]+matrix1[11]*matrix2[12];
outputmatrix[9]=matrix1[8]*matrix2[1]+matrix1[9]*matrix2[5]+matrix1[10]*matrix2[9]+matrix1[11]*matrix2[13];
outputmatrix[10]=matrix1[8]*matrix2[2]+matrix1[9]*matrix2[6]+matrix1[10]*matrix2[10]+matrix1[11]*matrix2[14];
outputmatrix[11]=matrix1[8]*matrix2[3]+matrix1[9]*matrix2[7]+matrix1[10]*matrix2[11]+matrix1[11]*matrix2[15];

outputmatrix[12]=matrix1[12]*matrix2[0]+matrix1[13]*matrix2[4]+matrix1[14]*matrix2[8]+matrix1[15]*matrix2[12];
outputmatrix[13]=matrix1[12]*matrix2[1]+matrix1[13]*matrix2[5]+matrix1[14]*matrix2[9]+matrix1[15]*matrix2[13];
outputmatrix[14]=matrix1[12]*matrix2[2]+matrix1[13]*matrix2[6]+matrix1[14]*matrix2[10]+matrix1[15]*matrix2[14];
outputmatrix[15]=matrix1[12]*matrix2[3]+matrix1[13]*matrix2[7]+matrix1[14]*matrix2[11]+matrix1[15]*matrix2[15];




[This message has been edited by knackered (edited 04-26-2002).]

Nutty
04-26-2002, 07:07 AM
Of course an assembler version can be loads faster. Have you any idea of the kind of code, that C is gonna get turned into?

Also, MMX, 3dnow will helps loads too!

Zeno
04-26-2002, 07:18 AM
Yea, I'm sure it's because they use assembly routines and SSE to do theirs.

You could get significant speed up out of your routine without going that far, though. A friend of mine and I had to do a lot of 4x4 matrix multiplies for our work in grad school, so our prof had us compete to come up with the fastest C solution to the problem that we could. The one we ended up with matched the prof's hand made assembly routine (which didn't use anything processor specific like SSE, or we wouldn't have stood a chance).

The whole key is to lay things out so that you're being very cache friendly. I don't have the code with me but let me know if you're interested and I can post it for you later.

One last thing - I've gotten significant performance boosts on my math routines by using the intel C++ compiler instead of the visual C++ one.

-- Zeno

Devulon
04-26-2002, 07:29 AM
Look at the expanded matrix multiplication. See how many repetitious loads there are on each line. That C code will most like thrash the cache something aweful. Just being careful about the cache lines/cache hits and misses would make that 10 times faster. As you guys have said the only way to really make sure the compilier isn't being retarded is assembly. Assembly Code allows you to load the array in a much more efficient manner. SSE surely helps out by combining the multiplications together. You would be amazed how few instructions you could really make that. Actually I am sortof intrigued to see how many SSE instructions i can do it in.

Will have to look into it.

Devulon

knackered
04-26-2002, 08:08 AM
I'm sure I got that code out of the mesa gl library....maybe someone should point these things out to them.
I'm still a little stunned that there is *such* a performance difference - we're talking at least 16 milliseconds over only a few dozen mults.

Yes, zeno, I would be interested in seeing your code.

Thanks for the replies.

j
04-26-2002, 08:29 AM
Another possibility could be that the driver just copies over the matrices and waits to do the multiplication until it actually needs to use the multiplied matrix values.

Maybe.

j

V-man
04-26-2002, 08:43 AM
They use SSE for sure because when you install the driver, in control panel->display it says Geforce xxx /AGP/SSE and you can get the same thing with

glGetString(GL_RENDERER);

You should go to intel site for that. They have ready to compile code in their documents and a huge math library.

The docs say that for multiplying together 2 4x4 matrices, it takes 172 cycles with pure FPU intructions and 90 cycles with the SSE.

But do you find that there is reason to do your own matrix mult and other such operations?

V-man

nickels
04-26-2002, 09:35 AM
I would be suprised if hand coded assembly could beat the C compiler assembly for something simple like the code above. These compilers are way better at local optimization like this than any hand coders are (except the graphics black book guys at ID software).

But I guess that doesn't answer your question, it just makes it more perplexing.

This is, of course, provided you have turned on your compiler optimization. With -g it will do things just as you said.

Another possibility is that the opengl code turns off pointer aliasing and you don't. Search your compiler for the 'restrict' flag and see if this makes a difference.

[This message has been edited by nickels (edited 04-26-2002).]

knackered
04-26-2002, 10:18 AM
Originally posted by V-man:
But do you find that there is reason to do your own matrix mult and other such operations?
V-man

Yes, the reason I would prefer to do my own mults etc. is because my engine can create both opengl and direct3d views, therefore I'd like to keep matrix mults under my control, rather than passing the job to whichever context I'm rendering into (gl/d3d).
I'll look into the intel math lib - is it free for commercial use? I'll check...
I really should get more familiar with assembly...haven't done it for years (z80 was my last brush with it). I never view the disassembly stuff...perhaps I should.

One other thing - my maths library is in a seperate DLL, so maybe there's a performance penalty - but opengl is in a seperate DLL too...so maybe not.

Nickels, I am a bit perplexed, yes http://www.opengl.org/discussion_boards/ubb/smile.gif
So much other stuff to do other than optimising rendering....

knackered
04-26-2002, 10:34 AM
Mmm, the Intel math library seems to require me to give them some money...not interested.
So, I trotted over to the AMD site, and they have a 3dnow optimised maths lib for free. I downloaded it, and look at how they multiply 2 4x4 matrices together (look familiar?) :-




void glMul_4x4 (float *r, const float *a, const float *b)
{
float tmp[16];

tmp[0] = b[0] * a[0] + b[4] * a[1] + b[8] * a[2] + b[12] * a[3];
tmp[1] = b[1] * a[0] + b[5] * a[1] + b[9] * a[2] + b[13] * a[3];
tmp[2] = b[2] * a[0] + b[6] * a[1] + b[10] * a[2] + b[14] * a[3];
tmp[3] = b[3] * a[0] + b[7] * a[1] + b[11] * a[2] + b[15] * a[3];
tmp[4] = b[0] * a[4] + b[4] * a[5] + b[8] * a[6] + b[12] * a[7];
tmp[5] = b[1] * a[4] + b[5] * a[5] + b[9] * a[6] + b[13] * a[7];
tmp[6] = b[2] * a[4] + b[6] * a[5] + b[10] * a[6] + b[14] * a[7];
tmp[7] = b[3] * a[4] + b[7] * a[5] + b[11] * a[6] + b[15] * a[7];
tmp[8] = b[0] * a[8] + b[4] * a[9] + b[8] * a[10] + b[12] * a[11];
tmp[9] = b[1] * a[8] + b[5] * a[9] + b[9] * a[10] + b[13] * a[11];
tmp[10] = b[2] * a[8] + b[6] * a[9] + b[10] * a[10] + b[14] * a[11];
tmp[11] = b[3] * a[8] + b[7] * a[9] + b[11] * a[10] + b[15] * a[11];
tmp[12] = b[0] * a[12] + b[4] * a[13] + b[8] * a[14] + b[12] * a[15];
tmp[13] = b[1] * a[12] + b[5] * a[13] + b[9] * a[14] + b[13] * a[15];
tmp[14] = b[2] * a[12] + b[6] * a[13] + b[10] * a[14] + b[14] * a[15];
tmp[15] = b[3] * a[12] + b[7] * a[13] + b[11] * a[14] + b[15] * a[15];

r[0] = tmp[0];
r[1] = tmp[1];
r[2] = tmp[2];
r[3] = tmp[3];
r[4] = tmp[4];
r[5] = tmp[5];
r[6] = tmp[6];
r[7] = tmp[7];
r[8] = tmp[8];
r[9] = tmp[9];
r[10] = tmp[10];
r[11] = tmp[11];
r[12] = tmp[12];
r[13] = tmp[13];
r[14] = tmp[14];
r[15] = tmp[15];
}


Now I'm even more perplexed.

Elixer
04-26-2002, 11:12 AM
Hey knackered,it really does depend on the compiler, debug build or release, and some other things. Take for example vectorc, they have a little demo where you can plug in the source, and they spit out the otimized version. ( http://www.codeplay.com/vectorc/demo-compiler.html )

They also have a demo showing off results between Intel, MS and their own compiler (I wonder who won? http://www.opengl.org/discussion_boards/ubb/smile.gif) http://www.codeplay.com/vectorc/demo-int.html

Now if only I could afford the $$$ for the vectorC compiler. http://www.opengl.org/discussion_boards/ubb/frown.gif Around $800 or so.
I think the intel compiler does SSE stuff, but, AFAIK, the MS compiler doesn't do SSE or 3dnow (automatically produce the sse/3dnow code I mean).

zed
04-26-2002, 01:46 PM
go over to www.flipcode.com (http://www.flipcode.com) have a look at the code of the day, theres an entry from a guy from microsoft, showing the various speeds of mat*mat routines, ammusingly he asserted that the d3dx matrix multiplcation is the slowest http://www.opengl.org/discussion_boards/ubb/smile.gif

dorbie
04-26-2002, 01:55 PM
glMultmatrix can be accelerated in hardware, but this is implementation dependent. Also the whole point of SSE and 3Dnow is support SIMD style vectorized code. Compilers don't tend to output that code, you need to hand code it in assembly, although Intel's new compiler apparently does a good job. I doubt it'll have a code path for 3DNow! assembler though.

Gorg
04-26-2002, 09:26 PM
Originally posted by zed:
go over to www.flipcode.com (http://www.flipcode.com) have a look at the code of the day, theres an entry from a guy from microsoft, showing the various speeds of mat*mat routines, ammusingly he asserted that the d3dx matrix multiplcation is the slowest http://www.opengl.org/discussion_boards/ubb/smile.gif

I am not sure if you read the whole text, but he said that the version he had at home (dx8.0a) was the slowest, but at work with dx8.1, it was the fastest because the D3DX team implemented 3dnow optimisations

Zeno
04-26-2002, 10:47 PM
Well it's going to be a bit long, but here it is.

The function multiplies two matrices and puts the answer in a third. i.e. ma * mb = result. The matrices are stored as an array of 18 floats in column major (OpenGL order).

Here are the things we found sped up our routine, in order of importance. Of course, your mileage may vary depending on your compiler.

1. When CPU's read from cache, they often pull in several consecutive memory locations. Thus, there is a HUGE speed advantage if you can walk through your arrays in the same order they are in memory. Notice that all three arrays involved are walked in order at all times. Notice, too, that doing this on this function makes it look like asm http://www.opengl.org/discussion_boards/ubb/smile.gif

2. Using temporary variables and copying them to the result at the end is more efficient than assigning them to the result as you go. I'm not sure why this is...perhaps because result is a pointer the compiler can't do some optimizations? Anyone know for sure?

3. Assigning a value to the temporaries when they're created instead of later can help a bit.

Without further ado...




float m00 = ma[ 0] * mb[ 0];
float m10 = ma[ 1] * mb[ 0];
float m20 = ma[ 2] * mb[ 0];
float m30 = ma[ 3] * mb[ 0];
m00 += ma[ 4] * mb[ 1];
m10 += ma[ 5] * mb[ 1];
m20 += ma[ 6] * mb[ 1];
m30 += ma[ 7] * mb[ 1];
m00 += ma[ 8] * mb[ 2];
m10 += ma[ 9] * mb[ 2];
m20 += ma[10] * mb[ 2];
m30 += ma[11] * mb[ 2];
m00 += ma[12] * mb[ 3];
m10 += ma[13] * mb[ 3];
m20 += ma[14] * mb[ 3];
m30 += ma[15] * mb[ 3];

float m01 = ma[ 0] * mb[ 4];
float m11 = ma[ 1] * mb[ 4];
float m21 = ma[ 2] * mb[ 4];
float m31 = ma[ 3] * mb[ 4];
m01 += ma[ 4] * mb[ 5];
m11 += ma[ 5] * mb[ 5];
m21 += ma[ 6] * mb[ 5];
m31 += ma[ 7] * mb[ 5];
m01 += ma[ 8] * mb[ 6];
m11 += ma[ 9] * mb[ 6];
m21 += ma[10] * mb[ 6];
m31 += ma[11] * mb[ 6];
m01 += ma[12] * mb[ 7];
m11 += ma[13] * mb[ 7];
m21 += ma[14] * mb[ 7];
m31 += ma[15] * mb[ 7];

float m02 = ma[ 0] * mb[ 8];
float m12 = ma[ 1] * mb[ 8];
float m22 = ma[ 2] * mb[ 8];
float m32 = ma[ 3] * mb[ 8];
m02 += ma[ 4] * mb[ 9];
m12 += ma[ 5] * mb[ 9];
m22 += ma[ 6] * mb[ 9];
m32 += ma[ 7] * mb[ 9];
m02 += ma[ 8] * mb[10];
m12 += ma[ 9] * mb[10];
m22 += ma[10] * mb[10];
m32 += ma[11] * mb[10];
m02 += ma[12] * mb[11];
m12 += ma[13] * mb[11];
m22 += ma[14] * mb[11];
m32 += ma[15] * mb[11];

float m03 = ma[ 0] * mb[12];
float m13 = ma[ 1] * mb[12];
float m23 = ma[ 2] * mb[12];
float m33 = ma[ 3] * mb[12];
m03 += ma[ 4] * mb[13];
m13 += ma[ 5] * mb[13];
m23 += ma[ 6] * mb[13];
m33 += ma[ 7] * mb[13];
m03 += ma[ 8] * mb[14];
m13 += ma[ 9] * mb[14];
m23 += ma[10] * mb[14];
m33 += ma[11] * mb[14];
m03 += ma[12] * mb[15];
m13 += ma[13] * mb[15];
m23 += ma[14] * mb[15];
m33 += ma[15] * mb[15];

result[ 0] = m00;
result[ 1] = m10;
result[ 2] = m20;
result[ 3] = m30;
result[ 4] = m01;
result[ 5] = m11;
result[ 6] = m21;
result[ 7] = m31;
result[ 8] = m02;
result[ 9] = m12;
result[10] = m22;
result[11] = m32;
result[12] = m03;
result[13] = m13;
result[14] = m23;
result[15] = m33;


Knackered - if you could post how this benches compared to your other routines, I'd appreciate it http://www.opengl.org/discussion_boards/ubb/smile.gif

-- Zeno

Edit: Doh! Stupid code tags took out the spaces I had between segments in the code above. I also wanted to add that if anyone can improve this in any way, I'd love to see it http://www.opengl.org/discussion_boards/ubb/smile.gif

[This message has been edited by Zeno (edited 04-27-2002).]

epajarre
04-27-2002, 08:03 AM
Originally posted by knackered:
Mmm, the Intel math library seems to require me to give them some money...not interested.
So, I trotted over to the AMD site, and they have a 3dnow optimised maths lib for free. I downloaded it, and look at how they multiply 2 4x4 matrices together (look familiar?) :-

Now I'm even more perplexed.

Notice, that your original code did not have the tmp array. It might be important because it may give the compiler the hint, that during the computation the input and output arrays are not the same. This has a significant effect on the code speed.

Eero

knackered
04-27-2002, 08:31 AM
Yes, I was thinking along those lines.
The output matrix could be anywhere in memory, whereas the temp matrix is created in the function, but the input matrices could be anywhere in memory too, surely?
Oh, I don't know - I'm going to try zeno's code, and if that doesn't give me the same performance as gl, then I'll have to pass the matrix mults over to the context, which will mean direct3d doing it's own mults too....

knackered
04-27-2002, 09:10 AM
Zeno, I wasn't able to decern a difference between your mult and my mult as far as milliseconds go - they gave identical performance...but I'm at home, and my home machine is an athlon 1.2ghz, whereas my work machine is a dual p3 700mhz - so I'm getting much better performance on my home machine anyway.
I'll try it on the work machine on monday to run a more revealing test.

V-man
04-27-2002, 11:29 AM
Originally posted by knackered:
Yes, I was thinking along those lines.
The output matrix could be anywhere in memory, whereas the temp matrix is created in the function, but the input matrices could be anywhere in memory too, surely?
Oh, I don't know - I'm going to try zeno's code, and if that doesn't give me the same performance as gl, then I'll have to pass the matrix mults over to the context, which will mean direct3d doing it's own mults too....


If the 3 matrices (A, B and the result) are in different memory locations, but are made of arrays of say 16 elements, then accessing each will bring them in Level 2 cache, then level 1 cache. PC caches (well I know that there are some caches that are external and then there are some that have on chip L2 caches) are pretty much n-way associative nowadays. I think that PPro and up are 4 way except for the server class which are 8 way.

So it is important to calculate and see if the arrays are stepping on each others foot in the cache. I havent really done this myself but seen some people's code that analyze cache performance.
A couple of things you should know before you use 3dNow, SSE:

1. Check to see if the CPU supports it by using cpuid opcode.
2. Check to see if the OS supports it by executing a random SSE or 3dNow opcode and use a try catch block to see if it causes an exception.

Someone's benchmark, GLVulpine executed SSE instructions, and while I had a P3, it crashed. Reading the intel manuals gave me the answer!

PS: if someone has links as to how to code intelligently in C++ or C so that the compiler can do better optimizations, then please post it here.

V-man

zed
04-27-2002, 01:00 PM
>>I am not sure if you read the whole text, but he said that the version he had at home (dx8.0a) was the slowest, but at work with dx8.1, it was the fastest because the D3DX team implemented 3dnow optimisations<<

yeah i was wrong. sorry i was talking outta my ass, unfortunately i have a problem of just focusing on numbers on a page.

take a look at this

/* generated code, do not edit. */

#include "ode/matrix.h"

dReal dDot (const dReal *a, const dReal *b, int n)
{
dReal p0,q0,m0,p1,q1,m1,sum;
sum = 0;
n -= 2;
while (n >= 0) {
p0 = a[0]; q0 = b[0];
m0 = p0 * q0;
p1 = a[1]; q1 = b[1];
m1 = p1 * q1;
sum += m0;
sum += m1;
a += 2;
b += 2;
n -= 2;
}
n += 2;
while (n > 0) {
sum += (*a) * (*b);
a++;
b++;
n--;
}
return sum;
}

taken from that excellant physics library ODE its a dotproduct notice the top line '/* generated code, do not edit. */'
it looks like this code was generated by a machine eg grown with NN perhaps, give it a billion generations and see what code it evolves. the same thing would be possible for a mat mat multiplication

btw the above dotproduct runs slower on my machine than A[0]*B[0] + A[1] ..

davepermen
04-27-2002, 01:06 PM
BeginTiming();
glLoadMatrix(a);
glMultMatrix(b);
glGetFloatv(GL_//CURRENTMATRIX//,c);
EndTiming();

have you tested it like this or before the getting of the result? you know gl runs in a parallel thread..

should be logical, but we're never sure http://www.opengl.org/discussion_boards/ubb/wink.gif

Zeno
04-27-2002, 10:44 PM
Well, I'll be darned. It's time for me to eat crow.

Knackered - I didn't believe that there was no difference between my algorithm and yours, so I wrote up a little benchmark program to compare them.

To my surprise (and embarrasment), my algorithm came out last (no matter the compiler). Yours was second place and the OGL routines were indeed the fastest.

Here are the results of the program from my Athlon 1.3 using the .exe from the Intel compiler:

Zeno: 9.6 Million mults/sec
Knackered: 13.5 Million mults/sec
OGL: 22.8 million mults/sec

If anyone else wants to try (or check my stuff) I put the source code and two .exe files up here:
www.sciencemeetsart.com/wade/temp/benchmark.zip (http://www.sciencemeetsart.com/wade/temp/benchmark.zip)

Learn something new every day http://www.opengl.org/discussion_boards/ubb/smile.gif

-- Zeno



[This message has been edited by Zeno (edited 04-28-2002).]

davepermen
04-28-2002, 01:42 AM
just wanted to note that you don't create any opengl context.. wondered that it did not crashed http://www.opengl.org/discussion_boards/ubb/smile.gif

and, did you checked the results for beeing ok? possibly you get just an error back from the glfunc because there is no gl existing.. dunno http://www.opengl.org/discussion_boards/ubb/wink.gif

Elixer
04-28-2002, 10:11 AM
Here is a 3dnow 4x4 matrix multiply routine you can test, BUT, I do not recall where it came from. http://www.opengl.org/discussion_boards/ubb/frown.gif I just happen to find a file called 3dnow4x4.asm in my "other" directory.
I sure hope the formatting is correct, since I don't have a web page to post it on.




;************************************************* *****************************
; Routine: void _glMul_4x4 (const float *r, const float *a, const float *b)
; Input: r - matrix (4x4) address
; a - matrix (4x4) address
; b - matrix (4x4) address
; Output: r = a * b, using standard matrix multiplication rules
; Uses: eax, ecx, edx, mm0 - mm7
; UPDATED 01/21/00
;************************************************* *****************************
ALIGN 32
PUBLIC __glMul_4x4
__glMul_4x4 PROC

r = 4 ;stack offset for result address
a = 8 ;stack offset for matrix a address
b = 12 ;stack offset for matrix b address

a_11 = 0 ;local stack frame layout
a_12 = 4
a_13 = 8
a_14 = 12

a_21 = 16
a_22 = 20
a_23 = 24
a_24 = 28

a_31 = 32
a_32 = 36
a_33 = 40
a_34 = 44

a_41 = 48
a_42 = 52
a_43 = 56
a_44 = 60

femms
mov eax,[esp+a] ;source a
mov ecx,[esp+b] ;source b
mov edx,[esp+r] ;result r

sub esp,64 ;T_ local work space to store temp results

movd mm0,[eax + a_21] ; | a_21
movd mm1,[eax + a_11] ; | a_11
movd mm6,[eax + a_12] ; | a_12
punpckldq mm1,mm0 ; a_21 | a_11
movd mm5,[eax + a_22] ; | a_22
pfmul mm1,[ecx] ; a_21 * b_12 | a_11 * b_11
punpckldq mm6,mm5 ; a_22 | a_12
movd mm7,[eax + a_32] ; | a_32
movd mm5,[eax + a_42] ; | a_42
pfmul mm6,[ecx] ; a_22 * b_12 | a_12 * b_11
movd mm2,[eax + a_31] ; | a_31
punpckldq mm7,mm5 ; a_42 | a_32
movd mm0,[eax + a_41] ; | a_41
pfmul mm7,[ecx+8] ; a_42 * b_14 | a_32 * b13
punpckldq mm2,mm0 ; a_41 | a_31
pfadd mm6,mm7 ; a_42 * b_14 + a_22 * b_12 | a_32 * b13 + a_12 * b_11
pfmul mm2,[ecx+8] ; a_41 * b_14 | a_31 * b13
pfacc mm6,mm6 ; | a_12 * b_11 + a_22 * b_12 + a_32 * b_13 + a_42 * b_14
pfadd mm1,mm2 ; a_21 * b_12 + a_41 * b_14 | a_11 * b_11 + a_31 * b13
movd [esp+4],mm6 ; T_12
pfacc mm1,mm1 ; | a_21 * b_12 + a_41 * b_14 + a_11 * b_11 + a_31 * b13
movd [esp],mm1 ; T_11

movd mm0,[eax + a_23] ; | a_23
movd mm1,[eax + a_13] ; | a_13
movd mm6,[eax + a_14] ; | a_14
punpckldq mm1,mm0 ; a_23 | a_13
movd mm5,[eax + a_24] ; | a_24
pfmul mm1,[ecx] ; a_23 * b_12 | a_13 * b_11
punpckldq mm6,mm5 ; a_24 | a_14
movd mm7,[eax + a_34] ; | a_34
movd mm5,[eax + a_44] ; | a_44
pfmul mm6,[ecx] ; a_24 * b_12 | a_14 * b_11
movd mm2,[eax + a_33] ; | a_33
punpckldq mm7,mm5 ; a_44 | a_34
movd mm0,[eax + a_43] ; | a_43
pfmul mm7,[ecx+8] ; a_44 * b_14 | a_34 * b_13
punpckldq mm2,mm0 ; a_43 | a_33
pfadd mm6,mm7 ; a_44 * b_14 + a_24 * b_12 | a_34 * b_13 + a_14 * b_11
pfmul mm2,[ecx+8] ; a_43 * b_12 | a_33 * b11
pfacc mm6,mm6 ; | a_44 * b_14 + a_24 * b_12 + a_34 * b_13 + a_14 * b_11
pfadd mm1,mm2 ; a_43 * b_12 + a_23 * b_12 | a_33 * b11 + a_13 * b_11
movd [esp+12],mm6 ; T_14
pfacc mm1,mm1 ; | a_43 * b_12 + a_23 * b_12 + a_33 * b11 + a_13 * b_11
movd [esp+8],mm1 ; T_13

movd mm0,[eax + a_21] ; | a_21
movd mm1,[eax + a_11] ; | a_11
movd mm6,[eax + a_12] ; | a_12
punpckldq mm1,mm0 ; a_21 | a_11
movd mm5,[eax + a_22] ; | a_22
pfmul mm1,[ecx+16] ; a_21 * b_22 | a_11 * b_21
punpckldq mm6,mm5 ; a_22 | a_12
movd mm7,[eax + a_32] ; | a_32
movd mm5,[eax + a_42] ; | a_42
pfmul mm6,[ecx+16] ; a_22 * b_22 | a_12 * b_21
movd mm2,[eax + a_31] ; | a_31
punpckldq mm7,mm5 ; a_42 | a_32
movd mm0,[eax + a_41] ; | a_41
pfmul mm7,[ecx+24] ; a_42 * b_24 | a_32 * b_23
punpckldq mm2,mm0 ; a_41 | a_31
pfadd mm6,mm7 ; a_42 * b_24 + a_22 * b_22 | a_32 * b_23 + a_12 * b_21
pfmul mm2,[ecx+24] ; a_41 * b_24 | a_31 * b_23
pfacc mm6,mm6 ; | a_42 * b_24 + a_22 * b_22 + a_32 * b_23 + a_12 * b_21
pfadd mm1,mm2 ; a_41 * b_24 + a_21 * b_22 | a_31 * b_23 + a_11 * b_21
movd [esp+20],mm6 ; T_22
pfacc mm1,mm1 ; |a_41 * b_24 + a_21 * b_22 + a_31 * b_23 + a_11 * b_21
movd [esp+16],mm1 ; T_21

movd mm0,[eax + a_23] ; | a_23
movd mm1,[eax + a_13] ; | a_13
movd mm6,[eax + a_14] ; | a_14
punpckldq mm1,mm0 ; a_23 | a_13
movd mm5,[eax + a_24] ; | a_24
pfmul mm1,[ecx+16] ; a_23 * b_22 | a_13 * b_21
punpckldq mm6,mm5 ; a_24 | a_14
movd mm7,[eax + a_34] ; | a_34
movd mm5,[eax + a_44] ; | a_44
pfmul mm6,[ecx+16] ; a_24 * b_22 | a_14 * b_21
movd mm2,[eax + a_33] ; | a_33
punpckldq mm7,mm5 ; a_44 | a_34
movd mm0,[eax + a_43] ; | a_43
pfmul mm7,[ecx+24] ; a_44 * b_24 | a_34 * b_23
punpckldq mm2,mm0 ; a_43 | a_33
pfadd mm6,mm7 ; a_24 * b_22 + a_44 * b_24 | a_14 * b_21 + a_34 * b_23
pfmul mm2,[ecx+24] ; a_43 * b_24 | a_33 * b_23
pfacc mm6,mm6 ; |a_24 * b_22 + a_44 * b_24 + a_14 * b_21 + a_34 * b_23
pfadd mm1,mm2 ; a_43 * b_24 + a_23 * b_22 | a_33 * b_23 + a_14 * b_21
movd [esp+28],mm6 ; T_24
pfacc mm1,mm1 ; | a_43 * b_24 + a_23 * b_22 + a_33 * b_23 + a_14 * b_21
movd [esp+24],mm1 ; T_23

movd mm0,[eax + a_21] ; | a_21
movd mm1,[eax + a_11] ; | a_11
movd mm6,[eax + a_12] ; | a_12
punpckldq mm1,mm0 ; a_21 | a_11
movd mm5,[eax + a_22] ; | a_22
pfmul mm1,[ecx+32] ; a_21 * b_32 | a_11 * b_31
punpckldq mm6,mm5 ; a_22 | a_12
movd mm7,[eax + a_32] ; | a_32
movd mm5,[eax + a_42] ; | a_42
pfmul mm6,[ecx+32] ; a_22 * b_32 | a_12 * b_31
movd mm2,[eax + a_31] ; | a_31
punpckldq mm7,mm5 ; a_42 | a_32
movd mm0,[eax + a_41] ; | a_41
pfmul mm7,[ecx+40] ; a_42 * b_34 | a_32 * b33
punpckldq mm2,mm0 ; a_41 | a_31
pfadd mm6,mm7 ; a_42 * b_34 + a_22 * b_32 | a_32 * b33 + a_12 * b_31
pfmul mm2,[ecx+40] ; a_41 * b_34 | a_31 * b33
pfacc mm6,mm6 ; |a_42 * b_34 + a_22 * b_32 + a_32 * b33 + a_12 * b_31
pfadd mm1,mm2 ; a_41 * b_34 + a_21 * b_32 | a_31 * b33 + a_11 * b_31
movd [esp+36],mm6 ; T_32
pfacc mm1,mm1 ; |a_41 * b_34 + a_21 * b_32 + a_31 * b33 + a_11 * b_31
movd [esp+32],mm1 ; T_31

movd mm0,[eax + a_23] ; | a_23
movd mm1,[eax + a_13] ; | a_13
movd mm6,[eax + a_14] ; | a_14
punpckldq mm1,mm0 ; a_23 | a_13
movd mm5,[eax + a_24] ; | a_24
pfmul mm1,[ecx+32] ; a_21 * b_32 | a_11 * b_31
punpckldq mm6,mm5 ; a_22 | a_12
movd mm7,[eax + a_34] ; | a_34
movd mm5,[eax + a_44] ; | a_44
pfmul mm6,[ecx+32] ; a_22 * b_32 | a_12 * b_31
movd mm2,[eax + a_33] ; | a_33
punpckldq mm7,mm5 ; a_42 | a_32
movd mm0,[eax + a_43] ; | a_43
pfmul mm7,[ecx+40] ; a_42 * b_34 | a_32 * b_33
punpckldq mm2,mm0 ; a_43 | a_33
pfadd mm6,mm7 ; a_42 * b_34 + a_22 * b_32 | a_32 * b_33 + a_12 * b_31
pfmul mm2,[ecx+40] ; a_41 * b_34 | a_31 * b_33
pfacc mm6,mm6 ; |a_42 * b_34 + a_22 * b_32 + a_32 * b_33 + a_12 * b_31
pfadd mm1,mm2 ; a_41 * b_34 + a_21 * b_32 | a_31 * b_33 + a_11 * b_31
movd [esp+44],mm6 ; T_34
pfacc mm1,mm1 ; |a_41 * b_34 + a_21 * b_32 + a_31 * b_33 + a_11 * b_31
movd [esp+40],mm1 ; T_33

movd mm0,[eax + a_21] ; | a_21
movd mm1,[eax + a_11] ; | a_11
movd mm6,[eax + a_12] ; | a_12
punpckldq mm1,mm0 ; a_21 | a_11
movd mm5,[eax + a_22] ; | a_22
pfmul mm1,[ecx+48] ; a_21 * b_42 | a_11 * b_41
punpckldq mm6,mm5 ; a_22 | a_12
movd mm7,[eax + a_32] ; | a_32
movd mm5,[eax + a_42] ; | a_42
pfmul mm6,[ecx+48] ; a_22 * b_42 | a_12 * b_41
movd mm2,[eax + a_31] ; | a_31
punpckldq mm7,mm5 ; a_42 | a_32
movd mm0,[eax + a_41] ; | a_41
pfmul mm7,[ecx+56] ; a_42 * b_44 | a_32 * b_43
punpckldq mm2,mm0 ; a_41 | a_31
pfadd mm6,mm7 ; a_42 * b_44 + a_22 * b_42 | a_32 * b_43 + a_12 * b_41
pfmul mm2,[ecx+56] ; a_41 * b_44 | a_31 * b_43
pfacc mm6,mm6 ; |a_42 * b_44 + a_22 * b_42 + a_32 * b_43 + a_12 * b_41
pfadd mm1,mm2 ; a_41 * b_44 + a_21 * b_42 | a_31 * b_43 + a_11 * b_41
movd [esp+52],mm6 ; T_42
pfacc mm1,mm1 ; | a_41 * b_44 + a_21 * b_42 + a_31 * b_43 + a_11 * b_41
movd [esp+48],mm1 ; T_41


movd mm0,[eax + a_23] ; | a_23
movd mm1,[eax + a_13] ; | a_13
movd mm6,[eax + a_14] ; | a_14
punpckldq mm1,mm0 ; a_23 | a_13
movd mm5,[eax + a_24] ; | a_24
pfmul mm1,[ecx+48] ; a_21 * b_42 | a_11 * b_41
punpckldq mm6,mm5 ; a_22 | a_12
movd mm7,[eax + a_34] ; | a_34
movd mm5,[eax + a_44] ; | a_44
pfmul mm6,[ecx+48] ; a_22 * b_42 | a_12 * b_41
movd mm2,[eax + a_33] ; | a_33
punpckldq mm7,mm5 ; a_42 | a_32
movd mm0,[eax + a_43] ; | a_43
pfmul mm7,[ecx+56] ; a_42 * b_44 | a_32 * b_43
punpckldq mm2,mm0 ; a_43 | a_33
pfadd mm6,mm7 ; a_42 * b_44 + a_22 * b_42 | a_32 * b_43 + a_12 * b_41
pfmul mm2,[ecx+56] ; a_41 * b_44 | a_31 * b_43
pfacc mm6,mm6 ; |a_42 * b_44 + a_22 * b_42 + a_32 * b_43 + a_12 * b_41
pfadd mm1,mm2 ; a_41 * b_44 + a_21 * b_42 | a_31 * b_43 + a_11 * b_41
movd [esp+60],mm6 ; T_44
pfacc mm1,mm1 ; a_41 * b_44 + a_21 * b_42 + a_31 * b_43 + a_11 * b_41
movd [esp+56],mm1 ; T_43


movq mm3,[esp] ;MOVE FROM LOCAL TEMP MATRIX TO ADDRESS OF RESULT
movq mm4,[esp+8]
movq [edx],mm3
movq [edx+8],mm4

movq mm3,[esp+16]
movq mm4,[esp+24]
movq [edx+16],mm3
movq [edx+24],mm4

movq mm3,[esp+32]
movq mm4,[esp+40]
movq [edx+32],mm3
movq [edx+40],mm4

movq mm3,[esp+48]
movq mm4,[esp+56]
movq [edx+48],mm3
movq [edx+56],mm4

add esp,64
femms

ret
__glMul_4x4 ENDP

jwatte
04-28-2002, 11:04 AM
Daveperman wrote:

> have you tested it like this or before the
> getting of the result? you know gl runs in a
> parallel thread..

If it did, pretty much anything you did would take milliseconds because of the synchronization overhead. On most current hardware, it runs just as a library linked into your process space, talking to the hardware directly, and the "second entity" is the GPU running DMA.

Speaking of knackered's "milliseconds"; I can see no way that you can spend MILLISECONDS on a simple matrix multiply. Not even on a hundred of them. There's one million cycles in a millisecond (give or take). Benchmarks of routines at this level should be measured in CYCLES, and should specify the hardware used, as well as where source and destination reside before each iteration of the benchmark (RAM, L2, L1).

Saying that a matrix mult will "thrash the cache" is similarly out of whack with reality. An Athlon, and a Pentium IV, fits a 4x4 float matrix in a single cache line, assuming it's aligned, else it's two. A Pentium III needs two cache lines, or three in the unaligned case. Thus, if you really wanted, you could conceivably fit all three matrices in line fetch buffers (write combiners) on a P-III !

Zeno
04-28-2002, 11:28 AM
just wanted to note that you don't create any opengl context.. wondered that it did not crashed

Yea, so was I, but it didn't crash and did give the same result as the others, so I figured no context is needed for that particular call.

-- Zeno

davepermen
04-28-2002, 11:32 AM
funny http://www.opengl.org/discussion_boards/ubb/wink.gif

knackered
04-28-2002, 12:38 PM
Originally posted by jwatte:
Speaking of knackered's "milliseconds"; I can see no way that you can spend MILLISECONDS on a simple matrix multiply. Not even on a hundred of them. There's one million cycles in a millisecond (give or take). Benchmarks of routines at this level should be measured in CYCLES, and should specify the hardware used, as well as where source and destination reside before each iteration of the benchmark (RAM, L2, L1).


I don't know much about cache's and such things, Jwatte - I appreciate you educating me. The reason I'm talking in milliseconds is just because I'm measuring the time before rendering anything, then measuring it again after the swapbuffer - in between those 2 measurements my scenegraph gets traversed, during which something like 70 to 80 matrix mults happen....now, if I use gl to multiply the matrix, the time spent is 16 milliseconds less than if I do the mults myself.
Hope that clears some things up.

jwatte
04-28-2002, 01:23 PM
Knackered,

That seems un-intuitive, if that's the only difference. 80 matmuls should never take 16 milliseconds, no way. Are you measuring over many frames and averaging? Which timing function do you use? On Windows, timeGetTime() or GetTickCount() are notoriously unreliable; they drop ticks under heavy load and they only give, at best, millisecond accuracy.

How about tracing through your matmul in the debugger and see if it goes off in a 100-times loop or something? How about trying it with a profiler? (try getting the demo version of VTune from Intel's developer site)

knackered
04-28-2002, 01:29 PM
The thing is, the frame rate drops dramatically too - it's a physically apparent that it's slower. I'm using gettickcount, yes, I know it's not as accurate as queryperformancecounter, but I just banged it in to give me a quick measurement - but as i say, it's a dramatic frame rate difference anyway.
No, I'm not going into some loop or other, the code is just as I detailed.
A mystery?

Michael Steinberg
04-28-2002, 03:12 PM
I gave the stuff a little benchmark.
I used QueryPerformanceCounter and switched optimizations off for the calling loop.
80 matmults with knackereds original matmult function took about 0.8 microseconds, but one has to respect the overhead of the non-optimized loop. I also found that in any case, if one copies the a and b matrices to two temporary ones and calculates with them it gets a bit faster.
However, I only have a P2 350, so these results are probably irrelevant.
I also learnt that the fpu-calc time depends on the values you put in. When I didnt give the matrices an initial value, it was a 10 or even 100 times slower.

[This message has been edited by Michael Steinberg (edited 04-28-2002).]

2nd Edit:
I guess the whole benchmark is irrelevant. When I set initial values, the CPU probably fetches the three matrices into the cache. It probably only work there then.

[This message has been edited by Michael Steinberg (edited 04-28-2002).]

ffish
04-28-2002, 05:50 PM
Long time no post http://www.opengl.org/discussion_boards/ubb/smile.gif

Here's a link to another algorithm for performing matrix mults. Might be interesting to bench against some of these implementations:
http://lib-www.lanl.gov/numerical/bookcpdf/c2-11.pdf

Apologies if this alg. has already been covered above. I didn't read through all the code in detail (esp. the assembly version). BTW, great thread.

Regards.

mcraighead
04-29-2002, 05:53 PM
I'm a little surprised that OGL doesn't lose simply because of all the API overhead. After all, you've passed in two matrices now, and they have to be all copied around and stuff...

Hint 1: The value of MatrixMode may affect OGL matrix performance. Some modes are probably faster/slower than others.

Hint 2: Nah, I won't tell you, this should be too obvious. http://www.opengl.org/discussion_boards/ubb/smile.gif

- Matt

mcraighead
04-29-2002, 05:58 PM
I didn't believe the results it gave me, so I looked at the app. (This has nothing to do with my Hint 2, BTW. That was something else.)

You haven't set up a GL context or anything. Those entry points probably just point to a "RET" instruction!

Also, in a fair comparison, glGetFloatv is going to absolutely destroy the GL driver because of, e.g., big switch statement overhead.

- Matt

Zeno
04-29-2002, 07:42 PM
I need to stop posting on this thread pretty soon...I keep looking like an idiot http://www.opengl.org/discussion_boards/ubb/smile.gif

Yes, Matt, the lack of context must have made those functions NULL. And, of course, the reason that the right answer was appearing anyway is that I did the OGL test LAST using the same arrays, so the answer was already done. *sigh*.

I put up a new main and two new .exe files. Here are the results when I create a context using GLUT:

Zeno: 10.2 Million mults/sec
Knackered: 12.0 Million mults/sec
OGL: 1.7 Million mults/sec

Sorry for all the mistakes here guys http://www.opengl.org/discussion_boards/ubb/smile.gif At least we're getting at the truth.

-- Zeno

knackered
04-30-2002, 01:32 AM
Then why am I getting this drop in frame rate, if my method is the fastest? I've told you the whole story, there's nothing more I can add...the mult is inline'd too http://www.opengl.org/discussion_boards/ubb/smile.gif

Thaellin
04-30-2002, 04:01 AM
Very interesting thread.

Knackered: Is it possible you're getting a processor stall due to a write-read pairing? Try dropping a /small/ operation or two in between the MatMult and glLoadMatrix calls.

I've never run into a clear case of a dependant read stall, so I have no idea if this is what's actually slowing you down.

Depends on what ASM the compiler is generating, I guess.

I might play with this today. It's a very curious problem. 8)
-- Jeff

Edit: I found the spaced version to be slightly slower than the tightly executed version. My bench (quick&dirty) also shows Knackered's code outperforming the OpenGL version, at 150% of the OpenGL speed (nVidia's 28.32 detonators on a GF2MX).

[This message has been edited by Thaellin (edited 04-30-2002).]

lgrosshennig
04-30-2002, 04:03 AM
Zeno no offense but your benchmark is a little unfair.

I noticed a few things that are worth mentioning.

1.) The custom benchmarks dont upload the results to OpenGL.

2.) The custom benchmarks always work on the same matricies making them L1 cache local after the first call (ok an interrupt or an process context change will kick them out once in a while)

3.) The OpenGL version reads the results back (most likley over the AGP bus) Why?

4.) glLoadMatrixf and glMultMatrixf are making copies of the data so it is unlikley to get L1 cache hits for succesive calls.

Since I know its alot easier to criticise someone else work than doing it better, I'll put a new bechmark together when I get home.

Regards,

LG http://www.opengl.org/discussion_boards/ubb/biggrin.gif

Zeno
04-30-2002, 07:40 AM
Zeno no offense but your benchmark is a little unfair.

None taken. Benchmarks are difficult to make fair and I don't really have any experience.

1) That's true. I forgot that the idea was to eventually give these matrices to opengl, not just get the answer.

2) Yes, that, actually, was on purpose. I wanted them to be in cache so I could see which one was more efficient without worrying as much about memory issues.

3) The opengl one reads results back for the same reason as I mentioned in number 1). For whatever reason, I had it in my head that we wanted the answer on the CPU.

4) True...but there's no way around this, is there? I guess I could load once, then push and pop and mult many times.

Anyway, thanks for the comments. Feel free to use my timer code if you put a benchmark together. I think that part is right at least http://www.opengl.org/discussion_boards/ubb/wink.gif

-- Zeno

lgrosshennig
05-01-2002, 12:39 PM
Okidoki I wrote a new benchmark that works on uncached data and uploads the resulting matricies to OpenGL.

On a 1.5Ghz P4 running W2K & latest Detonators I get the following (using the MS compiler):

Zenos: 1.92 Million iterations/s
Knackered: 1.31 Million iterations/s
OpenGL: 1.95 Million iterations/s

Looks different eh? http://www.opengl.org/discussion_boards/ubb/smile.gif

Knackered maybe you have Vsync enabled and the extra cycles make you miss the next retrace? I mean 16ms is really a bummer and I dont think that the matrix code alone can cause that (and 16ms smell like 60Hz refreshrate).

Oh you can grab the benchmark & source here (http://www.lutzgrosshennig.de/MatMult.zip)

EDIT:Fixed the URL

Regards,

LG http://www.opengl.org/discussion_boards/ubb/biggrin.gif

[This message has been edited by lgrosshennig (edited 05-01-2002).]

knackered
05-01-2002, 01:04 PM
Thanks for investigating it.
No, vsync is disabled.

I've noticed something else though (unrelated) - if I have a Sleep(1) in my render thread, I can get nothing more than 100 fps - without it I get > 1000 fps - why is this? Sleep(1) should only sleep for 1 millisecond, surely?
I've lowered the priority of my render thread and removed the sleep, and now I get sort of > 500 fps.

Zeno
05-01-2002, 01:13 PM
Thanks for the program http://www.opengl.org/discussion_boards/ubb/smile.gif Much more refined than mine was.

It's also nice to see that my routine pulled ahead a little bit. On my 1.7 Ghz pIV Xeon at work, it was about 33% faster (though still slightly slower than the OGL one).

-- Zeno

lgrosshennig
05-01-2002, 01:16 PM
Hmmm yes Sleep(1) should only sleep for 1ms.

BUT using Sleep() also puts the thread at the end of the sheduler list of running threads.

For example Sleep(0) is usally used to give an differend thread the remaining CPU time reserved for the current thread.

So using Sleep(1) causes your thread to suspended for a longer time (check the documentation).

Hope that helps,

LG http://www.opengl.org/discussion_boards/ubb/biggrin.gif

lgrosshennig
05-01-2002, 01:21 PM
Thx Zeno http://www.opengl.org/discussion_boards/ubb/smile.gif

I think the OpenGL version is faster because the glLoadMatrixf call makes some of the memory used by the following glMultMatrixf call level 1 cache local.

Regards,

LG http://www.opengl.org/discussion_boards/ubb/biggrin.gif

ToolTech
05-01-2002, 01:26 PM
just a comment. Sleep(1) gives approx 16-18 milliseconds delay. System can not switch faster..

Another comment.

If you know how the matrix is made. eg. rotations , translation etc. you do not need all adds and muls...

kehziah
05-01-2002, 01:27 PM
My experience with Sleep() :
on my Win2K system
Sleep(n) with n = 1 to 10 --> effective sleep time is 10 ms
Sleep(n) with n = 11 to 20 --> effective sleep time is 20 ms
Sleep(n) with n = 21 to 30 --> effective sleep time is 30 ms
...
Note : program running with default priority.
Yes, the doc says that n should be the time the thread is suspended...

lgrosshennig
05-01-2002, 01:34 PM
I think Knackered gets hÝs matricies from quaternations so using specialized multiplication code (which indeed can be alot faster) wont work too well.

As for the Sleep() the time to resume the thread depends on various factors and is hard to predict. So far I have seen almost everything between 10ms to 50ms (on a heavy loaded system).

Regards,

LG http://www.opengl.org/discussion_boards/ubb/biggrin.gif

ToolTech
05-01-2002, 01:39 PM
Info... Standard code on my PIII 600 MHz gives 2.9 million 4x4 matrix mults where the matrixes have random numbers on each position -> full 4x4 is needed. Plain C++ , VC 6.0 , Release

If I do a local storage of tmp i get lower values probably because most time all my matrixes are already within the cache.