How is GL's multmatrix so fast?

If I do this:-
glLoadMatrixf(matrix1);
glMultMatrixf(matrix2);

instead of this:-
MatMult(matrix1, matrix2, matprod);
glLoadMatrixf(matprod);

I get a significant performance boost.
(I’ve shown my mult matrix code below).

Why is this?

The reason I ask is because I gathered from matt and cass that all matrix mults are done on the CPU, and the resultant matrix uploaded to the GPU…so, surely some mad assembly version couldn’t be that much faster than this simple bit of C ?

outputmatrix[0]=matrix1[0]*matrix2[0]+matrix1[1]*matrix2[4]+matrix1[2]*matrix2[8]+matrix1[3]*matrix2[12];
outputmatrix[1]=matrix1[0]*matrix2[1]+matrix1[1]*matrix2[5]+matrix1[2]*matrix2[9]+matrix1[3]*matrix2[13];
outputmatrix[2]=matrix1[0]*matrix2[2]+matrix1[1]*matrix2[6]+matrix1[2]*matrix2[10]+matrix1[3]*matrix2[14];
outputmatrix[3]=matrix1[0]*matrix2[3]+matrix1[1]*matrix2[7]+matrix1[2]*matrix2[11]+matrix1[3]*matrix2[15];

outputmatrix[4]=matrix1[4]*matrix2[0]+matrix1[5]*matrix2[4]+matrix1[6]*matrix2[8]+matrix1[7]*matrix2[12];
outputmatrix[5]=matrix1[4]*matrix2[1]+matrix1[5]*matrix2[5]+matrix1[6]*matrix2[9]+matrix1[7]*matrix2[13];
outputmatrix[6]=matrix1[4]*matrix2[2]+matrix1[5]*matrix2[6]+matrix1[6]*matrix2[10]+matrix1[7]*matrix2[14];
outputmatrix[7]=matrix1[4]*matrix2[3]+matrix1[5]*matrix2[7]+matrix1[6]*matrix2[11]+matrix1[7]*matrix2[15];

outputmatrix[8]=matrix1[8]*matrix2[0]+matrix1[9]*matrix2[4]+matrix1[10]*matrix2[8]+matrix1[11]*matrix2[12];
outputmatrix[9]=matrix1[8]*matrix2[1]+matrix1[9]*matrix2[5]+matrix1[10]*matrix2[9]+matrix1[11]*matrix2[13];
outputmatrix[10]=matrix1[8]*matrix2[2]+matrix1[9]*matrix2[6]+matrix1[10]*matrix2[10]+matrix1[11]*matrix2[14];
outputmatrix[11]=matrix1[8]*matrix2[3]+matrix1[9]*matrix2[7]+matrix1[10]*matrix2[11]+matrix1[11]*matrix2[15];

outputmatrix[12]=matrix1[12]*matrix2[0]+matrix1[13]*matrix2[4]+matrix1[14]*matrix2[8]+matrix1[15]*matrix2[12];
outputmatrix[13]=matrix1[12]*matrix2[1]+matrix1[13]*matrix2[5]+matrix1[14]*matrix2[9]+matrix1[15]*matrix2[13];
outputmatrix[14]=matrix1[12]*matrix2[2]+matrix1[13]*matrix2[6]+matrix1[14]*matrix2[10]+matrix1[15]*matrix2[14];
outputmatrix[15]=matrix1[12]*matrix2[3]+matrix1[13]*matrix2[7]+matrix1[14]*matrix2[11]+matrix1[15]*matrix2[15];

[This message has been edited by knackered (edited 04-26-2002).]

Of course an assembler version can be loads faster. Have you any idea of the kind of code, that C is gonna get turned into?

Also, MMX, 3dnow will helps loads too!

Yea, I’m sure it’s because they use assembly routines and SSE to do theirs.

You could get significant speed up out of your routine without going that far, though. A friend of mine and I had to do a lot of 4x4 matrix multiplies for our work in grad school, so our prof had us compete to come up with the fastest C solution to the problem that we could. The one we ended up with matched the prof’s hand made assembly routine (which didn’t use anything processor specific like SSE, or we wouldn’t have stood a chance).

The whole key is to lay things out so that you’re being very cache friendly. I don’t have the code with me but let me know if you’re interested and I can post it for you later.

One last thing - I’ve gotten significant performance boosts on my math routines by using the intel C++ compiler instead of the visual C++ one.

– Zeno

Look at the expanded matrix multiplication. See how many repetitious loads there are on each line. That C code will most like thrash the cache something aweful. Just being careful about the cache lines/cache hits and misses would make that 10 times faster. As you guys have said the only way to really make sure the compilier isn’t being retarded is assembly. Assembly Code allows you to load the array in a much more efficient manner. SSE surely helps out by combining the multiplications together. You would be amazed how few instructions you could really make that. Actually I am sortof intrigued to see how many SSE instructions i can do it in.

Will have to look into it.

Devulon

I’m sure I got that code out of the mesa gl library…maybe someone should point these things out to them.
I’m still a little stunned that there is such a performance difference - we’re talking at least 16 milliseconds over only a few dozen mults.

Yes, zeno, I would be interested in seeing your code.

Thanks for the replies.

Another possibility could be that the driver just copies over the matrices and waits to do the multiplication until it actually needs to use the multiplied matrix values.

Maybe.

j

They use SSE for sure because when you install the driver, in control panel->display it says Geforce xxx /AGP/SSE and you can get the same thing with

glGetString(GL_RENDERER);

You should go to intel site for that. They have ready to compile code in their documents and a huge math library.

The docs say that for multiplying together 2 4x4 matrices, it takes 172 cycles with pure FPU intructions and 90 cycles with the SSE.

But do you find that there is reason to do your own matrix mult and other such operations?

V-man

I would be suprised if hand coded assembly could beat the C compiler assembly for something simple like the code above. These compilers are way better at local optimization like this than any hand coders are (except the graphics black book guys at ID software).

But I guess that doesn’t answer your question, it just makes it more perplexing.

This is, of course, provided you have turned on your compiler optimization. With -g it will do things just as you said.

Another possibility is that the opengl code turns off pointer aliasing and you don’t. Search your compiler for the ‘restrict’ flag and see if this makes a difference.

[This message has been edited by nickels (edited 04-26-2002).]

Originally posted by V-man:
But do you find that there is reason to do your own matrix mult and other such operations?
V-man

Yes, the reason I would prefer to do my own mults etc. is because my engine can create both opengl and direct3d views, therefore I’d like to keep matrix mults under my control, rather than passing the job to whichever context I’m rendering into (gl/d3d).
I’ll look into the intel math lib - is it free for commercial use? I’ll check…
I really should get more familiar with assembly…haven’t done it for years (z80 was my last brush with it). I never view the disassembly stuff…perhaps I should.

One other thing - my maths library is in a seperate DLL, so maybe there’s a performance penalty - but opengl is in a seperate DLL too…so maybe not.

Nickels, I am a bit perplexed, yes
So much other stuff to do other than optimising rendering…

Mmm, the Intel math library seems to require me to give them some money…not interested.
So, I trotted over to the AMD site, and they have a 3dnow optimised maths lib for free. I downloaded it, and look at how they multiply 2 4x4 matrices together (look familiar?) :-

void glMul_4x4 (float *r, const float *a, const float *b)
{
float tmp[16];

tmp[0]  = b[0] * a[0]  + b[4] * a[1]  + b[8]  * a[2]  + b[12] * a[3];
tmp[1]  = b[1] * a[0]  + b[5] * a[1]  + b[9]  * a[2]  + b[13] * a[3];
tmp[2]  = b[2] * a[0]  + b[6] * a[1]  + b[10] * a[2]  + b[14] * a[3];
tmp[3]  = b[3] * a[0]  + b[7] * a[1]  + b[11] * a[2]  + b[15] * a[3];
tmp[4]  = b[0] * a[4]  + b[4] * a[5]  + b[8]  * a[6]  + b[12] * a[7];
tmp[5]  = b[1] * a[4]  + b[5] * a[5]  + b[9]  * a[6]  + b[13] * a[7];
tmp[6]  = b[2] * a[4]  + b[6] * a[5]  + b[10] * a[6]  + b[14] * a[7];
tmp[7]  = b[3] * a[4]  + b[7] * a[5]  + b[11] * a[6]  + b[15] * a[7];
tmp[8]  = b[0] * a[8]  + b[4] * a[9]  + b[8]  * a[10] + b[12] * a[11];
tmp[9]  = b[1] * a[8]  + b[5] * a[9]  + b[9]  * a[10] + b[13] * a[11];
tmp[10] = b[2] * a[8]  + b[6] * a[9]  + b[10] * a[10] + b[14] * a[11];
tmp[11] = b[3] * a[8]  + b[7] * a[9]  + b[11] * a[10] + b[15] * a[11];
tmp[12] = b[0] * a[12] + b[4] * a[13] + b[8]  * a[14] + b[12] * a[15];
tmp[13] = b[1] * a[12] + b[5] * a[13] + b[9]  * a[14] + b[13] * a[15];
tmp[14] = b[2] * a[12] + b[6] * a[13] + b[10] * a[14] + b[14] * a[15];
tmp[15] = b[3] * a[12] + b[7] * a[13] + b[11] * a[14] + b[15] * a[15];

r[0]  = tmp[0];
r[1]  = tmp[1];
r[2]  = tmp[2];
r[3]  = tmp[3];
r[4]  = tmp[4];
r[5]  = tmp[5];
r[6]  = tmp[6];
r[7]  = tmp[7];
r[8]  = tmp[8];
r[9]  = tmp[9];
r[10] = tmp[10];
r[11] = tmp[11];
r[12] = tmp[12];
r[13] = tmp[13];
r[14] = tmp[14];
r[15] = tmp[15];

}

Now I’m even more perplexed.

Hey knackered,it really does depend on the compiler, debug build or release, and some other things. Take for example vectorc, they have a little demo where you can plug in the source, and they spit out the otimized version. ( http://www.codeplay.com/vectorc/demo-compiler.html )

They also have a demo showing off results between Intel, MS and their own compiler (I wonder who won? ) http://www.codeplay.com/vectorc/demo-int.html

Now if only I could afford the $$$ for the vectorC compiler. Around $800 or so.
I think the intel compiler does SSE stuff, but, AFAIK, the MS compiler doesn’t do SSE or 3dnow (automatically produce the sse/3dnow code I mean).

go over to www.flipcode.com have a look at the code of the day, theres an entry from a guy from microsoft, showing the various speeds of mat*mat routines, ammusingly he asserted that the d3dx matrix multiplcation is the slowest

glMultmatrix can be accelerated in hardware, but this is implementation dependent. Also the whole point of SSE and 3Dnow is support SIMD style vectorized code. Compilers don’t tend to output that code, you need to hand code it in assembly, although Intel’s new compiler apparently does a good job. I doubt it’ll have a code path for 3DNow! assembler though.

Originally posted by zed:
go over to www.flipcode.com have a look at the code of the day, theres an entry from a guy from microsoft, showing the various speeds of mat*mat routines, ammusingly he asserted that the d3dx matrix multiplcation is the slowest

I am not sure if you read the whole text, but he said that the version he had at home (dx8.0a) was the slowest, but at work with dx8.1, it was the fastest because the D3DX team implemented 3dnow optimisations

Well it’s going to be a bit long, but here it is.

The function multiplies two matrices and puts the answer in a third. i.e. ma * mb = result. The matrices are stored as an array of 18 floats in column major (OpenGL order).

Here are the things we found sped up our routine, in order of importance. Of course, your mileage may vary depending on your compiler.

  1. When CPU’s read from cache, they often pull in several consecutive memory locations. Thus, there is a HUGE speed advantage if you can walk through your arrays in the same order they are in memory. Notice that all three arrays involved are walked in order at all times. Notice, too, that doing this on this function makes it look like asm

  2. Using temporary variables and copying them to the result at the end is more efficient than assigning them to the result as you go. I’m not sure why this is…perhaps because result is a pointer the compiler can’t do some optimizations? Anyone know for sure?

  3. Assigning a value to the temporaries when they’re created instead of later can help a bit.

Without further ado…

float m00 = ma[ 0] * mb[ 0];
float m10 = ma[ 1] * mb[ 0];
float m20 = ma[ 2] * mb[ 0];
float m30 = ma[ 3] * mb[ 0];
m00 += ma[ 4] * mb[ 1];
m10 += ma[ 5] * mb[ 1];
m20 += ma[ 6] * mb[ 1];
m30 += ma[ 7] * mb[ 1];
m00 += ma[ 8] * mb[ 2];
m10 += ma[ 9] * mb[ 2];
m20 += ma[10] * mb[ 2];
m30 += ma[11] * mb[ 2];
m00 += ma[12] * mb[ 3];
m10 += ma[13] * mb[ 3];
m20 += ma[14] * mb[ 3];
m30 += ma[15] * mb[ 3];

float m01 = ma[ 0] * mb[ 4];
float m11 = ma[ 1] * mb[ 4];
float m21 = ma[ 2] * mb[ 4];
float m31 = ma[ 3] * mb[ 4];
m01 += ma[ 4] * mb[ 5];
m11 += ma[ 5] * mb[ 5];
m21 += ma[ 6] * mb[ 5];
m31 += ma[ 7] * mb[ 5];
m01 += ma[ 8] * mb[ 6];
m11 += ma[ 9] * mb[ 6];
m21 += ma[10] * mb[ 6];
m31 += ma[11] * mb[ 6];
m01 += ma[12] * mb[ 7];
m11 += ma[13] * mb[ 7];
m21 += ma[14] * mb[ 7];
m31 += ma[15] * mb[ 7];

float m02 = ma[ 0] * mb[ 8];
float m12 = ma[ 1] * mb[ 8];
float m22 = ma[ 2] * mb[ 8];
float m32 = ma[ 3] * mb[ 8];
m02 += ma[ 4] * mb[ 9];
m12 += ma[ 5] * mb[ 9];
m22 += ma[ 6] * mb[ 9];
m32 += ma[ 7] * mb[ 9];
m02 += ma[ 8] * mb[10];
m12 += ma[ 9] * mb[10];
m22 += ma[10] * mb[10];
m32 += ma[11] * mb[10];
m02 += ma[12] * mb[11];
m12 += ma[13] * mb[11];
m22 += ma[14] * mb[11];
m32 += ma[15] * mb[11];

float m03 = ma[ 0] * mb[12];
float m13 = ma[ 1] * mb[12];
float m23 = ma[ 2] * mb[12];
float m33 = ma[ 3] * mb[12];
m03 += ma[ 4] * mb[13];
m13 += ma[ 5] * mb[13];
m23 += ma[ 6] * mb[13];
m33 += ma[ 7] * mb[13];
m03 += ma[ 8] * mb[14];
m13 += ma[ 9] * mb[14];
m23 += ma[10] * mb[14];
m33 += ma[11] * mb[14];
m03 += ma[12] * mb[15];
m13 += ma[13] * mb[15];
m23 += ma[14] * mb[15];
m33 += ma[15] * mb[15];

result[ 0] = m00;
result[ 1] = m10;
result[ 2] = m20;
result[ 3] = m30;
result[ 4] = m01;
result[ 5] = m11;
result[ 6] = m21;
result[ 7] = m31;
result[ 8] = m02;
result[ 9] = m12;
result[10] = m22;
result[11] = m32;
result[12] = m03;
result[13] = m13;
result[14] = m23;
result[15] = m33;

Knackered - if you could post how this benches compared to your other routines, I’d appreciate it

– Zeno

Edit: Doh! Stupid code tags took out the spaces I had between segments in the code above. I also wanted to add that if anyone can improve this in any way, I’d love to see it

[This message has been edited by Zeno (edited 04-27-2002).]

Originally posted by knackered:
[b]Mmm, the Intel math library seems to require me to give them some money…not interested.
So, I trotted over to the AMD site, and they have a 3dnow optimised maths lib for free. I downloaded it, and look at how they multiply 2 4x4 matrices together (look familiar?) :-

Now I’m even more perplexed.[/b]

Notice, that your original code did not have the tmp array. It might be important because it may give the compiler the hint, that during the computation the input and output arrays are not the same. This has a significant effect on the code speed.

Eero

Yes, I was thinking along those lines.
The output matrix could be anywhere in memory, whereas the temp matrix is created in the function, but the input matrices could be anywhere in memory too, surely?
Oh, I don’t know - I’m going to try zeno’s code, and if that doesn’t give me the same performance as gl, then I’ll have to pass the matrix mults over to the context, which will mean direct3d doing it’s own mults too…

Zeno, I wasn’t able to decern a difference between your mult and my mult as far as milliseconds go - they gave identical performance…but I’m at home, and my home machine is an athlon 1.2ghz, whereas my work machine is a dual p3 700mhz - so I’m getting much better performance on my home machine anyway.
I’ll try it on the work machine on monday to run a more revealing test.

Originally posted by knackered:
Yes, I was thinking along those lines.
The output matrix could be anywhere in memory, whereas the temp matrix is created in the function, but the input matrices could be anywhere in memory too, surely?
Oh, I don’t know - I’m going to try zeno’s code, and if that doesn’t give me the same performance as gl, then I’ll have to pass the matrix mults over to the context, which will mean direct3d doing it’s own mults too…

If the 3 matrices (A, B and the result) are in different memory locations, but are made of arrays of say 16 elements, then accessing each will bring them in Level 2 cache, then level 1 cache. PC caches (well I know that there are some caches that are external and then there are some that have on chip L2 caches) are pretty much n-way associative nowadays. I think that PPro and up are 4 way except for the server class which are 8 way.

So it is important to calculate and see if the arrays are stepping on each others foot in the cache. I havent really done this myself but seen some people’s code that analyze cache performance.
A couple of things you should know before you use 3dNow, SSE:

  1. Check to see if the CPU supports it by using cpuid opcode.
  2. Check to see if the OS supports it by executing a random SSE or 3dNow opcode and use a try catch block to see if it causes an exception.

Someone’s benchmark, GLVulpine executed SSE instructions, and while I had a P3, it crashed. Reading the intel manuals gave me the answer!

PS: if someone has links as to how to code intelligently in C++ or C so that the compiler can do better optimizations, then please post it here.

V-man

>>I am not sure if you read the whole text, but he said that the version he had at home (dx8.0a) was the slowest, but at work with dx8.1, it was the fastest because the D3DX team implemented 3dnow optimisations<<

yeah i was wrong. sorry i was talking outta my ass, unfortunately i have a problem of just focusing on numbers on a page.

take a look at this

/* generated code, do not edit. */

#include “ode/matrix.h”

dReal dDot (const dReal *a, const dReal *b, int n)
{
dReal p0,q0,m0,p1,q1,m1,sum;
sum = 0;
n -= 2;
while (n >= 0) {
p0 = a[0]; q0 = b[0];
m0 = p0 * q0;
p1 = a[1]; q1 = b[1];
m1 = p1 * q1;
sum += m0;
sum += m1;
a += 2;
b += 2;
n -= 2;
}
n += 2;
while (n > 0) {
sum += (*a) * (*b);
a++;
b++;
n–;
}
return sum;
}

taken from that excellant physics library ODE its a dotproduct notice the top line ‘/* generated code, do not edit. */’
it looks like this code was generated by a machine eg grown with NN perhaps, give it a billion generations and see what code it evolves. the same thing would be possible for a mat mat multiplication

btw the above dotproduct runs slower on my machine than A[0]*B[0] + A[1] …