View Full Version : matrix optimize

02-21-2002, 11:20 AM

i am rethinking what type of internatl representation to chose for matrices.
i prefer the way dGraphics does it:
Object _11, _12, _13, _14;
Object _21, _22, _23, _24;
Object _31, _32, _33, _34;
Object _41, _42, _43, _44;

is the multiplication via doubly looping faster ?

i mean is the m[4][4]; the better way to store the data ?

and i also have another question. how can i load SIMD optimized vector classes dynamically ? is there any good solution chosing dynamically which class to use ? (standart, SIMD optimized)



02-21-2002, 02:15 PM
AFAIK the fastest way woul'd be to use separate variable names (no array).

Course when using an array the position in the array is computed by multiplying the position with the size of each element.

In this case there are two multiplies required to compute the position.

02-21-2002, 06:37 PM

I'm guessing you haven't actually examined the code emitted by a compiler with optimizations turned on.


There's no difference in the code emitted between:

struct _1 {
int a, b, c;
int sum( _1 * p ) {
return p->a + p->b + p->c;


struct _2 {
int e[3];
int sum( _2 * p ) {
return p->e[0] + p->e[1] + p->e[2];

02-21-2002, 08:47 PM
As you can see in the following piece of assembly, generated by VC6, there's indeed no difference.

; 30 : int t1 = b1.a + b1.b + b1.c;

mov eax, DWORD PTR _b1$[ebp]
add eax, DWORD PTR _b1$[ebp+4]
add eax, DWORD PTR _b1$[ebp+8]
mov DWORD PTR _t1$[ebp], eax

; 31 : int t2 = b2.m[0] + b2.m[1] + b2.m[2];

mov ecx, DWORD PTR _b2$[ebp]
add ecx, DWORD PTR _b2$[ebp+4]
add ecx, DWORD PTR _b2$[ebp+8]
mov DWORD PTR _t2$[ebp], ecx

So just check the assembly to find out what's faster..
(you can do that by going to menu 'Project -> Settings -> C/C++ tab -> Listing Files -> Assembly with Source Code')

Also, using for loops is slower, altough it might be possible that the compiler unrolls them.. (but I don't think it does, haven't checked that actually)

[This message has been edited by richardve (edited 02-21-2002).]

02-21-2002, 10:04 PM
as far as I know
mov eax, DWORD PTR _b1$[ebp]
is as fast as
mov eax, DWORD PTR _b1$[ebp+4]

'cause it's calculated in the pipe of the cpu bevor the move is executed.

you can optimse the code if you force your compiler to align the matrices and vectors to 32Byte boundary 'cause the cache stores this blocks and if your vector would be in two blocks, then the cache would load two 32byte blocks even if you just need 16byte (4 floats)
so you should store your vectors in a arrays to get more chache hits (with ISSE or 3DNow to cpu will wait and the memory will work http://www.opengl.org/discussion_boards/ubb/biggrin.gif )

you can also use the prefetch instructions, I've got 30% more performance in some code parts....

and don't do stuf like:

do something
use ISSE

instead collect work (try to fill the 1L cachesize)

do something

use ISSE

hope this helps

02-21-2002, 10:53 PM
I use a union with anonymous structures, its very comfortable to work with http://www.opengl.org/discussion_boards/ubb/smile.gif

float a00, a10, a20, a30,
a01, a11, a21, a31,
a02, a12, a22, a32,
a03, a13, a23, a33;
float array[16];


02-21-2002, 11:52 PM
Lev, that may be comfortable, but I don't think it's very portable..

(try compiling it with (VC) language extensions disabled.. ouch..)