OT: SSE implementation in C++

I’m working on C++ classes that wrap the SSE instructions (I didn’t like Intel implementation too much), so it can easily be used in ongoing OpenGL programs.

I wanted to know if anyone would be interested in such a class, so I would post it on a website.

I’m asking just in case someone actually care about it, not to waste hours on posting it on a website with a FAQ for nothing.

I don’t want to sound offensive but I am not really interested. Largely because of its usefulness. In my experience SSE hasn’t really made much of difference when it comes to individual operations. DotProduct, CrossProduct, Vector Operations, Matrix Multiplications. In general I use SSE for to do those operations when they are in a bunch. For Example, processing a mesh. Doing linear interpolation along with a Dot Product for direction dertimination with respect to light and generating a normal. Its great when I need to do all those operations in that I don’t have to make a lot of temp stores to memory. I can keep everything in registers and just do a final write output.

In response to your post. Please do post the code on the web. The more help people get the better. SSE is something that although well documented it tends to be difficult to find concise and useful information on its use. I am sure many people will find it quite useful.

Devulon

I’m interested in see your logic too.

SSE really only helps when you do “vector” processing where the same thing is done to lots of similar data in a vector.

In addition, assembly functions don’t get inlined, so the function call overhead of calling a function to use an SSE instruction will far outweigh the potential gain of using a single SSE instruction.

The reason the Intel compiler can get faster using SSE-like “functions” is that those are intrinsics that the compiler knows about. Thus, there’s no function call overhead, and the compiler knows how to emit and schedule these functions. Basically, these “functions” are special operators added to the language that the compiler recognizes, and these “operators” happen to use function-call-like syntax.

The classes were done with performance in mind. They actually use the Intel’s intrinsics (works with Intel’s compiler and MS VC 6.0 or higher, dunno about the other compilers).
They should be nearly as fast as a pure assembler version (could be also faster in some cases, due to the high level language optimisations)

here is a small and simple example of a C++ code using those classes:

float4 A = 1.2;
float4 B = 5.1;

A *= (A + B);

This means that two vectors are initialisated as following:
float A[4] = { 1.2, 1.2, 1.2, 1.2 }
float B[4] = { 5.1, 5.1, 5.1, 5.1 }

A *= (A + B) gives A[4] = { 7.56, 7.56, 7.56, 7.56 }

For testing the performances of these classes I made multiple programs, if you want you can dl one of them here http://users.pandora.be/tfautre/Files/SSETest2.exe
You’ll see how faster the SSE version of the test is compared to a conventional C/C++ version using the FPU. (btw, a friend tested it on his P4-2GHz. Man, this CPU IS fast! )

While I would tend to agree that for a single vector operation the function call overhead would probably offset the gains (or very close to it), I would speculate that for something such as a matrix multiplication there probably would be a net gain using by SSE math in a function.

Hmmm have you ever tried the Microsoft Processor Pack?

Its free and you can grab it here:
http://msdn.microsoft.com/vstudio/downloads/ppack/default.asp

Just make sure you download the one that matches the service patch level of your DevStudio.

Regards,

LG

I tried the processor pack very briefly when it was beta, but all it does is provide you with support for the assembler instructions (rather than programming in bytecodes). At the time I tested it, it had no prebuilt functions for vector and matrix math.

Test 1 (A *= B). Please wait…
FPU: 126112 msec.
SSE: 221438 msec. (4294967252% faster)

haha…

now THATS a boost!

well… i only have 192mb ram, so all the test does is benching how fast my hd can stream data in and out…

LOL. Nice result.
Btw, you found a bug in the test app due to an unsigned int.

Now, I took my decision: I will post the C++ implementation with some explaination on a website. (look right below )

[This message has been edited by GPSnoopy (edited 04-27-2002).]

There, I’ve posted everything on a website, with a small FAQ and all.

Please tell me what you think about it.

http://users.pandora.be/tfautre/SIMD/