Questions about SIMD

frink · February 11, 2002, 5:14am

I’m using SIMD to transform my vertices, but I’m not seeing very much improvement with performance. I thought I would see a much bigger improvement in performance since I can do calculations on four pieces of data at a time. How much performance should I be getting? I was thinking my problem might be with the cache. Is there any way to have data remain in the cache and never have it written to memory, because I don’t want to have to wait for the cache to write to memory all the time. Any help or suggestions will be greatly appreciated.

richardve · February 11, 2002, 5:35am

Are we talking about software or hardware?
If it’s hardware, then why are you transforming those vertices yourself instead of letting the GPU do that?
If you’re using a hardware renderer, stop transforming them yourself, you’d better use those SIMD instructions for other purposes.
If you’re using a software renderer, wait for another answer, because I can’t help you with SIMD (what else would you expect on a OGL board?)

Julien_Cayzac · February 11, 2002, 5:36am

Originally posted by frink:
I’m using SIMD to transform my vertices, but I’m not seeing very much improvement with performance.

I’m a bit puzzled about using cpu extensions… As I remember, old MMX extension used the same registers as FPU and switching between them was costly, so using both MMX and FPU was an excellent way to slow down everything. I don’t know if it’s the same with SIMD/SSE/SSE2/3DNow/3DNow2/…

Nutty · February 11, 2002, 8:13am

The switch between FPU and MMX was the one of the 1st things that AMD optimized.

There is a feature on Athlons (and presumable Pentium III’s IV’s) called Fast Save and Restore, used for swapping between the two. It shouldn’t cost that much nowadays.

Nutty

imported_jwatte · February 11, 2002, 11:50am

He’s saying 4 elements at a time, so this must mean Pentium III SSE (on floats).

First, if you compare to letting OpenGL do the transform, then it’s likely that the driver and/or the card is as optimized as possible, and you’ll have a hard time beating it in the first place with your own code (though it’s possible, for special cases).

Second, if your card supports hardware transform, then you should be using that, as it’s “free” and lets the CPU work on other things. Some exceptions are when you’re doing things like matrix palette skinning.

Third, you have to design your SIMD code to run well. Align all buffers on cache lines. Make sure you use MOVAPS etc. Make sure you read and write large chunks at a time, to take advantage of already-open DRAM pages. Try using pre-fetching intelligently (make sure you’ve read a byte from each page so there’s a TLB entry before you pre-fetch). Make sure you don’t blow your L1 cache – it’s only 8 kB on the P4, and 16 kB on the P3.

Last, beware that the P3 decoder can only decode/issue one SSE instruction per clock cycle. Further, it takes 2 clock cycles to execute an add or a mul, although there’s one add unit and one mul unit, so if you interleave them, you can get one instruction through per cycle. I believe the “shuffle” instruction is even more expensive, on the order of 3 cycles per shuffle, and it doesn’t pipeline well. If you don’t structure your data well, you will probably be drowned in shuffle overhead.
http://developer.intel.com/ has more information if and when you need it. It might be useful to buy a copy of VTune and run it (with 0.1 ms or smaller sample time) on 10,000 invocations of your assembly code, to see where the stalls are.

Humus · February 11, 2002, 12:07pm

Also, using MOVNTQ (I think it’s called something like that) to write to memory bypassing the caches can improve performance significantly.

imported_jwatte · February 11, 2002, 2:57pm

movntq == move quad-word non-temporal (without writing to cache). It moves a MMX register (64 bits), but is an SSE instruction.

movntps == move non-temporal parallel single-precision. It moves a SSE register (128 bits).

These are great if you’re only going to be working on the data once, and not touch it again, as you can load your cache up with matrices and code, and don’t need to pollute it with data that you will bang once and then forget about.

prefetch comes in a similar model (prefetchnta) although that may still kick out older cache lines from your L1.

Also, make sure you don’t suffer overly from things like partial cacheline evictions; that can be killer!

[This message has been edited by jwatte (edited 02-11-2002).]