PDA

View Full Version : SIMD data structures to OpenGL VBOs



Nicholas Bishop
12-28-2004, 11:38 AM
Recently, I've been attempting to take some of my code over to SIMD (specifically, SSE). SSE is perfect, as I have a great many functions doing math on all the vertices in a mesh.
My problem is how to organize the data. The Intel documentation (http://developer.intel.com/technology/itj/q21999/articles/art_5.htm) I've been reading indicates that the obvious array-of-structures (AOS) data format is not at all the most efficient one to use with SSE, but rather a structure-of-arrays (SOA) format is better, or even a hybrid data format. The problem being that the suggested SOA and hybrid formats are very different from the AOS format of VBOs. So, what's the recommended route to take?

jwatte
12-28-2004, 05:11 PM
You can't store x components separately from y components with vertex arrays, so your output needs to be interleaved (AoS). Intel might wish the world to be SoA, but by and large, it isn't.

You can still use SSE. Just make sure to put enough padding (or other data) in to make sure all the important bits (vertex data, normal data, etc) are 16-byte aligned.

A vertex-by-matrix multiply with the appropriately swizzled data is really quite simple:

1) move matrix into emm2, emm3, emm4 and emm5
2) move vertex into emm0
3) xor emm6,emm6

4) move emm0 to emm1
5) swizzle emm1 to broadcast the "x" component to all 4 components
6) multiply emm1, emm2
7) add emm6, emm1

8) move emm0 to emm1
9) swizzle emm1 to broadcast the "y" component to all 4 components
10) multiply emm1, emm3
11) add emm6, emm1

12) move emm0 to emm1
13) swizzle emm1 to broadcast the "x" component to all 4 components
14) multiply emm1, emm2
15) add emm6, emm1

16) add emm6, emm5 (defaulting to "1" for w)

Note that swizzles have a 3 instruction latency, and multiplies and adds each have a 2 instruction latency, and the multiply unit can get through one emm multiply per 2 cycles, and the adder can get through one emm add per 2 cycles. With the right scheduling, the inner loop (4-16, plus a store) should run in something like 25 cycles. The hardest thing is breaking the dependency on the swizzles -- you can use an extra register to do it. This is assuming it runs out of L1 cache -- pre-fetching and streaming stores ought to make sure of that.

The rules I described are Pentium III rules, but result in good performance for other architectures, too.

Nicholas Bishop
12-28-2004, 06:41 PM
Thank you for all the detailed information; swizzling it is...

jwatte
12-29-2004, 08:45 PM
Points for whomever spots the typo :-)

(Hint: it's line 14)

zeckensack
12-30-2004, 12:39 AM
Originally posted by jwatte:
Points for whomever spots the typo :-)

(Hint: it's line 14)It's full of typos. Should be xmm instead of emm. But you probably mean the 2 vs 4 copy&paste glitch ;)

Humus
12-30-2004, 10:17 AM
Speaking of swizzles, one thing I really miss in SSE and SSE2 is horizontal adds. That's the most obvious thing to accelerate very common stuff like dot products. But it only made it in SSE3 for some reason. The lack of horizontal adds makes coding SSE much more inconvenient for most task, and you're forces to swizzle way more than you'd like to. It should have been there from the start, like it was in 3DNow.

jwatte
12-30-2004, 01:07 PM
The reason horizontal adds wasnt' in SSE or SSE2 is that the Intel CPUs internally use 64-bit busses, forwards, and register files. This means that operating on an XMM register takes two cycles! (Well, insofar as "cycles" are defined in that architecture... it gets hairy at that level :-)

Anyway, this means that they didn't have the necessary interconnect between the two halves of the XMM registers to do a correct dot product efficiently. (On this topic: I think the swizzles are very special -- and they take longer because of this)

I whole-heartedly agree that a crosswise add was long over-due. Note that 3dNow only did 2 elements wide, so they didn't have the forwarding problem, but instead you quickly run out of available registers; it's not SIMD enough to be worth it IMO.

V-man
12-30-2004, 04:40 PM
>>> it's not SIMD enough to be worth it IMO.<<<<

I have done benchmarking with vec3 dot products. FPU was faster than SSE. SSE wastes time with the shuffles.
For vec4, it was just a tiny bit faster.

Have any of you tried this?

jwatte
12-30-2004, 09:03 PM
When you multiply by more than one matrix (such as when CPU skinning), the shuffles amortize over more operations. The actual case I was timing (starting four years ago now!) was multi-matrix blending and both vertices and normals, doing streaming stores to AGP VAR memory. At the time, we got a nice improvement over all the other mechanisms. (We used a slightly different shuffle, and swizzled the matrix instead when generating, btw, which saved one shuffle)

Also, on Pentium hardware, there's more of a difference between SSE and x87; the Athlon line is known for being quite good at x87 instruction execution. Which hardware were you using?

Humus
12-31-2004, 03:43 PM
Originally posted by V-man:
>>> it's not SIMD enough to be worth it IMO.<<<<

I have done benchmarking with vec3 dot products. FPU was faster than SSE. SSE wastes time with the shuffles.
For vec4, it was just a tiny bit faster.

Have any of you tried this?Well, I don't get FPU beating SSE, but I often get 3DNow beating SSE, even when I'm working on vec4 stuff and I don't need many shuffles, such as for instance in a vec4 lerp that you'd think SSE would be faster on.
That's on an Athlon64, don't know if it's any different on Athlon-xp.

V-man
01-01-2005, 11:13 AM
Originally posted by jwatte:
When you multiply by more than one matrix (such as when CPU skinning)I did the test 2 weeks ago, and of course, the SSE version was actually vec4 with w = 0 (or I will have to find a way to discard w)
Since I was using 10 million vertices :
XYZ = 114 MB
XYZW = 152 MB

and it was a Athlon XP.
I know the P4 FPU is weak and probably still is with the Prescotts. Intel wants everyone to use SSE.

So FPU was about 6% faster.

I have a certain algorithm that does a lot of vec3 dot products so I wanted to see if it's worth going to SSE.

>>>I often get 3DNow beating SSE<<<

Sounds good. By how much does it beat it?

jwatte
01-01-2005, 05:21 PM
Discarding the "w" means that your source data is not 16-byte aligned, and thus you have to fetch with MOVUPS instead of MOVAPS. That can eat up a noticeable part of your speed, once you're at the point that memory throughput matters.

We stored a "1" for w for the position, and "0" for w for the normal, and disallowed non-uniform scale in our animations, and interleaved normals and position in one vertex array -- I think you can see why :-)

Christian Schüler
01-01-2005, 05:37 PM
I did animation playback (hermite interpolation) and skinning palette computation, in C, SSE and 3DNow.

The most benefit when switching to any of the SIMD on any processor sets was to be able to use a streaming store.

The biggest downer when using SSE was that palette matrices had to be transposed after the SSE computation to fit into 3 shader constants each.

Some results from my experience

straight C ~700 cycles/joint
SSE, Pentium4 ~550 cycles/joint
SSE, Athlon64 ~450 cycles/joint
3DNow, Athlon64 ~400 cycles/joint

The work to be done per joint was about the equivalent of 500 SSE instructions.

Humus
01-02-2005, 07:20 PM
Originally posted by V-man:
Sounds good. By how much does it beat it?Depends on what I'm doing, but something like 20% is not uncommon. I just released a demo (http://www.humus.ca/index.php?page=3D&ID=57) that does some 3Dnow/SSE/FPU stuff, and with 3DNow it runs about 15% faster than with SSE.

Omaha
01-03-2005, 08:03 PM
Jesus Christ! I know most of these terms but the processes discussed are like.

( stuff here )

( my head way down here )

I just got my degree a couple weeks ago, but I've been programming since grade school, and working with OpenGL for three to four years now. Should I know all this stuff already? I'm always paranoid about being caught way behind on knowledge, since staying with the game is so important in this industry.

I have the full set of architecture manuals from AMD and Intel, covering the AMD64 and IA32 architectures; the AMD book kind of subsumes the 32-bit stuff too. I should probably brush up on my CPU extensions. And my basic assembly too... Ugh, I have so much to learn...

Oh yeah I start my first programming job on Monday too. Bwa ha. Cheers!

jwatte
01-04-2005, 06:31 PM
Yeah, you're basically doomed. Spreadsheet macros only for you from now on.

;-)

Omaha
01-05-2005, 04:53 PM
Don't tell me that!

/me hyperventillates

Borderline freakout here... Have you seen the screenshots of AOE3? I'm still working with multitexturing and plain blending... I really need to write some vertex and fragment programs, and then LEARN how to do GLshaders because that's the way things are going.

And then there's that other API I should probably learn...

ZbuffeR
01-05-2005, 05:06 PM
For GLSL, play with Shader Designer at http://www.typhoonlabs.com/
GLSL is much easier than vertex and fragment programs IMHO.

jwatte
01-05-2005, 08:40 PM
I was kidding. Here's some career advice:

The carreer path of "graphics guru" is actually fairly narrow, has a lot of competition, and changes every three years.

The carreer path of "engineer with depth in many areas, ability to focus on necessities, and ability to figure out what's needed to deliver" actually looks a lot better in many cases.

So, don't sweat the details. Write the code you think is fun. If you're puttering around with your own hobby stuff, you WILL catch up with the people who do things "for real" because that takes 10 times longer. Once you're at parity, THEN you do something for real to show that you're not just all experiments.

l_belev
01-09-2005, 05:13 AM
Originally posted by Humus:
Speaking of swizzles, one thing I really miss in SSE and SSE2 is horizontal adds.Heres one potentially faster way to do a horizontal add. It employs the fact that the *ss instructions dont have special alignment requirements/penalties for mem operands:


<source in xmm0.xyzw>
movaps membuf, xmm0
addss xmm0, membuf+4
addss xmm0, membuf+8
addss xmm0, membuf+16
<result in xmm0.x>
note that the last addition is not needed for 3-component
operation
of course the membuf would stay in the cache all the time, or better, be forwarded
there is an irony in that working with memory is more flexible than working with registers in this case; surely intel's designers could have done better job

l_belev
01-09-2005, 05:33 AM
Noticed the typo? :-)

zeckensack
01-09-2005, 05:38 AM
Originally posted by l_belev:
Noticed the typo? :-)Yes :p
+12

l_belev
01-09-2005, 06:14 AM
right :D

Zengar
01-09-2005, 10:12 AM
Athlon XP supports SSE as so called vector-path opcodes which are fairly inefficient commands, that's why your FPU was faster then SSE. On Athlon XP 3dNow! should be used.
However, Athlon 64 has native support of SSE&SSE2 so it would be faster to use this opcodes instead of FPU/3dNow!

At least i think so ;)

Humus
01-09-2005, 01:24 PM
Well, that's not what I'm seeing. 3DNow consistently beats SSE on Athlon64. Even when you're working with vec4s and there are many more instructions in the 3DNow path.

jwatte
01-09-2005, 05:37 PM
I'm not surprised that Athlon 64 runs 3DNow fast. It's their instructions -- for sure, nobody else will do it!

The real question is in what cases the Athlon 64 with 3DNow out-runs a Pentium 4 EE with SSE3, and vice versa. You can bet Intel implements SSE as well as they can.

Humus
01-10-2005, 09:00 PM
Yup. In my experience SSE on P4 is roughly the same speed as 3DNow! on Athlon64 when comparing similar CPU speeds (like 3.2Ghz vs 3200+). In my MetaBalls demo, that relies heavily on SSE and 3DNow for performance, my 3.2GHz P4 laptop runs at 130fps using the SSE path, while my Athlon64 3200+ runs at 125fps using 3DNow and 110fps using SSE. So it's more like it being slow (well, slower anyway) on SSE than being particularly fast on 3DNow.

Tzupy
01-11-2005, 01:18 AM
Hi, I have two points to make:

1) When talking about Athlon 64 3200, it would be nice to specify which of the three flavors it is:
2.0 GHz, 1 MB, single-channel; 2.2 GHz, 512kb, single-channel; 2.0 GHz. 512 kb, dual-channel.
2) Even if the integrated memory controller of the Athlon64 is great, one shouldn't rely heavily on
it: prefetch techniques should be still used, especially for the dual-channel Athlon64s; in theory,
one would then get the most performance by using both SSE(2) and 3DNow.

One more thing: in 64-bit mode, there would be 8 more XMM registers available, and a SSE(2)
performance increase can be expected. Anyone knows something specific on this?

V-man
01-11-2005, 06:53 AM
(like 3.2Ghz vs 3200+)The AMD is clocked lower, hence it does more work per cycle. :D
(you didn't OC did you?)

And the exact system spec will matter cause your metaballs demo may benifit from the cache and such.

I would like to know by what % one is superior to the other (factoring out cache and memory performance)

Humus
01-11-2005, 08:28 PM
Originally posted by Tzupy:
Hi, I have two points to make:

1) When talking about Athlon 64 3200, it would be nice to specify which of the three flavors it is:
2.0 GHz, 1 MB, single-channel; 2.2 GHz, 512kb, single-channel; 2.0 GHz. 512 kb, dual-channel.
2) Even if the integrated memory controller of the Athlon64 is great, one shouldn't rely heavily on
it: prefetch techniques should be still used, especially for the dual-channel Athlon64s; in theory,
one would then get the most performance by using both SSE(2) and 3DNow.

One more thing: in 64-bit mode, there would be 8 more XMM registers available, and a SSE(2)
performance increase can be expected. Anyone knows something specific on this?I'm using the 2.2GHz version. In my demo I'm not so much dependent on memory performance but rather on raw computation performance. Extra registers would only help if you need more of them. I also believe the use of the extra 8 registers creates larger code because of a prefix byte, but I'll have to verify that.

Tzupy
01-12-2005, 05:48 AM
Humus, you are correct: there's a REX prefix involved with the usage of the new registers, but
I doubt it will have a significant impact on performance. Here is my reason for needing more
registers: it is possible to write ASM code that procesess both step n and step n+1, interleaved.
The purpose is to try to hide instruction latency, when your algorithm has step n+1 immediately
dependent on step n. The drawback is that you need double the number of registers you needed
without this interleaving. I am using this technique since I had a 486, many years ago, and was
getting memory limitations on a crude texturing code. More recently, a population count code
implemented with MMX benefitted about 15% from the interleaving, compared with the AMD
implementation in the x86 Code Optimisation Guide (I think I should have done better :D ).

zeckensack
01-12-2005, 07:31 AM
Originally posted by Humus:
Extra registers would only help if you need more of them. I also believe the use of the extra 8 registers creates larger code because of a prefix byte, but I'll have to verify that.It's still better than spilling to memory from a code density pov. Memory references make opcodes longer by at least one byte.

Zengar
01-12-2005, 09:22 AM
Typical SIMD instructions with REX prefix will be likely 4 bytes(if my memory doesn't trick): one REX, two opcode bytes and modrrm byte. So they have a good chance of fitting the decode window of Athlon. Pentiums will have more problems I guess as they have only one decode unit(still if my memory doesn't trick me ;-) ).

I'm just a hobby assembler writer(writing compilers and such stuff) so don't rely on my words.

l_belev
01-13-2005, 01:02 PM
In pentium4 the prefixes are of concern no more since the processor "re-compiles" the incoming code stream into it's internal representation and in this form stores it in internal trace cache. So as much as the most of the time-consuming code in the real-life situations is located in loops, which usually completely fit in the trace cache, the original machine code (including prefixes, etc) does not matter at all.
That is with P4, which is intel. I dont know what is the case with AMD, but I suppose they would go the same way. The only obstacle for that i can think of, could be some patent-related problems, but i guess that AMD has enough money to work around such problems.
Generally there's no need worry about the prefixes. IMO the extra registers are the best thing in AMD64, not the 64 bits.

Humus
01-13-2005, 09:54 PM
Yeah, extra registers are definitely a good thing. Though the need isn't as dramatic for SIMD as for regular ALU instructions. Back in the old days when I wrote a raycasting engine for the 486 I remember the hell of trying to squeeze everything into the registers. Not only that you only had 8, but you also lost the stack and stack and frame pointer so you essentially had 6. When you do SSE and stuff you can use the xmm registers for the math and ALU for pointers and counters etc. so you have a lot more freedom then when have to do both the math and increment counters and pointer with only 6 registers (or 7 if you compile with frame pointer omission), so it's not that often (at least for the stuff I'm doing) that I really need more registers. For compiled code though the extra ALU registers will probably do wonders for performance.

xanatose
01-14-2005, 07:21 PM
Since SSE is offtopic from openGL, but the topic is already being discussed here, I wondered if someone would know a link to a good tutorial on SSE and SSE2.

I know how to use asm up to Pentium, but dont know my way with MMX,SSE and SSE2. I downloaded the manuals on intel, but would really need a good tutorial on the subject to get up to speed.

Tzupy
01-14-2005, 11:25 PM
I'm not sure about a tutorial on SSE, but if you download the AMD x86 Code Optimisation Guide you'll find examples of code implemented with MMX and 3DNow. There are also several Intel papers, like 'Application tuning for SSE', 'Antialiasing implemented using SSE', etc.

jwatte
01-16-2005, 09:09 PM
Humus: you don't need the frame pointer, because you can get all your locals off the stack pointer. So we're up to 7 registers.

Then, you don't need the stack pointer, if you store it in a global variable, and turn off interrupts :-)

Humus
01-22-2005, 10:40 AM
Yeah, that's what I said ("7 if you compile with frame pointer omission").