Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 4 123 ... LastLast
Results 1 to 10 of 39

Thread: SIMD data structures to OpenGL VBOs

  1. #1
    Junior Member Newbie
    Join Date
    Dec 2004
    Location
    USA
    Posts
    6

    SIMD data structures to OpenGL VBOs

    Recently, I've been attempting to take some of my code over to SIMD (specifically, SSE). SSE is perfect, as I have a great many functions doing math on all the vertices in a mesh.
    My problem is how to organize the data. The Intel documentation I've been reading indicates that the obvious array-of-structures (AOS) data format is not at all the most efficient one to use with SSE, but rather a structure-of-arrays (SOA) format is better, or even a hybrid data format. The problem being that the suggested SOA and hybrid formats are very different from the AOS format of VBOs. So, what's the recommended route to take?

  2. #2
    Senior Member OpenGL Guru
    Join Date
    Mar 2001
    Posts
    2,704

    Re: SIMD data structures to OpenGL VBOs

    You can't store x components separately from y components with vertex arrays, so your output needs to be interleaved (AoS). Intel might wish the world to be SoA, but by and large, it isn't.

    You can still use SSE. Just make sure to put enough padding (or other data) in to make sure all the important bits (vertex data, normal data, etc) are 16-byte aligned.

    A vertex-by-matrix multiply with the appropriately swizzled data is really quite simple:

    1) move matrix into emm2, emm3, emm4 and emm5
    2) move vertex into emm0
    3) xor emm6,emm6

    4) move emm0 to emm1
    5) swizzle emm1 to broadcast the "x" component to all 4 components
    6) multiply emm1, emm2
    7) add emm6, emm1

    8) move emm0 to emm1
    9) swizzle emm1 to broadcast the "y" component to all 4 components
    10) multiply emm1, emm3
    11) add emm6, emm1

    12) move emm0 to emm1
    13) swizzle emm1 to broadcast the "x" component to all 4 components
    14) multiply emm1, emm2
    15) add emm6, emm1

    16) add emm6, emm5 (defaulting to "1" for w)

    Note that swizzles have a 3 instruction latency, and multiplies and adds each have a 2 instruction latency, and the multiply unit can get through one emm multiply per 2 cycles, and the adder can get through one emm add per 2 cycles. With the right scheduling, the inner loop (4-16, plus a store) should run in something like 25 cycles. The hardest thing is breaking the dependency on the swizzles -- you can use an extra register to do it. This is assuming it runs out of L1 cache -- pre-fetching and streaming stores ought to make sure of that.

    The rules I described are Pentium III rules, but result in good performance for other architectures, too.
    "If you can't afford to do something right,
    you'd better make sure you can afford to do it wrong!"

  3. #3
    Junior Member Newbie
    Join Date
    Dec 2004
    Location
    USA
    Posts
    6

    Re: SIMD data structures to OpenGL VBOs

    Thank you for all the detailed information; swizzling it is...

  4. #4
    Senior Member OpenGL Guru
    Join Date
    Mar 2001
    Posts
    2,704

    Re: SIMD data structures to OpenGL VBOs

    Points for whomever spots the typo :-)

    (Hint: it's line 14)
    "If you can't afford to do something right,
    you'd better make sure you can afford to do it wrong!"

  5. #5
    Senior Member OpenGL Pro
    Join Date
    Feb 2002
    Location
    Bonn, Germany
    Posts
    1,652

    Re: SIMD data structures to OpenGL VBOs

    Originally posted by jwatte:
    Points for whomever spots the typo :-)

    (Hint: it's line 14)
    It's full of typos. Should be xmm instead of emm. But you probably mean the 2 vs 4 copy&paste glitch

  6. #6
    Senior Member OpenGL Guru Humus's Avatar
    Join Date
    Mar 2000
    Location
    Stockholm, Sweden
    Posts
    2,444

    Re: SIMD data structures to OpenGL VBOs

    Speaking of swizzles, one thing I really miss in SSE and SSE2 is horizontal adds. That's the most obvious thing to accelerate very common stuff like dot products. But it only made it in SSE3 for some reason. The lack of horizontal adds makes coding SSE much more inconvenient for most task, and you're forces to swizzle way more than you'd like to. It should have been there from the start, like it was in 3DNow.

  7. #7
    Senior Member OpenGL Guru
    Join Date
    Mar 2001
    Posts
    2,704

    Re: SIMD data structures to OpenGL VBOs

    The reason horizontal adds wasnt' in SSE or SSE2 is that the Intel CPUs internally use 64-bit busses, forwards, and register files. This means that operating on an XMM register takes two cycles! (Well, insofar as "cycles" are defined in that architecture... it gets hairy at that level :-)

    Anyway, this means that they didn't have the necessary interconnect between the two halves of the XMM registers to do a correct dot product efficiently. (On this topic: I think the swizzles are very special -- and they take longer because of this)

    I whole-heartedly agree that a crosswise add was long over-due. Note that 3dNow only did 2 elements wide, so they didn't have the forwarding problem, but instead you quickly run out of available registers; it's not SIMD enough to be worth it IMO.
    "If you can't afford to do something right,
    you'd better make sure you can afford to do it wrong!"

  8. #8
    Super Moderator OpenGL Guru
    Join Date
    Feb 2000
    Location
    Montreal, Canada
    Posts
    4,421

    Re: SIMD data structures to OpenGL VBOs

    >>> it's not SIMD enough to be worth it IMO.<<<<

    I have done benchmarking with vec3 dot products. FPU was faster than SSE. SSE wastes time with the shuffles.
    For vec4, it was just a tiny bit faster.

    Have any of you tried this?
    ------------------------------
    Sig: http://glhlib.sourceforge.net
    an open source GLU replacement library. Much more modern than GLU.
    float matrix[16], inverse_matrix[16];
    glhLoadIdentityf2(matrix);
    glhTranslatef2(matrix, 0.0, 0.0, 5.0);
    glhRotateAboutXf2(matrix, angleInRadians);
    glhScalef2(matrix, 1.0, 1.0, -1.0);
    glhQuickInvertMatrixf2(matrix, inverse_matrix);
    glUniformMatrix4fv(uniformLocation1, 1, FALSE, matrix);
    glUniformMatrix4fv(uniformLocation2, 1, FALSE, inverse_matrix);

  9. #9
    Senior Member OpenGL Guru
    Join Date
    Mar 2001
    Posts
    2,704

    Re: SIMD data structures to OpenGL VBOs

    When you multiply by more than one matrix (such as when CPU skinning), the shuffles amortize over more operations. The actual case I was timing (starting four years ago now!) was multi-matrix blending and both vertices and normals, doing streaming stores to AGP VAR memory. At the time, we got a nice improvement over all the other mechanisms. (We used a slightly different shuffle, and swizzled the matrix instead when generating, btw, which saved one shuffle)

    Also, on Pentium hardware, there's more of a difference between SSE and x87; the Athlon line is known for being quite good at x87 instruction execution. Which hardware were you using?
    "If you can't afford to do something right,
    you'd better make sure you can afford to do it wrong!"

  10. #10
    Senior Member OpenGL Guru Humus's Avatar
    Join Date
    Mar 2000
    Location
    Stockholm, Sweden
    Posts
    2,444

    Re: SIMD data structures to OpenGL VBOs

    Originally posted by V-man:
    >>> it's not SIMD enough to be worth it IMO.<<<<

    I have done benchmarking with vec3 dot products. FPU was faster than SSE. SSE wastes time with the shuffles.
    For vec4, it was just a tiny bit faster.

    Have any of you tried this?
    Well, I don't get FPU beating SSE, but I often get 3DNow beating SSE, even when I'm working on vec4 stuff and I don't need many shuffles, such as for instance in a vec4 lerp that you'd think SSE would be faster on.
    That's on an Athlon64, don't know if it's any different on Athlon-xp.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •