Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Results 1 to 7 of 7

Thread: my SSE code is slower than normal. why?

  1. #1
    Intern Newbie
    Join Date
    Jan 2004
    Location
    Budapest, Hunary
    Posts
    34

    my SSE code is slower than normal. why?

    Hello!

    I would like to ask somebody to tell me, why is my SSE code for vector and matrix operations slower. I must do something wrong, but dont know what. My vector addition for example:
    (im using intrinsics, xmmintirn.h for vc++6 sp5)


    data type:

    typedef union
    {
    __m128 data;
    float elements[4];
    } vector4d;

    with SSE (using intrinsics):

    inline vector4d add(vector4d a, vector4d b)
    {
    vector4d c;
    c.data = _mm_add_ps(a.data, b.data);
    return c;
    };

    without:

    inline vector4d add_sisd(vector4d a, vector4d b)
    {
    return set(a.elements[0]+b.elements[0],a.elements[1]+b.elements[1],a.elements[2]+b.elements[2],0);
    };

    (set() simply returns a vector4d with the parameter
    Knowledge is no more expensive than ignorance, but at least as satisfying...

  2. #2
    Junior Member Regular Contributor
    Join Date
    May 2003
    Location
    New York
    Posts
    194

    Re: my SSE code is slower than normal. why?

    Originally posted by orbano:
    Hello!

    I would like to ask somebody to tell me, why is my SSE code for vector and matrix operations slower.
    a) How did you do your timing? How much slower would give a clue as to what might be wrong.

    b) the SSE functions are slow when using unaligned data. the _mm128 data type should implicitly align your union, but I'd double-check that. You may need a declspec align token to help the compiler realize what you're doing.

    c) In your SSE function, you create a temporary vector (plus the input vectors aren't const references), which involves memory allocation, if only on the stack and a lot of reading/writing using non SSE instructions. Optimized math will generally be of the form { a=b; a+=c; } rather than { a = b + c; }

    More imporantly, to get the benefits of SSE, it's best to write the code so that the compiler can leave as much of your data in SSE registers for as long as possible to avoid any copies to main memory.

    The compiler's optimization pass should be able group multiple SSE instructions that wind up near each other this way (check optimization level), but it's not guaranteed.

    The only guarantted way, IMO, is to use the inline assembly SSE where you manage the registers yourself. That's more useful for more complex math operations, but that's where you'll see the biggest benefit.

    Hope that helps

    Avi



    [This message has been edited by Cyranose (edited 01-15-2004).]

  3. #3
    Junior Member Newbie
    Join Date
    Dec 2003
    Location
    Germany
    Posts
    4

    Re: my SSE code is slower than normal. why?

    Greetings!

    The problem in you code is rather trivial
    (if one understands)! You do a lot of really
    UNNECESSARY copy operations. In general,
    parallel addition is faster then sequential
    addition. Well, thats not the news. What I
    mean is, the problem is not within '_mm_add_ps'.
    It is the overhead you create everytime you
    call your 'add' function.

    The reason why your 'add_sisd' function is
    faster is because of fewer copy operations.
    Since you have written the function 'set(...)'
    directly in the return statement the compiler
    can do a so-called 'return value optimization',
    which means that the restult of 'set(...)'
    function is directly created in the return
    value without any copying. Well, even further,
    no temporaries are created in the function of
    'add_sisd', if the function 'set(...)' does
    really only an assignment (as you have told)
    and is also inlined (to prevent from touching
    the stack). So everything in 'add_sisd' happends
    'in-place', because of the way you have outline
    the code. Well, two things get copied (when you
    call 'add_sisd') and these are the arguments for
    the function. But this is the same overhead as
    you have in 'add'. However, this overhead is
    also UNNECESSARY and can be removed.

    As Cyranose said, memory alignment can be an
    issue, but not really in you part of code.
    Well, vc++6 does usually align to a good amount.
    If you have doubts about alignment use
    '__declspec(align(16))' in you variable
    declaration, or check the assembler code to look
    for it. However, copying 100% perfect unnecessary
    aligned data does not give you anything, whether
    aligned or not. But don't get me wrong, alignment
    is very important.

    So what happends in your 'add' function?
    Everytime you call this function you create the
    temp variable 'c'. Then you copy the result to
    'c.data' and thereafter 'c' is copied to the
    place where the return address is in memory. All
    of this can be removed.

    As a side note; You not only add in parallel with
    the intrinsic '_mm_add_ps', you implicitly move
    the result also in parallel, unlike in 'add_sisd'.
    If you do a memory transfer within "SSE bounds",
    then the compiler will use the appropriate
    copy/move instructions. In this case it is 'movaps'.

    So, the naked power, in this example, (for the
    parallel version) comes from the following two
    assembler instructions - 'addps' and 'moveps'.
    The deal is now to give 'addps' the raw source date
    and to let 'moveps' copy the result (located in a
    register) directly to the location in memory, where
    your variable is living.

    If this would not speed up the code, the entire
    SSE unit would be useless!

    What does an improved version look like? Here is a
    trivial solution to the problem (for the parallel
    version first).

    inline void ADD_SIMD(const vector4d &a, const vector4d &b, vector4d &c)
    {
    c.data = _mm_add_ps(a.data, b.data);
    }

    Puhh, rather simple! This function does what I
    have explaind before. What? You want a prove?
    Ok, let us look on the assembler instruction

    {
    movaps -72(%ebp), %xmm1
    addps -88(%ebp), %xmm1
    movaps %xmm1, -104(%ebp)
    }

    Wow!
    This is all of the function body of 'ADD_SIMD'!

    Using 'ADD_SIMD' is simple:

    vector4d a,b,c;
    ADD_SIMD(a, b, c);

    'c' holds the result.

    Of couse, this is somewhat different from

    c = add(a, b)

    but it is also possible to code a version
    this way, if you want. The small "problem"
    doing so comes from having different types
    involved, since the return value should be
    'vector4d' whereas '_mm_add_ps' returns
    '__m128'.

    Wanna see your 'add' function in assembler?
    No? Sorry...

    {
    movl -56(%ebp), %esi
    movl -52(%ebp), %ecx
    movl -48(%ebp), %edx
    movl %esi, -120(%ebp)
    movl %ecx, -116(%ebp)
    movl %edx, -112(%ebp)
    movl -44(%ebp), %esi
    movl -72(%ebp), %ecx
    movl -68(%ebp), %edx
    movl %esi, -108(%ebp)
    movl %ecx, -136(%ebp)
    movl %edx, -132(%ebp)
    movl -64(%ebp), %esi
    movl -60(%ebp), %ecx
    movaps -120(%ebp), %xmm1
    movl %esi, -128(%ebp)
    movl %ecx, -124(%ebp)
    addps -136(%ebp), %xmm1
    movaps %xmm1, -104(%ebp)
    movl -104(%ebp), %edx
    movl -100(%ebp), %esi
    movl -96(%ebp), %ecx
    movl %edx, -88(%ebp)
    movl %esi, -84(%ebp)
    movl %ecx, -80(%ebp)
    movl -92(%ebp), %edx
    movl %edx, -76(%ebp)
    }

    Puhhh! Can you see all da moves? Now look
    back at the assembly code from 'ADD_SIMD'.

    Well, the 'movl's above the first 'movaps'
    are the penalty for copying the paramters
    of the function. The 'movl's below the second
    'movaps' are the penalty for copying the temp
    vector 'c' to the memory location where the
    result should be stored.

    Lets look at your 'add_sisd' function. This
    function could also trivialy be improved to

    inline void ADD_SISD(const vector4d &a, const vector4d &b, vector4d &c)
    {
    c.elements[0] = a.elements[0] + b.elements[0];
    c.elements[1] = a.elements[1] + b.elements[1];
    c.elements[2] = a.elements[2] + b.elements[2];
    c.elements[3] = a.elements[3] + b.elements[3];
    }

    Of couse, listing the elements is really rather
    trivial, but also note that we have changed the
    arguments (note the '&'). We now call by reference
    and not by value, like we have done in 'ADD_SIMD'.

    Assembler code? For sure!

    {
    flds -88(%ebp)
    flds -84(%ebp)
    flds -80(%ebp)
    flds -76(%ebp)
    fxch %st(3)
    fadds -72(%ebp)
    fxch %st(2)
    fadds -68(%ebp)
    fxch %st(1)
    fadds -64(%ebp)
    fxch %st(3)
    fadds -60(%ebp)
    fxch %st(2)
    fstps -104(%ebp)
    fstps -100(%ebp)
    fxch %st(1)
    fstps -96(%ebp)
    fstps -92(%ebp)
    }

    Well, sweet. Here you can see how the code is
    processed in a sequentially manner. We have
    sequentially loads, sequentially adds and
    sequentially stores.

    What about the speed?

    I have clocked both functions, ADD_SIMD and
    ADD_SISD, down to cycles.

    Result:
    'ADD_SIMD' takes ~ 16 cycles.
    'ADD_SISD' takes ~ 28 cycles.

    (cycles can vary a little due to the
    OS. But these are the smallest I was
    able measure.)

    Speed up of 'ADD_SIMD': ~ 42% faster

    Now lets do it with 2^20 = 1048576 calls in
    a row!

    2^20 'ADD_SIMD' ~ 2097316 cycles
    2^20 'ADD_SISD' ~ 13720816 cycles

    Hence, in this case, 'ADD_SIMD' is at least
    6 TIMES FASTER as 'ADD_SISD'! That's alot!

    For fun, lets measure the cycles from your
    'add' function.

    'add' takes ~ 164 cycles

    2^20 'add' ~ 105118568 cycles

    The 'add' function is ABOUT 50 TIMES
    SLOWER than 'ADD_SIMD', if you make 2^20
    calls to 'ADD_SIMD' in a row, and ABOUT 10
    TIMES SLOWER within a single call! Both are
    using the simd '_mm_add_ps' intrinsic
    function.

    Well, using 'ADD_SIMD' will give you at least
    a 10-fold speed-up!

    Make your choice...

    I think, this should help.


    Notes:
    Stay away from memory copying! Memory transfer
    is a slow operation (like searching) because of the
    large gap between cpu performance and memory
    bandwidth. Suffling in an efficient way memory around
    will be one of the main difficulties in the years to
    come. Intel has allready put some stuff on that issue.
    It is possible to use SSE (esp. SSE2) to make fast
    memory transfer. Well, but small temps can also improve
    your code! Small temps in connection with pointers can
    remove what is known as 'pointer aliasing'. This helps
    the compiler in optimization.


    Some comments on the system I used:
    Hardware: PIV 2.4GHz 1GB PC2100 DDR-SDRAM
    OS : Linux (kernel 2.4.*)
    Compiler: gcc 3.3 [-march=pentium4 -msse2 -s -O3 -fomit-frame-pointer]

    cu,
    m i s s i l e

    [This message has been edited by m i s s i l e (edited 01-25-2004).]

  4. #4
    Intern Newbie
    Join Date
    Jan 2004
    Location
    Budapest, Hunary
    Posts
    34

    Re: my SSE code is slower than normal. why?

    THANK YOU! This is the best answer i have ever got! If i could, i would thank you it in a verse...
    Knowledge is no more expensive than ignorance, but at least as satisfying...

  5. #5
    Intern Newbie
    Join Date
    Jan 2004
    Location
    Budapest, Hunary
    Posts
    34

    Re: my SSE code is slower than normal. why?

    The only thing i dont yet know is the vector dot product. it is also slower thas the sisd counterpart. the code is the following:

    inline float Dot_SIMD(const TVector4d &a, const TVector4d &b)
    {
    TVector4d c;
    c.data=_mm_add_ps(a.data,b.data);
    return (c.v[0]+c.v[1]+c.v[2]+c.v[3]);
    };

    Do you have any ideas?
    Knowledge is no more expensive than ignorance, but at least as satisfying...

  6. #6
    Junior Member Newbie
    Join Date
    Dec 2003
    Location
    Germany
    Posts
    4

    Re: my SSE code is slower than normal. why?

    Originally posted by orbano:
    THANK YOU! This is the best answer i have ever got! If i could, i would thank you it in a verse...
    Well, for a 10-fold speed-up people
    would kill.

    m i s s i l e

  7. #7
    Junior Member Newbie
    Join Date
    Dec 2003
    Location
    Germany
    Posts
    4

    Re: my SSE code is slower than normal. why?

    Originally posted by orbano:
    The only thing i dont yet know is the vector dot product. it is also slower thas the sisd counterpart. the code is the following:

    inline float Dot_SIMD(const TVector4d &a, const TVector4d &b)
    {
    TVector4d c;
    c.data=_mm_add_ps(a.data,b.data);
    return (c.v[0]+c.v[1]+c.v[2]+c.v[3]);
    };

    Do you have any ideas?
    First off all, you missed the definition of the
    scalar-product (dot product). But I guess it is a
    typo in your post, since we multiply the components
    of the two vectors and then add the result together
    to get the dot. Hence, using _mm_mul_ps instead of
    _mm_add_ps.

    inline float Dot_SIMD(const TVector4d &a, const TVector4d &b)
    {
    TVector4d c;
    c.data=_mm_mul_ps(a.data,b.data);
    return (c.v[0]+c.v[1]+c.v[2]+c.v[3]);
    };

    Speeding up?

    Well, lets analyse a bit. The _mm_mul_ps intrinsic
    is ok. But what about this sequential addition? Of
    cause, this is where a lot of people get stucked,
    since there is no (what is known as a) "horizontal"
    addition (for __m128 data types) within the SSE2 unit.
    This is one of the drawback of the SSE2 unit whereas
    3DNOW has such an instruction, called 'pfacc'. Well,
    the problem will be solved with the upcoming next Intel
    chip, called Prescott. This chip will utilizes
    horizontal instructions.

    However, there is minor versions of such horizontal
    instructions in SSE2. You can multiply two __m128
    together and then adding the low and higher part using
    some annoying (its up to your knowledge) shuffling
    techniques.

    Wanna have a faster version? Lets look at this

    inline void DOT_SIMD(const vector4d &a, const vector4d &b, float &dot)
    {
    __m128 r;

    r = _mm_mul_ps(a.data, b.data);
    r = _mm_add_ps(_mm_movehl_ps(r, r), r);
    r = _mm_add_ss(_mm_shuffle_ps(r, r, 1), r) ;

    _mm_store_ss(&dot, r);
    }

    Read the manuals about '_mm_shuffle_ps' and '_mm_movehl_ps'!
    As allways, read as much as you can and play around with it.
    This is the only way you can improve you skills, esp. on such
    a topic. Using intrinsic's is very close to the metal. On
    this stage you must already unterstand some assembler
    coding, otherwise you will blow your head off. Intrinsic's
    are a thin layer above the raw assembly instructions. Most
    usefull to those who already have experience with SSE2 at
    the native level. In this way using intrinsic's is easier
    and also more portable (to newer Intel microprocessors).
    So, the intrinsic's should NOT be unterstood as an easy way
    for programers who are not into assembler and things alike.

    If you *really* care for speed, then you have to go down
    learn some concepts on the assembly level and come back
    up again. It will definitely payoff! Even further, if new
    assembly instructions appear you can trivially use them
    to your advance where most compilers get stucked! Well, if
    you don't want effort the time, then it is better to use
    a good compiler on such a topic. In this case, the Intel
    icc C++ compiler is one such compilers (if not the best).
    This compiler has an automatic vectorizer implemented.
    Using this compiler would automaticly transform the sequence
    (c.v[0]+c.v[1]+c.v[2]+c.v[3]) in a vectorized form. However,
    a very nice combination is in having knowledge on the lower
    level and a powerful compiler aside. In this way, you can
    give the compiler the right hints and he can optimize the
    hell out of your code! For example, you can give the gcc
    compiler a hint to generate intructions for prefetching
    (in advance!) a block of data into the cache! Placed at the
    right position and your code must step on the breaks to not
    cross the lightspeed barrier. Placed at the wrong position
    and you will trash your cache and lose speed. Its for you
    to decide.

    Ok, now back to the speed test. I will measure
    'Dot_SIMD' (your function), DOT_SISD and 'DOT_SIMD'.

    For completeness, here is the code for 'DOT_SISD':

    inline float DOT_SISD(const vector4d &a, const vector4d &b)
    {
    return a.elements[0] * b.elements[0] + a.elements[1] * b.elements[1] +
    a.elements[2] * b.elements[2] + a.elements[3] * b.elements[3];
    }

    Results:

    1 'DOT_SISD' ~ 84 cycles
    1 'Dot_SIMD' ~ 88 cycles
    1 'DOT_SIMD' ~ 52 cycles


    Well, the speed-up of 'DOT_SIMD' against 'Dot_SIMD'
    is about 70% (88 / 52 ~= 1.70)

    These timings are from my machine using gcc 3.3 with
    options -march=pentium4 -msse2 -s -O3 -fomit-frame-pointer.

    The performance of DOT_SIMD also depends on how your vectors
    flow around in the system. It is very easy to out-perform
    'DOT_SIMD' with 'DOT_SISD', for example, if your data is not
    aligned. So, take care and also watch your compiler settings.


    cu,
    m i s s i l e



    [This message has been edited by m i s s i l e (edited 01-25-2004).]

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •