Fast float -> 1010102 converter

Hi everyone

I work on some graphics software which uses the 1010102 vertex format a lot.
Per frame I need to perform a lot of float -> 1010102 conversions.
I’m detecting my current converter as being a performance bottleneck.
I’m not a bit shifting wizard at all, but I know there are a lot out there who are. : )

So my question.
Does anyone have a faster float -> 1010102 conversion routine?

Please note: I’m on Intel chipset.

Thanks very much!

Current routine:


//  converts XYZ float in range of -1.0 -> 1.0 to 1010102 format
inline uint32_t signedUnitFloat1010102( float x, float y, float z, uint32_t w = 0 )
{
    int32_t ix = fabs( x ) * 511.0f + 0.5f;
    int32_t iy = fabs( y ) * 511.0f + 0.5f;
    int32_t iz = fabs( z ) * 511.0f + 0.5f;
    ix = ( ix > 511 ) ? 511 : ix;
    iy = ( iy > 511 ) ? 511 : iy;
    iz = ( iz > 511 ) ? 511 : iz;
    ix = ( x < 0.0f ) ? -ix : ix;
    iy = ( y < 0.0f ) ? -iy : iy;
    iz = ( z < 0.0f ) ? -iz : iz;
    return ( ix & 0x3FF ) | ( ( iy & 0x3FF ) << 10 ) | ( ( iz & 0x3FF ) << 20 ) | ( ( w & 0x3 ) << 30 );
}

If your vector is always normalized, then there should be no reason to clamp it, right?
Just cast those values straight to int. I see no reason to use fabs and 0.5f.

Why don’t you just stick in a loop and throw a couple thousand random normals at it?

I hold the same idea as seven hold

[QUOTE=sevenfold;1281753]If your vector is always normalized, then there should be no reason to clamp it, right?
Just cast those values straight to int. I see no reason to use fabs and 0.5f.
[/QUOTE]

The data comes from user plugins so although though it should be between -1.0->1.0, often it is not which is why I need to clamp.
Adding 0.5 is necessary to round to the nearest integer, otherwise the data will be off by 0.5/512 (on average).
I could forgo the 0.5 rounding as it could be argued that nobody will notice 0.5/512 error, and if they will 1010102 should not be used to start with.

But the clamp does need to happen.

So without adding 0.5 I end up with this…

//  converts XYZ float in range of -1.0 -> 1.0 to 1010102 format
inline uint32_t signedUnitFloat1010102( float x, float y, float z )
{
    int32_t ix = x * 511.0f;
    int32_t iy = y * 511.0f;
    int32_t iz = z * 511.0f;
    ix = ( ix > 511 ) ? 511 : ( ( ix < -511 ) ? -511 : ix );
    iy = ( iy > 511 ) ? 511 : ( ( iy < -511 ) ? -511 : iy );
    iz = ( iz > 511 ) ? 511 : ( ( iz < -511 ) ? -511 : iz );
    return ( ix & 0x3FF ) | ( ( iy & 0x3FF ) << 10 ) | ( ( iz & 0x3FF ) << 20 );
}

Is it still slow to cast from float->int on Intel machines?
I’m trying all the old tricks and they’re not speeding it up, so I guess they’ve fixed that now? : )

thanks!

Do you call this function many times with the same data?
Then fix your data. Make sure all of it is normalized, one time only.

There is no need to clamp normalized data.

[QUOTE=sevenfold;1281785]Do you call this function many times with the same data?
Then fix your data. Make sure all of it is normalized, one time only.
There is no need to clamp normalized data.[/QUOTE]

No, it is not called many times with the same data. Only once per data set.
(eg, user plugin is re-skinning character per-frame on the CPU, and the renderer is needing to re-upload to the GPU each frame )

I’ve decided to go with SIMD.
On linux + GCC/IntelCompiler systems if 4 32bit floats are on a 16byte boundary then they can be used via SIMD directly. (no need to explicitly load etc…)
So I simply walk up the head of the array processing manually until I hit the 16byte boundary.
Then I process 16bytes at a time via SIMD.
Then I process the left over bytes at the end manually again.
I’m achieving a good speedup from this.
And I’m able to clamp+round as per the original function.

Cheers!
Ren