The Industry's Foundation for High Performance Graphics

1. HW implemetation of trigonometry

Can anyone give a reference to some official document about trigonometric function implementation in modern GPUs?

There are some unofficial claims on the net (forums at first place), but I need some firm document to cite.

I have already carried out experiments and the implementations differ from vendor to vendor.
Currently, Intel has the most precise trigonometry, then NVIDIA, and AMD is definitely the worst.
Probably the density of lookup tables differ (if the implementations use lookup tables at all).

For my purposes I had to reimplement all trigonometric functions in GLSL. That solution is certainly slower (since both sine and cosine are executed in a single cycle of the clock), but the overall performances are almost the same (since shaders are not trivial) and the precision is raised incredibly.

Thank you for the help!

2. Hi,

I have make a test about linear interpolation + renormalization of cos/sin with a table of 360 steps and this seem to always give an approximed value very near to the true

Code :
```#include <stdio.h>
#include <stdlib.h>
#include <math.h>

float Cosinus_table[361];
float Sinus_table[361];

#define M_PI 3.14159265358979323846

#define STEPS 1000

void InitTables()
{
int i;

for( i = 0 ; i < 361 ; i++ )
{
radians = ( (float)(i) * M_PI ) / 180.0f;
printf("SinCos[%d]=(%f, %f) \n",
i, Sinus_table[i], Cosinus_table[i] );
}
}

void Renormalize( float *x, float *y)
{
float nx, ny, dist;

nx = *x;
ny = *y;

dist = sqrt(nx*nx + ny*ny);

nx /= dist;
ny /= dist;

*x = nx;
*y = ny;
}

{
float degrees, interp, c0, c1;

degrees = fmod(radians * (180.0f / M_PI), 360.0f);

c0 = Cosinus_table[(int)degrees];
c1 = Cosinus_table[(int)(degrees+1)];

interp = degrees - floor(degrees);

return c1 * interp + c0 * (1.0f - interp);
}

{
float degrees, interp, s0, s1;

degrees = fmod(radians * (180.0f / M_PI), 360.0f);

s0 = Sinus_table[(int)degrees];
s1 = Sinus_table[(int)(degrees+1)];

interp = degrees - floor(degrees);

return s1 * interp + s0 * (1.0f - interp);
}

int main( int argc , char **argv)
{
int i;
float radians, sreal, creal, sfast, cfast;
float sdiff, cdiff,  mindiff, maxdiff;

maxdiff = 0.0f;
mindiff = 1.0f;

InitTables();

for( i = 0 ; i < 360 * STEPS ; i++)
{
radians = ((float)(i) * M_PI) / (180.0f * STEPS);

Renormalize(&cfast, &sfast);

sdiff = sreal - sfast;
cdiff = creal - cfast;

//printf("step %d (%f degrees, %f radians) : (sin,cos)=(%f,%f) (fsin,fcos)=(%f,%f) diff=(%f,%f) \n",
//    i, (float)(i)/(float)STEPS, radians, sreal, creal, sfast, cfast, sdiff, cdiff);

if ( cdiff > maxdiff) maxdiff = cdiff;
if ( cdiff < mindiff) mindiff = cdiff;
if ( sdiff > maxdiff) maxdiff = sdiff;
if ( sdiff < mindiff) mindiff = sdiff;
}

printf("mindiff=%f maxdiff=%f \n", mindiff, maxdiff);
}```

With 360 000 steps (something like an angular step of 0,00001746 radians), this give only mindiff=-0.000001 maxdiff=0.000001
=> the interpolation + renormalization seem to make a relatively precise sin/cos reconstruction for intermediates values
(without the Renormalize() step, this give mindiff=-0.000038 maxdiff=0.000038 => in this case, the approximation is 38x more bad )
[but the computation is certainly relatively more speed on the other side ]

Note that I have only test angle between 0 and 360 degrees but I think that this work too with biggers or négatives values because the use of fmod() function
(internaly, this use degrees for to be "a little more table friendly" but externaly we have always to use radians values for angles)

PS : I don't know is my fastSinus() and fastCosinus() are really as fast as this because I have only make them for to test the idea

PS2 : I have too find this http://lab.polygonal.de/?p=205 and this http://devmaster.net/forums/topic/46...te-sinecosine/ but, as for my example, it's only a software approximation of sinus/cosinus, not a faster GPU hardware implementation scheme of sin/cos as you seem to want => you want to find infos about how is implemented FSINCOS for example, it's good this ?

3. I have found at http://blog.makezine.com/2008/12/04/...se-square-roo/ how to compute rapidly the Inverse Square Root

Code :
```float InvSqrt(float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f3759df - (i>>1); // gives initial guess y0
x = *(float*)&i; // convert bits back to float
x = x*(1.5f-xhalf*x*x); // Newton step, repeating increases accuracy

return x;
}```

This can to be used into my Renormalize() func for to accellerate it for exemple
(this can eliminate the sqrt() func ands replace the two divisions by a multiplication of the InvSqrt() return value)

On the other side, I think about using the GPU texture hardware for to handle the interpolation part and using a paletted texture
for to remplace the Sinus_table[] and Cosinus_table[] association array, limited to 256 entry instead 360 because paletted texture can only handle 256 colors/entries
(but this can perhaps really to be less fast than to simply use actuals RCP, RSQ or others EXP shader's intructions ?)

In all cases, using Taylor/MacLaurin, Lauren/Newton or likes series/steps expansions seem to be generally a good solution for a lot of fast trigonometrics approximations

But it seem than hardware implementations use differents algorithms and I have found too links that speak about CORDIC and BKM here http://en.wikipedia.org/wiki/Cordic and here http://en.wikipedia.org/wiki/BKM_algorithm
=> they seem specially adapted for FPGAs because they don't use multipliers

Here is a PDF link that explain how to use CORDIC into FPGAs http://www.andraka.com/files/crdcsrvy.pdf
(but what is explained into this PDF seem to be relatively complex to understand => after one or two days of reading, I could perhaps begin to understand what it is explained in it )
[this link http://web.archive.org/web/200008231...v/sidebar.html seem to be very more simple to understand => I think to use it for to begin to understand what is really the CORDIC algorithm ]

4. I have found one CORDIC software implementation at this link http://www.dcs.gla.ac.uk/~jhw/cordic/

I have used the 32 bits version of the header (that I have named corbic32.h)
Code :
```//Cordic in 32 bit signed fixed point math
//Function is valid for arguments in range -pi/2 -- pi/2
//for values pi/2--pi: value = half_pi-(theta-half_pi) and similarly for values -pi---pi/2
//
// 1.0 = 1073741824
// 1/k = 0.6072529350088812561694
// pi = 3.1415926536897932384626
//Constants
#define cordic_1K 0x26DD3B6A
#define half_pi 0x6487ED51
#define MUL 1073741824.000000
#define CORDIC_NTAB 32
int cordic_ctab [] = {0x3243F6A8, 0x1DAC6705, 0x0FADBAFC, 0x07F56EA6, 0x03FEAB76, 0x01FFD55B,
0x00FFFAAA, 0x007FFF55, 0x003FFFEA, 0x001FFFFD, 0x000FFFFF, 0x0007FFFF, 0x0003FFFF,
0x0001FFFF, 0x0000FFFF, 0x00007FFF, 0x00003FFF, 0x00001FFF, 0x00000FFF, 0x000007FF,
0x000003FF, 0x000001FF, 0x000000FF, 0x0000007F, 0x0000003F, 0x0000001F, 0x0000000F,
0x00000008, 0x00000004, 0x00000002, 0x00000001, 0x00000000, };

void cordic(int theta, int *s, int *c, int n)
{
int k, d, tx, ty, tz;
int x=cordic_1K,y=0,z=theta;
n = (n>CORDIC_NTAB) ? CORDIC_NTAB : n;
for (k=0; k<n; ++k)
{
d = z>>31;
//get sign. for other architectures, you might want to use the more portable version
//d = z>=0 ? 0 : -1;
tx = x - (((y>>k) ^ d) - d);
ty = y + (((x>>k) ^ d) - d);
tz = z - ((cordic_ctab[k] ^ d) - d);
x = tx; y = ty; z = tz;
}
*c = x; *s = y;
}```
(warning : the values theta, c and s values are used in a fixed point MUL format)

And I have only make some minors modifications to the .c part for to see how the approximation is big
Code :
```#include "cordic32.h"
#include <stdio.h>
#include <stdlib.h>
#include <math.h> // for testing only!

#define STEPS 360
#define ITERS 32

//Print out sin(x) vs fp CORDIC sin(x)
int main(int argc, char **argv)
{
int s, c;
float sinus,cosinus;
int i;

for(i=0;i<STEPS;i++)
{
deg = ((float)(i)/(float)(STEPS)) * 90.0f;
rad = ((float)(i)/(float)(STEPS)) * M_PI / 2.0f;

sinus = s / MUL;
cosinus = c / MUL;

//these values should be nearly equal
printf("Step %i (%f degrees, %f radians) : cordic=(%f,%f) vs (sin,cos)=(%f,%f) => diff=(%f,%f)) \n",
}
}```

This give only cos/sin values between 0 to PI/2 radians (cf. 0 to 90 degrees in my french country ), so we have to use quadrants for to handle all the 0 to 2 PI range of the circle

And the approximation seem near perfect because this give always the sames values, cf. diff is always (0,0)

I think to benchmark it the next week because I have a big doubt about the fact that this cordic function can have a chance to be to really more speed than the sin() or cos() implementation on my Linux box ...
(I can't make/test a true hardware implementation of it because I'm not an hardware engineer so don't know how, and don't have the material too, for to use/test/devellop something on a FPGA)

(this is in french language)

But me too, I don't find any documentation about "trigonometric function implementation in modern GPUs" on the Net

5. Thank you, Little Body, for so many examples!

I don't need an implementation. I already have very precise one.
It is a combination of Taylor series, built-in functions and trigonometric identities.
For my application it is not important to be extremely fast, but extremely precise.

What I'm looking for is HW implementation that I can use for a citation.

CORDIC is a very interesting approach, but I really doubt it is used in modern GPU since it is an iterative method. That way they cannot achieve one cycle execution, or I missed something.

Thus far, I have discovered that NV implementation has 1.07288363581976e-5 step in sine calculation. It would require 8M-entry table for such precision, which I really doubt can be afforded. On the other hand, it is a step-function.

Did you try to implement your method in GLSL? What precision do you get?
Don't compare with float versions of sin/cos from the math lib. They are far less accurate than its double counterparts.
What do you get for sin(0.5e-5), or smaller arguments? 0.5e-5 degrees (not radians)! Or 1e-6? HW versions retrieve always 0, which is a 100% wrong value. That's what I'm talking about.

6. I have only begin to study about how is computed trigonometrics functions after your post, this is only my firsts tests

I have effectively read that they are problems with very little values
=> perhaps that using something like a pseudo-logarithmic scale (or hyperbolic ?) for to index the association array, where two consecutives entries was used for to make the interpolation (linear for the instant, but this can extended with an interpolation sheme that use more than two entries for to be more precise) + renormalisation, with [a lot ?] more values at the beginning that at the end can resolve a part of this problem ?
(but I have doubt about the fact that this can be enough precise for to work with doubles )
[are double version of sin/cos/atan and other trigonometric functions, on Linux and C/C++, enough precises for to serve as a reference ?]

Or replace the renormalisation step by a refinement step ???
(or add a refinenement step instead to only replace the renormalization step ?)

I think to implement/test this in GLSL after to have find something that is enough precise (and speed if possible) with a standard C code
(and I have test nothing with doubles, only with float ... but this is something that I have planned to make the next week)

Me too, I think that only a very precise implementation can really to be interesting to improve in GLSL, due to the fact that they seem already very fast on somes GLSL implementations (make better than 1 cycle seem me very hard ) **BUT** that they don't seem to be very precise
=> improve the precision seem to me the principal thing that can now to be added to them in GLSL

I think too test/improve the code from this link http://lab.polygonal.de/?p=205

7. I have make a little benchmark between standard C/C++ float and double versions of sin()/cos() and an adapted version of the code from http://lab.polygonal.de/?p=205

Code :
```#include <chrono>
#include <iostream>
#include <unistd.h>
#include <time.h>
#include <math.h>

#define PRECISION double

// Chronometer functions
using Clock = std::chrono::high_resolution_clock;
using std::chrono::milliseconds;
using std::chrono::nanoseconds;
using std::chrono::duration_cast;

auto start = Clock::now();
auto end = Clock::now();
auto delai = duration_cast<milliseconds>(end-start).count();

inline void BeginTimer()
{
start = Clock::now();
}

inline void EndTimer()
{
end = Clock::now();
}

void PrintTimer(char *title)
{
// std::cout << duration_cast<nanoseconds>(end-start).count() << " ns\n";
std::cout << title << " : "<< duration_cast<milliseconds>(end-start).count() << " ms\n";
}

// Fast approximation functions
inline PRECISION fast_sin(PRECISION x)
{
PRECISION sin;

//always wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;

//compute sine
if (x < 0)
sin = 1.27323954 * x + .405284735 * x * x;
else
sin = 1.27323954 * x - 0.405284735 * x * x;

return sin;
}

inline PRECISION fast_cos( PRECISION x)
{
PRECISION cos;

//always wrap input angle to -PI..PI
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;

// cos(x) = sin(x + PI/2)
x += 1.57079632;

if (x >  3.14159265)
x -= 6.28318531;

if (x < 0)
cos = 1.27323954 * x + 0.405284735 * x * x;
else
cos = 1.27323954 * x - 0.405284735 * x * x;

return cos;
}

// Fast/ and precises approximation functions
inline PRECISION fast_precise_sin(PRECISION x)
{
PRECISION sin;

//always wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;

//compute sine
if (x < 0)
{
sin = 1.27323954 * x + .405284735 * x * x;

if (sin < 0)
sin = .225 * (sin *-sin - sin) + sin;
else
sin = .225 * (sin * sin - sin) + sin;
}
else
{
sin = 1.27323954 * x - 0.405284735 * x * x;

if (sin < 0)
sin = .225 * (sin *-sin - sin) + sin;
else
sin = .225 * (sin * sin - sin) + sin;
}

return sin;
}

inline PRECISION fast_precise_cos( PRECISION x)
{
PRECISION cos;

//always wrap input angle to -PI..PI
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;

// cos(x) = sin(x + PI/2)
x += 1.57079632;

if (x >  3.14159265)
x -= 6.28318531;

if (x < 0)
{
cos = 1.27323954 * x + 0.405284735 * x * x;

if (cos < 0)
cos = .225 * (cos *-cos - cos) + cos;
else
cos = .225 * (cos * cos - cos) + cos;
}
else
{
cos = 1.27323954 * x - 0.405284735 * x * x;

if (cos < 0)
cos = .225 * (cos *-cos - cos) + cos;
else
cos = .225 * (cos * cos - cos) + cos;
}

return cos;
}

// Tests standards functions
void TestStandardSinus(char *title)
{
int i;
PRECISION val, sinus;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
sinus    = sin(val);
}

EndTimer();

PrintTimer(title);
}

void TestStandardCosinus(char *title)
{
int i;
PRECISION val, cosinus;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
cosinus    = cos(val);
}

EndTimer();

PrintTimer(title);
}

// Tests fast functions
void TestFastSinus(char *title)
{
int i;
PRECISION val, sinus;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
sinus    = fast_sin(val);
}

EndTimer();

PrintTimer(title);
}

void TestFastCosinus(char *title)
{
int i;
PRECISION val, cosinus;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
cosinus    = fast_cos(val);
}

EndTimer();

PrintTimer(title);
}

// Tests fast and precise functions
void TestFastPreciseSinus(char *title)
{
int i;
PRECISION val, sinus;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
sinus    = fast_precise_sin(val);
}

EndTimer();

PrintTimer(title);
}

void TestFastPreciseCosinus(char *title)
{
int i;
PRECISION val, cosinus;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
cosinus    = fast_precise_cos(val);
}

EndTimer();

PrintTimer(title);
}

// Tests ultra fast versions
void TestUltraFastSinus(char *title)
{
int i;
PRECISION val, sinus, x;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
x = val;

if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;

//compute sine
if (x < 0)
sinus = 1.27323954 * x + .405284735 * x * x;
else
sinus = 1.27323954 * x - 0.405284735 * x * x;
}

EndTimer();

PrintTimer(title);
}

void TestUltraFastCosinus(char *title)
{
int i;
PRECISION val, cosinus, x;

BeginTimer();

for( i = 0; i < 1000000 ; i++ , val += 0.0000001f)
{
// cos(x) = sin(x + PI/2)
x = val + 1.57079632;

//always wrap input angle to -PI..PI
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;

if (x < 0)
cosinus = 1.27323954 * x + 0.405284735 * x * x;
else
cosinus = 1.27323954 * x - 0.405284735 * x * x;
}

EndTimer();

PrintTimer(title);
}

// main
int main()
{
TestStandardSinus(     "Standard    C/C++ sin() func  ");
TestStandardCosinus(   "Standard    C/C++ cos() func  ");

TestFastPreciseSinus(  "FastPrecise C/C++ sin() func");
TestFastPreciseCosinus("FastPrecise C/C++ cos() func");

TestFastSinus(         "Fast        C/C++ sin() func");
TestFastCosinus(       "Fast        C/C++ cos() func");

TestUltraFastSinus(    "UltraFast   C/C++ sin() func");
TestUltraFastCosinus(  "UltraFast   C/C++ cos() func");

}```

=> the integrated/optimised versions of sin/cos presents in TestUltraFastSinus() and TestUltraFastCosinus() can to be more than 4x more speed than the standards C/C++ sin()/cos() function

With the float version :
Code :
```Standard    C/C++ sin() func   : 90 ms
Standard    C/C++ cos() func   : 96 ms
FastPrecise C/C++ sin() func : 38 ms
FastPrecise C/C++ cos() func : 57 ms
Fast        C/C++ sin() func : 27 ms
Fast        C/C++ cos() func : 36 ms
UltraFast   C/C++ sin() func : 15 ms
UltraFast   C/C++ cos() func : 21 ms```
(the code make 1 000 000 sinus (or cosinus) computations at each benchmark)

With the double version :
Code :
```Standard    C/C++ sin() func   : 87 ms
Standard    C/C++ cos() func   : 95 ms
FastPrecise C/C++ sin() func : 65 ms
FastPrecise C/C++ cos() func : 73 ms
Fast        C/C++ sin() func : 51 ms
Fast        C/C++ cos() func : 62 ms
UltraFast   C/C++ sin() func : 17 ms
UltraFast   C/C++ cos() func : 20 ms```

Is normal that the standard and ultrafast versions seem equivalents with float and double ?
(my chronometer is certainly not the best in the world but the double versions of sin/cos seem however more speed than the float versions ...)

I think than the the Fast and FastPrecise versions are less speed in the double version because they have a bigger size of arguments to handle in input and output
(cf. this is the function push/pop of values that make the différence)

I think add speedly the check of differences between the reference value returned by sin/cos() and the various optimised computation of them using fast_cos/sin() , fastprecise_sin/cos() and the integrated computation of fast_sin/cos() included into TestUltraFastSinus/Cosinus()
(I can now work without any problem with doubles instead floats since computations with doubles seem as fast as computations with float on my Linux box => is it a normal thing ???)

8. I have make/debugged a new version that compute difference between goods values given by sin/cos and approximed/optimised versions of them

Code :
```#include <chrono>
#include <iostream>
#include <unistd.h>
#include <time.h>
#include <math.h>

#define PRECISION float
#define BENCHSIZE 10000000
#define WRAPPING  1

// #define BENCHINC  0.000001f
PRECISION BENCHINC   = (6.28318531 / BENCHSIZE) * 2.0f;
PRECISION BENCHSTART = -BENCHSIZE * BENCHINC;

///////////////////////////
// Chronometer functions //
///////////////////////////
using Clock = std::chrono::high_resolution_clock;
using std::chrono::milliseconds;
using std::chrono::nanoseconds;
using std::chrono::duration_cast;

auto start = Clock::now();
auto end = Clock::now();
auto delai = duration_cast<milliseconds>(end-start).count();

inline void BeginTimer()
{
start = Clock::now();
}

inline void EndTimer()
{
end = Clock::now();
}

void PrintTimer(char *title)
{
// std::cout << duration_cast<nanoseconds>(end-start).count() << " ns\n";
std::cout << title << " : "<< duration_cast<milliseconds>(end-start).count() << " ms ";
}

///////////////////////////////
// Tests standards functions //
///////////////////////////////
void TestStandardSinus(char *title)
{
int i;
PRECISION val, sinus;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
sinus    = sin(val);
}

EndTimer();

PrintTimer(title);
}

void TestStandardCosinus(char *title)
{
int i;
PRECISION val, cosinus;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
cosinus    = cos(val);
}

EndTimer();

PrintTimer(title);
}

//////////////////////////////////
// Fast approximation functions //
//////////////////////////////////
inline PRECISION fast_sin(PRECISION x)
{
PRECISION sin;

//always wrap input angle to -PI..PI
#ifdef WRAPPING
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute sine
if (x < 0)
sin = 1.27323954 * x + 0.405284735 * x * x;
else
sin = 1.27323954 * x - 0.405284735 * x * x;

return sin;
}

inline PRECISION fast_cos( PRECISION x)
{
PRECISION cos;

// cos(x) = sin(x + PI/2)
x += 1.57079632;

//always wrap input angle to -PI..PI
#ifdef WRAPPING
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
// else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

if (x < 0)
cos = 1.27323954 * x + 0.405284735 * x * x;
else
cos = 1.27323954 * x - 0.405284735 * x * x;

return cos;
}

////////////////////////////////////////
// Tests fast approximation functions //
////////////////////////////////////////
void TestFastSinus(char *title)
{
int i;
PRECISION val, sinus;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
sinus    = fast_sin(val);
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = sin(val);
sinus    = fast_sin(val);

diff = sinus - ref;
moy  += diff;
variance += diff * diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

// std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

void TestFastCosinus(char *title)
{
int i;
PRECISION val, cosinus;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
cosinus    = fast_cos(val);
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = cos(val);
cosinus    = fast_cos(val);

diff = cosinus - ref;
moy += diff;
variance += diff*diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

///////////////////////////////
// Tests ultra fast versions //
///////////////////////////////
void CheckUltraFastSinus()
{
int i;
PRECISION val, sinus, x;
PRECISION  ref, diff, moy = 0, variance=0, mini = 1.0f , maxi = -1.0f;

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;
ref = sin(val);

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute sine
if (x < 0)
sinus = 1.27323954 * x + .405284735 * x * x;
else
sinus = 1.27323954 * x - 0.405284735 * x * x;

// check diff
diff = sinus - ref;
moy += diff;
variance += diff * diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

void TestUltraFastSinus(char *title)
{
int i;
PRECISION val, sinus, x;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531 ;
#endif

//compute sine
if (x < 0)
sinus = 1.27323954 * x + 0.405284735 * x * x;
else
sinus = 1.27323954 * x - 0.405284735 * x * x;
}

EndTimer();

PrintTimer(title);

CheckUltraFastSinus();
}

void CheckUltraFastCosinus()
{
int i;
PRECISION val, cosinus, x;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f, maxi = -1.0f;

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
// cos(x) = sin(x + PI/2)
x = val + 1.57079632;
ref = cos(val);

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute cosine
if (x < 0)
cosinus = 1.27323954 * x + 0.405284735 * x * x;
else
cosinus = 1.27323954 * x - 0.405284735 * x * x;

// check diff
diff = cosinus - ref;
moy += diff;
variance += diff * diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

void TestUltraFastCosinus(char *title)
{
int i;
PRECISION val, cosinus, x;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
// cos(x) = sin(x + PI/2)
x = val + 1.57079632;

#ifdef WRAPPING
// wrap input angle to -PI..PI
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
// else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

if (x < 0)
cosinus = 1.27323954 * x + 0.405284735 * x * x;
else
cosinus = 1.27323954 * x - 0.405284735 * x * x;
}

EndTimer();

PrintTimer(title);

CheckUltraFastCosinus();

}

///////////////////////////////////////////////
// Fast but precises approximation functions //
///////////////////////////////////////////////
inline PRECISION fast_precise_sin(PRECISION x)
{
PRECISION sin;

//always wrap input angle to -PI..PI
#ifdef WRAPPING
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute sine
if (x < 0)
{
sin = 1.27323954 * x + .405284735 * x * x;

if (sin < 0)
sin = .225 * (sin *-sin - sin) + sin;
else
sin = .225 * (sin * sin - sin) + sin;
}
else
{
sin = 1.27323954 * x - 0.405284735 * x * x;

if (sin < 0)
sin = .225 * (sin *-sin - sin) + sin;
else
sin = .225 * (sin * sin - sin) + sin;
}

return sin;
}

inline PRECISION fast_precise_cos( PRECISION x)
{
PRECISION cos;

// cos(x) = sin(x + PI/2)
x += 1.57079632;

// always wrap input angle to -PI..PI
#ifdef WRAPPING
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
// else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

if (x < 0)
{
cos = 1.27323954 * x + 0.405284735 * x * x;

if (cos < 0)
cos = .225 * (cos *-cos - cos) + cos;
else
cos = .225 * (cos * cos - cos) + cos;
}
else
{
cos = 1.27323954 * x - 0.405284735 * x * x;

if (cos < 0)
cos = .225 * (cos *-cos - cos) + cos;
else
cos = .225 * (cos * cos - cos) + cos;
}

return cos;
}

////////////////////////////////////////////////////
// Tests fast but precise approximation functions //
////////////////////////////////////////////////////
void TestFastPreciseSinus(char *title)
{
int i;
PRECISION val, sinus;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
sinus    = fast_precise_sin(val);
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = sin(val);
sinus    = fast_precise_sin(val);

diff = sinus - ref;
moy += diff;
variance += diff*diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

void TestFastPreciseCosinus(char *title)
{
int i;
PRECISION val, cosinus;
PRECISION  ref, diff, moy = 0, variance=0, mini = 1.0f , maxi = -1.0f;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
cosinus    = fast_precise_cos(val);
}

EndTimer();

PrintTimer(title);

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
ref = cos(val);
cosinus    = fast_precise_cos(val);

diff = cosinus - ref;
moy += diff;
variance += diff*diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

///////////////////////////////////////////
// Tests ultra fast and precise versions //
///////////////////////////////////////////
void CheckUltraFastPreciseSinus()
{
int i;
PRECISION val, sinus, x;
PRECISION  ref, diff, moy = 0, variance=0, mini = 1.0f , maxi = -1.0f;

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;
ref = sin(val);

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute sine
if (x < 0)
{
sinus = 1.27323954 * x + .405284735 * x * x;

if (sinus < 0)
sinus = .225 * (sinus *-sinus - sinus) + sinus;
else
sinus = .225 * (sinus * sinus - sinus) + sinus;
}
else
{
sinus = 1.27323954 * x - 0.405284735 * x * x;

if (sinus < 0)
sinus = .225 * (sinus *-sinus - sinus) + sinus;
else
sinus = .225 * (sinus * sinus - sinus) + sinus;
}

// check diff
diff = sinus - ref;
moy += diff;
variance += diff * diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

void TestUltraFastPreciseSinus(char *title)
{
int i;
PRECISION val, sinus, x;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531 ;
#endif

//compute sine
if (x < 0)
{
sinus = 1.27323954 * x + .405284735 * x * x;

if (sinus < 0)
sinus = .225 * (sinus *-sinus - sinus) + sinus;
else
sinus = .225 * (sinus * sinus - sinus) + sinus;
}
else
{
sinus = 1.27323954 * x - 0.405284735 * x * x;

if (sinus < 0)
sinus = .225 * (sinus *-sinus - sinus) + sinus;
else
sinus = .225 * (sinus * sinus - sinus) + sinus;
}
}

EndTimer();

PrintTimer(title);

CheckUltraFastPreciseSinus();
}

void CheckUltraFastPreciseCosinus()
{
int i;
PRECISION val, cosinus, x;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f, maxi = -1.0f;

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
// cos(x) = sin(x + PI/2)
x = val + 1.57079632;
ref = cos(val);

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
// else
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

if (x < 0)
{
cosinus = 1.27323954 * x + 0.405284735 * x * x;

if (cosinus < 0)
cosinus = .225 * (cosinus *-cosinus - cosinus) + cosinus;
else
cosinus = .225 * (cosinus * cosinus - cosinus) + cosinus;
}
else
{
cosinus = 1.27323954 * x - 0.405284735 * x * x;

if (cosinus < 0)
cosinus = .225 * (cosinus *-cosinus - cosinus) + cosinus;
else
cosinus = .225 * (cosinus * cosinus - cosinus) + cosinus;
}

// check diff
diff = cosinus - ref;
moy += diff;
variance += diff * diff;

if ( diff < mini )
mini = diff;

if ( diff > maxi )
maxi = diff;
}

moy /= i;
variance = sqrt(variance / i);

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.9f/%1.9f/%1.9f/%1.9f)", mini, moy, variance, maxi);
}

void TestUltraFastPreciseCosinus(char *title)
{
int i;
PRECISION val, cosinus, x;

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
// cos(x) = sin(x + PI/2)
x = val + 1.57079632;

#ifdef WRAPPING
// wrap input angle to -PI..PI
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
// else
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

if (x < 0)
{
cosinus = 1.27323954 * x + 0.405284735 * x * x;

if (cosinus < 0)
cosinus = .225 * (cosinus *-cosinus - cosinus) + cosinus;
else
cosinus = .225 * (cosinus * cosinus - cosinus) + cosinus;
}
else
{
cosinus = 1.27323954 * x - 0.405284735 * x * x;

if (cosinus < 0)
cosinus = .225 * (cosinus *-cosinus - cosinus) + cosinus;
else
cosinus = .225 * (cosinus * cosinus - cosinus) + cosinus;
}
}

EndTimer();

PrintTimer(title);

CheckUltraFastPreciseCosinus();

}

//////////
// main //
//////////
int main()
{
std::cout << "\nBenchmark cos()/sin() optimization with " << BENCHSIZE << " iterations" << "\n";
std::cout << "(from " << BENCHSTART << " radians to " << -BENCHSTART << " radians with a stepsize of " << BENCHINC << ")\n\n";

TestStandardSinus(             "Standard C/C++ sin()       "); std::cout << "\n";
TestStandardCosinus(           "Standard C/C++ cos()       "); std::cout << "\n\n";

TestFastPreciseSinus(          "inlined fast_precise_sin() "); std::cout << "\n";
TestFastPreciseCosinus(        "inlined fast_precise_cos() "); std::cout << "\n\n";

TestUltraFastPreciseSinus(        "UltraFastPreciseSinus()    "); std::cout << "\n";
TestUltraFastPreciseCosinus(      "UltraFastPreciseCosinus()  "); std::cout << "\n\n";

TestFastSinus(                 "inlined fast_sin()         "); std::cout << "\n";
TestFastCosinus(               "inlined fast_cos()         "); std::cout << "\n\n";

TestUltraFastSinus(           "UltraFastSinus()           "); std::cout << "\n";
TestUltraFastCosinus(          "UltraFastCosinus()         "); std::cout << "\n\n";
}```

This give this with the float version :
Code :
```Benchmark cos()/sin() optimization with 10000000 iterations

Standard C/C++ sin()        : 745 ms
Standard C/C++ cos()        : 774 ms

inlined fast_precise_sin()  : 598 ms (min/moy/variance/max)=(-0.001090974/0.000029710/0.000597704/0.001090437)
inlined fast_precise_cos()  : 695 ms (min/moy/variance/max)=(-0.001091599/-0.000001502/0.000588869/0.001091063)

UltraFastPreciseSinus()     : 520 ms (min/moy/variance/max=(-0.001090974/0.000029710/0.000597704/0.001090437)
UltraFastPreciseCosinus()   : 702 ms (min/moy/variance/max)=(-0.001091599/-0.000001502/0.000588869/0.001091063)

inlined fast_sin()          : 506 ms (min/moy/variance/max)=(-0.056010097/0.009407033/0.035580181/0.056009740)
inlined fast_cos()          : 589 ms (min/moy/variance/max)=(-0.056010574/-0.000039861/0.035341624/0.056010306)

UltraFastSinus()            : 382 ms (min/moy/variance/max=(-0.056010097/0.009407033/0.035580181/0.056009740)
UltraFastCosinus()          : 461 ms (min/moy/variance/max)=(-0.056010574/-0.000039861/0.035341624/0.056010306)```

And with the double version :
(only change the "#define PRECISION float" by "#define PRECISION double" into the source code)
Code :
```Benchmark cos()/sin() optimization with 10000000 iterations

Standard C/C++ sin()        : 737 ms
Standard C/C++ cos()        : 765 ms

inlined fast_precise_sin()  : 903 ms (min/moy/variance/max)=(-0.001090303/0.000000000/0.000596678/0.001090319)
inlined fast_precise_cos()  : 956 ms (min/moy/variance/max)=(-0.001090311/-0.000000000/0.000596679/0.001090326)

UltraFastPreciseSinus()     : 526 ms (min/moy/variance/max=(-0.001090303/0.000000000/0.000596678/0.001090319)
UltraFastPreciseCosinus()   : 571 ms (min/moy/variance/max)=(-0.001090311/-0.000000000/0.000596679/0.001090326)

inlined fast_sin()          : 839 ms (min/moy/variance/max)=(-0.056009604/0.000000000/0.035836168/0.056009589)
inlined fast_cos()          : 801 ms (min/moy/variance/max)=(-0.056009610/-0.000000000/0.035836167/0.056009594)

UltraFastSinus()            : 403 ms (min/moy/variance/max=(-0.056009604/0.000000000/0.035836168/0.056009589)
UltraFastCosinus()          : 462 ms (min/moy/variance/max)=(-0.056009610/-0.000000000/0.035836167/0.056009594)```

Only the UltraFast[Precise]Sinus/Cosinus() implementations are faster than standards cos/sin() functions in the double version
(the fast_sin/cos() and fast_precise_sin/cos() funcs are more slow in the double version)
[but they are alls more speed on the float version ]

PS : I don't see really an amelioration with the double version compared to the float version => this is normal ???
(the float version is more fast but on the other side the double version don't seem to be more precise ...)

9. Originally Posted by The Little Body
are double version of sin/cos/atan and other trigonometric functions, on Linux and C/C++, enough precises for to serve as a reference ?
Yes, they are. Especially for the single precision calculated counterparts.
Only if you could use more precise variables than doubles you should worry about precision of C-math function implementation. Assume 15 decimal positions are absolutely correct.

Originally Posted by The Little Body
I think to implement/test this in GLSL after to have find something that is enough precise (and speed if possible) with a standard C code (and I have test nothing with doubles, only with float ... but this is something that I have planned to make the next week)
Don't wait till then just to find out something that works on CPU does not work the same way on GPU. The precision is not the same (although vendors claim full IEEE754 compatibility) as well as HW function implementations.

My suggestion is not to try to find ultra-fast but ultra-precise implementation using single precision arithmetic. There are algorithms for double precision simulation with single precision, but they are not trivial and require more computation time that is necessary, unless we really need more that 10 decimal digits precision. If your FP implementations generate error above 1e-3 they are not useful. Built-in functions are several orders of magnitude more precise (except for very small arguments, or borders of the argument range).

PS : I don't see really an amelioration with the double version compared to the float version => this is normal ???
(the float version is more fast but on the other side the double version don't seem to be more precise ...)
No, that is not normal. DP version should have more precise results. At least 9 orders of magnitude. Just be aware that coefficients for the interpolation polynomials are not the same for SP and DP version, as well as power. You need polynomials of higher order for DP.

The following is almost the best we can get for SP:
Code :
```float GLRenderer::cos_52s(float x)
{
const float c1= 0.9999932946;
const float c2=-0.4999124376;
const float c3= 0.0414877472;
const float c4=-0.0012712095;
float x2; // The input argument squared
x2=x * x;
return (c1 + x2*(c2 + x2*(c3 + c4*x2)));
}```
This implementation has 5 exact decimal positions.

For DP, the following works OK:
Code :
```double cos_121s(double x)
{
const double c1= 0.99999999999925182;
const double c2=-0.49999999997024012;
const double c3= 0.041666666473384543;
const double c4=-0.001388888418000423;
const double c5= 0.0000248010406484558;
const double c6=-0.0000002752469638432;
const double c7= 0.0000000019907856854;
double x2; // The input argument squared
x2=x * x;
return (c1 + x2*(c2 + x2*(c3 + x2*(c4 + x2*(c5 + x2*(c6 + c7*x2))))));
}```
The author claims 12 decimal position precision, but I would agree with just 11.
Error is above 4e-12.

Lets back to GPU HW implementation. Here is a short summary what I found this weekend:
1. sin/cos functions are HW based
2. Unit that calculate those functions are called Special Function Unit (SFU) in NV terminology
3. Based on High-Performance Area-Efficient Multifunction Interpolator (quadratic interpolation algorithm) - A table lookup is used to retrieve optimized coefficients for a quadratic polynomial approximation, followed by a high-speed evaluation of the polynomial.
4. Accuracy – 22.47 (the number of good bits in the prenormalized results)
5. Kepler uses the same logic as Fermi
6. Each SFU executes one instruction per thread, per clock; a warp executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied.

In short, they use quadratic interpolation with coefficient lookup tables.

No, we can continue to discuss more precise software implementations.

10. Thank, Alexandar

In short, they use quadratic interpolation with coefficient lookup tables.
Where the lookup table of coefficients is indexed by the angle for example ?

I have tested your first formula but my check find than this approximation seem to give bad result around PI (cf. 180 degrees)

I have using this :
Code :
```void CheckCosinus_52s()
{
int i;
PRECISION val, cosinus, x, smaller, bigger;
PRECISION  ref, diff, moy = 0, variance = 0, mini = 1.0f, maxi = -1.0f;

const float c1= 0.9999932946;
const float c2=-0.4999124376;
const float c3= 0.0414877472;
const float c4=-0.0012712095;
float x2; // The input argument squared

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;
ref = cos(val);

#ifdef WRAPPING
// wrap input angle to -PI..PI
if ( x < -3.14159265 )
while ( x < -3.14159265)
x += 6.28318531 ;
#ifdef WRAPELSE
else
#endif
if ( x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute cosine
x2 = x * x;
cosinus = c1 + x2*(c2 + x2*(c3 + c4*x2));

// check diff
diff = cosinus - ref;
moy += diff;
variance += diff * diff;

if ( diff < mini )
{
mini = diff;
smaller = val;
}

if ( diff > maxi )
{
maxi = diff;
bigger = val;
}
}

moy /= i;
variance = sqrt(variance / i);
smaller  = smaller - (6.28318531 * floor(smaller/6.28318531));
bigger   = bigger  - (6.28318531 * floor(bigger /6.28318531));

//std::cout << "(min/moy/max=" << mini << "/" << moy << "/" << maxi << ")";
printf("(min/moy/variance/max)=(%1.6f/%1.6f/%1.6f/%1.6f)", mini, moy, variance, maxi);
printf(" [smaller=%3.1f degrees, bigger=%3.1f degrees]", smaller * (360.0f/6.28318531),  bigger * (360.0f/6.28318531) );

}

void TestCosinus_52s(char *title)
{
int i;
PRECISION val, cosinus, x;

const float c1= 0.9999932946;
const float c2=-0.4999124376;
const float c3= 0.0414877472;
const float c4=-0.0012712095;
float x2; // The input argument squared

BeginTimer();

for( i = 0, val = -BENCHSTART; i < BENCHSIZE ; i++ , val += BENCHINC)
{
x = val;
#ifdef WRAPPING
// wrap input angle to -PI..PI
if (x < -3.14159265)
while ( x < -3.14159265)
x += 6.28318531;
#ifdef WRAPELSE
else
#endif
if (x >  3.14159265)
while ( x >  3.14159265)
x -= 6.28318531;
#endif

//compute cosine
x2  =x * x;
cosinus = c1 + x2*(c2 + x2*(c3 + c4*x2));
}

EndTimer();

PrintTimer(title);

CheckCosinus_52s();

}```

And Add this test of the end of the main()
Code :
`TestCosinus_52s(          "FastCosinus_52s()          "); std::cout << "\n\n";`

And the result is this :
Code :
```Benchmark cos()/sin() optimization with 1000000 iterations
[with a stepsize of 0.00072 degrees]

Standard C/C++ sin()        : 85 ms
Standard C/C++ cos()        : 86 ms

inlined fast_precise_sin()  : 74 ms (min/moy/variance/max)=(-0.001091/-0.000025/0.000605/0.001090) [smaller=168.9 degrees, bigger=348.9 degrees]
inlined fast_precise_cos()  : 78 ms (min/moy/variance/max)=(-0.001092/0.000010/0.000588/0.001091) [smaller=78.9 degrees, bigger=258.9 degrees]

UltraFastPreciseSinus()     : 60 ms (min/moy/variance/max=(-0.001091/-0.000025/0.000605/0.001090) [smaller=168.9 degrees, bigger=348.9 degrees]
UltraFastPreciseCosinus()   : 69 ms (min/moy/variance/max)=(-0.001092/0.000010/0.000588/0.001091) [smaller=78.9 degrees, bigger=258.9 degrees]

inlined fast_sin()          : 55 ms (min/moy/variance/max)=(-0.056010/0.002068/0.036154/0.056010) [smaller=333.0 degrees, bigger=27.0 degrees]
inlined fast_cos()          : 63 ms (min/moy/variance/max)=(-0.056011/0.000306/0.035148/0.056010) [smaller=117.0 degrees, bigger=297.0 degrees]

UltraFastSinus()            : 43 ms (min/moy/variance/max=(-0.056010/0.002068/0.036154/0.056010) [smaller=333.0 degrees, bigger=27.0 degrees]
UltraFastCosinus()          : 53 ms (min/moy/variance/max)=(-0.056011/0.000306/0.035148/0.056010) [smaller=117.0 degrees, bigger=297.0 degrees]

FastCosinus_52s()           : 52 ms (min/moy/variance/max)=(-0.114787/-0.010663/0.025695/0.000007) [smaller=180.0 degrees, bigger=83.1 degrees]```
(this is the last line that use your first formula, no tested the second formula)
[cf. a very good precision around PI/2, cf 83.1 degrees, but very bad at PI cf. 180.0 degrees]
=> the formula isn't usable between -PI to PI but between 0 to PI/2 or something like this ?

We can perhaps use NURBS or something like this instead an indexed lookup of quadratics curve's coefficients ?
(or interpolate quadratics coefficients, instead "only" values computed by them, between two or more lookup entries ?)
[the current NV hardware implementation seem to use something like a piecewyse quadratic function where each piece, cf. quadratic's coefficients, is given by an index/angle lookup]

=> is the quadratic's coefficients table lookups used by NV strictly private or it is accessible somewhere ?
(in alls cases, I think that we can recompute ourselves the quadratic coefficients of each piece if we have values and their derivatives at the begin and the end of each "step/index lookup" [values can easily be computed using cos(x)/sin(x) and their derivative as easily by -sin(x)/cos(x) ... But OK all the difficulty is to compute the goods quadratic's coefficients from them ])

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•