PDA

View Full Version : Builtin math function execution cost: Issues with accuracy of builtins



damian
09-30-2016, 02:14 AM
Hi,

I have run into a problem with accuracy of sin/cos on an Intel HD4600.

Because of this I have had to implement a min/max sin/cos in glsl.

What I really need to know is how costly my implementations are so that I can play accuracy off against performance.

Can anyone provide me with the cycle cost for the builtin math functions for the HD4600 GPUs?

Thanks
Damian

Aleksandar
09-30-2016, 05:45 PM
It is impossible to give a general answer to this question. Modern GPUs are massively parallel processors with deep pipelines. Also, different compilers can make the difference in the execution speed even for the same hardware. The only way to tell whether something is faster or not is to benchmark. That would give the correct answer, but just for the certain application and environment.

I have very little knowledge about Intel graphics. According to the scarce documentation, Intel implemented transcendental functions in Execution Units (EUs) starting with the HD 3000. Previously they used a shared math box. This change increased the performance of transcendental function for three times (also according to the Intel's documentation).

What I know for sure is how it is implemented in NVIDIA GPUs. Transcendental functions are implemented in the Special Function Units (SFUs) and the new result of sine and cosine (at the same time) can be read with each clock. It doesn't mean (necessarily) it requires just a single clock to calculate the value, but the pipeline enables fetching new values each clock tick. Unfortunately, several Cores may share a single SFU (4 cores in GP104 (GTX 1080)) and SFU can operate only on the single precision floating point arguments. That's the problem with the precision; the double precision is not supported in the SFU. However, for the visualization process, a single precision is more than acceptable.

Why do you think you have a problem with the accuracy of the transcendental functions? I'm asking this, because only before a couple of years I had the same problem, trying to solve it with the Taylor series in shaders. Fortunately, I realized that the problem should be solved by changing the algorithm, not the accuracy.

damian
10-03-2016, 05:17 AM
I found your description of the NVIDIA GPU and transcendental function implementations interesting and enlightening.



Why do you think you have a problem with the accuracy of the transcendental functions? I'm asking this, because only before a couple of years I had the same problem, trying to solve it with the Taylor series in shaders. Fortunately, I realized that the problem should be solved by changing the algorithm, not the accuracy.

I am implementing map projections on the GPU. The inverse and forward projections work fine on NVIDIA and AMD according to my testing. Sadly my target GPU is the Intel HD4600.

We need a very high degree of accuracy because we are displaying vector and raster maps which maybe used for navigation. When the projections go wrong it is fairly obvious. In this case we are seeing tears in the raster's.

I have eliminated the use of sin/cos as much as I can but due to the nature of the algorithium's there is a limit to how much I can reorganise the maths or simplify it.

NVIDIA sin/cos accuracy is to at least 6 decimal places (Quadro K620), which for the majority of the time is sufficient.

My testing implies that there are problems with parts of the sin/cos range on the Intel GPU. The program I used to test the NVIDIA GPU does not appear to work with the Intel GPU. I have had to park this investigation for now due to time pressures.

I have implemented a min/max sin/cos polynomial which is accurate to at least 6 decimal places with float and ~12 with double for the +-360 degree range which is more than sufficient for this work. I'm just concerned that this may affect performance too much when compared to hardware sin/cos.

I will just have to figure out how to do some sensible benchmarking, I'm just not too sure how at the moment.

Aleksandar
10-04-2016, 06:15 AM
Inspired by your question about the precision of the implemented transcendental function, I have carried out some experiments. The results are really amazing!

Although I prefer NVIDIA cards, I have to admit that AMD has the most precise implementation. When I carried out similar experiments 5 or more years ago, the results were different.
In the tests, AMD is represented by R280, while NV and Intel are represented by GTX850 and HD4600, respectively. The same results are retrieved for all NV cards (from the Fermi to Maxwell).

The figure (https://a60d8deb-a-62cb3a1a-s-sites.googlegroups.com/site/opengltutorialsbyaks/events/glslprecision041016/Sin-Precision.jpg) speaks more than words. AMD has got absolutely the best precision. I'm not speaking about a percentage or a dozen. The difference is for the orders of magnitude!
In the figure, it is shown the precision (relative error) of the sine function used in GLSL. The top chart shows errors for small arguments (x<1e-5). The middle chart shows errors for arguments in the range 1e-5 to 1e-2 radians, while the bottom chart is for large arguments.

The conclusions are the following:

1. The relative error of the GLSL sine function on R280 is always less than or equal to 5.3e-7. Absolutely amazing! Even for the infinitesimally small arguments (e.g. 1e-18). For the others, the error is 100% for arguments less than 1e-7.

2. For small arguments, Intel has slightly better precision than NVIDIA.

3. For larger arguments, Intel's implementation is a disaster.




We need a very high degree of accuracy because we are displaying vector and raster maps which maybe used for navigation. When the projections go wrong it is fairly obvious. In this case we are seeing tears in the raster's.


You shouldn't use "standard" projection/reprojection function for all scales. I had the same problem 5-6 years ago. That's why I have introduced the topocentric coordinate system and use linear interpolation for shorter distances. This enabled a subpixel-precise visualization of the whole Earth.



I will just have to figure out how to do some sensible benchmarking, I'm just not too sure how at the moment.


Just measure a frame drawing time using timer_query (https://www.opengl.org/registry/specs/ARB/timer_query.txt).