Performance : sqrt vs acos vs multiplies

Dear All,

I have a performance intensive fragment shader, and want to optimise the speed. I have various ways of performing some of the maths, and was wondering if the trade-offs were obvious (before I go and try it myself).

Does anybody have any information about the relative time cost of sqrt() vs acos()? I presume they are both table lookups, so don’t have much of a hit.

Similarly, does anybody have any information about the relative time cost of sin/cos/sqrt etc. vs adds, muls and the like? How many muls does one sqrt cost?

Lastly, does anybody have any information about the relative cost of texture lookups? Precomputing functions into textures for lookup is often done to increase performance, but how complex do the functions have to be before this is worth doing?

Thanks in advance,

David

i’ve been playing with nvidia’s “fx composer” (very cool), which has a shader perf window with an asm dump, plus instruction and cycle counts. not for glsl directly, but nvidia uses the cg compiler for glsl so it might give you a good ballpark for isolated functions like this.

you can use nvidia performance tools without composer, but it’s kinda handy :slight_smile:

as for glsl, there’s nothing in the spec that i can see that would make such an analysis possible. perhaps i missed it.

nvidia has traditionally favored the lut, while ati prefers the math. alas, it’s an implementation thing.

the standalone nVidia shader performance tool http://developer.nvidia.com/object/nvshaderperf_home.html supports GLSL.

The cost of the instructions can be evaluated only in context of the entire shader and on specific HW and driver. For example while some operation may take long time, the compiler may be able to schedule instructions in such way that calculation is done in otherwise unused unit simultaneously with another necessary calculation or using some fast path in hw so part of the time cost of that instruction may disapear or that instruction may be even free.

The same thing goes for texture sampling. If your shader is already texturing heavy then even a complex calculation may be better than storing function inside texture while if your shader uses only small ammount of textures the fetch may be better than calculation. This also depends on way your function parameters change across the rendered primitive. If they are very random the sampling may kill the texture cache and caculation may be better even if for smothly varying parameters the sampling would be better. And of course different HW may behave in exactly oposite way in your shader.

Originally posted by David Spilling:
[b]Does anybody have any information about the relative time cost of sqrt() vs acos()? I presume they are both table lookups, so don’t have much of a hit.

Similarly, does anybody have any information about the relative time cost of sin/cos/sqrt etc. vs adds, muls and the like? How many muls does one sqrt cost?[/b]
sqrt costs two instructions if it’s a single scalar. For two components it’s 3 instructions and so on.
sin/cos costs one cycle each on R5xx. The R4xx only supports those natively in the vertex shader, not in the pixel shader, so it would need 8-11 instructions (depending on whether it needs to the clamped to [-PI, PI] range). For R3xx that would be the case in the vertex shader as well.
acos,asin,atan,atan2 etc. don’t have native hardware support on any hardware, so they will need like 10-20 instructions.

Originally posted by David Spilling:
Lastly, does anybody have any information about the relative cost of texture lookups? Precomputing functions into textures for lookup is often done to increase performance, but how complex do the functions have to be before this is worth doing?
Texture instructions are a bit more complex due to bandwidth and stuff, but if we assume everything is always in cache a bilinear fetch will take 1 cycle if it’s 32bits or less. For trilinear it’s 2 cycles. For anisotropic it’s multiplied by the level of anisotropy for the fetch. For 64bit texture it’s twice the cost and 128bit it’s 4x. Texture instructions can be executed in parallel with ALU, so if they are not the bottleneck, they may even be for free.

As Humus pointed out, sampler lookups can be free a lot of the times, so it “can” be a good idea to store complex functions as sampler lookups. You’ll have to evaluate your bottlenecks (keeping target hardware in mind) before resorting to that.
Other than that, i have noticed that a lot of the time you can get away with interpolated results in the fragment shader with little or no loss in accuracy. GPU programming guide on nVidia’s hardware is a good reference to the general “rules of thumb” while doing shader optimizations.

Thanks for all of that - very useful information and much appreciated advice.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.