PDA

View Full Version : fast math on a vectors components



zed
02-01-2005, 04:04 PM
i have a vector say
vec4 a(1.0,2.0,3.0,4.0);

whats the best way of finding the product/sum of its parts ie

float answer = a.x * a.y * a.z * a.w;
float answer = a.x + a.y + a.z + a.w;

jwatte
02-01-2005, 07:09 PM
a.x+a.y+a.z+a.w is the same as dot4 between A and vec4(1).

a.x*a.y*a.z*a.w is harder; you probably need to swizzle and multiply twice. I e, a.xy = a.xz*a.yw; a.x *= a.y; It's un-clear whether this is any faster than just writing out the expression.

zed
02-01-2005, 11:55 PM
doh, yeah should of gotten the dotproduct one

cheers this is quicker (5 instructions less from a quick check)
a.xy = a.xz*a.yw; a.x *= a.y

[edit] hmmm i thought 5 was a bit much at the time, seems i forgot to multiple by another result which since i wasnt using the result was getting optimized away

StefanG
02-02-2005, 12:59 AM
This reminds me of C compiler technology
from 20 years ago, when (a*b)*(c*d) could
compile to significantly faster code than
a*b*c*d on some platforms.

The GLSL compiler will probably never be very
good at optimising expressions, I guess, because
it has to be simple and quick enough to execute
entirely at application runtime.

There could be a need here for a code optimiser
to transform human authored GLSL code to some
more optimal GLSL code to hand feed the
compiler. Assembly should be a thing of the
past now when GLSL is here, but we still end
up exchanging ideas on how to hand feed the
compiler to trim down the number of assembly
level instructions for specific targets, so
there is definitely a need for better
optimisation tools here.

I thought I'd never say this, but in this
particular respect, the precompilation of
HLSL does seem like a better platform for
more complicated expression optimisations.

Humus
02-02-2005, 02:45 PM
Originally posted by zed:
doh, yeah should of gotten the dotproduct one

cheers this is quicker (5 instructions less from a quick check)
a.xy = a.xz*a.yw; a.x *= a.yI'd recommend this instead to make the swizzles more friendly with ATI cards:

a.xy *= a.wz;
a.x *= a.y;

Humus
02-02-2005, 02:59 PM
Originally posted by StefanG:
This reminds me of C compiler technology
from 20 years ago, when (a*b)*(c*d) could
compile to significantly faster code than
a*b*c*d on some platforms.a*b*c*d is essentially a*(b*(c*d)). There's no parallelism possibly there without breaking the C standard (I guess some compiler flag could allow that though). (a*b)*(c*d) on the other hand allows a*b and c*d to be computed in parallel on superscalar FPUs, which could be up to 50% faster.


Originally posted by StefanG:
The GLSL compiler will probably never be very
good at optimising expressions, I guess, because
it has to be simple and quick enough to execute
entirely at application runtime.Don't know about that. It's pretty good already. Yes, there are some corner cases where you need to tweak the code a bit for the compiler to see optimization opportunities, but most of the time the GLSL compiler does a very good job already.


Originally posted by StefanG:
I thought I'd never say this, but in this
particular respect, the precompilation of
HLSL does seem like a better platform for
more complicated expression optimisations.Actually, HLSL precompilation is a problem. If HLSL just dumped raw unoptimized code many shaders would actually run faster as that would leave that work to the driver's optimizer, which knows more of what's optimal for the underlying hardware. When HLSL is trying to optimize, it often means the real intent of the original shader is hidden to the driver.

Relic
02-02-2005, 11:22 PM
a*b*c*d is essentially a*(b*(c*d)). Sorry for nitpicking ;) . It's ((a * b) * c) * d because "*" has left-to-right associativity.

A good example for GLSL user optimizations is this:
vector = Matrix * Matrix * vector; // slow
vector = Matrix * (Matrix * vector); // fast

See the different instruction count?
The first needs 20 the second only 8!

Humus
02-03-2005, 09:06 AM
Originally posted by Relic:
Sorry for nitpicking ;) . It's ((a * b) * c) * d because "*" has left-to-right associativity.Duh! *smacks forehead* :)

V-man
02-04-2005, 06:57 AM
Actually, HLSL precompilation is a problem. If HLSL just dumped raw unoptimized code many shaders would actually run faster as that would leave that work to the driver's optimizer, which knows more of what's optimal for the underlying hardware. When HLSL is trying to optimize, it often means the real intent of the original shader is hidden to the driver.Are you sure about that?
It probably detects the hw and does it's best to optimize which should be enough.
I don't really know but D3D may even flag the shader as beeing already optimized to the driver.


vector = Matrix * Matrix * vector; // slow
vector = Matrix * (Matrix * vector); // fast The thought had crossed my mind. I assume the driver is or will be smart enough to reduce instructions.

Humus
02-04-2005, 06:37 PM
Originally posted by V-man:
Are you sure about that?That's what the people working close on this is saying. MS has been specifically asked not to try to optimize the shader for this reason, but they don't listen. The whole problem arises from the fact that DirectX uses assembly targets. If you compile against say ps2.0, the resulting shader must fit within the limits of that model. Unoptimized code of shaders of decent length can easily grow well past hardware limits, but after the driver optimizer has done its job it's a different story. But since there's no software rendering in DirectX and all shaders that compiles are guaranteed to run in hardware, the compiler must optimize itself to try to fit it to be able to make that guarantee.
Now this isn't the only problem with targets either. Another problem is that functionality is lost. The X800 for instance supports the vFace register. It can't be used at all in DirectX because that goes under ps3.0. Fortunately, GLSL can use it with gl_FrontFacing. IMHO, it will become increasingly clear that the GLSL model is vastly superior to the HLSL model as we move forward.

Relic
02-06-2005, 11:17 PM
Originally posted by V-man:
The thought had crossed my mind. I assume the driver is or will be smart enough to reduce instructions.The compiler can't (shouldn't?) optimize this M*M*v because the operator precedence is from left-to-right and there is no precedence based on the underlying data types "like multiply M*v first", AFAIK.