Fast math on a vectors components

i have a vector say
vec4 a(1.0,2.0,3.0,4.0);

whats the best way of finding the product/sum of its parts ie

float answer = a.x * a.y * a.z * a.w;
float answer = a.x + a.y + a.z + a.w;

a.x+a.y+a.z+a.w is the same as dot4 between A and vec4(1).

a.xa.ya.za.w is harder; you probably need to swizzle and multiply twice. I e, a.xy = a.xza.yw; a.x *= a.y; It’s un-clear whether this is any faster than just writing out the expression.

doh, yeah should of gotten the dotproduct one

cheers this is quicker (5 instructions less from a quick check)
a.xy = a.xz*a.yw; a.x *= a.y

[edit] hmmm i thought 5 was a bit much at the time, seems i forgot to multiple by another result which since i wasnt using the result was getting optimized away

This reminds me of C compiler technology
from 20 years ago, when (ab)(cd) could
compile to significantly faster code than
a
bcd on some platforms.

The GLSL compiler will probably never be very
good at optimising expressions, I guess, because
it has to be simple and quick enough to execute
entirely at application runtime.

There could be a need here for a code optimiser
to transform human authored GLSL code to some
more optimal GLSL code to hand feed the
compiler. Assembly should be a thing of the
past now when GLSL is here, but we still end
up exchanging ideas on how to hand feed the
compiler to trim down the number of assembly
level instructions for specific targets, so
there is definitely a need for better
optimisation tools here.

I thought I’d never say this, but in this
particular respect, the precompilation of
HLSL does seem like a better platform for
more complicated expression optimisations.

Originally posted by zed:
[b]doh, yeah should of gotten the dotproduct one

cheers this is quicker (5 instructions less from a quick check)
a.xy = a.xz*a.yw; a.x *= a.y[/b]
I’d recommend this instead to make the swizzles more friendly with ATI cards:

a.xy *= a.wz;
a.x *= a.y;

Originally posted by StefanG:
This reminds me of C compiler technology
from 20 years ago, when (ab)(cd) could
compile to significantly faster code than
a
bcd on some platforms.

abcd is essentially a(b*(cd)). There’s no parallelism possibly there without breaking the C standard (I guess some compiler flag could allow that though). (ab)(cd) on the other hand allows ab and cd to be computed in parallel on superscalar FPUs, which could be up to 50% faster.

Originally posted by StefanG:
The GLSL compiler will probably never be very
good at optimising expressions, I guess, because
it has to be simple and quick enough to execute
entirely at application runtime.

Don’t know about that. It’s pretty good already. Yes, there are some corner cases where you need to tweak the code a bit for the compiler to see optimization opportunities, but most of the time the GLSL compiler does a very good job already.

Originally posted by StefanG:
I thought I’d never say this, but in this
particular respect, the precompilation of
HLSL does seem like a better platform for
more complicated expression optimisations.

Actually, HLSL precompilation is a problem. If HLSL just dumped raw unoptimized code many shaders would actually run faster as that would leave that work to the driver’s optimizer, which knows more of what’s optimal for the underlying hardware. When HLSL is trying to optimize, it often means the real intent of the original shader is hidden to the driver.

abcd is essentially a(b*(cd)).
Sorry for nitpicking :wink: . It’s ((a * b) * c) * d because "
" has left-to-right associativity.

A good example for GLSL user optimizations is this:
vector = Matrix * Matrix * vector; // slow
vector = Matrix * (Matrix * vector); // fast

See the different instruction count?
The first needs 20 the second only 8!

Originally posted by Relic:
Sorry for nitpicking :wink: . It’s ((a * b) * c) * d because “*” has left-to-right associativity.
Duh! smacks forehead :slight_smile:

Actually, HLSL precompilation is a problem. If HLSL just dumped raw unoptimized code many shaders would actually run faster as that would leave that work to the driver’s optimizer, which knows more of what’s optimal for the underlying hardware. When HLSL is trying to optimize, it often means the real intent of the original shader is hidden to the driver.
Are you sure about that?
It probably detects the hw and does it’s best to optimize which should be enough.
I don’t really know but D3D may even flag the shader as beeing already optimized to the driver.

vector = Matrix * Matrix * vector; // slow
vector = Matrix * (Matrix * vector); // fast

The thought had crossed my mind. I assume the driver is or will be smart enough to reduce instructions.

Originally posted by V-man:
Are you sure about that?
That’s what the people working close on this is saying. MS has been specifically asked not to try to optimize the shader for this reason, but they don’t listen. The whole problem arises from the fact that DirectX uses assembly targets. If you compile against say ps2.0, the resulting shader must fit within the limits of that model. Unoptimized code of shaders of decent length can easily grow well past hardware limits, but after the driver optimizer has done its job it’s a different story. But since there’s no software rendering in DirectX and all shaders that compiles are guaranteed to run in hardware, the compiler must optimize itself to try to fit it to be able to make that guarantee.
Now this isn’t the only problem with targets either. Another problem is that functionality is lost. The X800 for instance supports the vFace register. It can’t be used at all in DirectX because that goes under ps3.0. Fortunately, GLSL can use it with gl_FrontFacing. IMHO, it will become increasingly clear that the GLSL model is vastly superior to the HLSL model as we move forward.

Originally posted by V-man:
The thought had crossed my mind. I assume the driver is or will be smart enough to reduce instructions.
The compiler can’t (shouldn’t?) optimize this MMv because the operator precedence is from left-to-right and there is no precedence based on the underlying data types “like multiply M*v first”, AFAIK.

Super old thread but it comes up sometimes. Just want to mention it here that the optimizer may not opt for performing the Matrix * vector part first. Mathematical matrix-matrix and matrix-vector multiplications are associative, but floating point operators on a physical processor generally aren’t, due to limited precision. The result will be different. Whether it’s an unnoticeable difference, or something that breaks the experience, depends on the specific numbers and ranges involved, and the tolerance for artifacts. One place where it would surely break the experience is if the programmer implemented some workaround to increase numeric precision, eg. combining two floating point vectors/arrays to take care of range (large values) and precision (small values) - multiple such approaches exist. These rely on specific operation orders and it’d be bad if the optimizer reordered things on the pretense that 32 bit GPU FP registers are a sufficient representation of real numbers.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.