The MAD instruction

I’m not sure if anyone will be able to answer this one.

The MAD instruction takes 1 cycle to execute?
Howcome and what hardware is used.

Will the result of the MAD instruction be perfectly identical to MUL followed by an ADD?
MAD will be more precise correct?

Although the MAD instruction comes in handy, why aren’t there other combined instructions, such as

ADM - add and multiply
MPOW - multiply and compute power

and so on.

I would like to know why MAD is available for “our” programable GPUs, but not available for the x86?

I’m aware MAD is available for other platforms.

*** Basically, I want to know what is so special about it? ***

Well, on most GPUs (if not all) a multiplication and an addition can be done in one cycle. AFAIK this is because every shader-unit (don´t know how to call it) consists of two units, one for additions and one for multiplications. Or so i understood nVidias and ATIs slides.
So, to take advantage of this, there is this MAD function.

If you simply use a MUL and an ADD the compiler might be able to optimize it into a MAD, but of course the compiler has to be a bit more intelligent for that.

Therefore the result of MAD will be exactly the same precision as a MUL and an ADD.

There are no other combinations, since it wouldn´t be possible to speed them up, this way. If you do a MUL and than an ADD, that´s possible to do in one cycle, other stuff is not possible. That´s the only reason for MAD to exist.

It´s not available for the x86, since the GPU is a parallel processor and can therefore do this one special operation in one cycle, whereas the x86 isn´t and therefore is unable to take advantage of such operations.

Jan.

Multiply and add is a common operation in many digital signal processing algorithms. It is used so frequently that it makes sense to optimise it.

Other combinations of instructions aren’t used anything like as much so it’s not worth optimising them

It’s not just useful for graphics either - old style DSPs and some general CPUs, e.g. ARM, also have similar instructions.

dave j

It is common, but so are others. It’s all I’m saying!

It’s also interesting to note that in D3D’s ps 1.x version, there were opertunities to combine operation by prefixing instructions with “mult by 2”, “mult by 4”, “mult by 8”, “div by 2”, “div by 4”, “div by 8”.
These were removed in ps 2.0 and above.

Some of the texture instructions could do complex operations as well. Removed in ps 2.0 and above cause I guess they are useless.

Can we force the compiler to accept these? Kind of the way you can add assembly to C programs with VC? I’m currently working more with Cg than GLslang for the moment… But there are indeed more candidates than MAD. Things like TEXC doing a texture lookup and compare in one cycle

If it is supported in hardware it’s possible that a vendor’s compiler would optimize it into one instruction (or a few pipelined instructions). Generally I wouldn’t count on it (multiple aligned shading units are just coming into existence now).

Plus if you want to do sampling then compare, you should just use a shadow map (restricted to only one channel I believe…), which there are functions for.

Originally posted by vmh5:
Can we force the compiler to accept these? Kind of the way you can add assembly to C programs with VC? I’m currently working more with Cg than GLslang for the moment… But there are indeed more candidates than MAD. Things like TEXC doing a texture lookup and compare in one cycle
inlining assembly would raise the question of what happens to the registers across the boundary.
I think it will complicate (Nvidia’s) the compiler further

Originally posted by V-man:
[b]I’m not sure if anyone will be able to answer this one.

The MAD instruction takes 1 cycle to execute?
Howcome and what hardware is used.[/b]
Don’t know about current GPU implementations but for CPUs it is a big win especially from a latency-oriented perspective. On a 970 a fmadd (in one of its four formats ax+b, -ax+b, ax-b, -ax-b) takes 6 cycles, compare it with the 10 cycles required for a dependent add/mul couple on a P4.


Will the result of the MAD instruction be perfectly identical to MUL followed by an ADD?
MAD will be more precise correct?

The MAD will be more precise, the purpose of the MAD instruction is to skip the normalize/denormalize operation needed for a dependent add/mul couple thus lowering the latency. This in turn increases the accuracy of the instruction since the full precision result of the multiplication is added to the third operand and only then the common result is normalized again.

[b]
Although the MAD instruction comes in handy, why aren’t there other combined instructions, such as

ADM - add and multiply
MPOW - multiply and compute power

and so on.

[/b]

Because MAD instructions are used everywhere you use linear algebra, linear approximations and so on which account probably for 90% of the code so it is the only one worth a hardware implementation.

[b]
I would like to know why MAD is available for “our” programable GPUs, but not available for the x86?

I’m aware MAD is available for other platforms.
[/b]
MAD instructions need four operands, three sources and one destination (this can be reduced to three by forcing the destruction of one of the operands). Historically x86 instructions have had and still have a two operands format.


*** Basically, I want to know what is so special about it? ***

It is widely used in fp code and easy to implement in hardware.

I have sometimes wished for an ADM instruction though, at least back when I was still doing my shaders in asm. Even though a * b + c is more common, the (a + b) * c case is pretty common too.

Btw, x86 has a few instructions that have more than two operands. There is for instance a three component IMUL instruction.

Thanks for all the responses.
I was asking this cause I see a lot of research has been happening on this, and still is.
In fact, I’m wondering whether some of these x86 have it and silently performing mad behind our backs.

Originally posted by V-man:
Thanks for all the responses.
I was asking this cause I see a lot of research has been happening on this, and still is.
In fact, I’m wondering whether some of these x86 have it and silently performing mad behind our backs.

They aren’t unfortunately, for some reasons. Fusing a mul/add couple in a madd instruction can change (though slightly) the precision of the result and can potentially break code relying on exact precision (this is why even on architecture supporting fmadd the compiler doesn’t generate them unless you ask for it). Also if the mul part of the instruction causes an exception you wouldn’t be able to see it until the end of the madd, ie after the add which would lead to inexact exceptions.

Originally posted by crystall:
[quote]Originally posted by V-man:
Thanks for all the responses.
I was asking this cause I see a lot of research has been happening on this, and still is.
In fact, I’m wondering whether some of these x86 have it and silently performing mad behind our backs.

They aren’t unfortunately, for some reasons. Fusing a mul/add couple in a madd instruction can change (though slightly) the precision of the result and can potentially break code relying on exact precision (this is why even on architecture supporting fmadd the compiler doesn’t generate them unless you ask for it).
[/QUOTE]It’s getting offtopic, but I’d like to add that the “too much” precision problem already happens on x87: if you don’t spill to memory, the fpu always works internally at full 80bit precision and the compiler won’t do anything to prevent it unless you compile in “floating point safe” mode (“preserve floating point consistence” in VC++, for example).

There’s a precision field on x87, but it doesn’t shorten the exponent field, just the mantissa significant bits, so you may have overflows/underflows and not notice until/unless you spill.

For any fp-related topic, check Dr. Kahan's page. The Lecture Notes on IEEE754 has even an entry on MAC (Multiply and Accumulate or “fused multiply and add”). I recommend the original poster to check it for the whole story.

A similar case where loss of precision happens is for reciprocal divisions, where to divide by something, the compiler normally does a multiply by the reciprocal of the divisor (this optimization is also disabled when you enable fp consistency).

The bottom line is, leave those optimisations to your compiler (assembler for shaders is a misnomer, they are compiled & optimised as well) and never expect exact results from fp calculations.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.