EX2 instruction in ARB_vertex_program

In the ARB_vertex_program spec, regarding interaction with NV_vertex_program, it says: “EX2: exponential base2. On VP1.0 and VP1.1 hardware, this instruction will be emulated using EXP and a number of additional instructions.”.
So this is apparently the same functionality as the exp macro in the DirectX vertex shader assembly language.

Can anyone tell me how exactly this emulation of EX2 looks like in form of an NV_vertex_program program (or how exp is simulated using expp in DirectX)?
Thanks.

i think you’ll find it under nvidia.com with the subject “emulating missing instructions” or so… something like this…

Thanks, great tip. I found a paper by Matthias Wloka, “How to Implement Missing Vertex Shader Instructions”, and it has this DirectX vertex shader snippet about a high-precision EXP2:

DEF c0, 1.00000000, -6.93147182e-1, 2.40226462e-1, -5.55036440e-2
DEF c1, 9.61597636e-3, -1.32823968e-3, 1.47491097e-4, -1.08635004e-5

EXP r0.xy, r1.z
DST r1, r0.y, r0.y
MUL r1, r1.xzyw, r1.xxxy ; 1,x,x^2,x^3
DP4 r0.z, r1, c0
DP4 r0.w, r1, c1
MUL r1.y, r1.z, r1.z ; compute x^4
MAD r0.w, r0.w, r1.y, r0.z ; add the first to the last 4 terms
RCP r0.w, r0.w
MUL r0.z, r0.w, r0.x ; multiply by 2^I

However, as can be seen, it uses the EXP macro instruction in DirectX (which is apparently a mistake…shouldn’t that be EXPP?).

Anyway, I guess there’s no way of emulating EX2 without using constant registers to store the coeffecients of the Taylor series, is there?

This instruction is hardly unique in that respect. Another such instruction is XPD; I doubt any vendor can implement XPD without at least 2 internal instructions and possibly an extra temporary.

Now, I’m of the opinion that it’s better to stick to a RISC-like model here and to thus not provide any such “composite” instructions or “macros”, and that if you want the convenience, you should just be using a HLL in the first place.

I suppose that’s just how all these (major) ARB specs end up: design by committee leaves no one 100% satisfied with the result.

  • Matt

Matt,
I agree and I’d also prefer a model that’s as close to the hardware as possible.

But I have another question. Now if DirectX or the OpenGL drivers have to emulate EX2 or LG2 using constant registers to store the coeffecients of the Taylor series, what happens if I have a vertex program that uses all 96 program registers? Will that work? Or are there some additional hardware program registers that the driver can use? Or will the driver have to insert two MOVs at the beginning of the program to move the constants to temporaries? But then, what happens with a vertex program that also uses all 12 temporary regs? I’d really like to know how this works internally.
Thanks.

If we can’t fit the program, we need to emulate it.

A similar issue is that ARB_v_p lets you use more than one program parameter in a given instruction, and more than one vertex attribute. NV_v_p always has restricted you to using only one of these registers per instruction.

The reason? It’s very, very expensive to build multiport RAMs in hardware. We can afford to have several ports on the (temporary) register file, but to put multiple ports on the constant and input RAMs would be grossly wasteful of silicon.

The analogy I would make here is again RISC. There are CPUs out there that give you instruction sets where you can get both sources from memory, and write the output to memory also. But essentially all CPUs designed in recent years don’t support any such memory-to-memory operations natively; they decompose them to load instructions, ALU instructions, and store instructions. (On x86, add [eax], ebx effectively becomes 3 such instructions. add eax, [ebx] is 2.)

Again, I would have preferred that ARB_v_p try to keep those restrictions in place, since all HW vendors face the same issues with multiport RAMs being expensive, but I was overruled.

So, in such cases, again, our driver will break up those “CISC” instructions into “RISC” instructions.

  • Matt

I still don’t understand how EX2 emulation works in the driver.
Say, I use relative addressing in my vertex program. Now, when the driver compiles the program, it can determine what parameter registers the program uses, except for the ones that I will access via relative addressing (since I can basically load anything into the address register and could access all 96 registers by simply loading the address register from c[0] for example, that I change for each vertex from 0 to 95).
So how can the driver know which program parameter registers are safe to put the required constants in?

Originally posted by mcraighead:
[b]This instruction is hardly unique in that respect. Another such instruction is XPD; I doubt any vendor can implement XPD without at least 2 internal instructions and possibly an extra temporary.

Now, I’m of the opinion that it’s better to stick to a RISC-like model here and to thus not provide any such “composite” instructions or “macros”, and that if you want the convenience, you should just be using a HLL in the first place.

I suppose that’s just how all these (major) ARB specs end up: design by committee leaves no one 100% satisfied with the result.

  • Matt[/b]

It is a non-goal to make any individual 100% satisfied.

FWIW, the EX2 instruction is not a macro on several implementations. (Including one recently documented by NVIDIA.)

-mr. bill

Asgard,

The case you describe with the constants is exactly why array declarations were added to the spec. Thas allows implementations to knowhow many temos you intend to use. It allows the driver to alloc extra memory it might need or use resources optimally to improve the transform or state transition rate.

In other words, don’t ask for everything unless you need it. You wouldn’t alloc 2GB worth of memory in your app just because windows says you can.

-Evan

Of course EX2 need not be a macro. I have no problem with EX2. I just have an expectation that certain sorts of instructions are unlikely to ever run in one native instruction…

  • Matt

Bill,

I’m not suggesting that the ARB’s goal should be to make one person totally satisfied, or to satisfy everyone.

However, I do have a general distaste for design by committee and the outcomes that result from it. It’s my general opinion that the best designs are done by a single person. That’s not to say that the designer should be me, or that designers should consider no input, or that there should be no discussions.

A common scenario is that there are two decisions A and B, which can each be decided as options 1 or 2. If you pick A1, B1 makes more sense than B2; A2 goes better with B2 than B1; and so on, i.e., there are only two design alternatives that make sense. The committee proceeds to vote for adopting A1 and B2.

In my particular case, “RISC, not VAX” was one of the principles I would have liked to have seen upheld. It wasn’t; oh well.

  • Matt

Matt,

You are right that certain instructions may never be implemented in a single HW instruction. This seems to not keep in synch with a RISC design. On the other hand, IIRC from my asm courses, most RISC assemblers have psuedo-instruction macros and hidden branch delay slots built in. (I am 90% certain the MIPS ones do.) These serve to help programmers with overly common tasks that might otherwise be error-prone.

-Evan

Originally posted by mcraighead:
Now, I’m of the opinion that it’s better to stick to a RISC-like model here and to thus not provide any such “composite” instructions or “macros”, and that if you want the convenience, you should just be using a HLL in the first place.

so you think a general highlevel language, wich gets optimal compiled by the driver is actually good? so you like gl2?

Evan,

Thanks for your explanation. I’m still not too familiar with ARB_v_p and therefore didn’t think of the declarations. It actually makes good sense having declarations (and is cleaner too IMO ).

This might be the wrong forum, but does somebody know how DirectX handles the EXP macro then (with regards to the scenario I mentioned in my last post above). The EXP macro seems to be the same thing as EX2, and in DirectX there are no declarations as in ARB_v_p. So how can DirectX expand the EXP macro?

Just to clear up why I’m so persistent about this matter. I’m writing (or have already finished) various cross-translators that, for example, translate NV_v_p programs to ARB_v_p, or DirectX to N_v_p/ARB_v_p, or N_v_p to DirectX…
So far everything works fine, except for when translating DirectX vertex shaders to NV_v_p because I’m not sure how I can, in a general way, translate the EXP macro (except for maybe using the “regular” EXP instruction and possibly replicating the .z component into the destination with an additional instruction).

Thanks.

[This message has been edited by Asgard (edited 09-11-2002).]

The EXP macro/instruction in DX is documented to take only instruction slots not constants. I can’t remember the number of instructions off the top of my head.

-Evan

Something like 12 or so instructions. But how does that expansion look like only using regular instructions? I did lots of searching on google already, but all I can find are solutions that use the Taylor series, which obviously requires constants.

Is it correct that EX2 will compile to one instruction on 8500 where as EXP will compile to more than one ? For GeForce3/4Ti this appears to be the opposite.

Edit: Assuming you don’t use more than two or more unique constants with the instruction.

[This message has been edited by PH (edited 09-11-2002).]

Yes, this is true. The full precision version does not have side effects. The reason the 8500 can generate an extra instruction is ensuring that the side effects are all correct. I believe the driver presently does not expand it out in all cases as sometimes the other components are not needed.

-Evan

Originally posted by ehart:
[b]Yes, this is true. The full precision version does not have side effects. The reason the 8500 can generate an extra instruction is ensuring that the side effects are all correct. I believe the driver presently does not expand it out in all cases as sometimes the other components are not needed.

-Evan[/b]

If you only need the estimate, use

EXP dst.z, src.x;

I would assume that ATI or any other vendor doesn’t generate any extra instructions for EXP if you only use the component containing what EX2 would emit.

That’s probably your best bet for best performance on both pre-NV30 NVIDIA cards and on ATI.

I know there is a lot of optimization done with EXT_vertex_shader depending on how the output is masked. Also, several instructions are indeed generated for instructions that source more than one unique constant ( and vertex attrib I believe ). The component extraction has room for some more optimizations ( useful for EXP type instructions that work of floats rather than vec4’s ). I’m not sure if this is relevant to the ARB extension but still interesting I think .