In the future, the hw will do automatic load balancing between fragment shaders & vertex shaders (see WinHEC WGF spec by David Blythe from MS), so it will even be more difficult to predict the performance of a shader beforehand.
Kinda off-topic, but that requires a (fairly stupid, in my opinion) hardware implementation where vertex and fragment shader units are actually the same physical pieces of hardware. Microsoft is not going to force hardware vendors to do this; they are simply allowing for the possibility of such an implementation.
- There doesn’t need to be a native microcode LIT instruction. The microcode instructions LIT translates into, may depend on schedulling opportunities of the whole shader.
- There doesn’t need to be a native microcode EXP instruction. The microcode instructions EXP translates into, may depend on schedulling opportunities of the whole shader.
- Even if there are native LIT & EXP instructions, the compiler will change your LITs to EXPs, if that’s what’s better for the hardware or viceversa.
- The running time of LIT & EXP may depend on extern factors like how many registers is your program consuming or load balance between vertex & fragment shaders.
1: Irrelevant. In general, there is an expected minimum speed that can be expected from any particular sequence of operations.
2: See above
3: Also irrelevant. The question is not, “In the context of shader X, what is the performance?” The question is, “In general, what is the expected performance?”
4: External factors are irrelevant. When you want to know the general idea of the running time of, say, fcos(), you aren’t interested in the time spent waiting on a possible cache miss for the input parameter. The user has no control over that. What the user has control over is whether they use fcos() or something else.
So it doesn’t matter what you use LIT or EXP. And even if it does, it may not matter for the next driver rev. with a better optimising compiler.
The question remains valid, however. Observe.
I have 2 functions in C-standard libraries: sqrt, and cos. Which one is, in general, faster?
The correct answer is sqrt. Yes, it is entirely possible that cos could be faster (for example, a CPU that has a native COS opcode, but no native SQRT equivalent, or a stdlib implementation that doesn’t use the native SQRT but does use the native COS). However, that doesn’t nullify the question, because the question isn’t interested in a specific case.
[edit]
trust your driver’s optimizations
As an aside, I would like to point out that we can barely trust OpenGL drivers to work correctly at all, let alone have a decent compiler/optimizer. There are still cases in ATi’s ARB_fp where it will add more dependent reads than the shader specifies.