fragment shader instruction cost table (NV/ATI/etc)

vincoof · June 18, 2004, 5:27am

Hi all,

Do you know where I can find tables that list the number of cycles needed by each fragment shader instructions, for any given graphics card ?
What I mean is that a MOV instruction is faster than a LIT instruction (at least, I hope it does!) but how much is it faster for a GeForce5900 ? for a Radeon 9800 ?

The final goal is to be able to estimate the shader performance thanks to a formula like :
total_shader_cost=0;
for_each(shader_insruction) total_shader_cost+=cost(shader_instruction);

Thanks in advance !

SirKnight · June 18, 2004, 5:53am

I do hope this information is available somewhere because this would be perfect for me to add in a presentation I’m doing for my Computer Architecture course.

-SirKnight

Stephen_H · June 18, 2004, 6:53am

ATI has a list something like this in their “Radeon 9500/9600/9700/9800 OpenGL Programming and Optimization Guide” which, iirc, is included with the ATI OpenGL SDK.

I haven’t seen anything similar from Nvidia, although I haven’t looked through their DX docs.

evanGLizr · June 18, 2004, 9:13am

Originally posted by vincoof:

The final goal is to be able to estimate the shader performance thanks to a formula like :
total_shader_cost=0;
for_each(shader_insruction) total_shader_cost+=cost(shader_instruction);

What for? I don’t want to sound too harsh, but I get on my nerves when I see a topic like this.

Do you really believe that “the assembler” you write is what is sent to the hardware?
Do you think that the current hw is so simple that knowing how much an instruction takes will be able to tell you how fast your program will perform?

People have to learn that assembler shaders are compiled and optimized to the underlying hardware.
There’s no one to one mapping between “assembler” and hw microcode.

Current hardware is very complex, there are multiple variables conditioning how fast a given shader will run:

shaders are multithreaded for latency hiding
depending on how many resources your shader uses, you may limit the number of threads you can run simultaneously
then you have instructions which share the same functional unit, so some combinations of microcode instructions may take more time than the same with other instructions interleaved.
some hardware is scalar and some is vectorial.

Writing shaders in “assembler” is a dead-end, embrace the inevitable, use a high level language and trust your driver’s optimizations.

Korval · June 18, 2004, 10:25am

Do you think that the current hw is so simple that knowing how much an instruction takes will be able to tell you how fast your program will perform?
For ATi hardware yes. R300 and better hardware is pretty simple to understand. Though there is some optimization that can happen (using the separate 3-vector and scaler computational units separately, etc) which can throw off instruction counts. At the very least, you can estimate a best-case scenario.

One thing you will have a problem with is swizziling on ATi hardware. It doesn’t handle the full swizzle set, but they don’t exactly tell you what it does and doesn’t handle. They give you hints, but there is no, to my knowledge, specific document that tells you which ones work natively and which ones have to be emulated.

nVidia’s FX line makes this more difficult, because it is usually bound on temporaries. It does its computations in a very different way, and, as such, it makes determining the running time of a program difficult. I wouldn’t suggest spending the time to try to figure out how long a program takes, since it relies on pretty intimate knowledge of the hardware.

shaders are multithreaded for latency hiding

depending on how many resources your shader uses, you may limit the number of threads you can run simultaneously

then you have instructions which share the same functional unit, so some combinations of microcode instructions may take more time than the same with other instructions interleaved.

some hardware is scalar and some is vectorial.

The top 3 of these apply to nVidia hardware only. ATi’s R300+ line is, as I pointed out, very simple.

Any “multithreading” issues are irrelevant, since the goal is to determine the time it takes to execute a single shader. And, in any event, ATi hardware doesn’t have any resource limits on the “threads” (ie, pixel quads) that it can run, nor does it have any issues with sharing functional units.

Stephen_H · June 18, 2004, 10:39am

By ‘multithreaded’ you mean the fact that’s there is multiple parallel fragment processing ‘pipes’ that the fragment may go down in the card, or something else? Just curious…

evanGLizr · June 18, 2004, 12:28pm

Originally posted by Stephen_H:
By ‘multithreaded’ you mean the fact that’s there is multiple parallel fragment processing ‘pipes’ that the fragment may go down in the card, or something else? Just curious…
It’s more like having multiple fragments in flight that can be preempted while they wait for some high-latency operation, so a different fragment can use the ALUs (much in the same way threads in a CPU are preempted while waiting for I/O).
This is mainly to hide the latency of texture fetches (or the latency of trascendental function calculations).

In the future, the hw will do automatic load balancing between fragment shaders & vertex shaders (see WinHEC WGF spec by David Blythe from MS), so it will even be more difficult to predict the performance of a shader beforehand.

vincoof · June 18, 2004, 2:05pm

Maybe I did not make myself clear, and in that case I’m sorry if it leaded to confusion, but my goal is not to compute the exact performance of a shader without executing it, rather I’d like to ~estimate~ such performance. I know there is compiler optimization, hardware-specific architecture, and many other issues that makes impossible to exactly know of fast a shader will execute. But at least I’d like to know “in general” if, for instance, a LIT instruction is faster than an EXP instruction, by comparing how many clock cycles are needed for each of them, and I’d like to know if such difference is more or less noticeable on eg R3xx and NV3x architecture.

imported_Adruab · June 18, 2004, 3:04pm

You can probably get an ok idea of this by looking at the msdn directx assembly level shading language specs (no, it’s not exact).
shader reference

If you’ve got an nvidia card and would be willing to work with it, NVidia’s FX Composer is supposed to give you performance data. I’m sure it’s nvidia specific, and I think the closest you can get to GLSL is to use CG (they might also support hlsl…). In any case, it’s a very different paradigm, but if you really want to know (it gives you the “disassembly” too).

evanGLizr · June 18, 2004, 3:06pm

Originally posted by vincoof:
But at least I’d like to know “in general” if, for instance, a LIT instruction is faster than an EXP instruction, by comparing how many clock cycles are needed for each of them, and I’d like to know if such difference is more or less noticeable on eg R3xx and NV3x architecture.
Hmmm maybe I’m not making myself clear:

There doesn’t need to be a native microcode LIT instruction. The microcode instructions LIT translates into, may depend on schedulling opportunities of the whole shader.
There doesn’t need to be a native microcode EXP instruction. The microcode instructions EXP translates into, may depend on schedulling opportunities of the whole shader.
Even if there are native LIT & EXP instructions, the compiler will change your LITs to EXPs, if that’s what’s better for the hardware or viceversa.
The running time of LIT & EXP may depend on extern factors like how many registers is your program consuming or load balance between vertex & fragment shaders.

So it doesn’t matter what you use LIT or EXP. And even if it does, it may not matter for the next driver rev. with a better optimising compiler.

Korval · June 18, 2004, 3:48pm

In the future, the hw will do automatic load balancing between fragment shaders & vertex shaders (see WinHEC WGF spec by David Blythe from MS), so it will even be more difficult to predict the performance of a shader beforehand.
Kinda off-topic, but that requires a (fairly stupid, in my opinion) hardware implementation where vertex and fragment shader units are actually the same physical pieces of hardware. Microsoft is not going to force hardware vendors to do this; they are simply allowing for the possibility of such an implementation.

There doesn’t need to be a native microcode LIT instruction. The microcode instructions LIT translates into, may depend on schedulling opportunities of the whole shader.

There doesn’t need to be a native microcode EXP instruction. The microcode instructions EXP translates into, may depend on schedulling opportunities of the whole shader.

Even if there are native LIT & EXP instructions, the compiler will change your LITs to EXPs, if that’s what’s better for the hardware or viceversa.

The running time of LIT & EXP may depend on extern factors like how many registers is your program consuming or load balance between vertex & fragment shaders.
1: Irrelevant. In general, there is an expected minimum speed that can be expected from any particular sequence of operations.

2: See above

3: Also irrelevant. The question is not, “In the context of shader X, what is the performance?” The question is, “In general, what is the expected performance?”

4: External factors are irrelevant. When you want to know the general idea of the running time of, say, fcos(), you aren’t interested in the time spent waiting on a possible cache miss for the input parameter. The user has no control over that. What the user has control over is whether they use fcos() or something else.

So it doesn’t matter what you use LIT or EXP. And even if it does, it may not matter for the next driver rev. with a better optimising compiler.
The question remains valid, however. Observe.

I have 2 functions in C-standard libraries: sqrt, and cos. Which one is, in general, faster?

The correct answer is sqrt. Yes, it is entirely possible that cos could be faster (for example, a CPU that has a native COS opcode, but no native SQRT equivalent, or a stdlib implementation that doesn’t use the native SQRT but does use the native COS). However, that doesn’t nullify the question, because the question isn’t interested in a specific case.

[edit]

trust your driver’s optimizations
As an aside, I would like to point out that we can barely trust OpenGL drivers to work correctly at all, let alone have a decent compiler/optimizer. There are still cases in ATi’s ARB_fp where it will add more dependent reads than the shader specifies.

vincoof · June 18, 2004, 3:53pm

@Adruab : thanks for the link. that’s exactly the kind of information I’m looking for, even if I’d prefer an ARB_fragment_shader insight

@evanGLizer : Are you telling that any instruction can cost anything just plain randomly ? I know many factors can affect what you’d expect from a shader, but there has to be some kind of logic. Your answer is like a MOV instruction could be as slow as a SIN instruction and even if it might happen one time out of a thousand of times, it’s not gonna change dramatically the “overall” performance estimation I’m looking for.

@Korval : couldn’t agree more

al_bob · June 18, 2004, 11:59pm

In general, there is an expected minimum speed that can be expected from any particular sequence of operations.
The minimum speed is easy: It converges towards zero. The hardware (the Java stack machine I keep bringing up) may not be able to execute that instruction at all, and thus rely on the CPU to do it. It’s then very easy to arbitrarily slow it down.

What you’re probably looking for is maximum speed (how many of those instructions per second can you do, at best).

Of course, this is not so simple. Some instructions can be combined with others and actually form a more complex instruction that takes the same amount of time as either one. An easy example of this is MAD: On CPUs, both (FP) MUL and ADD usually take the same amount of time to compute as MAD.

In which case, how many ADDs can you do per clock? How many MULs? What if your shader had a mixture of both? Would it run at 1 / (speed(MUL) + speed(ADD)), or 1 / speed(MAD)?

The hardware is likely very much more complicated than this simple example.

If you’ve got an nvidia card and would be willing to work with it, NVidia’s FX Composer is supposed to give you performance data.
That would be the best way of evaulating the performence of your shader, excluding external factors.

ScottManDeath · June 20, 2004, 5:52am

Maybe some detailed HW will give hints about the internals…
Possibly you’ll need to browse some pages

NV40
NV30
R420

vincoof · June 20, 2004, 7:54am

Thank you, there is some very useful information here, especially the NV30 instruction list which is somehow the kind of list I’m looking for. I’d love seing such tables for all instructions (only a majority is covered here) and for all other pixel-shading-capable graphics cards too