VAR: FF vs Shader Performance

Hello everyone. While I was adding support for NV_vertex_array_range to my application I noticed that vertex programs were slower (2-3x) than the fixed function pipeline on a GeForce4 TI4400 (40.72 drivers). The vertex program was a single diffuse/specular light, the same of the fixed function setup. In both cases the geometry is static with no uses of fences or FlushVertexArrayRangeNV, I follow the documented requirements for VAR, and the vertex array range is valid.
Anybody find similar results? Thanks in advance.

Grell

The fixed-function pipeline goes through specialized hardware designed to run as fast as possible. The programmable pipeline can do the same stuff, but it is not going to be as fast. nVidia did it this way so that they could still be really fast on existing apps, but be able to run new cool effects in newer applications. You have to do both for a video card to sell well.

It is well-documented and well-known that you should not write a vertex program for anything that can be done in the fixed-function pipeline, since in the worst case the driver people implement the fixed-function with the programmable pipeline, and in the best case they have custom hardware that can run it even faster.

Yes I am aware of the onboard implementation of the fixed function vs programmable pipeline on the GeForce series of cards. I would not expect much difference on the Radeon 9700 since it emulates the fixed function pipeline with shaders.
The reason I used such a simple vertex program was to have some level of parity for my benchmarking. I guess I should clarify my question. I knew there would be a performace difference, but should it be of such a great maganitude? I could see maybe 10-15% but 2-3x seems large.

I could see maybe 10-15% but 2-3x seems large.

Sure, that’s entirely possible.

The fixed function pipeline may take as little as 2-3 cycles per vertex.

Let’s say your shader program was 12 instructions long. On a GeForce 4, that’s approximately 6 cycles per vertex. Hence, 2x-3x times faster.

Because of the longer setup times, if you aren’t using VAR, you won’t notice as much of a speed difference.

Ah! That makes total sense. Thank you.

Grell

Fixed Function cheats really bad. Basically it does the 2 big transforms (the vert and the normal with the modelview) at the same time. Then lighting pretty much in one big operation. Specialized hardware is always faster. There really isn’t any form of “pipes” with vertex programs. By pipes I mean being able to do multiple instructions at the same time (Assuming of course that the instructions are independent of each other). This is for simplicity. Current x86 processors do a ton of branch prediction, instruction reordering and register renaming operations. The instruction fetch and decode units are very complicated. Simply isn’t worth the time of effort to do that in a vertex program. In general in a vertex program most operations are dependant. The only time this doesn’t occur is when implementing the FF in the program and doing skeletal character animation. Which is largely independant short of the weighted summing which could be done in the end. Not enough temp registers.

Devulon

[This message has been edited by Devulon (edited 12-17-2002).]

I’d guess that your vertex program isn’t as optimized as it could be… this is a “30%” sort of effect, not a “2x-3x” effect.

  • Matt

This is from the DX mailing list:

"On the latest DirectX 9 hardware you’ll find it’s generally faster to use
shaders. The difference may be small, but it’s there, and that’s a clear
indication of the way to run future hardware.

Plus you usually know things that the fixed function dare not assume (like
uniform scales etc). This means that you can often beat the fixed function
route for efficiency by quite a good margin.

My general recommendation is to use the programmable pipeline for
everything. Don’t switch back…

Thanks,

Richard “7 of 5” Huddy
European Developer Relations Manager, ATI"

And

"On the R300 series the FFP is implemented by a vertex shader that the driver
manages. I think this is a strong indication of the way future hardware
will be designed.

Hence my comment that I recommend always using the programmable pipeline.

Thanks,

Richard “7 of 5” Huddy
European Developer Relations Manager, ATI"

Hope this helps.

Originally posted by Devulon:
There really isn’t any form of “pipes” with vertex programs. By pipes I mean being able to do multiple instructions at the same time (Assuming of course that the instructions are independent of each other). This is for simplicity. Current x86 processors do a ton of branch prediction, instruction reordering and register renaming operations. The instruction fetch and decode units are very complicated. Simply isn’t worth the time of effort to do that in a vertex program. In general in a vertex program most operations are dependant. The only time this doesn’t occur is when implementing the FF in the program and doing skeletal character animation. Which is largely independant short of the weighted summing which could be done in the end. Not enough temp registers.
[This message has been edited by Devulon (edited 12-17-2002).]

>>>Simply isn’t worth the time of effort to do that in a vertex program.<<<

The current NV and ATI offerings, don’t need this, but when branching and looping come into play, the GPU will have to implement some CPU like technologies.

Also, vertex programs could be implemented in a pipeline fashion, since basically all you have is a lot of vertices, normals, texcoordinates, etc and they will be manipulated by your program. The only issue is how you have ordered your instructions. If your program requires to put an ouput back into some part of the pipe, then secondary pipe could be put to work while the first is busy. There will be best cases and worst cases.

Im sure there is plenty of opportunities to do parallel processing, even in vp.
A programmable GPU will be slower then the conventional, but it’s better then the x86 (the current ones and previous)

V-man

Mmm, like multiple vertex shader units. Oh, there already are…

No, I’d definitely recommend that you use the fixed function if it happens to be precisely what you want – that is, unless you want to throw away a significant amount of lighting performance. It’s an even bigger deal, of course, on hardware that only has fixed-function T&L and no programmability – which is a huge number of chips.

Of course, there are still various cases where you can do a better job optimizing, due to assumptions you can make about your own app’s data and behavior. But if all you want is fixed function and nothing else, …

It’s not particularly surprising that ATI would advocate you use a vertex program 100% of the time. Even if it didn’t help your app running on ATI hardware, it’d hurt your app running on NVIDIA hardware!

  • Matt

Just like nVidia encourages the use of glCopyTexSubImage() instead of WGL_ARB_render_texture I suppose.

I’ve only ever written vp’s that deal with 1 light at a time. How could you tell in a vertex program which of the 8 supported fixedfunction lights were enabled (I mean, if you pass their attributes in as v program parameters)? There’s no conditional statements allowed in standard cards…

Ah, answered in this thread:- http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/008255.html

Originally posted by mcraighead:
It’s not particularly surprising that ATI would advocate you use a vertex program 100% of the time. Even if it didn’t help your app running on ATI hardware, it’d hurt your app running on NVIDIA hardware!

Come on, Matt, that doesn’t seem like you to post something like that. Not very nice of you .

Knackered,

Evan Hart fra ATI said ( a few months ago ), that the ARB was working on ARB_vertex_program2 for flow control. This should work on the GeForceFX, the 9700 and future cards.

Someone has posted Matt response in DX mailing list.
Here is a reply from one guy from ATI:

"This is very interesting! I always thought that NVIDIA’s current hardware
emulated FFP with vertex shaders and that the vertex shaders benefited them
more than ATI. Radeon 8500/9000 chips have fixed function implemented in the
silicon, which in most cases is faster than shader-based equivalent TnL, so
I’m rather surprised to learn that I and other ATI guys for a long time
evangelized shaders for the “wrong” reasons. Reading between the lines, I
also get a feeling that Matt acknowledges that vertex shaders run faster on
ATI hardware. If that’s the case I’d like to thank Matt from NVIDIA for
saying it publicly

Guennadi Riguer
ATI Technologies Inc."

I think it’s pretty clear-cut: if you want to use programmability, use it. But if you don’t need it, don’t use it.

Given that there are tens of millions of graphics chips out there that don’t support any programmable T&L, and millions and millions more that have dedicated fixed-function lighting hardware, using fixed-function when it makes sense to do so will make your app run better on a lot of systems.

  • Matt

Originally posted by PH:
Not very nice of you .

No one’s ever accused me of being a nice person…

Business is business. Any company that doesn’t promote the things they’re good at, or promotes the things their competitors are good at, isn’t doing its shareholders much of a favor…

As for the comment about performance: please feel free to perform all the augury you wish on my remarks (if the goat entrails coil clockwise, ATI is faster, counterclockwise, NVIDIA is faster), but, nope, there are so many products out there these days I can barely remember which ones are faster and slower at what myself.

  • Matt

I dont think it’s a big secret or a big deal that nvidia cards are slower than ATI in some cases, and faster than in other cases.
Same situation with many other companies.

I think we all know what the purpose of these so called becnmarks are.

Personally, I prefer stability and trouble free operation. As for the continous updating of drivers makes me want to kiss those fine nvidia people. No, make that shake hands.

V-man

Sorry about my question coz maybe it’s a bit OT.

I was just asking myself what was the future of the fixed pipe? I mean with GL2.0 will it be sacrificed to an operand ‘emulation’ instead of a pure hw implementation?

Ok, i would certainly undertstand the point -> Nbtransistors vs PrgSpeed with next gen of GPU (anyhow it seems that ATI has already done that choice and u can really feel the problem with a low system)

Well, what i think here… is that it could be a big mistake to throw away the fixed pipe hw implementation when it has taken so many years to make it 100% hardwired.