Quick question: why are simple vertex programs slower than fixed pipeline?

In direct3d, you are encouraged to use vertex shaders (the equivelent of vertex programs), as it should give a performance increase. I’ve done a quick experiment comparing the fixed pipeline with vertex programs, and I notice that the fixed pipeline is faster even with more going on (vertex program simply x-forms the vertex and passes through the tex coords, whereas I’ve got auto-texcoord generation on in the fixed pipeline).
Is there a reason for this?
Setup: Geforce3 ti500, with latest det drivers.

Direct3d Sucks. Sorry but it had to be said. As for the little I know about the vertex shaders in Direct3d. They are messed up. They work but the extra abstraction that was added makes managing them and getting them onto the card quite slow. Fixed Pipelines simply don’t have this problem. In general a fixed pipeline will simply run faster because its more optimized for a specific set of tasks. Recreating a fixed pipeline in a vertex program is generally going to be slower. But of course thats not what the the vertex and pixel programs were meant for.

My small vertex programs (say, with only transform and color/texture pass-through) easily beat slightly more complex fixed-function T&L (say with one light enabled).

Using OpenGL, it is definitely not slower in all cases. I’ve only got a regular GF3 too…

– Zeno

Sorry if I wasn’t clear. I only mentioned d3d because I program in that too, and the dx8 sdk docs state that vertex shaders are faster than fixed pipeline. I did my tests using OpenGL - fixed pipeline and vertex programs.
Don’t be so hard on d3d - it’s not that bad these days. It’s not like supporting a football team, you know - you can like more than one api.

only reason i dont like directx is because its not cross platform. thats why i use opengl

I found the same effect (also for imitating fixed pipeline programs), but blamed it on my GF3 TI 200. The most disappointing was certainly that the bump mapping setup was quicker when using the cpu (but only with geometry in video mem, using agp both approaches were equal).
I think I heared somewhere that the GeForce 3 had two t&l-units in hardware: One for fixed pipeline and one for vertex programs. Does anyone know any better than I do?