Nvidia's Unified Compiler

I’ve got several question about this:

  • does it optimise vertex programs(just curious)?
  • does it optimise NV_fragment_program or is it bound to ARb and DX9?

Why am I asking: I would really like to have a possibility of using fixed-point in my fragment programs. But NV_fragment_program has an unhandy syntax, so I thought about updating an ARB assembler and writing a parser that would convert it to NV_fp. (Don’t tell me to use cg )
The only problem is: I am not very happy to do series of optimisation to the parser. If NV_fp won’t be optimised I can clearly give it up.

Ok, I did some tests and I’ve got indeed very interesting results.

I wrote two identical fragment programs, one for ARB and another for NV. They perform additions with twenty temporary registers, Each register is sum of previous two. Yes, teh example is somehow stupid, but that’s an example on high-unoptimised program on nvidia’s hardware(temp regs sickness).

  1. The framerate was identical up to 4 digits behind the point(120 FPS on FX5600, 1024x768 32BPP window). So, I guess, the driver first converts ARB fragment program to NV fragment program and then performs the optimisation.
  2. I’ve tried to boost up the performance by converting NV program to use halfs. The performance didn’t change at all. It means that either fp16 is indeed not faster then fp32 or teh driver uses fp16 everywhere. I’ve read somewhere that fp16 is used only to reduce temporary register usage and I tend to agree with it. If driver isn’t cheeting with fp32 it shows also the uselessness of the partial precision of DX9 shaders. This can be prooved by viewing different benchmarks, where partial precision almast always doesn’t bring any performance boost. ForceWare optimises register usage heavily, so fp16 almost
    looses it value.
  3. fixedpoint performance… That’s the monster. 270 FPS(ok, the fragment program wasn’t very usefull, but that’s demonstration of raw power) More then 100 % boost. That’s the true FX power. It’s a shame that fixedpoint is understimated by developers. From GLSH reference(more or less, citate from memory): “we decided not to add support for fixedpoint because this advanced feature wouldn’t probably be supported by modern generations of hardware”.

So, as far as my empirical fragment program can be considered a benchmark, i can formulate next statements:

  1. Nvidia does indeed say truth when assuring that the optimiser doesn’t change the precision of the operands.
  2. 16-bit floatingpoint are NOT fastzer then 32-bit. This is only to optimise the register usage.
  3. fixedpoint is indeed very powerfull feature.

Would appreciate to hear your remarks.

Originally posted by Zengar:
3. fixedpoint is indeed very powerfull feature.

True, but only on GeforceFX 5200s, 5600s, and 5800s. As long as you’re satisfied with fixed point precision, the ability to perform 2 integer ops plus a floating point op is extremely powerful. However, since the 5900 and newly arrived 5700 replaced those fixed point units with mini floating point units, that performance advantage won’t be seen across the entire GeforceFX line.

Originally posted by Ostsol:
However, since the 5900 and newly arrived 5700 replaced those fixed point units with mini floating point units… [/b]

How do you know that? I thought they’ve replaced register combiners units. I would like to have some access to 5900 to test it… Ostol, can I assume that you have this card? If so, have you done some tests?

Sorry, I should have posted a reference.
http://www.beyond3d.com/forum/viewtopic.php?t=8005

At Beyond3D there’s quite a few people who -really- know what they’re talking about, when it comes to 3d hardware and have done extensive tests and analysis into their capabilities. It was in those forums that I first heard about the register limitations on the GeforceFX and the extent to which those limitations affected performance. Here’s that thread:
http://www.beyond3d.com/forum/viewtopic.php?t=5150

I can’t find the thread where the theory of the NV35 only having floating point units started, though. However, it seems to have become the general concensus, over there.

>> thought they’ve replaced register combiners units.
Looks, like changing in spec of NV_fp (target !!RC1.0, which not exist any more) conform this conclusions.

>> 2. 16-bit floatingpoint are NOT fastzer then 32-bit. This is only to optimise the register usage.
Possible.
But difference between FP16 & FP32 changed from one diver to other
(I had up to 25% under 43.xx and near 5% in 52.xx).

Looks like 52.xx have some additional optimization layer in driver(in both D3D & OGL), which can change order of instruction &etc.

Originally posted by ayaromenok:
[b]
But difference between FP16 & FP32 changed from one diver to other
(I had up to 25% under 43.xx and near 5% in 52.xx).

Looks like 52.xx have some additional optimization layer in driver(in both D3D & OGL), which can change order of instruction &etc.[/b]

Yes. You can use two 16 bit registers at the cost of one 32bit. 52.xx optimises usage of 32 bit registers, so this aspect has now minimal influence. I’ve tested several shader till now and the performance is nearly the same with 32 and 16

Originally posted by Zengar:
Yes. You can use two 16 bit registers at the cost of one 32bit.

I thought that it had always been like that. . .

Wanted just to point that out

A 160 instruction long fragment program still does 17,5 FPS on my 5600 non-ultra

Im having difficulty in beleiving what’s beeing said there.
How do they evaluate the # of cycles a shader takes? I think it is mostly guesswork with some known facts thrown in.

It’s not even clear how much the programmer should optimize.

They should write the means by which they came to their conclusion, not just the conclusions.

Everything I’ve said is basen on an experiment. If you want, I can send you the code and the screenshots. Thre’s no speculation in my results. If you have diffuculties with some of my results, please post them and I’ll try to check it. Feel free
But, to say it once more: I wote only that what I’ve seen.

Edit: my benchmarks could be somehow syntetic, as I pointed it out several times already, as I used mainly series of ADDs

[This message has been edited by Zengar (edited 10-28-2003).]