5200 again: performance vs. precision

My impressions, as a couple-days-old owner of GFFX 5200:

Strangely, so far I haven’t felt a slightest difference in performance between X/H/R precision modes (except when switching between fp16 and fp32 pbuffer RTT, but that is memory bandwidth cost, I suppose).

Some examples:

“TEX H0, f[TEX0], TEX0, CUBE;”
“DP3X H0, H0, {0.2, 0, 0.9797, 0};”
“POW o[COLH], H0.x, 120;”

“TEX H0, f[TEX0], TEX0, CUBE;”
“DP3H H0, H0, {0.2, 0, 0.9797, 0};”
“POW o[COLH], H0.x, 120;”

“TEX R0, f[TEX0], TEX0, CUBE;”
“DP3R R0, R0, {0.2, 0, 0.9797, 0};”
“POW o[COLR], R0.x, 120;”

These 3 programs above run all at the same speed, however the result quality is vastly different in each case (the TEX0 is a HILO16 cubemap).

Another one:

“TEX H0, f[TEX0], TEX0, CUBE;”
“TEX H1, f[TEX1], TEX1, 2D;”
“DP3X_SAT o[COLH], H0, H1;”

This program uses fixed precision data only, but runs twice slower than equivalent NV_RC program.

I’ve tried longer programs too, but I’ve never been able to get achieve better performance by using the fixed precision. Simply, only instruction count and their complexity (like POW, RFL, etc., accessing interpolants) mattered.

Has anyone ever noticed speed benefit from using “fixed” on 5200 ? or 5600 ?

Maybe the “fixed” thing is really a 5800 specific feature?

Originally posted by MZ:
[b]
My impressions, as a couple-days-old owner of GFFX 5200:

Strangely, so far I haven’t felt a slightest difference in performance between X/H/R precision modes (except when switching between fp16 and fp32 pbuffer RTT, but that is memory bandwidth cost, I suppose).
[…]
Has anyone ever noticed speed benefit from using “fixed” on 5200 ? or 5600 ?

Maybe the “fixed” thing is really a 5800 specific feature?[/b]

Fixed should really run faster than fp16/fp32. There were some rumours on fixed being removed on some version of NV3x, but I think they refer to NV35.

In fp16 vs. fp32, the performance increase comes doubling the number of available registers, more than for being able to perform operations faster.

GFFX is very sensitive to the number of registers you use, that’s why Nvidia’s Cg compiler optimises the way it does.

Peruse Beyond3d for Uttar’s and AnteP GFFX performance tests:

MDolenc
It’s not that fp16 is significantly faster then fp32 (or fp32 being significantly slower)… The problem on NV3x is that it only has 2 fp32 registers (or 4 fp16 registers) free and after that performance goes down. I’m sure you won’t see much of a difference if you have code with 2 registers and 32 fp32 instructions and same code but using 32 fp16 instructions. Fx12 is however two times faster then floating point on nv30.

http://www.beyond3d.com/forum/viewtopic.php?t=6072&postdays=0&postorder=asc&start=20

[This message has been edited by evanGLizr (edited 08-27-2003).]

In order to find a speed difference, your program should actually do something. That is, something more significant than 2 ALU instructions and a texture access.

Try using something other than POW. That instruction alone like accounts for most of your time in the shader.

@evanGLizr:
I lurk around B3D forms ocasinally, and as long as thepkrl is right, I am aware of what you wrote. However some parts of his analysis of 5800 doesn’t apply to my 5200 - like the fx12 performance. I’ve just tried MDolenc’s test, and it showed my 5200 is 4.8 times slower in PS2.0 than 5800 (what is reasonable), but this doesn’t cover fx12 (the whole site is strongly DX biased ), which I’m concerned about.

@al_bob:
I used POW to be able to see precision I’m really getting - whether it was what I requested (yes, it was). Additionally, I was a little paranoid about possiblity of compiler detecting series of MULs and replacing it with single POW. Replacing POW with couple of MULHs or MULXs didn’t show any difference between fp16 and fx12. (BTW, cost of POW == 3 * cost of MUL)

@Korval:
I said I had tested longer programs too. Actually, I started with normal, real world shader, then I noticed the issue (no benefit from fx12), and then I dumbed down my app to isolate shader performance as much as possible and explore the issue further.

I sustain my observation:
No matter I use dumb or real, shorter or longer shaders, minimize bandwidth influence or leave it as in real rendering - I always see ZERO difference in speed between fp16 and fx12 in NV FP computations. I’m puzzled.

Here’s my code for a very simple bump mapping for 2 light sources:

!!FP1.0

TEX H0.xyz, f[TEX1].xyxx, TEX1, 2D; # Sample bump
TEX H1.xyz, f[TEX0].xyxx, TEX0, 2D; # Sample diffuse
MADX H0.xyz, H0.xyzx, {2}.x, {-1}.x; # Expand bump [0,1] → [-1,1]
DP3X_SAT H2.w, H0.xyzx, f[TEX3].xyzx; # light2Pos . bump
MULX H2.xyz, H2.w, p[2].xyzx; # (light2Pos . bump) * light2Color
DP3X_SAT H3.x, H3.xyzx, f[TEX2].xyzx; # light1Pos . bump
MADX H2.xyz, H3.x, p[1].xyzx, H2.xyzx; # (light1Pos . bump) * light1Color
ADDX H2.xyz, H2.xyzx, p[0].xyzx; # (light1Pos . bump) * light1Color + (light2Pos . bump) * light2Color + ambient
MULX o[COLH].xyz, H1.xyzx, H2.xyzx; # " * diffuse color

END

To me (and my Fx 5200) it also doesn’t matter if I replace all operations with their respective H or R variant or if I replace all H variables with Rs (Though I’m breaking through the 2/4 limit in the fp32 case). At least in my small demo there’s also absolutely no significant difference in performance.

[This message has been edited by stefan (edited 08-28-2003).]

Actually, it could be possible that, to save transistor space on their really-low-end chip (the 5200), nVidia took out the extra fixed-point processing unit. That would make integer just as slow as floating-point.

Originally posted by Korval:
Actually, it could be possible that, to save transistor space on their really-low-end chip (the 5200), nVidia took out the extra fixed-point processing unit. That would make integer just as slow as floating-point.

Wouldn’t that mean that there is simply no FX12 support at all and any reference to FX12 is interpretted as FP16?

It was my understanding that the major speeds gains of using fixed point came from parallel excution. Fixed and floating point could be executed simultaneously, stalling whenever an interdependancy was hit. Using straight floating point or straight fixed won’t give this gain.

Wouldn’t that mean that there is simply no FX12 support at all and any reference to FX12 is interpretted as FP16?

The performance gains for fixed-point operations over floating-point is due to having 2 functional units rather than 1. Hence twice the speed. However, if nVidia removed one of those functional units, then the performance of a fixed-point operation is the same as a floating-point one. Hence no speed improvement.

It was my understanding that the major speeds gains of using fixed point came from parallel excution.

According to investiations at Beyond3D, this is not the case.

Originally posted by Korval:
The performance gains for fixed-point operations over floating-point is due to having 2 functional units rather than 1. Hence twice the speed. However, if nVidia removed one of those functional units, then the performance of a fixed-point operation is the same as a floating-point one. Hence no speed improvement.

Ah. . . sorry, I misunderstood your post. I thought you were implying the total removal of integer units.