Normalization cube map

Normalizing a vector in a fragment program takes 3 ops (DP3, RSQ, MUL), and using a cube map takes two (TEX, MAD), the second one for range unflating.

Actually, using a cube map takes one texture unit, is one op all you gain using a cube map ?

SeskaPeel.

Actually a cubemap is used on hardware that does not support fragment programs. On nVidia that´s everything below a GFFX!
I´m quite sure that doing it in a fragment program might be better, but for all the “older” hardware you have to use cubemaps.

Jan.

On GeForceFX you can use signed textures so that a TEX is all you need to do cube map normalization.

DP3/RSQ/MUL is a good bit more expensive.

Thanks -
Cass

I heard a rumour that doing a normalize in a fragment program, got automatically switched to a normalization cubemap on GF-FX.

Originally posted by Nutty:
I heard a rumour that doing a normalize in a fragment program, got automatically switched to a normalization cubemap on GF-FX.

I really don’t think so, but it should be easy to verify by writing a demo to test it.

Have you noticed any precision problems / artifacts with DP3/RSQ/MUL on GeForceFX?

I wasn’t soo sure on it meself, but its something I read a while back. They claimed the precision wasn’t very good, as this was due to it using a NCM instead internally.

I dont have a gf-fx tho, so no worries there.

Current high-end hardware usually has quite impressive texturing hardware, so using a normalization cube map may be a good idea.

However, future hardware, including “Built-in Ultra Realistic Graphics ™” integrated controllers, may have a totally different balance between texturing and fragment operation execution. Built-in graphics parts tend to use a shared memory bus with the CPU, so all framebuffer and texture access is likely to be much slower than on a dedicated part.

So I guess the question is whether you want to optimize for the current monsters, or possible future cheap implementations, which are likely NOT going to have as good fill rate as the current high-end cards.

The divide keeps growing :frowning:

I almost always do the normalization mathematically. I look at it this way:

  1. It’s more elegant
  2. It’s more precise
  3. It requires no setup outside the program
  4. It requires no video memory
  5. It’s likely to be as fast or faster than a tex lookup in the future…do you still use sqrt tables in your regular code?

Anyway, it’s just an opinion. If speed is your primary concern for today’s hardware, use the lookup.

do you still use sqrt tables in your regular code?

If I was doing 307,200 (equivelent of 640x480 pixels) or more every frame, yes I would!

uhm, no, nutty, you wouldn’t. i have an approx sqrt in 2 cycles, and a “good” approx sqrt in 5 cycles. and reverse square roots in 2, and 3 cycles. bether than every lookuptable

Originally posted by davepermen:
uhm, no, nutty, you wouldn’t. i have an approx sqrt in 2 cycles, and a “good” approx sqrt in 5 cycles. and reverse square roots in 2, and 3 cycles. bether than every lookuptable

Now, I would be interested to see the code of that for sure. If possible of course.

Originally posted by Zeno:

If speed is your primary concern for today’s hardware, use the lookup.

At this year’s GDC, NVIDIA said that it is currently faster to do normalization using cubemaps in their hw (because accessing to textures is highly optimized).
But, in the same tutorial, ATI said that this is not true for their hw and they prefer to do it using the fragment program instructions.
I agree with the points that you presented. And with those arguments, some time ago, I removed the cubemap version from my code and I have just left the normalization using fragment code. It is really more elegant. Even in all the NVIDIA’s examples that you can see in their Cg Browser, they use the normalize function (that is later compiled to the dp3/rsq/mul code).

Hope this helps

>Now, I would be interested to see the code
>of that for sure. If possible of course

IIRC these are the timings for the correponding 3DNow! instructions (which incidentally use lookup tables built into the CPU for the 1st approximation).

in my case, they are p4 sse timings but yeah, amd 3dnow are amazing in doing that fast, too… can’t wait for my athlon64…

btw, the ultrafast (ultraunprecious ones… 5% away from the expected value in worstcase) version is integerbased. too bad, we can’t directly switch fpu/mmx code as registers are mapped directly. else, it would be… “hell fast”

anyways. 3dnow/sse provide you enough speed to forget about lookuptables. at least, in case of sqrt/rsq.

i don’t actually know of a good lookuptable approach wich is fast for them.

using plain c/c++, with the fast rsq from the q3 source, tuned, optimized for higher precicion, and more speed, i can get it to 15-20cycles. that was on a p3. was nice, too…

oh, and, yeah. on gpu’s, dp3-rsq-mul could easily get faster, as rsq is a very cheap instruction actually (look at sse rsq, or similar), and the math to do a cubemap texlookup is actually not that simple at all! it contains “normalizing” of a vector, too… in this case, a division.

SSE eh, so will that work on a gamecube? No thought not, so I’d still use a lookup

If its soo fast to do on a gpu, why does JC still say cubemaps are faster eh?

Athlon64? Thought you were an intel boy? Preliminary benchmarks are awesome tho I admit…

carmack states its faster on gfFX but not on radeons.

i’m not “intel”, i’m “good”. and athlon64 is “good”, as p4 with good configuration is “good”.

you need a fast rsq on gamecube? i can help you. i only need the asm-papers if you can use asm, and some other infos. a portable fast rsq doesn’t really exist, but a portable fast rsq algo does exist. how to implement it, i don’t know. but i know i can have a veryfast sqrt on a gba with 1 instruction (yes). an integer add with integrated bitshift, would sqrt every float. bad just, gba doesn’t have floats at all, hehe…

Four reasons to prefer cube map on nv3x:

  1. nv3x have HILO texture format, which gives you 16-bit precision for normalization cube map. This is comparable precision to doing DP3/RSQ/MUL ona R300, which are done with 16-bit mantissa (and hence final result is arbitrarily oriented vector of unit length, exponent part doesnt improve precision very much). The drawback is that sphere is degenerated to hemisphere with HILO, but with doing lighting in tangent space and with geometric self shadowing term, this error can be completely hidden.

  2. If you do DP3/RSQ/MUL for use in high exponent specular lighting, you must do it in full precision, in R register, or else it will look crap. Conversely, using HILO sample stored to H register looks fine.

  3. At least on my little poor 5200, RSQ is considerably more expensive than other simple instructions (like MOV, ADD, MUL, DP3, RCP, LG2, EX2, etc.). 3 times more, to be correct.

  4. Using interpolant register in instruction other than TEX or MOV incurs penalty, which is comparable to cost of one additional simple instruction. For example, compare these 2 pieces of code:

MOVR R0, f[TEX0];
DP3R R0.w, R0, R0;
RSQR R0.w, R0.w;
MULR R0, R0, R0.w;

DP3R R0.w, f[TEX0], f[TEX0];
RSQR R0.w, R0.w;
MULR R0, f[TEX0], R0.w;

The 2nd sequence is “one cycle” slower then the 1st (despite being one instruction shorter), because it accesses interpolant twice. Tested myself on 5200, and by thepkrl@B3D on 5800.

Taking this all into account, it would be very malign of programmer to use DP3/RSQ/MUL on nv3x

SSE eh, so will that work on a gamecube?

GameCube is PowerPC, right? PowerPC had fast rsqrt from the beginning. (And that yummy rlwimi - Mmm…)

Uhh… how’s that OpenGL support for gamecube working out, anyway?

I just wanted to clarify a couple things reguarading the Radeon on this thread.

First, if it was me that gave the impression that it was always better to use an ALU normalize at GDC, I appologize. I didn’t mean to.

On the Radeon cards, neither way is a given to provide the best performance. It really depends on the shader. I have seen shaders that work best by doing one with a cube-map and one with the ALU. The rule of thumb I try to suggest for Radeon’s is that if the program is ALU heavy it is best to use a texture fetch, and if it is texture heavy, it is best to use ALU instructions.

-Evan