POW, RCP, RSQ

I’m not sure if it’s me or what but i can’t find what I’m looking for in the spec for ARB_vp, ARB_fp

pow(0, 0) seems to give 1. This is correct right?
I think that pow(0, 0) is suppose to be undefined actually, but I guess there is an agreement that it should give 1.

For RCP and RSQ, it seems that division by 0 gives 0???

I need to do a if() inside ARB_fp, but since we can’t, I created a series of instructions that does what I want (except it wastes a few GPU cycles) and I have to rely on 0/0=0!

BTW, NV_fp says that pow(0, 0) = NaN and
0/0 = NaN, but I’m using ARB_fp

I think the results for all is just “undefined”.

There’s a conditional-set instruction.

You can select between two values by using the conditional-set (which sets to 1 or 0) and then the LRP instruction.

I found a paragraph that says that it follows the core specification of GL, so the following applies

pow(0, 0)=1
n/0= undefined, but program must not terminate.

jwatte, you mean CMP? OK, that’s not bad.

And what’s with “native limits”?
I’m getting an error because it says I’m exceeding the native limit of 64 ALU instructions, yet the non-native limit is 96 or something.

What about the f-buffer? Is this for vertex_programs only? It says the R300 can do 65000 instructions (non-native) but for fragment_programs, the maximums ALU, tex, tex indirection are really low.

It sounds like we are limited by the native limits.

f-buffer is used for fragment program two(in ATI Ashli demo up to 6 times).

just to keep in mind - for complex fragment program NV3x hardware much better - 1024 native/tex instruction.

The “f-buffer” is a “fragment buffer” which stores fragment intermediary values, which allows the hardware to go run another physical shader on the intermediary outputs of the first one. This can be chained to support arbitrarily long fragment shaders. The concept is a little bit similar to deferred shading, except it happens transparently (to the application program) inside the hardware.

Allegedly, the Radeon 9800 hardware supports f-buffer operation, although I haven’t seen it exposed. An f-buffer can, in theory, support arbitrarily long fragment shaders with reasonably efficient use of hardware.

Is this acccurate?

for R9800 http://www.delphi3d.net/hardware/viewreport.php?report=804

Max. ALU instructions = 96
Max. native ALU instructions = 64

There doesn’t seem to be much of a difference between 9700 and 9800 actually.
Along with 9500, 9600.

I was expecting the f-buffer to push those numbers up.
I think that Max. ALU instructions is meaningless. As soon as the native limit is exceeded, it complains.

What makes it worst is that some instruction are converted to 2 or 3 for the GPU.

It’s kind of weird. I have code like this

DP4 …
DP4 …
DP4 …
DP4 …

DP3 …
DP3 …
DP3 …

DP4 …
DP4 …
DP4 …
DP4 …

that makes a total of 11 ALU, but it compiles to give 13 native ALUs.

Where does the extra 2 come from?

strange …

I was expecting the f-buffer to push those numbers up.

The F-buffer is not implemented on these cards.

Where does the extra 2 come from?

Remember, the hardware doesn’t natively support all that swizziling. So, if you do some swizziling, extra opcodes will be generated.

[This message has been edited by Korval (edited 10-14-2003).]

###Remember, the hardware doesn’t natively support all that swizziling. So, if you do some swizziling, extra opcodes will be generated###

I didn’t. I only do masking on the destination but I’m sure that doesn’t need extra cycles.

The instruction I have in there are quite similar.
All those lines that start with DP4 are like this

DP4 THING.x, THING, THING;
DP4 THING.y, THING, THING;
DP4 THING.z, THING, THING;
DP4 THING.w, THING, THING;

and all those lines that start with DP3 are like this

DP3 THING.x, THING, THING;
DP3 THING.y, THING, THING;
DP3 THING.z, THING, THING;

(Sorry, actual code not at hand)

I’ve noticed this sort of thing both in VP and FP.

In one FP I had, I was not doing any tex instructions, but natively, it says I’m doing 1 TEX and 1 TEX indirection.

It’s nice to have the native queries

Originally posted by V-man:
[b]The instruction I have in there are quite similar.
All those lines that start with DP4 are like this

DP4 THING.x, THING, THING;
DP4 THING.y, THING, THING;
DP4 THING.z, THING, THING;
DP4 THING.w, THING, THING;

and all those lines that start with DP3 are like this

DP3 THING.x, THING, THING;
DP3 THING.y, THING, THING;
DP3 THING.z, THING, THING;

(Sorry, actual code not at hand)

I’ve noticed this sort of thing both in VP and FP.

In one FP I had, I was not doing any tex instructions, but natively, it says I’m doing 1 TEX and 1 TEX indirection.[/b]

Did you use a KIL in that case?

The two additional instructions might be necessary because of the R300 limited swizzle. In PS2.0 DP4 only writes to the w component. I think the reason for this is that the pipeline of R300 is split into a vec3 and a scalar part, and consists of a “full” ALU and a mini ALU. So the vector ALU might do a DP3, the scalar ALU does a MUL, and the scalar mini ALU adds both results together, outputting the DP4 result in the w component.

Did you use a KIL in that case?

Nope, just plain ALU instructions (MOV, DP3, RSQ, MUL, ADD, …)

I was thinking that the R3XX has a fixed requirement.

In PS2.0 DP4 only writes to the w component.

I think you mean it can’t, so the operation is split into 3.
*I guess it counts as 2 opcodes.

This explanation sounds familiar.

Is there an official source for this?

Originally posted by V-man:
I think you mean it can’t, so the operation is split into 3.
*I guess it counts as 2 opcodes.

The PS Spec is a bit confusing here.

DP4 is described as:

dest.w = (src0.x * src1.x) + (src0.y * src1.y) +
(src0.z * src1.z) + (src0.w * src1.w);
dest.x = dest.y = dest.z = unused;

DP3 has the exact same description (a c&p error, obviously), but writes the result to xyz.

DP4 documentation also says that it takes two instruction slots in PS1.2/1.3, which makes sense as NV25 will turn that into a DP3 followed by a MAD in the alpha channel (which is probably the reason why DP4 is defined as writing to the w component).

Maybe you can just try to drop some of the DP4s and observe the changes