PDA

View Full Version : How costly is a sqrt in a fragment shader?



Mukund
10-20-2011, 03:02 AM
Hello all,

This is my fragment shader:



varying vec4 color;
varying vec3 fragPos;
uniform vec4 lightPos;

void main(void)
{
float dis = sqrt((fragPos.y - lightPos.y) *
(fragPos.y - lightPos.y)
+(fragPos.x - lightPos.x) *
(fragPos.x - lightPos.x)
);

if (dis <= 5.0)
gl_FragColor = color * vec4(1.0, 1.0, 1.0, 1.0);
else
gl_FragColor = color * vec4(0.5, 0.5, 0.5, 1.0);
}

Can anyone please tell me how costly this is?
I just want to check for a circular region and assign colors. Please let me know if there is a better way to do the same.

Thanks a lot!

aqnuep
10-20-2011, 03:55 AM
It shouldn't be that much expensive. Reciprocal square root is actually only a single instruction on modern GPUs, while this means that sqrt() should not be more than two, however it requires division what may be a bit expensive.

You may use inversesqrt() instead and change your comparison but I think you should not be afraid of square root calculation. Actually the branching (if) is much more costly.

Mukund
10-20-2011, 04:17 AM
Thanks for the reply aqnuep.

>>Actually the branching (if) is much more costly.

Well, i need to do the check every fragment. Any idea how i can make it better?

Thanks!

mbentrup
10-20-2011, 04:23 AM
aqnuep is absolutely correct, except sqrt() is usually implemented as inversesqrt() + multiplication, not inversesqrt() + division, so it should be quick.

One thing you should keep in mind though is that sqrt is not vectorized on most GPUs, so computing a sqrt(vec4) requires 4*2 instructions.

aqnuep
10-20-2011, 04:35 AM
aqnuep is absolutely correct, except sqrt() is usually implemented as inversesqrt() + multiplication, not inversesqrt() + division, so it should be quick.

Yes, you are right, as usually if the GLSL compiler is smart enough then it may figure out that there is no need for division/multiplication at all, or maybe only a multiplication is enough. Also, it is true that sqrt(vec4) most probably will require 4 instructions for the reciprocal square root, but may not require 4 instructions for the multiplication.

V-man
10-20-2011, 05:16 AM
This is 4 subtractions, 2 multiplications, 1 addition, 1 inversquareroot, 1 inverse (because sqrt might be a inversquareroot followed by a 1/x).

TOTAL = 9 clock cycles


float dis = sqrt((fragPos.y - lightPos.y) *
(fragPos.y - lightPos.y)
+(fragPos.x - lightPos.x) *
(fragPos.x - lightPos.x)
);


This is 1 subtraction, 1 dot product, 1 inversesqrt.

TOTAL = 3 clock cycles



vec2 result = fragPos.xy - lightPos.xy;
float result2 = dot(result, result);
float dis = inversesqrt(result2);


and then you change your "if (dis <= 5.0)"

Mukund
10-20-2011, 05:29 AM
Thanks V-man, aqnuep, mbentrup.

@V-man

>> and then you change your "if (dis <= 5.0)"

I didn't quite get you. Change that to what?
Thanks!

aqnuep
10-20-2011, 06:05 AM
Maybe you can play with some math like min/max/clamp/ceil/floor to get the values 0.5 or 1.0 based on whether dis is greater than 5.0 or not.

Most probably even multiple ALU instructions will be faster than a conditional.

Mukund
10-20-2011, 06:27 AM
Hmm, i came up with this one:


float val = step(dis, 5.0);
gl_FragColor = mix( color * vec4(0.5, 0.5, 0.5, 1.0),
color * vec4(1.0, 1.0, 1.0, 1.0),
val);

Is this better? step would internally have to do a comparison right? So, is it inevitable that there is a loss of cycles or is it in any way avoided?

aqnuep
10-20-2011, 06:28 AM
FYI, Groovounet just posted on twitter about an ALU technique that can be used for conditional elimination: http://developer.amd.com/documentation/articles/pages/New-Round-to-Even-Technique.aspx

Groovounet
10-20-2011, 06:41 AM
Ahah: many people seems to have missed that the build in function mix also have a version with a bool type.

genType mix(genType, genType, genBType);

So this is enough:


gl_FragColor = color * mix(
vec4(0.5, 0.5, 0.5, 1.0),
vec4(1.0, 1.0, 1.0, 1.0),
dis <= 5.0);

Mukund
10-20-2011, 07:23 AM
@aqnuep: Thanks!

@Groovounet: Well GLSL version 1.2(the one i'm using) doesn't seem to support it. But yeah, i didn't know we could use mix that way from 4.0 onwards.
Thanks!

Groovounet
10-20-2011, 08:15 AM
Ahhh I didn't realized that you were using GLSL 1.20.

Ilian Dinev
10-20-2011, 11:21 AM
But then again, gpus generally have predicated-execution :) .

So, fastest version should be:



varying vec4 color;
varying vec3 fragPos;
uniform vec4 lightPos;

void main(void)
{
vec2 tmp = fragPos.xy - lightPos.xy;
float disSq = dot(tmp,tmp);
float col = 0.5;
if (disSq <= 5.0 * 5.0) col = 1.0;

gl_FragColor = vec4(color.xyz * vec3(col), 1.0);
}

Ilian Dinev
10-20-2011, 11:39 AM
And if some gpus can do single-cycle compare to 0.0f and conditionally move, then:



varying vec4 color;
varying vec3 fragPos;
uniform vec4 lightPos;

// for scalar-ISA gpus
void main(void)
{
vec2 tmp = fragPos.xy - lightPos.xy; // 2 fsub = 2 cycles
float col = 0.5; // mov, 1 or 0 cycles, see below
float disSq = tmp.x*tmp.x + (tmp.y*tmp.y - 25.0); // fmad, fmad = 2 cycles.

if (disSq <= 0.0) col = 1.0; // 1 cycle . Some gpus might merge-in the above "col = 0.5" execution in here.

gl_FragColor = vec4(color.xyz * vec3(col), 1.0); // 3 fmul, 1 mov = 4 cycles.
}


Things get funny when some gpus can do an fmul and an fmad together in a single cycle, though :)

V-man
10-21-2011, 03:54 AM
"gl_FragColor = vec4(color.xyz * vec3(col), 1.0); // 3 fmul, 1 mov = 4 cycles. "

That should be 1 MUL and 1 MOV
gl_FragColor.xyz = color.xyz * col.xxx;
gl_FragColor.w = 1.0;

and modern hw supports direct multiple writes to gl_FragColor.

Ilian Dinev
10-21-2011, 11:10 AM
// for scalar-ISA gpus
void main(void)


:)

yuriks
10-28-2011, 07:47 AM
No one seems to have mentioned it, but op can simply square 5 and compare with 25 instead, doing away with the sqrt entirely. :P

sqrt[-1]
10-28-2011, 05:42 PM
No one seems to have mentioned it, but op can simply square 5 and compare with 25 instead, doing away with the sqrt entirely. :P

I think Ilian Dinev's code above does this.