How costly is a sqrt in a fragment shader?

Hello all,

This is my fragment shader:


varying vec4 color;
varying vec3 fragPos;
uniform vec4 lightPos;

void main(void)
{
    float dis = sqrt((fragPos.y - lightPos.y) * 
                     (fragPos.y - lightPos.y)
		    +(fragPos.x - lightPos.x) * 
                     (fragPos.x - lightPos.x)
		    );
							
    if (dis <= 5.0)
	gl_FragColor = color * vec4(1.0, 1.0, 1.0, 1.0);
    else 
	gl_FragColor = color * vec4(0.5, 0.5, 0.5, 1.0);   
}

Can anyone please tell me how costly this is?
I just want to check for a circular region and assign colors. Please let me know if there is a better way to do the same.

Thanks a lot!

It shouldn’t be that much expensive. Reciprocal square root is actually only a single instruction on modern GPUs, while this means that sqrt() should not be more than two, however it requires division what may be a bit expensive.

You may use inversesqrt() instead and change your comparison but I think you should not be afraid of square root calculation. Actually the branching (if) is much more costly.

Thanks for the reply aqnuep.

>>Actually the branching (if) is much more costly.

Well, i need to do the check every fragment. Any idea how i can make it better?

Thanks!

aqnuep is absolutely correct, except sqrt() is usually implemented as inversesqrt() + multiplication, not inversesqrt() + division, so it should be quick.

One thing you should keep in mind though is that sqrt is not vectorized on most GPUs, so computing a sqrt(vec4) requires 4*2 instructions.

Yes, you are right, as usually if the GLSL compiler is smart enough then it may figure out that there is no need for division/multiplication at all, or maybe only a multiplication is enough. Also, it is true that sqrt(vec4) most probably will require 4 instructions for the reciprocal square root, but may not require 4 instructions for the multiplication.

This is 4 subtractions, 2 multiplications, 1 addition, 1 inversquareroot, 1 inverse (because sqrt might be a inversquareroot followed by a 1/x).

TOTAL = 9 clock cycles

    float dis = sqrt((fragPos.y - lightPos.y) * 
                     (fragPos.y - lightPos.y)
		    +(fragPos.x - lightPos.x) * 
                     (fragPos.x - lightPos.x)
		    );

This is 1 subtraction, 1 dot product, 1 inversesqrt.

TOTAL = 3 clock cycles


	vec2 result = fragPos.xy - lightPos.xy;
	float result2 = dot(result, result);
	float dis = inversesqrt(result2);

and then you change your “if (dis <= 5.0)”

Thanks V-man, aqnuep, mbentrup.

@V-man

>> and then you change your “if (dis <= 5.0)”

I didn’t quite get you. Change that to what?
Thanks!

Maybe you can play with some math like min/max/clamp/ceil/floor to get the values 0.5 or 1.0 based on whether dis is greater than 5.0 or not.

Most probably even multiple ALU instructions will be faster than a conditional.

Hmm, i came up with this one:


float val    = step(dis, 5.0); 
gl_FragColor = mix( color *  vec4(0.5, 0.5, 0.5, 1.0), 
                    color *  vec4(1.0, 1.0, 1.0, 1.0), 
                    val);

Is this better? step would internally have to do a comparison right? So, is it inevitable that there is a loss of cycles or is it in any way avoided?

FYI, Groovounet just posted on twitter about an ALU technique that can be used for conditional elimination: http://developer.amd.com/documentation/articles/pages/New-Round-to-Even-Technique.aspx

Ahah: many people seems to have missed that the build in function mix also have a version with a bool type.

genType mix(genType, genType, genBType);

So this is enough:


gl_FragColor = color * mix(
  vec4(0.5, 0.5, 0.5, 1.0), 
  vec4(1.0, 1.0, 1.0, 1.0), 
  dis <= 5.0);

@aqnuep: Thanks!

@Groovounet: Well GLSL version 1.2(the one i’m using) doesn’t seem to support it. But yeah, i didn’t know we could use mix that way from 4.0 onwards.
Thanks!

Ahhh I didn’t realized that you were using GLSL 1.20.

But then again, gpus generally have predicated-execution :slight_smile: .

So, fastest version should be:



varying vec4 color;
varying vec3 fragPos;
uniform vec4 lightPos;

void main(void)
{
	vec2 tmp = fragPos.xy - lightPos.xy;
    float disSq = dot(tmp,tmp);
	float col = 0.5;
	if (disSq <= 5.0 * 5.0) col = 1.0;
	
	gl_FragColor = vec4(color.xyz * vec3(col), 1.0);
}


And if some gpus can do single-cycle compare to 0.0f and conditionally move, then:


varying vec4 color;
varying vec3 fragPos;
uniform vec4 lightPos;

// for scalar-ISA gpus
void main(void)
{
	vec2 tmp = fragPos.xy - lightPos.xy; // 2 fsub = 2 cycles
	float col = 0.5; // mov, 1 or 0 cycles, see below
	float disSq = tmp.x*tmp.x + (tmp.y*tmp.y  - 25.0); // fmad, fmad = 2 cycles.

	if (disSq <= 0.0) col = 1.0; // 1 cycle . Some gpus might merge-in the above "col = 0.5" execution in here. 
	
	gl_FragColor = vec4(color.xyz * vec3(col), 1.0); // 3 fmul, 1 mov = 4 cycles. 
}

Things get funny when some gpus can do an fmul and an fmad together in a single cycle, though :slight_smile:

"gl_FragColor = vec4(color.xyz * vec3(col), 1.0); // 3 fmul, 1 mov = 4 cycles. "

That should be 1 MUL and 1 MOV
gl_FragColor.xyz = color.xyz * col.xxx;
gl_FragColor.w = 1.0;

and modern hw supports direct multiple writes to gl_FragColor.

:slight_smile:

No one seems to have mentioned it, but op can simply square 5 and compare with 25 instead, doing away with the sqrt entirely. :stuck_out_tongue:

I think Ilian Dinev’s code above does this.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.