GLSL performance with per-pixel bump mapping = slow?

Hello folks,

I know this was discussed in another post, but I thought it could be better to put it in a new post.

Although doing per-pixel lighting leads to increased calculations, and thus performance hits, when I apply this bump-mapping shader, my performance drops from about 150fps to 15fps… this seems so much! :confused:

Here is the code:

VS:

varying vec4 lightDir;
varying vec3 normal, halfVector, spotDir;

void main()
{	
	vec3 tempLight;
	// computing TBN matrix for tangent space ****************************************************
	vec3 v_Normal = normalize(gl_NormalMatrix*gl_Normal); // normal to eye space
	vec3 v_Tangent = normalize(gl_NormalMatrix*gl_MultiTexCoord2.xyz); // tangent to eye space
	vec3 v_Binormal = normalize(gl_NormalMatrix*gl_MultiTexCoord3.xyz); // binormal to eye space
	
	mat3 tangentBasis = mat3( // in column major order
	v_Tangent.x, v_Binormal.x, v_Normal.x,
	v_Tangent.y, v_Binormal.y, v_Normal.y,
	v_Tangent.z, v_Binormal.z, v_Normal.z);
		
	// compute light vector ***********************************************************************
	vec4 ecPos, bb;
	vec3 aux;
	
	ecPos = gl_ModelViewMatrix * gl_Vertex;
	aux = vec3(gl_LightSource[0].position-ecPos);
	tempLight = aux;
	
	// compute normal and half vector **************************************************************
	normal = normalize(gl_NormalMatrix * gl_Normal);// vertex to eye coordinates
	halfVector = normalize(gl_LightSource[0].halfVector.xyz);
	
	// convert coordinates to tangent space ********************************************************
	tempLight = tangentBasis * tempLight;
	halfVector = tangentBasis * halfVector;
	spotDir = tangentBasis * gl_LightSource[0].spotDirection;
	
	// pass texture coords to fragment shader ******************************************************	
	gl_TexCoord[0] = gl_MultiTexCoord0;
	
	// putting distance in w component of light ****************************************************
	lightDir = vec4(tempLight, 0.0);
	lightDir.w = length(aux);
	
	// convert vertex position *********************************************************************	
	gl_Position = ftransform();
}

FS:

varying vec4 lightDir;
varying vec3 normal, halfVector, spotDir;

uniform sampler2D decalMap;
uniform sampler2D normalMap;

void main()
{
	vec3 n,l,halfV;
	vec4 texel;
	float NdotL,NdotHV;
	float att;
	float spotEffect;
	float dist;
	
	// retrieve material parameters ***************************************************************
	vec4 color = gl_FrontLightModelProduct.sceneColor;
	vec4 ambient = gl_FrontLightProduct[0].ambient;
	vec4 diffuse = gl_FrontLightProduct[0].diffuse;
	vec4 specular = gl_FrontLightProduct[0].specular;
	
	// compute normals from normal map ************************************************************
	vec2 tuv = vec2(gl_TexCoord[0].s, -gl_TexCoord[0].t);
	n = 2.0 * (texture2D(normalMap, tuv).rgb - 0.5);
	n = normalize(n);
	
	// compute light ******************************************************************************	
	l = normalize(lightDir.xyz);
	dist = lightDir.w;
	
	NdotL = max(dot(n,l),0.0);

	if (NdotL > 0.0)
	{
		spotEffect = dot(normalize(spotDir), normalize(-l));
		if (spotEffect > gl_LightSource[0].spotCosCutoff)
		{
			spotEffect = pow(spotEffect, gl_LightSource[0].spotExponent);
			att = spotEffect / (gl_LightSource[0].constantAttenuation +
						gl_LightSource[0].linearAttenuation * dist +
						gl_LightSource[0].quadraticAttenuation * dist * dist);
						
			color += att * (diffuse * NdotL + ambient);
			
			halfV = normalize(halfVector);
			NdotHV = max(dot(n,halfV),0.0);
			color += att * specular * pow(NdotHV, gl_FrontMaterial.shininess);
		}
	}

	// apply texture ******************************************************************************	
	texel = texture2D(decalMap,gl_TexCoord[0].st);
	color *= texel;

	// set fragment color *************************************************************************	
	gl_FragColor = color;
}

I know this is not optimal code, but suffering such a great performance hit makes me wonder whether I did miss something…

Another point to mention is that I can’t compute light in my FS because I’m not using a directional light, but a spotlight, which vectors need to be interpolated per-fragment. Anybody got a clue?

Thanks fellows!

HardTop

You neglected to mention what hardware/drivers you’re using.

In any case, my guess would be the use of the if-statements. Remember that conditional statements will always be executed on most hardware, and what’s going on in those conditions seems pretty heavy for certain hardware.

Yeah you’d need to mention what hardware/drivers you’re using. But just off the top of my head, that normalization of -l isnt required, because you normalized l already :slight_smile: Wont account for the massive FPS drop though… Give us some more info and maybe we can help!

yeah you’re right, I noticed it after having posted the message :slight_smile:

About my hardware, I test on 2 different configurations:

AMD Barton 3000+ / 512 Mb DDR / ATI Radeon 9800 Pro
P4 2.4Ghz / 512 Mb DDR / GeForce FX 5200 crap

On both PC’s I get the same performance hit. I get 5 FPS on the 5200 where I had 50 when not running the shaders; and I get 15 FPS on the Radeon where I had 150 FPS, so that’s about the same proportion.

Cheers

HardTop

First of all, there is no need for redundant assignments (and yes they might be causing you GPU cycles, use the original variable names).

	vec4 color = gl_FrontLightModelProduct.sceneColor;
	vec4 ambient = gl_FrontLightProduct[0].ambient;
	vec4 diffuse = gl_FrontLightProduct[0].diffuse;
	vec4 specular = gl_FrontLightProduct[0].specular;

Secondly if you are using uncompressed textures then there is usually no need for normalizing the normal from the normal map

vec2 tuv = vec2(gl_TexCoord[0].s, -gl_TexCoord[0].t);

do this negation in the vertex shader. so that these three lines

	vec2 tuv = vec2(gl_TexCoord[0].s, -gl_TexCoord[0].t);
	n = 2.0 * (texture2D(normalMap, tuv).rgb - 0.5);
	n = normalize(n);

will become

	n = 2.0 * (texture2D(normalMap, gl_TexCoord[0].st).rgb - 0.5);

Likewise light direction can usually be normalized in the vertex shader. There is usually no need to do this per-fragment (normalization is EXPENSIVE!).
I fail to understand why spotDir is passed as a varying ? Its constant and should be passed as a uniform, and hence there is no need to normalize it.
Pre shader model 3.0 hardware cannot handle branch statements so they are pretty much useless. You can zero-out results by multiplying with lets NDOTL. For SM3.0 capable hardware use “discard”.

			att = spotEffect / (gl_LightSource[0].constantAttenuation +
						gl_LightSource[0].linearAttenuation * dist +
						gl_LightSource[0].quadraticAttenuation * dist * dist);

There is no need to calculate the denominator per-pixel. The only variable it contains is “dist” which is a varying. You can calculate this entire term (the one in the denominator) in the vertex shader and pass it as a varying to fragment shader and be sure that the result will be the same!

There are a few other optimizations that can be done but i will leave those upto you.

Like i told you earlier, read GPU Programming Guide from the nVidia’s website.

Great! Thanks a lot for the info. I don’t always figure out what can be:

[ul][li]in the VS and interpolated to the FS[]calculated for each fragment in the FS[]handled as a uniform staying the same all the way [/ul][/li]I’ll give that a try.
Thanks!

So much so that you can actually pass the denominator term as
1.0/(denominator term)
in the vertex shader and pass it as a varying to the fragment shader, so that you do the expensive division in vertex shader and the relatively inexpensive multiplication in the fragment shader.

Originally posted by Zulfiqar Malik:
First of all, there is no need for redundant assignments (and yes they might be causing you GPU cycles, use the original variable names).
The compiler should handle that gracefully. I doubt this would ever cost any extra.

Originally posted by Zulfiqar Malik:
For SM3.0 capable hardware use “discard”.
Don’t! Discard kills HiZ optimizations and doesn’t actually early-out the shader. For SM3.0 hardware, it’s better to use if-statements to branch to the end of the shader.

Originally posted by hardtop:
NdotL = max(dot(n,l),0.0);
Use this instead:
NdotL = clamp(dot(n,l),0.0,1.0);

That should map to a single DP3_SAT instruction, while the first one maps to DP3 and MAX (unless the compiler is smart and figures out that both vectors are normalized so a dot product can’t return values above 1.0).

Originally posted by Zulfiqar Malik:
the expensive division
A scalar division isn’t that expensive. It’s a RCP and a MUL. Both are single cycle. A vector by vector division is expensive though as all components need their own RCP, so the total cost for a vec4 is five instructions.

Originally posted by Humus

The compiler should handle that gracefully. I doubt this would ever cost any extra.

Oh really, or should i say an nVidia compiler would handle it gracefully! I have seen ATI’s compiler behave worse than this scenario. In one particular scenario i declared a few CONSTANTS in fragment shader. The nvidia compiler generated proper code, but the ATI’s compiler did not inline the value of those constants thus resulting in a huge shader that brought my 9700pro to its knees. I don’t know the situation right now as i haven’t done any shader related stuff for a couple of months.

Originally posted by Humus

Don’t! Discard kills HiZ optimizations and doesn’t actually early-out the shader. For SM3.0 hardware, it’s better to use if-statements to branch to the end of the shader.

Thanks for clearing that up. I have never, as such tested the use of discard because i didn’t have SM3.0 capable hardware until recently. But i read it in some presentations that discard can help save fragment shader instructions. Can you tell me as to why discard won’t early out from the shader? The specs state that it should, or is it driver dependent?

Originally posted by Humus

A scalar division isn’t that expensive. It’s a RCP and a MUL. Both are single cycle. A vector by vector division is expensive though as all components need their own RCP, so the total cost for a vec4 is five instructions.

True! But isn’t just one MUL better than a MUL and RCP :slight_smile: ? I might sound too primitive but i have written shaders for early hardware and for large scenes i had to literally take each and ever clock cycle into account. Although i agree that in this particular case it might not result in spectacular increase in performance :slight_smile: .

Don’t! Discard kills HiZ optimizations and doesn’t actually early-out the shader.
I find this disappointing (about the HiZ deactivation; the other is expected). Is this only with discard, or do other discard-like effects (Alpha test, etc) also deactivate HiZ?

Can you tell me as to why discard won’t early out from the shader? The specs state that it should, or is it driver dependent?
It doesn’t early-out because it can’t.

Each fragment is not being processed independently; 4-fragment blocks are processed simultaneously, each block running the same program, running the same opcode at the same time. So if you have a conditional discard, it is more efficient to simply set a flag saying not to write that fragment and continue processing, because the other fragments in the 4-fragment block may not have taken the discard. It is also because of these fragment-quads so to speak that conditional branching is difficult.

Originally posted by Zulfiqar Malik:
[QUOTE]Oh really, or should i say an nVidia compiler would handle it gracefully! I have seen ATI’s compiler behave worse than this scenario. In one particular scenario i declared a few CONSTANTS in fragment shader. The nvidia compiler generated proper code, but the ATI’s compiler did not inline the value of those constants thus resulting in a huge shader that brought my 9700pro to its knees. I don’t know the situation right now as i haven’t done any shader related stuff for a couple of months.
Well, I’ve not seen anything like that happen since pretty much the first driver release to support GLSL, so I don’t know how you managed.

Originally posted by Zulfiqar Malik:
[QUOTE]But i read it in some presentations that discard can help save fragment shader instructions. Can you tell me as to why discard won’t early out from the shader? The specs state that it should, or is it driver dependent?
The spec doesn’t say how it should be done. As long as it’s functionally equivalent it’s within spec, and I don’t think anyway hardware really early-outs at discard. It will kill the fragment, but the entire shader will still be executed. On the R520, if all pixels within a quad are killed, it will at least stop sampling textures, unlike previous generations. To really early-out you have to use dynamic branching.

True! But isn’t just one MUL better than a MUL and RCP :slight_smile: ?
Yes, obviously. :slight_smile: Just pointing out that it’s not that expensive. Lots of people are just used to it being very slow on the CPU, so they assume it’s slow on GPUs too.

Originally posted by Korval:
[QUOTE]I find this disappointing (about the HiZ deactivation; the other is expected). Is this only with discard, or do other discard-like effects (Alpha test, etc) also deactivate HiZ?
Alpha test also deactivates HiZ. Depth and stencil test doesn’t.
Note that it only disables it for passes that uses this, so if you later turn alpha test off, HiZ will be enabled again, so you don’t have to be paranoid about using it at all, just keep in mind that when you use it, things will be slower.

Originally posted by Humus

Well, I’ve not seen anything like that happen since pretty much the first driver release to support GLSL, so I don’t know how you managed.

sigh. This was just one example my friend, ATI drivers have given me countless sleepless nights when my colleagues were always on my a** telling me to make the shaders more efficient on their machines as well. Hours and hours of debugging, only to find out that the compiler screwing around :frowning: . It was hardly good enough to be barely usable in the beginning, but eventually got better. No offense intented, just something i experienced first hand over several months.

Most of the time my cheap 5700 Ultra was giving twice the performance of a 9700pro.

Originally posted by Humus

The spec doesn’t say how it should be done. As long as it’s functionally equivalent it’s within spec, and I don’t think anyway hardware really early-outs at discard. It will kill the fragment, but the entire shader will still be executed. On the R520, if all pixels within a quad are killed, it will at least stop sampling textures, unlike previous generations. To really early-out you have to use dynamic branching.

Thanks for the info. So this means that it will actually be better to use branch statements on older hardware because on those the code will be executed anyways, but on modern hardware with branch support it will perform better? This would actually be pretty good, since one will have to provide just one shader for all sorts of hardware that will perform reasonably optimally on all sorts of hardware.

Hello folks,

I’ve just tested the hints collected here, and the code seems much cleaner, but I don’t notice much performance increase…

To the question “why do you pass spotDir as a varying?”, I do the same for lightDir and lightPos: I pass them as varyings because I must project them to the tangent space before passing to the FS. I can’t figure what could be cheaper GPU-wise (not because there is no way, but because I’m a n00b at shader programming :wink: )

Thanks for the clue

HardTop

I don’t understand why lightPos and spotDir need to be tangent space. The only thing that you need for per-pixel lighting is a light direction vector in tangent space so that it can be multiplied (dot product) with a normal (from the normal map). The attenuation can be done in world/eye space.

Originally posted by hardtop

I’ve just tested the hints collected here, and the code seems much cleaner, but I don’t notice much performance increase…

That’s strange! Such a compact fragment shader (at least what i have in mind :slight_smile: ) should give you plenty of performance. Maybe your vertex shader is choking the vertex processor? Can you post the entire souce code i.e. vertex and fragment shader? On another note, you can look at some per-pixel lighting code (there is plenty of lighting related source code available online) and test your application’s performance with that.

I convert spotDir, lightDir and lightPos to tangent space because I need to perform DOT products between spotdir and lightdir (for the spot cutoff effect) and between the half vector and the normal (for the specular component), but I might be doing something wrong. Here are the full original shaders (for bump mapping):

varying vec4 lightDir;
varying vec3 normal, halfVector, spotDir;

void main()
{	
	vec3 tempLight;
	// computing TBN matrix for tangent space ****************************************************
	vec3 v_Normal = normalize(gl_NormalMatrix*gl_Normal); // normal to eye space
	vec3 v_Tangent = normalize(gl_NormalMatrix*gl_MultiTexCoord2.xyz); // tangent to eye space
	vec3 v_Binormal = normalize(gl_NormalMatrix*gl_MultiTexCoord3.xyz); // binormal to eye space
	
	mat3 tangentBasis = mat3( // in column major order
	v_Tangent.x, v_Binormal.x, v_Normal.x,
	v_Tangent.y, v_Binormal.y, v_Normal.y,
	v_Tangent.z, v_Binormal.z, v_Normal.z);
		
	// compute light vector ***********************************************************************
	vec4 ecPos, bb;
	vec3 aux;
	
	ecPos = gl_ModelViewMatrix * gl_Vertex;
	aux = vec3(gl_LightSource[0].position-ecPos);
	tempLight = aux;
	
	// compute normal and half vector **************************************************************
	normal = normalize(gl_NormalMatrix * gl_Normal);// vertex to eye coordinates
	halfVector = normalize(gl_LightSource[0].halfVector.xyz);
	
	// convert coordinates to tangent space ********************************************************
	tempLight = tangentBasis * tempLight;
	halfVector = tangentBasis * halfVector;
	spotDir = tangentBasis * gl_LightSource[0].spotDirection;
	
	// pass texture coords to fragment shader ******************************************************	
	gl_TexCoord[0] = gl_MultiTexCoord0;
	
	// putting distance in w component of light ****************************************************
	lightDir = vec4(tempLight, 0.0);
	lightDir.w = length(aux);
	
	// convert vertex position *********************************************************************	
	gl_Position = ftransform();
}
varying vec4 lightDir;
varying vec3 normal, halfVector, spotDir;

uniform sampler2D decalMap;
uniform sampler2D normalMap;

void main()
{
	vec3 n,l,halfV;
	vec4 texel;
	float NdotL,NdotHV;
	float att;
	float spotEffect;
	float dist;
	
	// retrieve material parameters ***************************************************************
	vec4 color = gl_FrontLightModelProduct.sceneColor;
	vec4 ambient = gl_FrontLightProduct[0].ambient;
	vec4 diffuse = gl_FrontLightProduct[0].diffuse;
	vec4 specular = gl_FrontLightProduct[0].specular;
	
	// compute normals from normal map ************************************************************
	vec2 tuv = vec2(gl_TexCoord[0].s, -gl_TexCoord[0].t);
	n = 2.0 * (texture2D(normalMap, tuv).rgb - 0.5);
	n = normalize(n);
	
	// compute light ******************************************************************************	
	l = normalize(lightDir.xyz);
	dist = lightDir.w;
	
	NdotL = max(dot(n,l),0.0);

	if (NdotL > 0.0)
	{
		spotEffect = dot(normalize(spotDir), normalize(-l));
		if (spotEffect > gl_LightSource[0].spotCosCutoff)
		{
			spotEffect = pow(spotEffect, gl_LightSource[0].spotExponent);
			att = spotEffect / (gl_LightSource[0].constantAttenuation +
						gl_LightSource[0].linearAttenuation * dist +
						gl_LightSource[0].quadraticAttenuation * dist * dist);
						
			color += att * (diffuse * NdotL + ambient);
			
			halfV = normalize(halfVector);
			NdotHV = max(dot(n,halfV),0.0);
			color += att * specular * pow(NdotHV, gl_FrontMaterial.shininess);
		}
	}

	// apply texture ******************************************************************************	
	texel = texture2D(decalMap,gl_TexCoord[0].st);
	color *= texel;

	// set fragment color *************************************************************************	
	gl_FragColor = color;
}

Please note these shaders are not yet optimized according to your advice, but I am still testing the whole thing, so I post here what I know is running all right on my machine

instead of computing the spotDir in tangent space in the VS, and then pass it to the FS, I tried this:

spotE = dot(normalize(gl_LightSource[0].spotDirection), normalize(-tempLight));

in the VS. tempLight is basically the light pos in eye space. My FS remains the same except the following lines:

spotEffect = dot(normalize(spotDir), normalize(-l));
if (spotEffect > gl_LightSource[0].spotCosCutoff)

become this:

if (spotE > gl_LightSource[0].spotCosCutoff)

where spotE is a varying which was calculated in the VS.

This works (in a way), but I notice no perf increase, and worse the bump is correct BUT the light is computed per-vertex and interpolated. So if I have a big quad in front of me, and I illuminate its center, I’ll see no light, I have to “illuminate a vertex”. no good.

shader development is definately not that easy :eek: