PDA

View Full Version : Shader performance/complexity problem.



Nasa Gvis
08-05-2010, 02:20 PM
Hello,

I'm an experienced graphics programmer, but am very new to GLSL. I'm writing a number of shaders for visual effects in a project I'm coding via OpenSceneGraph.

My problem is that this particular fragment shader is killing my performance. I get around 40-50 fps without it, but when adding this shader, I'm knocked down to about 10-15 fps. Being new to GLSL, I really have no frame of reference for how many operations a shader can contain and still be manageable. The problem occurs even when this is the only shader active.

There's only a few thousand polygons in the scene, and the geometry thats being shaded only covers a small amount of pixels, I'd say less than 1%.

I've listed the code below. Is this really too expensive a shader? or is something else going on here, like the shader is recompiling constantly (which is my hope, and tthen find the fix). When I hard code a value into "offset" near the bottom just before assigning to gl_fragcolor (as a test), performance is fine, but I'm guessing the compiler is optimizing out all the code above it since its then not needed, right?

Any help would be appreciated, as the performance is unuable as is.

Thanks.



varying in vec4 fragpos;
uniform float timeSU;
uniform float cyclepos;

void main()
{
fragpos.z += 3.96;
vec4 green = vec4(0.0, .2, 0.0, .5);

float pulsealpha = abs(sin(cyclepos));

vec2 dir = fragpos.xy;
vec2 dirNorm = normalize(dir);
vec3 dirNorm3 = vec3(dirNorm.x, dirNorm.y, 0.0);

vec3 basepoint = dirNorm3 * 2.12;
vec3 fragpointer = fragpos.xyz - basepoint;
fragpointer = normalize(fragpointer);

float minangle;

float dot = (dot(fragpointer, dirNorm3));
minangle = atan( fragpointer.z , dot );

minangle = degrees(minangle);

float sgn = 1.0;
if( mod(cyclepos, 6.28) > 3.14)
sgn = -1.0;

float offset = sgn * minangle *.0055555556;
offset = (offset ) + .5;
offset = mod(offset + timeSU, .25 ) * 4.0;

offset *=.75;

green[3] = (.75 - offset ) * pulsealpha * pulsealpha;
gl_FragColor = green;

}

DmitryM
08-05-2010, 02:29 PM
Well, I see the following heavy instructions:
*2 normalizations - there is nothing can be done here
*1 if instruction - could be fixed:


float sgn = sign(3.14 - mod(cyclepos, 6.28));

*2 trigonometry instructions - can be reduced to one by pre-calculating 'pulsealpha' into the uniform.

Please, let me know if any of that helps.

qzm
08-05-2010, 03:24 PM
I just added another post, which could also apply here, perhaps NVidia is recompiling your shader each time you change a uniform?

I am guessing your uniforms are things that change per frame?

I am currently looking for an answer, however dont have one as yet.

Dark Photon
08-05-2010, 05:32 PM
I just added another post, which could also apply here, perhaps NVidia is recompiling your shader each time you change a uniform?
GF7 or earlier -- very possible.
GF8+ - Not sure but I think that trick died with GF8.

Nasa Gvis
08-06-2010, 01:08 PM
DmitryM,

Thank you for your suggestions. Those are nice pointers and appreciated. However, they had no impact on my performance. And its now occured to me, that an earlier version of this shader actually had MORE trig and normalizations to it, and ran just fine.

I think I've found my problem, however. The geometry I'm shading is a geode instanced 15 times. I.e. I have the shader attached to a geode node (in openscenegraph), that node has 15 parents (transform nodes), which are parented by a group node.

It appears the shader is loading separately for each of the 15 instances. Or at least, when I eliminate the instancing (reduce the parent transforms to one instead of 15, as a test), the problem goes away. The other 14 certainly dont add enough polygons or pixels to account for the such a performance hit, so I conclude its treating it like 15 separate shaders.

I tried attaching the shader program to the group node ABOVE the transforms, but the performance hit still incurs. So this is probably a question to post on an openscenegraph forum, unless someone has insight here.

As a temporary fix, I can create the instances in a modeler and import it all as one single geode. But I shouldnt have to do that. I should be able to propogate the shader to children nodes without it replicating/reloading this way. (of course, I may be wrong about the diagnosis, but its the best theory so far)

I'm working on an Nvidia Quadro FX 3400, if thats of any importance.

DmitryM
08-06-2010, 01:57 PM
I don't see the problem in calling glUseProgram 15 times per frame. Even if it was 150 times, it should be a problem. Especially in your case, where the program is the same (driver can simply cache the currently activated program and skip this call 14 times).

From the experiments you made I think it's either some weird OSG issue or exactly that NV issue 'qzm' warned you about: each instance has it's own world transformation, causing the uber-smart driver to rebuild the shader...

You can share the program for me to test it on ATI, if you want.

Nasa Gvis
08-06-2010, 02:24 PM
Hmm, interesting. Being new to GLSL, and working at the OSG level of abstraction, I'm not at all familiar with glUseProgram, or what exactly it does. So I'm intrigued that you say it shouldnt be a problem.

I did complete my temporary fix of creating the model inclusive of all the geometry, so there is no instancing involved, and it did indeed fix the problem.

It may well be a weird OSG issue, as I run into a lot of those. haha. I'll post on the OSG forum and see what they say.

As for NV rebuilding, yes, thats where my money would go. Yes, qzm, my uniforms do change per frame, (sorry to not respond earlier), but it appears its rebuilding per instance.

Thanks for the ATI offer. The project includes model files that are proprietary and protected, so I cant distribute them. But I should be able to find an ATI around here to test on. Good suggestion.

Thanks.

Dark Photon
08-07-2010, 06:52 PM
I'm working on an Nvidia Quadro FX 3400, if thats of any importance.
Oh yeah. That's a GeForce 6800 class card. That is "way" old. Why don't up upgrade?

So yes, this old card should be subject to the recompile/reoptimize on shader uniform change behavior in the NVidia OpenGL driver.

Some threads you might want to read:
* glUniform is slow? (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=249667) (Nov 2008)
* nVidia FP uniforms driver optimization lags (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Board=3&Number=169125) (Jan 2007)
* FragProgBottleNeck.zip (http://localhost/sgglwiki/images/e/e9/FragProgBottleNeck.zip)
* Is the driver doing run-time shader re-compilation? (http://developer.nvidia.com/forums/index.php?showtopic=2350) (Oct 2008)

Then again, it could be something else, maybe something you're doing.

If you insist on using this old card, try doing what Nigel @ NVidia suggests and create a separate shader program for every uniform permutation that you are using, and see if that helps, especially if you are using a uniform in a conditional expression, [B]which you are!:


I would suggest compiling a seperate shader for each combination of uniforms, providing that the total number of combinations is within reason.
There is a good rationale for omitting as many conditional blocks of code as possible, but excessive recompilation and optimisation isn't good for performance either.

On more recent cards (GF8+, which is now 4 years and what -- 6 or 7 generations old) you don't have to be so finicky about this stuff.

kyle_
08-08-2010, 02:31 AM
On more recent cards (GF8+, which is now 4 years and what -- 6 or 7 generations old) you don't have to be so finicky about this stuff.
Did they brag about that in some specific presentations you can link?
Im curious about this. Did they make some tricks in their conditionals or are they using subroutine like feature in their shaders internally?

Dark Photon
08-08-2010, 05:44 AM
On more recent cards (GF8+, which is now 4 years and what -- 6 or 7 generations old) you don't have to be so finicky about this stuff.
Did they brag about that in some specific presentations you can link?
Brag? Not that I recall. My impression is this is something that was just done to make the most of the pre-GF8 hardware. GF8 got them "shared memory" for the GPU cores, where uniforms could presumably be stored. Also, IIRC got them real branching. Check your shaders and if you're doing branching on uniforms or expressions based on uniforms, suspect that as the most likely culprit.


Im curious about this. Did they make some tricks in their conditionals or are they using subroutine like feature in their shaders internally?

I think what I've read on this said that on pre-GF8 HW they tried hard to compile all the conditionals out of the shaders. See those threads I linked you to.

Nasa Gvis
08-08-2010, 11:07 AM
I'm working on an Nvidia Quadro FX 3400, if thats of any importance.
Oh yeah. That's a GeForce 6800 class card. That is "way" old. Why don't up upgrade?
I'm not in control of the hardware configuration and purchasing of the machine. However, I will make the recommendation.



So yes, this old card should be subject to the recompile/reoptimize on shader uniform change behavior in the NVidia OpenGL driver.

Some threads you might want to read:
* glUniform is slow? (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=249667) (Nov 2008)
* nVidia FP uniforms driver optimization lags (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Board=3&Number=169125) (Jan 2007)
* FragProgBottleNeck.zip (http://localhost/sgglwiki/images/e/e9/FragProgBottleNeck.zip)
* Is the driver doing run-time shader re-compilation? (http://developer.nvidia.com/forums/index.php?showtopic=2350) (Oct 2008)

Then again, it could be something else, maybe something you're doing.
I do want to undestand this issue better, so I'll look at those links, thanks for the pointers. However, its already been determined that my problem is linked to the instance issue, as reported in above posts. I.e. I've resovled the instancing issue, and my performance is fine with uniforms as is. Although perhaps it is recompiling due to uniform change, only that alone doesnt incur a noticeable hit. Or maybe the instance and uniform change issues confound each other somehow.




If you insist on using this old card, try doing what Nigel @ NVidia suggests and create a separate shader program for every uniform permutation that you are using, and see if that helps, especially if you are using a uniform in a conditional expression, [B]which you are!:


I would suggest compiling a seperate shader for each combination of uniforms, providing that the total number of combinations is within reason.
There is a good rationale for omitting as many conditional blocks of code as possible, but excessive recompilation and optimisation isn't good for performance either.


It would be impossible to create separate shaders in this way, as my uniforms are continuous valued floats that indeed change per frame.

As for the conditional, again, its already been determined and reported above that the conditional expression has been removed and tested, and does not impact my performance at all.

But thanks for the pointers. Looks like I need to push for a hardware upgrade.

Dark Photon
08-09-2010, 04:24 AM
I've resovled the instancing issue, and my performance is fine with uniforms as is.
Good deal. That's the main thing. I retract my suggestions.

Nasa Gvis
08-12-2010, 07:57 AM
Ok, a little foot in mouth disease for me here.

Instancing is not the problem at all. There was a small oversight in my testing. There's two different versions of the geometry I've been considering using. They are almost identical with a very small difference, amounting to only a few extra polygons.

I neglected to control which version of the geometry I was testing with. Using the version with the few extra polygons causes the performance hit. Using the version without is fine, even when instancing. Its quite a mystery why those few polygons bring the gpu to its knees. I'm guessing its the nature of the polygons, and what their position requires of my shader, rather than a simple polygon count.

Anyway, I said the card I've been using is quadro fx3400, but its actually 3400/4400. However, I tried it on another machine with a 4800, and it doesnt bat an eye, it runs like a dream. I even upped the instances to 130 (with the problem geometry), and no hit whatsoever.

So it is indeed an nvidia hardware issue. Not entirely clear what is exactly happening and why (shader rebuild per uniform change doesnt quite cover it), but doesnt matter. Need to upgrade either way.

Thanks for everyone's input. It was very helpful.