how to disable GLSL uniform variable optimization

The following happens on my NVIDIA NV40 board with driver 91.63 WinXP32SP2.

My application draws a primitive with a quite large fragment shader. After that I change a uniform variable (vec4) from zero to something nonzero (one of the 4 components). Then another primitive gets drawn. BUT with a delay of 400ms. I figured out that this delay gets introduced if a uniform value gets changed for the first time since its initial set (I believe when its set to nonzero and was zero before).

My assumption is that the driver performs some “optimization” on the shader based on the values of the uniforms. So if it is set to zero some computations might get obsolete and therefore can be omitted in the shader.

But if the uniform gets changed afterwards, the shader needs to be re-compiled(optimized) introducing that delay. If another uniform register gets changed to nonzero (for example another component of that vector) the delay occours again. Further changes on those uniform register don’t introduce another delay. It seems that the driver marks these uniform registers as “changeable” and doesn’t “optimize” for them again.

What I want is to mark those uniform registers from the beginning or somehow turn off this kind of optimization.

Anyone who knows more about this.
Help is appreciated.

I don’t think the compiler will optimize based on uniform values because you can look at the generated assembly with the “nvEmulate” tool from developer.nvidia.com

How long does it take for the primitive to render before you change the uniform?
Make sure you bench correctly. Add a glFinish before the uniform update and measure the time there.

If your shader is that complex the uniform update might need to wait until the pipeline has finished rendering before it can update which will make it loose asynchronicity.

You’d need to provide the shader to get more feedback. For example it’s unclear what effect the uniform you change has on the shader execution, e.g. loops etc.

Did this happen with previous drivers?
There is a newer driver as well:
http://www.nvidia.com/object/winxp_2k_93.71.html

I think the optimization is on the low-level shader code inside the driver, not the GLSL / Cg generated assembly. I guess that during program bind, the driver checks for constant expressions that do not contribute to the output and re-compiles the shader if needed. I don’t know of any API way to disable this.

I guess that during program bind, the driver checks for constant expressions that do not contribute to the output and re-compiles the shader if needed.
Uniforms can only be changed if the program is bound, there should no additional bind happening as long as you only change uniforms.
Of course constant expressions can be optimized, but uniforms are inherently dynamic.
You don’t want to recompile a shader at any level when changing a uniform.
Uniform changes happen often, every OpenGL state available in GLSL is such a uniform.

Relic, it seems to be a fact that changing uniforms can provoke shader re-compile and re-upload (at least for NVIDIA).

here a topic that discussed a similar case
http://www.gpgpu.org/forums/viewtopic.php?t=2257

The current status (guesses) of my investigation:

  • It’s not a matter of zero/nonzero, it’s a matter of constant/non-constant.
  • NVIDIA driver seems to cache multiple compiled/optimized instances of the shader for specific uniform parametrizations. Switching between those is fast. Introducing a new one is slow of course.
  • There seems to be a non-trivial heuristic in the driver that decides if a uniform is >currently< constant or not, based on the sequence of values the application has send to the uniform in the past.

I’ve still found no way to convince the driver that my uniform remains non-constant to prevent those lags.

I’m looking forward to find a solution.
To bad if StafanB is right.

oh, and the driver version is 91.36 instead of 91.63

Since I’ve a Quadro board the GeForce driver doesn’t install.

Thanks

I can confirm that uniforms on NV40 and lower are hard-coded, thus changing the uniform will trigger the shader recompile. Although 400ms is a bit too long… You can’t do anything about it - this is the way how nvidia’s hardware work…

But when I change the value in the uniform for each primitive to a new value the lag does not appear. If necessary I’ll try to create a minimal test case.

For your info, the lag does not appear on a 8800GTX driver 97.02 .

Originally posted by Kubis:
[b]
I’ve still found no way to convince the driver that my uniform remains non-constant to prevent those lags.

I’m looking forward to find a solution.
To bad if StafanB is right. [/b]
I think there are only two solution to this:

  • NVIDIA adding some kind of enable/disable switch[*]Ensure that all code paths of your shader contribute to the result, e.g. by using a color output channel to write out that part of the computation that could be removed by a constant unifom. I think the optimizer only generates a specialized shader if a certain cost-metric is ok (e.g. if a bunch of texture lookups or expensive arithmetic instructions can be safely removed).

Originally posted by Kubis:
For your info, the lag does not appear on a 8800GTX driver 97.02 .
I still see the lag with 97.02 on a GeForceGo 7900GS
(around 300ms for a pretty complex shader where there are lots of possible ‘multiply by zero’ uniform parameter cases)

Maybe they didn’t had the time to port this kind of low-level-optimization to the new hardware architecture…

8800GTX is not “NV40 or lower”, it is a new core and will be able to access uniforms in memory

Originally posted by Zengar:
8800GTX is not “NV40 or lower”, it is a new core and will be able to access uniforms in memory
Well, it doesn’t matter if the uniforms can be updated separately or not, since this is not part of the optimization. Given something like

 result = t*f(x) + (1-t)*g(x) 

where ‘t’ is a uniform parameter,
the idea is simply to check if f and/or g are ‘costly’ functions and if its ok to generate two specialized shaders for the cases of t=0 and t=1 (plus one general shader) and switch depending on some heuristic. Since the GPU programs are pretty much side-effect free, it’s ok to do such kind of optimizations. So I think the lag is due to the generation of a new machine-code shader when we hit one of the ‘specialized’ cases, and not due to the upload/patching of uniform parameters.

What makes do you to do the conclusion that some sort of optimization is taking place? What I say is simply: uniforms are actually constants on NV40 and lower, so the shader gets recompiled every time you change it (recompiled, not patched!). There is no way to disable it. On G80, there is no penalty, because the uniforms may be updated dynamically (as Kubis stated with his new card).

Well it is pretty easy to test this:
Write a shader that does the linear interpolation as described in the previous post. Implement f and g as really ‘heavy’ functions, e.g. sampling a large 3D texture 64 times like

   r0 += tex3D(<use r0 as texcoords>) 
   r0 += tex3D(<use r0 as texcoords>)
   ...

so that all operations contribute to the result.
Now measure the performance when doing t=0 or t=1 and you’ll see a big difference on NV40 hardware.

I first noticed this behaviour when I had a rather complex material shader for the diffuse and specular part. As usual, the result of this shader is weighted by kd, ks parameters. When turning these two to 0.0, I noticed a rather long lag and suddenly the frame rate increased by a magnitude.

As I sad before, this optimization has nothing to do with the upload mechanism of shader uniforms etc. It’s just an optimization technique.

Zengar, could you reveal the source of your information about the NV40- you describe ?!

Originally posted by Zengar:
What I say is simply: uniforms are actually constants on NV40 and lower, so the shader gets recompiled every time you change it (recompiled, not patched!).
If that was the case, how can it be the delay only occurs when changing the uniform from 0 to non-0, but not when changing from a non-0 value to a different non-0 value?

Originally posted by eyebex:
[quote]Originally posted by Zengar:
What I say is simply: uniforms are actually constants on NV40 and lower, so the shader gets recompiled every time you change it (recompiled, not patched!).
If that was the case, how can it be the delay only occurs when changing the uniform from 0 to non-0, but not when changing from a non-0 value to a different non-0 value?
[/QUOTE]Well, both can be true: The change of a uniform may requiere a shader upload on certain cards, which shouldn’t be a big deal in terms of performance (no optimizations or changes to the control flow, just ‘patching’ the old values to the new values). But if a completely new shader is generated (the 0 vs. non-0 case) you have the performance penality of compiling a brand new shader. I guess that NVIDIA has more optimization stages than one (high-level to low-level, just like every other compiler). The constant uniform optimization is a high-level optimization, so the new shader has to go through all other stages. Also the state can change (number of used interpolators or texture units).
That’s my guess for the time-lag.

Originally posted by Kubis:
Zengar, could you reveal the source of your information about the NV40- you describe ?!
I don’t really know… But someone from Nvidia said it once, either here on the forums or on the Nvidia developer forums…

Originally posted by eyebex:
If that was the case, how can it be the delay only occurs when changing the uniform from 0 to non-0, but not when changing from a non-0 value to a different non-0 value?
Oh, I didn’t know that… Sorry, I thought the penalty was similar in every case. In this case what i said is wrong (at least partly :slight_smile: )…

I still know no way to disable it… Maybe you could pm cass or pbrown?

The change of a uniform may require a shader upload on certain cards, which shouldn’t be a big deal in terms of performance (no optimizations or changes to the control flow, just ‘patching’ the old values to the new values)
What if the driver can split the branching code ahead of time, then simply select a path based on the value of a uniform boolean, sort of like what I might do myself in the way of separate shaders, only in this case it’s automagic. I wonder if any assumptions in this area are truly safe with respect to runtime optimizations.

But as you say, since these are only modifications to constant data, it’s (perhaps) a good bet that no real validation is necessary, which I expect would account for a good chunk of latency.