Cost of using functions instead of a variable

LapisSea · September 20, 2017, 5:16am

Hi, I am making a little opengl “script”. Now by script I mean that shader code is generated based on the needs (materials) of a model.
So when some code for let’s say diffuse color is created, would it be ok (for performance) to do:


out vec4 pixelColor;
vec4 calcDiffuse(){
	//gencode
}
void main(void){
	pixelColor=calcDiffuse();
}

instead of directly setting pixelColor to a value produced by the generated code.

Actual pixel shader template (deferred shading): #version 400 coreflat in int materialId;flat in vec4 vtColor; in vec4 - Pastebin.com

bootstrap · September 20, 2017, 6:08pm

Not a simple question… or answer. A couple observations.

If you ask C/C++ compilers for maximum optimization (which you usually do only to release or test a release version), they will often put the function code “inline” without any function call/return overhead… IF… the code is short enough and not problematic in any way from the compiler point of view.

Sometimes functions execute faster than inline code (though not something as simple as assigning a value to a variable, which the compiler will inline). Why is this so? If you call a very common function extremely often, it will be in L1 cache (very rare, cuz L1 is very tiny). Or it might be in L2 cache, in which case it will execute blazing fast. One “old time” mistake some compilers used to make (maybe some bad ones still do) was to inline functions that were longer than they should. What this did was to increase the total size of the often-executing code. Why? Because 100 copies of a section of code is longer than 1 copy (in the function). But I don’t think you’ll run into this mistake any more.

To make a more reliable or helpful analysis would require a lot of attention. And the fact is, usually it is easier to figure out the answer by benchmarking the various options in the context of the actual program. However, the huge mistake here is… programmers often do this long before the program is finished, when the nature and size of the program is not what it will be when completed.

The simple lesson that is usually correct is… don’t optimize too early. Don’t worry about such things too early. Perhaps the only optimizations that you should consider “early” are… those that are an absolute nightmare to change later. Most are not. Anything you can change in small bits and pieces is not. Don’t worry, be happy == get it working first!

One trivial but important point to remember about short functions. Since functions are called very often, and since function local variables are accessed extremely often, the near portion of the stack is almost always in L1 and L2 cache (or L3 at worst). Which means, usually the time overhead to call a small function in insanely small, much smaller than you’d imagine looking at the code executed. Do recall that any access of memory that is out there in the dynamic DRAM chips is 10x, 20x, 50x slower than L1 [or L2] cache (depending on what code executed immediately before).

LapisSea · September 20, 2017, 6:28pm

Yes you are absolutely right. This is a case of “premature optimization is the root of all evil” but I was wondering if this an okay thing to do in a general sense as I am not to familiar with opengl compiling optimization. Is it up to the gpu driver?

Oh also I have just benchmarked it and it has no observable difference. Kinda expected as it was benchmarked on a GTX980 maybe older hardware with older drivers would have some tiny difference.

Silence · September 20, 2017, 11:18pm

[QUOTE=LapisSea;1288603]Yes you are absolutely right. This is a case of “premature optimization is the root of all evil” but I was wondering if this an okay thing to do in a general sense as I am not to familiar with opengl compiling optimization. Is it up to the gpu driver?

Oh also I have just benchmarked it and it has no observable difference. Kinda expected as it was benchmarked on a GTX980 maybe older hardware with older drivers would have some tiny difference.[/QUOTE]

GLSL function calls, AFAIK, are very very fast. The compiler will tend to optimize as much as it can.

One thing you can do is to compare both and get the binary generated from compilation/linkage by using glGetProgramBinary.

Dark_Photon · September 21, 2017, 4:53am

Yes. This should be fine.

In fact on some GL drivers (e.g. NVidia’s), the underlying generated assembly will have no subroutine call nor return in it at all (even though NVidia’s ASM shader assembly language has CAL / RET instructions). I just re-verified this by piping your shader through cgc. On NVidia GL drivers, another way to get access to this shader assembly is to use the glGetProgramBinary call Silence mentioned. With either, you can see exactly what the NVidia GL driver is generating behind-the-scenes when you give it a specific GLSL shader and assess the result yourself.

That said, even on drivers which may not do aggressive constant folding and resulting dead-code elimination (like NVidia’s does), a subroutine call and return are each coherent branches, so they should be pretty efficient on modern GPUs.

To answer a subsequent question you might have, this is also very, very efficient to do in GLSL, on NVidia GL drivers at least:


const int RENDERSTATE = <value2>;

if ( RENDERSTATE == <value1> )
{
    <bunch of complex stuff>
}
else
{
    <another bunch of complex stuff>
}

This is the core of a classic “ubershader” where you have some constant “Renderstate” values at the top of your shader that select which shader permutation you’re compiling, and you have logic in your shader to choose the correct branch(es) of logic to use for that Renderstate combination.

In the above example, the NVidia GLSL compiler (and possibly those for other vendors too; not sure) will, during compilation, plug in the value for RENDERSTATE into all references in the shader, pre-evaluate any constant expressions in the shader (e.g. the “if” expression), and pre-determine which paths of your shader are “dead code” which will never be executed and throw them out (e.g. the first block of code under the “if” statement, and even the evaluation of the “if” condition itself.

So for the above example, after compilation (including constant folding and dead code elimination), what’s left of your shader will look like this:


<another bunch of complex stuff>

As you can see, all of the logic that’s performing code branch selections based on the values of constant expressions is “free” in the sense that it’s never evaluated on the GPU. It’s evaluated in the compiler at compile time. This technique makes for a much more readable ubershader than if you’d done all this with preprocessor #if/#else/#endif statements (been there, done that).

Also, as I recall, you don’t really even need the “const” on the renderstate variable(s) on NVidia drivers. The NVidia GLSL compiler will look at your usage and figure that out for itself.

LapisSea · September 21, 2017, 7:03am

[QUOTE=Silence;1288604]GLSL function calls, AFAIK, are very very fast. The compiler will tend to optimize as much as it can.

One thing you can do is to compare both and get the binary generated from compilation/linkage by using glGetProgramBinary.[/QUOTE]

Oooh wow I didn’t know about that function! Thank you! I will play around with that.