Offline GLSL compilation

Hi

I was thinking about this offline compilation for GLSL. Now that we have ARB_get_program_binary, ‘all’ that’s needed is the format specified.

My question is, what should that format look like? Should it resemble the HLSL compiled format? Something more like a binary version of ARB_{fragment,vertex}_program? A stream of GLSL tokens?

A couple of feature I think we would need of this are:[ul][li]The ability to store multiple shaders of different parts of the the pipeline, e.g. a VS and a FS[*]The ability to compile HLSL to this format. Although OpenGL is extremely close to DirectX now, the biggest difference is the shader language, requiring you to use Cg if you want cross API shaders. There is now reason OpenGL can’t support HLSL, either directly or via a tool.[/ul]Does anyone have any suggestions or criticisms?[/li]
Regards
elFarto

Um:

there are other issues too… the point of the jazz is not to deliver pre-compiled shaders to a customer, but for once the a customer has your application, your application will save the binaries at the first run and subsequent runs use the binary blob instead of recompiling the shaders and relinking the shader program.

Also, as for HLSL vs GLSL there is a tool: [url=HLSL2GLSL download | SourceForge.net]hlsl2glsl[/url open source too.

also:

HLSL is MS-Windows only, where as GLSL is Unix(including Linux and BSD), Mac OS-X (if we stay at GL2.1), and MS-Windows. It is HLSL that is not cross-platform not GLSL.

I was thinking about this offline compilation for GLSL.

I’m curious as to exactly what it is you expect such a thing to accomplish.

Whatever format it ultimately compiles to will itself have to be compiled by the drivers. You might be able to get rid of a few driver bugs by unifying the GLSL front-ends, but that’s about it. The ATI looping bug for example is likely a part of the lower-level code generation, so unifying the GLSL front-end wouldn’t fix it.

You’d also have to have two shading language specifications: GLSL itself and the low-level form. That’s twice the work for the ARB. Every new shader-based feature that comes along would require updates to both specs.

Lastly, GLSL really doesn’t have that many high-level constructs in it. It has functions (but no recursion) and structs, but that’s about it. We’re not talking about a particularly hard language to compile here.

The point of the program binary extension, as kRogue points out, is to allow you to get most of the benefits of offline compilation (faster load times, etc) without these downsides. You simply use an OpenGL program as an offline compiler. Sure, you still need the strings just in case, but you only use them if you need them.

there are other issues too… the point of the jazz is not to deliver pre-compiled shaders to a customer, but for once the a customer has your application, your application will save the binaries at the first run and subsequent runs use the binary blob instead of recompiling the shaders and relinking the shader program.

They could defined a specific format enumeration (or a range of them) and then say that ProgramBinary must succeed with these formats (or provide an error like compiling/linking does). This would require a new extension, but that’s about it.

HLSL is MS-Windows only, where as GLSL is Unix(including Linux and BSD), Mac OS-X (if we stay at GL2.1), and MS-Windows. It is HLSL that is not cross-platform not GLSL.

“Platform” in this context means “API”, not hardware or OS.

Also, I’d be interested to see, if they define such a low-level language, how they could possibly prevent cross-compiling. You can compile HLSL to GLSL now; I don’t see how they could stop you from doing so with a lower-level language.

For what it is worth, there is an assembly GL language and NVIDIA updates their assembly interface for new GL features. NVIDIA’s Cg compiler can be used to generate this assembly (from both Cg and GLSL) and it will use NVIDIA made extensions to have newer (read GL3 or newer) features. Catch is, I don’t think any GL driver except NVIDIA has been keeping their GL assembly up to date in terms of features.

Lastly, there are two big reasons why people want binary shaders:
(1) to cut down on start up times (which GL_ARG_get_program_binary does except at first load or after driver/hardware change)

(2) Prevent 3rd parties from reading your shader code.

Though even in the Direct3D world, many games supply HLSL to Direct3D and a little poking lets you see the shaders.

HLSL is just a language, it can’t be OS specific.

As Alfonse says, the extension that specifies this hypothetical format can specify that for this format, the glProgramBinary call cannot fail.

The wiki describes offline shader compliation as:

So, having an offline version allows for better optimisations to be applied and obfuscation. Obfuscation is not really possible, any format the driver can read can be reverse engineered into a more human friendly format.

So that leaves better optimisations, however that could be solved with a program hint like PROGRAM_BINARY_RETRIEVABLE_HINT, such as OPTIMIZE_PROGRAM_HINT telling the GL to favour optimisation over compilation time. The binary can then be retrieved in the usual method.

Mobile platforms are a bit tricky. Any platform constrained enough to not have a decent compiler is probably running OpenGL ES, not the regular OpenGL

The HLSL compilation can probably be taken care of by an external library such as AMD/ATI’s HLSL2GLSL library above.

Regards
elFarto

I do not believe any offline tool can do valuable optimization. The GPUs are so different today even from one vendor. For example NVIDIA Fermi is totally different GPU then the older ones. The real optimization is GPU specific.

I know having offline tool is tempting.

So, having an offline version allows for better optimisations to be applied and obfuscation.

Just because someone on the Wiki wrote that doesn’t make it true. Because of what mfort has pointed out, the most that offline compilation can do (without favoring one kind of hardware over another) is very simple thing like dead-code elimination. Which drivers already do.

Until you can make some assumptions about what your hardware can actually do, you have to leave at least some of the high-level semantics in place in your off-line compiler. You have to leave structs in place (to some extent). You have to leave function calls in place, so that the low-level driver can decide whether or not to inline them. You have to leave standard functions in place, so that the low-level driver can decide how best to implement them.

At that point, your offline compiler essentially is reduced to converting a human-readable C-style language into something slightly less complex. It’s basically converts human-readable infix expressions into an explicit assembly-like sequence. And do dead-code elimination.

I don’t see how this would help compile times very much if optimization is the main drawback of compiling.

In this thread I argue that simple dead code elimination and constant folding would have a huge win for typical large scale shader code. (Where you are including libraries of functionality you are wasting “tens of seconds” compiling each shader)
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=270130#Post270130

Obviously, this does not need to be done via the ARB, a separate entity could just release this type of parser, but the amount of effort would be huge. (Unless Nvidia/ATi decide to be kind)

It’s worth remembering that the current binary shaders implementation, together with separate shader objects, greatly reduces the need for fast shader compilation.

Assuming you are using the two facilities effectively, most of the time you will be loading cached binary shaders and when you do need to compile something, you will be using smaller compilation units.

This means the the hit of providing better optimization, but with slower compilation, in drivers will be much less.

Given all this, and the limitations on optimization that offline shaders could provide mentioned by previous posters, I can’t help thinking that many of the requests for offline shaders are from people who would really rather not distribute GLSL source but know that argument has already been had and rejected.

That’s a bit of a hand-wave, and reality depends on the application.

At issue is whether you can “guaranteed” compile a shader permutation in a background thread/GL context on a separate core in parallel with running a GL thread whose performance MUST NOT be affected at all by that compilation, and just share the program or transfer the binary shader over to the foreground thread and load it up without a perf hit. Can you? On any platform with anybody’s driver?

If not, the subtle details behind the issue include, what’s the most time consuming single shader an application must compile or link, does the app have a hard 60Hz or better frame rate requirement (or can it just “blow frames” every once in a while for this kind of thing), does it have to dynamically load content with materials defining shader permutations or change environment-related shader permutations at render-time, can it know all possible permutations beforehand and precompile all of them for that specific GPU make/model/minormodel/driver version/etc. (or is the permutation space just too large for that), …

And I agree with sqrt[-1]. Dead code elimination and constant folding is likely a large part of what eats “compile/link” time, for our app anyway. That and just high-level language parsing. If just that could be pre-done, I suspect we’d see a pretty large speed-up. Ideally, this would be an explicitly CPU-only task operating on local data, so that you’re guaranteed to parallelize easily without any odd GL thread-lock or reentrancy issues.

At issue is whether you can “guaranteed” compile a shader permutation in a background thread/GL context on a separate core in parallel with running a GL thread whose performance MUST NOT be affected at all by that compilation, and just share the program or transfer the binary shader over to the foreground thread and load it up without a perf hit. Can you? On any platform with anybody’s driver?

OK, let’s say you had offline compilation. Will this ensure " ‘guaranteed’ compile a shader permutation in a background thread/GL context on a separate core in parallel with running a GL thread whose performance MUST NOT be affected at all by that compilation, and just share the program or transfer the binary shader over to the foreground thread and load it up without a perf hit. Can you? On any platform with anybody’s driver?’

I think not. Not if it’s going to be an intermediate form that still retains enough high-level constructs (structs, functions, loops, arrays, etc) to allow the back-end code access to the constructs that it can use to properly optimize as needed.

And if you can’t guarantee this this, everywhere, then everything you just said about compiling GLSL now applies to compiling this “precompiled” form.

And I agree with sqrt[-1]. Dead code elimination and constant folding is likely a large part of what eats “compile/link” time, for our app anyway. That and just high-level language parsing.

I can’t speak to what kind of shaders you’re running through the compiler, but how do you know that high-level language parsing is a performance-determining step (ignoring the issue of feeding the parser code you don’t intend to run)? I’d imagine that the actual optimization of the final code is far more performance critical.

Furthermore, there’s something else with separate shader objects. Right now, as it currently stands, the compiler in some implementations must execute twice. Once for each of the individual shader objects, and once for the program as a whole. The ability to go straight from text to program object alone could give you a 2x speedup. Add to that the fact that you no longer have to link a program for every permutation of shaders, and you may not even need get program binary.

I second that. For example register allocation is NP-complete and should be performed as one of the last compiler passes on native code representation. There are polynomial-time algorithms as well, but the resulting code might end up being slower. Having get_program_binary return native and optimized code instead of IR seems reasonable to me.

With explicitly specified uniform, sampler, input, and output locations, the linking can be done at runtime. For example, after a vertex shader you may choose which VS outputs you want to rasterize and write to a fragment shader (in hardware, not in GL), so ideally the shaders don’t really need know about each other.

I personally have two reasons for wanting the features to support offline GLSL compilation, though I don’t necessarily care about that specific request in particular.

Supporting offline GLSL compilation basically just means revivifying the ARB assembly syntax, updated to modern shader features. It could be taken a step further and allow a binary representation of that assembly language.

The advantages of this are two fold:

(1) smaller program sizes. Yes, this really matters. We’ve recently measured that disk seek and read time is the big killer in loading times for even simple games. Cutting a file down from several KB to a few hundred bytes makes a big difference, especially if you have a lot of files like that. Loading a PNG and decompressing it on the CPU for instance is way faster than loading a TGA; all the time the CPU ‘wastes’ in decompressing the image data is more than made up for by how much faster the image is loaded for disk, even for small images.

(2) An assembly language makes it easier to write new shading languages. HLSL could be more easily supported, for instance. More importantly, it allows easier experimentation of shading languages. Right now there’s basically two (GLSL and HLSL/Cg), but maybe there’s some better paradigm that works even better but nobody has found it yet. Or, if nothing else, it makes it easier to experiment with GLSL extensions that offer HLSL/CG-like effects frameworks and combined shaders. It also makes it possible for a well-known GLSL compiler library to exist to help minimize all those pesky NVIDIA/AMD GLSL compiler differences by simply removing them from the high-level language part of compilation. In short, it lets people write better language tools with less effort.

Supporting offline GLSL compilation basically just means revivifying the ARB assembly syntax, updated to modern shader features.

Your use of “just” there suggests that this is a simple undertaking. Everyone understands what would have to be done. The question is whether it would be worthwhile.

smaller program sizes. Yes, this really matters. We’ve recently measured that disk seek and read time is the big killer in loading times for even simple games.

Simple games don’t care about loading times; they’re not spending a lot of time loading things because they don’t have a lot to load. They are, as you say, “simple games.”

Cutting a file down from several KB to a few hundred bytes makes a big difference, especially if you have a lot of files like that.

I agree in principle, but the problem with your assertion is that you haven’t demonstrated that GLSL shaders are significantly larger than ARB assembly shaders that do the same thing. Note that ARB assembly already allows you to use reasonable variable names (no r1, r21, etc nonsense), so reasonably-written code will use reasonable variable names.

For example, here’s a fairly lengthy GLSL shader of mine:

<div class=“ubbcode-block”><div class=“ubbcode-header”>Click to reveal… <input type=“button” class=“form-button” value=“Show me!” onclick=“toggle_spoiler(this, ‘Yikes, my eyes!’, ‘Show me!’)” />]<div style=“display: none;”>


#version 330

in vec4 diffuseColor;
in vec3 vertexNormal;
in vec3 cameraSpacePosition;

out vec4 outputColor;

uniform vec3 modelSpaceLightPos;

uniform vec4 lightIntensity;
uniform vec4 ambientIntensity;

uniform vec3 cameraSpaceLightPos;

uniform float lightAttenuation;

const vec4 specularColor = vec4(0.25, 0.25, 0.25, 1.0);
uniform float shininessFactor;


float CalcAttenuation(in vec3 cameraSpacePosition, out vec3 lightDirection)
{
	vec3 lightDifference =  cameraSpaceLightPos - cameraSpacePosition;
	float lightDistanceSqr = dot(lightDifference, lightDifference);
	lightDirection = lightDifference * inversesqrt(lightDistanceSqr);
	
	return (1 / ( 1.0 + lightAttenuation * sqrt(lightDistanceSqr)));
}

void main()
{
	vec3 lightDir = vec3(0.0);
	float atten = CalcAttenuation(cameraSpacePosition, lightDir);
	vec4 attenIntensity = atten * lightIntensity;
	
	vec3 surfaceNormal = normalize(vertexNormal);
	float cosAngIncidence = dot(surfaceNormal, lightDir);
	cosAngIncidence = clamp(cosAngIncidence, 0, 1);
	
	vec3 viewDirection = normalize(-cameraSpacePosition);
	
	vec3 halfAngle = normalize(lightDir + viewDirection);
	float blinnTerm = dot(surfaceNormal, halfAngle);
	blinnTerm = clamp(blinnTerm, 0, 1);
	blinnTerm = cosAngIncidence != 0.0 ? blinnTerm : 0.0;
	blinnTerm = pow(blinnTerm, shininessFactor);

	outputColor = (diffuseColor * attenIntensity * cosAngIncidence) +
		(specularColor * attenIntensity * blinnTerm) +
		(diffuseColor * ambientIntensity);
}

[/QUOTE]</div>

The ARB-assembly version would still use the same long variable names. The ARB-assembly version would have “MUL” in place of every *, “ADD” in place of every +, etc. It would also have to have a lot of temporary variables that I don’t need to explicitly state.

The main difference would be the lack of a function call. And that’s assuming that you disallow functions in this advanced version of ARB-assembly.

I’d bet that the ARB version would be larger in terms of byte size than the GLSL version. And I can make the GLSL version smaller by using larger expressions. I could compute “blinnTerm” in a single line rather than 4 (though this would make the code less readable). You can’t do that in the ARB-assembly version.

Now, let’s say you found a way of encoding your ARB-assembly shaders into some kind of smaller, binary format. That’s basically just employing a compression algorithm. There’s no reason you couldn’t do something similar with GLSL shaders. And there’s no reason to expect that shaders wouldn’t zip-compress well, particularly if you stick them in a TAR (or something similar) before compressing them. That way, the same names in different shader files will be compressed with the same bitpattern. You’ll get smaller files, and you can decompress them all-at-once.

While ARB-assembly shaders might benefit from compression more than GLSL shaders (the repeated use of certain instructions), I doubt that they will benefit from it so much more than the final size difference will be substantial.

Loading a PNG and decompressing it on the CPU for instance is way faster than loading a TGA; all the time the CPU ‘wastes’ in decompressing the image data is more than made up for by how much faster the image is loaded for disk, even for small images.

Perhaps (depending on the CPU in question), but the fact that this is a TGA instead of an S3TC or BPTC-compressed texture means that you’re losing quite a bit of memory. Yes, your load-times may be faster, but you’re using up ~4x the memory compared to compressed textures. PNG doesn’t work with texture compression, and good S3TC compression algorithms take longer than the read time of their output to work. So you’re not going to be compressing it on the fly.

So either your data is PNG compressed on disk, or it’s S3TC compressed on disk. You need to make a decision: runtime performance in both memory and actual texture access speed, or load-time performance?

For most applications, you’re going to use a texture far more than you’re going to load it. And you’re really going to want the extra texture space you get from compressed texture formats.

It also makes it possible for a well-known GLSL compiler library to exist to help minimize all those pesky NVIDIA/AMD GLSL compiler differences by simply removing them from the high-level language part of compilation.

And that’s where you’re going to run afoul of two simple facts.

1: OpenGL is, and always will be, backwards-compatible. GLSL is core now. Therefore, it will always be core. NVIDIA and AMD will have to support it. And support it in addition to whatever enhanced ARB-assembly language you suggest. Even if they stopped extending GLSL and froze it at 4.1, the IHVs would still have to support it.

You will not make drivers more conformant by giving IHVs more test cases and a larger specification to conform to.

2: Most of the “pesky compiler differences” would still remain in an ARB-assembly language. Sure, you would lose some, like NVIDIA’s slightly more lax GLSL front-end. But the most egregious would remain, as those tend to be in the lower-level optimizing routines.

Let’s look at some GLSL-based driver bugs reported on these very forums.

AMD: struct array in UBO: This is based on the internal accessing code of UBOs. It has nothing to do with the compilation of GLSL code, and it would have occurred in AMD’s implementation of an enhanced ARB-assembly language that supported UBO-like constructs.

AMD: textureOffset flickering and texelFetch crash: It’s not clear if the second part of this bug (the glCompileShader crash) would be solved by your suggestion, but the first part would certianly be unaffected. The enhanced ARB-assembly would have similar texturing functions, and AMD would be just as capable of screwing up with this code in ARB-assembly form as GLSL form.

AMD: The craziest bug I have even seen! Oo: This one would be solved, since ARB-assembly probably wouldn’t have function overloading.

So that’s not many bugs fixed.

I didn’t read the entire thread, so I don’t know if this was mentioned. I use GL_ARB_get_program_binary. The first time my program is run, the compiled shader folder is empty or doesn’t exist, so I load the GLSL text version of the shaders, which takes a long time. I “get” the binary blob and save it to the folder.

The next times my program is loaded, I notice a big reduction of load time because I load the binary blobs.

As for assembly shaders (ARB), face it, they are dead. nVidia continues to update and create their assembly extensions so that is your only option if you want to create a tool that does HLSL->OpenGL assembly.

I would just manually convert an HLSL to GLSL since they are not too different. All high level languages are similar. (yes I know, some say that you can use Cg to convert HLSL to GLSL).

On CPU architectures, yes.

On GPU architectures, no. There is no register spilling to the stack on a GPU because GPUs do not have stacks. GPUs also have a crapload of registers compared to a CPU. Register allocation on a GPU is (relatively) simple and easily solvable: either the code uses too many registers and can’t be compiled, or it doesn’t and there’s a simple and obvious mapping of registers to variables/temporaries.

This is precisely why the D3D standards dictate the number of “temporary value registers” hardware must support for each Shader Model generation. It guarantees that a given program will be able to compile and execute on all conforming hardware. The only reason problems arise there is because the optimization passes can eliminate some registers, allowing programs that are “too big” to run on one vendor’s driver but not the other.

And that’s a problem. Vendor drivers should only be providing the low-level IR-to-machine-code translation higher. All of the higher level parsing, semantic analysis, and high-level optimizations should be performed by a single API-vendor-provided library, e.g. a libglsl. This should produce a standardized high-level IR that can be saved to disk offline and loaded up at runtime by the API and fed to the drivers for low-level code generation.

The added bonus of such a model is that it makes it waaaaay easier to write high-level effects libraries that compose low-level shader code from GLSL snippets and data-driven code generation from UI-based effects builders, which at least in the game industry is incredibly common. It also allows for more experimentation in shading languages, which would be a good thing as we pretty much only have three hardware-supported shading languages left today (GLSL, HLSL, and NVIDIA’s assembly language extensions), but for all we know there’s some super awesome paradigm out there that the big vendors haven’t thought of yet, and is just waiting for a small company or individual to discover. Needing to compile into GLSL just to feed into the drivers is horrible, and imposes all the limitations of GLSL onto the new language that wouldn’t be necessary with standard lower-level representation that can more easily match actual hardware capabilities. (e.g., GLSL only recently gained bind-by-semantic capabilities in 3.3/4.1, using an incredibly awkward and obtuse syntax, despite that being the way the hardware has actually worked for passing values between shader stages since the introduction of programmable GPUs…)

Separate shader objects is super awesome, yes. It’s pathetic how long GL/GLSL took to gain that support.

This is though another point in favor of decoupling the shader language from the API as much as possible and going for assembly shaders (be they in text format or some serialized binary IR format, doesn’t matter a lot). The more tightly they’re tied together, the more often we’re going to end up waiting for obvious performance features like this. It’s silly that every OpenGL API change needs to wait on a complete language specification for GLSL to get anywhere, when the shader compiler and the raw GPU access APIs are really totally unrelated to each other in concept (and are only tied together because of OpenGL’s backwards design).

The trick to designing a good GL assembly language (rather than the old ARB mess) is to realize that you really want an IR, not a “real” assembly language like what we normally think of with x86/ARM/PPC assembly language code. We don’t need a format that is meant to just be a textual representation of machine code. We need a format that is meant to describe high-level instructions and data structures, with as much semantic information as possible. e.g., instead of implementing a loop using JMP instructions, actually encoded the loop in the IR. It’s a balancing act between going so low-level as to be tied to a specific implementation model that doesn’t reflect how all the real hardware works and going so high-level as to be tied to an API’s semantics like GLSL is. It’s tricky to get right, but is possible to get right, and hence should be done.