Suggestions for OpenGL 5

AaronMiller · June 18, 2012, 2:10pm

Hello, here are a few suggestions of things I would really like to see in OpenGL.

Offline GLSL Compilation and Official Bytecode
Okay, this is something that’s been requested before, but it has been ignored. There needs to be an official bytecode format that specifies the binary. Why? Offline compilation with ALL optimizations on and third party compilers. As it stands currently, you must distribute the GLSL with your app (not a big deal) then waste the user’s time (in addition to the time it takes to perform an install) to recompile a shader that is perfectly capable of being done offline. This has two detrimental effects. 1: Slower start-up time whenever hardware or drivers change. 2: You don’t get the best optimizations you could. Maybe the driver implementation of the compiler is flawed. Maybe there’s a bug in the compiler implementation. Maybe you simply don’t want to use GLSL. Yeah, you can use Cg but that’s no better. You’re locked to one company to provide updates to a closed source compiler. Not good for me.

Offline GLSL compilation can be accomplished, by more than just implementors of OpenGL, if an official bytecode exists, and is maintained. Make it an extension at first, then make it part of the official spec. It doesn’t matter how the bytecode looks, how complicated it is, or anything like that as long as it’s functional. Once you have the bytecode, just use the existing APIs to apply the binary. Maybe it’s in the format of GL_BYTECODE_FORMAT or something similar. Direct3D has this feature. I can implement my own HLSL compiler and not be locked in to Microsoft’s implementation if I want to. I can’t do that with GLSL. It’s kind of funny actually, I can compile GLSL to HLSL bytecode. (Which is documented in the WDK btw.)

Some people will argue that the GL_ARB_get_program_binary extension allows you to produce binaries that are optimized for the hardware instead of just in general. Yes, that’s true. However, there’s no reason the bytecode format couldn’t be converted to a binary by the GL in the same way to produce a binary optimized for the hardware. It would probably be easier too since everything’s in a bytecode format and easy to optimize/interpret.

This extension should be propogated back all the way to the first hardware that’s capable of executing shaders. Why? Compatability. D3D worked with bytecode all the way back to when shaders were first introduced. GL can too.

Here’s a comparison between D3D9 bytecode shader loads, and OpenGL shader loads (among other things).

github.com

aras-p/glsl-load-speed/blob/master/results.txt

Mac: Mac OS X 10.6.8, MacBookPro Core i7 2.66GHz, GeForce GT 330M
PC: Win 7 64bit, Core i7 2.66GHz, Radeon HD 6870 (4.1.10750 Compatibility Profile Context)

-------------------------------------------------------------------
Win 7 64bit, Core i7 2.66GHz, Radeon HD 6870
GL renderer: AMD Radeon HD 6800 Series
GL version: 4.1.10750 Compatibility Profile Context

**** Shader shaders/treeleaf:
GL:           23.70ms
GL Optimized: 20.16ms (3.55ms less, or 1.18 times faster)

**** Shader shaders/fxaa311pc39:
GL:           73.18ms
GL Optimized: 66.32ms (6.85ms less, or 1.10 times faster)

**** Shader shaders/prepasslight:
GL:           18.42ms
GL Optimized: 13.79ms (4.63ms less, or 1.34 times faster)

This file has been truncated. show original

Official Bindless Objects Extension
NVIDIA and AMD offer separate but (as far as I can tell) equivalent solutions for accessing textures without binding them, saving a lot of unnecessary driver overhead. These are the two big players in the game. Then there’s Intel. I’ve heard they offer a bindless solution as well, but I haven’t bothered verifying this. In this day and age where virtual texturing is starting to become common place (at least in one form or another) the ability to just directly access texture memory is becoming increasingly more important.

For reference, see GL_AMD_pinned_memory, GL_NV_bindless_texture, and GL_NV_vertex_buffer_unified_memory.

API To Determine What’s Supported By Hardware
D3D9 and below had GetDeviceCaps(). D3D10+ has “feature levels” and “feature sets” that you can check. These feature sets are useful to determine what the driver has support for.

Now, believe me, I like that OpenGL forces the driver to support some things, even if it has to fallback to software to do so. However, I also want to find out what falls back to software. That way, in my code, I can choose to avoid that particular feature and perform a work-around. Maybe this can be done through queries. e.g., issue an Is Hardware query for a set of OpenGL commands using the current state. Then, check the query for a boolean value. If true, then the hardware can execute the GL commands within the query statement. If false, then the hardware can’t, so do something else. This seems like it would be easy to implement. At least, it would be easy to add to the specs. (No new entry points, one added enum value, and some description.)

Programmable Blending
Why is this still fixed function? I’ve noticed that in OpenGL ES, on NVIDIA Tegra hardware, whenever you change the fixed-function alpha blending state, the shader bytecode is regenerated. (Someone at Unity discovered this actually. I don’t remember where I read this from specifically though, sorry.) So, I assume that means hardware is capable of performing it. The D.I.C.E. Frostbite 2 (Battlefield 3) guys seem to want this as well. I’ll leave this section short since AAA developers already want this feature. (Sorry for not providing more in terms of citation here.)

Sampler Shaders
Samplers seem quite fixed-function to me as they rely on various state settings and so on. Perhaps sampler shaders could be implemented? Basically, when a sampler shader is applied it would allow for programming exactly what type of filtering is applied when a texture image is retrieved. This can be accomplished to some degree already by using point/nearest filtering (that is, no filtering at all) and applying several samples within the pixel shader. However, it is my understanding that the hardware that handles sampling can make a few optimizations based on the areas sampled. If that is correct, perhaps a sampler shader could be used as a more optimized method for performing said sampling. Here are two examples. 1: Virtual textures and atlas textures have page boundaries that cannot be crossed. You usually have to add padding (which is unfortunate) around each page, or put the data within volume textures (which can cause other issues, or may not be feasible based on certain hardware limits). The third alternative being: implement the filtering yourself. 2: Implement elliptical texture filtering (a higher quality form of filtering) and use it across multiple shaders without using subroutines. With programmable samplers it would be easier to implement other shaders. You wouldn’t have to use subroutines, just generate a new sampler then access the texture data like normal within the shader.

Overall, this isn’t something that’s hugely necessary, but I haven’t heard of anything like it before and, at least for me, it would be fairly convenient. Also, with AMD’s introduction of partially resident textures I think this is probably hugely unnecessary.

Official OpenGL Support for Partially Resident Textures
I like AMD’s implementation, but it’s just limited to AMD. This should be a core requirement for GL 5~.

[HR][/HR]
If there are ARB extensions available for anything I’ve mentioned above, I would love to know their names.

I’m not looking for alternatives or work-arounds to any of the above suggestions. I know what I’m doing. These are just some things I’d fancy seeing in an official GL specification.

Cheers,
Aaron

mhagain · June 18, 2012, 3:38pm

[QUOTE=AaronMiller;1238968]You’re locked to one company to provide updates to a closed source compiler. Not good for me.[/quote]That’s actually the current situation. The GLSL compiler provided by your vendor is already closed-source, and you’re locked to your hardware vendor. Currently it’s worse because each vendor must provide their own compiler, which introduces bugs and divergence (as well as potentially conformance-breaking “optimization”). Specifying the bytecode is actually a great idea though; it would force vendors to have some measure of consistence, but it would need to be a sensible spec (and I don’t really have much faith in the ARB in that regard).

[QUOTE=AaronMiller;1238968]Official Bindless Objects Extension
API To Determine What’s Supported By Hardware
Programmable Blending[/quote]
Yes, please - bring them on. Especially this part: “However, I also want to find out what falls back to software” (can I add: “and why it fell back” - i.e. is it not supported in hardware, did you exceed some hardware limit, was the phase of the moon wrong, or whatever?)

Programmable blending could actually be achieved on current hardware/drivers with multiple FBOs but at some fillrate cost; adding true programmability to it would be a great simplification of the API (traditional blending could then go to deprecated status) as well as bring OpenGL ahead of D3D in this regard. Would need hardware support though.

You can mostly already do this with some texelFetch calls; the idea of being able to predefine some behaviour and have the driver optimize it, then reuse it at will, is nice though. Would need hardware support too.

aqnuep · June 18, 2012, 3:42pm

You wrote up a pretty interesting list of suggested features so I’ll try to respond accordingly.

Offline GLSL Compilation and Official Bytecode
This is something that has been requested a lot of times. Personally, I would go for it, though I’m pretty sure designing such a bytecode would be a non-trivial task.

Official Bindless Objects Extension
Personally, I think that the NVIDIA bindless extensions shouldn’t become core. Mainly because they introduce pointers which would make OpenGL programming, well, let’s say, less safe.
Also, the bindless extensions require the application developer to control which resources are resident, what is currently handled by the drivers themselves, thus I believe what it changes is that the overhead of managing which resource is resident for a draw call (which actually depends on what resources the draw call would like to use) is put to the application side from the driver side. The overhead doesn’t magically go away. Of course, applications that can make all their resources resident would definitely get a huge speedup, but not sure how many real-life applications fall into this category.
Finally, while GL_AMD_pinned_memory also sounds like a “bindless” API, in fact what it provides is a way to use client memory as buffer storage. While in case of APUs this might not be of an issue, in case of dedicated GPUs I wouldn’t use pinned memory for e.g. vertex buffers as dedicated GPUs can access their local memory way faster than they do access client memory (i.e. system memory). Of course, bindless memory can be still very useful, especially for often changed uniform buffers or for pixel unpack/pack buffers.

API To Determine What’s Supported By Hardware
If you use core OpenGL and respect e.g. the alignment restrictions (see MIN_MAP_BUFFER_ALIGNMENT, UNIFORM_BUFFER_OFFSET_ALIGNMENT, etc.) then you should be fine. I think OpenGL applications have already enough hastle to handle all the different OpenGL versions and extensions. Btw, it is not a coincidence that D3D10 removed device caps and rather uses “feature levels”. Analogously, you can think of your OpenGL version as “feature levels”. The only thing you might need to handle specially is those hardware independent extensions like GL_ARB_explicit_attrib_location or GL_ARB_separate_shader_objects.

Programmable Blending
Yes, this is another long waited feature.
You’ve mentioned the NVIDIA Tegra hardware, but you forgot that the Tegra is a fundamentally different architecture, even compared to NVIDIA’s desktop GPUs. Thus the fact that on the Tegra the blending is done within the fragment shader doesn’t mean that it can be done in the same way on a desktop GPU.
In fact, the blending stage is still more or less fixed-function hardware on all current desktop hardware afaik, and probably the main reason for this is performance.
However, with GL_ARB_shader_image_load_store and/or with GL_NV_texture_barrier (both of them are supported on NVIDIA and AMD) you could pretty much implement it.
Finally, I think load/store image based order independent transparency is way more useful than programmable blending as blending is order dependent, thus relies on the CPU to do expensive sorting.

Sampler Shaders
The idea sounds great, though not sure whether we need one more shader stage just for this (if that’s what you meant).
Rather, you should be able, more or less, to do it already using subroutines. Maybe the only thing that’s missing is a programmable way to set sampler parameters within the shader.
If want you meant is really a way to set sampler parameters programmatically then I second your suggestion.

Official OpenGL Support for Partially Resident Textures
This one obviously depends on a single thing: whether NVIDIA can/will implement it. Period.

P.S.: Happy to see some to-the-point suggestion as I didn’t see that many lately (of course there were, like buffer storage, multi-queries, fragment depth mask, but these are not common).

aqnuep · June 18, 2012, 3:44pm

Yes, and one more thing about the “API To Determine What’s Supported By Hardware”:

If we will have more built-in debug output messages in the drivers, I think we won’t have to worry that much about figuring out when we fall back to a software path.

Alfonse_Reinheart · June 18, 2012, 5:16pm

You don’t get the best optimizations you could. Maybe the driver implementation of the compiler is flawed. Maybe there’s a bug in the compiler implementation.

And if there is, there’s not a thing you can do about it.

What, do you think that this bytecode will magically transform itself into hardware-specific optimized shaders? No. It’s going to have to go through a compiler, just like everything else. And that compiler can be flawed and/or buggy, just like GLSL.

The only bugs you avoid with bytecode are those relating to the transformation of the C-like GLSL to your bytecode (ie: language parsing). All other bugs (such as optimizations, etc) are still available to be encountered. Not to mention, you’d open yourself up to all new bugs, because this “bytecode” will have to be parsed, errors emitted when it’s faulty, etc. All you’re doing is exchanging one compiler front-end for another.

You’re still “locked to one company to provide updates to a closed source compiler.”

This extension should be propogated back all the way to the first hardware that’s capable of executing shaders.

Well that’s just unrealistic. A hefty portion of that hardware is simply not being supported. It would take a lot of effort to define feature levels for 2.1-level functionality, all to no real purpose when over half the hardware doesn’t receive driver updates anymore.

Why is this still fixed function?

Because we have shader_image_load_store. If you want programmable blending, go write it yourself. It’s not like this proposed shader stage would be able to run on pre-4.x hardware anyway.

AaronMiller · June 18, 2012, 7:40pm

@mhagain

Yes, please - bring them on. Especially this part: “However, I also want to find out what falls back to software” (can I add: “and why it fell back” - i.e. is it not supported in hardware, did you exceed some hardware limit, was the phase of the moon wrong, or whatever?)

Haha, yes. I like the idea of finding out why too. But I’m personally more interested in finding out what the hardware does support to avoid the fall-back altogether. e.g., D3D has that ability (though, the API defines minimum limits now).

@aqnuep

However, with GL_ARB_shader_image_load_store and/or with GL_NV_texture_barrier (both of them are supported on NVIDIA and AMD) you could pretty much implement it.
Finally, I think load/store image based order independent transparency is way more useful than programmable blending as blending is order dependent, thus relies on the CPU to do expensive sorting.

Ah, GL_ARB_shader_image_load_store can do that then? Sweet. I don’t remember reading that extension’s documentation, so I’ll have a look. You have a good point on the OIT too. Thanks!

Rather, you should be able, more or less, to do it already using subroutines.

I had similar thoughts. I figure the sampler units could be emulated on current hardware using the existing subroutines functionality. The only reason I would want this is to reuse the existing GLSL texture lookup routines to make the code cleaner and easier to read/write/modify.

Analogously, you can think of your OpenGL version as “feature levels”.

That’s a good point. Though, I still prefer the idea of being able to find out what my limits are specifically so I can work within those bounds. For example, imagine OpenGL defined the minimum limit of simultaneously bound textures to 64. Lets say I had good reason to bind 73 for a nice quality boost. I would like to find out if I can bind those 73 before actually trying then risking a software fallback. (I can’t actually think of a situation you would actually need that many textures simultaneously bound, but this situation can apply to other things.)

@Alfonse Reinheart

What, do you think that this bytecode will magically transform itself into hardware-specific optimized shaders? No. It’s going to have to go through a compiler, just like everything else. And that compiler can be flawed and/or buggy, just like GLSL.

I’ve considered your argument prior to posting. I disagree that what you suggest would be an issue. Bugs are inevitable in software development, no doubt. However, this model seems to work for D3D with most bugs being silly driver side things. It also offers several benefits.

[ul]
[li]Faster Installations.
[/li][li]Offline Optimization.
[/li][li]Use Any Frontend.
[/li][/ul]
Just like many users, I too complain of how long it takes to install certain software. (Though mostly, just to myself.) Why waste your customer’s (or your end-user’s) time at any point? One of my core principles is improving quality for the end-user. Making the installation faster, even just by a couple of seconds, is worth the “effort.”

You aren’t guaranteed any form of optimization for your shaders. You can mitigate this by running the shaders through an offline “optimizer” that basically just moves text around… That idea isn’t the best if you can just avoid the text distribution altogether. If you had your own shader bytecode generator, that you were in control of, you could implement any optimizations you like. (e.g., you could build your own work atop of systems like LLVM.) Not that you couldn’t technically do that already, but the bytecode solution is a bit more “workable.”

Maybe I don’t want to write my shaders in GLSL. Maybe I want to use my own shading language to implement them. Maybe I want Cg to be able to spit out binaries that don’t need the Cg runtime. (I’m stretching here, but I think you see my point.) It’s more efficient to output bytecode than text.

I realize, of course, that all of the features I just mentioned can be emulated using GLSL. I personally still prefer a uniform bytecode back-end.

You’re still “locked to one company to provide updates to a closed source compiler.”

True!

Well that’s just unrealistic. A hefty portion of that hardware is simply not being supported. It would take a lot of effort to define feature levels for 2.1-level functionality, all to no real purpose when over half the hardware doesn’t receive driver updates anymore.

You misunderstand me here. What I meant was that there’s no reason this shader bytecode requires newer hardware. (Obviously, different versions of the bytecode may. e.g., geometry shader bytecode won’t work on GL 2.x.) What I envision is an extension (of the ARB variety, of course), that can specify its version. This feature would be really useful on mobile devices too. With that in mind, I can see a version of the bytecode being supported for “lower-end” (SM2/SM3) devices. Though, it’s true that supporting SM1.x devices just isn’t worth the effort (we’ve already got vendor-neutral (mostly) assembly shaders for that). However, SM2 (which is a large chunk of the target market for indie developers currently) should be supported. (That also corresponds roughly to the feature set available of mobile devices currently, if I’m not mistaken.)

It’s not like this proposed shader stage would be able to run on pre-4.x hardware anyway.

It probably could with intrinsics, so to speak.

[HR][/HR]

Is there currently a vendor/driver-agnostic method by which I can choose how much optimization effort a driver puts into outputting its code? Being able to specify this would also be helpful, I imagine.

I should note that I dislike the idea of a GLSL to GLSL compiler. We don’t have (or at least, don’t commonly see) that with “proper” languages, such as C.

Cheers,
Aaron

aqnuep · June 18, 2012, 8:03pm

The texture example you gave is a bad example because of two reasons:

Texture units are unlikely to be software emulated, thus it’s rather a hard limit so you shouldn’t have issues.
You can use texture arrays and thus have a much larger number of images that you can dynamically fetch from.

Not to mention that besides software fallbacks, there are also slow hardware paths. One example might be an improperly aligned vertex array setup. This could hurt your performance by a decent amount, yet it is still a fully hardware based path, just slower than the optimal.

I believe debug output and better coverage of performance warnings in the driver is a better approach. Not to mention that you anyway have to test your application at least the most common hardware that you want your application to run at, so software fallbacks and suboptimal hardware paths shouldn’t be that difficult to identify.

The core profile has been mainly introduced to expose through the API only those features that have direct hardware support (on all hardware out there, including different generations supported and different vendors, of course). While there may be still a couple of functionalities or functionality combinations that might hit a software fallback, I strongly believe that in the long run simply relying on using core only features will get you to that hardware path.

The question of fast and slow hardware path is a more subtle issue, but no “feature level” kind of mechanism could solve that. The lesson remains the good old “test & benchmark” practice and the added help of debug output.

AaronMiller · June 18, 2012, 9:16pm

That’s a really good point. I never thought of slower hardware paths (with the exception of alignment specifically). Thank you!

Cheers,
Aaron

mhagain · June 19, 2012, 2:53am

[QUOTE=aqnuep;1238975]
However, with GL_ARB_shader_image_load_store and/or with GL_NV_texture_barrier (both of them are supported on NVIDIA and AMD) you could pretty much implement it.[/QUOTE]

load_store requires texture objects, texture_barrier has a whole bunch of restrictions attached, and there are still cases where blending is done without either. Sure, you could emulate it, with a bunch of grief and/or performance impact attached, but why go the half-assed and messy route when programmable blending would offer a cleaner and more performant alternative?

aqnuep · June 19, 2012, 8:23am

First, how do you know that programmable would offer a more performant alternative? How do you know that there weren’t earlier GPUs that actually used the same hardware for blending and image load/store?
Also, how does programmable blending solve the issue of order-independent transparency (which would be the primary use I suppose)?

Blending is itself a “half-assed and messy route” when it comes to rendering transparency due to its order dependent nature. Sure, blending can be used for a lot of other things where programmable blending might be useful. But how many order independent (i.e. commutative) operators are out there that application developers would like to use? Add, multiply, min, max, etc. these are all supported by current hardware.

Further GL_NV_texture_barrier does have restrictions, but in fact in those scenarios which are restricted by it wouldn’t work out well with blending either, unless you use an order independent (i.e. commutative) operator as mentioned before.

Alfonse_Reinheart · June 19, 2012, 9:05am

However, this model seems to work for D3D with most bugs being silly driver side things.

You’re conflating two different issues. The fact that D3D has fewer evident driver bugs is not because of how it compiles shaders. It’s due to several factors:

1: Writing D3D drivers is simpler than writing GL drivers. Simpler code means less bugs.

2: D3D is more heavily used than OpenGL. Because of that, more bugs are found. And because D3D software is quite popular, they are more quickly responded to than GL bugs. The best way to find and squash bugs is to use something, and code that doesn’t get used is more likely to be buggy.

Changing the language that gets compiled will change very little about how many bugs you will encounter. Indeed, you’ll likely get more bugs because driver developers will have to maintain their GLSL compilers too, for backwards compatibility reasons.

If you want to decrease compiler bugs, then put together a real test suite for GLSL. Then find a way to make driver developers test and fix bugs based on it.

You aren’t guaranteed any form of optimization for your shaders. You can mitigate this by running the shaders through an offline “optimizer” that basically just moves text around… That idea isn’t the best if you can just avoid the text distribution altogether. If you had your own shader bytecode generator, that you were in control of, you could implement any optimizations you like. (e.g., you could build your own work atop of systems like LLVM.) Not that you couldn’t technically do that already, but the bytecode solution is a bit more “workable.”

What kind of optimizations are you talking about? Loop unrolling? Function inlining? Dead code removal? That’s not very much in the grand scheme of shader logic; most of the real optimizations will have to be done by the driver.

One hardware’s optimization is another’s pessimization. Unrolling a loop on one piece of hardware can give a performance boost; on another, it can make things slower. The driver knows which is better because it’s hardware-specific. Better to rely on the driver to do the right thing than to rely on your personal hope that you can out-think the people who actually know their hardware.

However, SM2 (which is a large chunk of the target market for indie developers currently) should be supported. (That also corresponds roughly to the feature set available of mobile devices currently, if I’m not mistaken.)

If hardware isn’t being supported, it won’t get new OpenGL APIs. New APIs like this shader language of yours. Therefore, even if it could run it, it won’t because the IHV isn’t supporting the hardware anymore.

The “large chunk of the target market for indie developers” is primarily hardware that isn’t being supported. Integrated Intel chips and any of AMD’s hardware pre-HD models. NVIDIA is still supporting the GeForce 6xxx and 7xxx lines, but outside of that, you’ve got nothing.

Thus, any effort in this regard is going to help less than half of the “target market for indie developers.” So why bother?

It probably could with intrinsics, so to speak.

At which point, you simply have a more cumbersome way of specifying the blending equation. That’s not particularly helpful.

Janika · June 19, 2012, 10:11am

I think programmable blending will open the “forbidden” feature of reading from and writing to specific pixels inside the fragment shader.

thokra · June 19, 2012, 10:26am

I think programmable blending will open the “forbidden” feature of reading from and writing to specific pixels inside the fragment shader.

What do you think image_load_store allows you to do?

Janika · June 19, 2012, 10:52am

Yeah but you cannot directly read and write the front buffer.

aqnuep · June 19, 2012, 11:39am

You cannot load/store the window system provided color buffers, that’s true, but that’s rather because you cannot access the default framebuffer color buffers as textures in general (like you can in D3D). That’s definitely an issue with OpenGL and was requested for a long time. But that’s another story.
For programmable blending, the fact that you have to work with an FBO and then you have to copy the results to the default framebuffer won’t cause you any problems, it would be ultra-fast.

AaronMiller · June 19, 2012, 5:21pm

@Alfonse Reinheart

You’re conflating two different issues. The fact that D3D has fewer evident driver bugs is not because of how it compiles shaders. It’s due to several factors:

1: Writing D3D drivers is simpler than writing GL drivers. Simpler code means less bugs.

2: D3D is more heavily used than OpenGL. Because of that, more bugs are found. And because D3D software is quite popular, they are more quickly responded to than GL bugs. The best way to find and squash bugs is to use something, and code that doesn’t get used is more likely to be buggy.

Changing the language that gets compiled will change very little about how many bugs you will encounter. Indeed, you’ll likely get more bugs because driver developers will have to maintain their GLSL compilers too, for backwards compatibility reasons.

Okay. Have a look here: Implementing fixed function T&L in vertex shaders · Aras' website I ran into a similar need to emulate fixed-function support in my code, between different states (at the contractor’s request). These sorts of cases are few and far between, but having bytecode support made things much easier. I could generate the bytecode and have that recompiled dynamically. Not something you normally need to do. Compare that to GLSL which takes much longer (link provided in a prior post within this thread). Feasible for Direct3D. Not for OpenGL.

What kind of optimizations are you talking about? Loop unrolling? Function inlining? Dead code removal? That’s not very much in the grand scheme of shader logic; most of the real optimizations will have to be done by the driver.

One hardware’s optimization is another’s pessimization. Unrolling a loop on one piece of hardware can give a performance boost; on another, it can make things slower. The driver knows which is better because it’s hardware-specific. Better to rely on the driver to do the right thing than to rely on your personal hope that you can out-think the people who actually know their hardware.

Mobile devices come to mind. Not all mobile devices have an offline machine code compiler for deploying with your project. For the devices that do not, the code has to be generated on the device, then sent back to you for later deployment (to avoid initial compilation). Additionally, there are cases where you may be able to implement a bytecode compiler that executes faster than the default GLSL one. So, not just optimization in the sense of code produced, but in time executed. You can control exactly which is more important. Keep in mind that driver writers must choose a balance between the two and may make a choice completely opposite of what you like.

Optimizations that are specific to the hardware can still be done. For example, D3D bytecode has a “rep” and “endrep” instruction pair. These could be unrolled by the driver if it determines it’s a good idea to do so. Likewise, instructions that were generated in an unrolled state could be “rolled up again.” Also, don’t trust the driver writers… Especially if they’re from Intel. So, in some cases, yes, I can get more optimized results for the hardware than the driver if I have the ability to do so. With GLSL it’s a free for all and you don’t know what you’re going to get back. The same is still true for the bytecode variant, but you at least have more control of it then.

A benefit of having a standardized bytecode (which would likely represent how the underlying hardware ISAs work anyway) is that you would know what that bytecode is. For some people that’s not important. They don’t care how something gets done as long as it gets done. Those people aren’t affected by this proposed extension. Then there are the people who like GLSL specifically, the presence of an IR binary format wouldn’t affect them. A lot of people seem to want a standardized bytecode format, myself included.

Your argument, to me, seems mostly like “Java shouldn’t exist because I don’t use it,” in a manner of speaking. For better communication between the two of us, I request that you respond to the following inquiries in great detail.
Why do you prefer GLSL does not have a standardized IR?
What is your ideal communication mechanism between OpenGL shader representations and the GPU?
What is it that you think I’m suggesting (in terms of this bytecode extension), exactly?

And, just a footnote for you. If I had to write an OpenGL and Direct3D driver for some GPU (that supports shaders), this is how I would do it.

Convert GLSL source to D3D bytecode. (Possibly marked with a “special version token” to indicate that GL extensions can be supported.)
Optimize the bytecode.
Convert to binary form for whichever GPU backend I’m supporting.

If I had the same job, but with my proposed extension, this is how I would do it.

Two frontends. One for GLSL. One for D3D bytecode.
Generate to the IR (which could support per-vendor/driver extensions, just as GLSL does).
Convert to binary form for whichever GPU backend I’m supporting.

The steps don’t change that much, and they both support some codebase sharing. (Yes, I realize that drivers are separated between D3D and GL. Source code can still be shared.) The benefit is that two stages of the pipeline no longer have to be done at runtime, but the IR can still be generated at runtime, dynamically. As mentioned above, this is for certain code injection techniques for shaders.

If OpenGL were designed to use the bytecode IR from the start, would you disagree with the pipeline even with GLSL support?
If so, please provide a strict example as to why my version is less optimal, or why it could not possibly help any developer.

Personally, I feel that GLSL should be handled offline anyway. You said it yourself, “1: Writing D3D drivers is simpler than writing GL drivers. Simpler code means less bugs.” I would push for Khronos to release an offline GLSL compilation kit if it were possible for them to do so. Then the drivers would be simpler if only for the fact that there’s a common IR to share and no need for language compilation. Reference drivers could be written that target software specifically using the IR. Interpreting an IR is simpler than interpreting almost free-form text. The driver no longer has to maintain decisions like register allocation for variables. (Though, they would have to do so from the IR. They would have to anyway because of D3D’s presence. Or they could use LLVM.) Again, IR compilation is simpler than language compilation. This would make writing drivers for GL simpler, which would introduce less bugs.

@aqnuep, Janika
I found the link mentioning programmable blending. Bending The Graphics Pipeline (SIGGRAPH 2010). See page 13. The GL_ARB_shader_image_load_store is more recent (2012) so it may be used to emulate such support. That said, I do think a more separate shader stage for more elegant code could be implemented. I haven’t used this, so it’s not something I can really comment on.

Cheers,
Aaron

Alfonse_Reinheart · June 19, 2012, 6:47pm

Your argument, to me, seems mostly like “Java shouldn’t exist because I don’t use it,” in a manner of speaking.

No, my argument is, “some of your arguments don’t make sense.”

Would it be nice to have an intermediate language? Yes. Should we have one because it makes it easier to optimize the code? No, because it doesn’t make optimizing code easier; it only gives the illusion of that. Should we have one because it makes the driver less buggy? No, because it doesn’t make the driver less buggy.

See the difference? I’m not arguing against your position per-se; I’m arguing that some of your arguments for this are of dubious merit.

Okay. Have a look here: Implementing fixed function T&L in vertex shaders · Aras' website I ran into a similar need to emulate fixed-function support in my code, between different states (at the contractor’s request). These sorts of cases are few and far between, but having bytecode support made things much easier. I could generate the bytecode and have that recompiled dynamically. Not something you normally need to do. Compare that to GLSL which takes much longer (link provided in a prior post within this thread). Feasible for Direct3D. Not for OpenGL.

That’s nice but… how does this have anything to do with what I said? This has nothing to do with compiler bugs, which was what the part you quoted from me was talking about.

You seem to be arguing points that I’m not making. I never said that it wouldn’t make this situation easier. I said that it wouldn’t cause fewer compiler bugs.

Optimizations that are specific to the hardware can still be done. For example, D3D bytecode has a “rep” and “endrep” instruction pair. These could be unrolled by the driver if it determines it’s a good idea to do so. Likewise, instructions that were generated in an unrolled state could be “rolled up again.”

That would seem to work against the whole “making the compiler faster” issue, since now every time you load the bytecode, it has to scan it and decide to re-optimize things that you mistakenly de-optimized for it. Not only that, because it’s in assembly, it has much weaker semantics for it than if it were GLSL.

Also, don’t trust the driver writers… Especially if they’re from Intel. So, in some cases, yes, I can get more optimized results for the hardware than the driver if I have the ability to do so. With GLSL it’s a free for all and you don’t know what you’re going to get back. The same is still true for the bytecode variant, but you at least have more control of it then.

First, what cases are you talking about? Do you have any specific examples?

Secondly and more importantly, you seem to be arguing against yourself here. On the one hand, you’re saying that you can optimize better than the driver. But you just said that the driver can basically override anything you were doing. It can re-roll loops you unrolled.

So really, you have no more control either way. You’re ultimately trusting the compiler not to do something stupid. Personally, if I’m going to put my faith in a compiler, I’d rather give it more semantics and information to work with rather than less.

If I had to write an OpenGL and Direct3D driver for some GPU (that supports shaders), this is how I would do it.

Convert GLSL source to D3D bytecode.

Quite frankly, that’s a horrible idea.

The whole point of shoving GLSL down the driver’s throat is so that we can provide more semantic information to the compiler. And with greater semantic information comes more chances for hardware-specific optimization. Real data structures, functions, parameter passing, etc, all are crucial bits of information that can be used when making hardware-specific optimizations.

By forcing this two-stage compilation model on the system, you’re basically throwing vital information away. You’re taking the advantages that native GLSL provides and just pretending they don’t exist just to make your code slightly easier to write.

I hope, for the sake of hardware optimizations, that you aren’t hired to write GL drivers anywhere.

If OpenGL were designed to use the bytecode IR from the start, would you disagree with the pipeline even with GLSL support?

If the ARB had just kept using and updating ARB assembly, they would never have created GLSL in the first place. Odds are, HLSL would have simply become a universal standard, and there’d be some SourceForge project that contains a HLSL-to-ARB_assembly compiler that most people who don’t want to code to the assembly uses.

So your hypothetical question is moot. Indeed, if they kept up with ARB assembly, and someone suggests GLSL now, I’d tell them to take a hike.

OpenGL only should have one shading language. And quite frankly, that’s my biggest argument against this:

Personally, I feel that GLSL should be handled offline anyway. You said it yourself, “1: Writing D3D drivers is simpler than writing GL drivers. Simpler code means less bugs.”

Yes, making drivers simpler would lead to less bugs. But the ARB is clearly reluctant to make backwards incompatible changes to OpenGL. Even getting rid of immediate mode rendering was the equivalent of pulling teeth, and it’s not like most GL implementations don’t still support all the old junk that was ostensibly ripped out. So the only way this proposal would actually be implemented is if you now have two compilers in the driver.

Two compilers is, pretty much by definition, less simple than one. Even if you internally make your GLSL compiler go to the bytecode, that’s still two compilers you have to support. That means two places where a failure can happen.

Intel struggles with supporting one compiler; how can you expect them to work with two?

The only way this could make things simpler is if you rewound time and made the ARB stick with ARB assembly instead of using 3D Labs’ asinine GLSL proposal. However, given that we’re already in this mess, and we can’t suddenly magic ourselves out of this mess, what you’re suggesting isn’t helping.

Ultimately, the best course of action is to just live with it. OpenGL is imperfect, and trying to make it perfect is only going to make the imperfections worse.

Oh, and let’s not forget a simple, practical fact: your proposal is nothing the ARB hasn’t heard dozens of times before. Go ahead; search this forum. It’s been suggested over and over since GLSL was adopted. It hasn’t happened in almost 10 years. The arguments for it haven’t changed a bit.

And yet, it still hasn’t been done. It took almost a decade to get separate shaders and program binaries, and those are also things people asked for even before GL 2.0. So I wouldn’t hold my breath.

AaronMiller · June 19, 2012, 10:22pm

No, my argument is, “some of your arguments don’t make sense.”

Would it be nice to have an intermediate language? Yes. Should we have one because it makes it easier to optimize the code? No, because it doesn’t make optimizing code easier; it only gives the illusion of that. Should we have one because it makes the driver less buggy? No, because it doesn’t make the driver less buggy.

See the difference? I’m not arguing against your position per-se; I’m arguing that some of your arguments for this are of dubious merit.

Ah, okay. I had noticed you seemed to target select statements, but I didn’t think anything of it.

You seem to be arguing points that I’m not making. I never said that it wouldn’t make this situation easier. I said that it wouldn’t cause fewer compiler bugs.

I was cramming stuff into certain “sections” of the post, which is why they didn’t seem correlated. I forgot to move it all out to a separate “general purpose” section. Anyway, my point there was in argument for the overall feature. Irrelevant now.

First, what cases are you talking about? Do you have any specific examples?

Mostly mobile targets. Drivers can’t spend too much time optimizing output bytecode or there would be a noticeable hiccup when you try using apps on the (already terribly slow) processors (CPU and GPU included). So, it must find a balance. I think OpenGL may eventually exist as an “embedded” profile (instead of just GL ES), so it would make sense for having bytecode support in such cases.

Secondly and more importantly, you seem to be arguing against yourself here. On the one hand, you’re saying that you can optimize better than the driver. But you just said that the driver can basically override anything you were doing. It can re-roll loops you unrolled.

So really, you have no more control either way. You’re ultimately trusting the compiler not to do something stupid. Personally, if I’m going to put my faith in a compiler, I’d rather give it more semantics and information to work with rather than less.

I was providing separate arguments/POVs in favor of the bytecode there. They weren’t meant to go together, necessarily. In a sense, I’m arguing the IR will already have many of the basic optimizations covered, but the driver can still do hardware specific optimizations. Loop unrolling being one of them.

Quite frankly, that’s a horrible idea.

The whole point of shoving GLSL down the driver’s throat is so that we can provide more semantic information to the compiler. And with greater semantic information comes more chances for hardware-specific optimization. Real data structures, functions, parameter passing, etc, all are crucial bits of information that can be used when making hardware-specific optimizations.

By forcing this two-stage compilation model on the system, you’re basically throwing vital information away. You’re taking the advantages that native GLSL provides and just pretending they don’t exist just to make your code slightly easier to write.

I hope, for the sake of hardware optimizations, that you aren’t hired to write GL drivers anywhere.

I should probably clarify what I meant by that a bit more. The D3D bytecode could be modified to support certain semantics if necessary, and metadata could be attached as well. (Just like my proposal for extensions.) By baking that information in, the runtime can still make hardware optimizations. I think it makes sense to only translate one IR to the equivalent hardware backend. I’m not sure what benefit GLSL (or any high-level language, for that matter) would have if I could bake in “intents” into the bytecode as metadata anyway. Regardless, I’m not sure it’s a bad idea. It seems to me that OpenGL and Direct3D render shaders at about the same speed anyway. (It would be difficult to benchmark this accurately I think, though.) Realizing that the same metadata can be (though is not necessarily required) do you still think this approach would be a horrible idea? I believe it to be reasonable. Any information the driver can use for improving performance can be encoded as optional metadata, and only one IR would have to be supported.

Two compilers is, pretty much by definition, less simple than one. Even if you internally make your GLSL compiler go to the bytecode, that’s still two compilers you have to support. That means two places where a failure can happen.

Intel struggles with supporting one compiler; how can you expect them to work with two?

lol! That was hilarious. And, good point!

Ultimately, the best course of action is to just live with it. OpenGL is imperfect, and trying to make it perfect is only going to make the imperfections worse.

I figured as much, but…

Oh, and let’s not forget a simple, practical fact: your proposal is nothing the ARB hasn’t heard dozens of times before. Go ahead; search this forum. It’s been suggested over and over since GLSL was adopted. It hasn’t happened in almost 10 years. The arguments for it haven’t changed a bit.

And yet, it still hasn’t been done. It took almost a decade to get separate shaders and program binaries, and those are also things people asked for even before GL 2.0. So I wouldn’t hold my breath.

… I was hoping that maybe, just maybe, the ARB may consider it. Even if just as a bolt-on. Hell, even updating the ARB assembly would be acceptable to me.

[hr][/hr]

I don’t mind GLSL (as a language). But there are a couple of things I’d like to see from its evolution, if that is what we must deal with…

The ability to embed some form of ARB assembly, maybe. (I haven’t though this one through, but it’s an interesting idea.)
Better GL interfacing in terms of “info logs.” (It takes a bit of extra work to get filenames from the errors presented. It’s not difficult to hack it in, but it would be nice if I could specify how errors are formatted.) This may pose some security risks, but so do extensions like GL_AMD_pinned_memory.
The ability to specify, from the GL, as well as in GLSL, how much optimization to apply to certain routines. Some form of control via “pragma” directives, or whatever, would be beneficial.

Cheers,
Aaron

imported_kyle · June 20, 2012, 3:07am

I dont buy it. At least with argumentation that GLSL semantics enable some meaningful optimizatinos. GLSL is usefull, but manily for the shader developers - we would be better of with extending arb (or just basing bytecode on it) and having Khronos provide compiller GLSL -> ARB. This would eliminate only some parsing effort, probably very minor, bigger gain would be standarized syntax that is the same everywhere, its pretty much hard to screw up on ARB programs syntax (but yeah, khronos doesnt do code so that is impossible scenario).

As far as i know nvidia does exactly this with their compiler. Their ARB (NV) programs are some intermediate form they use for GLSL shaders (when writing particularily convulted structures or hitting compiler bug in GLSL they will often present you with shader in such a form).

system · June 20, 2012, 4:05am

Also, you can’t attach the backbuffer depth buffer to your FBO, which is something that has existed in D3D since version 8.
http://www.opengl.org/wiki/Framebuffer_Object_Examples#The_main_framebuffer