Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 4 123 ... LastLast
Results 1 to 10 of 31

Thread: Suggestions for OpenGL 5

  1. #1
    Junior Member Newbie AaronMiller's Avatar
    Join Date
    May 2012
    Posts
    5

    Suggestions for OpenGL 5

    Hello, here are a few suggestions of things I would really like to see in OpenGL.

    Offline GLSL Compilation and Official Bytecode
    Okay, this is something that's been requested before, but it has been ignored. There needs to be an official bytecode format that specifies the binary. Why? Offline compilation with ALL optimizations on and third party compilers. As it stands currently, you must distribute the GLSL with your app (not a big deal) then waste the user's time (in addition to the time it takes to perform an install) to recompile a shader that is perfectly capable of being done offline. This has two detrimental effects. 1: Slower start-up time whenever hardware or drivers change. 2: You don't get the best optimizations you could. Maybe the driver implementation of the compiler is flawed. Maybe there's a bug in the compiler implementation. Maybe you simply don't want to use GLSL. Yeah, you can use Cg but that's no better. You're locked to one company to provide updates to a closed source compiler. Not good for me.

    Offline GLSL compilation can be accomplished, by more than just implementors of OpenGL, if an official bytecode exists, and is maintained. Make it an extension at first, then make it part of the official spec. It doesn't matter how the bytecode looks, how complicated it is, or anything like that as long as it's functional. Once you have the bytecode, just use the existing APIs to apply the binary. Maybe it's in the format of GL_BYTECODE_FORMAT or something similar. Direct3D has this feature. I can implement my own HLSL compiler and not be locked in to Microsoft's implementation if I want to. I can't do that with GLSL. It's kind of funny actually, I can compile GLSL to HLSL bytecode. (Which is documented in the WDK btw.)

    Some people will argue that the GL_ARB_get_program_binary extension allows you to produce binaries that are optimized for the hardware instead of just in general. Yes, that's true. However, there's no reason the bytecode format couldn't be converted to a binary by the GL in the same way to produce a binary optimized for the hardware. It would probably be easier too since everything's in a bytecode format and easy to optimize/interpret.

    This extension should be propogated back all the way to the first hardware that's capable of executing shaders. Why? Compatability. D3D worked with bytecode all the way back to when shaders were first introduced. GL can too.

    Here's a comparison between D3D9 bytecode shader loads, and OpenGL shader loads (among other things).
    https://github.com/aras-p/glsl-load-...er/results.txt

    Official Bindless Objects Extension
    NVIDIA and AMD offer separate but (as far as I can tell) equivalent solutions for accessing textures without binding them, saving a lot of unnecessary driver overhead. These are the two big players in the game. Then there's Intel. I've heard they offer a bindless solution as well, but I haven't bothered verifying this. In this day and age where virtual texturing is starting to become common place (at least in one form or another) the ability to just directly access texture memory is becoming increasingly more important.

    For reference, see GL_AMD_pinned_memory, GL_NV_bindless_texture, and GL_NV_vertex_buffer_unified_memory.

    API To Determine What's Supported By Hardware
    D3D9 and below had GetDeviceCaps(). D3D10+ has "feature levels" and "feature sets" that you can check. These feature sets are useful to determine what the driver has support for.

    Now, believe me, I like that OpenGL forces the driver to support some things, even if it has to fallback to software to do so. However, I also want to find out what falls back to software. That way, in my code, I can choose to avoid that particular feature and perform a work-around. Maybe this can be done through queries. e.g., issue an Is Hardware query for a set of OpenGL commands using the current state. Then, check the query for a boolean value. If true, then the hardware can execute the GL commands within the query statement. If false, then the hardware can't, so do something else. This seems like it would be easy to implement. At least, it would be easy to add to the specs. (No new entry points, one added enum value, and some description.)

    Programmable Blending
    Why is this still fixed function? I've noticed that in OpenGL ES, on NVIDIA Tegra hardware, whenever you change the fixed-function alpha blending state, the shader bytecode is regenerated. (Someone at Unity discovered this actually. I don't remember where I read this from specifically though, sorry.) So, I assume that means hardware is capable of performing it. The D.I.C.E. Frostbite 2 (Battlefield 3) guys seem to want this as well. I'll leave this section short since AAA developers already want this feature. (Sorry for not providing more in terms of citation here.)

    Sampler Shaders
    Samplers seem quite fixed-function to me as they rely on various state settings and so on. Perhaps sampler shaders could be implemented? Basically, when a sampler shader is applied it would allow for programming exactly what type of filtering is applied when a texture image is retrieved. This can be accomplished to some degree already by using point/nearest filtering (that is, no filtering at all) and applying several samples within the pixel shader. However, it is my understanding that the hardware that handles sampling can make a few optimizations based on the areas sampled. If that is correct, perhaps a sampler shader could be used as a more optimized method for performing said sampling. Here are two examples. 1: Virtual textures and atlas textures have page boundaries that cannot be crossed. You usually have to add padding (which is unfortunate) around each page, or put the data within volume textures (which can cause other issues, or may not be feasible based on certain hardware limits). The third alternative being: implement the filtering yourself. 2: Implement elliptical texture filtering (a higher quality form of filtering) and use it across multiple shaders without using subroutines. With programmable samplers it would be easier to implement other shaders. You wouldn't have to use subroutines, just generate a new sampler then access the texture data like normal within the shader.

    Overall, this isn't something that's hugely necessary, but I haven't heard of anything like it before and, at least for me, it would be fairly convenient. Also, with AMD's introduction of partially resident textures I think this is probably hugely unnecessary.

    Official OpenGL Support for Partially Resident Textures
    I like AMD's implementation, but it's just limited to AMD. This should be a core requirement for GL 5~.



    If there are ARB extensions available for anything I've mentioned above, I would love to know their names.

    I'm not looking for alternatives or work-arounds to any of the above suggestions. I know what I'm doing. These are just some things I'd fancy seeing in an official GL specification.

    Cheers,
    Aaron

  2. #2
    Senior Member OpenGL Pro
    Join Date
    Jan 2007
    Posts
    1,789
    Quote Originally Posted by AaronMiller View Post
    You're locked to one company to provide updates to a closed source compiler. Not good for me.
    That's actually the current situation. The GLSL compiler provided by your vendor is already closed-source, and you're locked to your hardware vendor. Currently it's worse because each vendor must provide their own compiler, which introduces bugs and divergence (as well as potentially conformance-breaking "optimization"). Specifying the bytecode is actually a great idea though; it would force vendors to have some measure of consistence, but it would need to be a sensible spec (and I don't really have much faith in the ARB in that regard).

    Quote Originally Posted by AaronMiller View Post
    Official Bindless Objects Extension
    API To Determine What's Supported By Hardware
    Programmable Blending
    Yes, please - bring them on. Especially this part: "However, I also want to find out what falls back to software" (can I add: "and why it fell back" - i.e. is it not supported in hardware, did you exceed some hardware limit, was the phase of the moon wrong, or whatever?)

    Programmable blending could actually be achieved on current hardware/drivers with multiple FBOs but at some fillrate cost; adding true programmability to it would be a great simplification of the API (traditional blending could then go to deprecated status) as well as bring OpenGL ahead of D3D in this regard. Would need hardware support though.

    Quote Originally Posted by AaronMiller View Post
    Sampler Shaders
    You can mostly already do this with some texelFetch calls; the idea of being able to predefine some behaviour and have the driver optimize it, then reuse it at will, is nice though. Would need hardware support too.

  3. #3
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    989
    You wrote up a pretty interesting list of suggested features so I'll try to respond accordingly.

    Offline GLSL Compilation and Official Bytecode
    This is something that has been requested a lot of times. Personally, I would go for it, though I'm pretty sure designing such a bytecode would be a non-trivial task.

    Official Bindless Objects Extension
    Personally, I think that the NVIDIA bindless extensions shouldn't become core. Mainly because they introduce pointers which would make OpenGL programming, well, let's say, less safe.
    Also, the bindless extensions require the application developer to control which resources are resident, what is currently handled by the drivers themselves, thus I believe what it changes is that the overhead of managing which resource is resident for a draw call (which actually depends on what resources the draw call would like to use) is put to the application side from the driver side. The overhead doesn't magically go away. Of course, applications that can make all their resources resident would definitely get a huge speedup, but not sure how many real-life applications fall into this category.
    Finally, while GL_AMD_pinned_memory also sounds like a "bindless" API, in fact what it provides is a way to use client memory as buffer storage. While in case of APUs this might not be of an issue, in case of dedicated GPUs I wouldn't use pinned memory for e.g. vertex buffers as dedicated GPUs can access their local memory way faster than they do access client memory (i.e. system memory). Of course, bindless memory can be still very useful, especially for often changed uniform buffers or for pixel unpack/pack buffers.

    API To Determine What's Supported By Hardware
    If you use core OpenGL and respect e.g. the alignment restrictions (see MIN_MAP_BUFFER_ALIGNMENT, UNIFORM_BUFFER_OFFSET_ALIGNMENT, etc.) then you should be fine. I think OpenGL applications have already enough hastle to handle all the different OpenGL versions and extensions. Btw, it is not a coincidence that D3D10 removed device caps and rather uses "feature levels". Analogously, you can think of your OpenGL version as "feature levels". The only thing you might need to handle specially is those hardware independent extensions like GL_ARB_explicit_attrib_location or GL_ARB_separate_shader_objects.

    Programmable Blending
    Yes, this is another long waited feature.
    You've mentioned the NVIDIA Tegra hardware, but you forgot that the Tegra is a fundamentally different architecture, even compared to NVIDIA's desktop GPUs. Thus the fact that on the Tegra the blending is done within the fragment shader doesn't mean that it can be done in the same way on a desktop GPU.
    In fact, the blending stage is still more or less fixed-function hardware on all current desktop hardware afaik, and probably the main reason for this is performance.
    However, with GL_ARB_shader_image_load_store and/or with GL_NV_texture_barrier (both of them are supported on NVIDIA and AMD) you could pretty much implement it.
    Finally, I think load/store image based order independent transparency is way more useful than programmable blending as blending is order dependent, thus relies on the CPU to do expensive sorting.

    Sampler Shaders
    The idea sounds great, though not sure whether we need one more shader stage just for this (if that's what you meant).
    Rather, you should be able, more or less, to do it already using subroutines. Maybe the only thing that's missing is a programmable way to set sampler parameters within the shader.
    If want you meant is really a way to set sampler parameters programmatically then I second your suggestion.

    Official OpenGL Support for Partially Resident Textures
    This one obviously depends on a single thing: whether NVIDIA can/will implement it. Period.

    P.S.: Happy to see some to-the-point suggestion as I didn't see that many lately (of course there were, like buffer storage, multi-queries, fragment depth mask, but these are not common).
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

  4. #4
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    989
    Yes, and one more thing about the "API To Determine What's Supported By Hardware":

    If we will have more built-in debug output messages in the drivers, I think we won't have to worry that much about figuring out when we fall back to a software path.
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

  5. #5
    Senior Member OpenGL Lord
    Join Date
    May 2009
    Posts
    6,064
    You don't get the best optimizations you could. Maybe the driver implementation of the compiler is flawed. Maybe there's a bug in the compiler implementation.
    And if there is, there's not a thing you can do about it.

    What, do you think that this bytecode will magically transform itself into hardware-specific optimized shaders? No. It's going to have to go through a compiler, just like everything else. And that compiler can be flawed and/or buggy, just like GLSL.

    The only bugs you avoid with bytecode are those relating to the transformation of the C-like GLSL to your bytecode (ie: language parsing). All other bugs (such as optimizations, etc) are still available to be encountered. Not to mention, you'd open yourself up to all new bugs, because this "bytecode" will have to be parsed, errors emitted when it's faulty, etc. All you're doing is exchanging one compiler front-end for another.

    You're still "locked to one company to provide updates to a closed source compiler."

    This extension should be propogated back all the way to the first hardware that's capable of executing shaders.
    Well that's just unrealistic. A hefty portion of that hardware is simply not being supported. It would take a lot of effort to define feature levels for 2.1-level functionality, all to no real purpose when over half the hardware doesn't receive driver updates anymore.

    Why is this still fixed function?
    Because we have shader_image_load_store. If you want programmable blending, go write it yourself. It's not like this proposed shader stage would be able to run on pre-4.x hardware anyway.

  6. #6
    Junior Member Newbie AaronMiller's Avatar
    Join Date
    May 2012
    Posts
    5
    @mhagain
    Yes, please - bring them on. Especially this part: "However, I also want to find out what falls back to software" (can I add: "and why it fell back" - i.e. is it not supported in hardware, did you exceed some hardware limit, was the phase of the moon wrong, or whatever?)
    Haha, yes. I like the idea of finding out why too. But I'm personally more interested in finding out what the hardware does support to avoid the fall-back altogether. e.g., D3D has that ability (though, the API defines minimum limits now).

    @aqnuep
    However, with GL_ARB_shader_image_load_store and/or with GL_NV_texture_barrier (both of them are supported on NVIDIA and AMD) you could pretty much implement it.
    Finally, I think load/store image based order independent transparency is way more useful than programmable blending as blending is order dependent, thus relies on the CPU to do expensive sorting.
    Ah, GL_ARB_shader_image_load_store can do that then? Sweet. I don't remember reading that extension's documentation, so I'll have a look. You have a good point on the OIT too. Thanks!

    Rather, you should be able, more or less, to do it already using subroutines.
    I had similar thoughts. I figure the sampler units could be emulated on current hardware using the existing subroutines functionality. The only reason I would want this is to reuse the existing GLSL texture lookup routines to make the code cleaner and easier to read/write/modify.

    Analogously, you can think of your OpenGL version as "feature levels".
    That's a good point. Though, I still prefer the idea of being able to find out what my limits are specifically so I can work within those bounds. For example, imagine OpenGL defined the minimum limit of simultaneously bound textures to 64. Lets say I had good reason to bind 73 for a nice quality boost. I would like to find out if I can bind those 73 before actually trying then risking a software fallback. (I can't actually think of a situation you would actually need that many textures simultaneously bound, but this situation can apply to other things.)

    @Alfonse Reinheart
    What, do you think that this bytecode will magically transform itself into hardware-specific optimized shaders? No. It's going to have to go through a compiler, just like everything else. And that compiler can be flawed and/or buggy, just like GLSL.
    I've considered your argument prior to posting. I disagree that what you suggest would be an issue. Bugs are inevitable in software development, no doubt. However, this model seems to work for D3D with most bugs being silly driver side things. It also offers several benefits.
    • Faster Installations.
    • Offline Optimization.
    • Use Any Frontend.

    Just like many users, I too complain of how long it takes to install certain software. (Though mostly, just to myself.) Why waste your customer's (or your end-user's) time at any point? One of my core principles is improving quality for the end-user. Making the installation faster, even just by a couple of seconds, is worth the "effort."

    You aren't guaranteed any form of optimization for your shaders. You can mitigate this by running the shaders through an offline "optimizer" that basically just moves text around... That idea isn't the best if you can just avoid the text distribution altogether. If you had your own shader bytecode generator, that you were in control of, you could implement any optimizations you like. (e.g., you could build your own work atop of systems like LLVM.) Not that you couldn't technically do that already, but the bytecode solution is a bit more "workable."

    Maybe I don't want to write my shaders in GLSL. Maybe I want to use my own shading language to implement them. Maybe I want Cg to be able to spit out binaries that don't need the Cg runtime. (I'm stretching here, but I think you see my point.) It's more efficient to output bytecode than text.

    I realize, of course, that all of the features I just mentioned can be emulated using GLSL. I personally still prefer a uniform bytecode back-end.

    You're still "locked to one company to provide updates to a closed source compiler."
    True!

    Well that's just unrealistic. A hefty portion of that hardware is simply not being supported. It would take a lot of effort to define feature levels for 2.1-level functionality, all to no real purpose when over half the hardware doesn't receive driver updates anymore.
    You misunderstand me here. What I meant was that there's no reason this shader bytecode requires newer hardware. (Obviously, different versions of the bytecode may. e.g., geometry shader bytecode won't work on GL 2.x.) What I envision is an extension (of the ARB variety, of course), that can specify its version. This feature would be really useful on mobile devices too. With that in mind, I can see a version of the bytecode being supported for "lower-end" (SM2/SM3) devices. Though, it's true that supporting SM1.x devices just isn't worth the effort (we've already got vendor-neutral (mostly) assembly shaders for that). However, SM2 (which is a large chunk of the target market for indie developers currently) should be supported. (That also corresponds roughly to the feature set available of mobile devices currently, if I'm not mistaken.)

    It's not like this proposed shader stage would be able to run on pre-4.x hardware anyway.
    It probably could with intrinsics, so to speak.




    Is there currently a vendor/driver-agnostic method by which I can choose how much optimization effort a driver puts into outputting its code? Being able to specify this would also be helpful, I imagine.

    I should note that I dislike the idea of a GLSL to GLSL compiler. We don't have (or at least, don't commonly see) that with "proper" languages, such as C.

    Cheers,
    Aaron

  7. #7
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    989
    Quote Originally Posted by AaronMiller View Post
    That's a good point. Though, I still prefer the idea of being able to find out what my limits are specifically so I can work within those bounds. For example, imagine OpenGL defined the minimum limit of simultaneously bound textures to 64. Lets say I had good reason to bind 73 for a nice quality boost. I would like to find out if I can bind those 73 before actually trying then risking a software fallback. (I can't actually think of a situation you would actually need that many textures simultaneously bound, but this situation can apply to other things.)
    The texture example you gave is a bad example because of two reasons:
    1. Texture units are unlikely to be software emulated, thus it's rather a hard limit so you shouldn't have issues.
    2. You can use texture arrays and thus have a much larger number of images that you can dynamically fetch from.

    Not to mention that besides software fallbacks, there are also slow hardware paths. One example might be an improperly aligned vertex array setup. This could hurt your performance by a decent amount, yet it is still a fully hardware based path, just slower than the optimal.

    I believe debug output and better coverage of performance warnings in the driver is a better approach. Not to mention that you anyway have to test your application at least the most common hardware that you want your application to run at, so software fallbacks and suboptimal hardware paths shouldn't be that difficult to identify.

    The core profile has been mainly introduced to expose through the API only those features that have direct hardware support (on all hardware out there, including different generations supported and different vendors, of course). While there may be still a couple of functionalities or functionality combinations that might hit a software fallback, I strongly believe that in the long run simply relying on using core only features will get you to that hardware path.

    The question of fast and slow hardware path is a more subtle issue, but no "feature level" kind of mechanism could solve that. The lesson remains the good old "test & benchmark" practice and the added help of debug output.
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

  8. #8
    Junior Member Newbie AaronMiller's Avatar
    Join Date
    May 2012
    Posts
    5
    That's a really good point. I never thought of slower hardware paths (with the exception of alignment specifically). Thank you!

    Cheers,
    Aaron

  9. #9
    Senior Member OpenGL Pro
    Join Date
    Jan 2007
    Posts
    1,789
    Quote Originally Posted by aqnuep View Post
    However, with GL_ARB_shader_image_load_store and/or with GL_NV_texture_barrier (both of them are supported on NVIDIA and AMD) you could pretty much implement it.
    load_store requires texture objects, texture_barrier has a whole bunch of restrictions attached, and there are still cases where blending is done without either. Sure, you could emulate it, with a bunch of grief and/or performance impact attached, but why go the half-assed and messy route when programmable blending would offer a cleaner and more performant alternative?

  10. #10
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    989
    Quote Originally Posted by mhagain View Post
    load_store requires texture objects, texture_barrier has a whole bunch of restrictions attached, and there are still cases where blending is done without either. Sure, you could emulate it, with a bunch of grief and/or performance impact attached, but why go the half-assed and messy route when programmable blending would offer a cleaner and more performant alternative?
    First, how do you know that programmable would offer a more performant alternative? How do you know that there weren't earlier GPUs that actually used the same hardware for blending and image load/store?
    Also, how does programmable blending solve the issue of order-independent transparency (which would be the primary use I suppose)?

    Blending is itself a "half-assed and messy route" when it comes to rendering transparency due to its order dependent nature. Sure, blending can be used for a lot of other things where programmable blending might be useful. But how many order independent (i.e. commutative) operators are out there that application developers would like to use? Add, multiply, min, max, etc. these are all supported by current hardware.

    Further GL_NV_texture_barrier does have restrictions, but in fact in those scenarios which are restricted by it wouldn't work out well with blending either, unless you use an order independent (i.e. commutative) operator as mentioned before.
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •