Last chance for glNext predictions!

With a GDC presentation on glNext in less than two weeks, I thought it would be good to get final predictions on the nature of the API. So here are mine:

BTW: “Has command buffers” doesn’t count; everyone has those.

Automatic synchronization

Yeah, you can pretty much kiss that goodbye.

OpenGL offers a lot of automated synchronization. If you render to images in a texture, then unbind that FBO and bind those textures to the context and render something else, then OpenGL has your back. It’ll send the command to force the lines out of the ROP’s write cache once the first rendering operation is done. It’ll send the command to halt the GPU’s command processing until the first rendering operation and cache writing is finished. And so forth.

This is fine for a state-machine driven world. But in a command-buffer-based world, that’s not tenable, especially if you want to be able to build up command buffers in advance. The user needs to be able to decide explicitly when synchronization happens and which domain is affected. So Image Load/Store’s incoherent memory access is going to be the standard way to work, along with something not entirely unlike sync objects. I also expect that there will be some way to send a GPU command that ensures that commands issued next will only start executing when commands issued before it have fully completed.

Also, I seriously doubt that the in-order processing for non-command buffer things like reads/writes from/to buffer/texture objects will be allowed. So you’re going to have to know which commands are using which objects, so that you can synchronize with them.

So I hope you like synchronization.

Explicit multi-GPU support

Um, duh!

I expect textures, buffers, and command queues will be created on, and live within, a specific GPU device (or perhaps devices if certain devices allow sharing). Any other kinds of objects should likewise be GPU-local. There could be some kind of sharing system, but that sounds too high-level.

If an implementation wants to support something like automatic SLI, I’d expect it would be exposed as a specific GPU type. So on a multi-GPU SLI-type system, you’ll see GPU1, GPU2, and GPU-sli.

Displays

Well, the current OpenGL situation is simply untenable on Windows. It’s tied into Win32, which Microsoft has been making increasingly clear they’d love to ditch if they could get away with it (FYI: OpenGL is not why they can’t get away with it). Plus, it’s just not a good way to interact with a display.

So I predict we’ll be seeing something more like the Apple iOS way of handling displays. You render to some form of image that you create and allocate on your own, then shove that image off on something that represents a viewable surface. This also pushes towards being more low-level than OpenGL’s default framebuffer.

Objects

80% chance of immutability post-creation. This is what the ARB wanted back in Long’s Peak, and there are even more reasons for it now. I would guess that they’d use Apple Metal’s style of object creation (because it was originally LP’s style): you make a mutable descriptor object, and pass it off to the constructor function to create the actual object.

Obviously bulk data like a texture/buffer’s contents will still be mutable.

I can certainly see the existence of state objects for certain collections of state that are often grouped together. Vertex formats (just the format, not the buffers), blending state, framebuffers, etc. Then again, Apple Metal is rather inconsistent about this. Some of it are direct commands in the command buffer (vertex format state), while others are objects. Even though it’s very likely that users will want to make multiple vertex format changing calls when they do change formats.

Immutable state objects make creating new command buffers much easier. Rather than making dozens of API calls to set up the initial state for a rendering, you just make one: create command buffer starting from this state. Granted, most state change and draw calls should have minimal overhead since they don’t execute until later, but whatever.

Low level memory access

By this, I mean anything from being able to have real GPU pointers and raw GPU malloc to the ability to control whether a buffer/texture is in GPU memory or not. Things like that.

My guess? Not on your life.

I highly doubt that memory access will be much more low-level than it currently is in OpenGL/D3D. While handle residency for bindless textures may be a thing, that’s much more about giving handles the current location of a texture’s images than being for explicit texture management. I rather doubt the system will let you get into the details of texture/buffer management. Or at least, not that much more than you can do right now.

I don’t see the IHVs wanting to give users that kind of low-level control. Not on a GPU that needs to share memory among processes.

Indeed, we’ll probably see invalidation go bye-bye. That’s normally a trick used by the user to make drivers internally reallocate a buffer object if it’s still in use, so that we don’t have to wait for the driver to finish. Well, if you want to do that in glNext, there’s a way to do that. Keep track of how you used the buffer, and use the existing sync features to see if the commands that use it are still in progress. Allocate a new one if it is.

In short: do it yourself. glNext is almost certainly not going to make implementations write these kinds of strategies. Especially if synchronization is all completely explicit.

Not only is this lower level, it allows the user to determine the right “invalidation” strategy for their particular use case. glNext ought to give you the tools needed to implement it yourself, without imposing it upon you. It would also make drivers a lot less complex, while simultaneously making their performance more uniform across drivers (you don’t have to worry about whether one driver does invalidation better).

That’s about as low-level as we’re going to get.

Oh sure, we’ll still have mapping, and likely persistent/coherent mapping and such. But I highly doubt we’ll have texture mapping or any form of direct GPU pointer manipulation.

Error checking

When it comes to error checking, there are easy kinds of errors to check, and there are hard kinds of errors. Easy errors include (assuming immutable object creation):

  • Invalid parameters to object descriptor field setting functions (ie: calling glSamplerParameter with invalid parameter values).
  • When creating objects, invalid/inconsistent fields to the object creation function (ie: using the wrong object type, not specifying all of the parameters, two sets of parameters that are inconsistent).
  • Invalid parameters when uploading/downloading data to/from buffer/texture objects.
  • Compile errors for programs.

These are effectively problems that the API would need to catch anyway. Basic sanity checking for parameters is not optional these days. And program compilation failure needs to be dealt with. And since most of these functions are not used in performance-critical code (at least compared to the expense of the real part of the function), I would expect these to error check no matter what.

The harder part is with things that are used in performance-critical code. These are the main functions for interacting with command buffers and sending buffers to the GPU.

At the same time, being able to debug errors is absolutely critical. Therefore, I would predict that glNext handles errors in a more explicit way than OpenGL.

First, I would expect all of the easy cases to handle errors always. I don’t know which style of debugging would work best, but I know that the D3D style is terrible.

Second, there will be a debug setting done at GPU interface creation. When in debug mode, the system will make a concerted effort to detect improper use of the API. For example, if a program has an active UBO binding point, but there’s no UBO associated with it, that’s a rendering-time error. It will be caught in debug (preventing the addition of the rendering command), but not in non-debug interfaces.

Note that this error checking happens when building a command buffer, not when transmitting it.

However, it is sometimes exceptionally useful to be able to check for errors in non-debug mode. Therefore, there should be a validation interface to ask, “given the current command buffer state, is it OK to render, and if not, why not?”

Minimal state programs

Right now in OpenGL, program objects have a lot of state in them. Namely, uniforms.

I see no reason for glNext to have honest-to-God uniform state values. Programs should only have the following interface state:

  • Local memory storage buffers
  • Large memory storage buffers
  • Textures (I know, I know. See the next two items)

“Local memory” represents transferring a range of data from a buffer object into shader-local memory before executing that shader (aka: UBOs). “Large memory” represents the shader accessing buffer object storage, and can be read/write (aka: SSBOs). Apple Metal effectively does this (though it lacks SSBOs).

Also, I suspect that we won’t see shader subroutines. I’m sure you’ll be disappointed.

Cross-hardware unification

Perhaps not as unified as you might think.

AMD’s Mantle, as I understand it, is only supported on their “GCN”-based hardware. Apple Metal is focused specifically on Apple’s mobile GPUs. Both of these approaches limit how flexible the API needs to be. They’re each designed for a particularly narrow stretch of hardware.

glNext (and D3D12) don’t really have that luxury. glNext in particular is intended to support a fairly broad range of hardware, across both mobile and desktop platforms. That’s going to be hard, because despite advances in mobile GPUs, most modern mobile GPUs are still inferior to modern desktop GPUs.

OpenGL has generally used extensions for dealing with this kind of thing. The 4.x line of OpenGL is an example of this. Core 4.x can be implemented on any DX11-class of GPU, while other features common to more advanced DX11-class GPUs are exposed via extensions.

However, I rather suspect that extensions in glNext will be… highly discouraged if they’re even allowed at all. Instead, I would expect a more formalized mechanism that represents the separation of concerns.

Basically, I expect there to be formal profiles. One of the big problems in OpenGL with extensions is that they’re always on (except for GLSL). You can’t force an implementation to turn off its support for some extra behavior.

So instead of using extensions for this, I would expect to see the use of profiles, each of which represents a particular level of greater functionality. I’ll even predict what profiles are available:

  • [var]base[/var]: Essentially, ES 3.0 functionality.
  • [var]mobile[/var]: ES 3.1. So no tessellation or GS, but compute shaders, image load/store and SSBOs.
  • [var]desktop[/var]: DX11-class.
  • [var]advance[/var]: post-DX11 features, available currently as 4.5 extensions.

This works because each profile can be defined purely in terms of functionality additions to the base profile, rather than modifying or removing APIs. A profile might add shader stages to a pipeline, but it won’t take away any base functionality. And the general structure of the API will be consistent; how you use objects at one profile level will not be altered at a different one. You might add new functions, but you wouldn’t make existing APIs obsolete.

But there is one flaw in this plan:

Bindless textures

The problem here is that bindless texture completely obsoletes the old way of handling textures (and images). Pre-bindless, programs have specific, named slots that textures are associated with (indirectly via the context). In such an API, you need functions for querying slots in programs and functions for associating textures/images with them.

In a bindless API, you need neither; you only need some way to shove a 64-bit integer at the shader. Therefore, if you were writing an API that only needed to serve modern desktop hardware, you would have no such APIs.

But if glNext is going to serve modern mobile hardware, it can’t function in a bindless-only way. D3D 12 has the luxury of saying “you must be this tall to ride this ride.” glNext is supposed to be inclusive. Which means that some sort of API is going to have to be made superfluous in the profiles that support bindless.

The way to make the best of a bad situation is to minimize the API damage. We recognize that programs need state of various kinds. Even programs that rely on bindless textures will need UBOs and/or SSBOs. So there would need to be a way to query which UBO/SSBO slots are available, and APIs to set slots onto them.

So the API to non-bindlessly connect textures/images to the system will use the same interface. Textures and images are simply another kind of “slot” that you can stick objects in. So there would be a single API function for setting resource objects (textures&buffers), and it takes a particular slot type to use.

So for non-bindless texture profiles, you would set UBO, SSBO, and texture/image slots. For profiles with bindless texture, you still can use slot-based textures (for backwards-compatibility), but you can also use the bindless stuff. You still need that API function for setting UBO and SSBO resources either way. So it’s not an entire API function that is been rendered moot; just one use of it.

The other thing is that the API should behave in a “bindless” way towards texture/image use at all times. Namely, you should not be putting texture objects into those slots; you’re putting bindless texture handles there. And things won’t work if that handle is not resident (assuming glNext’s bindless needs such a limitation).

Intermediate shader representation

When I initially started writing up this list, on this particular subject, I wrote: “80% chance of trainwreck, with a 50% chance of epic fail.”

My biggest issue with intermediate shader stuff was mainly with semantics and optimization. Compilers can, and usually do, use higher level semantics to effectively make optimization choices. But ARB-style assembly languages lose these semantics. They have no data structures, functions, or similar constructs, which all makes optimization tricker. Even modern assembly-level optimizers make some assumptions based on how the assembly was generated. And a shader intermediate representation will not have that luxury.

That was before I actually looked at OpenCL’s SPIR and LLVM on which it is built. LLVM in particular is not really an assembly language. At least, not with the assumptions that usually entails. LLVM is really just C, only without those pesky constructs that make C reasonable for humans to use.

You still have structures and you still have functions in LLVM. You don’t really work at the level of registers, and the variables you declare are typed. You don’t have to deal with the specifics of argument passing and so forth. What LLVM doesn’t have that C does are convenience/sanity features like looping constructs, human-readable expressions, and so forth. It’s basically a more rigidly specified (in terms of the sizes of data types), easier to parse form of C.

Now, SPIR is not really a language or a specification for a language. It’s mainly defined in terms of a process. SPIR is the expected means for translating OpenCL’s language into LLVM’s intermediate representation. Or rather, a subset of that, with a few minor additions to the language.

I would hope that glNext’s IR is not just a minor variation of LLVM. It can be based on the look and feel of it, but I don’t think Khronos can take the SPIR route with glNext. Part of this is the reason why SPIR is a process. By defining it in terms of how OpenCL C is translated into LLVM IR, each driver’s IR compiler knows exactly what a matrix multiply looks like in LLVM. That way, it knows to look for specific sequences of LLVM opcodes and convert them into, for example, a real matrix multiply opcode.

That only works well in a world where people get their LLVM IR from an OpenCL C compiler that generates it in accord with SPIR. That’s possible in the OpenCL world, where there is exactly one such language. But in the shader world, there’s GLSL and HLSL, along with a myriad of in-house languages (or even stitching shaders together from other short IR fragments) that could be supported. People may even write directly in glIR. The details of how to translate them effectively will differ, and suddenly IR compilers have to figure out what a matrix multiply looks like.

I expect to see a language that is similar to LLVM in structure, but retains more of the common shading language constructs (explicit vector/matrix types and operations, shader-stage inputs/outputs as explicit declarations, etc). I would really hate for them to have to say, “declare 3-element vectors using this sequence of declarations. Issue a matrix multiply using exactly this sequence of opcodes.”

Also, retaining shading language constructs makes useful introspection easier. I know that’s kind of a high-level thing, but I would really hope that glNext still lets us query stuff from the shader, like input and output indices, buffer locations, and so forth.

Incidentals

I would expect that glNext would cut down on some of the needless texture types. For example, I was rather surprised that Apple Metal retains a distinction between array and non-array texture types. That seems rather pointless when all your hardware supports array types for the non-array base types. Sure, your shaders would have to explicitly specify array layer 0 for 1-length arrays. But it would simplify a lot if everything just worked this way. And the fact that Metal allows you to create array views of non-array types proves that the distinction is entirely unneeded.

Multisample is another matter, since that probably has explicit requirements that can’t easily be overcome. A multisample texture with 1 sample per pixel may well use different internal allocation and access patterns than a non-multisample texture.

I hope you’re not attached to being able to force the implementation to convert your pixel data for you. Because I would be absolutely shocked if glNext let you do that.

Of course, there’s a problem here. Mobile and desktop platforms differ in many cases as to what the right pixel format is. So there must be some API for querying how pixel data should appear. OpenGL already has that with internalformat_query; it just needs to be made binding.

Those are some well thought out and argued predictions, I just have a few comments.

Vertex format information is immutable in a MTLRenderPipelineState. There are only a few states set directly, such as blend colour, stencil reference value, cull mode, scissor rect, and viewport.

“Local memory” represents transferring a range of data from a buffer object into shader-local memory before executing that shader (aka: UBOs). “Large memory” represents the shader accessing buffer object storage, and can be read/write (aka: SSBOs). Apple Metal effectively does this (though it lacks SSBOs).

I’m not sure in what way Metal “lacks SSBOs”. Do you mean that it has no such distinction on the API side (while there is “constant” and “device” on the shader side)?

  • [var]base[/var]: Essentially, ES 3.0 functionality.
  • [var]mobile[/var]: ES 3.1. So no tessellation or GS, but compute shaders, image load/store and SSBOs.
  • [var]desktop[/var]: DX11-class.
  • [var]advance[/var]: post-DX11 features, available currently as 4.5 extensions.

The base profile doesn’t seem necessary, given the time required to roll out glNext on mobile and that most GPUs initially released with ES3.0 drivers are actually ES3.1 capable.

I would expect that glNext would cut down on some of the needless texture types. For example, I was rather surprised that Apple Metal retains a distinction between array and non-array texture types. That seems rather pointless when all your hardware supports array types for the non-array base types. Sure, your shaders would have to explicitly specify array layer 0 for 1-length arrays. But it would simplify a lot if everything just worked this way. And the fact that Metal allows you to create array views of non-array types proves that the distinction is entirely unneeded.

Metal lets you create views of uncompressed textures that differ in format only. If there are other ways to create views, I haven’t seen them yet.
But you’re probably right that Metal could have dropped this distinction (with 0 as default parameter in the shader that can be optimised away), assuming the memory layout of a non-array texture matches that of an array slice.

Multisample is another matter, since that probably has explicit requirements that can’t easily be overcome. A multisample texture with 1 sample per pixel may well use different internal allocation and access patterns than a non-multisample texture.

What would be the purpose of a single-sample multisample texture?

I’m not sure in what way Metal “lacks SSBOs”. Do you mean that it has no such distinction on the API side (while there is “constant” and “device” on the shader side)?

I’d never really looked at the shading language part of Metal; I assumed that if the distinction existed, it would be on the API side. But there it is, purely done in shaders. That’s certainly… an odd way to represent it.

But I suppose it simplifies the API a bit. The user of the API doesn’t have to care whether the shader accesses the buffer UBO-style or SSBO-style.

assuming the memory layout of a non-array texture matches that of an array slice.

My point was that it must match, because if it didn’t, you couldn’t create a view of a non-array texture that was equivalent to a 1-element array texture. And vice-versa.

What would be the purpose of a single-sample multisample texture?

You give the user the choice of how many samples to use. And you don’t want to have to decide whether to use a completely different texture type (along with other requisite shader changes) just because they picked 1.

It’s not the most useful thing in the world, but so long as the hardware supports the idea, I see no reason to forbid it.

And my point was that you can’t. :wink:
AFAIK the only way to create texture views in Metal changes the pixelFormat, not the textureType.

Need to backtrack on immutable state objects. Just say no.

Totally! It would be easier to just forget about regular textures the moment multi-sample codepath has to be supported. After all, most of the time multi-sample implementation would work perfectly if GL_TEXTURE_nD would be a GL_TEXTURE_nD_MULTISAMPLE with a sample count of 1. Right now it turns into a crazy binding, shader and shader function management to support both. Or is there a better way?