Work groups with priorities

thomas.d · May 22, 2013, 7:31am

Problem:
Currently, you have no way of telling/hinting the GL (or the driver) which of your commands are important (real time critical) and which ones are not. You want to keep the command queue full, but at the same time you do not want to compete for resources (GPU time, memory, bandwidth) with time critical tasks.

Examples:
[ol]
[li]You are running a complicated physical simulation (in an old ping-pong pixel shader, in a compute shader, or as an OpenCL kernel) which is calculated at 10Hz. The renderer runs at typical 60Hz, limited by vsync, and interpolates the simulation results over several frames. The simulation takes considerable time (say, 5-10ms), but it suffices if the result is ready 5 to 6 frames (= 83 to 100ms) in the future. [/li][li]You are calculating a histogram of the previous frame to do tonemapping. The calculation could start as soon as the frame is available as texture (at the same time as tonemapping/displaying it) and could execute while the GPU is not doing anything (such as during vsync), but it should not compete with tonemapping/blitting the previous frame or delay swapping buffers. [/li][li]You are doing a non-trivial amount of render-to-texture (say, to display a “page of text” out of a book in an e-reader, or in a game). The frame rate should be constant, as it would be disturbing to see all other animations “freeze” for a moment when one opens a book. On the other hand, nobody would notice if 2-3 frames passed before the book is opened or a page is flipped – as long as everything stays “smooth”. [/li][li]Your worker thread has just finished loading a texture from disk into a mapped buffer object. Now you would like to use it (next frame). So you unmap the buffer and call glTexImage to allocate storage and define the texture’s contents. You want to do this early to give the driver a chance to asynchronously upload the data, but you do not want to compete with the rendering (frame time budget!) for PCIe or or GPU memory. You certainly do not want to stall for half a millisecond while the GL or driver is doing memory allocator work (and maybe even kick a texture that is still needed later this frame!) to make room for the new texture. [/li][/ol]

You have no way of telling the GL that you don’t need the physics immediately. You have no way of telling the GL to start calculating the histogram but not to compete with the rendering – or worse, wait for histogram calculation to complete before swapping buffers. You have no way of telling the GL to allocate and upload the texture whenever there is time (i.e. generally as soon as possible), but not at the cost of something that must finish this frame.

Yes, swapping buffers likely won’t be delayed by “unimportant” tasks since most implementations render 2-3 frames ahead anyway, so there is no clean-cut end of frame. But still, you cannot be certain of this implementation detail, and you do not even have a way of hinting as to what’s intended.
The driver must assume that anything you pass to GL (or… CL) is equally important, and anything you submit should be ready as soon as possible. At the same time, you want to push as many tasks to the GL as fast as you can, as to prevent the GPU from going idle.

With some luck, the driver is smart enough (or lucky enough) to get it just right, but ideally you would be able to hint it, so it can do a much better job.

Proposal:
Commands submitted to the GL are grouped into work groups (name it differently if you like). There is a single default workgroup with “normal” priority to accomodate programs that are not workgroup-aware.
A subset of commands can be enclosed in a different workgroup with a different (lower) priority using a begin/end command pair (say, glBeginGroup(GLenum priority); and glEndGroup();). Implementations that are unwilling to implement the feature simply treat the begin/end function calls as no-op.

(As a more complicated alternative, one could consider “workgroup objects” much like query objects or buffer objects. This would allow querying the workgroup’s status and/or synchronizing with its completion, and one might change a workgroup’s priority at a later time, or even cancel the entire workgroup. However, the already present synchronization mechanisms in OpenGL are actually entirely sufficient, and it’s questionable whether changing priorities and cancelling workgroups are really advantageous features. They might add more complexity than they are worth.)

An elaborate system of priorities (with dozens/hundreds of priorities as offered by operating systems) is needlessly complex and has no real advantage – a simple system with less than half a dozen possible levels, maybe only 2 or 3, would be more than enough.

For example:
GL_PRIORITY_NORMAL --> default, want to see this ready as soon as possible
GL_PRIORITY_END_OF_FRAME --> not immediately important, best start when done with this frame (or when main task is stalled)
GL_PRIORITY_NEXT_FRAME --> don’t care if this is ready now or the next frame (or in 2 frames), but still want result in “finite time”
GL_PRIORITY_LOW --> rather than going idle, process this task – otherwise do a higher priority one

Ideally, there’d be interop between GL and CL for this, too.

thomas.d · May 24, 2013, 3:10am

An alternative (probably easier to implement – but not what I would prefer, since it would require shared contexts, extra synchronization, more resources) might be being able to mark an entire context as “lower priority”.

That way, one could for example do lengthy tasks such as complicated simulations or render-to-texture in the lower priority context without competing for the GPU with the time-critical render thread. While the context belonging to the render thread is stalled or waiting on vsync, the lower priority context takes over. Eventually, the result will be available, and can be used.

As a side thought: Having lengthy tasks in a separate work queue with lower priority might be interesting for other things as well (think GPGPU). As long as a normal-priority lightweight task keeps performing little or no work and swapping buffers at 60fps, WDDM will not be inclined to kill the display driver because it is “not responding”, if a compute task takes, say, 45 seconds.

grumbler · May 26, 2013, 1:21am

Interesting. I would call that multithreading as there is a sense of concurrency. No idea how one would implement such a thing - looks like a synchronization can of worms to me :/. How does it report back when stuff got done for use in primary (or some other) thread?

I do not like this. Just having a glSetThread(thread) would be preferable. “thread” would be an object that contains the priority. “0” => default thread.

no need to specify what commands are eligible.
no need to repeatedly restate priority whenever something is added to thread.
no need to know whether some “workgroup” is open.

thomas.d · May 31, 2013, 2:07am

It’s not about threading, though. You can already do threading just fine using shared contexts, just not in a “healthy” way that doesn’t negatively affect you.

Changing thread priorities wouldn’t be any good. It is not the thread that submits commands that should run at lower priority. This would likely cause the pipeline to become empty, resulting in inferior performance. It is not the thread that processes items on the work queue either that should run at lower priority, and there shouldn’t normally be more than one such thread. That would make the driver a lot more complicated (and bring threading issues) and isn’t even guaranteed to work any better. Playing with thread priorities in a driver most likely will cause havoc. What if the lower priority thread is starved by a high-priority application thread?

What should happen is that the driver thread picks up commands from the work queue as usual, but anything it encounters inside a “low priority” section, it just stores on its “not immediately important” list and continues processing “normal priority” tasks. After all, the contract says “this isn’t needed immediately right now”. Eventually, there will be a stall or the queue will be empty, so the driver will process some of the lower priority tasks in the mean time.

Now what happens if you need a result from a low priority calculation? Say you are rendering your physics for the next frame (as “low priority” so it doesn’t disturb your present frame), and then during the next frame bind the buffer with the result to read from it? Problem? No.
The same happens as always. Data is not ready, so the command blocks. Now that the “normal priority” queue is blocked, the GPU is free and the lower queue is worked off – your half-finished physics calculations are finalized. As the result is ready, the “normal priority” queue is unblocked, and since it’s higher priority, it takes over the GPU again (you’ve maybe already queued the calculations for another 2-3 frames in the future, but they won’t impact you).

The GPU never goes idle, but also you never have less than 100% power available for your main task when it matters.

Now of course there exists the theoretical issue that a programmer might submit total nonsensical sequences of commands/priorities. You could bind a texture with low priority (even for no good reason, maybe just because it’s possible), and immediately draw from it with normal priority. What would probably happen is that the GL would draw without a texture bound since the low priority task isn’t executed (resulting in a black screen), and later bind a texture, which is good for nothing.
But honestly, this theoretical issue is not really a problem. If someone tries to play stupid, that’s just bad luck for them. You cannot and should not expect from the GL that submitting nonsense produces something useful. It doesn’t do that as it is, either.
Also, this theoretical issue could be avoided alltogether, even in theory, with the “per context” priority approach and shared contexts.

Alfonse_Reinheart · May 31, 2013, 6:25am

It’s not about threading, though.

But it is. You want one set of GPU commands to automatically interrupt another set of GPU commands if it has a higher priority. That’s the very definition of preemptive multitasking. They may not be CPU threads, but they are still separate paths of execution which all operate in the same memory space on the device.

That’s a thread.

Eventually, there will be a stall or the queue will be empty, so the driver will process some of the lower priority tasks in the mean time.

Why will there be stalls or an empty queue? Generally speaking, that is considered a bad thing in a high-performance graphics rendering system. You should be doing everything in your power to ensure that this doesn’t happen, that the GPU always has stuff to be doing.

Which means, if you do your job right, those low priority tasks will never execute. Well, not until you force them to via priority inversion. At that point, you may as well have done the priority inversion manually: execute the “unimportant” commands when you need their results.

You’ve gained nothing in this case.

You could bind a texture with low priority (even for no good reason, maybe just because it’s possible), and immediately draw from it with normal priority. What would probably happen is that the GL would draw without a texture bound since the low priority task isn’t executed (resulting in a black screen), and later bind a texture, which is good for nothing.

That makes no sense. Binding a texture, or indeed, setting any actual state, is not a “command”. It may translate into some GPU operation, but in all likelihood, it’s purely a CPU-side construct.

Furthermore, this goes against exactly what you said earlier about priority inversion: a later “command” which is dependent on the execution of an earlier “command” will still have to wait on the execution of the earlier “command”. If you consider binding a texture (or any other state change) to be just another “command”, then later operations that depend on the execution of that command must wait. It’s no different from using a buffer that some low-priority command renders to.

This is part of the reason why a separate context for command priorities makes far more sense. This what, it’s clear exactly what depends on what. It’s the state of objects that can cause priority inversions, not random CPU state like object bindings and such. OpenGL already has a well-defined model for when changes to shared state becomes visible to other contexts.

But honestly, this theoretical issue is not really a problem. If someone tries to play stupid, that’s just bad luck for them. You cannot and should not expect from the GL that submitting nonsense produces something useful. It doesn’t do that as it is, either.

First, there’s nothing nonsensical about that. You bind a texture, then you render with it.

Second, it’s a poor API that encourages abuse of itself. And your suggested API makes it far easier to screw it up than to do it right.

aqnuep · May 31, 2013, 8:15am

The only way I can image such a thing to happen is in the form of assigning priorities to contexts, then do multithreading + multicontext rendering.

If e.g. you have two groups one after the other and the first has lower priority than the second. What state the second will run in? It can happen that the low priority task runs first, then the state set by that will affect the second task. However, it may happen the other way too. This would practically make all GL state that could potentially be effected by any of the tasks undefined, which makes the whole feature useless.

thomas.d · May 31, 2013, 8:50am

That’s the very definition of preemptive multitasking. They may not be CPU threads, but they are still separate paths of execution which all operate in the same memory space on the device.
In a very general way, one could probably see it as “threads”. Though I would not like to think of it as “preemtive multitasking”. Rather, it’s some form of cooperative multitasking. The “main thread” (default) will only yield when it blocks/stalls (e.g. waiting on a result) or when it has “nothing to do” (e.g. the end of the frame). The lower priority “thread” will use what GPU time is left over, and when it’s available. It may stop “pulling” new compute tasks from the queue if a condition makes the “main thread” ready, but it ceratinly wouldn’t interrupt a running workgroup (that would be insanely expensive!).
Since there’s a well-defined point in time (end of frame) when the “main thread” chooses to yield (by calling SwapBuffers, or explicitly if you want), there’s not much risk of a background task never running.

That makes no sense. Binding a texture, or indeed, setting any actual state, is not a “command”. It may translate into some GPU operation, but in all likelihood, it’s purely a CPU-side construct.
Gee, I knew this would come up…
What does it matter for the example whether binding a texture to a texture unit (note that I’m not talking about glBindTexture), is or is not a GPU operation (besides, it necessarily is, only not necessarily immediately). Regardless whether it adjusts state or runs a compute task, it is still a “command”. Even a command that merely adjusts a selector (assuming that OpenGL will still not be direct state next revision) is a “command”. They all go into one queue.
And indeed, the point of my example was that it makes no sense to do for example such a thing as putting a command on which another one depends into another priority block, because it will very obviously not work in a meaningful way. But again, if you do deliberately stupid stuff, then undefined results will come out. Unix has very successfully worked by the “type shit in, and shit comes out” principle for 45 years. It’s not a problem.

Now you’re inevitably going to object that this cannot work anyway because for example selectors may be changed and afterwards you have no way of knowing what was what. There are at least three ways to address this. Either, since by definition, lower priority tasks are not related to the main queue (not directly, anyway), every subgroup could start with default values. Or, the present state could be “snapshot”, for example using a copy-on-write mechanism. Developers are careful not to do too many state changes anyway, and you won’t have a hundred different priority blocks in a frame either, so the overhead should be tolerable. Or, just screw selectors alltogether and go DSA, which a lot of people would welcome anyway.

But yes, the “per context priority” solution would conceptually be easier to implement, that’s sure. And even this would be a big win already (even though not my preferred solution).

Why will there be stalls or an empty queue? Generally speaking, that is considered a bad thing in a high-performance graphics rendering system. You should be doing everything in your power to ensure that this doesn’t happen, that the GPU always has stuff to be doing.

Which means, if you do your job right,
The problem is, as it stands, OpenGL does not allow you to do your job right. Not in a reliable way, anyway.

Certainly, you can assure that the pipeline never stalls and never gets empty. Nothing easier than that. But then you’ll inevitably have different tasks competing for the GPU, and not all of them must be ready at the same time. With some luck, there’s enough horsepower, so nobody notices. With some luck, some driver hack kicks in (application-profile, prerendering, EXT_swap_control_tear,…), and maybe not.

But doing it properly in a somewhat predictable, reliable way is an entirely different story.

For example, you have the choice of submitting a physics simulation for the next frame at the beginning of a frame, somewhere in the middle, at the end before SwapBuffers, or at the end after SwapBuffers. Or, you can do it in a second, shared context (or in CL). These are all your options. Which do you choose?

The first two will result in the simulation competing with drawing the current frame, and with some “luck” will cause you to just miss the frame time by 0.1ms, causing your frame rate to drop from 60 to 30. Now you wish you had submitted it later.

Submitting just before swapping buffers will reduce this competition, but is likely to cause the pipeline to run empty, and it may still cause you to miss the frame time. The driver has no way of knowing (other than by application-specific driver hacks, or by prerendering 3 frames, none of which you can rely on) that you actually want to swap buffers immediately – it must assume that whatever you submit still belongs to the same frame.

Submitting the physics after SwapBuffers, on the other hand, may work or may not work as intended. Again, you have no way of knowing, and it’s nothing you could rely on. The driver might let you render ahead 1-3 frames or not. You might be blocked inside SwapBuffers for 3-4ms during which the GPU will be idle. Now you wish you had submitted it earlier (because now, running the computations would be “free”).

Using a shared context has none of the waiting-on-SwapBuffers problems, but it competes with the other context all the time, much like submitting the computation early or in the middle. The driver cannot know (without an application-profile hack) that one is more important than the other because it will exceed its frame time budget.

Alfonse_Reinheart · May 31, 2013, 1:20pm

Though I would not like to think of it as “preemtive multitasking”. Rather, it’s some form of cooperative multitasking. The “main thread” (default) will only yield when it blocks/stalls (e.g. waiting on a result) or when it has “nothing to do” (e.g. the end of the frame). The lower priority “thread” will use what GPU time is left over, and when it’s available. It may stop “pulling” new compute tasks from the queue if a condition makes the “main thread” ready, but it ceratinly wouldn’t interrupt a running workgroup (that would be insanely expensive!).

That’s still preemptive multitasking, because the lower priority thread’s actions must be preempted by the execution of the high priority thread. You said so yourself: an example you gave read, “it suffices if the result is ready 5 to 6 frames”. In order to make that work, your low-priority commands must be interrupted by the main thread several times. That is, when whatever “end of the frame” wait time (see below for a discussion of this fiction) finishes, the “main thread” has to take over again. This involves a full-fledged task switch.

Whether you think of it that way or not, that’s preemptive multitasking. There’s no cooperation happening there, because it is not at all clear when, during the execution of the low priority task, that the high priority task can start taking over again.

Since there’s a well-defined point in time (end of frame) when the “main thread” chooses to yield (by calling SwapBuffers, or explicitly if you want), there’s not much risk of a background task never running.

You seem to have this idea that, when you call SwapBuffers, the GPU just stops or something. That later commands will have to wait to execute until some pre-determined time after this, possibly related to vsync. You can always turn vsync off, you know. And even if you can’t, that’s what multiple buffering is for: so that you don’t have to wait for the vsync.

GPUs already do this stuff for you. Really, there is no “end of frame” waiting time where GPUs are stalled. There is no “well-defined point in time” “where the ‘main thread’ chooses to yield”. GPUs don’t stall just because you call SwapBuffers.

And indeed, the point of my example was that it makes no sense to do for example such a thing as putting a command on which another one depends into another priority block, because it will very obviously not work in a meaningful way.

And my point was that the example you gave just before that specifically stated that it would work. Allow me to quote you:

You say that all OpenGL functions are commands, all commands go into a queue, and the “normal priority” queue will block if it tries to access data from the “low priority” queue that is not yet ready. Then why should calling glBindTexture in a low-priority thread then accessing it in a high-priority one behave any differently from rendering to a buffer object in a low-priority command and then accessing it in a high-priority one? They are, by your own logic, the same thing: data set by a low-priority command that is being accessed by the high-priority one, but is not yet ready.

So why does one work and the other not work?

You can’t have it both ways. You can’t have the products of low-priority tasks be accessible just like they would have been without priority, yet have glBindTexture in a low-priority task somehow not be accessible from a high-priority task. It doesn’t make sense. Either everything works under the “as if it were specified in order” rule (which is where the whole “Data is not ready, so the command blocks” comes from), or everything does not. You can’t pick and choose arbitrarily.

Or if you want to pick and choose arbitrarily, you need to explain exactly what you want to be picked and chosen.

Submitting just before swapping buffers will reduce this competition, but is likely to cause the pipeline to run empty, and it may still cause you to miss the frame time. The driver has no way of knowing (other than by application-specific driver hacks, or by prerendering 3 frames, none of which you can rely on) that you actually want to swap buffers immediately – it must assume that whatever you submit still belongs to the same frame.

I don’t understand what this means. Presumably, your “physics simulation” isn’t rendering to the framebuffer or reading from it. So the driver already knows that these commands have no relation to any of your prior rendering commands. So, assuming that there is time to complete both tasks, why exactly would this cause you to miss the frame time?

thomas.d · June 11, 2013, 6:22am

There is no “well-defined point in time”

Indeed there is. When you swap buffers, you are saying that everything before belongs to the current (or rather, last) frame, whose definition is now finished. Everything you’re telling the driver from now on is “another story”, e.g. the next frame.
Everything else is just a lot of sophistries, they don’t matter. It’s entirely irrelevant whether the driver blocks your thread, whether there is an actual pause, whether there it waits for vsync or not, or whether it pre-renders another 2 frames. It just doesn’t matter. What matters is that you have a well-defined point in time that defines what must be ready whenever the driver wants to actually put something on screen.

You seem to have this idea that, when you call SwapBuffers, the GPU just stops or something. […] can turn vsync off […] multiple buffering […] GPUs don’t stall just because you call SwapBuffers.

Yes yes, very entertaining, but only more sophistries (and things I already mentioned in the original post, so no… I do not have this idea).
But as a matter of fact, you don’t know any of these and you can’t be certain about any of these. You have no idea whether your thread blocks, which unless you have already submitted some extra work necessarily means that the GPU pipeline will run empty. You just don’t know, and you do not have any way of controlling it. Maybe a driver lets you render 3 frames ahead, maybe not. If it doesn’t, you’re just lost.

The only certain way to avoid the pipeline from running empty is to submit a considerable amount of work early before swapping buffers. That’s a safe thing to do. But then, you don’t know that it won’t compete with the commands that you urgently want to complete within the frame time.

Which, as I’m trying to point out, is the whole reason why a “low priority” group could be immensely helpful. With this, you could submit a group of commands early, but without interfering with (or delaying) something more time critical.

So why does one work and the other not work? You can’t have it both ways.

I’m not sure whether you’re just trying to find petty arguments against an idea that you don’t like, or whether I’m really being that unclear. Even more so as this example does not matter in any way for the proposal, it was merely a consideration of what the maximum-moronic-programmer might do. But I’ll bite anyway.

Using “something” that is not ready (maybe a query result, or a render-to-texture) is different from not having something set or defined, or having illegal state set but still accessing it. The former will block until the result is ready, OpenGL has always worked that way, and should certainly continue to do so. The latter is wrong and produces undefined behaviour. This is why one would work, and the other would not.
For example, reading from a texture unit when no texture has been bound would likely result in “why is my screen black???” behaviour. So yes, it is possible for the MMP to generate havoc, but nothing prevents you from being a jackass as it is, either. Nobody prevents you from drawing from non-existent buffer objects into non-complete framebuffers or such. Thing is, it just won’t work, but this is acceptable, since it well-deservedly doesn’t work.

So, assuming that there is time to complete both tasks, why exactly would this cause you to miss the frame time?

The assumption is fundamentally wrong. If you have a GPU that can render a scene with 3,000 frames per second, decode a video, mix some sound, and run the physics simulation, all at the same time, there is no issue.

Unluckily, compute power on real world GPUs is not infinite, and given a typical frame time of slighly over 16.6ms (or 10/11.7ms on 85/100Hz monitors), the more “typical” scenario is that the end user’s GPU (which is not necessarily the biggest available $1000 card) might be 100% busy anywhere from 10 to 16ms only rendering the current frame.
Which means that you cannot just assume that everything you submit finishes “instantly”, and you cannot afford to speculatively throw in some tasks that take another 2 or 3 milliseconds only to keep the GPU busy, and still assume that you don’t have to care because everything will be just fine. Not unless you are willing to set your minimum specs so only people with “top power gamer” rigs have a satisfying result.
Even if something “works fine” for the kick-ass GPU in your development machine, it might look quite different on the end user’s GPU which maybe has only 1/4 or 1/5 of the compute power. Whatever you do, it should still work reasonably well on the second most expensive hardware, too. Only when there is not enough power to finish everything instantly, things starts to matter.

Alfonse_Reinheart · June 11, 2013, 8:02am

FYI: “Sophistry” is not a synonym for “arguments I don’t like.”

But as a matter of fact, you don’t know any of these and you can’t be certain about any of these. You have no idea whether your thread blocks, which unless you have already submitted some extra work necessarily means that the GPU pipeline will run empty. You just don’t know, and you do not have any way of controlling it. Maybe a driver lets you render 3 frames ahead, maybe not. If it doesn’t, you’re just lost.

So you’re assuming driver stupidity. Well, if driver stupidity is the assumption, why would you assume drivers would be less stupid under your model? Your model requires them to implement context switching, the ability to interrupt one series of drawing commands with another, and then return to it later to complete it. It would be much easier for them to just wait on low-priority commands and execute them when you require their contents. And if that means that your low priority commands blow your budget for that frame, too bad.

So driver stupidity can hurt your idea just as much as normal rendering. Either way, you’re relying on the driver to not be stupid and to not waste time. Unless this feature is somehow more likely to be implemented reasonably than a non-time-wasting SwapBuffers (which is unlikely, since we have non-time-wasting SwapBuffers calls right now, whereas GPU task switching would be far more complex to implement), there’s no reason to do it.

Even more so as this example does not matter in any way for the proposal, it was merely a consideration of what the maximum-moronic-programmer might do.

The example does matter, because you were using the example to explain some details about how your proposal works. If that’s not how your proposal is supposed to work, then you shouldn’t have given the example. But if it is how the proposal works, then it is part of the proposal and is therefore fair game.

Using “something” that is not ready (maybe a query result, or a render-to-texture) is different from not having something set or defined, or having illegal state set but still accessing it. The former will block until the result is ready, OpenGL has always worked that way, and should certainly continue to do so. The latter is wrong and produces undefined behaviour. This is why one would work, and the other would not.

I think you’re getting confused about what we’re discussing. The example you gave was this:

Let me pseudo-code that:


glBeginLowPriority();
glActiveTexture(GL_TEXTURE0 + i);
glBindTexture(..., texId);
glUseProgram(progThatUsesTexture_i);
glDrawElements(...);
glEndLowPriority();

glUseProgram(otherProgThatUsesTexture_i);
glDrawElements(...);

You’re saying that this code will fail because the high-priority glDrawElements command will not have seen the glBindTexture call yet, and therefore will be rendering with whatever happened to be bound before those commands executed.

However, you’re saying that this code will succeed:


glBeginLowPriority();
glBindFramebuffer(..., fboThatUses_texId);
...
glDrawElements(...);
glEndLowPriority();

glActiveTexture(GL_TEXTURE0 + i);
glBindTexture(..., texId);
glGetTexImage(...);

OpenGL has a sequential execution model. This means that all commands, all commands, must execute as if they executed in the given order. The guarantee of the sequential execution model doesn’t care what state is modified by a command. It could be bind state, state within an object, large data storage within an object, etc.

In order for what you’re talking about to work, you now have to violate the sequential execution model. But that’s OK; Image Load/Store does that too. The confusing part is that you want sequential execution sometimes and not other times. You want to make an arbitrary distinction between “Using ‘something’ that is not ready” and… well, I’m not sure, because you haven’t actually specified what this execution model actually is yet. Or at least not in any consistent way.

The sequential execution model is very simple: all changes to all state are visible immediately to any command that executes later. The Image Load/Store execution model is (mostly) simple too: no changes to image contents are visible to anything unless you explicitly synchronize them. Both of these models are very self-consistent; they treat all applicable state the same way.

What you’re suggesting would create some delineation between certain state. The texture’s storage will follow sequential execution, but context bind state will require explicit synchronization. This is completely inconsistent with OpenGL’s object model, and needlessly introduces a third synchronization model. Plus, you never bothered to nail down in any concrete way exactly what is sequential and what isn’t, so the idea is too nebulous to talk about. Not only that, this split synchronization model is actively dangerous.

Take the case you suggested, of binding a texture as a low-priority command and using it in a high-priority command. That low priority bind command could execute at any time. It could execute while you’re in the middle of issuing high-priority commands. It could execute immediately after a high-priority glBindTexture call, thus stomping on your context state.

Unless you’re suggesting that low-priority commands effectively have their own context state, that none of their binding commands can ever be visible to high-priority commands. Because if that’s what you’re saying, then it’s far more reasonable to just make this priority thing a separate context (just like deferred contexts in D3D). That way, neither side can stomp on the others’ context stuff.

Furthermore, it would make your memory model very simple: just follow the OpenGL memory model. The GL memory model says that changes to object state in one context only become visible to another if you bind that object first. Therefore, if the high-priority context doesn’t bind objects used by the low-priority context, then no priority inversion will take place and they can execute asynchronously. Once the high-priority context binds the object, synchronization happens automatically. And thus, everything would follow the as if rule.

It also would make life easier for users, as they don’t have to preface their low priority commands with a bunch of basic state commands, like setting blend modes, glViewport, and so forth, then restoring all of that at the end of their command stream.

So what exactly is the execution model you’re proposing? It’s hard to argue for or against something that’s nebulously defined.

Only when there is not enough power to finish everything instantly, things starts to matter.

If you need to perform tasks X and Y, and you only have 16ms to do them, and you need both of them next frame (which is exactly the example you gave: “submitting a physics simulation for the next frame”. That’s the scenario I was talking about), if the time it takes to perform both tasks is more than 16ms in total, you drop the frame. There’s no getting around that. Threading calls aren’t going to help. There’s no trick that’s going to make these two tasks take less than the sum of their running times. Not unless you are running them on separate GPUs in parallel (which is not what you’re talking about). So if that sum is greater than the time you have for the frame, you drop the frame.

I made the assumption of sufficient time because, without that assumption, the example simply doesn’t make sense. The example might make sense if you got rid of the “for the next frame” bit and gave yourself 3-4 frames of time. But even then, we would have to assume that there is sufficient time to execute all of these commands within that 3-4 frames. Otherwise, you’re going to drop at least one of those frames.

kRogue · June 11, 2013, 9:33am

I am just wondering aloud here, but it seems to me that the core issue that thomas wants to resolve is the reasonably common situation of asking the GPU to compute something but not needing the result ASAP, but later might be next frame, or many frames away.

One idea is the following, is to introduce the idea of “tasks”. A task is just a collection of commands to send to GL to get processed. A task can be dependent on other tasks, i.e. the dependent tasks need to be done before the task itself. Lastly, one may want the ability to cancel a task as well (for example new data comes, and if the task is still waiting, then go ahead and cancel it). With that idea clear, here is one way to express that idea:



/*
  Generate task objects
  \param n number of tasks objects to create
  \param tasks specifies an array in which to store the task object ID's
*/
GenTasks(sizei n, uint *tasks)

/*
  Delete task objects
  \param n number of tasks objects to delete
  \param tasks specifies an array of task objects to delete
*/
DeleteTasks(sizei n, uint *tasks)


/*
   Specify that all subsequent commands are to 
   be for a task to saved for later execution
   \param task task object to make active.
*/
BeginTask(uint task);

/*
  Specifies resume of normal GL operation, i.e.
  commands are no longer added to a task.
*/
EndTask()

/*
  Specifies that a target task, target_task, may only
  be executed after a set of other tasks are completed
*/
AddTaskDependecy(uint target_task, sizei n, uint *tasks);

/*
  Specifies that a target task, target_task, no longer
  needs to wait for other tasks to complete before being 
  executed
*/
RemoveTaskDepenency(uint target_task, sizei n, uint *tasks);


/*
  Place a task to be executed, return immediately. An implementation
  will select when it is executed. 
*/
ExecuteTaskNonBlock(uint task);

/*
  Forces a GL implementation to finish the named task before
  executing any subsequent GL commands. Does not block. 
  The task must have had ExecuteTaskNonBlock already called.
*/
TaskBarrier(uint task);

/*
 If a task has not yet completed, but ExecuteTaskNonBlock has been called,
 cancel the task. The return code is one of
  -PARTIALLY_EXECUTED task was only partially run
  -NOT_EXECUTED task never had a chance to run at all
  -COMPLETED task finished it's run anyways
*/
enum CancelTask(uint task)


/*
  Query a task
  \param task task to query
  \param prop what property to query, values may be: TASK_STATUS 
                   NUMBER_DEPENDENT_TASKS, DEPENDENT_TASKS, others?
  \param out location which to write values     

  TASK_STATUS gives one of:
  - INACTIVE: task is either complete or not queued
  - ACTIVE: task is either running right now or queued to be run
*/
QueryTask(uint task, enum prop, uint *out);

Naturally this opens up lots of messy cans of worms like what happens when a task is cancelled but it is relied upon by another running task, or even if task should be cancellable. Moreover, weather or not the same task can be queued up more than once. Along those lines, weather or not a task is modifiable while it is queued up. Going further state of stuff on which the task resides. These are my preferences for each of these worms:

A given task may not be queued up more than once, i.e. calling ExecuteTaskNonBlock on a task that has not yet finished but had ExecuteTaskNonBlock called on it before is either an error or ignored. Having the same task queued repeatedly opens up a mess to specify with respect to dependent tasks, i.e. should they get rerun too, etc. This is messy regardless once dependent tasks come into play, I am on the side that calling ExecuteTaskNonBlock on a task already queued or running is ignored
A task is unchangeable once TaskEnd() is called.
If a task T relies on any GL objects, such as buffer objects and textures, it uses the state of those objects when it is run. Thus creating a task that uses a buffer object B the values used by the task in it’s run are not the values of B when the task was defined; rather they are whatever values B has when the task is run. To get well defined results an application needs to not write to B (except in a task on which T depends) until T completes.
If a task T is cancelled where T has dependent tasks, then those dependent tasks that were made to run so that T may ran are also cancelled.
If a task T depends on a task S and S is cancelled where as T is not cancelled, then I don’t know. Error? Ignored?
Circular task dependency is an error.
Order of operations of tasks is only guaranteed on dependency graph, i.e. the only task guaranteed to be done before a task T are those tasks that T depends on.

More worms are the nature of GL state, I would want that the GL state be per task, thus GL state changes are isolated to the task and each task has their own GL state vector.

This sounds like the D3D’s deferred command stuff.

This suggestion forces the developer to say exactly how important something is by specifying there dependency task. An ideal implementation of the above would interpret TaskBarrier() as moving the task and it’s dependency up to most important thing to get done. Not too sure if there should be a blocking version of TaskBarrier as well though…

imported_degasus · June 17, 2013, 6:01am

Adding priorities to contexts sounds fine, but I don’t see the benefits of multiple “threads” per context.

I think to original suggestion is to split up GPU tasks into many prioritized command streams (glBindTexture isn’t one, mostly glDraw* matters) and to use implicit synchronization between them. So the execution seems to be in order, but the driver gets a hint which results aren’t needed soon.

But to use this feature wisely, explicit synchronization should be used:

We need to check if the task has finished to not stall the rendering and to not wait much too long.
We also have to check for space on this command stream, else the driver will have to wait.

So the only usage of this implicit synchronization are games which knows that they need some offscreen renderings in exactly some frames. Everyone else would use explicit synchronization and could also use shared prioritized context without additional overhead.

btw: EGL_IMG_context_priority

thomas.d · June 24, 2013, 1:33am

the sequential execution model is very simple […]

The sequential execution model includes, among other things glBeginConditionalRender/glEndConditionalRender.

What is so much different for conditional render, what issues exist with my proposal that conditional render does not have too? I don’t see that much of a difference. You group some commands together, logically isolate them from the “main stream” and tell the driver to only process them based on some query. It might process the commands speculatively anyway if the result is not ready in time. Or it might start processing and stop in the middle and discard the results. Or, something else. It might ignore the conditional render commands alltogether. Who knows, and who cares – but you do have the possibility to tell the driver your intent.