Compute shader interoperability with GL pipeline

PaoloS · January 24, 2015, 11:17am

Hello everyone,

I have got an interesting question for the OpenGL gurus out there.

When I make a call to glDispatchCompute, and then I execute other commands, the OpenGL pipeline queues those commands sequentially. Does that mean that on the GPU the command after glDispatchCompute waits for glDispatchCompute to finish before executing?

I’ll clarify my question: When the CPU pushes commands to the GL pipeline, those commands are not necessarily executed immediately as the GPU might be stuck a little bit behind. This is called the CPU-GPU async model.
At the same time, when I queue 2 rendering commands, I know that the gpu executes them one after the other.
Now let’s make an example.
I rendered all my geometry and I want to do a bunch of post process passes.
Let’s assume that I want to do a horizontal blur pass and then a vertical blur pass.

Pass 1: horizontal blur: pixel shader reads from scene, outputs blurred to texture object 1
Pass 2: vertical blur: pixel shader reads from texture object 1 and outputs to final scene or other texture object.

since the 2 passes are 2 different draw calls, I don’t have to put a wait between pass 1 and pass 2, because I know for a fact that the GPU won’t start any operation of pass 2 before pass 1 is done and the texture object 1 is ok.

Now, can I assume the same with Compute Shaders and the rest of the GL pipeline?

In the same example as above, let’s assume that Pass 1 is a compute shader.

Pass 1: horizontal blur: compute shader reads from scene, outputs blurred to texture object 1
Pass 2: vertical blur: pixel shader reads from texture object 1 and outputs to final scene or other texture object.

do I have to put any sync/barrier or double buffering between the compute shader and the pixel shader? That is, will the GPU wait for the compute shader to finish before executing the Pass 2 draw call and read from the texture object 1 which is the compute shader output?

The reason why I ask, is because in the new OpenGL superbible, compute shader chapter, they use double buffers for the flocking example (using compute shader to do flocking on a uniform shader buffer that is then bound when drawing the geometry)

This is from the book:

“The flock positions and velocities need to be double buffered because we don’t want to to partially update the position or velocity buffer while at the same time using them as a source for drawing commands”

but what you can do is execute the compute shader with the output buffer being the input buffer (copy all data to shared memory, do calculation, write out), and then bind that one.
Am I missing anything else?

Thanks for the reply in advance,

Paolo

Alfonse_Reinheart · January 24, 2015, 11:56am

When I make a call to glDispatchCompute, and then I execute other commands, the OpenGL pipeline queues those commands sequentially. Does that mean that on the GPU the command after glDispatchCompute waits for glDispatchCompute to finish before executing?

No, it does not. Or at least, it doesn’t have to. Part of a [rendering command](Vertex Rendering - OpenGL Wiki Command) can still be in the pipeline while a second command starts. But please continue reading:

since the 2 passes are 2 different draw calls, I don’t have to put a wait between pass 1 and pass 2, because I know for a fact that the GPU won’t start any operation of pass 2 before pass 1 is done and the texture object 1 is ok.

No, you don’t have to do anything between them because most of OpenGL follows a coherent, sequential memory model. The OpenGL implementation will detect that Pass 2 can read from texture object 1, which was used as a render target in a previous rendering operation. Therefore, the implementation will make sure that Pass 2 will not start executing until Pass 1 has fully completed. The implementation will also ensures that any writes to texture object 1 are fully written and caches are cleared, so that Pass 2 can successfully read all of the data written by Pass 1.

Please note that I said “most of OpenGL” above. That will become relevant right now:

Pass 1: horizontal blur: compute shader reads from scene, output blurred to texture object 1
Pass 2: vertical blur: pixel shader reads from texture object 1 and output to final scene or other texture object.

do I have to put any sync/barrier or double buffering between the compute shader and the pixel shader? That is, will the GPU wait for the compute shader to finish before executing the Pass 2 draw call and read from the texture object 1 which is the compute shader output?

Compute shaders do not have “outputs” per-se. They have some built-in input values (defining which invocation it is), but no user-defined outputs. Therefore, there are only three mechanisms you could possibly use to output data from a CS:

Since you mention writing to a texture, I’ll assume that #3 isn’t what you’re doing for output.

SSBO is basically a nicer face on Image Load/Store to buffer textures. So in effect, they’re the same thing. I say that because writes to both of them breaks the sequential, coherent memory model (hence the “most of OpenGL” thing). So the issue has nothing to do with Compute Shaders at all; it’s how you’re writing your data.

Image Load/Store and its equivalents follow an incoherent memory model. Therefore, it is up to you to prove sequential and coherent access to such data.

Your compute shader uses image load/store to write to a texture. If Pass 2 reads that texture via a sampler, then you need to ensure coherency by issuing an appropriate memory barrier (ie: use the GL_TEXTURE_FETCH_BARRIER_BIT), so that Pass 2 can properly read the values written by Pass 1.

The barrier ensures memory coherency (ie: that values written can be read) as well as sequential operations (ie: that Pass 1 has actually finished before starting Pass 2).

but what you can do is execute the compute shader with the output buffer being the input buffer (copy all data to shared memory, do calculation, write out), and then bind that one.
Am I missing anything else?

You could do that. But you’d just lose performance (probably). Why? Because of your barriers.

Because barriers ensure both coherency and sequential operation, you should only issue barriers when you absolutely need to. With your way, your renderer will have to look like this:


Pass 1: CS to read/write data.
Memory Barrier for buffer read.
Pass 2: Rendering call to read written data.
Memory Barrier for CS read/write.
Pass 1: CS to read/write data.
...

That is, every loop will need two memory barriers. You need that second one because you cannot execute the next frame’s Pass 1 while Pass 2 is still in progress. And since Pass 1 is using image load/store to both read and write the data, OpenGL won’t enforce that on its own. Therefore, while you don’t care about coherency, you do care about sequential operation. So you need a way to tell OpenGL to finish up all existing operations before moving on. And that’s the memory barrier.

If you do it the book’s way, you only need one per loop:


Pass 1: CS to write data to part of buffer.
Memory Barrier for buffer read.
Pass 2: Rendering call to read written data.
Pass 1: CS to write data to part of buffer not currently being read.
...

Note the “not currently being read” part; that’s why you don’t need the second barrier. Pass 1’s writes cannot interfere with Pass 2’s reads, so there’s no need to ensure sequential operation.

PaoloS · January 24, 2015, 2:15pm

What a fantastic answer.
Thank you very much for the thoroughness of it, and for providing the necessary links to read more about this topic.
I do apologize since I completely skipped some pages of the OpenGL Wiki and I didn’t try to find my answer there first.

I have been working with OpenGL for years, and a few years ago I worked with OpenCL on its own. I just recently started studying the concept of using compute shaders in a graphics pipeline to achieve particular effects.
I was completely unaware that compute shader would not respect the memory coherent model that the GL pipeline follows.
Thanks to the links, I now understand the logic of it and what the glMemoryBarrier barrier is for.

Just to clarify, when I call glMemoryBarrier, I am enqueuing in the GL pipeline a barrier on a specific set of states (texture fetch, framebuffer op, …) so that the GPU knows that the following commands, if they use a resource I put a barrier on, they need to wait. Now, this doesn’t cause any CPU wait, right? The CPU execution will continue (unlike calling glReadPixels from the framebuffer for example) as that barrier is not between GPU/CPU but for the GL pipeline to know that the next command has to wait.
This is my understanding from reading the api for glMemoryBarrier. Hopefully I am not mistaken.

There are some other things that I would like to clarify, if possible.

I understand this, I think I didn’t explain it correctly.
What the book does is the following:


Even frames:

CS: read from buff 0, write to buff 1
RENDERING: read from buff 0 and draw boids

Odd Frames:
CS: read from buff 1, write to buff 0
RENDERING: read from buff 1 and draw boids

so there is no barrier between the CS and the rendering because the target is not the same.
My question then is: if I can’t be sure that the GL pipeline waits for the CS to fully write memory correctly, then what stops this from happening:



frame 0:
CS: read from buff 0, write to buff 1
RENDERING: read from buff 0 and draw boids

frame 1:
// enqueue next CS
CS: read from buff 1, write to buff 0

// CS FRAME 0 is so slow that it's still working on buff 1, so this call is now accessing memory that is not fully written
RENDERING: read from buff 1 and draw boids

Is that because between frames there is “Swap Buffers” call?
Now I could be mixing things here, but it’s safe to assume that swap buffer actually calls glFinish under the hood.
As defined by the API, glFinish locks the execution of the CPU until all commands are executed.
I also remember the following being correct: if working on a double buffered application, swap buffers might not lock the CPU if one of the two buffers is available to be used.
(Example: CPU is so fast at submitting commands, while the gpu is really slow, that this happens)


0:
- cpu submits commands
    - gpu starts executing commands
swap buffers: cpu should wait but since I am double buffering, I have one buffer left so I can continue
1: while gpu 0 is still working, cpu submits frame 1 commands
    - gpu is still working
swap buffers: now I really have to wait, so lock cpu until gpu frame 0 is done

Thanks,

Paolo

Alfonse_Reinheart · January 24, 2015, 3:04pm

I was completely unaware that compute shader would not respect the memory coherent model that the GL pipeline follows.

Compute shaders aren’t (specifically) the issue. You could have been doing those writes in a fragment shader and have the same problem. It’s the way you’re writing the data (via image load/store) that makes the memory writes incoherent.

Just to clarify, when I call glMemoryBarrier, I am enqueuing in the GL pipeline a barrier on a specific set of states (texture fetch, framebuffer op, …) so that the GPU knows that the following commands, if they use a resource I put a barrier on, they need to wait.

You’re thinking too low-level. Don’t think of it in terms of what “needs to wait”. Think of it in terms of data visibility.

Some operation wrote some data. You want to do something that reads that data. So you issue a memory barrier that says, “I’m about to read data via <insert bits here> that was written incoherently. Make that work.”

Now, this doesn’t cause any CPU wait, right?

It could. OpenGL specifies behavior, not implementation. That being said, image load/store is incoherent to increase performance; that’s why you have explicit control over barriers. So I would guess that they’re primarily on-GPU operations. Though it’s stated that GL_CLIENT_MAPPED_BUFFER_BARRIER_BIT (for persistent mapping) may cause additional stalls. I also wouldn’t be surprised if the GL_BUFFER_UPDATE_BARRIER_BIT had some unpleasant CPU behavior.

My question then is: if I can’t be sure that the GL pipeline waits for the CS to fully write memory correctly, then what stops this from happening:

As far as the OpenGL specification is concerned… nothing.

However, most implementations of OpenGL will, during the process of buffer swapping, issue the internal equivalent of a glFinish, along with whatever other memory barriers exist. So basically, they get away with it. Undefined behavior means undefined. Which means that it may appear to work.

This does not mean that you should rely on it (and shame on the writers of the book for having that bug in it).

PaoloS · January 24, 2015, 4:15pm

It’s the way you’re writing the data (via image load/store) that makes the memory writes incoherent.

Right. It’s the new feature of writing memory that is incoherent, not the compute shader. The compute shader is simply using that write method. Clear.

You’re thinking too low-level. Don’t think of it in terms of what “needs to wait”. Think of it in terms of data visibility.

Yes, I do that too often . I strive to write applications that are correct, and, while I rely on the API, I ofter try to understand how the inner workings are so that I have a deeper understanding of it. Of course I need to also understand that there is a line between API specification and direct implementation.

OpenGL specifies behavior, not implementation

I get . The progthatrammer should follow the API logic and not how it’s implemented since it can change from vendor to vendor. I am surprised that the API is not clear on what ops require the cpu to lock and what don’t. I guess that it’s in the interest of the vendor to try to lock as little as possible, therefore I should trust the API and, as you mentioned before, use barriers only when necessary.

This does not mean that you should rely on it (and shame on the writers of the book for having that bug in it).

Interesting. So the right approach for this should be what? I don’t think there is a way to set a specific memory barrier on one specific buffer uniform object right?



Initial setup:

// same geometry, different uniform buff

create VAO0: 
  geometrybuff0
  uniform buff0

create VAO1: 
  geometrybuff0
  uniform buff1

Rendering:

frame 0:
CS: read from buff 0, write to buff 1

// I can safely start the rendering as buffer 0 is not used
RENDERING: render VAO0, which is using buffer 0
 
frame 1:
// enqueue next CS
CS: read from buff 1, write to buff 0
 
// how do I make sure that the write on buff 1 is done?
RENDERING: render VAO1, which is using buffer 1

So if I want to make sure that I am putting a barrier at the right spot, I could do the following:



frame 1:
// enqueue next CS
CS: read from buff 1, write to buff 0

//I could put a memory barrier here
glMemoryBarrier(GL_UNIFORM_BARRIER_BIT);
// before rendering
RENDERING: render VAO1, which is using buffer 1

So the final question is: will opengl recognize that VA01 is using the buffer 1 and therefore wait only if the write to buffer 1 is not complete, or will it lock in any case, maybe waiting for frame1 CS to be done even if not necessary?

I am starting to guess that the answer to such a specific behavior is:
It’s implementation dependent and if you want to be 100% correct, you just put a barrier and hope that the implementation is smart enough to do the right and efficient thing.

Is that right?

Thanks a lot for all clarifications, I have some thinking and studying to do now . I need to go back and re-visit the wiki pages you linked before to make sure I have a solid understanding of this topic. You answers helped a lot.

Paolo

Alfonse_Reinheart · January 24, 2015, 6:43pm

So the final question is: will opengl recognize that VA01 is using the buffer 1 and therefore wait only if the write to buffer 1 is not complete, or will it lock in any case, maybe waiting for frame1 CS to be done even if not necessary?

These memory barriers are based on what operations you ask it to create a barrier for, not which specific objects you’re working with. So it’s blind to particular objects; it will make no assumptions based on the current state of OpenGL. All it knows is that you put a barrier there, so it needs to ensure that any previously-executed incoherent writes are made visible to the operations you specified in the barrier call. So whatever the call does is whatever the call always does.

I would suggest putting the barrier before the compute operations. That way, there’s some GPU time between the compute operation and the barrier for it to complete.