Memory Model

Jump to: navigation, search

A Memory Model defines the rules under which writes to various stored object data become visible to later reads to that data. For example, the memory model defines when writes to a Texture attached to a Framebuffer Object become visible to subsequent reads from that texture.

Under normal circumstances, a coherent memory model is rigidly enforced by OpenGL. In general, if you write to an object, any command you issue later will see the new value. Because OpenGL behaves "as if" all operations happened in a specific sequence, it is up to the OpenGL implementation to make sure that subsequent writes have occurred when you issue a read.

For example, if you fire off a rendering command, OpenGL will not have finished executing that command by the time the function returns. If you issue a glReadPixels operation (without doing an asynchronous read) to read from the framebuffer, it is now on the OpenGL implementation to synchronize with all outstanding write operations. OpenGL will therefore halt the CPU thread issuing the read until all rendering is done, then perform the read and return.

What follows will be a list of exceptions to the basic rule of "reads issued after a write will see the written data".

Multiple threads and contexts

It is possible to use multiple CPU threads in OpenGL. This raises a number of questions about synchronization and state visibility between threads.

The multithreading model that OpenGL uses is built on one fact: the same OpenGL context cannot be current within multiple threads simultaneously. While you can have multiple OpenGL contexts which are current in multiple threads, you cannot manipulate a single context simultaneously from two threads.

Of course, this raises the possibility of having a race condition on the currency of a context. That is, one thread is manipulating a context while another thread makes it current. Who wins?

The correct answer is that you lose. You should design your application such that this is impossible. It is up to the OpenGL application to avoid race conditions on the current context. This should be done by using appropriate synchronization primitives, which should be available to you from your language or threading library of choice.

Object content visibility

OpenGL Contexts can share objects. This means that it is possible for an object in one thread to be manipulated while it is being used in a second thread.

In terms of synchronization, OpenGL states that changes only happen once those commands have completed. Since each context has its own command streams, you will need to either use a Sync Object or glFinish to ensure that the command has executed. But you must also communicate with the other thread to let it know that the data has been updated. This will require using the appropriate language/library inter-thread communication features (mutex's, atomics, etc).

However, this is not enough to ensure that the changes are visible in the consuming thread. The OpenGL specification has a number of complex rules (OpenGL 4.5, Section 5.3.3, page 54) that determine when updates to an object in one thread will become visible to other threads.

The general gist of the rules is that each consuming context must bind the object to its context before the change becomes visible. The binding can either be direct binding (with a glBind*​ call) or indirect binding, by binding a container object that references the changed object.

Even if that object is already bound in thread 2 when the change takes place in thread 1, thread 2 must rebind the object to ensure the visibility of the updated data. The rebinding however does not need to rebind the object to every possible binding or attachment point it is associated with. Any one of them will cause the object's new data to be visible on the current thread.

Note that the above is true even in a single-threaded case. This would happen if you have multiple contexts in one thread that share objects. If you change an object's data in context 1, then make context 2 current, the object's data will only become visible in the new context if you rebind it.

Sync objects

The only objects that don't work this way are Sync Objects. Specifically, when multiple contexts are blocked on the same sync object. When the sync object becomes signaled in one context, it becomes signaled in all contexts that are currently blocked on that object.

Framebuffer objects

When performing a rendering operation to images attached to a Framebuffer Object, attempts to access pixels from those images via texture or image fetches will result in undefined values. Such fetches from images that are being written to are not coherent. Note that this only concerns fetches due to rendering operations; image reads via Pixel Transfer operations (even asynchronous ones) will work just fine (though non-async reads will stall the CPU to wait for the GPU to finish rendering).

OpenGL is actually quite strict about fetching data from attached textures. It states that the value of a texture being accessed is undefined if it is at all theoretically possible for any texture fetch executed by the shaders to access any part of any image being written in the rendering operation. OpenGL is quite conservative about this.

If an image is attached to a framebuffer object, then to get defined behavior from reading from a texture (or any texture whose storage provides access to that image), you must use the texture mipmap range specifiers to make it actively impossible to access any mipmap levels that are attached. Alternatively, you may use nearest or linear minification filtering, and fetch from mipmap layers other than those which are attached.

You will get undefined fetches when using Array Textures or Cubemap Textures, even if you have attached one array layer and are fetching only from a different one. So long as the fetch and the attached layer are both in the same mipmap level, you will get undefined behavior.

That being said, view textures can help resolve this problem. If you attach one array layer/face to an FBO, and fetch from a view texture which uses a different array layer/face as its source, then the fetching and writing will work.

Note: All that it takes to trigger undefined fetches is for the image to be attached, even if you are not rendering to it. So the draw buffers state for the framebuffer is irrelevant. If it is attached to the FBO currently being rendered to, and you try to read from it, you get undefined behavior. Similarly, using Write Masks will also not prevent undefined behavior.

Once those images are no longer being written to, then rendering commands made after that change will be able to read the values written by rendering commands prior to the change. Things that change what is being rendered to include:

  • Binding a new FBO that doesn't use the image(s) in question.
  • Detaching the image from the FBO altogether.

Because of this, if you want to implement programmatic Blending (more complex blend functions than OpenGL provides) operations, you will often need to "ping-pong" between two textures. The algorithm works like this:

  1. Read from texture 0, blend and write to texture 1.
  2. Bind texture 1 for reading.
  3. Change the FBO's attachment to texture 0 (remember: just calling glDrawBuffers isn't enough).
  4. Read from texture 1, blend and write to texture 0.
  5. Bind texture 0 for reading.
  6. Change the FBO's attachment to texture 1 (remember: just calling glDrawBuffers isn't enough).
  7. Repeat as needed for each blended object.

Texture barrier

Texture Barrier
Core in version 4.5
Core since version 4.5
Core ARB extension ARB_texture_barrier
Vendor extension NV_texture_barrier

This functionality changes the above in several ways. Note that, even before this functionality was core OpenGL, the NVIDIA extension was implemented even on AMD hardware.

First, it expands the rules to say that it is OK to render to the same image that you read from, so long as you render to different parts of the image. So if you use the Viewport and Scissor Test to limit where you render to on the image, and use proper texture coordinates and filtering to limit where you read from, it is possible to avoid having to ping-pong between two separate textures.

This also allows you to use Cubemap Textures and Array Textures, where you attach one layer/face to the FBO and read from a different one. That is, the undefined behavior is only triggered if the shader attempts to read from texels that were written by a prior rendering call.

However, while the above allows you to read from one location and write to another, the restriction about a read holds for any number of rendering calls. If you wanted to implement ping-ponging as above, but within two regions of the same texture, a problem would occur when you wanted to switch to reading from an area that was written to by the first part of the ping-pong action.

The second thing this functionality provides is a way to mitigate the aforementioned limitation. That is done via this function:

void glTextureBarrier(void);

This function states that all writes to framebuffer images due to rendering operations before this command is issued will become visible to reads from those images after this command. Therefore, if you want to ping-pong between two separate regions of the same image, the way it works is as follows:

  1. Read from region 0, blend and write to region 1.
  2. Call glTextureBarrier​.
  3. Read from region 1, blend and write to region 0.
  4. Call glTextureBarrier​.
  5. Repeat as needed for each blended object.

Without the barrier, this would not work.

The third thing this functionality changes is that you are permitted to perform a single read/modify/write operation between a texel fetch and a framebuffer image under the following conditions:

  • Each fragment shader reads only from a single texel from the image being written to. Specifically, the texel that that particular fragment shader will write to. This is easily done via texelFetch(sampler, ivec2(gl_FragCoord.xy), 0)​, though if the Viewport is adjusted from the 0,0 origin, you may need to bias the value by the viewport.
  • The read/modify/write happens only once between barrier calls. That is, you only perform one read/modify/write for each texel in the texture between calls to glTextureBarrier​ or other operations that ensure the visibility of writes. Note that even writing a value without reading counts, so you need a barrier before you start if you've already rendered to the image.

This generally means that you can only have a single layer of read/modify/write blending between calls to glTextureBarrier​. So no overlapping within a single render call. In many cases, this is sufficient for some complex blending algorithms.

With this functionality, you don't need to pingpong between different regions at all. You just issue a barrier anytime you draw something that may have overlapped with some other blended object you drew. Thus, the above becomes:

  1. Render an object that reads from image, blends in the shader, and writes to image.
  2. Call glTextureBarrier​, if the next object overlaps with the first.
  3. Render an object that reads from image, blends in the shader, and writes to image.
  4. Call glTextureBarrier​, if the next object overlaps with the first.
  5. Repeat as needed for each blended object.

Incoherent memory access

There are a number of advanced operations that perform what we call "incoherent memory accesses":

These are called "incoherent" because these operations do not use the normal OpenGL memory model, which is normally "coherent". If memory has been modified in an incoherent fashion, any subsequent reads from that memory are not automatically guaranteed to see these changes. These reads could be from any OpenGL operation that reads from the memory. This includes, but is not limited to:

Generally speaking, all of these operations will not be guaranteed to produce the modified value. OpenGL will not protect you from the asynchronous nature of the OpenGL implementation.

There are ways to ensure the visibility of writes to various operations, but they all require the user to explicitly request them. It will not happen automatically.

Note: Time is not an effective method for ensuring visibility. It does not matter how many Rendering Commands have been issued since the incoherent write. You may have swapped buffers dozens of times and rendered millions of triangles. Realistically, the written value might actually become visible after enough time, if you keep the GPU busy enough. But OpenGL will not require it to be visible, and therefore, you cannot rely on such methods to achieve visibility.


Despite the above, there are some protections that OpenGL provides. What follows is a list of situations where reading memory that was written incoherently will be guaranteed to read the written value.

First, within a single shader invocation, if you perform an incoherent memory write, the value written will always be visible for reading. But only through that particular variable and only within the shader invocation that issued the write. You need not do anything special to make this happen. However, it is possible that, between writing and reading, another invocation may have stomped on that value. So long as that is not the case, reading it will produce the value you have written.

Second, if a shader invocation is being executed, then the shader invocations necessary to cause that invocation must have taken place. For example, in a fragment shader, you can assume that the vertex shaders to compute the vertices for the primitive being rasterized have completed. This is called a dependent invocation. They get to have special privileges in terms of ordering, but simply having a dependent invocation is not enough to ensure visibility.

Warning: This only applies to the shader invocations directly responsible for this shader invocation. Being in a fragment shader does not mean that all vertex shaders in a rendering command have completed. Nor does it mean that all vertex shaders for triangles issued before this particular triangle in the rendering command have completed. Only the ones needed for this particular triangle have been executed.
Note: Geometry shaders have a caveat here. A GS may write multiple vertices and primitives. Therefore, you may only assume that the GS executed just far enough to write enough vertices needed to render the fragment shader's primitive.

Third, sometimes a fragment shader is executed for the sole purpose of computing derivatives for neighboring fragment shader invocations. All incoherent memory writes (as well as coherent memory writes) will be ignored by that invocation.

Invocation order and count

One problem with the above is what defines "subsequent reads". Sometimes, this is obvious; if you issue a rendering command, any OpenGL commands are required to happen after that. The OpenGL memory model may be incoherent with regard to data visibility, but actual OpenGL commands are still ordered.

When dealing with shader invocations, ordering is less certain. While OpenGL explicitly requires that commands are completed in order, that does not mean that two (or more) commands cannot be concurrently executing. As such, it is possible for shader invocations from one command to be exeucting in tandem with shader invocations from other commands.

As such, it is unclear whether a shader invocation has taken place after the execution of some other shader invocation. This ordering issue only matters when dealing with writing from one invocation and trying to read from another, since you cannot see a value that has not been written yet.

OpenGL allows drivers a lot of leeway on the ordering of shader invocations, as well as the number of invocations. Here is a list of the rules about ordering and invocations:

  1. You may not assume that a Vertex Shader will be executed only once for every vertex you pass it. It may be executed multiple times for the same vertex. In indexed rendering scenarios, it is very possible for re-used indices to not execute the vertex shader a second or third time.
  2. The same applies to Tessellation Evaluation Shaders.
  3. The number of Fragment Shader invocations generated from rasterizing a primitive depends on the Pixel Ownership Test, whether Early Depth Test is enabled, and whether the rendering is to a multisample buffer. When not using per-sample shading, the number of fragment shader invocations is undefined within a pixel area, but it must be between 1 and the number of samples in the buffer.
  4. Invocations of the same shader stage may be executed in any order. Even within the same draw call. This includes fragment shaders; writes to the framebuffer are ordered, but the actual fragment shader execution is not.
  5. Outside of invocations which are dependent (as defined above), invocations between stages may be executed in any order. This includes invocations launched by different rendering commands. While it is perhaps unlikely that two vertex shaders from different rendering operations could be running at the same time, it is also possible, so OpenGL provides no guarantees.

Ensuring visibility

The term "visibility" represents when someone (whether shader code or another process something else) can safely access the value written incoherently by some shader invocation. There are two cases in which one can try to read incoherent data (outside of those few cases where visibility is guaranteed, stated above): visibility from within a rendering command (one part writes a value, and another part reads it), and visibility from subsequent OpenGL commands issued after the incoherent write.

Internal visibility

This visibility applies when the writing and reading invocations are in the same rendering command. In order to read such values properly, certain GLSL syntax is required.

To make writes from one invocation visible to reads from another, you must do two things. First, you must ensure that the write has actually happened. This requires following the ordering, as defined above.

Dependent shader invocations are ordered after the invocations they depend upon. Therefore, it is possible for them to read values written by invocations they depend on.

However, this is not the only possible ordering between invocations. In Compute Shaders, calling the barrier​ function ensures that all other invocations in a work group have reached that point in the computation. This ensures an ordering between invocations, as it requires all invocations in the work group to have executed at least that far before any of them can go farther.

After ensuring ordering, the other element that is needed for visibility is special GLSL syntax. The image or buffer variable being written to and read from must be qualified by the coherent​ qualifier. Note that both the writer and reader shaders must qualify their variables properly; otherwise, nothing is guaranteed.

Qualifying the variable with coherent​ alone is not enough however. You also need to use a memory barrier; this will let OpenGL know that an invocation wants all previously executed writes (of some kind) to become visible to another shader invocation. The GLSL functions that do this have the word "memoryBarrier" in them (no relation to the glMemoryBarrier API function). The various flavors of the function operates on different kinds of writes:

Provides a barrier for all of the below operations. This is the only function that doesn't require GL 4.3 or some 4.3 core extension.
Provides a barrier for Atomic Counters.
Provides a barrier for image variables.
Provides a barrier for buffer variables.
Provides a barrier for Compute Shader shared​ variables.
Provides a limited barrier. It creates visibility for all incoherent memory operations, but only within a Compute Shader work group. This can only be used in Compute Shaders.

Atomic Counter operations are functionally coherent, in that they are atomic (nothing can interfere with the read/modify/write operation). Even so, memory barriers can still be employed if you wish to ensure the ordering between two separate atomic operations. But most uses of atomic counters don't need that.

Note that atomic counters are different functionally from atomic image/buffer variable operations. The latter still need coherent​ qualifiers, barriers, and the like.

Note: While Compute Shaders need to use barriers to read shared​ variables, remember that TCS writes to output variables were specifically called out as not being incoherent. So there is no need for a memory barrier in TCSs if you want to read output variables, patch​-qualified or not. The TCS barrier​ function is needed to ensure that the write has happened, but it also ensures that such writes are visible.

External visibility

Internal visibility only handles cases of shader-to-shader reading/writing where you can be certain of invocation order. If you want to establish visibility between shader invocations in two different rendering commands (which, as previously stated, have no ordering guarantees), or if you want to establish visibility between one rendering command and some later OpenGL operation (such as a CPU read via glReadPixels, GPU read via glGetBufferSubData, etc), you need to do something else.

This is external visibility: making an incoherent write from one command visible to a read in a later OpenGL command.

You might think that a Sync Object could ensure synchronization between commands. But there are two problems with that. First, it's incredibly expensive, because it means having to wait to issue the second command until the first completed. Second, it is insufficient, because data may still be in a GPU cache. Sync objects don't ensure cache coherency. So don't do that.

Instead, you must use a special OpenGL function, issued between the writing OpenGL call and the reading OpenGL call:

void glMemoryBarrier(GLbitfield barriers​);

This function is a way of ensuring the visibility of incoherent memory access operations with a wide variety of OpenGL operations, as listed on the documentation page. The thing to keep in mind about the various bits in the bitfield is this: they represent the operation you want to make the incoherent memory access visible to. This is the operation you want to see the results.

For example, if you do some image store operations on a texture, and then want to read it back onto the CPU via glGetTexImage, you would use the GL_TEXTURE_UPDATE_BARRIER_BIT​. If you did image load/store to a buffer, and then want to use it for vertex array data, you would use GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT​. That's the idea.

Note that if you want image load/store operations from one command to be visible to image load/store operations from another command, you use GL_SHADER_IMAGE_ACCESS_BARRIER_BIT​. There are other similar bits for other incoherent memory accesses.

Guidelines and usecases

Here are some basic use cases and how to synchronize them properly.

Read-only variables
Reading, even through Image Load Store or {[Shader Storage Buffer Object|buffer variables]] is never incoherent. So there is nothing to synchronize.
barrier​ invocation write/read
Use coherent​ and an appropriate memoryBarrier*​ or groupMemoryBarrier​ call if you use a mechanism like barrier​ to synchronize between invocations. You only need these in a Compute Shader, not a TCS-invoked barrier​ call.
Dependent invocation write/read
If you have one invocation which is dependent on another (the vertex shaders used to generate a primitive used for a fragment shader), then you need to use coherent​ on the variables and invoke an appropriate memoryBarrier*​ after you finish writing to the images of interest.
Shader writes, other shader in another rendering command reads
coherent​ is the wrong tool. Use glMemoryBarrier before issuing the reading rendering command, using the access bit appropriate for the reading operation.
Shader writes, other OpenGL operation(s) read
Again, coherent​ is the wrong tool. Use a glMemoryBarrier before performing the read, using the access bit appropriate for the reading operation.