Memory Model

From OpenGL.org
Jump to: navigation, search

A Memory Model defines the rules under which writes to various stored object data become visible to later reads to that data. For example, the memory model defines when writes to a Texture attached to a Framebuffer Object become visible to subsequent reads from that texture.

Under normal circumstances, a coherent memory model is rigidly enforced by OpenGL. In general, if you write to an object, any command you issue later will see the new value. Because OpenGL behaves "as if" all operations happened in a specific sequence, it is up to the OpenGL implementation to make sure that subsequent writes have occurred when you issue a read.

For example, if you fire off a rendering command, OpenGL will not have finished executing that command by the time the function returns. If you issue a glReadPixels operation (without doing an asynchronous read) to read from the framebuffer, it is now on the OpenGL implementation to synchronize with all outstanding write operations. OpenGL will therefore halt the CPU thread issuing the read until all rendering is done, then perform the read and return.

What follows will be a list of exceptions to the basic rule of "reads issued after a write will see the written data".

Multiple threads and contexts

It is possible to use multiple CPU threads in OpenGL. This raises a number of questions about synchronization and state visibility between threads.

The multithreading model that OpenGL uses is built on one fact: the same OpenGL context cannot be current within multiple threads simultaneously. While you can have multiple OpenGL contexts which are current in multiple threads, you cannot manipulate a single context simultaneously from two threads.

Of course, this raises the possibility of having a race condition on the currency of a context. That is, one thread is manipulating a context while another thread makes it current. Who wins?

The correct answer is that you lose. You should design your application such that this is impossible. It is up to the OpenGL application to avoid race conditions on the current context. This should be done by using appropriate synchronization primitives, which should be available to you from your language or threading library of choice.

Object content visibility

OpenGL Contexts can share objects. This means that it is possible for an object in one thread to be manipulated while it is being used in a second thread.

In terms of synchronization, OpenGL states that changes only happen once those commands have completed. Since each context has its own command streams, you will need to either use a Sync Object or glFinish to ensure that the command has executed. But you must also communicate with the other thread to let it know that the data has been updated. This will require using the appropriate language/library inter-thread communication features (mutex's, atomics, etc).

However, this is not enough to ensure that the changes are visible in the consuming thread. The OpenGL specification has a number of complex rules (OpenGL 4.5, Section 5.3.3, page 54) that determine when updates to an object in one thread will become visible to other threads.

The general gist of the rules is that each consuming context must bind the object to its context before the change becomes visible. The binding can either be direct binding (with a glBind*​ call) or indirect binding, by binding a container object that references the changed object.

Even if that object is already bound in thread 2 when the change takes place in thread 1, thread 2 must rebind the object to ensure the visibility of the updated data. The rebinding however does not need to rebind the object to every possible binding or attachment point it is associated with. Any one of them will cause the object's new data to be visible on the current thread.

Note that the above is true even in a single-threaded case. This would happen if you have multiple contexts in one thread that share objects. If you change an object's data in context 1, then make context 2 current, the object's data will only become visible in the new context if you rebind it.

Sync objects

The only objects that don't work this way are Sync Objects. Specifically, when multiple contexts are blocked on the same sync object. When the sync object becomes signaled in one context, it becomes signaled in all contexts that are currently blocked on that object.

Framebuffer objects

When performing a rendering operation to images attached to a Framebuffer Object, attempts to access pixels from those images via texture or image fetches will result in undefined values. Such fetches from images that are being written to are not coherent. Note that this only concerns fetches due to rendering operations; image reads via Pixel Transfer operations (even asynchronous ones) will work just fine (though non-async reads will stall the CPU to wait for the GPU to finish rendering).

OpenGL is actually quite strict about fetching data from attached textures. It states that the value of a texture being accessed is undefined if it is at all theoretically possible for any texture fetch executed by the shaders to access any part of any image being written in the rendering operation. OpenGL is quite conservative about this.

If an image is attached to a framebuffer object, then to get defined behavior from reading from a texture (or any texture who's storage provides access to that image), you must use the texture mipmap range specifiers to make it actively impossible to access any mipmap levels that are attached. Alternatively, you may use nearest or linear minification filtering, and fetch from mipmap layers other than those which are attached.

You will get undefined fetches when using Array Textures or Cubemap Textures, even if you have attached one array layer and are fetching only from a different one. So long as the fetch and the attached layer are both in the same mipmap level, you will get undefined behavior.

That being said, view textures can help resolve this problem. If you attach one array layer/face to an FBO, and fetch from a view texture which uses a different array layer/face as its source, then the fetching and writing will work.

Note: All that it takes to trigger undefined fetches is for the image to be attached, even if you are not rendering to it. So the draw buffers state for the framebuffer is irrelevant. If it is attached to the FBO currently being rendered to, and you try to read from it, you get undefined behavior. Similarly, using Write Masks will also not prevent undefined behavior.

Once those images are no longer being written to, then rendering commands made after that change will be able to read the values written by rendering commands prior to the change. Things that change what is being rendered to include:

  • Binding a new FBO that doesn't use the image(s) in question.
  • Detaching the image from the FBO altogether.

Because of this, if you want to implement programmatic Blending (more complex blend functions than OpenGL provides) operations, you will often need to "ping-pong" between two textures. The algorithm works like this:

  1. Read from texture 0, blend and write to texture 1.
  2. Bind texture 1 for reading.
  3. Change the FBO's attachment to texture 0 (remember: just calling glDrawBuffers isn't enough).
  4. Read from texture 1, blend and write to texture 0.
  5. Bind texture 0 for reading.
  6. Change the FBO's attachment to texture 1 (remember: just calling glDrawBuffers isn't enough).
  7. Repeat as needed for each blended object.

Texture barrier

Texture Barrier
Core in version 4.5
Core since version 4.5
Core ARB extension ARB_texture_barrier
Vendor extension NV_texture_barrier

This functionality changes the above in several ways. Note that, even before this functionality was core OpenGL, the NVIDIA extension was implemented even on AMD hardware.

First, it expands the rules to say that it is OK to render to the same image that you read from, so long as you render to different parts of the image. So if you use the Viewport and Scissor Test to limit where you render to on the image, and use proper texture coordinates and filtering to limit where you read from, it is possible to avoid having to ping-pong between two separate textures.

This also allows you to use Cubemap Textures and Array Textures, where you attach one layer/face to the FBO and read from a different one. That is, the undefined behavior is only triggered if the shader attempts to read from texels that were written by a prior rendering call.

However, while the above allows you to read from one location and write to another, the restriction about a read holds for any number of rendering calls. If you wanted to implement ping-ponging as above, but within two regions of the same texture, a problem would occur when you wanted to switch to reading from an area that was written to by the first part of the ping-pong action.

The second thing this functionality provides is a way to mitigate the aforementioned limitation. That is done via this function:

void glTextureBarrier(void);

This function states that all writes to framebuffer images due to rendering operations before this command is issued will become visible to reads from those images after this command. Therefore, if you want to ping-pong between two separate regions of the same image, the way it works is as follows:

  1. Read from region 0, blend and write to region 1.
  2. Call glTextureBarrier​.
  3. Read from region 1, blend and write to region 0.
  4. Call glTextureBarrier​.
  5. Repeat as needed for each blended object.

Without the barrier, this would not work.

The third thing this functionality changes is that you are permitted to perform a single read/modify/write operation between a texel fetch and a framebuffer image under the following conditions:

  • Each fragment shader reads only from a single texel from the image being written to. Specifically, the texel that that particular fragment shader will write to. This is easily done via texelFetch(sampler, ivec2(gl_FragCoord.xy), 0)​, though if the Viewport is adjusted from the 0,0 origin, you may need to bias the value by the viewport.
  • The read/modify/write happens only once between barrier calls. That is, you only perform one read/modify/write for each texel in the texture between calls to glTextureBarrier​ or other operations that ensure the visibility of writes. Note that even writing a value without reading counts, so you need a barrier before you start if you've already rendered to the image.

This generally means that you can only have a single layer of read/modify/write blending between calls to glTextureBarrier​. So no overlapping within a single render call. In many cases, this is sufficient for some complex blending algorithms.

Incoherent memory access

There are a number of advanced operations that perform what we call "incoherent memory accesses". :

When you perform any of these operations, any subsequent reads from almost anywhere are not guaranteed to see them. And by "almost anywhere", this includes (but is not limited to):

  • Image load operations to that memory location from anywhere other than this particular shader invocation, using the specific image variable used to write the data.
  • SSBO reading operations to that memory location from anywhere other than this particular shader invocation, using the specific buffer variable used to write the data.
  • Reads from the texture via glGetTexImage, or if it is attached to an FBO, glReadPixels.
  • Texture reads via samplers.
  • If the image was a buffer texture, any form of reading from that buffer, such as using it for a Vertex Buffer Object or Uniform Buffer Object.
  • Blending
  • Depth Tests
  • Stencil Tests

In short, you get almost nothing. Everything is asynchronous, and OpenGL will not protect you from this fact. All of these can be alleviated, but only specifically at the request of the user. It will not happen automatically.

Guarantees

Despite the above, there are some protections that OpenGL provides. What follows is a list of things that the specification does require incoherent memory accesses to guarantee about when data will be accessible.

First, within a single shader invocation, if you perform an incoherent memory write, the value written will always be visible for reading through that variable. You need not do anything special to make this happen. However, it is possible that, between writing and reading, another invocation may have stomped on that value. So long as that is not the case, reading it will produce the value you have written.

Second, if a shader invocation is being executed, then the shader invocations necessary to cause that invocation must have taken place. For example, in a fragment shader, you can assume that the vertex shaders to compute the vertices for the primitive being rasterized have completed. This is called a dependent invocation. They get to have special privileges in terms of ordering.

Warning: This only applies to the shader invocations directly responsible for this shader invocation. Being in a fragment shader does not mean that all vertex shaders in a rendering command have completed. Nor does it mean that all vertex shaders for triangles issued before this particular triangle in the rendering command have completed. Only the ones needed for this particular triangle have been executed.
Note: Geometry shaders have a caveat here. A GS may write multiple vertices and primitives. Therefore, you may only assume that the GS executed just far enough to write enough vertices needed to render the fragment shader's primitive.

Third, sometimes a fragment shader is executed for the sole purpose of computing derivatives for neighboring fragment shader invocations. All incoherent memory writes (as well as coherent memory writes) will be ignored by that invocation.

Invocation order and count

One problem with the above is what defines "subsequent invocations". OpenGL allows implementations a lot of leeway on the ordering of shader invocations, as well as the number of invocations. Here is a list of the rules:

  1. You may not assume that a Vertex Shader will be executed only once for every vertex you pass it. It may be executed multiple times for the same vertex. In indexed rendering scenarios, it is very possible for re-used indices to not execute the vertex shader a second or third time.
  2. The same applies to Tessellation Evaluation Shaders.
  3. The number of Fragment Shader invocations generated from rasterizing a primitive depends on the Pixel Ownership Test, whether Early Depth Test is enabled, and whether the rendering is to a multisample buffer. When not using per-sample shading, the number of fragment shader invocations is undefined within a pixel area, but it must be between 1 and the number of samples in the buffer.
  4. Invocations of the same shader stage may be executed in any order. Even within the same draw call. This includes fragment shaders; writes to the framebuffer are ordered, but the actual fragment shader execution is not.
  5. Outside of invocations which are dependent (as defined above), invocations between stages may be executed in any order. This includes invocations launched by different rendering commands. While it is perhaps unlikely that two vertex shaders from different rendering operations could be running at the same time, it is also very possible, so OpenGL provides no guarantees.

Ensuring visibility

The term "visibility" represents when someone (whether shader code or something else) can safely access the value written incoherently by some shader invocation. There are two tools to ensure visibility; they are used to ensure visibility from two different contexts. There is the coherent​ qualifier and there is the glMemoryBarrier function.

coherent​ is used on image or buffer variables, such that writes to coherent​ qualified variables will be read correctly by coherent​ qualified variables in another invocation. Note that this requires the coherent​ qualifier on both the writer and the reader; if one of them doesn't have it, then nothing is guaranteed.

Note that coherent​ does not ignore all of the prior rules. In order for a write to become visible to an invocation, it must first have happened. Therefore, coherent​ can only really work if you know that the writing invocation has executed. Which usually means that only dependent invocations (as stated above) can read memory written by the invocations they depend on.

There are other times you can know that a write has happened. In Compute Shaders, the barrier​ function ensures that all other invocations in a work group have reached that point in the computation.

Qualifying the variable with coherent​ alone is not enough however. You also need to use a memory barrier; this will let OpenGL know that an invocation wants all previously executed writes (of some kind) to become visible to another shader invocation. The GLSL functions that do this have the word "memoryBarrier" in them (no relation to the glMemoryBarrier API function). The various flavors of the function operates on different kinds of writes:

memoryBarrier​
Provides a barrier for all of the below operations. This is the only function that doesn't require GL 4.3 or some 4.3 core extension.
memoryBarrierAtomicCounter​
Provides a barrier for Atomic Counters.
memoryBarrierImage​
Provides a barrier for image variables.
memoryBarrierBuffer​
Provides a barrier for buffer variables.
memoryBarrierShared​
Provides a barrier for Compute Shader shared​ variables.
groupMemoryBarrier​
Provides a limited barrier. It creates visibility for all incoherent memory operations, but only within a Compute Shader work-group. This can only be used in Compute Shaders.

Atomic Counter operations are functionally coherent, in that they are atomic (nothing can interfere with the read/modify/write operation). Memory barriers can still be employed if you wish to ensure the ordering between two separate atomic operations. But most uses of atomic counters don't need that.

Note that atomic counters are different functionally from atomic image/buffer variable operations. Those still need coherent​ and the above rules.

Note: Tessellation Control Shaders have their own barrier​ function, such that output variables written to by one invocation in a patch can be read by others in that same patch. These variables are effectively always coherent, and the TCS version of barrier​ also incorporates an implicit memory barrier for all output variables. As such, TCSs within a patch do not need to use the above memory barriers unless they are writing to external memory.

External visibility

coherent​ is only useful in cases of shader-to-shader reading/writing where you can be certain of invocation order. If you want to establish visibility between two different rendering commands (which, as previously stated, have no ordering guarantees), or if you want to establish visibility between one rendering command and some later OpenGL operation (such as a CPU read via glReadPixels, GPU read via glGetBufferSubData, etc), you need to do something else.

You might think that a Sync Object could ensure synchronization between commands. But there are two problems with that. First, it's incredibly expensive, because it means having to wait to issue the second command until the first completed. Second, it does not work (because data may still be in a GPU cache; sync objects don't ensure cache coherency). So don't do that.

Instead, you must use a special OpenGL function:

void glMemoryBarrier(GLbitfield barriers​);

This function is a way of ensuring the visibility of incoherent memory access operations with a wide variety of OpenGL operations, as listed on the documentation page. The thing to keep in mind about the various bits in the bitfield is this: they represent the operation you want to make the incoherent memory access visible to. This is the operation you want to see the results.

For example, if you do some image store operations on a texture, and then want to read it back onto the CPU via glGetTexImage, you would use the GL_TEXTURE_UPDATE_BARRIER_BIT​. If you did image load/store to a buffer, and then want to use it for vertex array data, you would use GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT​. That's the idea.

Note that if you want image load/store operations from one command to be visible to image load/store operations from another command, you use GL_SHADER_IMAGE_ACCESS_BARRIER_BIT​. There are other similar bits for other incoherent memory accesses.

Guidelines and usecases

Here are some basic use cases and how to synchronize them properly.

Read-only variables
If a shader only reads, then it does not need any form of synchronization for visibility. Even if you modify objects via OpenGL commands (glTexSubImage2D, for example) or whatever, OpenGL requires that reads remain properly synchronized.
barrier​ invocation write/read
Use coherent​ and an appropriate memoryBarrier*​ or groupMemoryBarrier​ call if you use a mechanism like barrier​ to synchronize between invocations. Remember that shared​ variables are incoherent, but the Tessellation Control Shaders outputs (per-vertex and per-patch) are coherent, so you don't need a memory barrier. You only need to use barrier​ to ensure that the write has actually happened.
Dependent invocation write/read
If you have one invocation which is dependent on another (the vertex shaders used to generate a primitive used for a fragment shader), then you need to use coherent​ on the variables and invoke an appropriate memoryBarrier*​ after you finish writing to the images of interest.
Shader write/read between rendering commands
One Rendering Command writes incoherently, and the other reads. There is no need for coherent​ here at all. Just use glMemoryBarrier before issuing the reading rendering command, using the appropriate access bit.
Shader writes, other OpenGL operations read
Again, coherent​ is not necessary. You must use a glMemoryBarrier before performing the read, using a bitfield that is appropriate to the reading operation of interest.