A Memory Model defines the rules under which writes to various stored object data become visible to later reads to that data. For example, the memory model defines when writes to a Texture attached to a Framebuffer Object become visible to subsequent reads from that texture.
Under normal circumstances, a coherent memory model is rigidly enforced by OpenGL. In general, if you write to an object, any reads you do later will be visible. Because OpenGL behaves "as if" all operations happened in a specific sequence, it is up to the OpenGL implementation to make sure that subsequent writes have occurred when you issue a read.
For example, if you fire off a rendering command, OpenGL will not have finished executing that command by the time the function returns. If you issue a glReadPixels operation (without doing an asynchronous read) to read from the framebuffer, it is now on the OpenGL implementation to synchronize with all outstanding write operations. OpenGL will wait until all rendering is done, then perform the read and return.
What follows will be a list of exceptions to the basic rule of "reads issued after a write will see the written data".
Contexts and object state
Outputs written by Fragment Shaders to images attached to Framebuffer Objects are not coherent with reads from those same images in shaders. This is only true while those images are part of the framebuffer being written. Attempting to read from the same image
Once those images are no longer being written to, then rendering commands made after that change will be able to read the values written by rendering commands prior to the change. Things that change what is being rendered to include:
- Binding a new FBO that doesn't use the image(s) in question.
- Removing the image from the FBO altogether.
- Using glDrawBuffers to change which buffers are rendered to, so that the particular image isn't a destination for rendering.
Note that this is about a specific image in a Texture. It is perfectly valid to read from a texture in a shader that writes to the same texture, but only if the read and write are to two different images in the texture. This is useful for performing specialized mipmap filtering operations or the like; you can read from a lower (in value, larger in size) mipmap level and write to a higher one. You will of course need to ensure that you do not read from the mipmap current used as an image; sampler functions like texelFetch or textureLod will help.
Because of this, if you want to implement programmatic Blending (more complex blend functions) operations, you will often need to "ping-pong" between two textures. The algorithm works like this:
- Read from texture 0, blend and write to texture 1.
- Bind texture 1 for reading.
- Switch the FBO to write to texture 0.
- Read from texture 1, blend and write to texture 0.
- Bind texture 0 for reading.
- Switch the FBO to write to texture 1.
- Repeat as needed for each blended object.
This extension changes the above in several ways. Note that, despite being an NVIDIA extension, it is implemented on AMD hardware as well.
First, it expands the rules to say that it is OK to render to the same image that you read from, so long as you render to different parts of the image. So if you use glViewport to limit where you render to on the image, and use proper texture coordinates and filtering to limit where you read from, it is possible to avoid having to ping-pong between two separate textures. You only have to jump between two parts of the same texture.
However, the above applies for any number of rendering commands. So if you want to ping-pong between parts of the same texture, you have to do something more than just render to different areas and switch whenever you want.
The second thing the extension changes is that it provides access to this function:
This function states that all writes to framebuffer images due to rendering operations before this command executed will become visible to reads from those images after this command. Therefore, if you want to ping-pong between two separate parts of the same image, the way it works is as follows:
- Read from location 0, blend and write to location 1.
- Call glTextureBarrierNV.
- Read from location 1, blend and write to location 0.
- Call glTextureBarrierNV.
- Repeat as needed for each blended object.
Without the barrier, this would not work.
The third thing the extension changes is that you can perform a single read/modify/write operation under the following conditions:
- Each fragment shader reads from a single texel. Namely, the texel that that particular fragment shader will write to. This is easily done via texelFetch(sampler, ivec2(gl_FragCoord.xy), 0).
- It happens only once between write synchronization events. That is, you only perform one read/modify/write for each texel in the texture between calls to glTextureBarrierNV or other operations that ensure the visibility of writes. Note that simply writing without reading counts, so you need a barrier before you start if you've already rendered to the image.
This generally means that you can only have a single layer of read/modify/write blending between calls to glTextureBarrierNV. So no overlapping within a single render. In many cases, this is sufficient for more complex algorithms.
Incoherent memory access
There are a number of advanced operations that perform what we call "incoherent memory accesses". :
- Writes (atomic or otherwise) via Image Load Store
- Writes (atomic or otherwise) via Shader Storage Buffer Objects
- Writes to variables declared as shared (but not patch)
When you perform any of these operations, any subsequent reads from almost anywhere are not guaranteed to see them. And by "almost anywhere", this includes (but is not limited to):
- Image load operations to that memory location from anywhere other than this particular shader invocation, using the specific image variable used to write the data.
- SSBO reading operations to that memory location from anywhere other than this particular shader invocation, using the specific buffer variable used to write the data.
- Reads from the texture via glGetTexImage, or if it is bound to an FBO, glReadPixels.
- Texture reads via samplers.
- If the image was a buffer texture, any form of reading from that buffer, such as using it for a Vertex Buffer Object or Uniform Buffer Object.
In short, you get almost nothing. Everything is asynchronous, and OpenGL will not protect you from this fact. All of these can be alleviated, but only specifically at the request of the user. It will not happen automatically.
Despite the above, there are some protections that OpenGL provides. What follows is a list of things that the specification does require incoherent memory accesses to guarantee about when data will be accessible.
First, within a single shader invocation, if you perform an incoherent memory write, it will always be visible to that variable for reading. You need not do anything special to make this happen. However, it is possible that, between writing and reading, another invocation may have stomped on that value.
Second, if a shader invocation is being executed, then the shader invocations necessary to cause that invocation must have taken place. For example, in a fragment shader, you can assume that the vertex shaders to compute the vertices for the primitive being rasterized have completed. This is called a dependent invocation. They get to have special privileges in terms of ordering.
Third, sometimes a fragment shader is executed for the sole purpose of computing derivatives for other shaders. All incoherent memory writes (as well as coherent memory writes) will be ignored by that invocation.
Invocation order and count
One problem with the above is what defines "subsequent invocations". OpenGL allows implementations a lot of leeway on the ordering of shader invocations, as well as the number of invocations. Here is a list of the rules:
- You may not assume that a vertex shader will be executed only once for every vertex you pass it. It may be executed multiple times for the same vertex. In indexed rendering scenarios, it is very possible for re-used indices to not execute the vertex shader a second or third time.
- The same applies to tessellation evaluation shaders.
- The number of fragment shader invocations generated from rasterizing a primitive depends on the pixel ownership test, whether early depth test is enabled, and whether the rendering is to a multisample buffer. When not using per-sample shading, the number of fragment shader invocations is undefined within a pixel area, but it must be between 1 and the number of samples in the buffer.
- Invocations of the same shader stage may be executed in any order. Even within the same draw call. This includes fragment shaders; writes to the framebuffer are ordered, but the actual fragment shader execution is not.
- Outside of invocations which are dependent (as defined above), invocations between stages may be executed in any order. This includes invocations launched by different rendering commands. While it is perhaps unlikely that two vertex shaders from different rendering operations could be running at the same time, it is also very possible, so OpenGL provides no guarantees.
The term "visibility" represents when someone (whether shader code or something else) can safely access the value written incoherently by some shader invocation. There are two tools to ensure visibility; they are used to ensure visibility from two different contexts. There is the coherent qualifier and there is the glMemoryBarrier function.
coherent is used on image or buffer variables, such that writes to coherent qualified variables will be read correctly by coherent qualified variables in another invocation. Note that this requires the coherent qualifier on both the writer and the reader; if one of them doesn't have it, then nothing is guaranteed.
Note that coherent does not ignore all of the prior rules. In order for a write to become visible to an invocation, it must first have happened. Therefore, coherent can only really work if you know that the writing invocation has executed. Which usually means dependent invocations, as stated above.
There are other times you can know that a write has happened. In Compute Shaders, the barrier function ensures that all other invocations in a work group have reached that point in the computation. This works for Tessellation Control Shaders as well, for all of the invocations in a patch. So you know that all invocations in a work group/patch have reached that point, so all prior writes have been written. You still need the coherent qualifier on both the reading and writing variable, but it works.
coherent alone is not enough however. You also need to use a memory barrier, to effectively let OpenGL know that you're finished writing a batch of things and want to make them visible to someone else. The GLSL functions that do this have the word "memoryBarrier" in them (no relation to the glMemoryBarrier API function). The particulars of the function defines which reads or writes the function operates on:
- Provides a barrier for all of the below operations. This is the only function that doesn't require GL 4.3 or some 4.3 core extension.
- Provides a barrier for Atomic Counters.
- Provides a barrier for image variables.
- Provides a barrier for buffer variables.
- Provides a barrier for Compute Shader shared variables.
- Provides a limited barrier. It creates visibility for all incoherent memory operations, but only within a Compute Shader work-group. This can only be used in Compute Shaders.
Atomic Counter operations are functionally coherent, in that they are atomic (nothing can interfere with the read/modify/write operation). Memory barriers can still be employed if you wish to ensure the ordering between two separate atomic operations. But most uses of atomic counters don't need that.
Note that atomic counters are different functionally from atomic image/buffer variable operations. Those still need coherent and the above rules.
coherent is only useful in cases of shader-to-shader reading/writing where you can be certain of invocation order. If you want to establish visibility between two different rendering commands (which, as previously stated, have no ordering guarantees), or if you want to establish visibility between one rendering command and some later OpenGL operation (such as a CPU read via glReadPixels, GPU read via glGetBufferSubData, etc), you need to do something else.
You might think that a Sync Object could ensure synchronicity between commands. But there are two problems with that. First, it's incredibly expensive, because it means having to wait to issue the second command until the first completed. Second, it does not work. So don't do that.
Instead, you must use a special OpenGL function:
void glMemoryBarrier(GLbitfield barriers);
This function is a way of ensuring the visibility of incoherent memory access operations with a wide variety of OpenGL operations, as listed on the documentation page. The thing to keep in mind about the various bits in the bitfield is this: they represent the operation you want to make the incoherent memory access visible to. This is the operation you want to see the results.
For example, if you do some image store operations on a texture, and then want to read it back onto the CPU via glGetTexImage, you would use the GL_TEXTURE_UPDATE_BARRIER_BIT. If you did image load/store to a buffer, and then want to use it for vertex array data, you would use GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT. That's the idea.
Note that if you want image load/store operations from one command to be visible to image load/store operations from another command, you use GL_SHADER_IMAGE_ACCESS_BARRIER_BIT. There are other similar bits for other incoherent memory accesses.
Guidelines and usecases
Here are some basic use cases and how to synchronize them properly.
- Read-only variables
- If a shader only reads, then it does not need any form of synchronization for visibility. Even if you modify objects via OpenGL commands (glTexSubImage2D, for example) or whatever, OpenGL requires that reads remain properly synchronized.
- barrier invocation write/read
- Use coherent and an appropriate memoryBarrier* or groupMemoryBarrier call if you use a mechanism like barrier to synchronize between invocations. Remember that shared variables are incoherent, but the Tessellation Control Shaders outputs (per-vertex and per-patch) are coherent, so you don't need a memory barrier. You only need to use barrier to ensure that the write has actually happened.
- Dependent invocation write/read
- If you have one invocation which is dependent on another (the vertex shaders used to generate a primitive used for a fragment shader), then you need to use coherent on the variables and invoke an appropriate memoryBarrier* after you finish writing to the images of interest.
- Shader write/read between rendering commands
- One rendering command (or compute shader invocation) writes incoherently, and the other reads. There is no need for coherent here at all. Just use glMemoryBarrier with the appropriate access bit.
- Shader writes, other OpenGL operations read
- Again, coherent is not necessary. You must use a glMemoryBarrier appropriate to the operation of interest.