Defined behaviour of image load after store in a different shader stage?

imported_obfuscator · July 25, 2014, 6:48am

Hi,

imagine following scenario:

I want to draw indexed triangle strips with a primitive restart after every fourth index (restart index: 0xffffffff) which results in quads.
Then I want to store a computed value for every quad into an image via vertex shader. The store should only be executed for the first vertex of a quad.

Pseudo code for vertex shader:


void main()
{
  ...

  if (gl_VertexID % 4 == 0)
  {
    imageStore(image, coordinate, value); // req.: coherent qualifier
    memoryBarrier();
  }

  ...
}

Now the question: Is it safe to assume (defined behaviour) that this stored value is visible for all fragments of the triangle strip / quad or can it happen that it is only visible for one triangle of the strip and the other uses an obsolete or undefined value? Is the fragment shader dependent on the primitive type of the draw call (here: triangle strip) or only on one triangle of the primitive type?

Two important cites from the Memory Model wiki article my question is based on:

Second, if a shader invocation is being executed, then the shader invocations necessary to cause that invocation must have taken place. For example, in a fragment shader, you can assume that the vertex shaders to compute the vertices for the primitive being rasterized have completed. This is called a dependent invocation. They get to have special privileges in terms of ordering.

Invocations of the same shader stage may be executed in any order. Even within the same draw call. This includes fragment shaders; writes to the framebuffer are ordered, but the actual fragment shader execution is not.

What is a primitive here? Triangle or triangle strip?

EDIT: Btw every quad has its own store coordinate (independent from the others).

Dark_Photon · July 26, 2014, 6:22pm

[QUOTE=obfuscator;1260667]I want to store a computed value for every quad into an image via vertex shader. The store should only be executed for the first vertex of a quad…

Now the question: Is it safe to assume (defined behaviour) that this stored value is visible for all fragments of the triangle strip / quad or can it happen that it is only visible for one triangle of the strip and the other uses an obsolete or undefined value?[/QUOTE]

I’m sure no expert on side-effect synchronization, but from skimming ARB_shader_image_load_store, I don’t see anything that guarantees that side-effect writes in a vertex shader will be synchronized before reads from a resulting fragment shader. This text is relevent:

Shader Memory Access Ordering

The order in which texture or buffer object memory is read or written by shaders is largely undefined…

While a vertex … shader will be executed at least once for each unique vertex specified by the application …, it may be executed more than once for implementation-dependent reasons…

The relative order of invocations of different shader types is largely undefined. However, when executing a shader whose inputs are generated from a previous programmable stage, the shader invocations from the previous stage are guaranteed to have executed far enough to generate final values for all next-stage inputs…

Shader Memory Access Synchronization

…To permit cases where textures or buffers may be read or written in different pipeline stages without the overhead of automatic synchronization, buffer object and texture stores performed by shaders are not automatically synchronized with other GL operations using the same memory…

SHADER_IMAGE_ACCESS_BARRIER_BIT: Memory accesses using shader image load, store, and atomic built-in functions issued after the barrier will reflect data written by shaders prior to the barrier. Additionally, image stores and atomics issued after the barrier will not execute until all memory accesses (e.g., loads, stores, texture fetches, vertex fetches) initiated prior to the barrier complete.
…

The following guidelines may be helpful in choosing when to use coherent memory accesses and when to use barriers
…

Data written by one shader invocation and consumed by other shader invocations launched as a result of its execution (“dependent invocations”) should use coherent variables in the producing shader invocation and call memoryBarrier() after the last write. The consuming shader invocation should also use coherent variables.

I’d pay close attention to the last one. Also this text from the GLSL 4.4 Specification/User’s Manual seems pertinent:

8.17 Shader Memory Control Functions [aka memoryBarrier* functions]

…void memoryBarrierImage () - Control the ordering of memory transactions to images
issued within a single shader invocation…

When these functions return, the results of any memory stores performed using coherent variables performed prior to the call will be visible to any future coherent access to the same memory performed by any other shader invocation. In particular, the values written this way in one shader stage are guaranteed to be visible to coherent memory accesses performed by shader invocations in subsequent stages when those invocations were triggered by the execution of the original shader invocation (e.g., fragment shader invocations for a primitive resulting from a particular geometry shader invocation).

So it sounds like, if you’re in a frag shader, you can (with some special sauce, as described above) access image information stored by one of the verts causing that fragment to be created. But if it’s not one of the prompting verts, you can’t make that assumption. Since the GPU rasterizes tris and not quads, I think you might be running afoul of the rules with what you’re trying to do.

Couple alternative options come to mind: first, just generate the data in the vertex shader and send it down normally using interpolators. Or, just send a single point down the pipe, gen your data in the vtx shader (the data you were going to image store for the frag shader executions), and then use a geom shader to spread the point into a quad, giving all resulting verts (and their frags) access to this shared data without any image load/store hocus pocus, then pass that down to the fragments through normal interpolators.

imported_obfuscator · July 27, 2014, 4:38am

Thanks so far, Dark Photon.

That quote from the GLSL 4.4 spec you underlined is the point which keeps me puzzled.

In particular, the values written this way in one shader stage are guaranteed to be visible to coherent memory accesses performed by shader invocations in subsequent stages when those invocations were triggered by the execution of the original shader invocation (e.g., fragment shader invocations for a primitive resulting from a particular geometry shader invocation).

What is their definition of a primitive? Is it a GPU primitive (triangle) or OpenGL primitive (here: triangle strip). In a geometry shader you can also specify a so-called output “primitive” which can also be a [var]triangle_strip[/var]. So in their definition, all fragments which are a result of a GS invocation’s [var]triangle_strip[/var] (primitive) should see its coherent qualified stores since these fragments are dependent on the GS. Hence, it should also give the same guarantee to an OpenGL draw call with “primitive” mode [var]GL_TRIANGLE_STRIP[/var].

Dark_Photon · July 28, 2014, 5:07am

You’re right, that seems to be a key question. The GL 4.4 core spec describes a primitive as: “a point, line segment, patch, or polygon”. Yet primitive type includes things like triangle strips, triangle fans, line loops, and such. Also, geom shader output types include points, line strips, and tri strips, which are termed “primitive types”.

So yes, I’d agree it’s confusing. Are they trying to say that it has to be from one of the prompting vertex shader executions resulting in this triangle, or any of the vertex shader executions for any vertex in the triangle strip prompting this triangle. I would guess the former, but I don’t know for sure. My reason is this quote I’d already mentioned above:

The relative order of invocations of different shader types is largely undefined. However, when executing a shader whose inputs are generated from a previous programmable stage, the shader invocations from the previous stage are guaranteed to have executed far enough to generate final values for all next-stage inputs…

This seems to say there’s no guarentee that all of your vertex shader executions for a strip are necessarily going to complete (in particular, for the 1 out of every 4th vertex you’re interested in) before frag shader executions that want to look at that output from that first-out-of-every-4th vertex.

OTOH, if you pass input down the pipe normally between stages via interpolators, then you can guarentee this will happen in the order you want (e.g. send down points, and do a point-to-quad spread with a geom shader).

imported_obfuscator · August 1, 2014, 7:39am

Thanks again, Dark Photon. I think we should conclude that the assumption with OpenGL primitives doesn’t hold and generates undefined behavior unless stated otherwise. The description is too unsharp and with that conclusion we are on the safe side, I think.

system · October 19, 2021, 7:15pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.