multi-pass, SSBO + a synchronisation issue

I am trying to implement a certain multi-pass rendering algorithm, am I am getting artifacts. Debugging this issue, I have cut down my shaders to just a few lines that still reproduce the problem. So I must be staring at the bug, but a whole day of staring at the code and OpenGL ES 3.1 spec and I still don’t understand what is going on…

Both passes share the same vertex shader:

#version 310 es

in vec2 a_Position;
out vec2 v_TexCoordinate;

void main()
  {
  v_TexCoordinate = a_Position + 0.5;         // this makes it [ 0.0 , 1.0 ]
  gl_Position     = vec4(2.0*a_Position,1.0,1.0);
  }

Pass1 fragment shader:

#version 310 es
out vec4 fragColor;
in vec2 v_TexCoordinate;

uniform vec2 u_Size;     // Size of the screen (width x height)

layout (binding=0, offset=0) uniform atomic_uint u_Counter;

layout (std430,binding=1) buffer linkedlist
  {
  uint u_Records[];     // number of entries: twice the number of pixels on the screen
  };

void main()                    		
  {
  uint index = uint( (v_TexCoordinate.x + v_TexCoordinate.y * u_Size.y) * u_Size.x);     // we are rendering a quad = unique index for pixel AFAIK ?
  uint ptr = atomicCounterIncrement(u_Counter) + uint(u_Size.x*u_Size.y);                // unique location the 'real per-pixel value' is stored at 
  u_Records[index] = ptr;                                                                // store a 'pointer' to the real value (kind of 'per pixel head pointer')
  u_Records[ptr  ] = (v_TexCoordinate.x>0.5?2u:1u);                                      // store the 'real value' - 1 for all pixels on the left of the screen and 2 otherwise
  discard;
  }

Pass2 fragment shader:

#version 310 es
out vec4 fragColor;
in vec2 v_TexCoordinate;

uniform vec2 u_Size;

layout (std430,binding=1) buffer linkedlist
  {
  uint u_Records[];
  };

void main()                    		
  {
  uint index = uint( (v_TexCoordinate.x + v_TexCoordinate.y * u_Size.y)* u_Size.x);
  uint ptr = u_Records[index];                                                                  // retrieve the 'per-pixel unique pointer to the real value'

       if( u_Records[ptr]==2u ) fragColor = vec4(1.0,0.0,0.0,1.0);                             // if the real value is 2, draw a red pixel
  else if( u_Records[ptr]==1u ) fragColor = vec4(0.0,1.0,0.0,1.0);                             // if the real value is 1, draw a green pixel
  else                          fragColor = vec4(0.0,0.0,1.0,1.0);                             // this should never happen
  }

Application:


zeroOutAtomicCounter();
Pass1();
glMemoryBarrier(GL_ALL_BARRIER_BITS);
Pass2();

Now, as you can see I am using a flat table of uints in a SSBO to communicate between the Passes. In the first pass, I fill up the table with values, in the second - read it and display appropriate colours.
The table has exactly twice as many entries as there are pixels on the screen. In the first half, each entry holds an index to another location of the table, where the real per-pixel value is stored.
Pass2 follows this ‘two element per-pixel linked list’ and displays appropriate coloured pixels.

If things were working, I should be getting the left half of the screen solid GREEN and the right half - solid RED. This is what happens in case of 95% of the pixels on the screen, but about 5% are of the wrong color (some left pixels are RED and some right - GREEN). Those ‘wrong’ pixels keep dancing on the screen.

Now, I have spent the whole day thinking about this and the only possibility I see is that some invocations of Pass2 are running when some invocations of Pass1 still haven’t completed and are, in fact, done with the first write to our table (the ‘head pointer’ ) but are not done with the second write.

But how that can be - notice that in the application, between the passes, I do call ‘glMemoryBarrier(GL_ALL_BARRIER_BITS)’ which, according to the spec, is AFAIK supposed to guarantee that all memory writes of all sorts from Pass1 will be complete when Pass2 begins running?

I have tested this on 3 phones with Adreno 418, Adreno 530 and Mali T760 mobile GPUs onboard and all of them show very similar artefacts, so looks like this is my problem and not some bug in the driver :slight_smile:


  uint index = uint( (v_TexCoordinate.x + v_TexCoordinate.y * u_Size.y) * u_Size.x);     // we are rendering a quad = unique index for pixel AFAIK ?
  uint ptr = atomicCounterIncrement(u_Counter) + uint(u_Size.x*u_Size.y);                // unique location the 'real per-pixel value' is stored at 
  u_Records[index] = ptr;                                                                // store a 'pointer' to the real value (kind of 'per pixel head pointer')
  u_Records[ptr  ] = (v_TexCoordinate.x>0.5?2u:1u);                                      // store the 'real value' - 1 for all pixels on the left of the screen and 2 otherwise

index is presumably intended to take texture coordinates and convert them into an index in a 1D array, giving each 2D position within a range a 1D index. If that’s the case, then that is the wrong code. If your texture coordinates are on the range [0, width/height], then you need to do X + Y * width to get a row-major index. Notice that the width is being multipled into the Y component; that is not an accident. For each row of pixels (that is, for each Y index), you must skip width indices.

And of course, you need to convert these values to integers before you use this equation.

If the texture coordinates are normalized (on the range [0, 1]), then the code is even more wrong. You have to denormalize them first (multiplying by the width/height), then apply the previous equation.

Now that we have index sorted, let’s move on. So you seem to have decided that your 1D array contains two things: the pixels storing integers, and the actual linked list data values. That’s actually a problem. See, the way a linked list works is that you have a list of values; each data value specifies an index to another data value that represents additional data for that pixel. You aren’t doing that here. It is that linked-list which allows multiple fragments to overlap, storing multiple distinct values in the same “pixel”.

Next, we come to the assignment of u_Records[index]. This too is wrong. Why? Because if multiple fragments overlap, then they will attempt to write to the same index. And since there is no sycnhronization between shader invocations, you get a race condition. AKA: undefined behavior.

The way this is usually done is to employ an atomic swap operation. Rather than simply writing to the value, you swap it with ptr. This ensures that if fragments overlap, one will complete before the other. This is important for the linked-list aspect too, as the swap function returns the original index you stored, which you can then store in your pixel data’s “next” field. This allows you to process a list of pixels.

The final assignment need not be an atomic operation; it’s essentially OK.

You have pointed at 3 issues. Let’s answer one-by-one.

############################################################################

First, you say that ‘index’ is not computed correctly. TexCoordinate ranges from 0 to 1 (as you can see in vertex shader). Index gets computed like this

uint index = uint( (v_TexCoordinate.x + v_TexCoordinate.y * u_Size.y) * u_Size.x);    // we are rendering a quad = unique index for pixel AFAIK ?

Now, what’s exactly wrong with this? Notice that ‘index’ is computed in the very same way in Pass1 and Pass2. Pass2 should then read back exactly what was written in Pass1, which is (v_TexCoordinate.x>0.5?2u:1u). Colors in the screen say that this is not te case (in case of about 5% of pixels)

I could change it to

uint pixelX = uint(v_TexCoordinate.x * u_Size.x);
uint pixelY = uint(v_TexCoordinate.y * u_Size.y);
uint index = pixelX + pixelY * uint(u_Size.x);

which is what you want and which is how the ‘original, full’ version of this code looked like, but this doesn’t seem to change anything - still the very same, 5% of dancing pixels.

I claim that the first, ‘short’ version of computing index and the second, ‘long’ version, while not exactly numerically identical (casts to uints are done differently, so we can have off-by-one differences occasionally) can result in index being off-by-one in the worst case. Even if, I don’t see how this would result in the ‘5% of dancing pixels’ problem as Pass1 and Pass2 compute the index in exactly the same way.

############################################################################

Next, you say that my ‘linked list’ is somehow not right. In the original, full version of this code indeed there is a full-blown linked list with (width x height) ‘head pointers’ in the beginning, containing indexes to next sections of this table. Each index points to a group of uints containing per-pixel info and ending with another index possibly pointing to next element of the list.

Here, we have a much simpler data structure. We still have the first (width x height) initial ‘per-pixel head pointers’, each containing an index pointing to a single uint, which in turn contains either a ‘1’ or a ‘2’. This is a very silly data structure, yes, but it should work, right?

############################################################################

Then, you point out that the head pointers are not set atomically. I know - in the original version they are. Here I claim they don’t have to, because in both Passes we are rendering a full-screen Quad, so there’s only 1 fragment per pixel. So, if the following 2 assumptions are true (and why wouldn’t they?) :

  • there is exactly 1 fragment per pixel (so fragment shader in each pass gets executed exactly once per pixel)
  • each pixel corresponds to a unique ‘index’

then we should have no problem with lack of atomicity, right?

Update: I did another test. Rather than rendering directly to the 1080 pixels wide screen, I render to an offscreen framebuffer of configurable size, and then blit the framebuffer onto the screen.

Results: there are absolutely no artefacts when size of the (square) framebuffer is <=88. When I increase its size to 89, suddenly the artefacts become very visible.

Would’t 88 be the size of the ‘tile’ mobile GPUs have? Then wouldn’t that mean that a given ‘tile’ proceeds to rendering Pass2 even though some other ‘tiles’ are not done rendering their Pass1 yet (and are on fact in between the two writes to the SSBO).

That would explain everything, but why would that be? I thought the MemoryBarrier(ALL_BARRIER_BITS) between the Passes would prevent that from happening?

Of course: what’s happening is that a given ‘tile’ still renders frame N’s Pass2 while some other ‘tile’ moves on to rendering frame N+1 Pass1 already, writes to sections of the SSBO being read by the first tile (because of atomicCounter those change in unpredictable ways between frames) and we get artefacts.