in-shader synchronization

My situation:
I’m doing lots of compute shader invocations running in parallel which do scattered reads and writes in SSBOs and in several images, thus group synchronization is worthless. It’s clear to me how both in-group synchronization in glsl and cpu-gpu synchronization work. I also commonly use atomic operations.

I am especially interested in the following topics:
memoryBarrierXXXXX - how does it work?. Do I need to call it before reads or after writes (it’s not clear from the api tbh, as it says memory accesses) on SSBOs/Images? Is this scoped on the shader invocation only or is this valid between shader invocations.
imageAtomicXXXX - since images are coherent aren’t reads and writes synchronized?, why the need for atomic?
why are there no atomic operations on floats?

Where could one reliably study in-shader synchronization, a book/clear presentation/document would help a lot! What I’m really looking for isn’t just a quick answer but a solid, example based explanation, to really discuss corner cases and the like.

Thanks!

Do I need to call it before reads or after writes (it’s not clear from the api tbh, as it says memory accesses) on SSBOs/Images?

You issue the barrier on the source invocation. The barrier causes memory accesses to become visible to subsequent invocations (or later parts of the current one, if you just need ordering).

Is this scoped on the shader invocation only or is this valid between shader invocations.

They’re scoped on the kind of memory access. If you use memoryBarrierImage, then any writes via image load/store will become visible to subsequent readers in subsequent invocations (or later in the current one). If you use memoryBarrier, then all writes from anywhere will become visible.

The only one that scopes based on the type of invocation is groupMemoryBarrier​, which is like memoryBarrier, but only within a compute shader workgroup. So any changes you have made via any operation are visible to others in the work group, but not necessarily to anyone else. memoryBarrierShared technically is scoped to a work group, but only because shared variables are work group local.

since images are coherent

Images are not coherent. Furthermore:

since images are coherent aren’t reads and writes synchronized?

Coherency only applies to visibility. Atomic operators perform a read/modify/write as a single, atomic action. That requires ordering. Coherency has nothing to do with ordering; it only means that writes will be visible. Coherency does not say anything about the order of such visibility.

If you do two writes to the same memory, even coherently, even in the same invocation, you have no guarantees that the second will happen after the first. You need a memory barrier to provide ordering.

Atomic operations not only require ordering, but locking. They have to ensure that nothing can happen to that memory after the read and before the write. Again, coherency is insufficient to ensure that.

why are there no atomic operations on floats?

1: Because some hardware doesn’t support it. Floating point math is very complex. Being able to perform a read/increment/write on IEEE-754 floats is non-trivial, much more difficult than for two’s complement integers.

2: There are extensions for hardware that can.

[QUOTE=Alfonse Reinheart;1282057]You issue the barrier on the source invocation. The barrier causes memory accesses to become visible to subsequent invocations (or later parts of the current one, if you just need ordering).

They’re scoped on the kind of memory access. If you use memoryBarrierImage, then any writes via image load/store will become visible to subsequent readers in subsequent invocations (or later in the current one). If you use memoryBarrier, then all writes from anywhere will become visible.

The only one that scopes based on the type of invocation is groupMemoryBarrier​, which is like memoryBarrier, but only within a compute shader workgroup. So any changes you have made via any operation are visible to others in the work group, but not necessarily to anyone else. memoryBarrierShared technically is scoped to a work group, but only because shared variables are work group local.

Images are not coherent. Furthermore:

Coherency only applies to visibility. Atomic operators perform a read/modify/write as a single, atomic action. That requires ordering. Coherency has nothing to do with ordering; it only means that writes will be visible. Coherency does not say anything about the order of such visibility.

If you do two writes to the same memory, even coherently, even in the same invocation, you have no guarantees that the second will happen after the first. You need a memory barrier to provide ordering.

Atomic operations not only require ordering, but locking. They have to ensure that nothing can happen to that memory after the read and before the write. Again, coherency is insufficient to ensure that.

1: Because some hardware doesn’t support it. Floating point math is very complex. Being able to perform a read/increment/write on IEEE-754 floats is non-trivial, much more difficult than for two’s complement integers.

2: There are extensions for hardware that can.[/QUOTE]

thank you very much, your clarification on ordering vs locking made things very clear.