Barrier() in tess control shaders

Some questions about the barrier() instruction:

(1) It implies that all control shader invocations within a given patch do not necessarily have the same program counter, which goes against my intuition. Given the 32 limit on patch size, I assumed that each patch is processed in a single warp.

(2) The only reason you’d need to synchronize threads is if the shader had RW access to a shared memory space. The “patch” qualifier can be applied only to “out” variable, not to temporaries. Wouldn’t it be useful to apply “patch” to temporaries?

(3) What’s the equivalent to barrier() in D3D hull shaders?

(4) When I tried to write a highly efficient control shader that makes use of patch-level shared memory (i.e., patch out) and barrier(), I ran into driver issues with both major vendors. Has anyone out there had any better luck than me?

My post was horribly written. Let me rephrase:

What is the raison d’être of the barrier() instruction?

The answer in my head was something like this:

Given a “patch out float foo”, barrier() can provide a synchronization point between invocations that write to foo, and invocations that read from foo.

However, this answer implies that using “patch out” as scratch space is a supported operation. AMD and NVIDIA both seem to be allocating intermediate results into per-thread registers.

Here’s the spec language:

So sounds alot like what you said and what I’d expect. Similar to barrier( CLK_LOCAL_MEM_FENCE ) in OpenCL, __syncthreads() in CUDA, and memoryBarrier() in ARB_shader_image_load_store/OpenGL 4.2.

When you have some threads potentially operating on the data previously read/written by other threads, you need a way to ensure the data is in place (a sync point) before you let the threads continue and start using the data.

Cool, thanks Dark Photon.

With AMD and NVIDIA, writing to a “patch out” variable from invocation A and reading back from invocation B does not work, even with an intermediary barrier(). The shader compiles without errors, but I see garbled rendering.

Their compilers are probably (incorrectly) allocating intermediary temporaries into a per-thread register space.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.