Make NV_texture_barrier core functionality.

Both AMD and NVIDIA provide access to this extension across all of their 3.x and 4.x hardware. It’s high-time this was core behavior.

And no, glMemoryBarrier doesn’t cut it. Besides the fact that it’s not available in 3.x land, it doesn’t do what glTextureBarrierNV does. Texture barrier is all about framebuffer object writes; glMemoryBarrier only is concerned with writes via atomic counters, image load/store, or shader storage buffers.

Agreed. +1

In 4.3 you can use ARB_shader_image_load_store, ARB_framebuffer_no_attachments and glMemoryBarrier to accomplish what glTextureBarrierNV does. I don’t see a point to add yet another feature to already huge, complex and very hard to implement API (NVIDIA and AMD aren’t the only vendors).

No, you can’t. ARB_shader_image_load_store only syncs image/UAV operations and atomic counter operations with any of the other accesses, but not framebuffer updates with texture accesses.

I don’t see issues in adding this to core. In the worst case it’s a glFlush() + added flushing of texture caches.
But for most things, I think it will be slower than image_load_store. (could you prove otherwise?)

I don’t have any GL 4.x hardware, so I can’t run performance tests. However, recall that the principle use of this feature is to allow a form of shader-based blending. Being able to perform a read/modify-in-shader/write loop to the same texel.

Now, both image load/store and texture_barrier require calling a barrier function between different read/modify-in-shader/write passes. And each pass must not overlap with itself. Fair enough.

However, texture_barrier has two major advantages. First, while the read/modify-in-shader/write loop is all about being able to do in-shader computations, that doesn’t mean that you can’t use hardware blending at all. You can combine your shader with a specific blend mode, so that you effectively get an additional operation for “free”. It’s not really “free”, but it’s not in the fragment shader; it’s in the ROP. So the operation is pipelined and thus more efficiently uses the hardware. If the in-shader blending operation is fairly short, maybe only an extra math operation or two that glBlendFunc couldn’t handle, odds are good that it’ll be a non-trivial performance win.

The other advantage is more substantial: hardware multisampling. If you write a fragment shader output, the hardware will use the coverage mask and write the fragment data to each coverage area using hardware ROPs. If you write the value with Image Load/Store, you must issue an imageStore call for each sample individually. Odds are good that the latter is going to be more expensive, since the hardware is optimized for multisampling, but it must assume that each individual imageStore operation is passing different data.

I can’t imagine how Image Load/Store would ever be faster than framebuffer writes, since it’s the same algorithm either way. If that were true, then writing outputs via Image Load/Store would be cheaper than writing them to the current framebuffer. And that seems rather unlikely.

The only place where the image load/store version might be a performance win would be if you’re also using it to do order-independent transparency via pixel lists. And in that case, the performance gain/loss would depend on the nature of the scene. A scene with little transparency overlap would be relatively expensive in the image load/store case, while a scene with lots of overlap is cheap.

The multisampling one seems invalid: to do any proper custom blending, you will have to do per-sample processing: fetch each individual texel sample, calculate, write one sample. Exactly the same flow as if you used image_load_store. Still, the gains from using ROP blending, that you mentioned, fortunately remain.

The multisampling one seems invalid: to do any proper custom blending, you will have to do per-sample processing: fetch each individual texel sample, calculate, write one sample. Exactly the same flow as if you used image_load_store.

It depends on what you’re doing and how accurate you want to be. If you read texels that correspond to the sample mask being written, then aggregate them together (effectively performing multisample resolve), you can still write one sample out with the proper sample mask, thus still benefiting from the hardware ROP writing.

It’s not as sample-accurate as doing per-sample shading, true. But then again, regular multisample without sample shading loses some accuracy too, compared to true super-sampling. So it is an accuracy-for-performance tradeoff. One that Image Load/Store doesn’t let you make.

That being said, it would be good if Image Load/Store had some functionality for doing writes to a texel with a sample mask, such that all of the given texels will get the same value. Though that might not be possible, depending on how the hardware actually implements multisample writes.