Make the blend stage to become programmable?

The next hardware generation is not going to provide such kind of feature … I was really expecting it for few reasons:

  • I’m annoyed by fixed blending setup. :slight_smile:
  • I saw on it a possible single pass deferred shading.
  • I expect some post processing effects like blur to be done at bleed stage or in few passes.
  • It could involve massive memory band wise saving.
    However, we get a new player “OpenCL” which could provide a lot on those topics.

First, some definitions. I call a “sample” a fragment that passed the tests. A fragment is a sample candidate if you want. Then I call “sampling” all the depth, stencil, etc. that discard fragments.

Here is an UML drawing to show how I see the thing. I simplify it as most as possible to keep on the blend stage. If you fell it’s a personal interpretation of the hardware, it’s exactly what I was expecting as far as there is no real true on this topic and all hardware are not blend programmable capable (I believe that some are however ;)): .

The question: Is that insane? I don’t really know where should go the multisampling resolution on this model … maybe before the “sample shader” to match the current way of doing deferred shading. However, following the OpenGL specification, it is supposed to be after if it actually about a “sample shader”. Anyway this is done in a really specific manner on every processors and multisampling remains a tricky one on deferred engine.

For the “I expect some post processing effects like blur to be done” issue the feature would require texture access of binded render target why not with a limited offset range. The main issue with this would be that we have to be sure that all the fragments are actually processed before the blending stage. This is really not obviously the case and for quite some hardware could be a limitation. I don’t really know but for the other stages it never really matter that all data get processed. This seams especially a limitation for immediate rendering devices but maybe not the case for tiled rendering devices.

For the deferred shading, we could expect to no use render target textures but just varying variables which would be the values we would have usually write in the render targets. However, we would probably need to wait until every triangle gets processed to generate a sample. On tiled rendering GPUs it wouldn’t be an issue, on direct render GPU the amount of memory required will involved to save all the data on the graphics card memory like it’s done with render targets so we loss all the benefit in band wise saving of this method.

Furthermore: CUDA and OpenCL. With CUDA there is not direct way to use an OpenGL texture, you have to use PBO for render to buffer to then access to the buffer … It’s really not convenient and I don’t believe in any gain from the current two fragment passes way especially because we lose GPU 2D image cache. Fortunately OpenCL is a lot better than expected (by me at least) and we can directly access to 2D images so why not compute the lighting pass with OpenCL? For some reasons it fell that the way OpenCL is specified make this “sample shader” even more interesting, some kind of convergence.

Just let me know your thoughts about this! :slight_smile:

I think one problem with real programmable blending is that ROP/blending might very well be deferred right now as the ROP unit likely waiting until enough transactions per output “tile” are ready before it actually does the global memory access(s) and atomic operations required for blending and/or depth/stencil check/update. Current (ie DX10) GPUs to my knowledge don’t exactly have anything “programmable” in the pipeline which can do this caching and vector scatter of fragment shader outputs. Without the caching and efficient vector scatter, performance is bound to be horrid, plus there are all sorts of other non-programmable things like depth and stencil updates requiring feedback to the coarse raster stage, and the mess of multi-sampling, and compression… so the dream of programmable blending might just have to stay as OpenCL atomic operations to buffer objects for this upcoming generation of DX11 graphics cards.

I don’t expect it for D3D11 … nVidia and ATI GPUs have I guest some kind of FIFO of fragment output that mask memory latencies. When the amount of data is large enough it is send to the “ROP”. The biggest issue comes from that those outputs could from a triangle on the upper left side of the screen and from the lower right side of the screen. The is no tile in nVidia and ATI architechtures. It means that with such idea of using the whole fragment output implies that you have to wait that every fragments have been processed … There is no way that all the render targets fit in the FIFO cache.

I’m referring to tile as in this, http://www.icare3d.org/blog_techno/gpu/the_froggy_fragsniffer.html. Also covers how 2x2 sample quads are distributed based on screen position to the various cores in the G80. Memory bus transactions have to be large (say over 32/64/128 bytes according to CUDA spec on latest G2xx hardware) to be efficient, the hardware wouldn’t be doing framebuffer blending on single 2x2 sample quads at a time. So there is indeed an ROP/OM “tile”, just not the type of tile that one would think about in a tile deferred renderer. The purpose is however the same, to reduce memory transactions to global memory.

OK, I see what you meant. It just make me believe that nVidia GPUs are not so far from ready for such ideas :wink:

I have always seen the blend shader to be internally attached to the end of the fragment shader, all you really need is to be able to read all the fragment data and from what i understand this is at least partially done for the early z discard, it wouldn’t hurt the bandwidth if the rest was read at the same time.

One would only have to reorganize a little how the ROP operates.
An upside to this would be a slimmed down ROP since it’s no longer doing any blending.