Ok, answering my own post. It just hit me that since the blending is order independent one doesn’t have to do a fullscreen pass for each particle, the following should work just as well:
Create two floating point buffers
Set one as render target and use the other one as a texture
Render the particles one at a time while multiplying with the fp buffer that is bound as a texture.
Flip the fp buffers and render the next particle
Bind both of the fp buffers as textures and multiply them together into a third buffer.
This will of course also be very slow since we’re rendering the particles on at a time and changing render target in between.
If fp16 blending is insufficient, then I would suggest splitting the scene up into slabs.
Each slab would become an fp16 texture with few enough blended elements for fp16 to be satisfactory.
You could accumulate the fp16 textures into an fp32 final at the end.
If fp16 is fundamentally unacceptable, then there’s probably not a great gpu-based solution today.
Of course, one can as the original poster suggested blit the result back to the main buffer after each render call, at the cost of twice the number of render target changes.