Output data packing

This may be more of a GPGPU question, but anyway…

I am trying to write a shader and program that does this for each pixel:

First overdraw : Write to Red channel
Second overdraw: Write to Green channel
Third overdraw : Write to Blue channel
Fourth overdraw: Write to alpha channel

Ideally (in magic GPU land), I could could increment the stencil on each render and use that count to select the output channel.

I am aware that I could accomplish this by depth peeling and using a color mask for each layer, but I would like to do it in one pass.

I assume this is probably a common GPGPU problem as I am essentially trying to store a list of output data at each pixel.

My current solutions to this problem are:

  1. Mess with the blend states
    a) Use 2 Render targets cleared to (0,0,0,1) and (0,0,0,0)
    b) Output “Value” from shader are (value, 0,0,0) and (value, 0,0,1)
    c) Use seperate blend modes DST_ALPHA, ONE_MINUS_DST_ALPHA for RGB and ONE, ZERO for alpha.

This gets me two outputs, but seems really wasteful for two values and I would like more than two values in the output list.

  1. Go nuts with bit logic.

a) Break the 8bit output value into 4x2 bits and output each two bits into RGBA channels as high bits.

b) Use blend mode ONE, CONSTANT_COLOR (const color being 0.25) to shift existing bits down two places, then add the new bits to the top.

I have not tried this approach yet - not sure if floating point conversions will mess with it. But it does involve a bit of bit logic to disassemble and re-assemble - something I would not prefer to do.

So - does anyone have any ideas?
(FYI: this is more of a research project, so I can use Geforce8 extensions etc. if need be)

Is there a reason why you’re doing it in four passes now? Are the computations dependent on each other?

Your bit logic idea is interesting. I was told that in Cg or GLSL (without a #version header in the file) that floatToRawIntBits() and intBitsToFloat() can be used with the NVidia 8xxx drivers to manually do your own packing. I haven’t tested them yet.

I would also suggest this paper which talks about a D3D10 solution to your problem with a single pass,
www.sci.utah.edu/~bavoil/research/kbuffer/StencilRoutedABuffer_Sigg07.pdf

<u>Other Ideas</u>

A 2 Pass Solution Which Enables Larger IDs

This requires 2 passes, but doesn’t make use of the alpha channel (so you can pack 4 items into your pixel list). This requires a stencil+z buffer.

Set blend mode to MAX, draw pixels with {R=id,G=-id}. Mask out writes to B and A. Set depth mode to allow pixel on greater than, but for this first pass, disable depth check. Instead use stencil check to only allow 2 writes into the framebuffer.

For the second pass, mask out writes to R and G, draw pixels with {B=id,A=-id}, turn depth check on, setup stencil to only allow 2 more writes into the frame buffer.

Of course you have to pre-setup the framebuffer so that the first blends work, and the RG and BA groups results are un-ordered. So you will have to resort RG and BA manually (if order is important to you).

If you use -id (as I suggested) then you need FP16 at minimum, otherwise you could use 1-id instead for INT8 IDs.

However I have a feeling (but I haven’t profiled this yet) that blended FP16 and FP32 writes are slower than non-blended. Also I know that FP16 blends are 2x as costly as INT8 on some G8xxx cards, and that FP32 blends are 2x-4x as costly as INT8.

Or Speed up Depth Peeling

So you might want to simply use transform feedback on your vertex shader pass and reuse this output for the other three color masked depth peeling (no blending) passes. Then profile the depth peeling vs your other non-4 pass solution.

OOPS,

GL doesn’t have a way to do a texture read from a multisample FBO. So the Stencil Routed A Buffer is D3D10 only.

Damn!

Nice paper, thanks! :slight_smile: I’ve been trying to figure out a good way to accomplish the same, but haven’t come up with anything that would work on current hardware, but this looks like it. :slight_smile: I’ll have to play some with this technique!

Nice, and a strong argument for having a publicly available spec.

Woah, thanks a lot Timothy Farrar, lots of nice tips there.

However, it seems that my bit packing is working better than expected - even on my Geforce 6800 - using float values to unpack bits.

I will post a demo here in a few weeks using the technique (assuming no more problems) and you can take a look. (It is nothing too flashy, just modifying an old Humus demo to test out some techniques)

Sorry to resurrect a dead thread here, but the topic is very related to what I’m currently trying to find a way to do. In a fragment shader, I’d like to write the color output to a different FBO color attachment depending on how many times the pixel has been written in this rendering pass. Basically, I’d like to do depth peeling of multiple layers in a single pass.

I figure that if I write out the four closest color values to four different buffers, along with their depth values, then I could draw a final fullscreen quad to sort and composite the colors, giving me 4-pass depth peeling in a single pass.

In order to do this, I need to read and write depth values associated with each color buffer. Unfortunately, these types of texture feedback loops cause undefined behavior, according to the spec.

I’d settle for any type of persistent per-fragment storage that I could access for both read and write, multiple times during the course of a single render pass.

Any ideas?

The only thing that helps you even remotely maybe is nVidia’s newest extension: http://developer.download.nvidia.com/opengl/specs/GL_NV_texture_barrier.txt
It’s basically a texture-cache invalidation command. If you do read+write the same pixel between two such commands, there’ll be garbage onscreen. But you can devise a spatial-partitioning system that can eliminate such cases, at the obvious expense. I successfully used it to create custom blending.

Another idea is (of course) the multilayer depth-peeling (there’s a DX10 doc on it, it’s identical on GL iirc). Maybe this will be of most use to you, but maybe you already tried it and it wasn’t useful.

Final idea is to go OpenCL and rasterize the fragments yourself.

Thanks. I don’t think the texture barrier will help in my case, but thanks for pointing it out, as it might be useful in some other development I’m working on.

I did stumble across a stencil routing trick for use with multisampled textures, pretty interesting. I’ve looked at several bucket-oriented solutions for depth peeling in a single pass, but in general they never guarantee a single pass, which complicates the implementation quite a bit.

Thanks again for the info.