global storage register(s)?

Hi all,

Recently, more and more algorithms for computer vision are ported from the CPU to the GPU to increase performance up to an order of magnitude.
The only drawback I encountered up til now, was that there’s no interpixel dependency due to the parallel processing architecture of the graphics chip. This could easily be solved by copying the framebuffer to a texture and sampling the resulting texture at the desired positions. As this can be performed on the graphics card itself, the performance drop is acceptable.

Right now, I’m in need of some kind of storage register that accumulates values, originating from a fragment shader or even the fixed function pipeline i.e. the sum of all the colorvalues that were drawn for a primitive in 4xfp16 or 4xfp32 format.(4xub8 would also be a great help)

I’ve been looking through the spec and virtually all extensions, but the only extension that comes close to this is the occlusion query. This is however of not much use to me…

Is there any way to accomplish this? Maybe there are some hardware implementations that do not necesseraly clear the temporary parameters of the fragment shaders, or maybe there’s a much simpler solution that I overlooked…

For the moment, I’m forced to use readpixels, accompanied with a large performance drop.

Any help is appreciated, as this would retain the performance gained from using the GPU instead of the CPU.

Thanks!

Nico

If there was a counter that all pixels could contribute to, then that wouldn’t be parallel anymore, and the performance gain of parallelism would be negated. Thus, no currently available hardware supports anything like this.

Now, there might be ways you can cheat in the hardware; especially if you run longer shaders, you could have a FIFO of contributions to the accumulator which are retired one per clock, so if you otherwise used 16 cycle long shaders, it wouldn’t block anything. You’re not alone in asking for global counters, so at some point, I believe they’ll appear.

Btw: Exposure control is one thing that would benefit from this, for example :slight_smile:

If there was a counter that all pixels could contribute to, then that wouldn’t be parallel anymore, and the performance gain of parallelism would be negated.
In the occlusion query extension, all pixels contribute to the counter. I’m not sure how big of a performance drop is caused by using this extension.
However, I don’t suspect it to have much impact because it was designed to achieve higher speeds by testing the bounding box of an object first. If the performance drop would be let’s say 8x due to 8x2 or 8x1 pixelpipeline, then in most cases it would be faster to draw the object rightaway instead of using occlusion tests first.

It’s likely that every pixel pipeline has its own counter and that the sum of these counters is computed at the end. This does not interfere with the parallel execution of the pixel pipelines.

So it seems feasable in hardware while maintaining parallel processing. Let’s hope it will appear soon :slight_smile:

N.

Occlusion query is close, since it is a very simple “reduction operator” on all the fragments. I’ve been begging for other reduction operators (e.g. particularly sum, max, and min) for a while, but I’m not expecting them anytime soon. Occlusion query just requires a single bit of information per pixel to be fed into the shared accumulator; other reduction operations require the full 8/16/32-bit result.

One workaround that stays on the card is to use log(n) passes to compute the result: On each pass, write to a buffer that is 1/2 the size (in each dimension) of the one you used on the previous pass. Use the buffer from the previous pass as a texture. The fragment shader computes the sum (or whatever) of four pixels from the previous pass. You can of course reduce the size by more than 1/2 on each pass. This is still slow, but on current commodity hardware it should still be faster than glReadPixels and computing it on the CPU.

Originally posted by Jesse Hall:
I’ve been begging for other reduction operators (e.g. particularly sum, max, and min) for a while.
I would love those things to.

I’m interested to hear from the driver guys what the difficulty is with implementing this?

Originally posted by -NiCo-:
Hi all,
Right now, I’m in need of some kind of storage register that accumulates values, originating from a fragment shader or even the fixed function pipeline i.e. the sum of all the colorvalues that were drawn for a primitive in 4xfp16 or 4xfp32 format.(4xub8 would also be a great help)

That sounds as if it could be done with the accumulationn-buffere.

Jan.

Originally posted by Jan:
That sounds as if it could be done with the accumulationn-buffere.

I don’t think it is possible to get the sum of all values in the accumulation buffer in one operation. So I would still have to read the whole accumulation buffer back to the CPU to do the adding myself.

N.

with floating point pbuffers and shaders you can sum it up in a few rendering steps. you need to sum up 22 pixels per step and go down to a 11 pbuffer.

I think I’m going to make it easy on myself and buy the next generation of graphics cards when they are available.

They have support for mipmapping of floating point textures. So If I could enable automatic mipmapping of these floating point textures op to the level of dimension 1x1, all that remains is to do a simple multiplication by the area of the texture.
It’s basically the same as what you suggested but is performed more efficiently. Thank a lot for the suggestion though! I will use your approach untill the graphics cards are available.

This,however, will only work if the mipmapping performs a simple box filter.
Does anyone know where I can find some info about the actual implementation of the mipmapping filters.
For instance, is the FASTEST-hint a box filter and the NICEST-hint some kind of gaussian filtering?

Thanks for the replies all :slight_smile:

N.

Mipmap generation really should be a linear box filter, because that’s what the mipmap filters expect to get. Results will look strange otherwise - try Ultima IX for a good example of broken mipmaps.

Regarding your initial question, averages over a full frame were what you wanted and everything’s clear now? Because your first post sounded a little different. I wanted to throw in blending and the stencil buffer (as a per-pixel integer counter) because I just didn’t understand precisely what you wanted :confused:

My appologies if my first post wasn’t clear enough.

My original problem is:

Draw a polygon which writes values to the pixels it writes too. Normally this is 32bit colorvalue, but I need floating point precision. This could be achieved by using floating point pbuffers. But I want the sum of all these values in an efficient way.
The stencil buffer does not provide enough accuracy and works on a per pixel basis e.g. I can not retrieve the sum of the stencil buffer in an efficient way.
As for blending, as far as I know blending is not supported for floating point buffers and also works on a per pixel basis.

So my first though was, that if there were persistent variables in the fragment shader, to initialize this register to zero before drawing the primitive and let the fragment shader update this register as it is executed for each pixel. And when the primitive if fully drawn, read out the final value of that register.

N.

EDIT:

Jesse –

Whoops. I missed your post. I say what he said, but a little more explicity.

/EDIT

jwatte –

Just because there are multiple contributers to a single value doesn’t mean that it completely kills parallelism. Sure, it kills the rather strict SIMD parallelism that most GPUs are geared towards today, but it is certainly possible to parallelize an operation as simple as accumulating to a “global” register.

In general, the processors in a GPU map each stream element to a new stream element (fragment killing is an exception). This is a classic data parallel operation. NiCo is asking for what’s called a “reduction,” or a map from a stream to a scalar. If the reduction operation is sufficiently simple, you can transform a reduction into what’s called a “parallel prefix” operation. With enough parallelism, you can perform O(N) operations in O(log N) time.

What characterizes a “simple operation?” Certainly, accumulation is simple enough. It turns out that the operation needs to be associative. If you want to add a, b, c, d, etc, compute ab = a op b, cd = c op d, etc, then compute abcd = ab op cd. If the operator is also commutative, than you also have flexibility in the order you perform the operations.

Some operations also give you an “early out” (somewhat similar to lazy evaluation of || and &&). For example, if you wanted to multiply a list of values, then if you ever encountered a “zero” then you know that the entire result is zero without perform the entire computation.

So occlusion queries are a specific instance of a reduction operator. In fact, it’s really two: there’s the complete query which returns the number of visible fragments (which is probably performed coarsely parallel-prefix due to hierarchical rasterization), and the simple query that short-circuits the count and only returns true/false. In the first case, the operation is an increment, in the second case the operation is simply ||.

So, of course you are right that no hardware REALLY supports what NiCo wants right now, but if people really wanted it, it wouldn’t be that big of a deal to see it in future hardware.

-Won