Chained FBO textures

Jay_Cornwall · August 1, 2005, 5:46am

I’m currently researching image processing techniques on the GPU for a UK visual effects plug-in developer. Thanks to NVIDIA’s work on the EXT_framebuffer_object extension all is rosy at the moment and I’m getting good performance advantages over powerful dual Xeon systems.

There is one thing that’s bugging me about the new extension, though. The standard explicitly disallows writing to a bound texture, presumably to avoid situations where one might try to access texels that have already been written. This is unfortunate for me, however, because although I can guarantee not to read texels after writing the driver refuses to play ball.

Is there a technical hardware reason for this? I’m having to compensate by using extra texture memory (a 2K image texture is quite large…) and swapping between textures on each pass, both of which are inefficient operations and costing me in performance.

I’d love to ask an NVIDIA developer about this but the whole registration procedure is so painful and inhibiting that I can’t get through it. Asking questions about products and installed user bases is great if you’re that far into development, but this project is still in the early prototyping stages.

Humus · August 1, 2005, 1:08pm

You want to read the same texture are you’re rendering into? Well, that totally breaks the pipeline so to get it to work reliably you’d need insane synchronization, and therefore likely see worse performance than what you get by ping-ponging between render targets. In the special case where you’d only access the particular pixel that you will write into and only do a read-write operation it could possibly work depending on the hardware, but it would require the hardware to be built to support that kind of stuff.

Jay_Cornwall · August 2, 2005, 1:19am

I can’t see why it would affect the pipeline. RTT with different input/output textures already sets up a stream from video memory to video memory, it wouldn’t be any different for the same input/output textures. Nor can I see any “synchronisation” problems - there are no interdependencies between different pixel calculations, since I’m only advocating writing to a single pixel in the texture after reading it in. No danger of reading a pixel after writing it, since you couldn’t calculate the output value before reading the input value anyhow. If anyone tries it, sure, just return undefined values like the way it works right now.

My only concern is with caching and register issues, and whether it adds overhead to hold cache lines which may be written to and read from. But I don’t see any other reason this support wouldn’t already be in the hardware.

Overmind · August 2, 2005, 10:26am

Even if you only read the same pixel where the current output is going, there are synchronisation issues. Imagine two fragments with the same target screen coordinate being rendered in parallel, such that the result of the first one is not yet written when the second one wants to read it, requiering a pipeline stall.

I assume in your special case there is absolutely no overdraw, making this example void. But this is not entirely true. You would draw multiple passes, so between these passes you have overdraw. Take this and a sufficiently small image area (or a sufficiently large parallel pipeline ), and you have exactly this issue.

The main point is, even if you’re sure these issues don’t exist in your special case, the hardware would still have to be built for the general case. And that’s not as trivial as it might seem, you would at least need a mechanism to detect data dependencies between fragments and a mechanism to stall a fragment pipe if such a dependency exists.

The alternative is you just disallow the entire scenario by not allowing a texture to be bound as source and target for a single primitive. Then you can make sure all rendering to a texture is finished before allowing rendering that uses this texture to enter the pipeline. This is an entirely software based solution that requires no additional hardware support.

Jay_Cornwall · August 2, 2005, 11:59pm

Perhaps it seems easier if I phrase it differently. Accumulation buffers exist, but why were they omitted for FBOs?

It’s causing headaches for image processing algorithms which very frequently need to accumulate values across passes.

ZbuffeR · August 3, 2005, 12:21am

BTW, anybody knows if accumulation buffers are actually supported in hardware with current cards ?
And which ones ?

Won · August 3, 2005, 5:31am

ZbuffeR – maybe this should be a different thread?

Jay – if all your pass is doing is doing a pixel-to-pixel mapping, then I think both ATI and NVIDIA have “blessed” using the same PBuffer as both texture and target. I don’t know how reliable this will be for future hardware/driver combinations, but I suppose that’s the price of exploiting undocumented behavior.

-W

Overmind · August 3, 2005, 12:46pm

Accumulation buffers are something completely different from FBOs. There are no special cases that have to be caught by the hardware, just a very limited set of full screen operations. If the accumulation buffers is what you want, you shouldn’t be using FBOs

And if accumulating a value is all you want, I don’t see why you would need to draw to a texture that is currently bound. You can just use blending.

If you allow drawing to a texture that is currently in use, you create synchronisation issues in the pipeline (see my previous post for a concrete example). These issues don’t exist with the accumulation buffer.

Jay_Cornwall · August 4, 2005, 12:19am

My applications of the GPU are far beyond simple OpenGL blending.

All of my work has to use FBOs to support fast multiple output, multiple pass algorithms in fragment shaders. My use of identical input/output textures is indeed to accumulate values across passes, since FBOs lack accumulation buffers which would otherwise do the job quite well.

There’s little difference between this:

*(accum buffer) = *(accum buffer) + value

And this:

*(output texture) = *(input texture) + value

That’s the special case I was advocating for FBOs, at least, where output and input texture identify the same area of memory. Clearly accumulation buffers are already practical in hardware and (I’m sure) there are few architectural differences, if any, between the two implementations.

Won - thanks for the heads up on pbuffers. I’ve tested this with FBOs and the results were mostly zero or NaN, I’ll see if pbuffers behave any differently. Although ideally I’d like to avoid the use of pbuffers since FBOs incur considerably less overhead.

Overmind · August 4, 2005, 9:40am

The suggestion that you should use the accumulation buffer instead of FBOs was only rhetorical. I know there is a difference in expressability, but after all you brought up the comparison. I just wanted to explain why render-to-bound-texture needs more hardware features than an accumulation buffer, even in the limited case you propose, let alone the general case. So your proposal wouldn’t be possible in hardware for a while, even if the spec allowed it.

As for the accumulation buffer inside an FBO, that’s easy. Just make a second FBO and render the first one full screen to the second with an appropriate blend mode. That’s an equal number of buffers than a single FBO with accumulation buffer would need. If you need to combine two textures in a more complex way than allowed with blending equations, I’m afraid with current hardware you’re out of luck, you’ll need a third buffer.

Jay_Cornwall · August 5, 2005, 5:56am

I’m not sure I understand your point that FBO accumulation buffers require more hardware features than standard accumulation buffers. Perhaps you could explain exactly which functionality is missing? The equations I posted previously demonstrate that the algorithms are identical, and that both are just reading and writing to an area of video memory.

Perhaps I approached with incorrect questioning at first by not saying that I just wanted to implement accumulation buffers with FBOs. There is indeed a difference between using multiple surfaces on an FBO for an accumulation buffer and using a real accumulation buffer:

Output FB + Accum1 FB + Accum2 FB -> FBO approach
Output FB + Accum buffer -> FBO accumulation buffer approach

One requires considerably more memory than the other, particularly when you’re dealing with 2K textures. Switching render targets (even with glDrawBuffers) is also an expensive operation and one that I’d greatly like to avoid if at all possible.

Jay_Cornwall · August 12, 2005, 5:48am

For the benefit of anyone who happens to pick up this thread in the future, the answer’s very simple.

glBlendFunc(GL_ONE, GL_ONE);
glEnable(GL_BLEND);

One accumulation buffer sorted.

(minor complications if you have multiple draw buffers, but I find there’s a very small performance impact to glClear() the other buffers before drwaing)