Using many FBOs

Hello OpenGL gurus,

My usage: I need to have an arbitrary number of passes, where each pass renders to some kind of an arbitrary buffer which is then used as an input texture to the next pass.

Currently, each buffer is implemented by an FBO with a Texture attached to COLOR0. First I create the FBO (and the texture inside) with

glGenTextures(1, texIds, 0);
glBindTexture(GL_TEXTURE_2D, texIds[0]);
// ... here goes the usual glTexParameterf stuff ...
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, mWidth, mHeight, 0, GL_RGBA, GL_UNSIGNED_BYTE, null);
glGenFramebuffers(1, fboIds, 0);
glBindFramebuffer(GL_FRAMEBUFFER, fboIds[0]);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, texIds[0], 0);

Then whenever I need to render to the Buffer I call

glBindFramebuffer(GL_FRAMEBUFFER, fboIds[0]);

and whenever I need to set the Buffer’s contents as input to the next pass, I do

glBindTexture(GL_TEXTURE_2D, texIds[0]);

and everything works.

The problem is that I read in numerous sources that the BindFramebuffer call is very slow (flushes the whole pipeline) especially on OpenGL ES targets which I work with. SongHo Ahn, for instance, clearly states that

“(…) Switching framebuffer-attachable images is much faster than switching between FBOs.”
( OpenGL Frame Buffer Object (FBO) )

Thus I am thinking to switch to a design where there’s only 1 Framebuffer, and many Textures; and each time I need to render to a Buffer I’d only attach the given texture to the sole Framebuffer’s COLOR0.

Do you think this is a good idea? Or maybe some totally different design would be best here - what about RenderBuffers?

[QUOTE=Utumno;1285256]I need to have an arbitrary number of passes, where each pass renders to some kind of an arbitrary buffer which is then used as an input texture to the next pass.


The problem is that I read in numerous sources that the BindFramebuffer call is very slow (flushes the whole pipeline) especially on OpenGL ES targets which I work with. [/QUOTE]

There’s a kernel of truth here, but as-stated this isn’t correct.

Unlike desktop GPUs which are pipelined to render the current frame as quickly as possible, OpenGL ES embedded GPUs often defer all fragment work for a frame to execute one frame later. This allows them to radically reduce memory bandwidth and use cheap CPU DRAM and slower memory buses, reducing cost. This means as you submit a frame, you always have to keep in mind that the driver isn’t executing your fragment work now. It’s just book-keeping a list of transformed vertices to rasterize and shade later.

Some embedded GPU GL drivers use the FBO object as a container for this unexecuted fragment work, and it follows it down the pipeline. If you re-render using the same FBO, you can trigger a full pipeline flush (including a fragment run, because the driver needs to “finish up” the fragment work previously queued for that FBO before it can attach new work. This is “very” expensive. You can create a ring-buffer pool of FBOs to alleviate this. However, you can’t go nuts, because there’s a memory cost per FBO (separate from the space required by its attachments). Check with your GPU driver writer for details.

And always run a GPU profiler so you can see how your rendering work is executing on the GPU. These can make it pretty easy to see when an unintended synchronization is happening in the driver.

SongHo Ahn, for instance, clearly states that

It appears Song Ho Ahn is probably talking about desktop GPUs here. Even though GL and GL-ES provide similar interfaces, you have to keep in mind that the underlying GPUs work differently.

I only have experience with desktop GPUs and its a while ago when I tested FBO performance.

Basicly switching FBO was very expensive. But you also can bind several render targets (e.g. textures) to one FBO and then just switch around which slot is active. Note that the maximum render target slot count of a FBO is separate from the maximum active slot count.

The fastest way was using a array texture and switching the target array via a unifom and a very simple geometry shader.

“(…) Switching framebuffer-attachable images is much faster than switching between FBOs.”

There are multiple sources promoting the opposite.
This presentation about porting games from Direct3D to OpenGL on Linux from Valve for example : https://developer.nvidia.com/sites/default/files/akamai/gamedev/docs/Porting%20Source%20to%20Linux.pdf

Do not create a single FBO and then swap out attachments on it.
This causes lots of validation in the driver, which in turn leads to
poor performance.

My own experience would lead me to agree with this statement.
The engine I’m working on is designed for powerful desktop GPUs and makes a lot of render to texture. The poor performance I had was caused by glFramebufferTexture.
Having a pool of FBOs always keeping their attachment solved my performance problem.

[QUOTE=Vylsain;1285296]

Do not create a single FBO and then swap out attachments on it. This causes lots of validation in the driver, which in turn leads to poor performance.

My own experience would lead me to agree with this statement.[/QUOTE]

Same here. On the mobile OpenGL ES and desktop OpenGL driver implementations I’m familiar with, reconfiguring an FBO is sometimes (or always, depending on the driver) treated as a full deletion and recreation of the FBO state which can be very expensive.

On NVidia years ago, there was a significant performance benefit from keeping separate pools of precreated FBOs binned by render target resolution and internal format, and fetching an FBO from the appropriate pool when rendering to textures. Apparently, attaching new render targets with the same resolution and format bypassed some internal driver overhead. However, I haven’t re-checked that recently.

Probably acts very closely like an array texture. I also remember that this was valves approche when they implemented there OpenGL backand for the source engine. But again that is some years ago.