BlitFrameBuffer woes

Tzupy · March 12, 2010, 11:00am

I am trying to achieve maximum AA quality using off-screen rendering, with framebuffer objects.
I also use Brian Paul’s Tile Rendering Library, since I want unlimited rendering sizes.
The code below uses 3 framebuffer objects, each with 1 renderbuffer, first one is large and multisampled, second one is just large and the third is small. They are in GL_ALPHA8 format.
I am drawing to the first one, blitting to the second, which has the same sizes, and then from the second I am downsampling with BlitFrameBuffer to the third one.
The maxaa and dsamp (downsampling) parameters control the quality of this operation. Maxaa is set to 8, while tile size varies between 256 and 8192 and dsamp between 1 and 8.
With dsamp = 4 I should get an overall AA = 8 * 4 * 4 = 128, or with dsamp = 2 I should get only AA = 8 * 2 * 2 = 32 (I’m not very sure that I’m getting this quality).
I discovered that I can only use something like tsize = 1024 and dsamp = 2 or tsize = 512 with dsamp = 4 or tsize = 256 and dsamp = 8.
Larger tsize = 2048 an dsamp = 1 works, but it’s lacking downsampling. Higher tsize fails, although in my previous tests with a single mutisampled buffer I used even 8192 as tsize.
To ensure correctness of the output with various tsize and dsamp values I modified the Tile Rendering Library this way:
In trBeginTile I multiplied the glViewport and glOrtho parameters with dsamp. I also multiplied my geometry with dsamp.
My system is i7-920 6GB, 4850 1GB, Vista64 HP, drivers 10.2. Please try to explain to me why my program fails with high tsize. Thank you.

trc1 = trNew() ; // tsize means tile size, trc1 is the TR context
trImageSize( trc1, width2, height2 ) ; // destination image sizes
trTileSize( trc1, tsize, tsize, 0 ) ;
trSetup( trc1 ) ;

glDisable( GL_ALPHA_TEST ) ; // geometry is 2D only, lots of quads or quad strips
glDisable( GL_DEPTH_TEST ) ;
glDisable( GL_STENCIL_TEST ) ;
glPolygonMode( GL_FRONT, GL_FILL ) ;
glEnableClientState( GL_VERTEX_ARRAY ) ;
glPixelStorei( GL_PACK_ALIGNMENT, 1 ) ;

// Create three FBOs, one large and multisampled, one large and one small
glGenFramebuffersEXT( 3, fbo ) ;
glGenRenderbuffersEXT( 3, rendbuf ) ;
glBindFramebufferEXT( GL_DRAW_FRAMEBUFFER_EXT, fbo[0] ) ;
glBindRenderbufferEXT( GL_RENDERBUFFER_EXT, rendbuf[0] ) ;
glRenderbufferStorageMultisampleEXT( GL_RENDERBUFFER_EXT, maxaa, GL_ALPHA8, tsize * dsamp, tsize * dsamp ) ;
glFramebufferRenderbufferEXT( GL_DRAW_FRAMEBUFFER_EXT, GL_COLOR_ATTACHMENT0_EXT, GL_RENDERBUFFER_EXT, rendbuf[0] ) ;

glBindFramebufferEXT( GL_READ_FRAMEBUFFER_EXT, fbo[1] ) ;
glBindRenderbufferEXT( GL_RENDERBUFFER_EXT, rendbuf[1] ) ;
glRenderbufferStorageEXT( GL_RENDERBUFFER_EXT, GL_ALPHA8, tsize * dsamp, tsize * dsamp ) ;
glFramebufferRenderbufferEXT( GL_READ_FRAMEBUFFER_EXT, GL_COLOR_ATTACHMENT0_EXT, GL_RENDERBUFFER_EXT, rendbuf[1] ) ;

glBindFramebufferEXT( GL_READ_FRAMEBUFFER_EXT, fbo[2] ) ;
glBindRenderbufferEXT( GL_RENDERBUFFER_EXT, rendbuf[2] ) ;
glRenderbufferStorageEXT( GL_RENDERBUFFER_EXT, GL_ALPHA8, tsize, tsize ) ;
glFramebufferRenderbufferEXT( GL_READ_FRAMEBUFFER_EXT, GL_COLOR_ATTACHMENT0_EXT, GL_RENDERBUFFER_EXT, rendbuf[2] ) ;
// we can’t blit directly from a multisampled buffer with ReadPixels, and source & destination must have the same size

cdmap1c = (uchar *)malloc( width2 * height2 ) ; // allocate destination buffer, grayscale 8bpp image

trImageBuffer( trc1, GL_ALPHA, GL_UNSIGNED_BYTE, (void *)cdmap1c ) ;

trOrtho( trc1, 0, (double)width2, 0, (double)height2, -1.0, 1.0 ) ;

glTranslatef( 0.375, 0.375, 0.0 ) ;
glColor4ub( 0, 0, 0, 0 ) ;

// compute geometry’s 2D vertices

moretiles = 1 ; // start Tile Rendering loop
while( moretiles ){
trBeginTile( trc1 ) ;// setup
glBindFramebufferEXT( GL_DRAW_FRAMEBUFFER_EXT, fbo[0] ) ; // fbo[0] is draw target

// draw many quads or quad strips

glBindFramebufferEXT( GL_READ_FRAMEBUFFER_EXT, fbo[0] ) ; // fbo[0] is now read target
glBindFramebufferEXT( GL_DRAW_FRAMEBUFFER_EXT, fbo[1] ) ; // and fbo[1] is draw target
glBlitFramebufferEXT( 0, 0, tsize * dsamp, tsize * dsamp, 0, 0, tsize * dsamp, tsize * dsamp, GL_COLOR_BUFFER_BIT, GL_LINEAR ) ;

glBindFramebufferEXT( GL_READ_FRAMEBUFFER_EXT, fbo[1] ) ; // fbo[1] is now read target
glBindFramebufferEXT( GL_DRAW_FRAMEBUFFER_EXT, fbo[2] ) ; // fbo[2] is draw target
glBlitFramebufferEXT( 0, 0, tsize * dsamp, tsize * dsamp, 0, 0, tsize, tsize, GL_COLOR_BUFFER_BIT, GL_LINEAR ) ;
glBindFramebufferEXT( GL_READ_FRAMEBUFFER_EXT, fbo[2] ) ; // after blitting, make fbo[2] read target

moretiles = trEndTile( trc1 ) ; // reading pixels
}

tifgrywr( pathout, cdmap1c, width2, height2, (uint)outres ) ; // saving to a grayscale TIFF file

// freeing various resources

		  glDeleteRenderbuffersEXT( 3, rendbuf ) ;
		  glDeleteFramebuffersEXT( 3, fbo ) ;
		  trDelete( trc1 ) ;

PS. Please also comment on ways to improve the speed of the operations above, it’s not as fast I hoped for.
PS2. The maximum AA given by glGetIntegerv( GL_MAX_SAMPLES, maxaa ) is 8 for my 4850.
But I noticed in CCC that the Edge-detect AA can go upto 24, how can I reach 24?

PS3. I decided to try a lower maxaa, instead of the 8 maximum allowed on my 4850. With maxaa = 4 I was able to use higher values for tsize and dsamp, including tsize = 512 and dsamp = 8, for a total AA of 256x . Or maxaa = 2, tsize = 512 and dsamp = 16 for a total AA of 512x (well, this is overkill).
When tsize = 512 and dsamp = 16, the size of the large FBOs are 8192x8192, but any MSAA larger than 2 fails. Maybe my 1GB of video RAM isn’t enough?

Pierre_Boudier · March 17, 2010, 3:17am

did you count how many blits you end up doing, and compare it to the max fill rate ?

the 24x is an edge detect filter algorithm. you can detect edges using the stencil ref output from the shader while you render, and then apply a filter by binding your msaa buffer as a texture.

on vista/win7, the memory is virtualized, so if you exceed the 1GB total allocation, the OS will page memory in/out on demand. this will be quite slow if you are doing it all the time.

skynet · March 17, 2010, 3:25am

you can detect edges using the stencil ref output from the shader while you render, and then apply a filter by binding your msaa buffer as a texture.

Pierre, this sounds like an interesting technique. Could you please elaborate a bit more on this?

Ilian_Dinev · March 17, 2010, 3:40am

Keyword: centroid
Screenshot:

Code:


// vtx shader
out vec4 varCen2;
centroid out vec4 varCen;

...
   varCen = gl_Position;
   varCen2= gl_Position;

// frag shader:

float zz = dot(abs(varCen-varCen2),vec4(1));
if(zz!=0.0)zz=1.0;
glFragColor = vec4(zz,zz,zz,1); // for visualization

Pierre_Boudier · March 17, 2010, 4:48am

exactly ! thanks Ilian.

with the new extension to output this value directly in the stencil buffer, you have then a quick way to apply any post processing on either the edge or inside of a polygon.

Tzupy · March 17, 2010, 1:46pm

Thank you for the answers, especially Pierre Boudier!
When I’ll have more time, I’ll try to implement this technique, although currently it’s beyond my OpenGL knowledge.

For now I’ll stick with the method listed in the OP, but with an intermediary buffer. The OP method had a serious flaw, downsampling from a renderbuffer that was more than 2x by 2x larger resulted in undersampling, and a severely reduced number of shades in the output.
So I decided to use 4 renderbuffers in all, sized: 4x by 4x and 8x MSAA, 4x by 4x, 2x by 2x and 1x by 1x, for a total of 128 shades (I checked this).

I was also able to make this work with CMYK images, by pretending they are RGBA and color masking.

I’ll try to further modify the TR library so I don’t necessarily read back tiles. In some cases when the output image would fit in an 8x by 8k renderbuffer it should be faster to read back the whole final renderbuffer than the tiles.

ZbuffeR · March 18, 2010, 2:34am

I may be completely wrong, but as blit only supports GL_NEAREST and GL_LINEAR, what about generating mipmaps on the renderbuffer and using it as a GL_LINEAR_MIPMAP_LINEAR texture rendered on another fbo ?

glGenerateMipmap(GL_TEXTURE_2D);

That may or may not be faster/better in your case, but can be worth trying.