Async buffer copies from fbo->pbo, am I doing it wrong?

imported_pettersson · September 8, 2017, 2:06am

I recently implemented (or tried to implement) async buffer copies into my renderer. Usecase is copying a render target (fbo color texture) to a PBO, and subsequently copying the PBO data to main memory.
I have currently disabled the copy to main memory to focus on the FBO->PBO copy. It does actually work, I get the results I want. It just is so slow.
I am using two ‘rendertargets’, two PBOs and GLsync objects to go with it (though I can use more). Problem is it does not look like it’s doing an async copy since performance is not increasing. If I just use one fbo and one pbo, performance is the same. Omitting the copy, performanec is vastly increased. Here is a short description of the render/copy loop. I use a writeindex and a readindex, which get incremented/swapped after each frame. Using more then two FBO/PBO buffers doesn’t affect performance as well.

I use a small structure to keep track of fbo/pbo/syncobjects (maybe there is a problem in using those).


renderdata[ fbo, pbo, readback_end, readback_start ]

Here’s some pseudocode of my render loop:

draw:
  glWaitSync( renderdata[writeindex].readback_end )
  glDeleteSync( renderdata[writeindex].readback_end )

  renderdata[writeindex].fbo.bind()
  // draw stuff
  renderdata[writeindex].fbo.unbind()
  renderdata[writeindex].readback_start = glFenceSync()
copy:
  glWaitSync(renderdata[readindex].readback_start)
  glDeleteSync(renderdata[readindex].readback_start)

  renderdata[readindex].pbo.bind()
  glGetTextureImage(renderdata[readindex].fbo.tex)
  renderdata[readindex].pbo.unbind()

  renderdata[readindex].readback_end = glFenceSync()

  swap read/write

I am using a NVidia GTX970, and I have read about NVidia’s dual copy engines, available on quadro GPUs. Is this the problem? My GPU just serializes the copy?

Or maybe I just got it all wrong, and it is only possible to do async copy from a CPU view and not on the GPU itself?

The FBO I am copying is quite large (16k x 2k), the PBOs are mapped persistent so the actual copy to main memory runs from a seperate thread. Using NSight, when copying is enabled, I can see the first drawcall in the renderloop takes up most of the frametime, making it slow (maybe that is of importance)

So, it does work, I can get the data to main memory (and it looks ok), my question is regarding performance.

Thanks for any hints or clarifications,
pettersson

Dark_Photon · September 8, 2017, 6:09am

There could be a number of things going on here.

For starters, I’d recommend putting aside the async part of this, separate threads, persistent mapping, the sync objects, and especially binding and unbinding your framebuffer (which you don’t need to do) and focus on timing your readback method and your synchronous readback performance. How long does the readback take on your CPU thread, and what effective GB/sec readback bandwidth does that imply. Optimize that first. Then throw in “other things” with a careful eye on making sure that your CPU thread time goes down.

For timing purposes only, be sure to put a glFinish() right before you do the readback call to ensure that all future pipeline work is complete and you’re not timing anything but the readback. Then start a timer, do the readback to the CPU, and then stop the timer. How many msec? Now compute the effective bandwidth in GB/sec. What do you get?

As a starter, here is a short little GLUT/GLEW test program (which compiles on Windows and Linux) which does just that:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <GL/glew.h>
#include <GL/glut.h>

#define WITH_BINDLESS

const int WIDTH      = 16384;
const int HEIGHT     = 2048;

//const int WIDTH      = 3840;
//const int HEIGHT     = 2160;       // 4K/Ultra-HD = 7.91 Mpixels

//const int WIDTH      = 2560;
//const int HEIGHT     = 1440;       // Quad-HD = 3.52 Mpixels

//const int WIDTH      = 1920;
//const int HEIGHT     = 1080;       // HD = 1.98 Mpixels

const int PIXEL_SIZE = 4;          // Bytes for RGBA8, or DEPTH24_STENCIL8

const int READBACK_BYTES = ( WIDTH * HEIGHT * PIXEL_SIZE );

GLuint   Buf[2];              // Buffer Object #0 (glReadPixels target),
                              // Buffer Object #1 (glCopyBuffer target)

GLuint   Fbo;                 // Framebuffer object
GLchar  *Readback_buf;        // App CPU mem for readback target

//-----------------------------------------------------------------------

void checkGLError( const char hdr[] )
{
    GLenum err = glGetError();
    if ( err )
    {
        fprintf( stderr, "GL ERROR at '%s': %s\n", hdr, gluErrorString(err) );
        exit(1);
    }
}

//-----------------------------------------------------------------------
// Timer
//-----------------------------------------------------------------------

#ifdef WIN32
# include <windows.h>
#else
# include <sys/time.h>
#endif


class Timer
{
protected:

    double        startUsec;
#ifdef WIN32
    LARGE_INTEGER freq;
#endif

public:
    Timer()
    {
#ifdef WIN32
        LARGE_INTEGER start;
        QueryPerformanceFrequency( &freq );
        QueryPerformanceCounter  ( &start );
        startUsec = start.QuadPart * ( 1000000.0 / freq.QuadPart );
#else
        timeval       start;
        gettimeofday( &start, 0 );
        startUsec = ( start.tv_sec * 1000000.0 ) + start.tv_usec;
#endif
    }

    double getElapsedMSec()
    {
        double endUsec;

#ifdef WIN32
        LARGE_INTEGER end;
        QueryPerformanceCounter( &end );
        endUsec   = end.QuadPart   * ( 1000000.0 / freq.QuadPart );
#else
        timeval       end;
        gettimeofday( &end, NULL );
        endUsec   = ( end.tv_sec   * 1000000.0 ) + end.tv_usec  ;
#endif

        return ( endUsec - startUsec ) * 0.001;
    }
};

//-----------------------------------------------------------------------

void initFBO()
{
    GLuint texColor, texDepth;

    // Color attachment
    glGenTextures  ( 1, &texColor );
    glBindTexture  ( GL_TEXTURE_2D, texColor );
    glTexImage2D   ( GL_TEXTURE_2D, 0, GL_RGBA8, WIDTH, HEIGHT, 0,
                     GL_RGBA, GL_UNSIGNED_BYTE, 0 );
    glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST );
    glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST );

    // Depth attachment
    glGenTextures  ( 1, &texDepth );
    glBindTexture  ( GL_TEXTURE_2D, texDepth );
    glTexImage2D   ( GL_TEXTURE_2D, 0, GL_DEPTH24_STENCIL8, WIDTH, HEIGHT, 0,
                     GL_DEPTH_STENCIL, GL_UNSIGNED_INT_24_8, 0 );
    glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST );
    glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST );

    // FBO
    glGenFramebuffers( 1, &Fbo );
    glBindFramebuffer( GL_FRAMEBUFFER, Fbo );
    glFramebufferTexture2D( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
                            GL_TEXTURE_2D, texColor, 0 );
    glFramebufferTexture2D( GL_FRAMEBUFFER, GL_DEPTH_STENCIL_ATTACHMENT,
                            GL_TEXTURE_2D, texDepth, 0 );

    GLenum status = glCheckFramebufferStatus( GL_FRAMEBUFFER );
    checkGLError( "FBO construct" );

    if ( status != GL_FRAMEBUFFER_COMPLETE )
    {
        fprintf( stderr, "FBO BUILD FAILURE: status = 0x%x\n", status );
        exit(1);
    }
}

//-----------------------------------------------------------------------

void initBuffers()
{
    glGenBuffers( 2, Buf );

    // Buffer #0: glReadPixels target
    GLenum target = GL_PIXEL_PACK_BUFFER;

    glBindBuffer( target, Buf[0] );
    glBufferData( target, READBACK_BYTES, 0, GL_STATIC_COPY );

#ifdef WITH_BINDLESS
    GLuint64EXT addr;
    glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
    glMakeBufferResidentNV     ( target, GL_READ_ONLY );
#endif

    // Buffer #1: glCopyBuffer target
    target = GL_COPY_WRITE_BUFFER;
    glBindBuffer( target, Buf[1] );
    glBufferData( target, READBACK_BYTES, 0, GL_STREAM_READ );
#ifdef WITH_BINDLESS
    glMapBufferRange( target, 0, 1, GL_MAP_WRITE_BIT );
    glUnmapBufferARB( target );
    glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
    glMakeBufferResidentNV     ( target, GL_READ_ONLY );
#endif
}

//-----------------------------------------------------------------------

void init()
{
#ifdef WITH_BINDLESS
    // Sanity check
    if ( !( GLEW_NV_vertex_buffer_unified_memory &&
            GLEW_NV_shader_buffer_load ) )
    {
        fprintf( stderr, "Missing NVidia bindless extensions!\n" );
        exit(1);
    }
#endif

    // Build an FBO
    initFBO();

    // Build glReadPixels readback buffer objects
    initBuffers();

    // Allocate a block of app CPU memory as the destination for the
    //   readback
    Readback_buf = (GLchar *) malloc( READBACK_BYTES );

    // Always tightly pack readbacks in memory.
    glPixelStorei( GL_PACK_ALIGNMENT, 1 );
}

//-----------------------------------------------------------------------

void reshape( int width, int height )
{
    glViewport( 0, 0, width, height );
}

//-----------------------------------------------------------------------

void doReadbackSLOW()
{
    // Do a depth readback directly to app CPU memory
    glBindBuffer( GL_PIXEL_PACK_BUFFER, 0 );

    glReadPixels( 0, 0, WIDTH, HEIGHT,
                  GL_RGBA, GL_UNSIGNED_INT_8_8_8_8_REV, Readback_buf );
    //glReadPixels( 0, 0, WIDTH, HEIGHT,
    //              GL_DEPTH_STENCIL, GL_UNSIGNED_INT_24_8, Readback_buf );
}

//-----------------------------------------------------------------------

void doReadbackFAST()
{
    // Work-around for NVidia driver readback crippling on GeForce.

    // Do a depth readback to BUF OBJ #0
    glBindBuffer( GL_PIXEL_PACK_BUFFER, Buf[0] );

    glReadPixels( 0, 0, WIDTH, HEIGHT,
                  GL_RGBA, GL_UNSIGNED_INT_8_8_8_8_REV, 0 );
    //glReadPixels( 0, 0, WIDTH, HEIGHT,
    //              GL_DEPTH_STENCIL, GL_UNSIGNED_INT_24_8, 0 );

    // Copy from BUF OBJ #0 to BUF OBJ #1
    glBindBuffer( GL_COPY_WRITE_BUFFER, Buf[1] );
    glCopyBufferSubData( GL_PIXEL_PACK_BUFFER, GL_COPY_WRITE_BUFFER, 0, 0,
                         READBACK_BYTES );

    // Do the readback from BUF OBJ #1 to app CPU memory
    glGetBufferSubData( GL_COPY_WRITE_BUFFER, 0, READBACK_BYTES,
                        Readback_buf );

    glBindBuffer( GL_PIXEL_PACK_BUFFER, 0 );
}

//-----------------------------------------------------------------------

void display()
{
    printf( "--- FRAME ---\n" );

    // Clear system FBO
    glBindFramebuffer( GL_FRAMEBUFFER, 0 );
    glClear( GL_COLOR_BUFFER_BIT );

    // Bind FBO and clear
    glBindFramebuffer( GL_FRAMEBUFFER, Fbo );
    glClear( GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT |
             GL_STENCIL_BUFFER_BIT );

    // Make sure we're caught up.  We only want to time the readback.
    glFinish();

    // Do depth readback to Readback_buf
    //   -- FAST way
    Timer timer1;
    doReadbackFAST();
    double time = timer1.getElapsedMSec();
    printf( "   Readback FAST time = %7.3f msec (%7.3f GBytes/sec)\n",
            time, ( READBACK_BYTES / time * 1000 / (1 << 30) ) );

    //   -- SLOW way
    Timer timer2;
    doReadbackSLOW();
    time = timer2.getElapsedMSec();
    printf( "   Readback SLOW time = %7.3f msec (%7.3f GBytes/sec)\n",
            time, ( READBACK_BYTES / time * 1000 / (1 << 30) ) );

    // Swap
    glutSwapBuffers();
    glFinish();
    glutPostRedisplay();
    checkGLError( "display()" );
}

//-----------------------------------------------------------------------

void keyboard( unsigned char key, int x, int y )
{
   switch (key) {
      case 27:
         exit(0);
         break;
   }
}

int main( int argc, char** argv )
{
  // Init GLUT
  glutInit( &argc, argv );
  glutInitDisplayMode(  GLUT_RGBA | GLUT_DOUBLE );
  glutCreateWindow( argv[0] );

  glutKeyboardFunc( keyboard );
  glutDisplayFunc( display );
  glutReshapeFunc( reshape );

  glutReshapeWindow( 400,400 );

  // Init GLEW
  GLenum err = glewInit();
  if ( err != GLEW_OK )
  {
    // Problem: glewInit failed, something is seriously wrong.
    fprintf( stderr, "Error: %s\n", glewGetErrorString(err) );
    exit(1);
  }

  printf( "GL_RENDERER = %s\n", glGetString( GL_RENDERER) );

  glClearColor( 0,0,0,0 );

  init();

  glutMainLoop();
  return 0;
}

Also, here’s a CMakeLists.txt (cmake config files) which’ll let you generate MSVS project files (on Windows) or a Makefile (on Linux).


cmake_minimum_required( VERSION 2.8 FATAL_ERROR )

# Help Windows find libaries
if ( WIN32 )
  # NOTE: You should add these directories to your PATH (for .dlls):
  #           C:\Tools\glew-2.0.0\bin\Release\Win32
  #           C:\Tools\freeglut-3.0.0\bin
  # NOTE: After installing NVidia Cg, FindGLUT was finding the GLUT in NVidia's Cg dir.
  set(CMAKE_PREFIX_PATH "C:/Tools/glew-2.0.0;C:/Tools/freeglut-3.0.0")
  set(CMAKE_LIBRARY_PATH "C:/Tools/glew-2.0.0/lib/Release/Win32;C:/Tools/freeglut-3.0.0/lib")
endif()

# Set variable
#SET( EXEC tst )

# Find packages of additional libraries
#   - GLEW
find_package( GLEW REQUIRED )
if ( GLEW_FOUND )
    include_directories( ${GLEW_INCLUDE_DIRS} )
    link_libraries     ( ${GLEW_LIBRARIES}    )
endif()

#   - GLUT
find_package( GLUT REQUIRED )
if ( GLUT_FOUND )
    include_directories( ${GLUT_INCLUDE_DIR} )
    link_libraries     ( ${GLUT_LIBRARIES}    )
endif()

#   - OpenGL
find_package( OpenGL REQUIRED )
if ( OPENGL_FOUND )
    include_directories( ${OPENGL_INCLUDE_DIRS} )
    link_libraries     ( ${OPENGL_LIBRARIES}    )
endif()

# C++11
macro(use_cxx11)
  if (CMAKE_VERSION VERSION_LESS "3.1")
    if (CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
      set (CMAKE_CXX_FLAGS "--std=c++11 ${CMAKE_CXX_FLAGS}")
    endif ()
  else ()
    set (CMAKE_CXX_STANDARD 11)
  endif ()
endmacro(use_cxx11)

use_cxx11()

# Specify exec_name followed by source and header files
add_executable ( readback_perf             readback_perf.cpp            )

To do so, just run:


cmake .
cmake --build . --target readback_perf --config Release

You’ll notice I’ve plugged in the image resolution you mentioned (16k x 2k).

Here on a GTX1080, here’s one of the typical readback timings I get:


--- FRAME ---
   Readback FAST time =  32.092 msec (  3.895 GBytes/sec)
   Readback SLOW time =  42.969 msec (  2.909 GBytes/sec)

Now even that is a small fraction of the ~15.75 GB/sec theoretical of the PCI Express v3 x16 bus my GPU is plugged into. But perhaps that’s good enough for you.

If you want to read more on this and why it might not be faster, see this thread.

Once you get your synchronous readback performance up as high as possible, then try mixing in other things like async, multiple buffers, etc. to try to hide some of that readback latency.

Using NSight, when copying is enabled, I can see the first drawcall in the renderloop takes up most of the frametime, making it slow (maybe that is of importance)

Don’t know for sure, but I do have some theories. One is that this may be instigated by you binding and unbinding your FBO needlessly. You should avoid changing render targets more than absolutely necessary as changing render targets is very expensive. It could be that the overhead of that is deferred until the first draw call that actually renders on a render target. Try removing the bind/unbind of your FBO.

Past that, make sure (for timing purposes) that you are doing a glFinish() before you do the readback. That isolates your timings from the other “stuff” (which you and/or the driver may be doing) which may make the rest of your frame slow.

imported_pettersson · September 8, 2017, 10:19am

Wow, thanks for your fast and extensive reply. I will try your example at home, but cannot really test in the “real world” it until monday, when I am back at my office.

Have a nice weekend and thanks again
pettersson

imported_pettersson · September 11, 2017, 12:59am

Hi,

I ran the timing tests on my project, with the following results:

16384x2048 readback
min: 39.0039 ms (3,2GB/s)
max: 49.0049 ms (2,5GB/s)

I removed all unneccessary Unbind calls on Shaders and FBOs, increasing performance by a really small margin. Now the question remains how to speed up the readback copy. I read about the dual copy engines on NVidia Quadro cards, but I only have limited access to one of those (M6000) and “only” a GTX970 in my development machine. Is it still a fact that only the Quadro cards do have this dual copy engine? I might try to use CUDA to copy the data to CPU to see if this is faster. With 39ms minimum duration for the copy and the requirement to copy every frame, even if the copy is completely async I won’t hit my target of at least 30 frames per second.

Any more ideas on how to speed up the copy? Decreasing resolution is out of the question for now

Dark_Photon · September 12, 2017, 6:12am

Are you saying you’ve already put full async readback in and it’s no faster than 39ms? On frame N, are you processing the frame data from frame N-1 or N-2 to give the readback time to migrate in the background?

Potentially useful to you:

Asynchronous Buffer Transfers (OpenGL Insights)
Asynchronous Buffer Transfers (NVidia)

As I recall, there are other related chapters from this book which would be useful to you as well.

imported_pettersson · September 12, 2017, 10:31pm

No, the 39ms are from the timing measurements I did on the readback alone. I still have to try to make that faster, currently I do a simple readback, not the nvidia workaround you did in your sample. My main concern is we have a 30fps source and readback needs to be done for every frame with no frames skipped.

Dark_Photon · February 8, 2021, 2:44pm

A post was split to a new topic: Fast Readbacks on Intel and NVIDIA

Dark_Photon · February 8, 2021, 2:49pm