Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Results 1 to 6 of 6

Thread: Async buffer copies from fbo->pbo, am I doing it wrong?

  1. #1
    Junior Member Newbie
    Join Date
    Mar 2015
    Posts
    8

    Async buffer copies from fbo->pbo, am I doing it wrong?

    I recently implemented (or tried to implement) async buffer copies into my renderer. Usecase is copying a render target (fbo color texture) to a PBO, and subsequently copying the PBO data to main memory.
    I have currently disabled the copy to main memory to focus on the FBO->PBO copy. It does actually work, I get the results I want. It just is so slow.
    I am using two 'rendertargets', two PBOs and GLsync objects to go with it (though I can use more). Problem is it does not look like it's doing an async copy since performance is not increasing. If I just use one fbo and one pbo, performance is the same. Omitting the copy, performanec is vastly increased. Here is a short description of the render/copy loop. I use a writeindex and a readindex, which get incremented/swapped after each frame. Using more then two FBO/PBO buffers doesn't affect performance as well.

    I use a small structure to keep track of fbo/pbo/syncobjects (maybe there is a problem in using those).
    Code :
    renderdata[ fbo, pbo, readback_end, readback_start ]

    Here's some pseudocode of my render loop:

    Code :
    draw:
      glWaitSync( renderdata[writeindex].readback_end )
      glDeleteSync( renderdata[writeindex].readback_end )
     
      renderdata[writeindex].fbo.bind()
      // draw stuff
      renderdata[writeindex].fbo.unbind()
      renderdata[writeindex].readback_start = glFenceSync()
    copy:
      glWaitSync(renderdata[readindex].readback_start)
      glDeleteSync(renderdata[readindex].readback_start)
     
      renderdata[readindex].pbo.bind()
      glGetTextureImage(renderdata[readindex].fbo.tex)
      renderdata[readindex].pbo.unbind()
     
      renderdata[readindex].readback_end = glFenceSync()
     
      swap read/write


    I am using a NVidia GTX970, and I have read about NVidia's dual copy engines, available on quadro GPUs. Is this the problem? My GPU just serializes the copy?

    Or maybe I just got it all wrong, and it is only possible to do async copy from a CPU view and not on the GPU itself?

    The FBO I am copying is quite large (16k x 2k), the PBOs are mapped persistent so the actual copy to main memory runs from a seperate thread. Using NSight, when copying is enabled, I can see the first drawcall in the renderloop takes up most of the frametime, making it slow (maybe that is of importance)

    So, it does work, I can get the data to main memory (and it looks ok), my question is regarding performance.

    Thanks for any hints or clarifications,
    pettersson
    Last edited by pettersson; 09-08-2017 at 03:16 AM. Reason: added additional information

  2. #2
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    4,173
    Quote Originally Posted by pettersson View Post
    I recently implemented (or tried to implement) async buffer copies into my renderer. Usecase is copying a render target (fbo color texture) to a PBO, and subsequently copying the PBO data to main memory.
    ...
    The FBO I am copying is quite large (16k x 2k),
    ...
    I have currently disabled the copy to main memory to focus on the FBO->PBO copy. It does actually work, I get the results I want. It just is so slow.
    ...
    I am using two 'rendertargets', two PBOs and GLsync objects to go with it (though I can use more).
    There could be a number of things going on here.

    For starters, I'd recommend putting aside the async part of this, separate threads, persistent mapping, the sync objects, and especially binding and unbinding your framebuffer (which you don't need to do) and focus on timing your readback method and your synchronous readback performance. How long does the readback take on your CPU thread, and what effective GB/sec readback bandwidth does that imply. Optimize that first. Then throw in "other things" with a careful eye on making sure that your CPU thread time goes down.

    For timing purposes only, be sure to put a glFinish() right before you do the readback call to ensure that all future pipeline work is complete and you're not timing anything but the readback. Then start a timer, do the readback to the CPU, and then stop the timer. How many msec? Now compute the effective bandwidth in GB/sec. What do you get?

    As a starter, here is a short little GLUT/GLEW test program (which compiles on Windows and Linux) which does just that:

    Code cpp:
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <GL/glew.h>
    #include <GL/glut.h>
     
    #define WITH_BINDLESS
     
    const int WIDTH      = 16384;
    const int HEIGHT     = 2048;
     
    //const int WIDTH      = 3840;
    //const int HEIGHT     = 2160;       // 4K/Ultra-HD = 7.91 Mpixels
     
    //const int WIDTH      = 2560;
    //const int HEIGHT     = 1440;       // Quad-HD = 3.52 Mpixels
     
    //const int WIDTH      = 1920;
    //const int HEIGHT     = 1080;       // HD = 1.98 Mpixels
     
    const int PIXEL_SIZE = 4;          // Bytes for RGBA8, or DEPTH24_STENCIL8
     
    const int READBACK_BYTES = ( WIDTH * HEIGHT * PIXEL_SIZE );
     
    GLuint   Buf[2];              // Buffer Object #0 (glReadPixels target),
                                  // Buffer Object #1 (glCopyBuffer target)
     
    GLuint   Fbo;                 // Framebuffer object
    GLchar  *Readback_buf;        // App CPU mem for readback target
     
    //-----------------------------------------------------------------------
     
    void checkGLError( const char hdr[] )
    {
        GLenum err = glGetError();
        if ( err )
        {
            fprintf( stderr, "GL ERROR at '%s': %s\n", hdr, gluErrorString(err) );
            exit(1);
        }
    }
     
    //-----------------------------------------------------------------------
    // Timer
    //-----------------------------------------------------------------------
     
    #ifdef WIN32
    # include <windows.h>
    #else
    # include <sys/time.h>
    #endif
     
     
    class Timer
    {
    protected:
     
        double        startUsec;
    #ifdef WIN32
        LARGE_INTEGER freq;
    #endif
     
    public:
        Timer()
        {
    #ifdef WIN32
            LARGE_INTEGER start;
            QueryPerformanceFrequency( &freq );
            QueryPerformanceCounter  ( &start );
            startUsec = start.QuadPart * ( 1000000.0 / freq.QuadPart );
    #else
            timeval       start;
            gettimeofday( &start, 0 );
            startUsec = ( start.tv_sec * 1000000.0 ) + start.tv_usec;
    #endif
        }
     
        double getElapsedMSec()
        {
            double endUsec;
     
    #ifdef WIN32
            LARGE_INTEGER end;
            QueryPerformanceCounter( &end );
            endUsec   = end.QuadPart   * ( 1000000.0 / freq.QuadPart );
    #else
            timeval       end;
            gettimeofday( &end, NULL );
            endUsec   = ( end.tv_sec   * 1000000.0 ) + end.tv_usec  ;
    #endif
     
            return ( endUsec - startUsec ) * 0.001;
        }
    };
     
    //-----------------------------------------------------------------------
     
    void initFBO()
    {
        GLuint texColor, texDepth;
     
        // Color attachment
        glGenTextures  ( 1, &texColor );
        glBindTexture  ( GL_TEXTURE_2D, texColor );
        glTexImage2D   ( GL_TEXTURE_2D, 0, GL_RGBA8, WIDTH, HEIGHT, 0,
                         GL_RGBA, GL_UNSIGNED_BYTE, 0 );
        glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST );
        glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST );
     
        // Depth attachment
        glGenTextures  ( 1, &texDepth );
        glBindTexture  ( GL_TEXTURE_2D, texDepth );
        glTexImage2D   ( GL_TEXTURE_2D, 0, GL_DEPTH24_STENCIL8, WIDTH, HEIGHT, 0,
                         GL_DEPTH_STENCIL, GL_UNSIGNED_INT_24_8, 0 );
        glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST );
        glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST );
     
        // FBO
        glGenFramebuffers( 1, &Fbo );
        glBindFramebuffer( GL_FRAMEBUFFER, Fbo );
        glFramebufferTexture2D( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
                                GL_TEXTURE_2D, texColor, 0 );
        glFramebufferTexture2D( GL_FRAMEBUFFER, GL_DEPTH_STENCIL_ATTACHMENT,
                                GL_TEXTURE_2D, texDepth, 0 );
     
        GLenum status = glCheckFramebufferStatus( GL_FRAMEBUFFER );
        checkGLError( "FBO construct" );
     
        if ( status != GL_FRAMEBUFFER_COMPLETE )
        {
            fprintf( stderr, "FBO BUILD FAILURE: status = 0x%x\n", status );
            exit(1);
        }
    }
     
    //-----------------------------------------------------------------------
     
    void initBuffers()
    {
        glGenBuffers( 2, Buf );
     
        // Buffer #0: glReadPixels target
        GLenum target = GL_PIXEL_PACK_BUFFER;
     
        glBindBuffer( target, Buf[0] );
        glBufferData( target, READBACK_BYTES, 0, GL_STATIC_COPY );
     
    #ifdef WITH_BINDLESS
        GLuint64EXT addr;
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV     ( target, GL_READ_ONLY );
    #endif
     
        // Buffer #1: glCopyBuffer target
        target = GL_COPY_WRITE_BUFFER;
        glBindBuffer( target, Buf[1] );
        glBufferData( target, READBACK_BYTES, 0, GL_STREAM_READ );
    #ifdef WITH_BINDLESS
        glMapBufferRange( target, 0, 1, GL_MAP_WRITE_BIT );
        glUnmapBufferARB( target );
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV     ( target, GL_READ_ONLY );
    #endif
    }
     
    //-----------------------------------------------------------------------
     
    void init()
    {
    #ifdef WITH_BINDLESS
        // Sanity check
        if ( !( GLEW_NV_vertex_buffer_unified_memory &&
                GLEW_NV_shader_buffer_load ) )
        {
            fprintf( stderr, "Missing NVidia bindless extensions!\n" );
            exit(1);
        }
    #endif
     
        // Build an FBO
        initFBO();
     
        // Build glReadPixels readback buffer objects
        initBuffers();
     
        // Allocate a block of app CPU memory as the destination for the
        //   readback
        Readback_buf = (GLchar *) malloc( READBACK_BYTES );
     
        // Always tightly pack readbacks in memory.
        glPixelStorei( GL_PACK_ALIGNMENT, 1 );
    }
     
    //-----------------------------------------------------------------------
     
    void reshape( int width, int height )
    {
        glViewport( 0, 0, width, height );
    }
     
    //-----------------------------------------------------------------------
     
    void doReadbackSLOW()
    {
        // Do a depth readback directly to app CPU memory
        glBindBuffer( GL_PIXEL_PACK_BUFFER, 0 );
     
        glReadPixels( 0, 0, WIDTH, HEIGHT,
                      GL_RGBA, GL_UNSIGNED_INT_8_8_8_8_REV, Readback_buf );
        //glReadPixels( 0, 0, WIDTH, HEIGHT,
        //              GL_DEPTH_STENCIL, GL_UNSIGNED_INT_24_8, Readback_buf );
    }
     
    //-----------------------------------------------------------------------
     
    void doReadbackFAST()
    {
        // Work-around for NVidia driver readback crippling on GeForce.
     
        // Do a depth readback to BUF OBJ #0
        glBindBuffer( GL_PIXEL_PACK_BUFFER, Buf[0] );
     
        glReadPixels( 0, 0, WIDTH, HEIGHT,
                      GL_RGBA, GL_UNSIGNED_INT_8_8_8_8_REV, 0 );
        //glReadPixels( 0, 0, WIDTH, HEIGHT,
        //              GL_DEPTH_STENCIL, GL_UNSIGNED_INT_24_8, 0 );
     
        // Copy from BUF OBJ #0 to BUF OBJ #1
        glBindBuffer( GL_COPY_WRITE_BUFFER, Buf[1] );
        glCopyBufferSubData( GL_PIXEL_PACK_BUFFER, GL_COPY_WRITE_BUFFER, 0, 0,
                             READBACK_BYTES );
     
        // Do the readback from BUF OBJ #1 to app CPU memory
        glGetBufferSubData( GL_COPY_WRITE_BUFFER, 0, READBACK_BYTES,
                            Readback_buf );
     
        glBindBuffer( GL_PIXEL_PACK_BUFFER, 0 );
    }
     
    //-----------------------------------------------------------------------
     
    void display()
    {
        printf( "--- FRAME ---\n" );
     
        // Clear system FBO
        glBindFramebuffer( GL_FRAMEBUFFER, 0 );
        glClear( GL_COLOR_BUFFER_BIT );
     
        // Bind FBO and clear
        glBindFramebuffer( GL_FRAMEBUFFER, Fbo );
        glClear( GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT |
                 GL_STENCIL_BUFFER_BIT );
     
        // Make sure we're caught up.  We only want to time the readback.
        glFinish();
     
        // Do depth readback to Readback_buf
        //   -- FAST way
        Timer timer1;
        doReadbackFAST();
        double time = timer1.getElapsedMSec();
        printf( "   Readback FAST time = %7.3f msec (%7.3f GBytes/sec)\n",
                time, ( READBACK_BYTES / time * 1000 / (1 << 30) ) );
     
        //   -- SLOW way
        Timer timer2;
        doReadbackSLOW();
        time = timer2.getElapsedMSec();
        printf( "   Readback SLOW time = %7.3f msec (%7.3f GBytes/sec)\n",
                time, ( READBACK_BYTES / time * 1000 / (1 << 30) ) );
     
        // Swap
        glutSwapBuffers();
        glFinish();
        glutPostRedisplay();
        checkGLError( "display()" );
    }
     
    //-----------------------------------------------------------------------
     
    void keyboard( unsigned char key, int x, int y )
    {
       switch (key) {
          case 27:
             exit(0);
             break;
       }
    }
     
    int main( int argc, char** argv )
    {
      // Init GLUT
      glutInit( &argc, argv );
      glutInitDisplayMode(  GLUT_RGBA | GLUT_DOUBLE );
      glutCreateWindow( argv[0] );
     
      glutKeyboardFunc( keyboard );
      glutDisplayFunc( display );
      glutReshapeFunc( reshape );
     
      glutReshapeWindow( 400,400 );
     
      // Init GLEW
      GLenum err = glewInit();
      if ( err != GLEW_OK )
      {
        // Problem: glewInit failed, something is seriously wrong.
        fprintf( stderr, "Error: %s\n", glewGetErrorString(err) );
        exit(1);
      }
     
      printf( "GL_RENDERER = %s\n", glGetString( GL_RENDERER) );
     
      glClearColor( 0,0,0,0 );
     
      init();
     
      glutMainLoop();
      return 0;
    }

    Also, here's a CMakeLists.txt (cmake config files) which'll let you generate MSVS project files (on Windows) or a Makefile (on Linux).

    Code :
    cmake_minimum_required( VERSION 2.8 FATAL_ERROR )
     
    # Help Windows find libaries
    if ( WIN32 )
      # NOTE: You should add these directories to your PATH (for .dlls):
      #           C:\Tools\glew-2.0.0\bin\Release\Win32
      #           C:\Tools\freeglut-3.0.0\bin
      # NOTE: After installing NVidia Cg, FindGLUT was finding the GLUT in NVidia's Cg dir.
      set(CMAKE_PREFIX_PATH "C:/Tools/glew-2.0.0;C:/Tools/freeglut-3.0.0")
      set(CMAKE_LIBRARY_PATH "C:/Tools/glew-2.0.0/lib/Release/Win32;C:/Tools/freeglut-3.0.0/lib")
    endif()
     
    # Set variable
    #SET( EXEC tst )
     
    # Find packages of additional libraries
    #   - GLEW
    find_package( GLEW REQUIRED )
    if ( GLEW_FOUND )
        include_directories( ${GLEW_INCLUDE_DIRS} )
        link_libraries     ( ${GLEW_LIBRARIES}    )
    endif()
     
    #   - GLUT
    find_package( GLUT REQUIRED )
    if ( GLUT_FOUND )
        include_directories( ${GLUT_INCLUDE_DIR} )
        link_libraries     ( ${GLUT_LIBRARIES}    )
    endif()
     
    #   - OpenGL
    find_package( OpenGL REQUIRED )
    if ( OPENGL_FOUND )
        include_directories( ${OPENGL_INCLUDE_DIRS} )
        link_libraries     ( ${OPENGL_LIBRARIES}    )
    endif()
     
    # C++11
    macro(use_cxx11)
      if (CMAKE_VERSION VERSION_LESS "3.1")
        if (CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
          set (CMAKE_CXX_FLAGS "--std=c++11 ${CMAKE_CXX_FLAGS}")
        endif ()
      else ()
        set (CMAKE_CXX_STANDARD 11)
      endif ()
    endmacro(use_cxx11)
     
    use_cxx11()
     
    # Specify exec_name followed by source and header files
    add_executable ( readback_perf             readback_perf.cpp            )

    To do so, just run:
    Code :
    cmake .
    cmake --build . --target readback_perf --config Release

    You'll notice I've plugged in the image resolution you mentioned (16k x 2k).

    Here on a GTX1080, here's one of the typical readback timings I get:

    Code :
    --- FRAME ---
       Readback FAST time =  32.092 msec (  3.895 GBytes/sec)
       Readback SLOW time =  42.969 msec (  2.909 GBytes/sec)

    Now even that is a small fraction of the ~15.75 GB/sec theoretical of the PCI Express v3 x16 bus my GPU is plugged into. But perhaps that's good enough for you.

    If you want to read more on this and why it might not be faster, see this thread.

    Once you get your synchronous readback performance up as high as possible, then try mixing in other things like async, multiple buffers, etc. to try to hide some of that readback latency.

    Using NSight, when copying is enabled, I can see the first drawcall in the renderloop takes up most of the frametime, making it slow (maybe that is of importance)
    Don't know for sure, but I do have some theories. One is that this may be instigated by you binding and unbinding your FBO needlessly. You should avoid changing render targets more than absolutely necessary as changing render targets is very expensive. It could be that the overhead of that is deferred until the first draw call that actually renders on a render target. Try removing the bind/unbind of your FBO.

    Past that, make sure (for timing purposes) that you are doing a glFinish() before you do the readback. That isolates your timings from the other "stuff" (which you and/or the driver may be doing) which may make the rest of your frame slow.
    Last edited by Dark Photon; 09-08-2017 at 07:09 PM.

  3. #3
    Junior Member Newbie
    Join Date
    Mar 2015
    Posts
    8
    Wow, thanks for your fast and extensive reply. I will try your example at home, but cannot really test in the "real world" it until monday, when I am back at my office.

    Have a nice weekend and thanks again
    pettersson

  4. #4
    Junior Member Newbie
    Join Date
    Mar 2015
    Posts
    8
    Hi,

    I ran the timing tests on my project, with the following results:

    16384x2048 readback
    min: 39.0039 ms (3,2GB/s)
    max: 49.0049 ms (2,5GB/s)

    I removed all unneccessary Unbind calls on Shaders and FBOs, increasing performance by a really small margin. Now the question remains how to speed up the readback copy. I read about the dual copy engines on NVidia Quadro cards, but I only have limited access to one of those (M6000) and "only" a GTX970 in my development machine. Is it still a fact that only the Quadro cards do have this dual copy engine? I might try to use CUDA to copy the data to CPU to see if this is faster. With 39ms minimum duration for the copy and the requirement to copy every frame, even if the copy is completely async I won't hit my target of at least 30 frames per second.

    Any more ideas on how to speed up the copy? Decreasing resolution is out of the question for now

  5. #5
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    4,173
    Quote Originally Posted by pettersson View Post
    With 39ms minimum duration for the copy and the requirement to copy every frame, even if the copy is completely async I won't hit my target of at least 30 frames per second.

    Any more ideas on how to speed up the copy?
    Are you saying you've already put full async readback in and it's no faster than 39ms? On frame N, are you processing the frame data from frame N-1 or N-2 to give the readback time to migrate in the background?

    Potentially useful to you:

    * Asynchronous Buffer Transfers (OpenGL Insights)
    * Asynchronous Buffer Transfers (NVidia)

    As I recall, there are other related chapters from this book which would be useful to you as well.

  6. #6
    Junior Member Newbie
    Join Date
    Mar 2015
    Posts
    8
    No, the 39ms are from the timing measurements I did on the readback alone. I still have to try to make that faster, currently I do a simple readback, not the nvidia workaround you did in your sample. My main concern is we have a 30fps source and readback needs to be done for every frame with no frames skipped.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •