Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Results 1 to 9 of 9

Thread: glMapBuffer time reduction

  1. #1
    Senior Member OpenGL Pro Aleksandar's Avatar
    Join Date
    Jul 2009
    Posts
    1,144

    glMapBuffer time reduction

    Hi all,

    Recently I have encountered something very interesting and maybe a little bit frustrating since it cannot be hidden on slow GPUs.
    Namely, I have discovered glMapBuffer/glMapBufferRange function calls are synchronized with a GPU frame time. This is the code used in my renderer to retrieve values from TF buffer.
    Code :
    void GLRenderer::CreateTFBuffer()
    {
         unsigned int size = ...;//
         glGenBuffers(1, &m_TF_ID);
        glBindBuffer(GL_TRANSFORM_FEEDBACK_BUFFER, m_TF_ID);        
        glBufferData(GL_TRANSFORM_FEEDBACK_BUFFER, size, NULL, GL_DYNAMIC_READ);
    }
     
    bool GLRenderer::QueryTF()
    {
        if(!m_bTFRead) return false;
        if(!m_bInitTF)
        {
            GLint available = 0;
            glGetQueryObjectiv(m_tfQuery, GL_QUERY_RESULT_AVAILABLE, &available);
            if(available == 0) return false;
        }
    //... Other code ...
     
        glBindBuffer(GL_TRANSFORM_FEEDBACK_BUFFER, m_TF_ID);
        glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, m_TF_ID);
        glEnable(GL_RASTERIZER_DISCARD);
        glBeginQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN, m_tfQuery);
        glBeginTransformFeedback(GL_POINTS);
     
         int first = 0, count = ...;
         glDrawArrays(GL_POINTS, first, count);
     
         glEndTransformFeedback();     
        glEndQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN);
        glDisable(GL_RASTERIZER_DISCARD);
        m_bInitTF = false;
        m_bTFRead = false;
        return true;
    }
     
    float GLRenderer::ReadTF()
    {
        if(m_bInitTF) return -1000.0f;
        GLint available = 0;
        glGetQueryObjectiv(m_tfQuery, GL_QUERY_RESULT_AVAILABLE, &available);
        if(available == 0) return -1000.0f;
     
         glBindBuffer(GL_TRANSFORM_FEEDBACK_BUFFER, m_TF_ID);
        //float* ptr = (float*)glMapBuffer(GL_TRANSFORM_FEEDBACK_BUFFER, GL_READ_ONLY); // The same as following
        float* ptr = (float*)glMapBufferRange(GL_TRANSFORM_FEEDBACK_BUFFER, 0, 4 * sizeof(float), GL_MAP_READ_BIT); // <= Extremely costly
     //...
    }

    As you can see I'm reading only when the result is available. Well, maybe the mechanism of reporting availability is not correct, but the result is the same on various NV drivers and cards. The following formula is always true:

    GPU_frame_time – (n-1) * CPU_frame_time >
    MapBuffer_time > GPU_frame_time – n * CPU_frame_time

    where n is the number of frames across which ReadTF() waits for TF buffer to be available. It's aways 3. That means values can be read every third frame. But every third frame has an extremely long CPU time. For slow GPUs that may mean 30 times longer than normal ones (since CPU time is less than 1ms).

    Can anyone explain why this happens? And is there any way for performance improvement?
    Be aware that it is reading from buffer, so GL_MAP_UNSYNCHRONIZED_BITis not applicable.
    Last edited by Aleksandar; 04-29-2014 at 03:22 AM. Reason: Problem with code formatting

  2. #2
    Intern Contributor
    Join Date
    May 2013
    Posts
    65
    Why using mapping and not glGetBufferSubData? Mapping on AMD/NV suffers of app/driver thread synchronization.

    Or take a look at buffer storage with permanent mapping. I'm not sure, but I suspect that buffer storage is supported on a lot of older hardware.

  3. #3
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,209
    Another option you might look at and bench is using a bounce buffer to do the GPU->CPU readback in the background. For instance, using DSA and bindless with readpixels:

    Code :
        glBindBuffer       ( GL_PIXEL_PACK_BUFFER, buf1 );
        glReadPixels       ( 0, 0, res[0], res[1], GL_DEPTH_STENCIL, GL_UNSIGNED_INT_24_8, 0 );
        glBindBuffer       ( GL_PIXEL_PACK_BUFFER, 0 );
        glNamedCopyBufferSubDataEXT( buf1, buf2, 0, 0, size );
        ...do lots of other work here...
        GLuint *p = (GLuint *) glMapNamedBufferRangeEXT( buf2, 0, size, GL_MAP_READ_BIT );

    Do the appropriate thing to force buf1 to be a GPU-mem buffer and buf2 to be a CPU-mem buffer. There's probably a better way to force this nowadays than I'm doing.

  4. #4
    Senior Member OpenGL Pro Aleksandar's Avatar
    Join Date
    Jul 2009
    Posts
    1,144
    Thanks guys for the suggestions!

    Let's elaborate what I achieved in the meanwhile...

    1. glGetBufferSubData() suffers from the same "disease". It behaves exactly the same as glMapBuffer/glMapBufferrange. Although it is faster (as a function call) than glMapBuffer.

    2. glGetBufferSubData() has the fastest function call. glMapBuffer() is slower about 26-36%. glMapBufferRange() is slower than glGetBufferSubData() about 35-43% . Values depend upon the system and drivers, and they range from 2us to 10us. So it is very difficult to measure precisely (the error marine is too high in such range). But, nevertheless, all those functions behave approximately equally.

    3. The only way to solve the problem is to wait until the result is really available. The availability reported by glGetQueryObjectiv() is not quite correct. After adding a countdown counter after reporting availability and wait additional two frames, the latency of the glGetBufferSubData()/glMapBuffer()/glMapBufferrange() is decreased for the three orders of magnitude (it is removed completely). The only drawback of the solution is that I have 5 frames old result which means inaccurate collision (that's what the code is used for).

    On the other hand, buffer storage is not supported in older drivers, so it is not a way to go, at least for a while. Although I'm not sure whether it would help. Bouncing buffer coping would probably behave the same as adding additional wait as in (3), plus adds additional buffer copy. Currently I have a very little CPU workload on the drawing thread, so the latency can hardly be hidden in a single frame.

  5. #5
    Member Regular Contributor malexander's Avatar
    Join Date
    Aug 2009
    Location
    Ontario
    Posts
    316
    1. glGetBufferSubData() suffers from the same "disease". It behaves exactly the same as glMapBuffer/glMapBufferrange. Although it is faster (as a function call) than glMapBuffer.
    I've found exactly the same thing, whether it be mapping a buffer for read or write. I've been replacing all glMapBuffers()'s I can find with glSubBufferData() and getting a nice little performance boost. Not exactly confidence-inspiring

    However, I'm a few days out from testing coherent-persistent buffers. I'll let you know how that works out. Even if you're targeting GL3 hardware, you could still branch on GL_ARB_buffer_storage (as long as you don't mind maintaining two codepaths, that is).

  6. #6
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,117
    GL_ARB_buffer_storage gave me a significant improvement but I am writing to the buffer not reading.

  7. #7
    Senior Member OpenGL Pro
    Join Date
    Jan 2007
    Posts
    1,198
    The 3 frame latency should be pretty standard, but if I'm reading this right the specific problem is that it's not 3 frames, it's actually 5? And that the driver is telling you the readback is ready after 3 frames but it's really not?

    The first thing I'd do to tackle this is try putting some strategically-placed glFlush calls around the code and see if they can give the driver a hint that you'd really like it to start processing buffered-up work now, please. At the end of each frame might be a good place to start, and maybe after the code that builds the TF buffer in the first place might be another.

    If that doesn't work, then another approach may be to put a glFinish or other sync object at the end of each frame. You'll run slower overall, but at least your framerates will be consistent, which seems better overall than getting fast/fast/fast/fast/sloooooooooooooooooowwwww/fast/fast/etc.

    Also consider if you actually need the TF data each frame. I'm assuming that you're using this for per-polygon collision, but I'm not certain if you're running the collision on the GPU and reading back a result, or if you're reading back transformed meshes in order to run the collision on the CPU. In either case you may be able to cache and reuse a result. In the former, if two meshes don't move between frames then the previous result is good to reuse. In the latter, if any given mesh doesn't move then the previous result is good to reuse. You may already be doing that, of course.

    Another thing you may also be doing is coarser bounding-box tests before the finer-grained per-polygon tests. If not you should do so: getting a fast reject would mean that you don't even need to run the TF stage.

  8. #8
    Junior Member Regular Contributor
    Join Date
    Dec 2009
    Posts
    210
    If there is a 3 frame delay between command submission and execution, it makes sense that you get the delay twice, because when you submit the ReadBufferData command in frame 3, there are two more frames of commands in the pipeline, and the driver has to process the commands in the order they are submitted.

  9. #9
    Senior Member OpenGL Pro Aleksandar's Avatar
    Join Date
    Jul 2009
    Posts
    1,144
    First I have to apologize for this very long delay, but I was on the trip last week unable to try anything.

    Quote Originally Posted by mhagain View Post
    The 3 frame latency should be pretty standard, but if I'm reading this right the specific problem is that it's not 3 frames, it's actually 5? And that the driver is telling you the readback is ready after 3 frames but it's really not?
    Yes, that is what is going on. If there is no additional waiting, two consecutive frames are about 0.88ms, while the third one is 2.2ms (CPU time). GPU average time is about 2.74ms (it depends on the scene, but is pretty steady).

    Quote Originally Posted by mhagain View Post
    The first thing I'd do to tackle this is try putting some strategically-placed glFlush calls around the code and see if they can give the driver a hint that you'd really like it to start processing buffered-up work now, please. At the end of each frame might be a good place to start, and maybe after the code that builds the TF buffer in the first place might be another.
    Great hint! Thanks a lot!
    glFlush after filling TF buffer actually removes additional waiting.
    It is quite interesting that putting glFlush at the end of the frame (SwapBuffers actually calls it under the hood) changes nothing. That is quite strange and proves I actually don't know how glFlush works.

    Quote Originally Posted by mhagain View Post
    Also consider if you actually need the TF data each frame. I'm assuming that you're using this for per-polygon collision, but I'm not certain if you're running the collision on the GPU and reading back a result, or if you're reading back transformed meshes in order to run the collision on the CPU. In either case you may be able to cache and reuse a result. In the former, if two meshes don't move between frames then the previous result is good to reuse. In the latter, if any given mesh doesn't move then the previous result is good to reuse. You may already be doing that, of course.
    I'm doing a collision test on the CPU, but need data from the GPU since they are created on the GPU. Maybe some efficient way of reading values from the texture would be better, but I really doubt texture reading could outperform the current approach.
    Also, I'll certainly optimize TF and prevent unnecessary readings, but at the moment I'm forcing it in each frame to find the most efficient way to solve slow readings.

    Quote Originally Posted by mhagain View Post
    Another thing you may also be doing is coarser bounding-box tests before the finer-grained per-polygon tests. If not you should do so: getting a fast reject would mean that you don't even need to run the TF stage.
    Completely agree! Optimization will follow as soon as I find the best solution for reading.

    Quote Originally Posted by mbentrup View Post
    If there is a 3 frame delay between command submission and execution, it makes sense that you get the delay twice, because when you submit the ReadBufferData command in frame 3, there are two more frames of commands in the pipeline, and the driver has to process the commands in the order they are submitted.
    Probably! As previous experiment showed, if glFlush is called after filling TF buffer, everything works like expected, but if it is delayed there is no influence.
    But the problem is: if there are new commands waiting for the execution, why the previous ones are not already flushed?

    Quote Originally Posted by tonyo_au View Post
    GL_ARB_buffer_storage gave me a significant improvement but I am writing to the buffer not reading.
    I have tried buffer storage and got no improvement. Maybe there can be some, but I have to find the right combination of flags. (GL_MAP_PERSISTENT_BIT | GL_MAP_READ_BIT) gains no speedup.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •