VBO access -- (GL_DYNAMIC_DRAW_ARB, GL_READ_WRITE_ARB) faster than (GL_DYNAMIC_DRAW_A

This has been posted on the Usenet:

"Anyone have any ideas why accessing a vertex buffer object that was
created with GL_DYNAMIC_DRAW_ARB – using the GL_READ_WRITE_ARB access
flag would be faster than using the GL_WRITE_ONLY_ARB access flag?

The code in question at the time of the vertex buffer lock is also
requesting to “discard” the old buffer by passing in NULL to the
glBufferDataARB function before calling glMapBufferARB.

Using frames per second as a “basic” metric – my current access
patterns are yielding:

~2740 FPS with GL_DYNAMIC_DRAW_ARB / GL_WRITE_ONLY_ARB combo
~4120 FPS with GL_DYNAMIC_DRAW_ARB / GL_READ_WRITE_ARB combo

Maybe I’m missing something – but it seems to me that
GL_WRITE_ONLY_ARB access flag “should” be faster than the
GL_READ_WRITE_ARB access flag.

The video card is an NVidia 7800GTX (drivers are not the latest)."

What do you think is the reason for this behaviour?

Maybe the driver does an extra copy? Or a bug?

Wow. I guess I’d ask a couple of questions to narrow things down.

  1. Is the “discard” necessary? I’d be concerned that there could be some small-but-cumulative memory issues from not reusing the buffers.

For example, maybe R/W vs W/O do different runtime checks on whether the old and new buffers are the same size? (I can imagine R/W doing more checks and avoiding the hit, though it’s a long-shot guess.)

  1. Why are you/he using old drivers? Test with the latest drivers, try not discarding if the size is unchanged, and see where it stands.

Originally posted by Cyranose:
[b]

  1. Is the “discard” necessary? I’d be concerned that there could be some small-but-cumulative memory issues from not reusing the buffers.
    [/b]
    The goal of the discard is to notify the driver that it can allocate new buffer memory if the old is currently in use by the GPU. Without it, the map would need to wait until the GPU is done with the buffer.

Originally posted by rootnode:
What do you think is the reason for this behaviour?
It is possible that with the WRITE_ONLY mode you are getting pointer to write combined uncached memory while in the READ_WRITE mode you will get ordinary memory. In that case the WRITE_ONLY mode would be more sensitive to way in which your application accesses the mapped memory.

I am the original poster of this message on the usenet forums.

In my ongoing attempt to look further into this issue – I made my way back here and saw this posting. Thanks rootnode for posting it up.

Just to bring you all up to speed – there was nothing said on the usenet forum that helped resolve this.

One other user also mentioned that he too observed the same issue on multiple ATI cards.

I have updated to the latest drivers – yet the results are the same (unfortunately).

A bit more background on the the code. It’s actually font rendering code and this specific bit of code is generating the quads (two tris) for each character or glyph in the string. Pretty straight forward stuff.

I am locking/mapping the vertex buffer and of course writing directly to it. But that’s the whole point of being able to map. I could trying (just for test purposes) generating the quad data in system memory and memcopy-ing it into the locked vertex buffer. But that really defeats the point of buffer objects then. However, if the results of that were indeed faster, then I guess it would point to cache issues.

You don’t need to generate the data in a separate buffer to efficiently write to uncached write-combined memory. You should just generate your data such that it gets written to the buffer in a contiguous linear pattern. For vertex data this generally means you want to use one of the following approaches:

[ul][li]Generate whole vertices (all vertex attributes) at a time and use interleaved vertex attrib arrays. Make sure you write the components in the order that they appear in memory.[*]Generate each attribute for all vertices at a time and use non-interleaved vertex attrib arrays.[/ul][/li]
If random access to the buffer is absolutely required for your particular app, then you should probably either generate the data in a separate buffer or map the buffer object READ_WRITE to get a cached mapping.

Also keep in mind that if all you’re doing is generating data into a VBO as a benchmark then you may not be seeing the full benefits of using an uncached mapping; i.e. avoiding polluting the CPU’s cache with data that will only be read by the GPU doesn’t get you anything if you’re not doing much other memory access on the CPU.

Originally posted by jgennis:
You don’t need to generate the data in a separate buffer to efficiently write to uncached write-combined memory.
I understand that. I only suggested it more as a trouble shooting technique than anything else.

You should just generate your data such that it gets written to the buffer in a contiguous linear pattern.
I am doing that. Data is being written as it should appear in memory.

[b]For vertex data this generally means you want to use one of the following approaches:

  • [li]Generate whole vertices (all vertex attributes) at a time and use interleaved vertex attrib arrays. Make sure you write the components in the order that they appear in memory.[*]Generate each attribute for all vertices at a time and use non-interleaved vertex attrib arrays.

If random access to the buffer is absolutely required for your particular app…[/b]
Random access is not required. At the current time I only require sequential linear access.

[b]…then you should probably either generate the data in a separate buffer or map the buffer object READ_WRITE to get a cached mapping.

Also keep in mind that if all you’re doing is generating data into a VBO as a benchmark then you may not be seeing the full benefits of using an uncached mapping; i.e. avoiding polluting the CPU’s cache with data that will only be read by the GPU doesn’t get you anything if you’re not doing much other memory access on the CPU. [/b]
My current scenario is definitely a legitimate real-world scenario. However, it just so happens to be the only thing going on in the application at the moment. That will change over time.

You mention that using READ_WRITE implies cached access? Does this mean the data is mapped to system memory rather than APG memory (as I assume WRITE_ONLY does)?

Originally posted by Brian Lawson:
You mention that using READ_WRITE implies cached access? Does this mean the data is mapped to system memory rather than APG memory (as I assume WRITE_ONLY does)?
READ_WRITE will likely return a pointer to cached system memory. On PCIe systems the sysmem returned with either WRITE_ONLY or READ_WRITE will probably be directly accessible to the GPU (so they both behave similarly to AGP memory on AGP systems). The difference is that WRITE_ONLY may be uncached write-combined memory so as not to pollute the CPU’s cache with data that will never be accessed by the CPU again.

Can you post the code you’re using to map and write the VBO and/or a simple binary that reproduces the problem?

The code in question is fairly heavily abstracted – so I’ve gone through and pulled out the API specific code and tossed it into a simple “sample” PrintStrings() function just for the purpose of demonstrating how I’m doing this.

struct FONT_VERT_XYZ_UV
{
    Vector3 p;
    Vector2 uv;
};

static const unsigned int MAX_NUM_TEXT_CHARS = 256;


void PrintStrings()
{
    // Init -- really only done once -- not every time the function is called.
    // Only here to demonstrate how I'm initializing the vertex buffer.
    unsigned int VertexBufferID;

    glGenBuffersARB( 1, &VertexBufferID );
    glBindBufferARB( GL_ARRAY_BUFFER_ARB, VertexBufferID );
    glBufferDataARB( GL_ARRAY_BUFFER_ARB, sizeof(FONT_VERT_XYZ_UV) * MAX_NUM_TEXT_CHARS * 4, NULL, GL_DYNAMIC_DRAW_ARB );



    // for each string
    // {

        // Discard
        glBufferDataARB( GL_ARRAY_BUFFER_ARB, sizeof(FONT_VERT_XYZ_UV) * MAX_NUM_TEXT_CHARS * 4, NULL, GL_DYNAMIC_DRAW_ARB );


        //FONT_VERT_XYZ_UV *LockedVerts = glMapBufferARB( GL_ARRAY_BUFFER_ARB, GL_WRITE_ONLY_ARB );
        FONT_VERT_XYZ_UV *LockedVerts = glMapBufferARB( GL_ARRAY_BUFFER_ARB, GL_READ_WRITE_ARB );

        for( unsigned int i = 0; i < StringLength; i++ )
        {
            LockedVerts[ i * 4     ].p.Set( ... );
            LockedVerts[ i * 4     ].uv = ...;

            LockedVerts[ i * 4 + 1 ].p.Set( ... );
            LockedVerts[ i * 4 + 1 ].uv = ...;

            LockedVerts[ i * 4 + 2 ].p.Set( ... );
            LockedVerts[ i * 4 + 2 ].uv = ...;

            LockedVerts[ i * 4 + 3 ].p.Set( ... );            
            LockedVerts[ i * 4 + 3 ].uv = ...;
        }

    // }


    // Unlock the vertex buffer.
    glUnmapBufferARB( GL_ARRAY_BUFFER_ARB );


    // Draw characters / quads
    // SetTexture         ( ... );

    // SetIndexBuffer     ( ... );

    // SetPositionStream  ( ... );
    // SetTexCoordStream  ( ... );

    // DrawIndexedPrimitive(  TRIANGLE_LIST, ... );

    // Really only done at program exit -- not every time function is called.
    glDeleteBuffers( 1, &VertexBufferID );
}

I haven’t been able to reproduce the behavior you’re seeing with WRITE_ONLY buffer object mapping (it’s the same speed as READ_WRITE for me). Any chance you can post a link to an executable that demonstrates this?

Always remeber… the compiler may reorder the memory access statements. So you should always use volatile on the pointer.

See here: http://www.xyzw.de/c170.html

I have made some testing:
With 50 objects of 7578 vertex each one (484992 bytes of vertex data each object)

  1. Mapping with:
  glBufferDataARB(GL_ARRAY_BUFFER_ARB, dwSizeInBytes, NULL, GL_STREAM_DRAW_ARB);
  pRet=glMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_WRITE_ONLY);
or
  pRet=glMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_READ_WRITE);
  


(I’m using STREAM_DRAW as I draw the object a couple of times: one for create the shadow map, one for render the object. I have tested that in this case is slightly faster than DYNAMIC_DRAW, not the map but the overall draw)

Then doing a FastCopy of a system memory array to the pointer returned by MapBuffer is slightly faster using GL_READ_WRITE than using GL_WRITE_ONLY (it is also a surprise for me). But it is just a small accumulated diference mapping and copying the 50 objects (each one with it’s own MapBuffer+FastCopy). I’m talking about near 25MB of data.
(FYI: for me a FastCopy is something like a copy using prefetch and MMX registers. If you are interested, I think you can find a GDC presentation about that in AMD’s website)

  1. Before this test, I was creating the transformed vertices (skinning) in a system memory buffer (I ‘random access’ the memory so I can’t use a buffer returned by MapBuffer) and then using:
 glBufferDataARB(GL_ARRAY_BUFFER_ARB, dwSizeInBytes, pPtr, GL_STREAM_DRAW_ARB);

For my surprise, I have ‘discover’ that this approach is slower than replacing glBufferData with:

glBufferDataARB(GL_ARRAY_BUFFER_ARB, dwSizeInBytes, NULL, GL_STREAM_DRAW_ARB);
void *pRet=glMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_WRITE_ONLY);
FastCopy(pRet, pSysMemBuffer, dwSizeInBytes);
glUnmapBufferARB(GL_ARRAY_BUFFER_ARB);

(again it is slightly faster using GL_READ_WRITE, but it is minimal)

The difference between the two methods (BufferData vs Map+FastCopy+Unmap) is noticeable.

Hope this helps.

(Note: I’m using a GF 8800GTX)

This turned out to be a bug in the NVIDIA driver. The driver wasn’t using the BufferData(NULL) to avoid stalling in the WRITE_ONLY case. It should be fixed in an upcoming driver release (though probably not the next release).