glBufferData variant with retained data copying

I am moving this discussion here from the “Official feedback on OpenGL 3.2 thread” to avoid spamming it further.

Here’s the proposal:

Jon Leech’s reply:

And my take:

Some people suggest that glMapBuffer works great : Mapping buffer is pretty straight forward usage pattern, calling glBufferData with data set to NULL is painless…

To which I disagree : Mapping the buffer will most certainly lead to sync issues. Unless glBufferData( …,NULL ) is used - in which case I am forced to update the whole buffer and also this may create a new buffer on the graphic card (and I would rather reuse the old one as it can be very big).

Everybody agrees that in most cases glBufferSubData is useless.

I don’t see how this can avoid syncing, unless using MAP_UNSYNCHRONIZED_BIT - but why would anyone use this?
With my proposal - the driver will know best when the buffer is available and will only then begin copying my original data, thus avoiding a sync.

I think that glBufferData with data != NULL should be avoided as well.

All the OpenGL API is either:

  • synchronous (glTexImage, glReadPixels,…) with plain C pointer
  • or asynchronous when buffer is bound and offset is used instead of pointer

I believe this is good enough.

The only API that does not fit is glBufferData/glBufferSubData that uses not null C pointers. With this we are back in synchronous world. The driver must wait until the buffer is available and then to copy the data.

I think the only way is to properly use buffer mapping.
As someone suggested, use of MapBufferRange could help.

Right, but then the driver must wait. It cannot return earlier otherwise the application could destroy the data after the BufferData returns and before the data are copied.

See extension ARB_copy_buffer:

Replace BufferSubData with a non-cache-polluting update:


        BindBuffer(COPY_READ_BUFFER, tempBuffer);
        BufferData(COPY_READ_BUFFER, updateSize, NULL, STREAM_DRAW);
        // this may return a WriteCombined mapping!
        ptr = MapBuffer(COPY_READ_BUFFER, WRITE_ONLY);
        // fill ptr
        UnmapBuffer(COPY_READ_BUFFER);

        BindBuffer(COPY_WRITE_BUFFER, vtxBuffer);
        // this copy ideally requires no CPU work on the data itself.
        CopyBufferSubData(COPY_READ_BUFFER, COPY_WRITE_BUFFER,
                          0, writeoffset, updateSize);

This examples show that you can modify part of the buffer without waiting at all. The buffer modification process can be enqueued using CopyBufferSubData.

(I have no real experience with this).

No, I think you misunderstood what I proposed.
glBufferDataRetained will return immediately and the driver will not wait for the data to be copied - it will know when the right time to copy the data comes and then copy the data, but this does not mean that it has to stop and wait before that.
This copying can be done by a separate driver thread.

The application should not modify or delete the data unless the copying has ended. And it will know whether the copying has ended if the relevant sync object (see the new ARB_Sync API) is signaled. If the app still insists to modify/delete the data - it must call ClientWaitSync.

To recap:

  1. Modify the data
  2. Call glBufferDataRetained - it will return immediately. The driver will copy the data later when the right time comes.
  3. At some later time the app will want to modify the data again:
  • check if the sync object is signaled(e.g. the data has been copied you can do with it whatever you want).
  • if the sync object is not signaled you have two choices:
    • a) Call ClientWaitSync and then modify the data
    • b) Store the [range of] changed data somewhere else and use glBuffer[Sub]Data again.

This may work. It will still use another buffer on the GPU and will require more work from the driver. Probably MapBufferRange with MAP_INVALIDATE_RANGE_BIT does something similar internally.

You are asking for something like ext. SGIX_async
I doubt ARB will go this way.

“SGIX_async provides a way to allow certain OpenGL commands to complete out-of-order with respect to others. This extension does not by itself enable asynchrony;”

Why not - this approach fixes some much debated gripes when updating buffer objects, is easy to use and fits very nicely with the ARB_sync API.

Whenever a new approach is evaluated for updating buffer objects one has to consider the following:

  • does this approach have sync issues
  • does it cause the data to be copied and/or allocated more than once
  • is it easy to use
  • does it fit in the current OpenGL API
  • are there use cases where this will be helpfull

My proposal fares very well on all these criteria.

I’ve never used MapBufferRange so far, but I think you might be able do what you want using MapBufferRange. I’ve been reading the MapBufferRange specification, but I’m unsure about the following example:

/* Map the entire buffer for write - unsynchronized.
 * GL will not block for prior operations to complete.  Application must
 * use other synchronization techniques to ensure correct operation.
 */
void *ptr = glMapBufferRange( GL_ARRAY_BUFFER_ARB, 0, size, MAP_WRITE_BIT | MAP_UNSYNCHRONIZED_BIT);

In what way can the application be sure that the data is synchronized? Does this just mean that the application has to make sure that writing data to the buffer has finished? Or is there some GL command that causes the mapped buffer range to synchronize again? In both cases it would be possible to create some kind of wrapper around buffers that have smart behaviour with respect to deleting cpu side data whenever the buffer is synchronized (if it is just making sure that the data is written to the buffer this can be handled from the application itself, if synchronizing can be achieved by using some GL command, a fence could be placed after the GL command, which allows you to know when synchronization is completed).

I also wonder: using MapBufferRange and the MAP_UNSYNCHRONIZED_BIT, is it possible to write data to the buffer from another cpu thread? Or will this cause problems?

I have no idea what these “other synchronization techniques to ensure correct operation” could be, apart from glFlush. If you already know that the buffer is synced and safe to modify - you can just Map it - I doubt MAP_UNSYNCHRONIZED_BIT will make much difference if a sync is not needed anyway.

The way to use MapBufferRange for updating a range of the buffer - is to first invalidate a range( MAP_INVALIDATE_RANGE_BIT) and then write the new data in that range. This way if the buffer cannot be mapped immediately - the driver can allocate another buffer in graphic memory, store the changed data there and copy it to the final buffer when it is available. Still requires an extra copy.

I have no idea what these “other synchronization techniques to ensure correct operation” could be, apart from glFlush. If you already know that the buffer is synced and safe to modify - you can just Map it - I doubt MAP_UNSYNCHRONIZED_BIT will make much difference if a sync is not needed anyway.

The way to use MapBufferRange for updating a range of the buffer - is to first invalidate a range( MAP_INVALIDATE_RANGE_BIT) and then write the new data in that range. This way if the buffer cannot be mapped immediately - the driver can allocate another buffer in graphic memory, store the changed data there and copy it to the final buffer when it is available. Still requires an extra copy. [/QUOTE]

The “other synchronization techniques” are fences as provided by APPLE_fence, ARB_sync or OpenGL 3.2. When MAP_UNSYNCHRONIZED_BIT is set, the driver will not wait until the command queue drains of any command referencing the buffer object and instead allow you to map it immediately. It then becomes the responsibility of the application to keep track of which ranges in the buffer object may be currently used by the GPU. Of course, to use this option effectively, you also need to be able to flush sub-ranges of the buffer object.

MAP_INVALIDATE_RANGE_BIT is a hint to the driver that it may create a new allocation or at least avoid copying any data from the on-GPU buffer object and just hand the application a pointer. And so yes, it may also allow the driver to return to the application even if the buffer object is still in use.

Note that these options are not mutually exclusive. You may for instance want to map a buffer object range as read-write and only update a sub-range of the mapping. MAP_UNSYNCHRONIZED_BIT will let you map the range without blocking while preserving the content of the range, which MAP_INVALIDATE_RANGE_BIT will not allow.

You can also use MAP_INVALIDATE_BUFFER_BIT which will behave like a BufferData(…, NULL). Drivers may have optimizations for this case, such as buffer object double-buffering, and return quickly from a subsequent MapBuffer command.

Using MAP_UNSYNCHRONIZED_BIT means, “I am taking full responsibility for the results of my actions. I want a pointer to this buffer. Right now.

You can use ARB_sync, NV_fence, or other similar mechanisms to do the synchronization guarantee yourself.

GeLeTo, can you break down in some more detail what the typical data flow is like - i.e. the amount of data being changed per draw call, and the overlap between old and new data?

For example, “I have to submit about 20-30KB of new vertex data per draw, replacing half of the previously drawn content.” or… “I have to push 2.1MB of verts every frame and the new verts are unrelated to the old ones.”

Knowing the data flow a little better might yield some clearer ideas.

IMO, the name of the game is always to make sure the GPU has its next block of work already waiting for it when it finishes the one it is working on. To some extent there is a space/time tradeoff here, since new work usually occupies space, as does the old work it hasn’t finished, thus arriving at an ideal world where the storage needs are minimal and the performance is maximal… is difficult

There are usually two copies of the data - one in system , one in GPU memory. The best time to copy the data from system to GPU memory is right after the GPU has finished using the buffer. And only the driver knows when this is.

It’s a modelling application. On higher subdivisions a single polygon may have >2000 triangles. Different polygons do not share vertices. The most common case is to have meshes with several thousands polygons (each rectangular polygon consisting of between 32 and 2048 triangles) and only a few of them are changed. Sometimes the same model can be drawn in more than one view.

A few ramblings about this:

  1. AFAIK the graphic cards do not like very big meshes. Maybe this has changed with newer cards? So it may be a good idea to break the mesh in smaller chunks and draw them separately. But how small? This will also help with partial updates.

  2. If I use MapBuffer I can dma-copy the computed verts directly rather than storing them in system memory first. This can be done in the last stage of the polygon tesselation where the SSE AoS data (a separate array for each x,y,z,… component) is gathered in more GPU-friendly xyz… structs and stored. But it’s a bit more complicated than that - the tesselation is done by many threads - each worker thread is given several polygons at a time. I have two choices:

  • First map the buffer and then let each thread write directly to it, but the writing will not be quite linear and this may cause the buffer to be mapped for a longer time because the tesselation is much slower than when just copying.
  • First generate the AoS data for the whole mesh and after that - map the buffer and copy-gather it from a single thread. This will trash the cache.
    And if I separate the mesh in chunks I can give each chunk to a different thread, though it won’t be as fine-grained and optimal as the current approach (e.g. you can have 5 chunks and 4 threads). Just thinking aloud.

CopyBufferSubData is basically the equivalent of PBOs for texture uploads. I’m not sure why you think it requires more work from the driver, or why it would use another buffer on the GPU.

To be clear, what you want is:

  • allocate a buffer in memory
  • fill the buffer
  • pass a pointer to the buffer to OpenGL for transfer into a buffer object
  • let GL return immediately but use a sync object to signal completion
  • do something else with the buffer after GL has finished reading it, but don’t touch it before.

However there’s no reason why you couldn’t have the GL driver allocate the memory. And that’s what buffer objects are: driver-allocated memory. Thus by creating a buffer object, mapping and filling it, then using CopyBufferSubData, you are performing the steps you want, without the need for explicit synchronisation or the possibility of errors due to the application prematurely deallocating the buffer or modifying the buffer contents.

The creation of the temporary buffer and the subsequent second copying (system mem to temp buffer to final buffer) are not no-ops.

MapBufferRange with MAP_INVALIDATE_RANGE_BIT is very similar. The driver will basically do the same thing as your approach ( create another temporary buffer and copy from it later ), but ONLY if the buffer cannot be mapped immediately. So it has an advantage - the temporary buffer may not have to be created.
But it has a disadvantage - when many ranges are used (this happens in my use case) it may require the creation of a temporary buffer for each range. With CopyBufferSubData I can pack all ranges in the same buffer.

Retained data copy also has disadvantages. For instance, to avoid a sync when the data has to be changed but the sync object is not yet signaled - you may have to use another location for the changed data. I still think it is the best approach for my use case.

Some may argue that with MapBuffer you don’t need a copy in system memory at all - but this depends on the use case.

malloc or new aren’t no-ops either. You don’t need a second copy if you directly use the pointer returned by MapBuffer and no client memory at all.

Yes, as I’ve mentioned above. But unfortunately it’s not always possible. For instance I have a case where I need the generated vertex data to calculate the normals. Also I have several threads writing the vertex data in parallel and this will probably hurt the performance of the DMA transfers.
And you probably don’t want to keep the buffer mapped for a long time (generating the data on the run can be much slower than just copying it). I am not sure about the last one - does having a mapped buffer stop or slow the other driver DMA transfers while you are copying data to it?

If nothing else, this thread certainly highlights the utter and complete confusion out there for how you use the many buffer APIs and flags to get best upload performance on various GL hardware and drivers.

Vendors, how about beating your heads together with the ARB and publishing a whitepaper on the way we should be tickling your drivers for best buffer upload perf. Clearly state, “don’t do X or Y, but do Z instead, …except on Tuesday, and only when it’s raining, then do W!”.

Personally we’ve gotten better perf using BufferData( NULL ) + BufferSubData + TexSubImage back when we last looked at it than with BufferData( NULL ) + MapBuffer, but that’s before some of the latest extensions and with no multi-buffer ping-pong, 1-N frame delays before latching, fences, fizbim, phase-of-moon testing, etcetcetc.

Vendors, just tell us what buffer API usage you want us to use on your hardware which you are hyper-optimizing for, please! Before we all go nuts or incorrectly write your hardware off as just slow.

If only one of you does it, your approach becomes the de facto OpenGL standard method ™. :stuck_out_tongue: