PDA

View Full Version : glBufferData variant with retained data copying



GeLeTo
08-05-2009, 02:21 AM
I am moving this discussion here from the "Official feedback on OpenGL 3.2 thread" to avoid spamming it further.

Here's the proposal:

Currently sending data to a buffer data object works like this:
1. Allocate and fill the data
2. Give it to glBufferData/glBufferSubData
3. glBufferData immediately makes a copy of the data (or sends it directly to the card which may force a sync)
4. When glBufferData returns the data is no longer needed and I can delete or modify it

What I want is to skip #3 to avoid the extra copy (or issuing a sync) done by glBufferData. When using glBufferDataRetained the data will not be immediately copied and it cannot be changed/deleted till OpenGL signals that this data is no longer needed:
1. Allocate and fill the data
2. Give the data and a sync object to glBufferDataRetained
3. glBufferDataRetained/glBufferSubData returns immediately without copying anything
4. The driver will wait for the GPU buffer to become available before DMA-copying the data from my original pointer.
4. The next time I need to change or delete the data I check if the sync object is signaled (using the new sync API) - and if so I can use it right away. If the object is not signaled - I can either call ClientWaitSync (or whatever) to ensure the data is copied right away or I can choose to allocate the changed data in a new place.

Currently the only way (that I know of) to avoid the extra copying is to use glMapBuffer. And this is another can of worms...

Jon Leech's reply:

I think both Sun and Apple have done vendor extensions along these lines, and we have sometimes discussed it as a future use case in the ARB. If we do something like this in a future release I hope it would use sync objects to signal the driver being done with the client buffer, but at present it's not being actively discussed in the group.
And my take:

Now that we have sync objects, the API to implement this functionality is a no-brainer, so I hope to see it implemented sooner rather than later.

Some people suggest that glMapBuffer works great : Mapping buffer is pretty straight forward usage pattern, calling glBufferData with data set to NULL is painless...

To which I disagree : Mapping the buffer will most certainly lead to sync issues. Unless glBufferData( ...,NULL ) is used - in which case I am forced to update the whole buffer and also this may create a new buffer on the graphic card (and I would rather reuse the old one as it can be very big).

Everybody agrees that in most cases glBufferSubData is useless.

GeLeTo
08-05-2009, 02:32 AM
Are you not able to accomplish what you want to do using MapBufferRange? Pay careful attention to the extra mapping options it provides that are not available to the original MapBuffer call.I don't see how this can avoid syncing, unless using MAP_UNSYNCHRONIZED_BIT - but why would anyone use this?
With my proposal - the driver will know best when the buffer is available and will only then begin copying my original data, thus avoiding a sync.

mfort
08-05-2009, 02:42 AM
I think that glBufferData with data != NULL should be avoided as well.

All the OpenGL API is either:
- synchronous (glTexImage, glReadPixels,...) with plain C pointer
- or asynchronous when buffer is bound and offset is used instead of pointer

I believe this is good enough.

The only API that does not fit is glBufferData/glBufferSubData that uses not null C pointers. With this we are back in synchronous world. The driver must wait until the buffer is available and then to copy the data.

I think the only way is to properly use buffer mapping.
As someone suggested, use of MapBufferRange could help.

mfort
08-05-2009, 02:45 AM
With my proposal - the driver will know best when the buffer is available and will only then begin copying my original data, thus avoiding a sync.

Right, but then the driver must wait. It cannot return earlier otherwise the application could destroy the data after the BufferData returns and before the data are copied.

mfort
08-05-2009, 02:50 AM
See extension ARB_copy_buffer:

Replace BufferSubData with a non-cache-polluting update:


BindBuffer(COPY_READ_BUFFER, tempBuffer);
BufferData(COPY_READ_BUFFER, updateSize, NULL, STREAM_DRAW);
// this may return a WriteCombined mapping!
ptr = MapBuffer(COPY_READ_BUFFER, WRITE_ONLY);
// fill ptr
UnmapBuffer(COPY_READ_BUFFER);

BindBuffer(COPY_WRITE_BUFFER, vtxBuffer);
// this copy ideally requires no CPU work on the data itself.
CopyBufferSubData(COPY_READ_BUFFER, COPY_WRITE_BUFFER,
0, writeoffset, updateSize);


This examples show that you can modify part of the buffer without waiting at all. The buffer modification process can be enqueued using CopyBufferSubData.

(I have no real experience with this).

GeLeTo
08-05-2009, 03:09 AM
Right, but then the driver must wait.No, I think you misunderstood what I proposed.
glBufferDataRetained will return immediately and the driver will not wait for the data to be copied - it will know when the right time to copy the data comes and then copy the data, but this does not mean that it has to stop and wait before that.
This copying can be done by a separate driver thread.


It cannot return earlier otherwise the application could destroy the data after the BufferData returns and before the data are copied.The application should not modify or delete the data unless the copying has ended. And it will know whether the copying has ended if the relevant sync object (see the new ARB_Sync API) is signaled. If the app still insists to modify/delete the data - it must call ClientWaitSync.

To recap:
1. Modify the data
2. Call glBufferDataRetained - it will return immediately. The driver will copy the data later when the right time comes.
3. At some later time the app will want to modify the data again:
- check if the sync object is signaled(e.g. the data has been copied you can do with it whatever you want).
- if the sync object is not signaled you have two choices:
- a) Call ClientWaitSync and then modify the data
- b) Store the [range of] changed data somewhere else and use glBuffer[Sub]Data again.

GeLeTo
08-05-2009, 03:27 AM
See extension ARB_copy_buffer:
Replace BufferSubData with a non-cache-polluting update:


BindBuffer(COPY_READ_BUFFER, tempBuffer);
BufferData(COPY_READ_BUFFER, updateSize, NULL, STREAM_DRAW);
// this may return a WriteCombined mapping!
ptr = MapBuffer(COPY_READ_BUFFER, WRITE_ONLY);
// fill ptr
UnmapBuffer(COPY_READ_BUFFER);

BindBuffer(COPY_WRITE_BUFFER, vtxBuffer);
// this copy ideally requires no CPU work on the data itself.
CopyBufferSubData(COPY_READ_BUFFER, COPY_WRITE_BUFFER,
0, writeoffset, updateSize);
This examples show that you can modify part of the buffer without waiting at all. The buffer modification process can be enqueued using CopyBufferSubData.
(I have no real experience with this).

This may work. It will still use another buffer on the GPU and will require more work from the driver. Probably MapBufferRange with MAP_INVALIDATE_RANGE_BIT does something similar internally.

mfort
08-05-2009, 03:45 AM
You are asking for something like ext. SGIX_async
I doubt ARB will go this way.

GeLeTo
08-05-2009, 03:56 AM
You are asking for something like ext. SGIX_async"SGIX_async provides a way to allow certain OpenGL commands to complete out-of-order with respect to others. This extension does not by itself enable asynchrony;"


I doubt ARB will go this way. Why not - this approach fixes some much debated gripes when updating buffer objects, is easy to use and fits very nicely with the ARB_sync API.

Whenever a new approach is evaluated for updating buffer objects one has to consider the following:
- does this approach have sync issues
- does it cause the data to be copied and/or allocated more than once
- is it easy to use
- does it fit in the current OpenGL API
- are there use cases where this will be helpfull

My proposal fares very well on all these criteria.

Heiko
08-05-2009, 05:10 AM
I've never used MapBufferRange so far, but I think you might be able do what you want using MapBufferRange. I've been reading the MapBufferRange specification, but I'm unsure about the following example:


/* Map the entire buffer for write - unsynchronized.
* GL will not block for prior operations to complete. Application must
* use other synchronization techniques to ensure correct operation.
*/
void *ptr = glMapBufferRange( GL_ARRAY_BUFFER_ARB, 0, size, MAP_WRITE_BIT | MAP_UNSYNCHRONIZED_BIT);

In what way can the application be sure that the data is synchronized? Does this just mean that the application has to make sure that writing data to the buffer has finished? Or is there some GL command that causes the mapped buffer range to synchronize again? In both cases it would be possible to create some kind of wrapper around buffers that have smart behaviour with respect to deleting cpu side data whenever the buffer is synchronized (if it is just making sure that the data is written to the buffer this can be handled from the application itself, if synchronizing can be achieved by using some GL command, a fence could be placed after the GL command, which allows you to know when synchronization is completed).

I also wonder: using MapBufferRange and the MAP_UNSYNCHRONIZED_BIT, is it possible to write data to the buffer from another cpu thread? Or will this cause problems?

GeLeTo
08-05-2009, 06:17 AM
I've never used MapBufferRange so far, but I think you might be able do what you want using MapBufferRange.

/* Map the entire buffer for write - unsynchronized.
* GL will not block for prior operations to complete. Application must
* use other synchronization techniques to ensure correct operation.
*/
I have no idea what these "other synchronization techniques to ensure correct operation" could be, apart from glFlush. If you already know that the buffer is synced and safe to modify - you can just Map it - I doubt MAP_UNSYNCHRONIZED_BIT will make much difference if a sync is not needed anyway.

The way to use MapBufferRange for updating a range of the buffer - is to first invalidate a range( MAP_INVALIDATE_RANGE_BIT) and then write the new data in that range. This way if the buffer cannot be mapped immediately - the driver can allocate another buffer in graphic memory, store the changed data there and copy it to the final buffer when it is available. Still requires an extra copy.

Jean-Francois Roy
08-05-2009, 10:34 AM
I've never used MapBufferRange so far, but I think you might be able do what you want using MapBufferRange.

/* Map the entire buffer for write - unsynchronized.
* GL will not block for prior operations to complete. Application must
* use other synchronization techniques to ensure correct operation.
*/
I have no idea what these "other synchronization techniques to ensure correct operation" could be, apart from glFlush. If you already know that the buffer is synced and safe to modify - you can just Map it - I doubt MAP_UNSYNCHRONIZED_BIT will make much difference if a sync is not needed anyway.

The way to use MapBufferRange for updating a range of the buffer - is to first invalidate a range( MAP_INVALIDATE_RANGE_BIT) and then write the new data in that range. This way if the buffer cannot be mapped immediately - the driver can allocate another buffer in graphic memory, store the changed data there and copy it to the final buffer when it is available. Still requires an extra copy.

The "other synchronization techniques" are fences as provided by APPLE_fence, ARB_sync or OpenGL 3.2. When MAP_UNSYNCHRONIZED_BIT is set, the driver will not wait until the command queue drains of any command referencing the buffer object and instead allow you to map it immediately. It then becomes the responsibility of the application to keep track of which ranges in the buffer object may be currently used by the GPU. Of course, to use this option effectively, you also need to be able to flush sub-ranges of the buffer object.

MAP_INVALIDATE_RANGE_BIT is a hint to the driver that it may create a new allocation or at least avoid copying any data from the on-GPU buffer object and just hand the application a pointer. And so yes, it may also allow the driver to return to the application even if the buffer object is still in use.

Note that these options are not mutually exclusive. You may for instance want to map a buffer object range as read-write and only update a sub-range of the mapping. MAP_UNSYNCHRONIZED_BIT will let you map the range without blocking while preserving the content of the range, which MAP_INVALIDATE_RANGE_BIT will not allow.

You can also use MAP_INVALIDATE_BUFFER_BIT which will behave like a BufferData(..., NULL). Drivers may have optimizations for this case, such as buffer object double-buffering, and return quickly from a subsequent MapBuffer command.

Alfonse Reinheart
08-05-2009, 11:56 AM
Using MAP_UNSYNCHRONIZED_BIT means, "I am taking full responsibility for the results of my actions. I want a pointer to this buffer. Right now."

You can use ARB_sync, NV_fence, or other similar mechanisms to do the synchronization guarantee yourself.

Rob Barris
08-05-2009, 04:22 PM
GeLeTo, can you break down in some more detail what the typical data flow is like - i.e. the amount of data being changed per draw call, and the overlap between old and new data?

For example, "I have to submit about 20-30KB of new vertex data per draw, replacing half of the previously drawn content." or... "I have to push 2.1MB of verts every frame and the new verts are unrelated to the old ones."

Knowing the data flow a little better might yield some clearer ideas.

IMO, the name of the game is always to make sure the GPU has its next block of work already waiting for it when it finishes the one it is working on. To some extent there is a space/time tradeoff here, since new work usually occupies space, as does the old work it hasn't finished, thus arriving at an ideal world where the storage needs are minimal and the performance is maximal... is difficult

GeLeTo
08-06-2009, 03:05 AM
IMO, the name of the game is always to make sure the GPU has its next block of work already waiting for it when it finishes the one it is working on. To some extent there is a space/time tradeoff here, since new work usually occupies space, as does the old work it hasn't finished, thus arriving at an ideal world where the storage needs are minimal and the performance is maximal... is difficult
There are usually two copies of the data - one in system , one in GPU memory. The best time to copy the data from system to GPU memory is right after the GPU has finished using the buffer. And only the driver knows when this is.


GeLeTo, can you break down in some more detail what the typical data flow is like
It's a modelling application. On higher subdivisions a single polygon may have >2000 triangles. Different polygons do not share vertices. The most common case is to have meshes with several thousands polygons (each rectangular polygon consisting of between 32 and 2048 triangles) and only a few of them are changed. Sometimes the same model can be drawn in more than one view.

A few ramblings about this:

1. AFAIK the graphic cards do not like very big meshes. Maybe this has changed with newer cards? So it may be a good idea to break the mesh in smaller chunks and draw them separately. But how small? This will also help with partial updates.

2. If I use MapBuffer I can dma-copy the computed verts directly rather than storing them in system memory first. This can be done in the last stage of the polygon tesselation where the SSE AoS data (a separate array for each x,y,z,.. component) is gathered in more GPU-friendly xyz.. structs and stored. But it's a bit more complicated than that - the tesselation is done by many threads - each worker thread is given several polygons at a time. I have two choices:
- First map the buffer and then let each thread write directly to it, but the writing will not be quite linear and this may cause the buffer to be mapped for a longer time because the tesselation is much slower than when just copying.
- First generate the AoS data for the whole mesh and after that - map the buffer and copy-gather it from a single thread. This will trash the cache.
And if I separate the mesh in chunks I can give each chunk to a different thread, though it won't be as fine-grained and optimal as the current approach (e.g. you can have 5 chunks and 4 threads). Just thinking aloud.

Xmas
08-07-2009, 04:58 AM
This may work. It will still use another buffer on the GPU and will require more work from the driver. Probably MapBufferRange with MAP_INVALIDATE_RANGE_BIT does something similar internally.
CopyBufferSubData is basically the equivalent of PBOs for texture uploads. I'm not sure why you think it requires more work from the driver, or why it would use another buffer on the GPU.

To be clear, what you want is:
- allocate a buffer in memory
- fill the buffer
- pass a pointer to the buffer to OpenGL for transfer into a buffer object
- let GL return immediately but use a sync object to signal completion
- do something else with the buffer after GL has finished reading it, but don't touch it before.

However there's no reason why you couldn't have the GL driver allocate the memory. And that's what buffer objects are: driver-allocated memory. Thus by creating a buffer object, mapping and filling it, then using CopyBufferSubData, you are performing the steps you want, without the need for explicit synchronisation or the possibility of errors due to the application prematurely deallocating the buffer or modifying the buffer contents.

GeLeTo
08-07-2009, 07:44 AM
...Thus by creating a buffer object, mapping and filling it, then using CopyBufferSubData, you are performing the steps you wantThe creation of the temporary buffer and the subsequent second copying (system mem to temp buffer to final buffer) are not no-ops.

MapBufferRange with MAP_INVALIDATE_RANGE_BIT is very similar. The driver will basically do the same thing as your approach ( create another temporary buffer and copy from it later ), but ONLY if the buffer cannot be mapped immediately. So it has an advantage - the temporary buffer may not have to be created.
But it has a disadvantage - when many ranges are used (this happens in my use case) it may require the creation of a temporary buffer for each range. With CopyBufferSubData I can pack all ranges in the same buffer.

Retained data copy also has disadvantages. For instance, to avoid a sync when the data has to be changed but the sync object is not yet signaled - you may have to use another location for the changed data. I still think it is the best approach for my use case.

Some may argue that with MapBuffer you don't need a copy in system memory at all - but this depends on the use case.

Xmas
08-07-2009, 09:55 AM
The creation of the temporary buffer and the subsequent second copying (system mem to temp buffer to final buffer) are not no-ops.
malloc or new aren't no-ops either. You don't need a second copy if you directly use the pointer returned by MapBuffer and no client memory at all.

GeLeTo
08-07-2009, 11:16 AM
You don't need a second copy if you directly use the pointer returned by MapBuffer and no client memory at all.
Yes, as I've mentioned above. But unfortunately it's not always possible. For instance I have a case where I need the generated vertex data to calculate the normals. Also I have several threads writing the vertex data in parallel and this will probably hurt the performance of the DMA transfers.
And you probably don't want to keep the buffer mapped for a long time (generating the data on the run can be much slower than just copying it). I am not sure about the last one - does having a mapped buffer stop or slow the other driver DMA transfers while you are copying data to it?

Dark Photon
08-10-2009, 07:29 PM
If nothing else, this thread certainly highlights the utter and complete confusion out there for how you use the many buffer APIs and flags to get best upload performance on various GL hardware and drivers.

Vendors, how about beating your heads together with the ARB and publishing a whitepaper on the way we should be tickling your drivers for best buffer upload perf. Clearly state, "don't do X or Y, but do Z instead, ...except on Tuesday, and only when it's raining, then do W!".

Personally we've gotten better perf using BufferData( NULL ) + BufferSubData + TexSubImage back when we last looked at it than with BufferData( NULL ) + MapBuffer, but that's before some of the latest extensions and with no multi-buffer ping-pong, 1-N frame delays before latching, fences, fizbim, phase-of-moon testing, etcetcetc.

Vendors, just tell us what buffer API usage you want us to use on your hardware which you are hyper-optimizing for, please! Before we all go nuts or incorrectly write your hardware off as just slow.

If only one of you does it, your approach becomes the de facto OpenGL standard method (tm). :p

Rob Barris
08-10-2009, 07:48 PM
I understand the motivation for the post, but can probably also see that there are likely as many different access/modify/draw patterns as there are applications.

There are not that many buffer API's. There is VBO, and there are copying and non-copying API's for putting data into them. (The copying ones being BufferData and BufferSubData, the non copying one being MapBufferRange.)

In general ISTM that the best way to avoid blocking is to think ahead of time, how can I structure my data deliveries such that I don't have the GPU wanting to access the thing I am modifying, which usually leads to one side waiting, or the other side getting the wrong data at the wrong time (*).

As alluded to above, in the best case you have a stream of commands in flight that will allow the GPU to complete work on one buffer and switch over to a different buffer without skipping a beat - but that this requires more storage. Not all that different from ping-ponging in DMA sound hardware - you try to have buffer N+1 filled up and ready for consumption well before buffer N is consumed - the hardware is really only interested in the current buffer but it can't hop smoothly to the next chunk of work until that buffer is ready.

If that kind of storage investment is too high, then you can trade space for time again, but then you may go slower.

With the CopySubBuffer API, you might be able to set up a cascade, where the GPU is consuming data from a finished buffer/mesh, meanwhile you have a separate buffer mapped, and are writing new sections of data into it - followed up with a series of CopySubBuffer calls which could complete in-order to make the final delivery of the updates into the dest buffer.

On the other hand, if analysis shows you that you are changing 75% or more of the vertices in the buffer per draw, then just orphan it (glBufferData(NULL)) and re-fill it, the GPU can keep drawing out of the nameless orphan while you fill up the next one.

Alfonse Reinheart
08-11-2009, 12:10 AM
I understand the motivation for the post, but can probably also see that there are likely as many different access/modify/draw patterns as there are applications.

I believe you missed the point of Dark Photon's post. He wants driver developers to inform users as to what the optimal usage patterns are, so that they can code their applications properly.

There are substantive questions about il-defined portions of the specification. The specific meaning of STREAM vs. DYNAMIC vs. STATIC, for example. How much respecifying of vertex data makes something count as STREAM instead of DYNAMIC. Because the specification does not explain what the implementation does with these hints, it is otherwise impossible to know which one to use for your scenario.

There is also the issue of mapping the buffer vs. using BufferData(NULL) and BufferSubData. There is no guidance on what the correct way to do these things are.

These are all things that have an overall effect on performance. But there is little guidance on the proper way to stream vertices. There is some lore floating around, but nothing concrete.

Ilian Dinev
08-11-2009, 04:57 AM
Hmm, I always believed STREAM doesn't keep a copy in RAM after use (you know, for restoring state on gpu-reset on i.e resolution-change). DYNAMIC looks like it would keep data in RAM and allow DMA on first use, copy to VRAM after that.