Buffer Object Streaming

From OpenGL Wiki
Jump to navigation Jump to search

Buffer Object Streaming is the process of updating buffer objects frequently with new data while using those buffers. Streaming works like this. You make modifications to a buffer object, then you perform an OpenGL operation that reads from the buffer. Then, after having called that OpenGL operation, you modify the buffer object with new data. Following this, you perform another OpenGL operation to read from the buffer.

Streaming is a modify/use cycle. There may be a swap buffers (or equivalent frame changing process) between one modify/use cycle and another, but not necessarily.

The problem[edit]

OpenGL puts in place all the guarantees to make this process work, but making it work fast is the real problem. The biggest danger in streaming, the one that causes the most problems, is implicit synchronization.

The OpenGL specification permits an implementation to delay the execution of drawing commands. This allows you to draw a lot of stuff, and then let OpenGL handle things on its own time. Because of this, it is entirely possible that, well after you call whatever operation that uses the buffer object, you might start trying to upload new data to that buffer. If this happens, the OpenGL specification requires that the thread halt until all drawing commands that could be affected by your update of the buffer object complete.

This implicit synchronization is the primary enemy when streaming vertex data.

There are a number of strategies to solve this problem. Some implementations work better with certain ones than others. Each one has its benefits and drawbacks.

When using non-immutable buffers, you should make sure that STREAM is in your buffer's hint.

Explicit multiple buffering[edit]

This solution is fairly simple. You simply create two or more buffer objects of the same length. While you are using one buffer object, you can be modifying another. Depending on how much parallelism your implementation can provide, you may need more than two buffers to make this work.

The principle drawback to this solution is that it requires using a number of different buffer objects (separate buffer handles). So you'll need to change which buffers you're using for your GPU operations every frame.

Buffer re-specification[edit]

This solution is to reallocate the buffer object before you start modifying it. This is termed buffer "orphaning". There are two ways to do it.

The first way is to call glBufferData with a NULL pointer, and the exact same size and usage hints it had before. This allows the implementation to simply reallocate storage for that buffer object under-the-hood. Since allocating storage is (likely) faster than the implicit synchronization, you gain significant performance advantages over synchronization. And since you passed NULL, if there wasn't a need for synchronization to begin with, this can be reduced to a no-op. The old storage will still be used by the OpenGL commands that have been sent previously. If you continue to use the same size over-and-over, it is likely that the GL driver will not be doing any allocation at all, but will just be pulling an old free block off the unused buffer queue and use it (though of course this isn't guaranteed), so it is likely to be very efficient.

You can do the same thing when using glMapBufferRange with the GL_MAP_INVALIDATE_BUFFER_BIT. You can also use glInvalidateBufferData, where available.

All of these give the GL implementation the freedom to orphan the previous storage and allocate a new one. Which is why this is called "orphaning".

Whenever you see either of these, think of it as a directive to OpenGL to 1) detach the old block of storage and 2) give you a new block of storage to work with, all behind the same buffer handle. The old block of storage will be put on a free list by OpenGL and reused once there can be no draw commands in the queue which might be referring to it (e.g. once all queued GL commands have finished executing).

Obviously, these methods detach the buffer storage from the client-accessible workspace, so they are only practical if there is no further need to read or update this specific block of storage from the GL client side. Unless you plan to use buffer updates in combination with this technique, then it is best if updates are done on a whole buffer rather than parts of a buffer, and if you overwrite all of the data in that buffer each time.

One issue with this method is that it is implementation dependent. Just because an implementation has the freedom to do something does not mean that it will.

Buffer update[edit]

Buffer update is form of streaming that you need to be very careful with. It is often used in combination with buffer re-specification to increase submission performance.

To implement buffer update, we call glMapBufferRange with the GL_MAP_UNSYNCHRONIZED_BIT. This tells OpenGL not to do any implicit synchronization at all. When you see this, think "OpenGL, please give me a buffer 'fast'. It's fine if you give me the same one for this buffer object that you did last time. I promise not to modify any portion of this buffer that might be in use by a GL command I've already submitted. Just trust me."

Though there is no synchronization, this does not mean that synchronization is unimportant. Indeed, you will get undefined results if you are modifying parts of the buffer that already-queued GL commands (such as draw commands) will read from on the GPU. Don't do that.

The basic use case for using buffer updates is that you can progressively fill up a buffer object with Map UNSYNCHRONIZED, write, unmap, issue GL command using that buffer subregion, rinse/repeat. And so long as your writes never overlap, then you're safe and you don't need to think about "messing up the GPU's data" until you fill up that buffer. Once you fill it up, you can do one of two things to continue to avoid stomping on the GPU's buffer data: 1) orphan, or 2) synchronize. Orphan being the preferred method as avoiding synchronization usually yields higher performance (as synchronization often involves waiting).

To orphan, just use the buffer re-specification technique (glBufferData(NULL), glMapBufferRange(GL_MAP_INVALIDATE_BUFFER_BIT), or glInvalidateBufferData). You then get a fresh block of storage underneath the buffer handle to scribble on that no other GL commands can be referring to, so no synchronize is needed.

Alternatively, to synchronize, use a sync object. If you put a fence after all of the commands that read from a buffer, you can check whether this fence has completed before mapping the buffer. If it has not, then you can wait to update the buffer, performing some other important task in the meantime. You can also use the fence to force synchronization if you have no other tasks to perform. Once the fence has completed, you can map the buffer freely, using the GL_MAP_UNSYNCHRONIZED_BIT just in case the implementation isn't aware that the buffer can be updated.

For more details on buffer streaming in general, see this thread. Pay particular attention to the posts by Rob Barris.

Persistent mapped streaming[edit]

Given the availability of OpenGL 4.4 or ARB_buffer_storage, the use of persistent mapping of buffers becomes a possibility.

The idea here is to allocate an immutable buffer 2-3x the size you need, and while you're executing operations from one region of the buffer while you are writing to a different region. The difference between the prior mapping scheme is that you are not frequently mapping and unmapping the buffer. You map it persistently when you create the buffer, and keep it mapped until it's time to delete the buffer.

This requires using glBufferStorage with the GL_MAP_WRITE and GL_PERSISTENT_BITs. It also requires using glMapBufferRange with those same bits when mapping it.

The general algorithm is as follows. The buffer is logically divided into 3 sections: the section you're writing to, and two sections that could currently be in use.

The first step is to write to section 1 of the buffer. Once you have finished writing, you must make this range of data visible to OpenGL by flushing it (if you aren't mapping coherently). Then, you do whatever you need to in order to ensure that this data is visible to OpenGL. Once the data is visible, you issue some number of Rendering Commands that read from that section of the buffer. After issuing all of the commands that read from the buffer, you create a fence sync object.

Next frame, you start writing to buffer section 2. You do all of the above, and create a new fence sync object. Keep each buffer section's sync objects separate.

You do the same with buffer section 3 on the next frame.

On the fourth frame, you want to start using section 1 again. However, you need to check section 1's sync object to see if it has completed before you can start. You can only start writing to a section if that section's sync object has completed.

Persistent visibility[edit]

Writing to a persistently mapped buffer does not guarantee automatically that OpenGL will see the written data. To ensure visibility, you must do one of three things.

  1. Map the buffer coherently with GL_COHERENT_BIT. This also requires allocating the buffer with GL_COHERENT_BIT. Coherently mapped buffers always ensure visibility to subsequent operations (this does not mean you get to write to something currently being read, however. You still need synchronization). While this might sound slow, there is some evidence that the performance cost is negligible, at least on some hardware.
  2. Map the buffer with GL_MAP_FLUSH_EXPLICIT_BIT and call glFlushMappedBufferRange on the written section of the buffer.

Streaming optimizations[edit]

glMapBufferRange has another flag you should know about: GL_MAP_INVALIDATE_RANGE_BIT. This is different from GL_MAP_INVALIDATE_BUFFER_BIT, which you've already been introduced to above.

According to Rob Barris, MAP_INVALIDATE_RANGE_BIT in combination with the WRITE bit (but not the READ bit) basically says to the driver that it doesn't need to contain any valid buffer data, and that you promise to write the entire range you map. This lets the driver give you a pointer to scratch memory that hasn't been initialized. For instance, driver allocated write-through uncached memory. See this post for more details.