Optimal Streaming Strategy

As of yet, my renderer has not handled streaming of vertex and index data. This is ok for small and mid-sized scenes where nothing too special happens. In its current state, a mesh will request to be uploaded to a buffer and stay there until program termination. The maximum buffer size is of course flexible and if not enough memory is left for a particular mesh, the buffer manager will allocate a new vertex and index buffer and put it there. To ensure minimal state changes, especially for VAOs, the sorting priorities are chosen accordingly, so stuff is sorted by VAO, then by shader and then, if possible by textures, unless the last step will incur more sorting cost than it’s worth. In general, meshes are rendered with base vertices and assembled to be multi-drawable. However, either single instances with DrawElementsBaseVertex or multiple instances are trivial either.

Now, streaming stuff is a whole different story. I was thinking about keeping static (or very infrequently evicted) data and very dynamic in a separate buffers which would mirror the GL’s buffer usage scenarios (although performance-wise this does not necessarily advantageous). Leave the static buffers alone most of the time and only fiddle with the dynamic ones. However, this is not the real problem. The real problem is fragmentation over time since you can’t expect that meshes are evicted in reverse order. This wastes a lot of memory in small chunks if you can’t fill the gaps anymore. First of all, I’d probably want to handle free memory sort of like Alexandrescu using a list of chunks, which would include an offset and a chunk size, with the first chunk holding 0 and the buffer size. Then, there’d be multiple possibilities of handling fragmentation, so if a certain threshold of fragmentation is surpassed I could:

a) reorganize the buffer, i.e copy data around and make it contiguous again
b) create a new buffer and upload formerly evicted meshes to that new buffer while slowly draining the old one until it is below the threshold or empty
c) track the number of successful uploads over time and then see if fragmentation is too high (or if the buffer is simply full), and do either a) or b)

Granted, I may be overreaching here a little bit. Please keep in mind, this is not meant for a simple demo or something, but as a streaming system for a heavy-duty renderer. What’s your experience with this? Any suggestions?

You’re hitting the same issues that puzzled me for a while. The efficiency watchdog in both of us would really like to keep the data on the GPU as long as possible and re-use it to the very last possible frame. But for performance you don’t really want to deal with full-up garbage collection, block coalescing, etc. In my experience it’s much more important to ensure your GPU streaming is as efficient as possible. Then the reuse issue isn’t as critical. In the end I decided on an approach with a big streaming VBO which is easily big enough to hold multiple frames of streaming VBO data. Reuse data for as many frames as possible, but when it fills up, just orphan it and start filling the next page. That’ll generate a little bandwidth hickup “refilling the cache”, but it doesn’t happen often, and typically won’t be a full frame of data. If needed that can be mitigated somewhat by keeping ping-pong streaming buffers and when you need to fill up the new, copy data from the old buffer to the new if there’s a cached copy in the old.

Recommended reading:

Thank you very much! Very informative reply by Rob Barris there.

Regarding your multi-frame VBO, do you orphan via glMapBufferRange and do an async invalidate? How big is big in your case and how much space do you allocate per frame? I assume you’re doing it ring-buffer style? In your approach, you accept throwing out static data as well, right? Do you partition completely static and dynamic data into separate buffers?

Please excuse the question storm. :wink:

I orphan using glMapNamedBufferRangeEXT with GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT. Normal (non-orphaning) map is done with GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT.

How big is big in your case and how much space do you allocate per frame? I assume you’re doing it ring-buffer style?

As to size, it’s just a function of the dataset. 4-8MB or so. And yes, ring buffer.

In your approach, you accept throwing out static data as well, right? Do you partition completely static and dynamic data into separate buffers?

Static (as in known-at-startup-and-never-unloaded) VBO contents go into their own static VBOs. If you’re not using NV bindless, flipping the draw source between a bunch of different VBOs can get expensive, so you’d want to do some smart packing. But if you’re using bindless, you just don’t care. It’s all stuff on the GPU addressed with the same 64-bit handle space.

Streamed VBO content (as in loaded and unloaded dynamically) goes to the streaming VBO.

The advantages to not blasting the static stuff in the streaming VBO as well is it doesn’t drive up the size of your streaming VBO artificially, nor does it need reloaded when you invalidate the cache (i.e. orphan the streaming VBO). You load it once and leave it.

I didn’t actually expect you to orphan the whole buffer. This isn’t done regularly, right? For the sake of rotating through the buffer, you don’t need to invalidate it completely, so at what times do you invalidate the whole thing?

Isn’t that technically orphaning as well? Am I missing something about invalidation?

Thanks for your patience. :slight_smile:

That’s correct.

This one isn’t (unless you use sync objects to make sure you don’t overwrite currently used data.

Let me explain.

If you pass unsynchronized bit then it means you don’t care if you potentially overwrite data that the GPU might use currently. I know that you also use the invalidate bit, but that doesn’t guarantee you that you’ll get a new piece of memory, it just allows the driver to potentially give you a new piece of memory, and in case it’s not the case, you’ll corrupt data currently in use by the GPU unless you use sync objects or other sort of manual synchronization to ensure that you don’t overwrite data used by the GPU.

I’ve seen this later in so many applications that I kind of feel that somebody is spreading this incorrect usage pattern.

There are practically two simple ways to stream data (both options assume here a fixed chunk size, i.e. upload granularity):

Option #1:

if (chunkIndex * size + size > bufferSize) chunkIndex = 0;
pointer = glMapBufferRange(..., chunkIndex * size, size, GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_RANGE_BIT);
// fill data
...
glUnmapBuffer(...);
chunkIndex++;

This one suffers from the problem that it relies on the assumption that the driver does renaming/orphaning internally and it will give you a new piece of memory which might not be always the case, thus you might get wildly varying performance across implementations.

Option #2:

if (chunkIndex * size + size > bufferSize) chunkIndex = 0;
if (syncObject[chunkIndex] != NULL && glClientWaitSync(syncObject[chunkIndex], 0, 0) != GL_ALREADY_SATISFIED)
{
  while (glClientWaitSync(syncObject[chunkIndex], GL_SYNC_FLUSH_COMMANDS_BIT, largeTimeout) == GL_TIMEOUT_EXPIRED));
}
pointer = glMapBufferRange(..., chunkIndex * size, size, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);
// fill data
...
glUnmapBuffer(...);
glDeleteSync(syncObject[chunkIndex]);
chunkIndex++;
syncObject[chunkIndex] = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);

This one should consistently perform efficiently across GL implementations. The only thing you have to make sure is that your buffer is sufficiently large enough so that you don’t hit often (or don’t hit at all) the case when the tested sync object is not already signaled.

[QUOTE=aqnuep;1247996]

This one isn’t (unless you use sync objects to make sure you don’t overwrite currently used data.[/QUOTE]

Are you sure about that?

I only fill stream VBOs front-to-back, and I never overwrite a range of the buffer with this technique (until after orphaning) that I’ve given to the GPU to use for any batch. So it should be safe without any CPU/GPU handshaking (sync/fence business).

For more details on this technique, read the two posts starting here, especially the second one by Rob:

Fairly rarely. The stream VBO is large enough to handle one frame of data plus a fair amount of extra, so orphaning doesn’t happen often.

Re the previous discussion on UNSYNCHRONIZED, the advantage to this fill/orphan/fill/orphan technique is there’s no need for any new “CPU/GPU waiting” (fence/sync queries, etc.)

…at what times do you invalidate the whole thing? Isn’t that technically orphaning as well? Am I missing something about invalidation?

It’s subtle – there are two invalidate flags. INVALIDATE_BUFFER = orphan (i.e. tear off that buffer page and give me a fresh new one; ala glBufferData NULL). INVALIDATE_RANGE says I’m gonna replace everything in this range, so the driver doesn’t need to pre-fill it with valid buffer data for the current orphan. Different concepts.

UNSYCHRONIZED says I promise not to mess with anything in this buffer that I’ve already given the GPU to process. Just gimme some memory to write on as fast as possible! This lets the driver potentially give you the same block of memory back again and again (if it desires) for efficiency. Potential write coalescing.

I strongly recommend reading everything in this thread beginning at post 23 or so:

Well, if you “orphan” after getting to the end of the buffer with glBufferData then it’s fine as glBufferData must give you a new storage. But if you would orphan at the end with GL_MAP_INVALIDATE_*_BIT or with glBufferSubData then it could result in corruption.

INVALIDATE_RANGE says I’m gonna replace everything in this range, so the driver doesn’t need to pre-fill it with valid buffer data for the current orphan.

Why would the driver “pre-fill” any of the buffer? If you’re mapping for writing, you aren’t supposed to read it. So it doesn’t matter what’s there.

Well, if you “orphan” after getting to the end of the buffer with glBufferData then it’s fine as glBufferData must give you a new storage. But if you would orphan at the end with GL_MAP_INVALIDATE_*_BIT or with glBufferSubData then it could result in corruption.

Technically, glBufferData doesn’t have to give you anything new if you use NULL and the same size/hint. It’s perfectly reasonable for it to not allocate any new storage.

As for the rest, I would say that invalidating the range is completely unnecessary if you’ve never written to that storage before. If the usage pattern is:

  1. Invalidate buffer.
  2. Map some range. Write data.
  3. Map some range after the previous map. Write data.
  4. Repeat 3 until we’re out of space.
  5. Goto 1.

Then invalidating the range on those “map some range” parts doesn’t need an INVALIDATE_RANGE_BIT at all. The only invalidation you need is for the buffer as a whole, so that you can get new storage (or just to let GL know that you’re not using it anymore). All pieces of an invalidated buffer are by definition no longer in use.

INVALIDATE_RANGE_BIT is mainly so that GL can check to see if the range is already in use; if it is, then it could give you some spare memory to write to, and it’ll upload it later. That seems to clash with UNSYNCHRONIZED, since the whole point of that is to say, “Don’t bother checking; just give me the memory.” Also, we know the next range isn’t in use, because we invalidated the entire buffer already.

Oh man. I was thinking: “Why would he need to make sure that everything is processed before explicitly syncing with SwapBuffers()?” What I didn’t see, although fairly obvious, is that you have to orphan the buffer when it fills up because it could do so in mid-frame and some portions of it may not already be processed when trying to write new stuff. That’s why there’s syncing (i.e. absence of the UNSYNCHRONIZED bit) when you orphan the whole buffer. Damn…

However, I have to agree with Alfonse on the async invalidation being a little contradictory. If you absolutely know that a well defined range of the buffer will not be in use when mapping, why the need to invalidate? Why not simply do an async write mapping and overwrite your contents with new stuff that fits into said range? If you can’t be sure, you need to synchronize anyway if you don’t want to possibly risk corruption.

I think you’re going towards scenarios where you don’t want to further fill the buffer with new stuff if there’s some place where clearly unused data resides - so you want to swap portions instead of risking orphaning if it’s not absolutely necessary. But how can you be positive about definitely unused ranges if you don’t track if data in a buffer has not actually been used for some time? This implies three possible solutions:

  1. go Alfonse’s way and can the whole thing, thus syncing by orphaning the complete buffer - this will lead to data that has been in use in previously and will be in use afterwards has to be re-uploaded to the buffer
  2. go aqnuep’s way and invalidate ranges synchronously, thus avoiding corruption of data still in flight - this may force the application to wait which only makes sense if you’ve got stuff to do. Just like with occlusion queries.
  3. employ an eviction strategy akin to commonly used CPU caching and asynchronously replace unused ranges, orphan only when really needed - this used additional CPU cycles and memory because you have to remember what data wasn’t used for how many frames. There could be other schemes, however.

Is that about it?

But only with GL_MAP_UNSYNCHRONIZED_BIT, right? Sorry for being obnoxious - just want to make sure I’m not missing something.

I think you’re misunderstanding the usage pattern here. It’s a pattern of dumping stuff to a buffer and rendering that stuff. Then dumping more stuff and rendering it. Nothing is being saved; all of the data is transient, one-use only. It’s for the kind of stuff you would have used immediate mode for back in the old days. GUI elements and such.

None of this is static, and nobody really cares where it lives in memory. There are no objects, no “ranges”, nada. Just sequences of vertex data. So you effectively build a ring buffer out of a buffer object.

Each thing you draw takes up a certain amount of buffer object space. You map that space, write your data, and render with it. You now no longer care about that data, so long as OpenGL eventually gets it. You do it again with some more data, so you just slide over to the next unused space. Eventually you run out of buffer object space for new data, so you start over at the beginning.

This “requires” buffer object orphaning, lest you make the driver actually check to see if that region of the buffer is in use. Because you don’t want to be unsync mapping some part of the buffer that’s still in use. By invalidating the buffer, you ensure that none of it is in use anymore.

As well as the excellent post by Rob Barris that explains it fairly thoroughly, here’s a blog post by Fabian Giesen with some tips about using write combining (which will likely be used when filling the mapped buffer) efficiently too: Write combining is not your friend | The ryg blog
Here’s the summary:

[ul]
[li]If it’s a dynamic constant buffer, dynamic vertex buffer or dynamic texture and mapped “write-only”, it’s probably write-combined.[/li]> [li]Never read from write-combined memory.[/li]> [li]Try to keep writes sequential. This is good style even when it’s not strictly necessary. On processors with picky write-combining logic, you might also need to use volatile or some other way to cause the compiler not to reorder instructions.[/li]> [li]Don’t leave holes. Always write large, contiguous ranges.[/li]> [li]Check the rules for your target architecture. There might be additional alignment and access width limitations.[/li]> [/ul]

Then we obviously had different things in mind. Also, Dark Photon mentioned that he wanted to keep streamed data in a buffer as long as possible - I could have misunderstood his intentions though.

My concern stays valid though: You have to upload data again and again although it doesn’t differ from the previous render call. Does that perform well?

You have to upload data again and again although it doesn’t differ from the previous render call. Does that perform well?

Compared to what? Consistency of performance is often more important than just getting good performance sometimes and bad performance other times. This method makes performance consistent, regardless of whether data changes or not. So rather than getting a performance spike when data changes, you get solid, consistent performance all the time.

If you’re not making much use of that uploading bus for anything else, there’s really no reason why you can’t do this effectively. Yes, it depends on how much stuff you’re doing this for, but it’s rarely that much stuff.

Compared to partial eviction. Sure, it’s relative to problem size. I imagine something like flight simulation where you travel at high speeds over some densely crowded areas with a lot of varying geometry. In that case, not taking advantage of whatever coherency you got theoretically seems less wise than trying to swap only parts of the buffer.

EDIT: But you’re right, for most applications such high frequency probably isn’t applicable and you can just draw your static stuff and the little dynamic rest like you suggest. I’d like to see both approaches in both scenarios though - just out of curiosity.

As a simple example to illustrate the point, suppose you map 20 bytes for write but only write 3 random bytes in this 20 byte range. True, you didn’t read it. But what’s the poor driver supposed to transfer to the GPU?

That said, I’m not a driver developer.

Close, but not exactly. One “write” only. Multiple reads. Reuse the copy you put there as long as possible (i.e. until the next orphan).

So I got it right. :smiley:

Maybe we should come up with a wiki article on this topic. Who wants to lookup the “slow VBO” thread again and again.

I think there’s still some misunderstanding left here. With the write-once, read-many streaming VBO approach I’m talking about, there is tons of reuse frame-after-frame-after-frame due to coherency because you’re basically just re-blasting 99.5% of the data you’ve already uploaded to the streaming VBO in previous frames. The streaming VBO “is” the cache. This is especially advantageous if you can lock that VBO on the GPU and blast those batches with bindless. That’s as cheap as it gets!

Yes, having to orphan every so often and “re-seed the cache” is a disadvantage. But the alternative is a filling/coalescing holes “garbage collection” scheme that adds complexity and is problematic for real-time performance. Or, never re-use anything and reupload everything every time you use it. It’s up to you though.