Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 3 123 LastLast
Results 1 to 10 of 24

Thread: Optimal Streaming Strategy

  1. #1
    Senior Member OpenGL Pro
    Join Date
    Apr 2010
    Location
    Germany
    Posts
    1,128

    Optimal Streaming Strategy

    As of yet, my renderer has not handled streaming of vertex and index data. This is ok for small and mid-sized scenes where nothing too special happens. In its current state, a mesh will request to be uploaded to a buffer and stay there until program termination. The maximum buffer size is of course flexible and if not enough memory is left for a particular mesh, the buffer manager will allocate a new vertex and index buffer and put it there. To ensure minimal state changes, especially for VAOs, the sorting priorities are chosen accordingly, so stuff is sorted by VAO, then by shader and then, if possible by textures, unless the last step will incur more sorting cost than it's worth. In general, meshes are rendered with base vertices and assembled to be multi-drawable. However, either single instances with DrawElementsBaseVertex or multiple instances are trivial either.

    Now, streaming stuff is a whole different story. I was thinking about keeping static (or very infrequently evicted) data and very dynamic in a separate buffers which would mirror the GL's buffer usage scenarios (although performance-wise this does not necessarily advantageous). Leave the static buffers alone most of the time and only fiddle with the dynamic ones. However, this is not the real problem. The real problem is fragmentation over time since you can't expect that meshes are evicted in reverse order. This wastes a lot of memory in small chunks if you can't fill the gaps anymore. First of all, I'd probably want to handle free memory sort of like Alexandrescu using a list of chunks, which would include an offset and a chunk size, with the first chunk holding 0 and the buffer size. Then, there'd be multiple possibilities of handling fragmentation, so if a certain threshold of fragmentation is surpassed I could:

    a) reorganize the buffer, i.e copy data around and make it contiguous again
    b) create a new buffer and upload formerly evicted meshes to that new buffer while slowly draining the old one until it is below the threshold or empty
    c) track the number of successful uploads over time and then see if fragmentation is too high (or if the buffer is simply full), and do either a) or b)

    Granted, I may be overreaching here a little bit. Please keep in mind, this is not meant for a simple demo or something, but as a streaming system for a heavy-duty renderer. What's your experience with this? Any suggestions?
    Last edited by thokra; 02-04-2013 at 01:48 AM. Reason: Orthography and typos.

  2. #2
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,220
    You're hitting the same issues that puzzled me for a while. The efficiency watchdog in both of us would really like to keep the data on the GPU as long as possible and re-use it to the very last possible frame. But for performance you don't really want to deal with full-up garbage collection, block coalescing, etc. In my experience it's much more important to ensure your GPU streaming is as efficient as possible. Then the reuse issue isn't as critical. In the end I decided on an approach with a big streaming VBO which is easily big enough to hold multiple frames of streaming VBO data. Reuse data for as many frames as possible, but when it fills up, just orphan it and start filling the next page. That'll generate a little bandwidth hickup "refilling the cache", but it doesn't happen often, and typically won't be a full frame of data. If needed that can be mitigated somewhat by keeping ping-pong streaming buffers and when you need to fill up the new, copy data from the old buffer to the new if there's a cached copy in the old.

    Recommended reading:

    * Re: VBOs strangely slow

  3. #3
    Senior Member OpenGL Pro
    Join Date
    Apr 2010
    Location
    Germany
    Posts
    1,128
    Thank you very much! Very informative reply by Rob Barris there.

    Regarding your multi-frame VBO, do you orphan via glMapBufferRange and do an async invalidate? How big is big in your case and how much space do you allocate per frame? I assume you're doing it ring-buffer style? In your approach, you accept throwing out static data as well, right? Do you partition completely static and dynamic data into separate buffers?

    Please excuse the question storm.

  4. #4
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,220
    Quote Originally Posted by thokra View Post
    ...do you orphan via glMapBufferRange and do an async invalidate?
    I orphan using glMapNamedBufferRangeEXT with GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT. Normal (non-orphaning) map is done with GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT.

    How big is big in your case and how much space do you allocate per frame? I assume you're doing it ring-buffer style?
    As to size, it's just a function of the dataset. 4-8MB or so. And yes, ring buffer.

    In your approach, you accept throwing out static data as well, right? Do you partition completely static and dynamic data into separate buffers?
    Static (as in known-at-startup-and-never-unloaded) VBO contents go into their own static VBOs. If you're not using NV bindless, flipping the draw source between a bunch of different VBOs can get expensive, so you'd want to do some smart packing. But if you're using bindless, you just don't care. It's all stuff on the GPU addressed with the same 64-bit handle space.

    Streamed VBO content (as in loaded and unloaded dynamically) goes to the streaming VBO.

    The advantages to not blasting the static stuff in the streaming VBO as well is it doesn't drive up the size of your streaming VBO artificially, nor does it need reloaded when you invalidate the cache (i.e. orphan the streaming VBO). You load it once and leave it.

  5. #5
    Senior Member OpenGL Pro
    Join Date
    Apr 2010
    Location
    Germany
    Posts
    1,128
    Quote Originally Posted by Dark Photon
    I orphan using glMapNamedBufferRangeEXT with GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT.
    I didn't actually expect you to orphan the whole buffer. This isn't done regularly, right? For the sake of rotating through the buffer, you don't need to invalidate it completely, so at what times do you invalidate the whole thing?

    Quote Originally Posted by Dark Photon
    Normal (non-orphaning) map is done with GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT.
    Isn't that technically orphaning as well? Am I missing something about invalidation?

    Thanks for your patience.

  6. #6
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    985
    Quote Originally Posted by Dark Photon View Post
    I orphan using glMapNamedBufferRangeEXT with GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT.
    That's correct.

    Quote Originally Posted by Dark Photon View Post
    Normal (non-orphaning) map is done with GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT.
    This one isn't (unless you use sync objects to make sure you don't overwrite currently used data.

    Let me explain.

    If you pass unsynchronized bit then it means you don't care if you potentially overwrite data that the GPU might use currently. I know that you also use the invalidate bit, but that doesn't guarantee you that you'll get a new piece of memory, it just allows the driver to potentially give you a new piece of memory, and in case it's not the case, you'll corrupt data currently in use by the GPU unless you use sync objects or other sort of manual synchronization to ensure that you don't overwrite data used by the GPU.

    I've seen this later in so many applications that I kind of feel that somebody is spreading this incorrect usage pattern.

    There are practically two simple ways to stream data (both options assume here a fixed chunk size, i.e. upload granularity):

    Option #1:
    Code :
    if (chunkIndex * size + size > bufferSize) chunkIndex = 0;
    pointer = glMapBufferRange(..., chunkIndex * size, size, GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_RANGE_BIT);
    // fill data
    ...
    glUnmapBuffer(...);
    chunkIndex++;

    This one suffers from the problem that it relies on the assumption that the driver does renaming/orphaning internally and it will give you a new piece of memory which might not be always the case, thus you might get wildly varying performance across implementations.

    Option #2:
    Code :
    if (chunkIndex * size + size > bufferSize) chunkIndex = 0;
    if (syncObject[chunkIndex] != NULL && glClientWaitSync(syncObject[chunkIndex], 0, 0) != GL_ALREADY_SATISFIED)
    {
      while (glClientWaitSync(syncObject[chunkIndex], GL_SYNC_FLUSH_COMMANDS_BIT, largeTimeout) == GL_TIMEOUT_EXPIRED));
    }
    pointer = glMapBufferRange(..., chunkIndex * size, size, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);
    // fill data
    ...
    glUnmapBuffer(...);
    glDeleteSync(syncObject[chunkIndex]);
    chunkIndex++;
    syncObject[chunkIndex] = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);

    This one should consistently perform efficiently across GL implementations. The only thing you have to make sure is that your buffer is sufficiently large enough so that you don't hit often (or don't hit at all) the case when the tested sync object is not already signaled.
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

  7. #7
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,220
    Quote Originally Posted by aqnuep View Post
    Quote Originally Posted by Dark Photon
    Normal (non-orphaning) map is done with GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT.
    This one isn't (unless you use sync objects to make sure you don't overwrite currently used data.
    Are you sure about that?

    I only fill stream VBOs front-to-back, and I never overwrite a range of the buffer with this technique (until after orphaning) that I've given to the GPU to use for any batch. So it should be safe without any CPU/GPU handshaking (sync/fence business).

    For more details on this technique, read the two posts starting here, especially the second one by Rob:

    * Re: VBOs strangely slow

    Quote Originally Posted by Rob Barris
    The key idea is really that careful fencing and sync efforts are only needed in the absence of orphaning and in cases where you are going back to *rewrite* some storage that may be pending drawing...
    Last edited by Dark Photon; 02-06-2013 at 05:39 PM.

  8. #8
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,220
    Quote Originally Posted by thokra View Post
    I didn't actually expect you to orphan the whole buffer. This isn't done regularly, right?
    Fairly rarely. The stream VBO is large enough to handle one frame of data plus a fair amount of extra, so orphaning doesn't happen often.

    Re the previous discussion on UNSYNCHRONIZED, the advantage to this fill/orphan/fill/orphan technique is there's no need for any new "CPU/GPU waiting" (fence/sync queries, etc.)

    ...at what times do you invalidate the whole thing? Isn't that technically orphaning as well? Am I missing something about invalidation?
    It's subtle -- there are two invalidate flags. INVALIDATE_BUFFER = orphan (i.e. tear off that buffer page and give me a fresh new one; ala glBufferData NULL). INVALIDATE_RANGE says I'm gonna replace everything in this range, so the driver doesn't need to pre-fill it with valid buffer data for the current orphan. Different concepts.

    UNSYCHRONIZED says I promise not to mess with anything in this buffer that I've already given the GPU to process. Just gimme some memory to write on as fast as possible! This lets the driver potentially give you the same block of memory back again and again (if it desires) for efficiency. Potential write coalescing.

    I strongly recommend reading everything in this thread beginning at post 23 or so:

    * Re: VBOs strangely slow
    Last edited by Dark Photon; 02-06-2013 at 05:44 PM.

  9. #9
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    985
    Quote Originally Posted by Dark Photon View Post
    I only fill stream VBOs front-to-back, and I never overwrite a range of the buffer with this technique (until after orphaning)
    Well, if you "orphan" after getting to the end of the buffer with glBufferData then it's fine as glBufferData must give you a new storage. But if you would orphan at the end with GL_MAP_INVALIDATE_*_BIT or with glBufferSubData then it could result in corruption.
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

  10. #10
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    INVALIDATE_RANGE says I'm gonna replace everything in this range, so the driver doesn't need to pre-fill it with valid buffer data for the current orphan.
    Why would the driver "pre-fill" any of the buffer? If you're mapping for writing, you aren't supposed to read it. So it doesn't matter what's there.

    Well, if you "orphan" after getting to the end of the buffer with glBufferData then it's fine as glBufferData must give you a new storage. But if you would orphan at the end with GL_MAP_INVALIDATE_*_BIT or with glBufferSubData then it could result in corruption.
    Technically, glBufferData doesn't have to give you anything new if you use NULL and the same size/hint. It's perfectly reasonable for it to not allocate any new storage.

    As for the rest, I would say that invalidating the range is completely unnecessary if you've never written to that storage before. If the usage pattern is:

    1. Invalidate buffer.
    2. Map some range. Write data.
    3. Map some range after the previous map. Write data.
    4. Repeat 3 until we're out of space.
    5. Goto 1.

    Then invalidating the range on those "map some range" parts doesn't need an INVALIDATE_RANGE_BIT at all. The only invalidation you need is for the buffer as a whole, so that you can get new storage (or just to let GL know that you're not using it anymore). All pieces of an invalidated buffer are by definition no longer in use.

    INVALIDATE_RANGE_BIT is mainly so that GL can check to see if the range is already in use; if it is, then it could give you some spare memory to write to, and it'll upload it later. That seems to clash with UNSYNCHRONIZED, since the whole point of that is to say, "Don't bother checking; just give me the memory." Also, we know the next range isn't in use, because we invalidated the entire buffer already.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •