Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 4 of 5 FirstFirst ... 2345 LastLast
Results 31 to 40 of 47

Thread: VBOs strangely slow?

  1. #31
    Intern Contributor
    Join Date
    May 2008
    Posts
    99

    Re: VBOs strangely slow?

    Quote Originally Posted by Rob Barris
    The key idea is really that careful fencing and sync efforts are only needed in the absence of orphaning and in cases where you are going back to *rewrite* some storage that may be pending drawing, like trying to do some sort of sub-section update to a mesh, say.
    Rob, I'm not sure I understand. You still need a sync before you go to draw though, don't you? Should the application keep track of active and inactive VBOs (the GPU may be drawing with the active while the inactive VBOs are ready to be recycled)?

    Quote Originally Posted by Rob Barris
    The flip side of that is that you can do high performance dynamically generated batches of mixed sizes with no fences at all, and with low driver overhead, if you constrain your access patterns to only write/use any given segment of a buffer exactly once before the buffer is orphaned. This is a familiar pattern from the D3D playbook.
    I'm curious about your "dynamically generated batches". Do you generate the batches a frame or two in advance to ensure you have time to upload them to the GPU before you need to draw with them? I'd really like to be able to put all my vertex data directly into VBOs. Unfortunately, I have multiple LODs and I don't know which LODs I need until I'm finished culling. If I put all my LODs into VBOs, I'd have hundreds of MBs of VBO data. I'm struggling with how to fill VBOs with the correct subset of vertex data while giving the GPU enough time between the upload and the draw call...

    Thanks.

  2. #32
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,188

    Re: VBOs strangely slow?

    Quote Originally Posted by skynet
    Its time for some new whitepapers from ATI/nVidia on how to deal with updating VBOs/UBOs/PBOs quickly. Clean up some myths and get straight on the facts. I'm tired of guessing.
    Agreed! Death to the Ouija Board!

    Also, interesting blog post from Sunday on this very topic: One More On VBOs - glBufferSubData

  3. #33
    Member Regular Contributor
    Join Date
    Apr 2006
    Location
    Irvine CA
    Posts
    299

    Re: VBOs strangely slow?

    Quote Originally Posted by ViolentHamster
    Quote Originally Posted by Rob Barris
    The key idea is really that careful fencing and sync efforts are only needed in the absence of orphaning and in cases where you are going back to *rewrite* some storage that may be pending drawing, like trying to do some sort of sub-section update to a mesh, say.
    Rob, I'm not sure I understand. You still need a sync before you go to draw though, don't you? Should the application keep track of active and inactive VBOs (the GPU may be drawing with the active while the inactive VBOs are ready to be recycled)?

    Quote Originally Posted by Rob Barris
    The flip side of that is that you can do high performance dynamically generated batches of mixed sizes with no fences at all, and with low driver overhead, if you constrain your access patterns to only write/use any given segment of a buffer exactly once before the buffer is orphaned. This is a familiar pattern from the D3D playbook.
    I'm curious about your "dynamically generated batches". Do you generate the batches a frame or two in advance to ensure you have time to upload them to the GPU before you need to draw with them? I'd really like to be able to put all my vertex data directly into VBOs. Unfortunately, I have multiple LODs and I don't know which LODs I need until I'm finished culling. If I put all my LODs into VBOs, I'd have hundreds of MBs of VBO data. I'm struggling with how to fill VBOs with the correct subset of vertex data while giving the GPU enough time between the upload and the draw call...
    I'll try to boil this down a bit. First let's define a workload, then we look at how you can feed it to GL. If your app doesn't match the workload, then this may not apply to you.

    workload: say the CPU wants to draw a series of batches where each one is based on data generated or unpacked right before issuing of the draw request. Once written, the data is not going to be modified or read back by the CPU. The goal is to efficiently let the GPU have access to the newly written data, and to avoid bogging down with excessive allocation or synchronization on a per-draw basis.

    ( As a hypothetical example, say we're using the CPU to deform and draw hundreds of falling leaves, where the leaf-shape algorithm runs on the CPU, and can be used to generate new batches of verts for each leaf at will )

    So, you can do this with one VBO and no fences, and it can run really well. The magic is hiding in the buffer-orphaning step.

    So make a VBO with glBindBuffer, and set its size with glBufferData. A few megabytes is good.

    Init a "cursor / offset" to zero.

    for each batch:
    - figure out how many bytes it will be.
    - round it up to some nice power of two multiple, 64 is good.
    * orphan current VBO if this batch won't fit (see below).
    - map the buffer using UNSYNCHRONIZED, at the current cursor offset, asking for the padded number of bytes to be visible. (On Apple flush-buffer-range, you can map in unsynchronized fashion, you just can't pick the range, so you always get the base address back - just add the offset to it)
    - write the data at the beginning of the mapped range.
    - unmap.
    - increment the cursor by the padded size used.
    - issue the draw call after setting vertex attrib pointers appropriately into the VBO, keeping the offset in mind.
    - repeat.

    Note, if you are using an asynchronous or multithreaded driver, you might well get 40, 50, 100 batches written into that VBO (and draw commands enqueued) before the GPU even looks at that first byte. That's OK. You just want the client thread to get in and out of that VBO as fast as possible so it can stay busy doing work.

    At some point the cursor will have moved far enough such that the next batch of data will not fit - i.e. offset + padded size exceeds the total size of the VBO. Note the starred step above.

    When this eventuality happens, and it will vary depending on the size of the batches you've been dropping into the VBO, the response is very simple.

    - orphan current storage by doing a new glBufferData using the fixed size chosen for the VBO, and a NULL pointer.
    - rewind cursor to offset 0.
    - continue.

    The subsequent map result will look at new storage, a clean sheet. The old storage belongs to the driver, it's no longer associated with the VBO ID that you have in your code. So from one point of view there are now two buffers of storage running around, but the one you orphaned can no longer be accessed by the client code. At some point all the draw calls that are consuming data from that storage will complete - and that storage will be freed or possibly recycled automatically.

    In this model, the number of VBO's known to the client is "one". The number of floating (orphaned) *blocks of storage* could be much higher, depending on how long the GPU is taking to chew through each job and how fast the CPU can drop them off.

    So you don't have to juggle "multiple VBO's", you just need to keep blasting away at the one VBO while letting the driver swap in new chunks of storage as needed.

    Client never needs to fence, or check GPU progress, or block on map.

    Write&draw, write&draw, repeat til VBO full, orphan and rewind cursor, repeat. CPU gets to drop off all of its data and draw requests and potentially go on to do other tasks without a care as to how many orphaned buffers (storage blocks) wind up in flight or how fast the GPU is retiring them.

    So in the hypothetical example, you might completely fill one buffer with leaf shapes (and have a draw pending on each one), orphan it, start pumping leaves into the VBO again starting at zero offset, process repeats. Are you getting ahead of the GPU by one or more blocks of storage? Maybe. Do you care? No. Let the GPU and driver catch up on their own time (ideally on an alternate CPU core). Keep that client drawing thread unblocked.

    Driver only sees fixes size VBO blocks coming and going. Its job to recycle those chunks of storage is greatly simplified. Draw events should outnumber orphan events by some healthy multiple - only you know the likely spectrum of batch sizes. Orphaning 128MB VBO's is probably too big. Orphaning 2-4MB VBO's, no big deal.

    Going back to your questions
    You still need a sync before you go to draw though, don't you?
    Not in this style. You map, write the data, and unmap, you can issue a draw call on that data right away. (An async driver is just stacking up these draw requests to process in order). The key is that you get control back into your code as soon as possible so you can crank up the next batch's data. You stay disconnected from any idea of how much work the GPU has done or is about to do.

    There is subtlety that the "next batch" will usually be mapping the same buffer/storage, but you are not going to alter or step on any data previously emplaced - the ascending cursor sees to that. The world doesn't end if the GPU reads from address A while you write to address B and they are different.

    Again if your workload doesn't fit this model, you would need to do more explicit sync effort possibly using fences to know "when" it is safe to touch any given region of storage. But if all you do is fill, fill, fill and then orphan and start over - you never need to check or sync. The juggling of multiple blocks of storage is all in the driver and not your problem. All you need to do is be careful about only writing each section of the larger VBO once and then moving on, and you're fine.

    Do you generate the batches a frame or two in advance to ensure you have time to upload them to the GPU before you need to draw with them?
    Not really. My thinking is usually along the lines of "what steps can I take such that the CPU can maximize its rate of work delivery, and get control back without having to wait for that work to complete?"

    If you are trying to manage the contents of a VBO such that some portions of it stay constant while other portions are changing, that's a workload where you would probably have to start using fences or other heuristics to schedule overwrites of pieces of it. (One heuristic is "has this chunk been used to draw anything in the last five frames" - if no, and you know the driver has a three frame queuing limit say, then you can actually infer when it's safe to overwrite that region without any sync effort, i.e. non blocking map, but you need to make sure you track carefully each segment and mark them in your own data structure when they were last used for draw).

    OTOH glBufferSubData will always be orderly and safe for a partial VBO replacement, no matter what has happened recently, but you have to have the source data in copyable form, whereas with mapping you can combine decompression and delivery into the buffer.

    IMO the application usually knows more about its operational history than the driver does, and is in a better position to make clever decisions about when sync is needed, which is why MapBufferRange has the unsynchronized option.

    whew.

  4. #34
    Intern Contributor
    Join Date
    May 2008
    Posts
    99

    Re: VBOs strangely slow?

    Thanks for your response. Let me read through that... When do you sleep?

  5. #35
    Member Regular Contributor
    Join Date
    Apr 2006
    Location
    Irvine CA
    Posts
    299

    Re: VBOs strangely slow?

    I'm just wakin' up

  6. #36
    Intern Contributor
    Join Date
    May 2008
    Posts
    99

    Re: VBOs strangely slow?

    Is this approach best suited for highly dynamic objects that are rendered a few frames behind their CPU positions? With orphaning, you don't draw the same position twice. You always have a fill/draw/fill/draw?

    What if you didn't draw leaves? What if you drew static objects like terrain, or objects that needed collision detection? Would you have to use another approach?

  7. #37
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948

    Re: VBOs strangely slow?

    What if you didn't draw leaves? What if you drew static objects like terrain, or objects that needed collision detection?
    If you're drawing static terrain, you use static buffer objects. Upload once, draw many. GL_STATIC_DRAW. There isn't really an approach for that

    This approach is for objects that you need to constantly generate data for.

  8. #38
    Intern Contributor
    Join Date
    May 2008
    Posts
    99

    Re: VBOs strangely slow?

    Quote Originally Posted by Alfonse Reinheart
    If you're drawing static terrain, you use static buffer objects. Upload once, draw many. GL_STATIC_DRAW. There isn't really an approach for that
    If I'm creating static buffer objects at runtime, how do I ensure that they have been uploaded before I go to draw with them? I don't want the draw calls to block until the GPU receives all the data.

    I'd like to be able to say, "Hey GPU, upload this high resolution LOD. Let me know when you're done. In the meantime, can you draw with the low resolution LOD? Thanks for not blocking and causing terrible frame breaks, GPU. You're super."

  9. #39
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948

    Re: VBOs strangely slow?

    I'd like to be able to say, "Hey GPU, upload this high resolution LOD. Let me know when you're done. In the meantime, can you draw with the low resolution LOD? Thanks for not blocking and causing terrible frame breaks, GPU. You're super."
    If it's a static buffer, doesn't that mean you're uploading it at "initialization" time? And how would you know that the low resolution LOD is uploaded yet if you're not sure about the high LOD?

  10. #40
    Intern Contributor
    Join Date
    May 2008
    Posts
    99

    Re: VBOs strangely slow?

    Quote Originally Posted by Alfonse Reinheart
    If it's a static buffer, doesn't that mean you're uploading it at "initialization" time?
    No. Imagine you have more data than will fit on a GPU and you can't display a "Loading" screen as the character moves--_very_ quickly.

    Quote Originally Posted by Alfonse Reinheart
    And how would you know that the low resolution LOD is uploaded yet if you're not sure about the high LOD?
    For the terrain or model in question, you'd have to display nothing at first, then the low res. I think I can figure out when nothing is ready.


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •