PDA

View Full Version : VBOs strangely slow?



Baughn
02-23-2010, 09:34 AM
I have here (http://hpaste.org/fastcgi/hpaste.fcgi/view?id=22909) two short programs doing the same thing, namely drawing random points on the screen.

The first one uses VBOs; the second, OpenGL 1.1 drawArrays. There are a couple parameters to play with; the number of VBOs to use in a round-robin fashion, etc. It doesn't really matter what parameters are used, as the results are approximately the same whatever you do.

That is, that the second, VBO-less program runs roughly ten times faster than the first in all cases, except if you comment out the VBO update entirely and only initialize them at startup, in which case they're the same speed.

Now, this is hardly the ideal use for VBOs - being used only once, and all - but it still seems odd to me that the difference would be that dramatic. Thus, my question: Am I doing something obviously and horribly wrong?

DmitryM
02-23-2010, 09:40 AM
I recall several discussions already claiming VBOs are slow. It almost doesn't make sense because they are the only choice in pure GL3. So, unless you are not aiming at future or doing something wrong, you can expect them to work at least as fast as vertex arrays.

Anyway, did you try to play with vbo usage hint parameter?

Baughn
02-23-2010, 09:42 AM
Yes, all _DRAW permutations have been tested; they make no apparent difference. (STATIC_DRAW was no slower, even.. you'd think it would be)

I also tested bufferSubData vs. bufferData in those permutations.

Baughn
02-23-2010, 10:30 AM
Orr it could be because the lowermost program actually transfers far less data. Um. Yeah.

Though even with that fixed, VBOs are /still/ 40% slower.

Alfonse Reinheart
02-23-2010, 11:15 AM
Why are you creating the buffer object with GL_DYNAMIC_DRAW in one case, and GL_STREAM_DRAW in another?

Also, what hardware are you running this on?

Baughn
02-23-2010, 11:38 AM
I was testing various combinations to see what, if anything, had an effect. So far nothing's produced any change at all.

Fixing the non-VBO case so the entire array is actually used, I'm looking at ~6,000ms for non-VBO, ~10,000ms for VBO (absolutely regardless of that setting), and ~8,000ms to upload it as a texture which I then proceed not to use.

My preference would be to have a way to do this via DMA. Hand the drivers a pointer which I can tell them I won't be changing until some later point (at which point I presumably need a sync call of some sort), and let the DMA engine do the memory-copy. I don't suppose that's possible?

Alfonse Reinheart
02-23-2010, 11:46 AM
So, what hardware are you running this on?

Also, why are you use the ARB extension function pointers, rather than the core function pointers? I don't imagine that this will have any effect, but it does seem highly unnecessary.


I don't suppose that's possible?

Have you tried mapping the buffers?

Baughn
02-23-2010, 11:49 AM
In the end, glMapBuffer was (much) faster; preceding it with a glBufferData with a null pointer (discarding its contents), 10% faster than the vertex array.

All's well that ends well? I guess, but it's still not obvious to me why the other way of using them is in this case /slower/.

EDIT: Well, yes, MapBuffer should be able to use DMA transfers.. that makes perfect sense. That's still missing a way to do a DMA transfer from an array that needs to stay readable in main memory, but I suppose it doesn't matter; I don't need that ability right now.

And I was using the ARB versions because they have the exact same API, thus are supported everywhere the core versions are (that I've seen); the other way around is not quite the case, though I haven't personally seen that either.

Ilian Dinev
02-23-2010, 11:53 AM
Try also the glBuffersubdata approach.
YMMV :(

Baughn
02-23-2010, 12:04 PM
I tried it (I think?), just uncomment the glBufferSubData line in the paste.. no difference.

glfreak
02-23-2010, 12:31 PM
Try not use VBOs then ;)

The question is, why they push the use of a new feature if it's not implemented well?

Conclusion, even with traditional glBegin/End you can get an outstanding performance as long as you algorithmically optimize vs. instruction/pixel/hardware optimization.

Personally, I only believe in hardware rasterizer as an alternative fast way to software rasterization. Other than this, try use a pure shader path and see if it's slower or faster :D

Alfonse Reinheart
02-23-2010, 01:32 PM
Try not use VBOs then

The question is, why they push the use of a new feature if it's not implemented well?

Conclusion, even with traditional glBegin/End you can get an outstanding performance as long as you algorithmically optimize vs. instruction/pixel/hardware optimization.

Did you read the thread? He said, "In the end, glMapBuffer was (much) faster; preceding it with a glBufferData with a null pointer (discarding its contents), 10% faster than the vertex array." In short, VBOs worked better for him, once he was using the correct API. So your "conclusion" is errant nonsense.

On topic, you should use glMapBufferRange if that extension is available. Using the invalidation flag, you don't even need the glBufferData(NULL) part.

glfreak
02-23-2010, 03:00 PM
"Try not use VBOs then"

If it's confusing...because sometimes you change the order of commands and you get different/unexpected results on middle end hardware.

"The question is, why they push the use of a new feature if it's not implemented well?"

Because ppl talking about expected a huge perfomance gain when using VBOs.

"Conclusion, even with traditional glBegin/End you can get an outstanding performance as long as you algorithmically optimize vs. instruction/pixel/hardware optimization."

A better way to optimize software ;)

Ilian Dinev
02-23-2010, 03:05 PM
....
Try shoving 10 million tris to the gpu per frame at 60fps without VBOs, while needing flexibility that display-lists do not give (and you'd like to not waste VRAM for the different permutations required otherwise with DLs).
:)

KariGordon
02-24-2010, 03:48 AM
Use GL_STATIC_DRAW and don't update your data pointer (via glBufferData or glBufferSubData)in your for loop. I would think these calls cause the geometry to be sent over to the graphics adapter every time.

Baughn
02-24-2010, 07:44 AM
That's pretty much the idea. I don't actually update the array here (because I just want to benchmark transfer speed, not random-number generation), but the target program writes new data every frame.

Well, mapping the buffer works very nicely.

Dark Photon
02-26-2010, 05:39 AM
In the end, glMapBuffer was (much) faster; preceding it with a glBufferData with a null pointer (discarding its contents), 10% faster than the vertex array.

All's well that ends well? I guess, but it's still not obvious to me why the other way of using them is in this case /slower/.
That's interesting. When I've tried Map vs. Sub, Sub was faster (with invalidate of course, so multiple buffers are in-flight in the driver [allegedly], and fixing the VBO max size -- no resizing).

But yeah, pure VBOs are odd. You'd think they'd always be faster, but some of the time they're slower (most of the time on pre-SM4 cards). Unless you play the "Ouija board" correctly per card per driver rev.
Map vs. Sub. Invalidate vs. not. Sync vs. not. Static vs. stream vs. dynamic. Dynamic max VBO size or not. Interleaved attributes vs. separate. Multiple batches per VBO vs. not. Mixing index and vertex arrays in one buffer or not. Max VBO size X or Y. 32-byte-aligned verts or not. Ring of N buffers or one. Vtx fmts X or Y for colors, normals, texcoords, etc. Latency between upload and use X or Y. Call glVertexPointer first, last, or in between. Heck, one of our devs even found it can be faster using CPU-side index list with VBO vertex attributes on some cards, when the index list changes frequently.

On pre-SM4 cards VBO perf used to be a total crapshoot, with it more likely to be slower than client arrays than faster, and that's without any dynamic VBO updates (you laugh, but we still have customers in the field with these and thus have to support them; these cards are only ~3yrs old and our customers use lots of GPUs). For recent gen cards, it's getting easier to be faster with VBOs, though still possible to find cases where VBOs lag. Batch setup seems more expensive with them than client arrays.

VBO updates aside though, I will say I am pleased with VBO performance on recent cards particularly using NVidia's bindless batch data extension. With that, I can get very near to the performance of their legendary display lists (it's ~2X slower without bindless). So no doubt NVidia display lists use bindless internally (of course). VBOs+bindless is definitely the future (unless they come up with something even faster :cool:)


The question is, why they push the use of a new feature if it's not implemented well?
That's a very good question. VBO's would have been a much easier sell if they didn't positively suck when they were first introduced, which lasted for several generations of cards. They're still a Ouigi board, but the Ouigi board has gotten much smaller on recent cards.

Another reason VBOs weren't such a slam-dunk sell is the vendors did not provide guidance to say specifically "this is how you get the fastest VBO performance on our cards: use permutation A,B,C,F,M,P,R". And when there was a tip dropped, if you tried it, half the time it was worse performance.

Dark Photon
02-26-2010, 07:04 AM
In the end, glMapBuffer was (much) faster; preceding it with a glBufferData with a null pointer (discarding its contents), 10% faster than the vertex array.
Could you post your exact map code example in a follow-up? I think it'd be useful/informative for a number of folks to run all 3, and let you/us verify that everyone's seeing similar results on varying GPUs/vendors/drivers on exactly the same code.

zweifel
02-26-2010, 07:19 AM
Not sure if someone said it before but, what are your GL_MAX_ELEMENTS_VERTICES and GL_MAX_ELEMENTS_INDICES.

Extending those things will have bad results.

Dark Photon
03-01-2010, 05:20 PM
Just for kicks, and to come to some VBO upload performance conclusions on modern hardware (at least with this 8MB/upload example), I thought I'd take the original two permutations (same VBO sizes/contents/rendering), and try a few variations for comparison:

1. 2.163s - Client arrays
2. 2.801s - BufferData load/reload
3. 2.876s - BufferData NULL, BufferSubData load
4. 1.985s - BufferData NULL, MapBuffer load, UnmapBuffer
5. 2.013s - glMapBufferRange MAP_INVALIDATE load, UnmapBuffer
6. 2.078s - glMapBufferRange MAP_INVALIDATE load, UnmapBuffer with buffer load/use lag of 2

Test setup:
- NVidia GTX 285 GPU, Core i7 920 CPU, PCIe 2.0
- NVidia 190.32 Linux drivers

Option #3 used to be the fastest. But on modern hardware/drivers it's now the dead slowest :p , at least with this example.

Also, options #4 and #5 can be made ~60ms faster (3%) merely by using fewer buffers (e.g. 1 instead of 3).

It's interesting to note this nets an upload rate of ~2 GB/sec (6.4 GB/sec practical max PCIe2, 8.0 GB/sec theoretical max, 8.3 GB/sec theoretical max CPU mem).

Alfonse Reinheart
03-01-2010, 06:06 PM
Have you tried explicit synchronization with NV_fence/ARB_sync and using GL_UNSYNCHRONIZED with glMapBufferRange?

Dark Photon
03-01-2010, 06:51 PM
Have you tried explicit synchronization with NV_fence/ARB_sync and using GL_UNSYNCHRONIZED with glMapBufferRange?
No, sure hadn't. What do you envision here?

Thought the whole purpose of MapBuffer NULL / UNSYNCHRONIZED is so the GPU can have multiple buffers in flight for the same buffer handle, and thus pipeline the buffer uploads, avoiding stalls. So I'm not seeing where fences come in.

I also didn't test a technique that has been touted here for buffer upload speed-up (since this is such a trivial test app), and that's mapping the buffer in a foreground thread, taking the potentially multi-ms hit of the memcpy in a background thread, and then unmapping in the forground thread, with ring-buffer work queues between the threads. But that's typically only useful if you've got other (typically GL) work to do in the foreground thread. This little test app's just gonna wait on the memcpy to unmap anyway because it has nothing better to do.

Alfonse Reinheart
03-01-2010, 09:09 PM
Thought the whole purpose of MapBuffer NULL / UNSYNCHRONIZED is so the GPU can have multiple buffers in flight for the same buffer handle, and thus pipeline the buffer uploads, avoiding stalls. So I'm not seeing where fences come in.


GL_UNSYNCHRONIZED is not the same as GL_INVALIDATE.

GL_INVALIDATE tells the implementation, "I don't care what was in the buffer before; I just want some memory!"

GL_UNSYNCHRONIZED says, "I don't care that you may currently be using the buffer, and that my attempt to modify it while in use can have horrible consequences. I will take responsibility for making sure the buffer is not in use when I modify it, so give me a pointer already!"

They're both solutions to the same basic problem (I rendered with a buffer last frame, and I want to change it and use it this frame), but with different needs. GL_INVALIDATE/glBufferData(NULL) is ultimately giving you two buffer objects: the one that's currently in use and the one you're writing to. GL_UNSYNCHRONIZED is all about using only one piece of memory to avoid the synchronization.

The idea is that you fill up a buffer object, do something with it, and then set a fence. If you want to change the buffer, you check your fence. If the fence has not passed yet, you go do something else (and therefore this only works when you have "something else" that you could be doing). When the fence has passed, you can now fill the buffer.

Rob Barris
03-02-2010, 02:01 AM
GL_UNSYNCHRONIZED can allow for idioms where the client is generating a large number of small batches dynamically; it makes it much more efficient to stack them up one after another within a smaller number of larger sized VBO's. For example you could have a 4MB VBO, and be able to map/write/unmap/draw several hundred times using that storage, before ever having to orphan or fence, if you are processing kilobyte-ish batches of data.

In this regard it's closer to the D3D NO_OVERWRITE hint. "Yes, I know I just wrote 512 bytes of stuff at offset 0, and maybe it hasn't been processed yet - I would like to go back in and write 1280 bytes of new stuff starting at offset 512 now in the same buffer... and I'd rather not have to wait." And so you repeat until you hit the end of the buffer - no hazards, no risks.

Concurrency goes up esp on a multi-threaded driver when you can use the cheap operation more frequently than the expensive one (unsync map = cheap ... orphaning = less cheap).

When this style makes sense (depends on your app), you can cut way down on the driver memory management work, since it just sees one particular size of buffer being orphaned / recycled, and those events are much less frequent than maps and unmaps.

Ideally, you reach a steady state where the driver is round-robining between a few physical buffers of that one large size, allocations stop happening, and the driver need not care if you are blasting rand()-sized batches in various numbers into that storage.

The key idea is really that careful fencing and sync efforts are only needed in the absence of orphaning and in cases where you are going back to *rewrite* some storage that may be pending drawing, like trying to do some sort of sub-section update to a mesh, say.

The flip side of that is that you can do high performance dynamically generated batches of mixed sizes with no fences at all, and with low driver overhead, if you constrain your access patterns to only write/use any given segment of a buffer exactly once before the buffer is orphaned. This is a familiar pattern from the D3D playbook.

Baughn
03-02-2010, 04:42 AM
I'm afraid all these details are too much for this poor systems programmer. I'll play the ouija board, figure out code that works well on my own system, and not worry too much about other systems. Still..

What I came up with in the end for the actual application (code here (http://github.com/Baughn/Dwarf-Fortress--libgraphics-/), but there's way too much of it) is to use two VBOs, for the variable data, which I switch between once per frame (using glMapBufferRange to invalidate if available, glBufferData otherwise), and a static_draw VBO for the quite static vertex grid. This works well enough; it's as fast as the ncurses output mode, which means about twice the speed of any other mode even counting immutable overhead.

If you really want to see the actual code.. uh, the important functions would be swap_pbos in graphics.cpp, and render_shader/init_gl (shader branch, latter) in enabler_sdl.cpp, but I would suggest you stay away. For one thing, the code's embarrassing and impenetrable.

I've also got ARB_sync in there, on the theory that blocking in SDL_GL_SwapBuffers is a very bad thing and I can't figure out a better way to limit framerates to what my (8600M) gpu can handle.

But now you're saying display lists are likely to be faster? And the drivers will use multiple VBOs as appropriate if I just invalidate before mapping? Are those also true for ATI cards?

Also, is there an ATI equivalent of bindless graphics?

Pierre Boudier
03-02-2010, 04:52 AM
"But now you're saying display lists are likely to be faster?"
-> internally, the driver will convert display list to vbo. the main issue with display list is that it is hard to predict when an implementation can optimize, because there are many corner cases in opengl...

"And the drivers will use multiple VBOs as appropriate if I just invalidate before mapping? Are those also true for ATI cards?"
-> yes. the implementation will reallocate a buffer and avoid any unnecessary synchronization overhead.

"Also, is there an ATI equivalent of bindless graphics?"
-> you can use vertex_array_object.

Dark Photon
03-02-2010, 05:51 AM
Have you tried ... GL_UNSYNCHRONIZED with glMapBufferRange?
... Thought the whole purpose of MapBuffer NULL / UNSYNCHRONIZED is so the GPU can have multiple buffers in flight for the same buffer handle
My apologies. I tested/meant INVALIDATE, but Alfonse said UNSYNCHRONIZED, and I merely copied and missed the distinction.

And thanks Rob and Alfonse for the detailed responses! I learned a few things, and I'm sure I'm not alone.

Dark Photon
03-02-2010, 05:55 AM
"Also, is there an ATI equivalent of bindless graphics?"
-> you can use vertex_array_object.
But on NVidia, avoid using VAOs on top of bindless. Yes, it works, but in my experience, you'll pay a little perf for doing that (but test on your setup to be sure).

Presumably bindless gives you the VAO speed-up, and without (I assume) a bazillion little VAOs floating around in the GL driver.

Baughn
03-02-2010, 07:57 AM
Naturally, trying to use display lists ran into the problem that my vertex shader uses gl_VertexID, which appears not to be set when executing display lists.

Is there a reasonable alternative? Some way of setting a per-vertex or per-primitive counter?

skynet
03-02-2010, 08:03 AM
Its time for some new whitepapers from ATI/nVidia on how to deal with updating VBOs/UBOs/PBOs quickly. Clean up some myths and get straight on the facts. I'm tired of guessing.

ViolentHamster
03-02-2010, 08:18 AM
The key idea is really that careful fencing and sync efforts are only needed in the absence of orphaning and in cases where you are going back to *rewrite* some storage that may be pending drawing, like trying to do some sort of sub-section update to a mesh, say.


Rob, I'm not sure I understand. You still need a sync before you go to draw though, don't you? Should the application keep track of active and inactive VBOs (the GPU may be drawing with the active while the inactive VBOs are ready to be recycled)?



The flip side of that is that you can do high performance dynamically generated batches of mixed sizes with no fences at all, and with low driver overhead, if you constrain your access patterns to only write/use any given segment of a buffer exactly once before the buffer is orphaned. This is a familiar pattern from the D3D playbook.


I'm curious about your "dynamically generated batches". Do you generate the batches a frame or two in advance to ensure you have time to upload them to the GPU before you need to draw with them? I'd really like to be able to put all my vertex data directly into VBOs. Unfortunately, I have multiple LODs and I don't know which LODs I need until I'm finished culling. If I put all my LODs into VBOs, I'd have hundreds of MBs of VBO data. I'm struggling with how to fill VBOs with the correct subset of vertex data while giving the GPU enough time between the upload and the draw call...

Thanks.

Dark Photon
03-02-2010, 08:24 AM
Its time for some new whitepapers from ATI/nVidia on how to deal with updating VBOs/UBOs/PBOs quickly. Clean up some myths and get straight on the facts. I'm tired of guessing.
Agreed! Death to the Ouija Board! :sorrow:

Also, interesting blog post from Sunday on this very topic: One More On VBOs - glBufferSubData (http://hacksoflife.blogspot.com/2010/02/one-more-on-vbos-glbuffersubdata.html)

Rob Barris
03-02-2010, 10:47 AM
The key idea is really that careful fencing and sync efforts are only needed in the absence of orphaning and in cases where you are going back to *rewrite* some storage that may be pending drawing, like trying to do some sort of sub-section update to a mesh, say.


Rob, I'm not sure I understand. You still need a sync before you go to draw though, don't you? Should the application keep track of active and inactive VBOs (the GPU may be drawing with the active while the inactive VBOs are ready to be recycled)?



The flip side of that is that you can do high performance dynamically generated batches of mixed sizes with no fences at all, and with low driver overhead, if you constrain your access patterns to only write/use any given segment of a buffer exactly once before the buffer is orphaned. This is a familiar pattern from the D3D playbook.


I'm curious about your "dynamically generated batches". Do you generate the batches a frame or two in advance to ensure you have time to upload them to the GPU before you need to draw with them? I'd really like to be able to put all my vertex data directly into VBOs. Unfortunately, I have multiple LODs and I don't know which LODs I need until I'm finished culling. If I put all my LODs into VBOs, I'd have hundreds of MBs of VBO data. I'm struggling with how to fill VBOs with the correct subset of vertex data while giving the GPU enough time between the upload and the draw call...



I'll try to boil this down a bit. First let's define a workload, then we look at how you can feed it to GL. If your app doesn't match the workload, then this may not apply to you.

workload: say the CPU wants to draw a series of batches where each one is based on data generated or unpacked right before issuing of the draw request. Once written, the data is not going to be modified or read back by the CPU. The goal is to efficiently let the GPU have access to the newly written data, and to avoid bogging down with excessive allocation or synchronization on a per-draw basis.

( As a hypothetical example, say we're using the CPU to deform and draw hundreds of falling leaves, where the leaf-shape algorithm runs on the CPU, and can be used to generate new batches of verts for each leaf at will )

So, you can do this with one VBO and no fences, and it can run really well. The magic is hiding in the buffer-orphaning step.

So make a VBO with glBindBuffer, and set its size with glBufferData. A few megabytes is good.

Init a "cursor / offset" to zero.

for each batch:
- figure out how many bytes it will be.
- round it up to some nice power of two multiple, 64 is good.
* orphan current VBO if this batch won't fit (see below).
- map the buffer using UNSYNCHRONIZED, at the current cursor offset, asking for the padded number of bytes to be visible. (On Apple flush-buffer-range, you can map in unsynchronized fashion, you just can't pick the range, so you always get the base address back - just add the offset to it)
- write the data at the beginning of the mapped range.
- unmap.
- increment the cursor by the padded size used.
- issue the draw call after setting vertex attrib pointers appropriately into the VBO, keeping the offset in mind.
- repeat.

Note, if you are using an asynchronous or multithreaded driver, you might well get 40, 50, 100 batches written into that VBO (and draw commands enqueued) before the GPU even looks at that first byte. That's OK. You just want the client thread to get in and out of that VBO as fast as possible so it can stay busy doing work.

At some point the cursor will have moved far enough such that the next batch of data will not fit - i.e. offset + padded size exceeds the total size of the VBO. Note the starred step above.

When this eventuality happens, and it will vary depending on the size of the batches you've been dropping into the VBO, the response is very simple.

- orphan current storage by doing a new glBufferData using the fixed size chosen for the VBO, and a NULL pointer.
- rewind cursor to offset 0.
- continue.

The subsequent map result will look at new storage, a clean sheet. The old storage belongs to the driver, it's no longer associated with the VBO ID that you have in your code. So from one point of view there are now two buffers of storage running around, but the one you orphaned can no longer be accessed by the client code. At some point all the draw calls that are consuming data from that storage will complete - and that storage will be freed or possibly recycled automatically.

In this model, the number of VBO's known to the client is "one". The number of floating (orphaned) *blocks of storage* could be much higher, depending on how long the GPU is taking to chew through each job and how fast the CPU can drop them off.

So you don't have to juggle "multiple VBO's", you just need to keep blasting away at the one VBO while letting the driver swap in new chunks of storage as needed.

Client never needs to fence, or check GPU progress, or block on map.

Write&draw, write&draw, repeat til VBO full, orphan and rewind cursor, repeat. CPU gets to drop off all of its data and draw requests and potentially go on to do other tasks without a care as to how many orphaned buffers (storage blocks) wind up in flight or how fast the GPU is retiring them.

So in the hypothetical example, you might completely fill one buffer with leaf shapes (and have a draw pending on each one), orphan it, start pumping leaves into the VBO again starting at zero offset, process repeats. Are you getting ahead of the GPU by one or more blocks of storage? Maybe. Do you care? No. Let the GPU and driver catch up on their own time (ideally on an alternate CPU core). Keep that client drawing thread unblocked.

Driver only sees fixes size VBO blocks coming and going. Its job to recycle those chunks of storage is greatly simplified. Draw events should outnumber orphan events by some healthy multiple - only you know the likely spectrum of batch sizes. Orphaning 128MB VBO's is probably too big. Orphaning 2-4MB VBO's, no big deal.

Going back to your questions

You still need a sync before you go to draw though, don't you?

Not in this style. You map, write the data, and unmap, you can issue a draw call on that data right away. (An async driver is just stacking up these draw requests to process in order). The key is that you get control back into your code as soon as possible so you can crank up the next batch's data. You stay disconnected from any idea of how much work the GPU has done or is about to do.

There is subtlety that the "next batch" will usually be mapping the same buffer/storage, but you are not going to alter or step on any data previously emplaced - the ascending cursor sees to that. The world doesn't end if the GPU reads from address A while you write to address B and they are different.

Again if your workload doesn't fit this model, you would need to do more explicit sync effort possibly using fences to know "when" it is safe to touch any given region of storage. But if all you do is fill, fill, fill and then orphan and start over - you never need to check or sync. The juggling of multiple blocks of storage is all in the driver and not your problem. All you need to do is be careful about only writing each section of the larger VBO once and then moving on, and you're fine.


Do you generate the batches a frame or two in advance to ensure you have time to upload them to the GPU before you need to draw with them?

Not really. My thinking is usually along the lines of "what steps can I take such that the CPU can maximize its rate of work delivery, and get control back without having to wait for that work to complete?"

If you are trying to manage the contents of a VBO such that some portions of it stay constant while other portions are changing, that's a workload where you would probably have to start using fences or other heuristics to schedule overwrites of pieces of it. (One heuristic is "has this chunk been used to draw anything in the last five frames" - if no, and you know the driver has a three frame queuing limit say, then you can actually infer when it's safe to overwrite that region without any sync effort, i.e. non blocking map, but you need to make sure you track carefully each segment and mark them in your own data structure when they were last used for draw).

OTOH glBufferSubData will always be orderly and safe for a partial VBO replacement, no matter what has happened recently, but you have to have the source data in copyable form, whereas with mapping you can combine decompression and delivery into the buffer.

IMO the application usually knows more about its operational history than the driver does, and is in a better position to make clever decisions about when sync is needed, which is why MapBufferRange has the unsynchronized option.

whew.

ViolentHamster
03-02-2010, 10:55 AM
Thanks for your response. Let me read through that... When do you sleep?

Rob Barris
03-02-2010, 11:05 AM
I'm just wakin' up :)

ViolentHamster
03-02-2010, 11:53 AM
Is this approach best suited for highly dynamic objects that are rendered a few frames behind their CPU positions? With orphaning, you don't draw the same position twice. You always have a fill/draw/fill/draw?

What if you didn't draw leaves? What if you drew static objects like terrain, or objects that needed collision detection? Would you have to use another approach?

Alfonse Reinheart
03-02-2010, 01:03 PM
What if you didn't draw leaves? What if you drew static objects like terrain, or objects that needed collision detection?

If you're drawing static terrain, you use static buffer objects. Upload once, draw many. GL_STATIC_DRAW. There isn't really an approach for that ;)

This approach is for objects that you need to constantly generate data for.

ViolentHamster
03-02-2010, 02:08 PM
If you're drawing static terrain, you use static buffer objects. Upload once, draw many. GL_STATIC_DRAW. There isn't really an approach for that ;)


If I'm creating static buffer objects at runtime, how do I ensure that they have been uploaded before I go to draw with them? I don't want the draw calls to block until the GPU receives all the data.

I'd like to be able to say, "Hey GPU, upload this high resolution LOD. Let me know when you're done. In the meantime, can you draw with the low resolution LOD? Thanks for not blocking and causing terrible frame breaks, GPU. You're super."

Alfonse Reinheart
03-02-2010, 02:41 PM
I'd like to be able to say, "Hey GPU, upload this high resolution LOD. Let me know when you're done. In the meantime, can you draw with the low resolution LOD? Thanks for not blocking and causing terrible frame breaks, GPU. You're super."

If it's a static buffer, doesn't that mean you're uploading it at "initialization" time? And how would you know that the low resolution LOD is uploaded yet if you're not sure about the high LOD?

ViolentHamster
03-02-2010, 02:53 PM
If it's a static buffer, doesn't that mean you're uploading it at "initialization" time?
No. Imagine you have more data than will fit on a GPU and you can't display a "Loading" screen as the character moves--_very_ quickly.



And how would you know that the low resolution LOD is uploaded yet if you're not sure about the high LOD?

For the terrain or model in question, you'd have to display nothing at first, then the low res. I think I can figure out when nothing is ready. ;)

Alfonse Reinheart
03-02-2010, 03:25 PM
Imagine you have more data than will fit on a GPU and you can't display a "Loading" screen as the character moves--_very_ quickly.

So you are streaming. Then it isn't a static buffer, is it ;)

In order to do a streaming world, you have to have some memory set aside for doing streaming into. And since you're streaming to the GPU, this would include buffer objects.

These buffer objects, just like the streaming space in main memory, are not currently in use. They're not currently being rendered from. So there's no need to orphan them. Just upload data to them, and when you need them, display them. If you need more time, then extend the boundaries of the streaming blocks.

Even across a PCIe bus, you can expect 1GB/sec transfer speeds. So in approximately 1 second, you can replace the entire contents of your GPU's memory.

So just make sure that you pad your streaming time by, say, 0.5 seconds. If you are streaming X segments from disk, and it takes on average 1.5 seconds to get that data from disk, make sure that your application has a 2 second window between the time it provoked the streaming and the time it starts using it.


I think I can figure out when nothing is ready.

How? If you think stalls are being created from a buffer object not being finished uploading, how do you know that a smaller buffer is finished uploading?

ViolentHamster
03-02-2010, 06:16 PM
Thanks Alfonse. That's a good way of looking at it. I'll have to figure out how to page ahead for my application.

Dark Photon
03-03-2010, 07:31 AM
So you don't have to juggle "multiple VBO's", you just need to keep blasting away at the one VBO while letting the driver swap in new chunks of storage as needed.

Client never needs to fence, or check GPU progress, or block on map.

Write&draw, write&draw, repeat til VBO full, orphan and rewind cursor, repeat. CPU gets to drop off all of its data and draw requests and potentially go on to do other tasks without a care as to how many orphaned buffers (storage blocks) wind up in flight or how fast the GPU is retiring them.
Client arrays for VBOs. Slick. Thanks for the detailed write-up!

Chris Lux
03-04-2010, 01:25 AM
hi,
just for me understanding correctly:

orphane a buffer means glBufferData(.., NULL); or map using invalidation?

Dark Photon
03-04-2010, 05:20 AM
orphane a buffer means glBufferData(.., NULL); or map using invalidation?
Yes. Per Rob in previous post:

...orphan current storage by doing a new glBufferData using the fixed size chosen for the VBO, and a NULL pointer.

Rob Barris
03-07-2010, 01:26 PM
There are two kinds of invalidation that MapBufferRange can do, and they have very different purposes.

One is tied to MAP_INVALIDATE_BUFFER_BIT in the access parameter to MapBufferRange. This essentially means "orphan". So in the usage I described, you could set this bit when you go back to offset 0 and get the same effect as BufferData(NULL).

The other is a bit more subtle, and it is tied to MAP_INVALIDATE_RANGE_BIT. This may seem a bit redundant, but it is important. It explicitly tells the driver up front "the range I am mapping - it does not need to contain valid data that I can read" - it is a signal to the driver that it is free to replace every single byte in that range with whatever is in your CPU-visible mapped buffer area upon unmap (or explicit flush).

The freedom this provides to the driver, if you have also set the WRITE bit but not the READ bit, is that it can hand back a pointer to completely uninitialized scratch memory - which may well be driver allocated for write-through uncached access etc. By opting into invalidation of the range, you eliminate any need for the driver to put a copy of valid data in that range prior to returning the pointer. If an implementor wanted to keep system-memory images of buffers to a minimum, this would let that driver provide scratchpad memory for maps using these bits (write + invalidate-range) - and then transfer those bits to the final destination later, perhaps via DMA.

Restated more simply, think of MAP_INVALIDATE_RANGE_BIT as a "promise to write the whole range, nothing but the range, and never read from the range" bit.

Cannos
07-12-2010, 12:11 AM
Great info in this topic, thanks everyone! Quick question...on OpenGL implementations without MapBufferRange support (i.e. OpenGL ES), are there any good alternative ways to implement the dynamic vertex ring buffer that Rob suggested? It sounds like glBufferSubData has some pitfalls in the general case (and with the style of workload described). Is it best to stick with standard non-VBO vertex arrays in this case? Thanks!