PDA

View Full Version : mega vbo, any reservations?



ugluk
09-08-2010, 12:26 AM
Once upon a time we had a discussion about having a single mega-vbo containing all attributes and indices an app might need. The only reservation was, that the swaps between CPU and GPU memories might become slow. But what about mobile & embedded platforms, where only 1 GL app at a time runs? Also, with games for the usual desktop computers, usually only 1 game at a time runs, as you can't really play 2 or more at a time.

I'm itching to try the mega vbo in practice, but, as you well know, this is a major change and I'd like to hear about any reservations before I jump to implementing it.

Dark Photon
09-08-2010, 06:56 AM
I'm itching to try the mega vbo in practice, but, as you well know, this is a major change and I'd like to hear about any reservations before I jump to implementing it.
If you've concentrated your scene glDraw*Elements code in one spot (i.e. one "drawable" or "draw batch" class), then the change isn't huge.

Reservations against? Not really, and there are some compelling arguments for doing it. GPU memory allocation/fragmentation = expensive. VBO switching = expensive. Can't afford to have all loaded batches on the GPU at all times (in some apps anyway). etc.

But just to clarify, what specifically are you contemplating when you say "mega VBO"? There have been a number of posts talking about packing multiple batches into one VBO. Many of them talking about pre-packing batches into VBOs (in DB build step or something). Then there have been a few talking about dynamically streaming just the batches you need into a "mega VBO" (e.g. this one (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=273141#Post2731 41)). I'm talking about the latter.

The main difference being whether you are storing big chunks of batch data on the GPU whether you need it or not (i.e. whether any of them pass CULL or not), or just storing those that you need based on CULL results. Or a mix.

ugluk
09-08-2010, 01:36 PM
I meant storing multiple batches on the GPU, without dynamically changing them at runtime, into a single big VBO. Maybe this is standard practice already, but I've read in these forums, the GPU, after running out of memory, might swap some VBOs into system memory, but with a mega-VBO the swapping could prove slow. But, I guess, if you allocate a mega-VBO and the GPU won't be able to accommodate, then it will return an error and that will be that. You need to make sure, it won't be too big up front.

I'll have to test what happens, if you try to allocate a, say, 1 GB VBO on a 512 MB GPU.

result:

02:00.0 VGA compatible controller: ATI Technologies Inc Cedar PRO [Radeon HD 5450] (prog-if 00 [VGA controller])
Subsystem: Hightech Information System Ltd. Device 2291
Flags: bus master, fast devsel, latency 0, IRQ 30
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at feae0000 (64-bit, non-prefetchable) [size=128K]
I/O ports at e000 [size=256]
Expansion ROM at feac0000 [disabled] [size=128K]
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Legacy Endpoint, MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [100] Vendor Specific Information <?>
Capabilities: [150] Advanced Error Reporting
Kernel driver in use: fglrx_pci
Kernel modules: fglrx

I apparently I have 512 MB on my GPU and I do:


GLuint buffer_object;

glGenBuffersARB(1, &amp;buffer_object);
GL_DEBUG();
glBindBufferARB(GL_ARRAY_BUFFER, buffer_object);
GL_DEBUG();
glBufferDataARB(GL_ARRAY_BUFFER, 1024 * 1024 * 1024, 0, GL_STATIC_DRAW);
GL_DEBUG();


and get:



out of memory

test_array_stored: test/test_array_stored.cpp:34: int main(int, char**): Assertion `0' failed.


I am worried, that when I allocated a 600-800 MB VBO, it actually succeeded, so you can apparently allocate more memory than is available.

Anyway, in my app, I don't have the *Draw*Elements calls grouped, but I'll simply drop all the VBO binds that are there and assume always the correct VBO is bound.

Alfonse Reinheart
09-08-2010, 07:08 PM
I am worried, that when I allocated a 600-800 MB VBO, it actually succeeded, so you can apparently allocate more memory than is available.

Why? There's nothing that said that the entire buffer object must be in video memory at the same time. As long as it functions as a valid buffer object, you shouldn't have a problem.

Dark Photon
09-08-2010, 08:32 PM
I meant storing multiple batches on the GPU, without dynamically changing them at runtime, into a single big VBO. ... if you try to allocate a, say, 1 GB VBO on a 512 MB GPU.
Yeah, this is the kind of thing that worries me with "just put all the batch data we could possibly need to render with on the GPU" approach. You end up with ginormous mem req'ts and wasting lots of GPU mem on stuff that may never be used by the GPU pipeline (and even that's assuming we somehow know we can allocate that big a VBO and that the GPU can keep it hot and resident).

Versus, the approach of just caching the batches you need to render with or rendered with recently on the GPU. This amount of data actually ends up being fairly small. So instead of talking about 0.5-1GB or something. We're maybe talking 0.5-10MB or so. Much more reasonable.


I am worried, that when I allocated a 600-800 MB VBO, it actually succeeded, so you can apparently allocate more memory than is available.
Yes, that is a concern.


Anyway, in my app, I don't have the *Draw*Elements calls grouped, but I'll simply drop all the VBO binds that are there and assume always the correct VBO is bound.
You can do that, or just do lazy binds (i.e. only bind if the last bind wasn't the same as the current bind). This is sometimes termed the state tracking, or removing duplicate calls approach, and uses thin wrapper functions around GL calls.

ugluk
09-08-2010, 10:59 PM
Yeah, this is the kind of thing that worries me with "just put all the batch data we could possibly need to render with on the GPU" approach. You end up with ginormous mem req'ts and wasting lots of GPU mem on stuff that may never be used by the GPU pipeline (and even that's assuming we somehow know we can allocate that big a VBO and that the GPU can keep it hot and resident).

You can take the approach: there is either enough memory, or screw it. Even if GL screws my app, by allocating a part of the VBO in system memory, the user will notice a delay, assume his "bad" card is at fault and rush to buy a new one :)

Well, unless the app developer is Microsoft, I suppose you need to take the caching approach. Or make more than 1 VBOs and put up with swapping.


Versus, the approach of just caching the batches you need to render with or rendered with recently on the GPU. This amount of data actually ends up being fairly small. So instead of talking about 0.5-1GB or something. We're maybe talking 0.5-10MB or so. Much more reasonable.

So, what if the cache VBO lacks a lot of the necessary batches? A noticeable delay taken up by loading. If I understand correctly your algorithm would be as follows:

when a batch is needed for rendering check if the needed batch is in the cache VBO, if not, load it into the cache VBO.

What happens when the cache VBO is full? Discard VBO and start reloading batches into it? Otherwise fragmentation will occur. Maybe a part of the cache VBO is permanent and only parts are discardable? Anyway do you have any links to descriptions of caching algorithms? What if you cached spatially, i.e. defined regions with lists of batches that might be needed and then load the vbo with batches needed in the current region, as well as neighboring regions, reloading when crossing boundaries between regions.

Also, what about VAOs? If you cache, the location of a batch may change, rendering it's VAO useless. In order to fix the VAO, you need to bind it, which takes up valuable time.


You can do that, or just do lazy binds (i.e. only bind if the last bind wasn't the same as the current bind). This is sometimes termed the state tracking, or removing duplicate calls approach, and uses thin wrapper functions around GL calls.

Still, it requires an "if" and some wasted memory for state tracking, but I guess this is better than a wasted bind.

Anyway, I implemented a part of the mega-vbo approach. I will eventually have two mega-vbo's, one for attributes and another for indices, due to alignment concerns. But currently I have a single VBO for both vertex attributes and indices. The layout looks like this:

| attributes1
| indices1
| attributes2
| indices2
...

Of course, if the layout is packed it can cause attributes and indices to become unaligned, what happens on my card in that case is completely random. Even if the scene remains static between frames GL will render random stuff. Is this allowed with vertex attributes on a BYTE boundary? My understanding was, it should only increase rendering time. As soon as vertex attributes are aligned on a DWORD boundary, at least on my card, everything renders correctly and fast. If I align vertex attributes on a WORD boundary, I notice a large delay, but everything renders correctly.

I've also noticed a speed-up due to the mega-vbo and the now missing binds when aligning at the DWORD boundary.

EDIT:
Now I've done tests with 2 mega vbo's, 1 for indices, the other for attributes and the performance is slightly worse (hundreth of a millisecond) than it was for a single mega vbo for both indices and vertices.

Dark Photon
09-09-2010, 05:04 AM
Versus, the approach of just caching the batches you need to render with or rendered with recently on the GPU. This amount of data actually ends up being fairly small. So instead of talking about 0.5-1GB or something. We're maybe talking 0.5-10MB or so. Much more reasonable.
So, what if the cache VBO lacks a lot of the necessary batches? A noticeable delay taken up by loading.
Here's the thing. First frame bringing up your level, maybe you take a little longer because you're uploading more batches. But in subsequent frames, a substantial portion of the batches you render were rendered the last frame, so they're already there. If you envision the frustum panning around, you only have potentially new batches on the leading edges of the view frustum. So the amount of new data you need to upload is small. This technique takes advantage of temporal coherence in your rendering.

But even for that first frame, is it noticable? Not so much. You end up with something like client arrays perf for those, which isn't too bad. Worst case, maybe you break just the first frame after you bring up a level if you were close to breaking anyway.


If I understand correctly your algorithm would be as follows: when a batch is needed for rendering check if the needed batch is in the cache VBO, if not, load it into the cache VBO.
Yes, more specifically, just append it using the Streaming VBO approach Rob Barris describes here (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=273141#Post2731 41).

And if it's already there, just launch the batch using NVidia bindless (http://developer.nvidia.com/object/bindless_graphics.html) (if on NVidia) to get display list perf for that batch.


What happens when the cache VBO is full? Discard VBO and start reloading batches into it?
Yeah, basically orphan it and the next frame you just end up reloading anything you're still using at the front of the new orphan. This works well enough. But if I ever get to a problem with frame breakage (due to the app being too close to breaking frame anyway) then I'll have to do something more slick like have 2 stream VBOs, and over the course of multiple frames expire things from the old VBO and load them over packed up-front in the new VBO, rather than do this all in one frame (the first frame after an orphan).


Otherwise fragmentation will occur.
Well with Rob's streaming VBO technique, fragmentation can't occur. He explicitly targetted this for dynamic data where there's no reuse.

But even with static data and reuse, you just end up with parts of the VBO in the current orphan that aren't being rendered in the current frustum anymore. Once you fill up, then you orphan, and then on the subsequent frame you'd reload what you are still using to the front of the new orphan.

Essentially an orphan event + refill gives you stream compaction so you don't have long-term fragmentation to deal with.


Maybe a part of the cache VBO is permanent and only parts are discardable?
I've avoided that, because then yeah, you have to deal with fragmentation. Ugly.


Anyway do you have any links to descriptions of caching algorithms?
For what I'm doing, all you need is Rob's post. It just uses a VBO as a big ring buffer. Just orphan when you wrap.


What if you cached spatially, i.e. defined regions with lists of batches that might be needed and then load the vbo with batches needed in the current region, as well as neighboring regions, reloading when crossing boundaries between regions.
Yeah, there are definitely lots of possibilities once you start considering this. I like to keep it as simple as possible and only make it more complex if I have to.

Another permutation to consider is to have separate stream VBOs for static and dynamic data. That way, the dynamic data won't prompt more orphans of the static data than needed. But again, haven't needed this for perf, so I've only considered it.


Also, what about VAOs? If you cache, the location of a batch may change, rendering it's VAO useless. In order to fix the VAO, you need to bind it, which takes up valuable time.
Personal taste: I don't like VAOs. A bazillion little objects in the GL driver to generate cache misses and cause slowdowns. I get everything VAOs can give and a good bit more from NVidia bindless (on an NVidia GPU).

And in my most recent tests, applying VAOs on top of bindless costs you a little perf, so on NVidia I only use bindless.


Anyway, I implemented a part of the mega-vbo approach. I will eventually have two mega-vbo's, one for attributes and another for indices, due to alignment concerns.
Re alignment, I use the trick offered in that thread of just rounding offsets up to multiples of 64 or something nice. No real cost or benefit for it that I've seen, but you can if you want. And no problems here with dumping attributes and indices into the same VBO.


Of course, if the layout is packed it can cause attributes and indices to become unaligned, what happens on my card in that case is completely random. Even if the scene remains static between frames GL will render random stuff.
Seriuosly? Not sure but I believe that's either a driver bug or your bug.

I ended up with a similar effect (garbage output: link (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=280280)) when I split the attribute and index blocks by an orphan. This is invalid. You can't do that. Both the attribute and index blocks must be in the same buffer orphan, as the content is "latched" at the draw call.


Is this allowed with vertex attributes on a BYTE boundary? My understanding was, it should only increase rendering time.
Think so, but I'd round each up to a multiple of 64 (lastest-gen cache line size) just for kicks to see if that fixes your problem.


As soon as vertex attributes are aligned on a DWORD boundary, at least on my card, everything renders correctly and fast.
Interesting. Which card is that? I'd like to bookmark that thought.

nickels
09-09-2010, 07:55 AM
Interesting discussion. Not sure I understand all the finer points, but as a testimonial, I have implemented my quake level viewer with one VBO and one Index buffer (GL_STATIC_DRAW, I believe). I dispatch faces separately using offsets into the index buffer. Haven't seen any issues. Static and dynamic meshes all have their own vbo and index buffers...
I hadn't thought about the potential for loading a level larger than the GPU memory by letting opengl swap buffers between GPU/CPU. Interesting...

mhagain
09-09-2010, 08:58 AM
Still, it requires an "if" and some wasted memory for state tracking, but I guess this is better than a wasted bind.To be honest, if you're worried about the overhead of an "if" and a handful of bytes, your app has bigger problems than it's VBO implementation. ;)

I wouldn't be inclined to call such memory "wasted" either; it's being "used" instead, to ensure that you don't issue redundant state changes.

ugluk
09-09-2010, 03:05 PM
To be honest, if you're worried about the overhead of an "if" and a handful of bytes, your app has bigger problems than it's VBO implementation. wink
Who says that I am worried? Whenever I see an "if" I try to think of ways around it. There are many examples of people managing to do so. To me "if"s have a taste of suboptimality to them, just like to us all, I think. This doesn't mean I don't use them in my code.

Yes, more specifically, just append it using the Streaming VBO approach Rob Barris describes here.

And if it's already there, just launch the batch using NVidia bindless (if on NVidia) to get display list perf for that batch.
How do you keep track of batches that are cached and those that are not? I've tried using std::map<> for cache tracking before, but promptly dropped it, as it was too much of a time waster.

Also, where are the textures in all this? Do you simply upload all textures to the GPU and rely on GL for swapping?

How do you determine the size of the cache VBO(s)?

Yeah, basically orphan it and the next frame you just end up reloading anything you're still using at the front of the new orphan. This works well enough. But if I ever get to a problem with frame breakage (due to the app being too close to breaking frame anyway) then I'll have to do something more slick like have 2 stream VBOs, and over the course of multiple frames expire things from the old VBO and load them over packed up-front in the new VBO, rather than do this all in one frame (the first frame after an orphan).
Yeah, you take what is used most often and put it in the second VBO and when the first fills up, you use the second, where there will hopefully be less batches to load into. What criterion do you use for expiring? Do you simply take the most often used batch on a frame and move it to the second VBO? You need to keep some of the second VBO free though, for possible loading. How much of it do you keep free?

I've avoided that, because then yeah, you have to deal with fragmentation. Ugly.
Orphaning a VBO discards it, so the user can't really reuse any of it anymore. So keeping a part of the VBO the same and changing only a part of it without orphaning can't probably be done.

Re alignment, I use the trick offered in that thread of just rounding offsets up to multiples of 64 or something nice. No real cost or benefit for it that I've seen, but you can if you want. And no problems here with dumping attributes and indices into the same VBO.
Yes, you've got to try using different alignments. I've used 4 and it worked. For some other card another magic number might be more appropriate.


Seriuosly? Not sure but I believe that's either a driver bug or your bug.

I ended up with a similar effect (garbage output: link) when I split the attribute and index blocks by an orphan. This is invalid. You can't do that. Both the attribute and index blocks must be in the same buffer orphan, as the content is "latched" at the draw call.

I see, they were in different orphans. Interesting.

As for my "problem", it definetly smells like a driver bug. I didn't orphan any VBO and I don't change anything about the scene or the viewing and random triangles are rendered from frame to frame. Has to be a driver bug. The card is:
02:00.0 VGA compatible controller: ATI Technologies Inc Cedar PRO [Radeon HD 5450]

After realigning the "problem" disappears, another indication of a driver bug.

Think so, but I'd round each up to a multiple of 64 (lastest-gen cache line size) just for kicks to see if that fixes your problem.
It works, just like the 4 byte alignment. No real performance improvement though.

mhagain
09-09-2010, 04:29 PM
If I understand Rob's approach correctly, it's almost the exact same as something I've implemented in D3D, and the approach is that you don't actually need to keep track of batches you've cached. Forget about tricks like that, just blast each batch into the VBO as it passes and either at the end of the frame or when it's going to overflow you rewind your cursor back to the start.

What you do need to keep track of is when state changes occur as obviously you need to either record those state changes somewhere for later replay, or draw everything that's accumulated to this point before applying the changes. The approach I took in D3D was to record them in a struct (just as a callback function that applies them) together with firstvertex, numvertexes, firstindex and numindexes, and then I replay them when the VBO is rewound (or at the end of a frame).

Obviously both state change filtering and sorting objects by state is really going to help a lot here when it comes to keeping batches as big as possible.

I just kept a static array of structs and filled them in as required; I found 1024 structs to be excessive but it's what I went with anyway. The exact size will depend on your application's needs.

STL constructs are nice for sure, but in this case I went with the more primitive option as being simpler and with less processing overhead. Knowing when to not use powerful tools is equally as important as knowing when to use them. ;)

A test for when the struct array was going to overflow was also needed, so I added that. With hindsight, it may have been perfecly fine to draw immediately instead of recording stuff, but I did want to be able to replay an arbitrary number of times (in the end I didn't have that requirement so I didn't need to record after all).

VBO sizes are going to be application dependent. I'd say create an initial one with a huge size, run your app, see how much space you need for a typical scene, then double it. I was using 16-bit indexes so I was necessarily constrained by that, having no more than 65536 vertexes.

The trick is to keep the initial "Lock"s (or MapBuffers) (preceded by BufferData NULL in GL) as few as possible, preferably only one per frame. Subsequent "Lock"s (or Maps) will be appending only, so they don't need to orphan (or discard, in D3D terminology) the entire buffer, just write to the cursor position. When it fills and is about to overflow, or when a new frame begins, is the only time you need to discard/orphan it again.

If you're really clever you could potentially set things up so that it doesn't even discard/orphan on a new frame, but just keeps appending. I personally found that - especially with 16 bit indexes - a discard/orphan on a new frame was the best way to go, otherwise there is a risk of needing to do so in the middle of most frames, so it was going to happen anyway, and it just kept things simpler this way.

I found no performance improvements whatsoever from double (or even quadruple) buffering, by the way. Most likely because my buffers were small enough to begin with, so the discard/orphan operation was about the same speed as switching to a new buffer.

The really nice outcome of this was that I was able to have a single DrawIndexedPrimitive (D3D-ese for glDraw(Range)Elements) for absolutely everything on-screen, as well as mix system memory vertex arrays (or the D3D equivalent) with VBOs while retaining 99% the very same code. There's still some room for further optimization; a lot of objects in the scene would more or less reuse the same indexes, for example, so I can adjust the code to allow for that, but right now there's no pressing need to. As a technique it works very well.

ugluk
09-09-2010, 04:59 PM
If I understand Rob's approach correctly, it's almost the exact same as something I've implemented in D3D, and the approach is that you don't actually need to keep track of batches you've cached. Forget about tricks like that, just blast each batch into the VBO as it passes and either at the end of the frame or when it's going to overflow you rewind your cursor back to the start.
So the algorithm is like this:

1. get a list of the needed batches,
2. iterate over the struct array and send the listed batches from the vbo, if you find them, otherwise send them to the vbo from system memory and append new batch state to the end of the struct array.

If the list of batches is M, and the struct array length is N, this gives M * N complexity.


The really nice outcome of this was that I was able to have a single DrawIndexedPrimitive (D3D-ese for glDraw(Range)Elements) for absolutely everything on-screen..
How is that possible? If you wrote glMultiDrawElements(), I'd understand but glDraw(Range)Elements? You cant specify multiple regions of indices with it. Maybe you could use it like you write, if all the indices were packed together and all were of the same type, but for an entire scene, they'd probably have to be 32-bit. Furthermore, what if different objects needed different shaders or different shader settings (like a different lighting model)? You'd either have to switch a shader or set some different settings.

mhagain
09-09-2010, 05:21 PM
No, just a single place where all the drawing is done is what I meant, not one call for absolutely everything. Sorry for the confusion.

I wouldn't even bother with checking for batches already in the VBO, I'd just resend everything from scratch each frame. Keep it as simple as possible. The eventual draw calls can then progress sequentially through your buffers instead of having to hop around from one location to another, which is very very important for maximum performance.

Dark Photon
09-09-2010, 06:29 PM
A post from ugluk in another thread that was obviously meant for this thread:

* http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=282998#Post2829 98

Dark Photon
09-09-2010, 06:57 PM
Where are the textures in all this?
With this technique we're just talking batch data. For texture data you've got a dedicated spot on the GPU you're going to upload the texture to and leave it, so it's a slightly different problem. But textures can be managed with a preallocated pool and a similar upload mechanism with PBOs.


Do you simply upload all textures there are? An then rely on GL to swap them in and out?
Oh, definitely not! Guaranteed frame breakage whenever you flip something into view that uses a texture that was paged off the GPU by the driver (or never pre-rendered with to force it on the GPU)! I avoid overcommitting GPU memory like the plague! Letting the driver page for you at the last minute when it figures out "inside your draw call" that it needs a texture that's not on the GPU is the kiss of death! :eek:


The client arrays analogy also worries me; it means all batches need to be kept around in system memory for possible uploading.
The driver's probably gonna keep a CPU-side copy anyway (AFAIK it does this for most if not all things you upload to the GPU, textures, probably buffer objects too). At least this way, there isn't a copy of "all" of your data in the driver. Just your working set. This way you get to manage the copy of "all" of your data on your side of the fence.

I confess not to be a driver writer so I don't know for sure. But since GL permits the driver to kick anything off the GPU to make room to put something else on it, at any time, it makes sense that it would keep a CPU copy because then it can just free up the space rather than have to 1) allocate CPU storage, 2) copy the GPU data off, and 3) free up the space. ... and only then, now that there's room, upload what it wants to the GPU.



Yes, more specifically, just append it using the Streaming VBO approach Rob Barris describes here.
But how do you keep track of which batches are in the cache and which aren't?
Store a link to it on your "batch" object in your scene graph. Currently my streaming VBO "link" is two things: 1) offset in VBO, 2) orphan ID.

How are these used?

When you find that you need to upload a batch to the streaming VBO, you store off the "offset" that you uploaded it to and the current "orphan ID" (this being the number of times you've orphaned the VBO).

Then the next time you need to draw the batch:

1) IFF the current stream VBO orphan ID == the link's "orphan ID" in your batch node, then your batch is still "in the cache" so you just render it from the link's "offset" in the VBO with no upload,
2) Otherwise you have to reupload it (Map UNSYNC / memcpy / Unmap) first, store off the link (orphan ID and offset) in your batch node, and then render it for that offset.


I've used std::map<BatchHash, void const*> for something similar before and it featured as a prominent time waster in the profiler report.
Yeah, I made sure and put my cookie crumb link in my batch object, which are cache lines I'll be pulling in anyway to draw the object.

Dark Photon
09-10-2010, 05:03 AM
How do you determine the size of the cache VBO(s)?
I just make sure it's something that can easily hold at least a frame's worth of data (6-8X is very comfortable, and is not that much GPU memory anyway).

If it can, then the next frame will reuse nearly 100% of the batches, and thus in the steady state nothing or nearly-nothing will need to be uploaded to the GPU during a frame.

If it can't, then you end up reuploading lots of stuff every frame. You're effectively "thrashing the cache". And you'll drop to something more akin to client arrays perf (where all batches are reuploaded every frame).


What criterion do you use for expiring?
Like I said, right now I'm not doing anything slick with that, but have anticipated that someday I might need to. Currently when the VBO fills up, everything expires immediately. So over the next 1 frame, you'll load up everything into the new orphan that you need. After that you're good-to-go until it fills up again.

Works very well, so long as your VBO can hold at least a frame's worth of data (so you don't thrash). So I haven't implemented any slick things to progressively migrate to a 2nd stream VBO over multiple frames.

Doing so would also involve a lot of buffer binds back-and-forth while you're getting switched over to the new VBO. This is a killer for classic (bind-to-use) VBOs, but no big deal if drawing the VBOs with NV bindless.



I've avoided that, because then yeah, you have to deal with fragmentation. Ugly.
Orphaning a VBO discards it, so the user can't really reuse any of it anymore. So keeping a part of the VBO the same and changing only a part of it without orphaning can't probably be done.
Right. The question was about having some parts locked down, always there, and static and other parts dynamic. So if done in the same VBO, naturally you couldn't orphan because that'd kill off the static part and demand a reupload of it. You'd instead have to deal with the fragmentation mess inside the VBO yourself -- like I said: ugly.

Dark Photon
09-10-2010, 05:31 AM
If I understand Rob's approach correctly, it's almost the exact same as something I've implemented in D3D, and the approach is that you don't actually need to keep track of batches you've cached. Forget about tricks like that, just blast each batch into the VBO as it passes and either at the end of the frame or when it's going to overflow you rewind your cursor back to the start. ...
True, Rob's pure approach does no caching as it is designed for dynamically-generated data, where reuse makes less sense.

The pro for adding reuse (when applying this technique to static batches) is that dispatching batches that are already on the GPU in a VBO is faster (sometimes much faster) than uploading the data from the CPU first and then dispatching them.

So if your stream VBO can hold at least a frame's worth of data, you're not uploading hardly anything per frame (for static batches anyway) -- only what you need to account for eyepoint motion and rotation.

The only corner case is the orphan, and the cost of moving off to another VBO can be amortized over multiple frames if needed.

Now for static batches that you know about on startup, you can just take the hit then, allocate dedicated GPU memory for VBOs, and upload to it with wild abandon, eating frame breaks as needed (or just build display lists for them). But for static batches that are loaded at run-time, when you really don't want to break frame, Rob's technique is good for getting that data on the GPU without significant overhead. And applying reuse to it (i.e. caching the batches on the GPU) gives you even better performance than always reuploading them.

nickels
09-10-2010, 06:07 AM
I'm trying to follow the thread, but I feel like I'm missing the main point.
1) What is a 'batch'? Is this the data needed to excersize one glDrawElements?
2) From the thread should I understand that in something like a FPS likely the whole level shouldn't just be thrown on the graphics card, but there should be some explicit management of gpu resource allocation, aligned somehow with the PVS or such?

Ignorance was bliss... Any comments to help a newbie appreciated!

mhagain
09-10-2010, 11:08 AM
1) What is a 'batch'? Is this the data needed to excersize one glDrawElements?
More or less, yes. In the optimal scenario absolutely everything in your scene that shares the same state would all be handled by one single such call, which is the purpose behind batching. In the real world that's not always possible because sometimes objects that share the same state will need to be drawn out of order, the most obvious example being 2D HUD elements (translucent objects that need to be rendered back to front are another example).

The goals are twofold. Firstly there is the desirable situation of having the driver do less CPU work. If you only need to update vertex and index pointers 20 times per frame it's obviously going to be cheaper than doing so 2,000 times per frame.

Secondly there is the well-understood situation where very few big things are preferable to lots of small things. This can be directly measured for hard disk access, network transmissions, memory allocations and it also applies to GPUs. Performance is widely known to suffer in the "lots of small things" scenario in all of these cases.

ugluk
09-10-2010, 02:09 PM
Ok, let me see if I can make sense of all this.

From what I see mhagain is writing about a multithreaded GL rendering instruction creator, that also handles resource loading. A "main" thread then deals with rendering, according to the instructions that the instruction creator generates.

I guess this means that a scene can be repeatedly rendered very quickly if its state does not change much. What if there is only a single core?

Dark Photon is more general. From what I understand, all indices and vertex attributes get an id and there is an array

std::pair<vbo_id, offset_in_vbo> cache_index[number_of_indices_and_vertex_attributes_regions];

When you need a resource (indices or vertex attributes) with an id, you look it up in the array, check if it is in the VBO. If it is, you just render with it, if not, upload it from system memory into the VBO, update the array with the offset and vbo_id and render.

Both posters repeatedly mention batching for rendering in some dedicated place. I see this would require additional state data, as for example a transform for the batch, what shader to use, ... So you probably have additional data structures for that. But what are the benefits of doing this? The vertex attribute pointers are bound to change across rendering a scene, as well as other things, like shader state, modeling transforms, which means multiple gl*Draw* calls.

One benefit I see is a central place to do the loading of resources, as well as rendering, but I guess inlined

void const* resouce_ptr = load_resource(resource_id);

calls would not be a big thing, even if strewn across the code.

If PBOs are available you can upload a texture into the caching VBO, that is bound as a PBO, and do glTex*Image*() call on the PBO (if not done already). Do you keep the VBO for the batches and textures separate? Do you keep textures at specific alignments in your caching VBO for textures? What if there are no PBOs available?

dedicated spot on the GPU you're going to upload the texture to and leave it
What spot is that? The algorithm I have in mind is:

Have and array

GLuint texture_index[all_textures_in_the_app];

whenever you need a texture you check in this index for the texture object id. If there is no valid texture object id for the texture you need, you reuse some texture object id and upload the texture data onto the GPU for that texture object, invalidating the texture object id for some other texture.

In a central place for rendering one can decide better, I guess, which texture data to overwrite and which to keep.

Also, why use glMapBuffer() calls? Why not glBufferSubData()?

mhagain
09-10-2010, 03:00 PM
Multiple draw calls are a fact of life and you're going to have to live with it. The objective is not to reduce the number of draw calls to 1, but rather to reduce them to as low a number as possible.

One single function in your application which all draw calls come from is a highly desirable goal because it isolates your actual drawing to one place and makes eventual debugging, as well as modification of code, a lot simpler.

Of course you might be crazy to insist on sending something like a single screen-sized quad through such a framework, but it's suitable for just about everything else.

My algorithm doesn't actually involve multi-threading (but it potentially could, at least in terms of building the current set of buffers while replaying the previous set) but instead goes something like this (the replay capability removed for simplicity and clarity):

begin frame, init buffer positions, etc

for each object in the scene
{
if (buffer is going to overflow) // this will be false first time
{
draw everything batched so far
rewind/orphan/discard
}

if (state change needed)
{
draw everything batched so far
apply new state changes
// we're appending to the buffer, not orphaning here
}

add this object to the buffer
increment positions, counters, anything else you need
}

draw anything left in the buffers
end frame
All that the replay capability does is store the needed state changes and buffer positions in the array of structs I mentioned, so you just loop through that array applying states and drawing stuff before the buffer is orphaned and at the end of the frame. Simple.

I haven't yet experimented with the caching idea that Dark Photon mentioned, but I think that it certainly sounds like it's at least worth a try. :D

ugluk
09-10-2010, 05:38 PM
You mention improvements, relevant to debugging/stability, but none relevant to performance. I thought performance was the most important thing. My prime concern is that batches have vastly different states (let's say the batch transform, positioning/orienting the geometry of the batch). So (state_change_needed) evaluates true more often than not. If each object in scene knew how to draw itself, it would simply set the states it needed, without checking if it was needed.

If you didn't have the batch transform in mind as a state change, then what do you consider a state change? Maybe a shader switch or a different shader configuration?

Furthermore, I don't see the aid in debugging. Let's say something screws up my scene render. I comment out some rendering stuff, prevent some objects from rendering and see if that helps, then I target the object classes, I've found the fault in, for fixing. If you have a sea of batches, you can prevent some batches to be generated, but are left with the question of whether your batch-processing code is at fault or the batch-generating code. There's also the overhead of adding batch state to the array of structs. Maybe you can give an example where this batching technique helps out and gives good performance despite the overhead?

I see with your technique you could probably make a snapshot of the scene at a given moment and then rerender it at a later time... But what benefit would that give, unless, of course, scene state remained the same between 2 renders?

mhagain
09-10-2010, 06:53 PM
You sort your objects by state before rendering, and draw in that order. Believe me, this is a solved problem; Quake III did it, Doom 3 does it, it's well known and well understood that this is the way to do things. A GLIntercept log from a Doom 3 frame (and this is a very worthwhile exercise to do) shows a typical batch looking something like this:
glLoadMatrixf([-0.707969,0.026333,0.705752,0.000000,0.706243,0.026 398,0.707477,0.000000,0.000000,0.999305,-0.037286,0.000000,187.120117,-22.095749,-185.795578,1.000000])
glBindBufferARB(GL_ARRAY_BUFFER,2496)
glVertexPointer(3,GL_FLOAT,60,0x0000)
glTexCoordPointer(2,GL_FLOAT,60,0x000c)
glEnable(GL_ALPHA_TEST)
glColor4fv([0.000000,0.000000,0.000000,1.000000])
glAlphaFunc(GL_GREATER,0.750000)
glBindTexture(GL_TEXTURE_2D,1510)
glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER,2494)
glDrawElements(GL_TRIANGLES,3690,GL_UNSIGNED_INT,0 x0000)
You can use tricks such as texture atlases, matrix transforms in your application code (instead of glTranslatef and family) and so on to get the batch sizes bigger, but the important thing is that you do get them big.

Performance improvements do happen with batching, and this is also a solved problem. You won't see any improvements unless you can get the batch sizes big, and you won't see any improvements with low polycount scenes, but as the polycounts go up it gets faster and faster.

The debugging aid is that you have only one place to set breakpoints. If you have a wild pointer that causes execution to jump to a random point and your call stack to be lost, and if you suspect that it's an indexes param to a glDrawElements call, having only one such call can help a lot - set a breakpoint before the call, examine variable values, set one after, run to the next breakpoint, see what happens. Likewise consolidating all of your gl*Pointer calls into a single place is also a Good Thing to do. Wrapping the functions also lets you do things like bounds checking and other validation in debug builds.

Yes, I can snapshot the scene like you say, and I deliberately wrote it that way because I wanted to have that ability, but I ended up having no need for it. I'll probably go back and revert to simpler code at some point in time, but right now what I have works, so there's no pressing need to change it.

ugluk
09-10-2010, 10:28 PM
Ok, I think I understand now. You simply want as few state changes as possible between consecutive gl*Draw*Elements() calls. How do you select the states to sort by? Also, what weights to assign to the state switches. They aren't equally expensive. The VBO used is always the same, so VBO is out, I don't use VAO, the model transformation matrices are rarely the same between drawing calls, so what else is there? I guess if one used the same textures for different models, but that is rare also. The only major remaining state I see is the vertex attribute pointer when instancing, but I don't do any instancing. Also shader or shader configuration, but what if the fixed pipeline is used?

I still don't understand what it means to get a batch as big as possible? You generally cannot render 2 or more objects with separate vertex attributes with a single gl*Draw*Elements() call. You mean finding the maximum number objects which need the same GL state?

mhagain
09-11-2010, 06:44 AM
It's not just state changes, it's getting that 2nd parameter to glDrawElements to be a large number.

Practical example - which of these two is going to be faster?
for (i = 0; i < 10000; i++)
{
glDrawElements (GL_TRIANGLES, 3, GL_UNSIGNED_SHORT, &amp;ptr[i]);
}Or:
glDrawElements (GL_TRIANGLES, 30000, GL_UNSIGNED_SHORT, ptr);

ugluk
09-11-2010, 03:00 PM
Certainly the second one, if you can do it. But what about the vertex attribute pointer? There is only 1 for position, the others are for other attributes. AFAIK, there is also only one gl_Position output variable or equivalent in the GLSL shader. How are you going to coalesce vertex attributes for 2 or more separate objects this way? You could only do it if the indices of 2 or more separate objects were stored adjacently in the VBO. This would also require fixing the indices of the 2nd, 3rd, ... object, relative to the vertex attribute pointers pointing into the first object. To me this seems something for design time, not runtime.

What if the separate objects had a different number of vertex attributes?

So how can you get the second parameter high and still render correctly?

From this article:
http://developer.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf

It seems to me you want to sort per state meshes of single objects. So you don't cross the object boundary with the indices. That is, if an entire object can be drawn with a single Draw call you simply batch it, you don't try to merge meshes of separate objects together into a single draw call. Someone could have pointed me to the article :)

The only thing I'd like to know now is. What to do in the case of vertex attributes taking up less than 64 bytes per vertex. From what I've read here, I resize the space they occupy to 64 bytes, right? This means 64 * number_of_vertices bytes space used for vertex attributes.

Also, does glFinish() or glFlush() have to be called after orphaning the caching VBO?

And, for textures, create texture atlases, that are as large as possible, to minimize binds of different texture objects, then keep your fingers crossed that they are retained by the GPU, not swapped out to system ram, even if PBOs are supported.

mhagain
09-11-2010, 05:29 PM
I was actually trying to find it earlier....

You're correct though about it being a design time thing, up to a point. Take the example where your world geometry is composed of a few thousand faces, and where typically 25%-75% of those faces might have the same texture. Due to frustum culling and PVS, you don't know which faces (and in what order) are going to be visible at any given point in time. Do you just draw all of them? That's not going to be efficient. So there's one clear case where you need to construct batches and batch ordering at runtime.

So a combination of design time and run time, without being particularly religious about either, is the way to go.

The situation is further complicated by multitexturing and lighting, so realistically when I say "as high as possible" it should also be read as meaning that "it might not be always possible to get it very high".

ugluk
09-11-2010, 05:41 PM
I guess with the Doom3 level example you gave earlier you'd just pack all vertices into a VBO, followed by indices, treating it as a single object. Of course this requires 32-bit indices then. Then you'd sort out the meshes of the level for batching and then blast them to the GPU with a glMultiDrawElements().

Dark Photon: This would probably break the caching VBO if the level was huge.

Dark Photon
09-12-2010, 11:50 AM
1) What is a 'batch'? Is this the data needed to excersize one glDrawElements?
It's one draw call, precisely (glDraw*)


Secondly there is the well-understood situation where very few big things are preferable to lots of small things.
I think I know what you're trying to say, but it's worth clarifying this statement. Taken alone, a newbie could imply that rendering the entire world in only one or very few batches (if possible) is preferable, when in fact in many cases this would seriously kill your performance.

The bigger your batches the more poorly they frustum cull (or occlusion cull), and shipping a bunch of out-of-frustum junk down the pipe (just because your batches are very big spatially) is a great way to kill perf (how does the GPU know they're out-of-frustum? It doesn't until after the vertex shader positions everything in clip space.). So tuning batch size is a balancing act between how much time we will waste rendering out-of-frustum (or occluded) junk on the GPU vs. how much time it takes us to cull that stuff out beforehand and dispatch a few more batches down the pipe on the CPU.

Faster CPU + slower GPU --> you're better with smaller batches. Faster GPU + slower CPU --> you're better with bigger batches. Relatively speaking. It's one of those local-minima-type optimization problems. One of the nice things about bindless is that it lets you pump a lot more batches (i.e. better culling) without having to have have a faster CPU or CPU memory. So you can have smaller batches, cull better, and waste less GPU time with useless work you don't care about.

But to your original point, yes, the fewer batches you have, the less time you eat on the CPU in the GL driver dispatching the batches. But if your batches are too big, the CPU time saved can be dwarfed by how much extra time you're eating on the GPU doing useless work that'll just be thrown away.

mhagain
09-12-2010, 12:08 PM
:) Yes, that put it just about perfectly.

Dark Photon
09-12-2010, 12:24 PM
...dedicated spot on the GPU you're going to upload the texture to and leave it
What spot is that? The algorithm I have in mind is:

Have and array

GLuint texture_index[all_textures_in_the_app];

whenever you need a texture you check in this index for the texture object id. If there is no valid texture object id for the texture you need, you reuse some texture object id and upload the texture data onto the GPU for that texture object, invalidating the texture object id for some other texture
Yeah, we're thinking along the same lines. The "dedicated" point being while we can just point the GPU to some random hunk of VBO memory to render any given batch (which it may be sharing with a bunch of other batches), with textures in general we can't so easily and still preserve GPU-accelerated trilinear MIPmapped filtering. You need a separate, free-standing GL texture (or slice of a texture array).

In fairness, should note that yes, you can use an atlas and try to do your own filtering in the shader to carefully avoid bleed-over between subtextures (ala megatexture/virtual texturing), but seems with these you always lose something (can't do large filters such as with edge-on surfaces resulting in noise, dependent texture lookups needed to implement this "virtual memory", more heavier fill hit since you're not using the built-in filtering hardware as effectively, etc.). But it's still "dedicated" in the sense that it should be preallocated, not dynamically allocated.


Also, why use glMapBuffer() calls? Why not glBufferSubData()?
Well, as always, test everything and use what's fastest for your use cases and GPUs. Years ago, I benched Data(NULL) + SubData() with PBOs to be fastest for our usage and GPUs (despite "conventional wisdom" at the time), but last time I checked, it's not so anymore.

A con to Sub is it does an immediate memcpy in the GL thread before it returns. Whereas with Map(), you can do your own memcpy in a BG thread if you want, and/or dump compressed data into the GL buffer directly so you don't have to incur another useless copy. The forum archives will have more details on the Map vs. Sub issues.

Dark Photon
09-12-2010, 03:16 PM
Both posters repeatedly mention batching for rendering in some dedicated place. I see this would require additional state data, as for example a transform for the batch, what shader to use, ... So you probably have additional data structures for that.
Sure, but this "state change management" is totally orthogonal to drawing batches. Generally speaking, core rendering loop is just:

- Apply state change group
- Draw batch
- Apply state change group
- Draw batch
...
... << rinse, repeat >>
...

We're just talking about optimizing the "Draw batch" piece here, not the "apply state change group" piece. Changing shaders, shader uniform values, etc. all fall in the latter category. Naturally you have to employ strategies to keep your "state change" overhead down or the "draw batch" performance is irrelevant. Ideally you want nearly every "apply state change group" to be no code required (radix sort draw order) or at most a single cheap "if" check that finds it can skip everything because "the current GL state is already the desired GL state for the next batch (or batch group)".


But what are the benefits of doing this? The vertex attribute pointers are bound to change across rendering a scene, as well as other things, like shader state, modeling transforms, which means multiple gl*Draw* calls.
I'm not sure I'm following you here.

A "batch" includes vtx attribs, DrawElements idx list, and the draw call arguments.

All other GL stuff falls in the general "state changes" bucket, including shader state changes (active pgm, modeling transform uniform sets, etc.). This is the environment that a batch draws in.

Yeah, whenever you change "state", you need to break a batch so you can change that state.

But beyond that, sorry but I don't understand what you're saying (or asking). The benefits of doing what?

If you mean using Streaming VBOs for batch data, I mentioned a few before but they include: Not having to allocate dedicated storage on the GPU for each batch, The ability to dispatch a substantial portion of frame batches directly from the GPU without upload (== fast!!), No need to implement complicated GPU/VBO memory fragmentation/garbage collection strategies, very small GPU memory footprint, and good results even without "pre-packing" a bunch of batches in shared memory buffers.

Dark Photon
09-12-2010, 03:31 PM
One single function in your application which all draw calls come from is a highly desirable goal because it isolates your actual drawing to one place and makes eventual debugging, as well as modification of code, a lot simpler.
Definitely! Would Mod +1 if I could. :cool: It makes trying and comparing various draw techniques very, very easy, as well as supporting them side-by-side for different types of batches as the need arises.

Duplicating code is almost always bad, and this case is no exception.

Dark Photon
09-12-2010, 04:10 PM
The only thing I'd like to know now is. What to do in the case of vertex attributes taking up less than 64 bytes per vertex. From what I've read here, I resize the space they occupy to 64 bytes, right? This means 64 * number_of_vertices bytes space used for vertex attributes.
I've read this, but the magic vtx alignment number touted was 32 bytes not 64 bytes. NVidia doesn't seem to care, and my testing has confirmed that. Several folks have said this is an ATI-only thing. I'll search in a minute and if I found the article, I'll edit in a link to it here.

EDIT: Links:
* Vertex_Buffer_Object#Sample_Code (OpenGL Wiki) (http://www.opengl.org/wiki/Vertex_Buffer_Object#Sample_Code)
* glMultiDrawElements() and VBO (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=277764&amp;Searchpa ge=1&amp;Main=53600&amp;Words=%2Bati+%2Balignment+%2B32&amp;Se arch=true#Post277764)
* Many VBOs vs. single VB for dynamic meshes (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=262445#Post2624 45)
* VBOs and ATI (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=236621#Post2366 21)


Also, does glFinish() or glFlush() have to be called after orphaning the caching VBO?
No! Definitely not!

That's the beauty of Rob's whole Map UNSYNC/INVALIDATE_RANGE, fill, then orphan-on-full strategy. There's no need for any synchronization. You're doing things just right so that you and the driver don't need to "shake hands" at any point. You just blast batches, and it rips through them just as fast as it can. That and it pipelines very well.


And, for textures, create texture atlases, that are as large as possible, to minimize binds of different texture objects, then keep your fingers crossed that they are retained by the GPU, not swapped out to system ram...
Yeah "keep your fingers crossed", or (more deterministically) just make sure the sum-total of all texture memory you've allocated does not come anywhere close to the amount of GPU memory on your card (there are a number of ways to query the system for this number nowadays -- OpenCL, GL extensions, vendor APIs, etc.). Of course you have to leave some slop for VBOs, framebuffers, padding/alignment overhead. etc. that also need vidmem.


..., even if PBOs are supported.
I don't understand that comment though... :(

Dark Photon
09-12-2010, 04:40 PM
Dark Photon: This would probably break the caching VBO if the level was huge.
I think you might be surprised!

Assuming you have all your batch dispatch code in one place (one class or function), hack in some code that adds up the size of all the batch data you use over a single frame (that isn't known about on startup -- that won't be streamed -- it can be loaded on the GPU on startup). It's probably not near as big a number as you think! I know I was shocked that it wasn't larger.

Also if you use a batch multiple times during a frame (e.g. for rendering reflections/shadows/etc. and then again for rendering the scene), you should only count that batch once for your statistics because subsequent uses in a frame should always be reuse.

Also verify that your frustum culling is working well (no out-of-frustum batches being rendered).

Also render your scene in wireframe to verify that your LOD is good (i.e. you're not rendering things down close to 1 pixel polys. GPUs are not optimized for this, and this wastes a lot of batch size due to useless verts not to mention perf with likely-useless vtx shader executions).

ugluk
09-12-2010, 10:16 PM
Your right, the batch processing code would be a perfect place to force wireframe rendering to occur. Thanks for patiently explaining caching and batching.

I have a feeling, that the texturing thing is going to bite me sooner or later. I'll ask again if that happens. The ever growing RAM capacity of GPUs does not guarantee memory is not going to run out, it just reduces the probability of it happening.

Dark Photon
09-13-2010, 06:05 PM
Sure thing! Hope it helps.

ugluk
09-18-2010, 12:06 AM
I have an additional question:

when I use glDrawElements(), I need 2 pointers, one to indices and another to vertex attributes, say I do this:


glDrawElements(GL_TRIANGLE_STRIP,
size,
GL_UNSIGNED_INT,
load_resource(indices_id));

It is possible that vertex attributes find themselves in the orphaned VBO due to loading of the indices. Does this make a case for separate caching VBOs, one for indices, the other for vertices? Also, if glMultiDrawElements() were called, there would have to be space for all rendered index and vertex arrays in the VBO, making necessary the loading of multiple vertex attributes and indices, requiring bigger caching VBO sizes. So what is the story on glMultiDrawElements() and other multis? Are you using them when caching?

I've also found that, caching works with both glBufferSubData() and glMapBuffer(), giving better perf with glBufferSubData().

Dark Photon
09-20-2010, 07:10 PM
when I use glDrawElements(), I need 2 pointers, one to indices and another to vertex attributes ... It is possible that vertex attributes find themselves in the orphaned VBO due to loading of the indices.
Yes, but that use case is invalid. The buffer orphan active at the time of the draw call must all of the bytes needed by the draw call from that buffer. You cannot orphan some bytes you're going to use and then call the draw call.

I tripped over this myself here: link (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=280279&amp;Searchpa ge=1&amp;Main=54156&amp;Words=orphaning&amp;Search=true#Post28 0279)


Does this make a case for separate caching VBOs, one for indices, the other for vertices?
I don't think so. Using one is no big deal -- you just need to verify you've got enough bytes for all vertex attributes and indices in the active buffer before you start dumping them in the buffer. Otherwise you orphan first, and then load.

Consider that if you don't interleave vertex attributes for some batches (for whatever reason) and have N vertex attributes in your batch, then you actually have N+1 individual blocks of memory (if say the N attribute blocks are not contiguous in memory), in between any of these you could find that your buffer filled up and you need to orphan. Doing the orphan check up-front before you start dumping any batch data in fixes the issue in all of these use cases.

Also, as to two are just as good as one... The cost of switching buffers seems to be expensive. This is why "classic" VBO rendering (i.e. VBOs w/o bindless) even with static preloaded VBOs can sometimes be slower than client arrays. Using one VBO for both ARRAY_BUFFER and ELEMENT_ARRAY_BUFFER and doing lazy state changes (so basically you never change the binding with streaming VBOs) definitely seems to cut down on the normal overhead when you're using a lot of VBOs with classic VBO rendering. But I'm not sure without testing if the same is true if you have two different VBOs that are bound, one to each bind point, and doing lazy state changes (so again the bindings never change with streaming VBOs). I'd guess it'd probably be a good case perf wise too, but with VBOs you never know for sure until you try.


Also, if glMultiDrawElements() were called, there would have to be space for all rendered index and vertex arrays in the VBO, making necessary the loading of multiple vertex attributes and indices, requiring bigger caching VBO sizes. So what is the story on glMultiDrawElements() and other multis? Are you using them when caching?
I've always read that glMulti* are just CPU-side "for" loops inside of the GL driver, so I don't use them.

This is (apparently) intuitively confirmed when you consider that glMultiDrawElements() takes an array of pointers to CPU memory. CPU memory pointers mean nothing on the GPU. So (AFAICT) you can't just point the driver to an index VBO with this draw call as you can with glDrawElements* calls.


I've also found that, caching works with both glBufferSubData() and glMapBuffer(), giving better perf with glBufferSubData().
Interesting. What GPU/driver?

I used to see better perf with Sub but last time I tried it that had changed.

ugluk
09-20-2010, 10:08 PM
Yes, but that use case is invalid. The buffer orphan active at the time of the draw call must all of the bytes needed by the draw call from that buffer. You cannot orphan some bytes you're going to use and then call the draw call.

It took me some time to figure it out ;)

I don't think so. Using one is no big deal -- you just need to verify you've got enough bytes for all vertex attributes and indices in the active buffer before you start dumping them in the buffer. Otherwise you orphan first, and then load.
From what I've read of your post, you seem to have a Batch class into which you upload everything and it probably does the check just before rendering. Still, it can happen, that there is not enough space for both indices and vertices in the caching buffer and the buffer is needlessly discarded, while there may be enough space for, say, just vertices. The 2 buffer approach also allows me to load all the indices I could possibly use for a level (as they typically occupy less space than vertex attributes).


Consider that if you don't interleave vertex attributes for some batches (for whatever reason) and have N vertex attributes in your batch, then you actually have N+1 individual blocks of memory (if say the N attribute blocks are not contiguous in memory), in between any of these you could find that your buffer filled up and you need to orphan. Doing the orphan check up-front before you start dumping any batch data in fixes the issue in all of these use cases.
I agree.

Also, as to two are just as good as one... The cost of switching buffers seems to be expensive. This is why "classic" VBO rendering (i.e. VBOs w/o bindless) even with static preloaded VBOs can sometimes be slower than client arrays. Using one VBO for both ARRAY_BUFFER and ELEMENT_ARRAY_BUFFER and doing lazy state changes (so basically you never change the binding with streaming VBOs) definitely seems to cut down on the normal overhead when you're using a lot of VBOs with classic VBO rendering. But I'm not sure without testing if the same is true if you have two different VBOs that are bound, one to each bind point, and doing lazy state changes (so again the bindings never change with streaming VBOs). I'd guess it'd probably be a good case perf wise too, but with VBOs you never know for sure until you try.

I've tested on my GPU and the 2-buffer approach is slightly slower (about 4 hundredths of a millisecond). The approach eliminates the extra "reserve" call before rendering to make sure there is enough space in the VBO and makes the caching less dynamic (you can, say, dump all the indices into the same VBO).

What I didn't like with the "reserve" call was the fact, that the number of resources (vertex attributes and indices) you reserve can vary, making necessary the use of a variable array (say, a C++ vector). The other alternative would be a usual C array (say, GLuint[], and a size of the array). This would still make array specification necessary in the code. There is also some overhead associated with the "reserve".

So, I think your approach is better for perf and mine for tight memory situations. I'll implement your approach and bench a little more.


I've always read that glMulti* are just CPU-side "for" loops inside of the GL driver, so I don't use them.

This is (apparently) intuitively confirmed when you consider that glMultiDrawElements() takes an array of pointers to CPU memory. CPU memory pointers mean nothing on the GPU. So (AFAICT) you can't just point the driver to an index VBO with this draw call as you can with glDrawElements* calls.

Not only this, they are sometimes buggy. Apparently the for loop is hard to implement right sometimes. Still, on Windows, you have just 1 call into opengl32.dll, instead of many. Calls into DLLs are supposedly slow there.


Interesting. What GPU/driver?

I used to see better perf with Sub but last time I tried it that had changed.

I see a whole millisecond improvement over glMapBuffer()/std::memcpy/glUnmapBuffer(). The driver is the linux 10.7 ATI driver and the GPU is Radeon HD 5450. As you wrote many times, you've gotta bench everything and I tested both approaches. To be honest, if memory writes are not trapped somehow, I really don't see how memcpy could be fast + there are 2 calls into the GL over 1 for Sub. Or maybe the gcc compiler (4.4.3) is crappy or my compiler switches are wrong? I've never heard of gcc using MTRRs, DMA or something hw accelerated for memcpy.

EDIT: I've done everything as you've suggested and got ripping perf now, no measurable difference compared to the mega-vbo loading case. Everything is stored in one buffer now.

Dark Photon
09-21-2010, 05:18 AM
What I didn't like with the "reserve" call was the fact, that the number of resources (vertex attributes and indices) you reserve can vary, making necessary the use of a variable array (say, a C++ vector).
While I think the C++ vector might be overkill (there are max 16 vtx attribs and 1 index list = fixed-length array of 17), I too didn't really take to the idea of a for loop to check space and another for loop to load. So initially I ended up tailoring this to only support one batch permutation (1 interleaved attr ptr, 1 idx pointer - per batch): the one that was used 99% of cases. But the for loop shouldn't be a big deal unless your memory is spread around and it causes you to touch a lot of cache lines, which is avoidable.


Not only this, they are sometimes buggy. Apparently the for loop is hard to implement right sometimes. Still, on Windows, you have just 1 call into opengl32.dll, instead of many. Calls into DLLs are supposedly slow there.
Still? I clearly remember reading this was the case for D3D9 and earlier (e.g. Batch, Batch, Batch (NVidia) (http://developer.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf)), but I thought that D3D10 fixed that (effectively adding the user-space marshalling that OpenGL has always had IIRC).


To be honest, if memory writes are not trapped somehow, I really don't see how memcpy could be fast + there are 2 calls into the GL over 1 for Sub. Or maybe the gcc compiler (4.4.3) is crappy or my compiler switches are wrong?
Definitely don't think GCC is crappy -- been using it for many years with great results.

As far as compiler switches, I'd suggest a base of "-g -O2 -Wall -Werror". That produces generally good optimization, without getting off into the weeds of optimization for specific processors or doing things that'll sometimes hurt perf.


I've never heard of gcc using MTRRs, DMA or something hw accelerated for memcpy.
As part of this, I spend a while trying various accelerated memcpy techniques, including some "MTRR"-like ones such as SSE2 non-temporal (non-cache-polluting) memcpy implementations. The built-in memcpy in GCC (on x86_64) came out on top, though the latter came pretty close. So at least from my testing with batch data, it's not too shabby. I did find msgs from the past saying older versions of GCC weren't always so tight though.

For reference, I'm currently on gcc 4.5.0, though when I did the memcpy testing that was with with gcc 4.4.1.


EDIT: I've done everything as you've suggested and got ripping perf now
Sweet! Congrats!