mega vbo, any reservations?

ugluk · September 8, 2010, 12:26am

Once upon a time we had a discussion about having a single mega-vbo containing all attributes and indices an app might need. The only reservation was, that the swaps between CPU and GPU memories might become slow. But what about mobile & embedded platforms, where only 1 GL app at a time runs? Also, with games for the usual desktop computers, usually only 1 game at a time runs, as you can’t really play 2 or more at a time.

I’m itching to try the mega vbo in practice, but, as you well know, this is a major change and I’d like to hear about any reservations before I jump to implementing it.

Dark_Photon · September 8, 2010, 6:56am

If you’ve concentrated your scene glDraw*Elements code in one spot (i.e. one “drawable” or “draw batch” class), then the change isn’t huge.

Reservations against? Not really, and there are some compelling arguments for doing it. GPU memory allocation/fragmentation = expensive. VBO switching = expensive. Can’t afford to have all loaded batches on the GPU at all times (in some apps anyway). etc.

But just to clarify, what specifically are you contemplating when you say “mega VBO”? There have been a number of posts talking about packing multiple batches into one VBO. Many of them talking about pre-packing batches into VBOs (in DB build step or something). Then there have been a few talking about dynamically streaming just the batches you need into a “mega VBO” (e.g. this one). I’m talking about the latter.

The main difference being whether you are storing big chunks of batch data on the GPU whether you need it or not (i.e. whether any of them pass CULL or not), or just storing those that you need based on CULL results. Or a mix.

ugluk · September 8, 2010, 1:36pm

I meant storing multiple batches on the GPU, without dynamically changing them at runtime, into a single big VBO. Maybe this is standard practice already, but I’ve read in these forums, the GPU, after running out of memory, might swap some VBOs into system memory, but with a mega-VBO the swapping could prove slow. But, I guess, if you allocate a mega-VBO and the GPU won’t be able to accommodate, then it will return an error and that will be that. You need to make sure, it won’t be too big up front.

I’ll have to test what happens, if you try to allocate a, say, 1 GB VBO on a 512 MB GPU.

result:

02:00.0 VGA compatible controller: ATI Technologies Inc Cedar PRO [Radeon HD 5450] (prog-if 00 [VGA controller])
Subsystem: Hightech Information System Ltd. Device 2291
Flags: bus master, fast devsel, latency 0, IRQ 30
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at feae0000 (64-bit, non-prefetchable) [size=128K]
I/O ports at e000 [size=256]
Expansion ROM at feac0000 [disabled] [size=128K]
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Legacy Endpoint, MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [100] Vendor Specific Information <?>
Capabilities: [150] Advanced Error Reporting
Kernel driver in use: fglrx_pci
Kernel modules: fglrx

I apparently I have 512 MB on my GPU and I do:


  GLuint buffer_object;

  glGenBuffersARB(1, &buffer_object);
  GL_DEBUG();
  glBindBufferARB(GL_ARRAY_BUFFER, buffer_object);
  GL_DEBUG();
  glBufferDataARB(GL_ARRAY_BUFFER, 1024 * 1024 * 1024, 0, GL_STATIC_DRAW);
  GL_DEBUG();

and get:

out of memory

test_array_stored: test/test_array_stored.cpp:34: int main(int, char**): Assertion `0’ failed.

I am worried, that when I allocated a 600-800 MB VBO, it actually succeeded, so you can apparently allocate more memory than is available.

Anyway, in my app, I don’t have the DrawElements calls grouped, but I’ll simply drop all the VBO binds that are there and assume always the correct VBO is bound.

Alfonse_Reinheart · September 8, 2010, 7:08pm

I am worried, that when I allocated a 600-800 MB VBO, it actually succeeded, so you can apparently allocate more memory than is available.

Why? There’s nothing that said that the entire buffer object must be in video memory at the same time. As long as it functions as a valid buffer object, you shouldn’t have a problem.

Dark_Photon · September 8, 2010, 8:32pm

Yeah, this is the kind of thing that worries me with “just put all the batch data we could possibly need to render with on the GPU” approach. You end up with ginormous mem req’ts and wasting lots of GPU mem on stuff that may never be used by the GPU pipeline (and even that’s assuming we somehow know we can allocate that big a VBO and that the GPU can keep it hot and resident).

Versus, the approach of just caching the batches you need to render with or rendered with recently on the GPU. This amount of data actually ends up being fairly small. So instead of talking about 0.5-1GB or something. We’re maybe talking 0.5-10MB or so. Much more reasonable.

I am worried, that when I allocated a 600-800 MB VBO, it actually succeeded, so you can apparently allocate more memory than is available.

Yes, that is a concern.

Anyway, in my app, I don’t have the DrawElements calls grouped, but I’ll simply drop all the VBO binds that are there and assume always the correct VBO is bound.

You can do that, or just do lazy binds (i.e. only bind if the last bind wasn’t the same as the current bind). This is sometimes termed the state tracking, or removing duplicate calls approach, and uses thin wrapper functions around GL calls.

ugluk · September 8, 2010, 10:59pm

Yeah, this is the kind of thing that worries me with “just put all the batch data we could possibly need to render with on the GPU” approach. You end up with ginormous mem req’ts and wasting lots of GPU mem on stuff that may never be used by the GPU pipeline (and even that’s assuming we somehow know we can allocate that big a VBO and that the GPU can keep it hot and resident).

You can take the approach: there is either enough memory, or screw it. Even if GL screws my app, by allocating a part of the VBO in system memory, the user will notice a delay, assume his “bad” card is at fault and rush to buy a new one

Well, unless the app developer is Microsoft, I suppose you need to take the caching approach. Or make more than 1 VBOs and put up with swapping.

Versus, the approach of just caching the batches you need to render with or rendered with recently on the GPU. This amount of data actually ends up being fairly small. So instead of talking about 0.5-1GB or something. We’re maybe talking 0.5-10MB or so. Much more reasonable.

So, what if the cache VBO lacks a lot of the necessary batches? A noticeable delay taken up by loading. If I understand correctly your algorithm would be as follows:

when a batch is needed for rendering check if the needed batch is in the cache VBO, if not, load it into the cache VBO.

What happens when the cache VBO is full? Discard VBO and start reloading batches into it? Otherwise fragmentation will occur. Maybe a part of the cache VBO is permanent and only parts are discardable? Anyway do you have any links to descriptions of caching algorithms? What if you cached spatially, i.e. defined regions with lists of batches that might be needed and then load the vbo with batches needed in the current region, as well as neighboring regions, reloading when crossing boundaries between regions.

Also, what about VAOs? If you cache, the location of a batch may change, rendering it’s VAO useless. In order to fix the VAO, you need to bind it, which takes up valuable time.

You can do that, or just do lazy binds (i.e. only bind if the last bind wasn’t the same as the current bind). This is sometimes termed the state tracking, or removing duplicate calls approach, and uses thin wrapper functions around GL calls.

Still, it requires an “if” and some wasted memory for state tracking, but I guess this is better than a wasted bind.

Anyway, I implemented a part of the mega-vbo approach. I will eventually have two mega-vbo’s, one for attributes and another for indices, due to alignment concerns. But currently I have a single VBO for both vertex attributes and indices. The layout looks like this:

| attributes1
| indices1
| attributes2
| indices2
…

Of course, if the layout is packed it can cause attributes and indices to become unaligned, what happens on my card in that case is completely random. Even if the scene remains static between frames GL will render random stuff. Is this allowed with vertex attributes on a BYTE boundary? My understanding was, it should only increase rendering time. As soon as vertex attributes are aligned on a DWORD boundary, at least on my card, everything renders correctly and fast. If I align vertex attributes on a WORD boundary, I notice a large delay, but everything renders correctly.

I’ve also noticed a speed-up due to the mega-vbo and the now missing binds when aligning at the DWORD boundary.

EDIT:
Now I’ve done tests with 2 mega vbo’s, 1 for indices, the other for attributes and the performance is slightly worse (hundreth of a millisecond) than it was for a single mega vbo for both indices and vertices.

Dark_Photon · September 9, 2010, 5:04am

Here’s the thing. First frame bringing up your level, maybe you take a little longer because you’re uploading more batches. But in subsequent frames, a substantial portion of the batches you render were rendered the last frame, so they’re already there. If you envision the frustum panning around, you only have potentially new batches on the leading edges of the view frustum. So the amount of new data you need to upload is small. This technique takes advantage of temporal coherence in your rendering.

But even for that first frame, is it noticable? Not so much. You end up with something like client arrays perf for those, which isn’t too bad. Worst case, maybe you break just the first frame after you bring up a level if you were close to breaking anyway.

If I understand correctly your algorithm would be as follows: when a batch is needed for rendering check if the needed batch is in the cache VBO, if not, load it into the cache VBO.

Yes, more specifically, just append it using the Streaming VBO approach Rob Barris describes here.

And if it’s already there, just launch the batch using NVidia bindless (if on NVidia) to get display list perf for that batch.

What happens when the cache VBO is full? Discard VBO and start reloading batches into it?

Yeah, basically orphan it and the next frame you just end up reloading anything you’re still using at the front of the new orphan. This works well enough. But if I ever get to a problem with frame breakage (due to the app being too close to breaking frame anyway) then I’ll have to do something more slick like have 2 stream VBOs, and over the course of multiple frames expire things from the old VBO and load them over packed up-front in the new VBO, rather than do this all in one frame (the first frame after an orphan).

Otherwise fragmentation will occur.

Well with Rob’s streaming VBO technique, fragmentation can’t occur. He explicitly targetted this for dynamic data where there’s no reuse.

But even with static data and reuse, you just end up with parts of the VBO in the current orphan that aren’t being rendered in the current frustum anymore. Once you fill up, then you orphan, and then on the subsequent frame you’d reload what you are still using to the front of the new orphan.

Essentially an orphan event + refill gives you stream compaction so you don’t have long-term fragmentation to deal with.

Maybe a part of the cache VBO is permanent and only parts are discardable?

I’ve avoided that, because then yeah, you have to deal with fragmentation. Ugly.

Anyway do you have any links to descriptions of caching algorithms?

For what I’m doing, all you need is Rob’s post. It just uses a VBO as a big ring buffer. Just orphan when you wrap.

What if you cached spatially, i.e. defined regions with lists of batches that might be needed and then load the vbo with batches needed in the current region, as well as neighboring regions, reloading when crossing boundaries between regions.

Yeah, there are definitely lots of possibilities once you start considering this. I like to keep it as simple as possible and only make it more complex if I have to.

Another permutation to consider is to have separate stream VBOs for static and dynamic data. That way, the dynamic data won’t prompt more orphans of the static data than needed. But again, haven’t needed this for perf, so I’ve only considered it.

Also, what about VAOs? If you cache, the location of a batch may change, rendering it’s VAO useless. In order to fix the VAO, you need to bind it, which takes up valuable time.

Personal taste: I don’t like VAOs. A bazillion little objects in the GL driver to generate cache misses and cause slowdowns. I get everything VAOs can give and a good bit more from NVidia bindless (on an NVidia GPU).

And in my most recent tests, applying VAOs on top of bindless costs you a little perf, so on NVidia I only use bindless.

Anyway, I implemented a part of the mega-vbo approach. I will eventually have two mega-vbo’s, one for attributes and another for indices, due to alignment concerns.

Re alignment, I use the trick offered in that thread of just rounding offsets up to multiples of 64 or something nice. No real cost or benefit for it that I’ve seen, but you can if you want. And no problems here with dumping attributes and indices into the same VBO.

Of course, if the layout is packed it can cause attributes and indices to become unaligned, what happens on my card in that case is completely random. Even if the scene remains static between frames GL will render random stuff.

Seriuosly? Not sure but I believe that’s either a driver bug or your bug.

I ended up with a similar effect (garbage output: link) when I split the attribute and index blocks by an orphan. This is invalid. You can’t do that. Both the attribute and index blocks must be in the same buffer orphan, as the content is “latched” at the draw call.

Is this allowed with vertex attributes on a BYTE boundary? My understanding was, it should only increase rendering time.

Think so, but I’d round each up to a multiple of 64 (lastest-gen cache line size) just for kicks to see if that fixes your problem.

As soon as vertex attributes are aligned on a DWORD boundary, at least on my card, everything renders correctly and fast.

Interesting. Which card is that? I’d like to bookmark that thought.

nickels · September 9, 2010, 7:55am

Interesting discussion. Not sure I understand all the finer points, but as a testimonial, I have implemented my quake level viewer with one VBO and one Index buffer (GL_STATIC_DRAW, I believe). I dispatch faces separately using offsets into the index buffer. Haven’t seen any issues. Static and dynamic meshes all have their own vbo and index buffers…
I hadn’t thought about the potential for loading a level larger than the GPU memory by letting opengl swap buffers between GPU/CPU. Interesting…

mhagain · September 9, 2010, 8:58am

To be honest, if you’re worried about the overhead of an “if” and a handful of bytes, your app has bigger problems than it’s VBO implementation.

I wouldn’t be inclined to call such memory “wasted” either; it’s being “used” instead, to ensure that you don’t issue redundant state changes.

ugluk · September 9, 2010, 3:05pm

To be honest, if you’re worried about the overhead of an “if” and a handful of bytes, your app has bigger problems than it’s VBO implementation. wink

Who says that I am worried? Whenever I see an “if” I try to think of ways around it. There are many examples of people managing to do so. To me "if"s have a taste of suboptimality to them, just like to us all, I think. This doesn’t mean I don’t use them in my code.

Yes, more specifically, just append it using the Streaming VBO approach Rob Barris describes here.

And if it’s already there, just launch the batch using NVidia bindless (if on NVidia) to get display list perf for that batch.

How do you keep track of batches that are cached and those that are not? I’ve tried using std::map<> for cache tracking before, but promptly dropped it, as it was too much of a time waster.

Also, where are the textures in all this? Do you simply upload all textures to the GPU and rely on GL for swapping?

How do you determine the size of the cache VBO(s)?

Yeah, basically orphan it and the next frame you just end up reloading anything you’re still using at the front of the new orphan. This works well enough. But if I ever get to a problem with frame breakage (due to the app being too close to breaking frame anyway) then I’ll have to do something more slick like have 2 stream VBOs, and over the course of multiple frames expire things from the old VBO and load them over packed up-front in the new VBO, rather than do this all in one frame (the first frame after an orphan).

Yeah, you take what is used most often and put it in the second VBO and when the first fills up, you use the second, where there will hopefully be less batches to load into. What criterion do you use for expiring? Do you simply take the most often used batch on a frame and move it to the second VBO? You need to keep some of the second VBO free though, for possible loading. How much of it do you keep free?

I’ve avoided that, because then yeah, you have to deal with fragmentation. Ugly.

Orphaning a VBO discards it, so the user can’t really reuse any of it anymore. So keeping a part of the VBO the same and changing only a part of it without orphaning can’t probably be done.

Re alignment, I use the trick offered in that thread of just rounding offsets up to multiples of 64 or something nice. No real cost or benefit for it that I’ve seen, but you can if you want. And no problems here with dumping attributes and indices into the same VBO.

Yes, you’ve got to try using different alignments. I’ve used 4 and it worked. For some other card another magic number might be more appropriate.

Seriuosly? Not sure but I believe that’s either a driver bug or your bug.

I ended up with a similar effect (garbage output: link) when I split the attribute and index blocks by an orphan. This is invalid. You can’t do that. Both the attribute and index blocks must be in the same buffer orphan, as the content is “latched” at the draw call.

I see, they were in different orphans. Interesting.

As for my “problem”, it definetly smells like a driver bug. I didn’t orphan any VBO and I don’t change anything about the scene or the viewing and random triangles are rendered from frame to frame. Has to be a driver bug. The card is:
02:00.0 VGA compatible controller: ATI Technologies Inc Cedar PRO [Radeon HD 5450]

After realigning the “problem” disappears, another indication of a driver bug.

Think so, but I’d round each up to a multiple of 64 (lastest-gen cache line size) just for kicks to see if that fixes your problem.

It works, just like the 4 byte alignment. No real performance improvement though.

mhagain · September 9, 2010, 4:29pm

If I understand Rob’s approach correctly, it’s almost the exact same as something I’ve implemented in D3D, and the approach is that you don’t actually need to keep track of batches you’ve cached. Forget about tricks like that, just blast each batch into the VBO as it passes and either at the end of the frame or when it’s going to overflow you rewind your cursor back to the start.

What you do need to keep track of is when state changes occur as obviously you need to either record those state changes somewhere for later replay, or draw everything that’s accumulated to this point before applying the changes. The approach I took in D3D was to record them in a struct (just as a callback function that applies them) together with firstvertex, numvertexes, firstindex and numindexes, and then I replay them when the VBO is rewound (or at the end of a frame).

Obviously both state change filtering and sorting objects by state is really going to help a lot here when it comes to keeping batches as big as possible.

I just kept a static array of structs and filled them in as required; I found 1024 structs to be excessive but it’s what I went with anyway. The exact size will depend on your application’s needs.

STL constructs are nice for sure, but in this case I went with the more primitive option as being simpler and with less processing overhead. Knowing when to not use powerful tools is equally as important as knowing when to use them.

A test for when the struct array was going to overflow was also needed, so I added that. With hindsight, it may have been perfecly fine to draw immediately instead of recording stuff, but I did want to be able to replay an arbitrary number of times (in the end I didn’t have that requirement so I didn’t need to record after all).

VBO sizes are going to be application dependent. I’d say create an initial one with a huge size, run your app, see how much space you need for a typical scene, then double it. I was using 16-bit indexes so I was necessarily constrained by that, having no more than 65536 vertexes.

The trick is to keep the initial "Lock"s (or MapBuffers) (preceded by BufferData NULL in GL) as few as possible, preferably only one per frame. Subsequent "Lock"s (or Maps) will be appending only, so they don’t need to orphan (or discard, in D3D terminology) the entire buffer, just write to the cursor position. When it fills and is about to overflow, or when a new frame begins, is the only time you need to discard/orphan it again.

If you’re really clever you could potentially set things up so that it doesn’t even discard/orphan on a new frame, but just keeps appending. I personally found that - especially with 16 bit indexes - a discard/orphan on a new frame was the best way to go, otherwise there is a risk of needing to do so in the middle of most frames, so it was going to happen anyway, and it just kept things simpler this way.

I found no performance improvements whatsoever from double (or even quadruple) buffering, by the way. Most likely because my buffers were small enough to begin with, so the discard/orphan operation was about the same speed as switching to a new buffer.

The really nice outcome of this was that I was able to have a single DrawIndexedPrimitive (D3D-ese for glDraw(Range)Elements) for absolutely everything on-screen, as well as mix system memory vertex arrays (or the D3D equivalent) with VBOs while retaining 99% the very same code. There’s still some room for further optimization; a lot of objects in the scene would more or less reuse the same indexes, for example, so I can adjust the code to allow for that, but right now there’s no pressing need to. As a technique it works very well.

ugluk · September 9, 2010, 4:59pm

If I understand Rob’s approach correctly, it’s almost the exact same as something I’ve implemented in D3D, and the approach is that you don’t actually need to keep track of batches you’ve cached. Forget about tricks like that, just blast each batch into the VBO as it passes and either at the end of the frame or when it’s going to overflow you rewind your cursor back to the start.

So the algorithm is like this:

get a list of the needed batches,
iterate over the struct array and send the listed batches from the vbo, if you find them, otherwise send them to the vbo from system memory and append new batch state to the end of the struct array.

If the list of batches is M, and the struct array length is N, this gives M * N complexity.

The really nice outcome of this was that I was able to have a single DrawIndexedPrimitive (D3D-ese for glDraw(Range)Elements) for absolutely everything on-screen…

How is that possible? If you wrote glMultiDrawElements(), I’d understand but glDraw(Range)Elements? You cant specify multiple regions of indices with it. Maybe you could use it like you write, if all the indices were packed together and all were of the same type, but for an entire scene, they’d probably have to be 32-bit. Furthermore, what if different objects needed different shaders or different shader settings (like a different lighting model)? You’d either have to switch a shader or set some different settings.

mhagain · September 9, 2010, 5:21pm

No, just a single place where all the drawing is done is what I meant, not one call for absolutely everything. Sorry for the confusion.

I wouldn’t even bother with checking for batches already in the VBO, I’d just resend everything from scratch each frame. Keep it as simple as possible. The eventual draw calls can then progress sequentially through your buffers instead of having to hop around from one location to another, which is very very important for maximum performance.

Dark_Photon · September 9, 2010, 6:29pm

A post from ugluk in another thread that was obviously meant for this thread:

http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=282998#Post282998

Dark_Photon · September 9, 2010, 6:57pm

With this technique we’re just talking batch data. For texture data you’ve got a dedicated spot on the GPU you’re going to upload the texture to and leave it, so it’s a slightly different problem. But textures can be managed with a preallocated pool and a similar upload mechanism with PBOs.

Do you simply upload all textures there are? An then rely on GL to swap them in and out?

Oh, definitely not! Guaranteed frame breakage whenever you flip something into view that uses a texture that was paged off the GPU by the driver (or never pre-rendered with to force it on the GPU)! I avoid overcommitting GPU memory like the plague! Letting the driver page for you at the last minute when it figures out “inside your draw call” that it needs a texture that’s not on the GPU is the kiss of death! :eek:

The client arrays analogy also worries me; it means all batches need to be kept around in system memory for possible uploading.

The driver’s probably gonna keep a CPU-side copy anyway (AFAIK it does this for most if not all things you upload to the GPU, textures, probably buffer objects too). At least this way, there isn’t a copy of “all” of your data in the driver. Just your working set. This way you get to manage the copy of “all” of your data on your side of the fence.

I confess not to be a driver writer so I don’t know for sure. But since GL permits the driver to kick anything off the GPU to make room to put something else on it, at any time, it makes sense that it would keep a CPU copy because then it can just free up the space rather than have to 1) allocate CPU storage, 2) copy the GPU data off, and 3) free up the space. … and only then, now that there’s room, upload what it wants to the GPU.

[quote]Yes, more specifically, just append it using the Streaming VBO approach Rob Barris describes here.

But how do you keep track of which batches are in the cache and which aren’t?[/QUOTE]
Store a link to it on your “batch” object in your scene graph. Currently my streaming VBO “link” is two things: 1) offset in VBO, 2) orphan ID.

How are these used?

When you find that you need to upload a batch to the streaming VBO, you store off the “offset” that you uploaded it to and the current “orphan ID” (this being the number of times you’ve orphaned the VBO).

Then the next time you need to draw the batch:

IFF the current stream VBO orphan ID == the link’s “orphan ID” in your batch node, then your batch is still “in the cache” so you just render it from the link’s “offset” in the VBO with no upload,
Otherwise you have to reupload it (Map UNSYNC / memcpy / Unmap) first, store off the link (orphan ID and offset) in your batch node, and then render it for that offset.

I’ve used std::map<BatchHash, void const*> for something similar before and it featured as a prominent time waster in the profiler report.

Yeah, I made sure and put my cookie crumb link in my batch object, which are cache lines I’ll be pulling in anyway to draw the object.

Dark_Photon · September 10, 2010, 5:03am

I just make sure it’s something that can easily hold at least a frame’s worth of data (6-8X is very comfortable, and is not that much GPU memory anyway).

If it can, then the next frame will reuse nearly 100% of the batches, and thus in the steady state nothing or nearly-nothing will need to be uploaded to the GPU during a frame.

If it can’t, then you end up reuploading lots of stuff every frame. You’re effectively “thrashing the cache”. And you’ll drop to something more akin to client arrays perf (where all batches are reuploaded every frame).

What criterion do you use for expiring?

Like I said, right now I’m not doing anything slick with that, but have anticipated that someday I might need to. Currently when the VBO fills up, everything expires immediately. So over the next 1 frame, you’ll load up everything into the new orphan that you need. After that you’re good-to-go until it fills up again.

Works very well, so long as your VBO can hold at least a frame’s worth of data (so you don’t thrash). So I haven’t implemented any slick things to progressively migrate to a 2nd stream VBO over multiple frames.

Doing so would also involve a lot of buffer binds back-and-forth while you’re getting switched over to the new VBO. This is a killer for classic (bind-to-use) VBOs, but no big deal if drawing the VBOs with NV bindless.

[quote]I’ve avoided that, because then yeah, you have to deal with fragmentation. Ugly.

Orphaning a VBO discards it, so the user can’t really reuse any of it anymore. So keeping a part of the VBO the same and changing only a part of it without orphaning can’t probably be done.[/QUOTE]
Right. The question was about having some parts locked down, always there, and static and other parts dynamic. So if done in the same VBO, naturally you couldn’t orphan because that’d kill off the static part and demand a reupload of it. You’d instead have to deal with the fragmentation mess inside the VBO yourself – like I said: ugly.

Dark_Photon · September 10, 2010, 5:31am

True, Rob’s pure approach does no caching as it is designed for dynamically-generated data, where reuse makes less sense.

The pro for adding reuse (when applying this technique to static batches) is that dispatching batches that are already on the GPU in a VBO is faster (sometimes much faster) than uploading the data from the CPU first and then dispatching them.

So if your stream VBO can hold at least a frame’s worth of data, you’re not uploading hardly anything per frame (for static batches anyway) – only what you need to account for eyepoint motion and rotation.

The only corner case is the orphan, and the cost of moving off to another VBO can be amortized over multiple frames if needed.

Now for static batches that you know about on startup, you can just take the hit then, allocate dedicated GPU memory for VBOs, and upload to it with wild abandon, eating frame breaks as needed (or just build display lists for them). But for static batches that are loaded at run-time, when you really don’t want to break frame, Rob’s technique is good for getting that data on the GPU without significant overhead. And applying reuse to it (i.e. caching the batches on the GPU) gives you even better performance than always reuploading them.

nickels · September 10, 2010, 6:07am

I’m trying to follow the thread, but I feel like I’m missing the main point.

What is a ‘batch’? Is this the data needed to excersize one glDrawElements?
From the thread should I understand that in something like a FPS likely the whole level shouldn’t just be thrown on the graphics card, but there should be some explicit management of gpu resource allocation, aligned somehow with the PVS or such?

Ignorance was bliss… Any comments to help a newbie appreciated!

mhagain · September 10, 2010, 11:08am

What is a ‘batch’? Is this the data needed to excersize one glDrawElements?

More or less, yes. In the optimal scenario absolutely everything in your scene that shares the same state would all be handled by one single such call, which is the purpose behind batching. In the real world that’s not always possible because sometimes objects that share the same state will need to be drawn out of order, the most obvious example being 2D HUD elements (translucent objects that need to be rendered back to front are another example).

The goals are twofold. Firstly there is the desirable situation of having the driver do less CPU work. If you only need to update vertex and index pointers 20 times per frame it’s obviously going to be cheaper than doing so 2,000 times per frame.

Secondly there is the well-understood situation where very few big things are preferable to lots of small things. This can be directly measured for hard disk access, network transmissions, memory allocations and it also applies to GPUs. Performance is widely known to suffer in the “lots of small things” scenario in all of these cases.

ugluk · September 10, 2010, 2:09pm

Ok, let me see if I can make sense of all this.

From what I see mhagain is writing about a multithreaded GL rendering instruction creator, that also handles resource loading. A “main” thread then deals with rendering, according to the instructions that the instruction creator generates.

I guess this means that a scene can be repeatedly rendered very quickly if its state does not change much. What if there is only a single core?

Dark Photon is more general. From what I understand, all indices and vertex attributes get an id and there is an array

std::pair<vbo_id, offset_in_vbo> cache_index[number_of_indices_and_vertex_attributes_regions];

When you need a resource (indices or vertex attributes) with an id, you look it up in the array, check if it is in the VBO. If it is, you just render with it, if not, upload it from system memory into the VBO, update the array with the offset and vbo_id and render.

Both posters repeatedly mention batching for rendering in some dedicated place. I see this would require additional state data, as for example a transform for the batch, what shader to use, … So you probably have additional data structures for that. But what are the benefits of doing this? The vertex attribute pointers are bound to change across rendering a scene, as well as other things, like shader state, modeling transforms, which means multiple glDraw calls.

One benefit I see is a central place to do the loading of resources, as well as rendering, but I guess inlined

void const* resouce_ptr = load_resource(resource_id);

calls would not be a big thing, even if strewn across the code.

If PBOs are available you can upload a texture into the caching VBO, that is bound as a PBO, and do glTexImage() call on the PBO (if not done already). Do you keep the VBO for the batches and textures separate? Do you keep textures at specific alignments in your caching VBO for textures? What if there are no PBOs available?

dedicated spot on the GPU you’re going to upload the texture to and leave it

What spot is that? The algorithm I have in mind is:

Have and array

GLuint texture_index[all_textures_in_the_app];

whenever you need a texture you check in this index for the texture object id. If there is no valid texture object id for the texture you need, you reuse some texture object id and upload the texture data onto the GPU for that texture object, invalidating the texture object id for some other texture.

In a central place for rendering one can decide better, I guess, which texture data to overwrite and which to keep.

Also, why use glMapBuffer() calls? Why not glBufferSubData()?