Here’s the thing. First frame bringing up your level, maybe you take a little longer because you’re uploading more batches. But in subsequent frames, a substantial portion of the batches you render were rendered the last frame, so they’re already there. If you envision the frustum panning around, you only have potentially new batches on the leading edges of the view frustum. So the amount of new data you need to upload is small. This technique takes advantage of temporal coherence in your rendering.
But even for that first frame, is it noticable? Not so much. You end up with something like client arrays perf for those, which isn’t too bad. Worst case, maybe you break just the first frame after you bring up a level if you were close to breaking anyway.
If I understand correctly your algorithm would be as follows: when a batch is needed for rendering check if the needed batch is in the cache VBO, if not, load it into the cache VBO.
Yes, more specifically, just append it using the Streaming VBO approach Rob Barris describes here.
And if it’s already there, just launch the batch using NVidia bindless (if on NVidia) to get display list perf for that batch.
What happens when the cache VBO is full? Discard VBO and start reloading batches into it?
Yeah, basically orphan it and the next frame you just end up reloading anything you’re still using at the front of the new orphan. This works well enough. But if I ever get to a problem with frame breakage (due to the app being too close to breaking frame anyway) then I’ll have to do something more slick like have 2 stream VBOs, and over the course of multiple frames expire things from the old VBO and load them over packed up-front in the new VBO, rather than do this all in one frame (the first frame after an orphan).
Otherwise fragmentation will occur.
Well with Rob’s streaming VBO technique, fragmentation can’t occur. He explicitly targetted this for dynamic data where there’s no reuse.
But even with static data and reuse, you just end up with parts of the VBO in the current orphan that aren’t being rendered in the current frustum anymore. Once you fill up, then you orphan, and then on the subsequent frame you’d reload what you are still using to the front of the new orphan.
Essentially an orphan event + refill gives you stream compaction so you don’t have long-term fragmentation to deal with.
Maybe a part of the cache VBO is permanent and only parts are discardable?
I’ve avoided that, because then yeah, you have to deal with fragmentation. Ugly.
Anyway do you have any links to descriptions of caching algorithms?
For what I’m doing, all you need is Rob’s post. It just uses a VBO as a big ring buffer. Just orphan when you wrap.
What if you cached spatially, i.e. defined regions with lists of batches that might be needed and then load the vbo with batches needed in the current region, as well as neighboring regions, reloading when crossing boundaries between regions.
Yeah, there are definitely lots of possibilities once you start considering this. I like to keep it as simple as possible and only make it more complex if I have to.
Another permutation to consider is to have separate stream VBOs for static and dynamic data. That way, the dynamic data won’t prompt more orphans of the static data than needed. But again, haven’t needed this for perf, so I’ve only considered it.
Also, what about VAOs? If you cache, the location of a batch may change, rendering it’s VAO useless. In order to fix the VAO, you need to bind it, which takes up valuable time.
Personal taste: I don’t like VAOs. A bazillion little objects in the GL driver to generate cache misses and cause slowdowns. I get everything VAOs can give and a good bit more from NVidia bindless (on an NVidia GPU).
And in my most recent tests, applying VAOs on top of bindless costs you a little perf, so on NVidia I only use bindless.
Anyway, I implemented a part of the mega-vbo approach. I will eventually have two mega-vbo’s, one for attributes and another for indices, due to alignment concerns.
Re alignment, I use the trick offered in that thread of just rounding offsets up to multiples of 64 or something nice. No real cost or benefit for it that I’ve seen, but you can if you want. And no problems here with dumping attributes and indices into the same VBO.
Of course, if the layout is packed it can cause attributes and indices to become unaligned, what happens on my card in that case is completely random. Even if the scene remains static between frames GL will render random stuff.
Seriuosly? Not sure but I believe that’s either a driver bug or your bug.
I ended up with a similar effect (garbage output: link) when I split the attribute and index blocks by an orphan. This is invalid. You can’t do that. Both the attribute and index blocks must be in the same buffer orphan, as the content is “latched” at the draw call.
Is this allowed with vertex attributes on a BYTE boundary? My understanding was, it should only increase rendering time.
Think so, but I’d round each up to a multiple of 64 (lastest-gen cache line size) just for kicks to see if that fixes your problem.
As soon as vertex attributes are aligned on a DWORD boundary, at least on my card, everything renders correctly and fast.
Interesting. Which card is that? I’d like to bookmark that thought.