VBO Performance Strategy

Stephen_H · January 18, 2004, 6:05pm

I’ve seen several people who’ve developed basically a memory management system for meshes on the graphics card memory. Basically they assume, say, that we have 16MB of room to store meshes on the card, and split that up into smaller geometrically sized VBOs. eg. They’ll have:

11 VBOs of 1k
22 VBOs of 2k
43 VBOs of 4k
27 VBOs of 8k
33 VBOs of 16k
17 VBOs of 32k

Each time they want to render a mesh, they’ll check to see if their data is still cached there from the last frame, and if its not, they load their data into the oldest VBO available and ‘overwrite’ the existing data.

They basically have developed a memory mangement system by tracking which of the VBOs get used each frame. If a certain size class is underused and another size class is overused, they can split/join the VBOs into smaller/larger VBOs. They can track usage statistics for each VBO size class and adjust the sizes of the VBOs dynamically.

Can anyone explain to me what this kind of system is advantagous? Aren’t you just reproducing the memory management features of the card’s drivers?

One of the disadvantages of this is that you need to keep a copy of the mesh in local memory, while with a VBO you don’t. (The driver will still keep a copy in AGP, though, yes? Even if you map the VBO and fill it directly?)

edit - I think they’re using this kind of system when they have lots of mesh data in memory and can’t load it all into the card at once, so they’ve developed this fancy memory manager. What I’m doing is loading meshes into the card that are visible, or in possibly visible in the frustum and then removing them once they are no longer ‘in range’ of the camera. Is it better to do create a custom memory manager, or to just load all your locally visible meshes into the card, like I’ve been doing, and let the drivers sort it out?

[This message has been edited by Stephen_H (edited 01-18-2004).]

maximian · January 18, 2004, 6:59pm

I personally allocate as much vbo as is available and use it. This other systems seems akward. Why not just allocate one large buffer,
and then assign regions of that buffer to objects based on the number of vertices,attributes etc. This approach mimics older memory managers used in apps and systems.
Obviously there are issues with fragmentation, and prioritizing but there is plenty of theory and code available as examples.

Stephen_H · January 18, 2004, 7:17pm

I’m trying to reconcile this with some recent posts by Cass:

Too many VBOs and you pay some (marginal) penalty for more frequent VBO state changes. Too few VBOs and you pay a (potentially very high) penalty for forcing a coherent CPU/GPU view of an unnecessarily large chunk of memory. Forcing this coherency requires either synchronization stalling or lots of in-band data copying. This is a real waste if that coherency is not essential.

Small VBOs solve the coherency problem and make driver-side memory management much easier. In the long term, I expect a one or two attribs for a few hundred vertexes per VBO to be “free”. And it will never hurt (though it may not help much) to pack multiple attributes (perhaps from multiple objects) into a single VBO – if they are static or nearly static. This is probably a good idea if you have lots of static objects with very few vertices - though if you don’t render these things all at the same time, immediate mode may be better still.

He talks about large/small VBO issues in these two threads:
http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/011120.html
http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/011194.html

Can the drivers do better optimization than allocating your own VBO/VBOs and using a customized memory management scheme?

Btw, this brings up a interesting side topic. I’ve heard PCI Express has a pretty good speed for getting stuff from the gfx card. Once PCI Express comes in this summer, will drivers only keep one copy of the mesh and keep passing it off between CPU and GPU, or still two copies? If the drivers automatically take advantage of PCI Express to do stuff like this, people who have a custom memory management architecture like this will loose out on the benefits of PCI Express.

Korval · January 18, 2004, 7:46pm

Why not just allocate one large buffer,
and then assign regions of that buffer to objects based on the number of vertices,attributes etc.

Because it gives the driver no way to manage VBO’s?

VBO’s aren’t just a way to allocate video memory; you may as well be using VAR. VBO’s are a way to let the driver manage your memory for you. It can come up with the best way to store vertices/indices. It can do so because driver writers are in a better position than the user to know how the hardware works and what the hardware wants.

maximian · January 19, 2004, 8:57am

What managment? I implemented a system with multiple vbos and it was significantly slower.
Using static always. A single vbo works great.
As for streaming, or write vbos, then more likely than not your data will not be written
to video memory and you will lose all the benefits of a vbo.

In that case, a single streaming vbo will be fine.

Korval · January 19, 2004, 11:10am

What managment? I implemented a system with multiple vbos and it was significantly slower.

On which implementation? The behavior differs based on the implementation. It is well known that nVidia’s VBO is highly complex, and attempts to try to figure out your intent.

Using static always. A single vbo works great.

Unless, of course, you have large quantities of data in memory at once (say, 32MB of vertex data and 128MB of textures). The driver is not free to page out parts of your VBO, because it must be an all-or-nothing thing. Which means that you’ve dropped your avaliable texture memory down to 96MB. So, if you try to use all 128MB in one frame, you’ll trash, and lose all kinds of performance.

As for streaming, or write vbos, then more likely than not your data will not be written
to video memory and you will lose all the benefits of a vbo.

Unless you’re using nVidia’s implementation, which does indeed use video memory for streaming buffers. Check out their VBO .pdf.

OldMan · January 19, 2004, 10:12pm

Whant my opinion?
Well when VBO came it was suposed to be a way to banish the need of specific code paths for NV and ATI hardware qith their proprietary extensions.

But the peculiarities of each VBO implementation are not less restrictive than the use of two different extensions. You may even have to use again two different models in your application, each one reflecting the peculiarities of each implementation.

For me, it is still a mess!!!

Korval · January 20, 2004, 2:36am

For me, it is still a mess!!!

Stop trying, then.

The current issues with VBO stem from the fact that VBO’s can be used in different ways. For example, Maximian’s single large VBO, vs the normal method of allocating object-sized VBO’s.

It is unclear which is best because the spec doesn’t define which is best. That’s up to the driver. To this point, driver writers can decide to optimize any VBO usage pattern they want, because nobody has released a product that uses VBO’s professionally. Once Doom 3 ships with VBO’s as the standard vertex sending mechanism, then driver writers have a particular method of using VBOs that must run fast.

Until then, just do what is most convienient to you, the user. Static VBOs will beat regular vertex arrays in virtually all circumstances, so it’s still a win. Until there is a specific impetus on the part of driver writers to make certain usages of VBOs more appropriate than others, just use VBO’s as you see fit, understanding that the optimal VBO usage may change with the release of Doom3.

zeckensack · January 20, 2004, 4:28am

Originally posted by maximian:
What managment? I implemented a system with multiple vbos and it was significantly slower.
Using static always. A single vbo works great.
That’s to be expected. For static data, there is little management to do because it is static. Cass stated as much. By going with a single, large VBO, you’re eliminating the overhead of switching between VBOs.

[/b]As for streaming, or write vbos, then more likely than not your data will not be written
to video memory and you will lose all the benefits of a vbo.

In that case, a single streaming vbo will be fine.[/b]
Did you try?
A driver may double buffer some VBOs and - surprise - this is most beneficial for non-static buffers. Double buffering of VBOs trades memory for synchronisation stalls. For a lot of applications that would be a good thing.

It’s not hard to explain:
If you have one huge VBO with dynamic data, every time you update a portion of that data (worst case: via glMapBuffer), all rendering that references this buffer must be finished first. If you have one VBO per dynamic mesh, the driver only needs to finish rendering that single mesh (which was in the majority of cases approximately one frame earlier).
If the driver performs VBO double buffering, there is no need to finish anything at all. Rendering can continue from the “old version”, while you get a “new version” for the new data.

maximian · January 20, 2004, 5:28am

1)‘Video memory is used for streaming vbo’.
This is questionable. While possibly true, I have found vbo performance to equal regular va when using streaming. nvidia+latest drivers+geforce fx+tnt2. So if it is using video memory, it is not much of a speed enhacement.

If your data is quite small one large vbo may not work. I deal mainly with very large models,few models per scene. Perhaps, arb should modify the vbo so that you can map only a portion of the vbo, the portion you will change. In that case, write or streaming will be best.

3)Synchronization issues are still present with vbo. If you try to render to a vbo(write to it, map it, whatever) you will cause errors. I tested this thoroughly.I then used semaphores to control the write,render code. Again, no synch on the part of the driver.

I agree with OldMan. VBOs differ almost as much as VARs+vertex object buffer did with each other. The api is the same for all, but you need special handling. Saying stop trying to someone is not constructive.
Finally, who made Carmack boss of the graphics universe. If ARB had done its job,
then functionality of vbo would work similarly across all platforms. Performance would not be the same, but at least the way it worked would be uniform. If opengl cannot get vendors to implement standards in platform consistent manner, then it has a serious problem. This is of course not new.
Even standard opengl code will some times behave quite differently based on the vendor.

OldMan · January 20, 2004, 5:58am

If we need to wait for D3 to have realiable abd predictable performance VBO. I Prefer to keep using VAR and Vertex objects until D3 comes out! Because the MAIN advantage of VBO is missing!

But so at leat Carmack could say us how VBO will be in future. Since it looks his decision will prevail anyway…

AdrianD · January 20, 2004, 6:16am

Originally posted by maximian:
[…]
5) Finally, who made Carmack boss of the graphics universe.

dont’ forget: cramack created the first successfull opengl game, by creating his own openGL implementation - mingl for quake.
This way he forced most HW vendors to add opengl support for their hw.
By creating quake2/3 he even more forced the hw vendors to push their opengl-development forward.(ie. optimization of VA’s)
(not to mention all the games created other companies using the ID engine…)
all vendors want to run his code as fast as possible on their HW - it is there reference for fast opengl implementations since 1998. and therefore it’s important what he does.

Korval · January 20, 2004, 6:36am

This is questionable.

This is what nVidia said. If you find it “questionable”, question them. Until they say that it isn’t the case, then it is, as far as I’m concerned.

While possibly true, I have found vbo performance to equal regular va when using streaming.

Which is hardly surprising because you’re streaming. You’re generating data per-frame in some fashion, so you’re required to do a copy operation. As such, you can only go so fast.

Also, from nVidia’s document, it seems that their glMapBuffer call for streaming buffers just allocates some system memory, rather than giving you a direct pointer to the actual memory. This means that streaming can’t be faster than standard arrays since you aren’t writing directly to the buffer memory itself.

If your data is quite small one large vbo may not work. I deal mainly with very large models,few models per scene. Perhaps, arb should modify the vbo so that you can map only a portion of the vbo, the portion you will change. In that case, write or streaming will be best.

This would require that they perpetuate the idea that using VBO like VAR is a good idea. It is not. One of the primary problems VBO’s are supposed to solve is getting buffer management out of the users hands and into the driver’s hands.

Synchronization issues are still present with vbo. If you try to render to a vbo(write to it, map it, whatever) you will cause errors. I tested this thoroughly.

That’s not a sync error. That’s either an error in your code or an error in the implementation in question. In either case, it has nothing to do with the VBO spec.

The api is the same for all, but you need special handling. Saying stop trying to someone is not constructive.

It’s far more constructive than complaining about something that is, by definition, implementation dependent.

Finally, who made Carmack boss of the graphics universe.

People who keep buying games/engines made by Id Software. Like it or not (and I don’t), that gives him clout that we don’t have. And it is his choices in how to use VBOs that we will be expected to follow.

Fortunately, he’s usually pretty reasonable in using an API, so I don’t expect his VBO usage to be totally off the beaten path.

If ARB had done its job,
then functionality of vbo would work similarly across all platforms. Performance would not be the same, but at least the way it worked would be uniform. If opengl cannot get vendors to implement standards in platform consistent manner, then it has a serious problem.

There is a difference between performance and functionality. Functionality is what a spec defines; not performance. The standard doesn’t prevent an implementation from being faster in certain usage patterns of the API. Just like the spec doesn’t prevent an implementation from punishing the user for pathological (though legal) use of the API.

For example, the whole VBO-as-VAR thing. This is a legitimate usage pattern for VBO’s. The behavior of this pattern is well understood, and all implementations of VBO should be able to function with it. How fast this is must always be implementation dependent. For some implementations, like nVidia’s where bind-equivalent actions incur some overhead, the VBO-as-VAR is faster than multiple VBOs. For other implementations where bind-equiavlent actions are just like any other state change, the VBO-as-VAR paradigm is, at best, no faster than normal.

The point is that expected performance is, and must always be, implementation dependent. The best the spec could ever do is give guidelines about what the “fast” path should be.

If we need to wait for D3 to have realiable abd predictable performance VBO. I Prefer to keep using VAR and Vertex objects until D3 comes out! Because the MAIN advantage of VBO is missing!

The primary purpose of the VBO extension is to provide for a cross-platform solution to vertex transfer that gives the driver the flexibility to manage buffer memory as it needs to. VBO effectively solves that purpose. Performance is, and has always been, implementation dependent.

You can get reliable performance out of VBOs today. You may not get optimal performance out of all implementations, but if you use the VBO API reasonably, then all implementations should be reasonably fast. Maybe not as fast as possible, but decently fast.

If you want to keep using extensions like VAR and the multitude of VAO extensions, good for you. Just don’t complain when ATi pulls the plug on VAO. And don’t complain that you have to use VAR/VAO; you have a perfectly viable alternative.

maximian · January 20, 2004, 7:01am

You misread my post completely. I said performance is not the problem. It is the usage pattern, ie how you expect the vbo to work that should be uniform across all platforms. Yes, I am right about this. It is not a standard, and does not solve the issue of transferring vertex data if I have to code it differently based on the chip. I said it before and I say it again.
Performance deltas between implementation is not a big issue. However, the vbo should work almost the same across all platforms.

Just because nvidia says it so does not make it such. I got plenty of data, and tests to back my assertions about the actual way vbo’s work. As for streaming, if it was double buffered, there is no reason, why you cannot get some performance improvement from the use of video memory. If it had been implemented correctly, streaming/writing vbo should also offer performance improvement, especially if your not trying to rewrite the whole vbo for every pass.(This alone shows their caching algorithm is fairly rudimentary if it exists at all.)

Again, the whole point of VBO is enhance performance. We do not need yet another VA.
How much it enhances varies, but if a path should enhance performance, then it should.
Failure to do so, constitutes a failure of the standard.

I have nothing against Carmack. However my criticism is correct. It is useless to have an ARB, if we have to wait for a developer to specify HOW something should work. You prove this point for me.
I am not complaining at all. I am mearly pointing out the problems. VBOs work very well for me. However, I sympathize with people who need more out of them. They should probably stick to VAR+VOB for the forseable future.
Going back to your argument about VAR not being the right way to do it. YES you are right. But if you look at the first post, and
how people are using crazy methods of managing VBO memory, then it is not much more advanced than var. It hides only the synch issue you mentioned. Like system memory, I should be able to allocate a block of memory, and then map, subregions, units of it. With the current method of doing things switching between all those VBOs will cost you dearly in performance. Not to mention the hassle, and also the redundency of managing your vbo memory, something that should be left to the driver.

Jan · January 20, 2004, 8:04am

If ARB had done its job,
then functionality of vbo would work similarly across all platforms.[/b]

Well, the spec says, that binding a buffer is “lightweight” opposed to VAR, where it is a “heavyweight” operation.

This describes pretty well how to use VBO. A buffer-change is an operation which does not hurt performance (per definition in the spec), so it is expected to be an often used thing.

This means, one VBO per object is the expected use of VBO.

And i am quite certain that a lot of programs will make use of lots of smaller buffers instead of one large buffer, therefore vendors are forced to optimize for that case.

Jan.

dorbie · January 20, 2004, 9:13am

Carmack never created MiniGL. Carmack used the ‘MiniGL’ driver on 3Dfx Voodoo hardware. MiniGL was kindof a nickname for 3Dfx’s OpenGL ‘implementation’ on voodoo because it was a separate dll that didn’t install as an ICD called through OpenGL32.dll. It wasn’t OpenGL compliant and only supported some rendering paths and functonality (which Carmack used). Some form of OpenGL was already in place when Carmack decided to port Quake over to it and he did the initial version in about a weekend as I think he mentioned this in a .plan way back in the day.

dorbie · January 20, 2004, 9:20am

It is important to specify memory use with VBO’s and not the actual implementation mechanism. This is essential for hardware and implementatuon abstraction. It is also important to see how this get’s used to allow implementors to optimize appropriately. It’s early days but it will fall into place. It’s a new feature, it’s carefully designed to accomodate everyone’s hardware and dispatch designs and it’s intended to be powerful and futureproof. If you don’t understand this then you just don’t get it.

Korval · January 20, 2004, 1:35pm

Failure to do so, constitutes a failure of the standard.

No, it only demonstraits a need for that particular implementation to provide better optimizations. The standard, the spec itself, is a very good one.

The VBO spec gives the driver the ability to make vertex transfer as fast as it possibly can be. If drivers don’t take advantage of it enough, or haven’t had the time to (good memory management is difficult, and usually requires a working model to determine memory usage patterns), then it is the fault of the driver/implementation, not of the spec itself.

Not even D3D Vertex Buffers are as flexible and fast as VBO’s potentially can be.

If you think the VBO spec is flawed, what suggestions would you make for changing it?

However, I sympathize with people who need more out of them. They should probably stick to VAR+VOB for the forseable future.

I disagree.

You never need performance that badly to abborgate VBO for VAR/VAO. VBO is the modern API for passing vertices to hardware; it is the path that will be optimized in the future (and near-term present).

But if you look at the first post, and
how people are using crazy methods of managing VBO memory, then it is not much more advanced than var.

That’s called “pathalogical use of the API”. You can do these things, and they are legal, and, on any given implementation, might give a performance boost. But, they are not suggested ways to use the API, and, in general, not a good idea. Indeed, I would suggest, given the spec, that any usage pattern that attempts to do significant memory management is not the proper usage pattern for the API. Clearly, the idea is for the driver to handle memory management.

Well, the spec says, that binding a buffer is “lightweight” opposed to VAR, where it is a “heavyweight” operation.

This describes pretty well how to use VBO. A buffer-change is an operation which does not hurt performance (per definition in the spec), so it is expected to be an often used thing.

Well, I read that slightly differently. I don’t read “lightweight” as “free”, but relatively “low-cost”. A bind (specifically, a glBindBuffer followed by a gl*Pointer call) could require an upload of a vertex buffer object that has been paged out back into video memory. This will require going through the cache. Now, unless you are constantly trashing, this operation should be virtually non-existant if the buffer is resident.

I do agree that the, in general, correct usage pattern is one VBO per object.

AdrianD · January 20, 2004, 6:33pm

dorbie,
you are right when you say, that there was some form of openGL allready in place as carmack decided to port quake. i know this, it was allready the current microsoft dll, as we know it. but that’s not the point.
the miniGL(actually 3dfxGL.dll) implemetation was not a driver. it was a simple wrapper for openGL over glide. using only glide-compatible function calls. Carmack created it to get the quake-1-demo running on the voodoo-1 chipset and it was not a creation from 3dfx. (they released their opengl implementation something about the time where voodoo-3 came out…)
Quake2 was shipped with this miniGL and in the game it was possible to choose between 3dfxGL and Standard OpenGL.
This forced the competitors of 3dfx(=the market leader at this time) to write an opengl driver for the card at all.(it is a good selling point, when you can say that your hw works with games like quake,halflife and couterstrike… )

what i am trying to say is: if caramack didn’t decided to use this 3dfxGL and opengl for his very popular game, then OpenGL drivers for consumer cards on win32 pattforms wouldn’t be so important for the hw vendors.
If there were no quake-engine-suppored games around, there would be not a single OpenGL-only-AAA-game for a long time.
Only because it is important to get quake running, the hw vendors improove their drivers.(everyone complains about this fact…)

imported_jwatte · January 20, 2004, 6:48pm

A single, large buffer is fine if you don’t update it after you first upload it. It could still be fine if you only update it using BufferData(), and never map it; this depends on what the driver does. The driver cannot make it efficient if you map and stream data to it (as it doesn’t know which part you want to touch).

When it comes to managing memory, that probably comes more from the DirectX world, where you have to keep the backing store for your vertex buffers (unless you mark them MANAGED, which leads to other problems). Also, when you go through VertexArrayRange(), you have to manage the “one big buffer” yourself, and keep backing store for the data that’s not in the working set.

And, by the way, there’s no way that you can just assume that ARB_VBO is going to be available on consumer machines. We support back to version 28 drivers, and there’s some pressure to go back to 6.xx, because forcing people to upgrade drivers puts them off. Similarly, there’s lots of hardware out there that doesn’t have VBO support even now.