PDA

View Full Version : VBO Performance Strategy



Stephen_H
01-18-2004, 07:05 PM
I've seen several people who've developed basically a memory management system for meshes on the graphics card memory. Basically they assume, say, that we have 16MB of room to store meshes on the card, and split that up into smaller geometrically sized VBOs. eg. They'll have:

11 VBOs of 1k
22 VBOs of 2k
43 VBOs of 4k
27 VBOs of 8k
33 VBOs of 16k
17 VBOs of 32k

Each time they want to render a mesh, they'll check to see if their data is still cached there from the last frame, and if its not, they load their data into the oldest VBO available and 'overwrite' the existing data.

They basically have developed a memory mangement system by tracking which of the VBOs get used each frame. If a certain size class is underused and another size class is overused, they can split/join the VBOs into smaller/larger VBOs. They can track usage statistics for each VBO size class and adjust the sizes of the VBOs dynamically.

Can anyone explain to me what this kind of system is advantagous? Aren't you just reproducing the memory management features of the card's drivers?

One of the disadvantages of this is that you need to keep a copy of the mesh in local memory, while with a VBO you don't. (The driver will still keep a copy in AGP, though, yes? Even if you map the VBO and fill it directly?)

edit - I think they're using this kind of system when they have lots of mesh data in memory and can't load it all into the card at once, so they've developed this fancy memory manager. What I'm doing is loading meshes into the card that are visible, or in possibly visible in the frustum and then removing them once they are no longer 'in range' of the camera. Is it better to do create a custom memory manager, or to just load all your locally visible meshes into the card, like I've been doing, and let the drivers sort it out?

[This message has been edited by Stephen_H (edited 01-18-2004).]

maximian
01-18-2004, 07:59 PM
I personally allocate as much vbo as is available and use it. This other systems seems akward. Why not just allocate one large buffer,
and then assign regions of that buffer to objects based on the number of vertices,attributes etc. This approach mimics older memory managers used in apps and systems.
Obviously there are issues with fragmentation, and prioritizing but there is plenty of theory and code available as examples.

Stephen_H
01-18-2004, 08:17 PM
I'm trying to reconcile this with some recent posts by Cass:


Too many VBOs and you pay some (marginal) penalty for more frequent VBO state changes. Too few VBOs and you pay a (potentially very high) penalty for forcing a coherent CPU/GPU view of an unnecessarily large chunk of memory. Forcing this coherency requires either synchronization stalling or lots of in-band data copying. This is a real waste if that coherency is not essential.

Small VBOs solve the coherency problem and make driver-side memory management much easier. In the long term, I expect a one or two attribs for a few hundred vertexes per VBO to be "free". And it will never hurt (though it may not help much) to pack multiple attributes (perhaps from multiple objects) into a single VBO -- if they are static or nearly static. This is probably a good idea if you have lots of static objects with very few vertices - though if you don't render these things all at the same time, immediate mode may be better still.

He talks about large/small VBO issues in these two threads:
http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/011120.html
http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/011194.html

Can the drivers do better optimization than allocating your own VBO/VBOs and using a customized memory management scheme?

Btw, this brings up a interesting side topic. I've heard PCI Express has a pretty good speed for getting stuff from the gfx card. Once PCI Express comes in this summer, will drivers only keep one copy of the mesh and keep passing it off between CPU and GPU, or still two copies? If the drivers automatically take advantage of PCI Express to do stuff like this, people who have a custom memory management architecture like this will loose out on the benefits of PCI Express.

Korval
01-18-2004, 08:46 PM
Why not just allocate one large buffer,
and then assign regions of that buffer to objects based on the number of vertices,attributes etc.

Because it gives the driver no way to manage VBO's?

VBO's aren't just a way to allocate video memory; you may as well be using VAR. VBO's are a way to let the driver manage your memory for you. It can come up with the best way to store vertices/indices. It can do so because driver writers are in a better position than the user to know how the hardware works and what the hardware wants.

maximian
01-19-2004, 09:57 AM
What managment? I implemented a system with multiple vbos and it was significantly slower.
Using static always. A single vbo works great.
As for streaming, or write vbos, then more likely than not your data will not be written
to video memory and you will lose all the benefits of a vbo.

In that case, a single streaming vbo will be fine.

Korval
01-19-2004, 12:10 PM
What managment? I implemented a system with multiple vbos and it was significantly slower.

On which implementation? The behavior differs based on the implementation. It is well known that nVidia's VBO is highly complex, and attempts to try to figure out your intent.


Using static always. A single vbo works great.

Unless, of course, you have large quantities of data in memory at once (say, 32MB of vertex data and 128MB of textures). The driver is not free to page out parts of your VBO, because it must be an all-or-nothing thing. Which means that you've dropped your avaliable texture memory down to 96MB. So, if you try to use all 128MB in one frame, you'll trash, and lose all kinds of performance.


As for streaming, or write vbos, then more likely than not your data will not be written
to video memory and you will lose all the benefits of a vbo.

Unless you're using nVidia's implementation, which does indeed use video memory for streaming buffers. Check out their VBO .pdf.

OldMan
01-19-2004, 11:12 PM
Whant my opinion?
Well when VBO came it was suposed to be a way to banish the need of specific code paths for NV and ATI hardware qith their proprietary extensions.

But the peculiarities of each VBO implementation are not less restrictive than the use of two different extensions. You may even have to use again two different models in your application, each one reflecting the peculiarities of each implementation.


For me, it is still a mess!!!!

Korval
01-20-2004, 03:36 AM
For me, it is still a mess!!!!

Stop trying, then.

The current issues with VBO stem from the fact that VBO's can be used in different ways. For example, Maximian's single large VBO, vs the normal method of allocating object-sized VBO's.

It is unclear which is best because the spec doesn't define which is best. That's up to the driver. To this point, driver writers can decide to optimize any VBO usage pattern they want, because nobody has released a product that uses VBO's professionally. Once Doom 3 ships with VBO's as the standard vertex sending mechanism, then driver writers have a particular method of using VBOs that must run fast.

Until then, just do what is most convienient to you, the user. Static VBOs will beat regular vertex arrays in virtually all circumstances, so it's still a win. Until there is a specific impetus on the part of driver writers to make certain usages of VBOs more appropriate than others, just use VBO's as you see fit, understanding that the optimal VBO usage may change with the release of Doom3.

zeckensack
01-20-2004, 05:28 AM
Originally posted by maximian:
What managment? I implemented a system with multiple vbos and it was significantly slower.
Using static always. A single vbo works great.That's to be expected. For static data, there is little management to do because it is static. Cass stated as much. By going with a single, large VBO, you're eliminating the overhead of switching between VBOs.


[/B]As for streaming, or write vbos, then more likely than not your data will not be written
to video memory and you will lose all the benefits of a vbo.

In that case, a single streaming vbo will be fine.[/B]Did you try?
A driver may double buffer some VBOs and - surprise - this is most beneficial for non-static buffers. Double buffering of VBOs trades memory for synchronisation stalls. For a lot of applications that would be a good thing.

It's not hard to explain:
If you have one huge VBO with dynamic data, every time you update a portion of that data (worst case: via glMapBuffer), all rendering that references this buffer must be finished first. If you have one VBO per dynamic mesh, the driver only needs to finish rendering that single mesh (which was in the majority of cases approximately one frame earlier).
If the driver performs VBO double buffering, there is no need to finish anything at all. Rendering can continue from the "old version", while you get a "new version" for the new data.

maximian
01-20-2004, 06:28 AM
1)'Video memory is used for streaming vbo'.
This is questionable. While possibly true, I have found vbo performance to equal regular va when using streaming. nvidia+latest drivers+geforce fx+tnt2. So if it is using video memory, it is not much of a speed enhacement.

2) If your data is quite small one large vbo may not work. I deal mainly with very large models,few models per scene. Perhaps, arb should modify the vbo so that you can map only a portion of the vbo, the portion you will change. In that case, write or streaming will be best.

3)Synchronization issues are still present with vbo. If you try to render to a vbo(write to it, map it, whatever) you will cause errors. I tested this thoroughly.I then used semaphores to control the write,render code. Again, no synch on the part of the driver.

4) I agree with OldMan. VBOs differ almost as much as VARs+vertex object buffer did with each other. The api is the same for all, but you need special handling. Saying stop trying to someone is not constructive.

5) Finally, who made Carmack boss of the graphics universe. If ARB had done its job,
then functionality of vbo would work similarly across all platforms. Performance would not be the same, but at least the way it worked would be uniform. If opengl cannot get vendors to implement standards in platform consistent manner, then it has a serious problem. This is of course not new.
Even standard opengl code will some times behave quite differently based on the vendor.

OldMan
01-20-2004, 06:58 AM
If we need to wait for D3 to have realiable abd predictable performance VBO. I Prefer to keep using VAR and Vertex objects until D3 comes out! Because the MAIN advantage of VBO is missing!

But so at leat Carmack could say us how VBO will be in future. Since it looks his decision will prevail anyway....

AdrianD
01-20-2004, 07:16 AM
Originally posted by maximian:
[...]
5) Finally, who made Carmack boss of the graphics universe.

dont' forget: cramack created the first successfull opengl game, by creating his own openGL implementation - mingl for quake.
This way he forced most HW vendors to add opengl support for their hw.
By creating quake2/3 he even more forced the hw vendors to push their opengl-development forward.(ie. optimization of VA's)
(not to mention all the games created other companies using the ID engine...)
all vendors want to run his code as fast as possible on their HW - it is there reference for fast opengl implementations since 1998. and therefore it's important what he does.

Korval
01-20-2004, 07:36 AM
This is questionable.

This is what nVidia said. If you find it "questionable", question them. Until they say that it isn't the case, then it is, as far as I'm concerned.


While possibly true, I have found vbo performance to equal regular va when using streaming.

Which is hardly surprising because you're streaming. You're generating data per-frame in some fashion, so you're required to do a copy operation. As such, you can only go so fast.

Also, from nVidia's document, it seems that their glMapBuffer call for streaming buffers just allocates some system memory, rather than giving you a direct pointer to the actual memory. This means that streaming can't be faster than standard arrays since you aren't writing directly to the buffer memory itself.


If your data is quite small one large vbo may not work. I deal mainly with very large models,few models per scene. Perhaps, arb should modify the vbo so that you can map only a portion of the vbo, the portion you will change. In that case, write or streaming will be best.

This would require that they perpetuate the idea that using VBO like VAR is a good idea. It is not. One of the primary problems VBO's are supposed to solve is getting buffer management out of the users hands and into the driver's hands.


Synchronization issues are still present with vbo. If you try to render to a vbo(write to it, map it, whatever) you will cause errors. I tested this thoroughly.

That's not a sync error. That's either an error in your code or an error in the implementation in question. In either case, it has nothing to do with the VBO spec.


The api is the same for all, but you need special handling. Saying stop trying to someone is not constructive.


It's far more constructive than complaining about something that is, by definition, implementation dependent.


Finally, who made Carmack boss of the graphics universe.

People who keep buying games/engines made by Id Software. Like it or not (and I don't), that gives him clout that we don't have. And it is his choices in how to use VBOs that we will be expected to follow.

Fortunately, he's usually pretty reasonable in using an API, so I don't expect his VBO usage to be totally off the beaten path.


If ARB had done its job,
then functionality of vbo would work similarly across all platforms. Performance would not be the same, but at least the way it worked would be uniform. If opengl cannot get vendors to implement standards in platform consistent manner, then it has a serious problem.

There is a difference between performance and functionality. Functionality is what a spec defines; not performance. The standard doesn't prevent an implementation from being faster in certain usage patterns of the API. Just like the spec doesn't prevent an implementation from punishing the user for pathological (though legal) use of the API.

For example, the whole VBO-as-VAR thing. This is a legitimate usage pattern for VBO's. The behavior of this pattern is well understood, and all implementations of VBO should be able to function with it. How fast this is must always be implementation dependent. For some implementations, like nVidia's where bind-equivalent actions incur some overhead, the VBO-as-VAR is faster than multiple VBOs. For other implementations where bind-equiavlent actions are just like any other state change, the VBO-as-VAR paradigm is, at best, no faster than normal.

The point is that expected performance is, and must always be, implementation dependent. The best the spec could ever do is give guidelines about what the "fast" path should be.


If we need to wait for D3 to have realiable abd predictable performance VBO. I Prefer to keep using VAR and Vertex objects until D3 comes out! Because the MAIN advantage of VBO is missing!

The primary purpose of the VBO extension is to provide for a cross-platform solution to vertex transfer that gives the driver the flexibility to manage buffer memory as it needs to. VBO effectively solves that purpose. Performance is, and has always been, implementation dependent.

You can get reliable performance out of VBOs today. You may not get optimal performance out of all implementations, but if you use the VBO API reasonably, then all implementations should be reasonably fast. Maybe not as fast as possible, but decently fast.

If you want to keep using extensions like VAR and the multitude of VAO extensions, good for you. Just don't complain when ATi pulls the plug on VAO. And don't complain that you have to use VAR/VAO; you have a perfectly viable alternative.

maximian
01-20-2004, 08:01 AM
You misread my post completely. I said performance is not the problem. It is the usage pattern, ie how you expect the vbo to work that should be uniform across all platforms. Yes, I am right about this. It is not a standard, and does not solve the issue of transferring vertex data if I have to code it differently based on the chip. I said it before and I say it again.
Performance deltas between implementation is not a big issue. However, the vbo should work almost the same across all platforms.

1) Just because nvidia says it so does not make it such. I got plenty of data, and tests to back my assertions about the actual way vbo's work. As for streaming, if it was double buffered, there is no reason, why you cannot get some performance improvement from the use of video memory. If it had been implemented correctly, streaming/writing vbo should also offer performance improvement, especially if your not trying to rewrite the whole vbo for every pass.(This alone shows their caching algorithm is fairly rudimentary if it exists at all.)

Again, the whole point of VBO is enhance performance. We do not need yet another VA.
How much it enhances varies, but if a path should enhance performance, then it should.
Failure to do so, constitutes a failure of the standard.

2. I have nothing against Carmack. However my criticism is correct. It is useless to have an ARB, if we have to wait for a developer to specify HOW something should work. You prove this point for me.

3. I am not complaining at all. I am mearly pointing out the problems. VBOs work very well for me. However, I sympathize with people who need more out of them. They should probably stick to VAR+VOB for the forseable future.

4. Going back to your argument about VAR not being the right way to do it. YES you are right. But if you look at the first post, and
how people are using crazy methods of managing VBO memory, then it is not much more advanced than var. It hides only the synch issue you mentioned. Like system memory, I should be able to allocate a block of memory, and then map, subregions, units of it. With the current method of doing things switching between all those VBOs will cost you dearly in performance. Not to mention the hassle, and also the redundency of managing your vbo memory, something that should be left to the driver.

Jan
01-20-2004, 09:04 AM
If ARB had done its job,
then functionality of vbo would work similarly across all platforms.[/B]

Well, the spec says, that binding a buffer is "lightweight" opposed to VAR, where it is a "heavyweight" operation.

This describes pretty well how to use VBO. A buffer-change is an operation which does not hurt performance (per definition in the spec), so it is expected to be an often used thing.

This means, one VBO per object is the expected use of VBO.

And i am quite certain that a lot of programs will make use of lots of smaller buffers instead of one large buffer, therefore vendors are forced to optimize for that case.

Jan.

dorbie
01-20-2004, 10:13 AM
Carmack never created MiniGL. Carmack used the 'MiniGL' driver on 3Dfx Voodoo hardware. MiniGL was kindof a nickname for 3Dfx's OpenGL 'implementation' on voodoo because it was a separate dll that didn't install as an ICD called through OpenGL32.dll. It wasn't OpenGL compliant and only supported some rendering paths and functonality (which Carmack used). Some form of OpenGL was already in place when Carmack decided to port Quake over to it and he did the initial version in about a weekend as I think he mentioned this in a .plan way back in the day.

dorbie
01-20-2004, 10:20 AM
It is important to specify memory use with VBO's and not the actual implementation mechanism. This is essential for hardware and implementatuon abstraction. It is also important to see how this get's used to allow implementors to optimize appropriately. It's early days but it will fall into place. It's a new feature, it's carefully designed to accomodate everyone's hardware and dispatch designs and it's intended to be powerful and futureproof. If you don't understand this then you just don't get it.

Korval
01-20-2004, 02:35 PM
Failure to do so, constitutes a failure of the standard.

No, it only demonstraits a need for that particular implementation to provide better optimizations. The standard, the spec itself, is a very good one.

The VBO spec gives the driver the ability to make vertex transfer as fast as it possibly can be. If drivers don't take advantage of it enough, or haven't had the time to (good memory management is difficult, and usually requires a working model to determine memory usage patterns), then it is the fault of the driver/implementation, not of the spec itself.

Not even D3D Vertex Buffers are as flexible and fast as VBO's potentially can be.

If you think the VBO spec is flawed, what suggestions would you make for changing it?


However, I sympathize with people who need more out of them. They should probably stick to VAR+VOB for the forseable future.

I disagree.

You never need performance that badly to abborgate VBO for VAR/VAO. VBO is the modern API for passing vertices to hardware; it is the path that will be optimized in the future (and near-term present).


But if you look at the first post, and
how people are using crazy methods of managing VBO memory, then it is not much more advanced than var.

That's called "pathalogical use of the API". You can do these things, and they are legal, and, on any given implementation, might give a performance boost. But, they are not suggested ways to use the API, and, in general, not a good idea. Indeed, I would suggest, given the spec, that any usage pattern that attempts to do significant memory management is not the proper usage pattern for the API. Clearly, the idea is for the driver to handle memory management.


Well, the spec says, that binding a buffer is "lightweight" opposed to VAR, where it is a "heavyweight" operation.

This describes pretty well how to use VBO. A buffer-change is an operation which does not hurt performance (per definition in the spec), so it is expected to be an often used thing.


Well, I read that slightly differently. I don't read "lightweight" as "free", but relatively "low-cost". A bind (specifically, a glBindBuffer followed by a gl*Pointer call) could require an upload of a vertex buffer object that has been paged out back into video memory. This will require going through the cache. Now, unless you are constantly trashing, this operation should be virtually non-existant if the buffer is resident.

I do agree that the, in general, correct usage pattern is one VBO per object.

AdrianD
01-20-2004, 07:33 PM
dorbie,
you are right when you say, that there was some form of openGL allready in place as carmack decided to port quake. i know this, it was allready the current microsoft dll, as we know it. but that's not the point.
the miniGL(actually 3dfxGL.dll) implemetation was not a driver. it was a simple wrapper for openGL over glide. using only glide-compatible function calls. Carmack created it to get the quake-1-demo running on the voodoo-1 chipset and it was not a creation from 3dfx. (they released their opengl implementation something about the time where voodoo-3 came out...)
Quake2 was shipped with this miniGL and in the game it was possible to choose between 3dfxGL and Standard OpenGL.
This forced the competitors of 3dfx(=the market leader at this time) to write an opengl driver for the card at all.(it is a good selling point, when you can say that your hw works with games like quake,halflife and couterstrike... http://www.opengl.org/discussion_boards/ubb/wink.gif)

what i am trying to say is: if caramack didn't decided to use this 3dfxGL and opengl for his very popular game, then OpenGL drivers for consumer cards on win32 pattforms wouldn't be so important for the hw vendors.
If there were no quake-engine-suppored games around, there would be not a single OpenGL-only-AAA-game for a long time.
Only because it is important to get quake running, the hw vendors improove their drivers.(everyone complains about this fact...)

jwatte
01-20-2004, 07:48 PM
A single, large buffer is fine if you don't update it after you first upload it. It could still be fine if you only update it using BufferData(), and never map it; this depends on what the driver does. The driver cannot make it efficient if you map and stream data to it (as it doesn't know which part you want to touch).

When it comes to managing memory, that probably comes more from the DirectX world, where you have to keep the backing store for your vertex buffers (unless you mark them MANAGED, which leads to other problems). Also, when you go through VertexArrayRange(), you have to manage the "one big buffer" yourself, and keep backing store for the data that's not in the working set.

And, by the way, there's no way that you can just assume that ARB_VBO is going to be available on consumer machines. We support back to version 28 drivers, and there's some pressure to go back to 6.xx, because forcing people to upgrade drivers puts them off. Similarly, there's lots of hardware out there that doesn't have VBO support even now.

dorbie
01-21-2004, 05:51 AM
Carmack didn't create MiniGL, 3Dfx did to cover the functions he was using in glQuake.

Here's a quote from his .plan 11/23/96

"GLQuake: 3DFX is writing a mini-gl driver that will run glquake. We exepect very high performance. 3Dlabs is also working with me on improving some aspects of their open-GL performance."

The implementation details of the MiniGL *driver* are not the issue. Nobody appreciates Carmack's contributions more than me. Just because I make a simple factual correction, don't assume I'm taking a position on other things you post.

[This message has been edited by dorbie (edited 01-21-2004).]

AdrianD
01-21-2004, 10:17 AM
dorbie, i am sorry. i had wrong infomations.
thanks for correcting me.

MikeC
01-21-2004, 10:34 AM
As a side note, there's something I've been wondering recently. When writing to mapped VBO memory residing on the graphics card, is there likely to be a major performance hit for scattered writes, as opposed to nice tidy writes to sequential memory locations? I'd imagine so, but don't know a lot about bus architectures.

If nobody knows offhand, I'll test it and see; I'm only asking 'cos I'm lazy...

maximian
01-21-2004, 10:59 AM
If you do test this, could you please post the results. I am also curious. Thanks.

Korval
01-21-2004, 11:53 AM
When writing to mapped VBO memory residing on the graphics card, is there likely to be a major performance hit for scattered writes, as opposed to nice tidy writes to sequential memory locations?

It is implementation dependent. However, the fastest implementations of mapping will definately have this limitation. So, you should probably assume that sequential writes are the way to go.

Licu
01-21-2004, 11:32 PM
We are using in our engine all the vertex buffers implementations: VBO, VAR and VAO. While at the beginning the VBO was somewhat slower than VAR and VAO with time, both NVIDIA and ATI improved their drivers and now VBO is somewhere 5-10% faster. Due to our games types, we have many vertex buffers (thousands in a scene and hundreds visible per frame). The VBOs design makes the update and rendering calls less expensive when you have many objects. Also, the client code path for VBO is far less complex than, for example, the memory management for VAR. Furthermore, even NVIDIA tells developers to use VBO rather that VAR (which is a rare thing considering their position regarding own extensions vs. ARB extensions).

I expect in the future VBO to become faster and faster while custom vendor extensions to be kept in the driver only for compatibility with older applications. New hardware will be made to be fast for the VBO and the presence of this standard extension will make different vendors hardware to be very similar in usage. Without this extension it would be no hope for unified geometry data management at all.

orbano
01-22-2004, 09:46 AM
Originally posted by Korval:
Well, I read that slightly differently. I don't read "lightweight" as "free", but relatively "low-cost". A bind (specifically, a glBindBuffer followed by a gl*Pointer call) could require an upload of a vertex buffer object that has been paged out back into video memory. This will require going through the cache. Now, unless you are constantly trashing, this operation should be virtually non-existant if the buffer is resident.

I do agree that the, in general, correct usage pattern is one VBO per object.

Binding buffers and setting pointers is so lightweight that you will never have to bother about it! I have tested it with about 12000 small objects (less than 100vertices), each having a separated VBO for Vertices,Normals and Texture coordinates. The binding was about the 1/100th of the time, the calling of glDrawElements took. Maybe i misunderstood the result of the profiling, or dont know exactly what is behing calling these functions, but i think VBO's driver-side management is well developed (i could even achieve the maximum vertex throughput with 180mbytes of model data loaded this way!!!). (dont know if nvidia's solution is as good as ati's, but it seemed that nvidia supports VBO more than VAR)...

MikeC
01-22-2004, 12:27 PM
Originally posted by licu:
Due to our games types, we have many vertex buffers (thousands in a scene and hundreds visible per frame). The VBOs design makes the update and rendering calls less expensive when you have many objects.

Are these results for one big buffer, or one buffer per object?

I've never quite understood how VBO is better than display lists for the latter case.

orbano
01-22-2004, 02:02 PM
Originally posted by MikeC:
Are these results for one big buffer, or one buffer per object?

I've never quite understood how VBO is better than display lists for the latter case.

You should read the specs of VBO and display lists...

MikeC
01-22-2004, 03:22 PM
Originally posted by orbano:
You should read the specs of VBO and display lists...

I have. In the general case, and for dynamic data in particular, sure. But for static data, with one buffer per object, I can't see any win over DLs except maybe a slightly faster setup, and I'd imagine that the DL will compile to something very like a static VBO behind the scenes. I suppose it boils down to a tradeoff between elegance (consistent use of VBOs throughout) and compatibility with old drivers.

If I'm missing something (entirely possible) feel free to point at me and laugh, but I'd appreciate it if some kind soul could put me out of my ignorance.

stefan
01-23-2004, 04:32 AM
Originally posted by MikeC:
If I'm missing something (entirely possible) feel free to point at me and laugh, but I'd appreciate it if some kind soul could put me out of my ignorance.

If you're using LODs for your geometry where the different levels share the vertices but use different indices it may make a big difference in terms of memory usage.

MikeC
01-23-2004, 01:45 PM
Originally posted by stefan:
If you're using LODs for your geometry where the different levels share the vertices but use different indices it may make a big difference in terms of memory usage.

Hmm, good point. Not sure I'd want to do LOD that way - it costs the footprint of the highest-resolution LOD even if you're only using the lowest-resolution one, and doesn't sound very cache-friendly - but it's an interesting approach.

Thanks,
Mike

Korval
01-23-2004, 04:46 PM
Hmm, good point. Not sure I'd want to do LOD that way - it costs the footprint of the highest-resolution LOD even if you're only using the lowest-resolution one, and doesn't sound very cache-friendly - but it's an interesting approach.

Well, the foot-print problem is certainly liveable, as you don't want to make the driver do an upload of a new VBO (if it was paged out) just for a new LOD.

And no, it isn't cache friendly, so your lower LODs won't be as fast as they could be. But you save memory on having to have multiple VBO's around, so it can definately be worth it.

zeckensack
01-24-2004, 12:37 AM
I don't know if this example is any good.
If smaller geometrical LODs are small enough(TM), the memory usage is well bounded and won't be much of a problem. Compare this with mipmapping, which is absolutely bounded at 30% more memory, regardless of how big the base texture is.

I'm a bit fuzzy on the math right now, so I don't know whether LOD n+1 needs to be strictly a quarter of LOD n in size, or if any exponential decay is fine.

MikeC
01-24-2004, 02:39 AM
Originally posted by Korval:
you don't want to make the driver do an upload of a new VBO (if it was paged out) just for a new LOD.

Really? In an ideal world, no. But if you aren't currently using an LOD, and it's hogging vidmem needed by things you *are* using, paging it out until it's needed sounds like a perfectly reasonable thing to do.

orbano
01-24-2004, 05:16 AM
Yes, any exponential function will do:
liman=1/c^n
where c>1 (yeah i know its not a function, but dont know its name in english)
And about DLs and VBOs. AFAK DLs dont have to be in the video/agp memory. its just compiled to ogl's own memory. please kick me if im wrong, but that is how i understood ogl specs.

TheSillyJester
01-24-2004, 05:30 AM
Well I have tried both 2 technique for my landscape(using the same VB with different LOD IB and geomipmapping VB too) and the results are *totaly* the same.
With the first method there was 5Mo of VBO and with the second only 350ko (50k tris rendered with both methods). Note that if a chunk of land change his LOD, he delete his VBO an create a new one. All VBO are statics.

Tested une nVidia Geforce4.
A screen: http://esotech.free.fr/Clipboard01.jpg

It's not the size of VBO who matter but how much you draw from it.

V-man
01-25-2004, 07:12 PM
Originally posted by MikeC:
As a side note, there's something I've been wondering recently. When writing to mapped VBO memory residing on the graphics card, is there likely to be a major performance hit for scattered writes, as opposed to nice tidy writes to sequential memory locations? I'd imagine so, but don't know a lot about bus architectures.

If nobody knows offhand, I'll test it and see; I'm only asking 'cos I'm lazy...

On PC's, this is not possible. If you wish to write to such a VBO, then the object will be brought into system memory because on PCs, there is a risk that the buffer may be lost.

Read the part tha says
"What happens to a mapped buffer when a screen resolution change or
other such window-system-specific system event occurs?"

I'm not sure though. It could be system mem or AGP.

forgottenaccount
01-26-2004, 12:59 PM
Originally posted by Korval:
nobody has released a product that uses VBO's professionally.

I know of at least one application shipping that uses VBOs. We used them on Homeworld2 as the preferred method of storing geometry. http://www.opengl.org/discussion_boards/ubb/smile.gif Homeworld2 shipped four months ago in September 2003.

Homeworld2 made a few magazine covers and won its share of game of the month awards, so I would think it is popular enough to count for something, but I won't compare its popularity/performance influence to what I expect from Doom3 http://www.opengl.org/discussion_boards/ubb/smile.gif

A bit about HW2 for those interested because I haven't really posted here much. HW2 uses VBOs, and if they aren't supported HW2 falls back to display lists. We detect the renderer and driver version and look up in list of known buggy drivers if we should or should not use display lists. Quite often we also disable display lists and fall back on (compiled) vertex arrays. Ugh.. I think with VBOs we may never see bug free display list support from certain vendors http://www.opengl.org/discussion_boards/ubb/frown.gif

Homeworld2 uses fragment programs for all rendering on advanced cards and has a bit of use of vertex programs too. Shadows are done with shadow maps. We don't support VAR or VAO, only VBO.

We used one VBO per object which I assume is the way one should try to use them. VBOs should be getting pretty stable as they are part of the new core and the standard is a year old.

Korval
01-26-2004, 01:46 PM
We used one VBO per object which I assume is the way one should try to use them.

OK, that's one shipping game that uses the one-VBO-per-object pattern. Good.

I seem to recall that ATi's drivers at the time of HW2's release had some issues with the game. Did this have something to do with their VBO implementation at the time, and did ATi correct the problem?

forgottenaccount
01-27-2004, 09:21 AM
Originally posted by Korval:
I seem to recall that ATi's drivers at the time of HW2's release had some issues with the game. Did this have something to do with their VBO implementation at the time, and did ATi correct the problem?

The initial VBO support from ATI had some issues with dynamic buffers, but ATI fixed them before we shipped. When we shipped we weren't aware of any issues with their current drivers at the time. Still, at first, a couple users claimed they needed to run Homeworld2 with the -noVBO command line parameter to disable VBOs in order to play the game, but ATI seems to have identified and fixed most of those cases now.

MZ
01-27-2004, 10:29 AM
I have looked into Star Wars K.O.T.O.R executable and found some extension names.
There is VBO among them. But there are VAR and fence too...

[BTW, first game for both XBox and PC which doesnt use DX Graphics?]

jwatte
01-27-2004, 11:35 AM
We use VBO if available and well supported (doing graphics driver version look-up); else we use VAR/fence if available; else we use vertex arrays. We previously did display lists, but there's too many problems with them, and they use more memory than vertex arrays.

Regarding LOD, if you do progressive meshes with sliding windows for LOD, then you need to keep all the verts in a single buffer -- and you can't even do progressive meshes at all using display lists.

I've found that the OpenGL support for the tier 1 hardware vendors is great, for the tier 1 integrated vendor it's great, too (although the performance is ... integrated), but the OpenGL support on the bottom 20% of the market is so bad that we can't run on those chips. (They crash within 3 seconds, typically, and I've never gotten a reply from any of the e-mails I've sent to Taiwan headquarters or US sales about the problems)

Korval
01-27-2004, 01:14 PM
That's one of the unfortunate advantages of the DirectX driver methadology. By taking on so much of the code-work themselves, Microsoft makes writing relatively bug-free DX drivers easy, as long as the DX component in question is relatively bug-free. OpenGL, especially at the higher-level extensions (glslang, VBO, ARB_fp/vp), makes writing implementations orders of magnitude harder than it used to be.

AdrianD
01-28-2004, 05:18 AM
Originally posted by forgottenaccount:
The initial VBO support from ATI had some issues with dynamic buffers, but ATI fixed them before we shipped. When we shipped we weren't aware of any issues with their current drivers at the time. Still, at first, a couple users claimed they needed to run Homeworld2 with the -noVBO command line parameter to disable VBOs in order to play the game, but ATI seems to have identified and fixed most of those cases now.

right now, i have installed the new ati drivers (cat 4.1) over my (cat 3.9). (Because since this driver is released, our hotline gets strange crash-reports on ATI hw...)

And suddenly, ALL dynamic accesses to VBO's are screwed up!
it looks like, the ATI drivers don't care if a currently rendered object is mapped or not.
the bugs look like the VBO memory is updated DURING rendering.(as a ex-driver-developer i know how this error look like...)
static geometry is fine. there are also strange performace drops, every 5th frame or so.

my VBO implementation works perfectly with cat3.9 and all detonators/forcewares...so i don't think my implementation is wrong.

until cat3.9 i really thought, ati's driver quality improoved in the last years... but now: how can such a bug pass the beta-test???!

skynet
01-28-2004, 06:22 AM
I also have noticed that problem, catalyst 4.1 is screwing up my dynamic stencil shadows (implemented using VBO) from time to time. I have mailed ATIs devrel already, but they dont seem to "believe" it,since Im not able to send a Testapp proving it. (Can you do?) Also _downloading_ from VBO is screwed up from time to time. ATI knows this for a rather long time now, but they didnt do anything about it in the last few driver versions :-(. I think, they were too busy implementing those extraordinary useful "Smartshader Effects" ;-)

V-man
01-28-2004, 08:38 AM
Originally posted by Korval:
OpenGL, especially at the higher-level extensions (glslang, VBO, ARB_fp/vp), makes writing implementations orders of magnitude harder than it used to be.

OK, but eventually your drivers should improve and become less buggy.
But what company are we talking about? Volari is the only one that supports the new ext.

I've noticed a strange problem with Cat 3.10 and 4.1
When I use VBO for rendering one of my objects multiple times, it only renders the first time.

Did anyone else experience this?

AdrianD
01-28-2004, 12:17 PM
v-man: i have no problems with static geometry at all.

i did some tests and i detected that the glMapBufferARB function causes most problems. when i replace all glMapBufferARB calls with according glBufferDataARB calls, everything looks ok, but the performace is still questionable (1/2 of the speed of my non-VBO path).

V-man
01-28-2004, 06:57 PM
AdrianD,
What problem did you see with glMapBuffer?

I have static geometry. Once I upload, I never change it.

To be more clear, my algo looks like this.

RenderObjectX();
RenderObjectX();
RenderObjectX();
RenderObjectX();
SwapBuffers();

Also, I tried to use generic attrib functions and still I get the same problem with 4.1

I have a single VBO for vertex, normal, texcoord, tangent, binormal. It looks like it can't access normal and all the rest. As if the offsets are invalid.

Did anyone see that one at?

jwatte
01-28-2004, 08:04 PM
I've used VBO on Catalyst 4.1, using both static, dynamic, and streaming draw, and I get none of the artifacts you're describing. I put all the data in one buffer for static geometry, and position/normal in one (streaming) buffer, texture/color in another (static) for soft-skinned geometry.

I saw no performance difference between STATIC_DRAW and DYNAMIC_DRAW, though. I'm pretty sure I'm CPU limited at that point on a Radeon 9700, but I'd expect DYNAMIC_DRAW to reduce available memory bandwidth for the CPU (coming out of AGP), whereas STATIC_DRAW might come out of VRAM and thus not load the memory bus. Oh, well, not too much to worry about; it seems to run fine. It also works fine on NVIDIA with series 5x.x drivers.

AdrianD
01-29-2004, 02:57 AM
the problem with glMapBuffer is, that the driver does not check if this buffer is currently in use(rendering) while i am locking&updating this buffer.(according to the specs the driver should do it, or give me another valid piece of memory.ie.by making a copy)
and because of that, i can sometimes(depends on scene size/polygoncount) see that some meshes are rendered with the vertices of the previous frame, and some with the current.(even in a multipass algorithm where i am first uploading all geometry and then draw it: the pass for the first light is still from the last frame and the second light draws the correct vertices)

i do not have any problems with generic attributes, but i detected that you can't use any generic attribute you want.
if you want to mix generic attributes with standard attribute bindings, you have to make sure that you don't use the same generic attributes which are mapped on standard attributes (0..5 and the texture bindings)
ie.:you can't use texcoord[0] and generic attribute#8 because they are mapped to the same data, and the generic attribute overrides the texcoord)
in my app, i bind my normal to the standard normal array, and the tangent and binormal are generic attributes 10&11. this works without any problems.

when i am talking about a performace loss, i also mean this compared to the previous driver version.
in some polygon-intensive demos i don't get my optimized av.230FPS(only around 130FPS)

but i have also expirienced some speed improovements in other parts of the driver. ie. my vertex-programm-based extreme-crowd-rendering-example
(up to 10000 moving, animated objects - 20 objects are renderd at once using batching) is up to 50% faster and much more stalbe than before.(it seems, that statechenges for vertexprogramms are faster now...)

Jared
01-31-2004, 03:36 AM
Originally posted by AdrianD:
i do not have any problems with generic attributes, but i detected that you can't use any generic attribute you want.
if you want to mix generic attributes with standard attribute bindings, you have to make sure that you don't use the same generic attributes which are mapped on standard attributes

at least that wasnt a surprise. if i remember the specs correctly it even states that the vertex-programm should be refused if its refering to an attrib as standard and generic. in the end i guess the only advantage of generic attributes is that its less confusing than passing some tertiary color as normals (that and you have more).

talking about vbo: is anyone else having horrible performance when using anything but floats in a vertex buffer? i could accept that it "doesnt like" vertex position as unsigned bytes, but when it would also start to crawl when passing colors from a vb as unsigned byte it started to get a little annoying. i can understand that floats are better, bigger, strong... eh, have better precision and the hardware is probably doing everything with floats anyway... but why should i have a 48mb vertex buffer if i only need 12mb? and the extra conversion really shouldnt turn 1200fps into 40fps, especially since even standard vertex array was running at 1200fps (with bytes and floats alike).

just when i thought the worst issues would be strange bugs like not allocating video memory below a certain size or (more understandable) above a certain size (though even when there would be more than 10 times as much free).

maybe someday soon they will decided how vbo should behave and what limitations they should have. that would be less troublesome than some new vbo issues with every new driver.

V-man
01-31-2004, 08:53 AM
Jared,

how are you storing your data in the VBO?
Maybe you can put the color in another VBO. Separating the float from the ubyte may help ... but i could be wrong. Are you saying using float works OK or you didn't try yet?

Maybe we should have a generic compression for everything instead of just textures.

In my case
vertex is attrib 0
normal is attrib 2
tex is attrib 8
tangent is attrib 9
binormal is attrib 10

I'm not mixing generic and conventional for sure. I did a small glut test and it wasn't working either.
It would be nice if someone can send me their source code or exe. I would like to see something that works.

Jared
01-31-2004, 10:20 AM
Originally posted by V-man:
Jared,

how are you storing your data in the VBO?
Maybe you can put the color in another VBO. Separating the float from the ubyte may help ... but i could be wrong. Are you saying using float works OK or you didn't try yet?


i tried pretty much every combination i could think of but the frustrating result: as soon as i use something else but floats its killing performance. vertices and color were already seperated, in two buffers, in the same buffer with offset, interleaved etc.. especially that moving it from va to vbo would make such a difference is weird when the data itself isnt changed. maybe i should try it with 4 unsigned bytes, just in case its some kind of alignment problem (though in that case ints should work at least)

[This message has been edited by Jared (edited 02-01-2004).]

Korval
01-31-2004, 11:07 AM
talking about vbo: is anyone else having horrible performance when using anything but floats in a vertex buffer? i could accept that it "doesnt like" vertex position as unsigned bytes, but when it would also start to crawl when passing colors from a vb as unsigned byte it started to get a little annoying. i can understand that floats are better, bigger, strong... eh, have better precision and the hardware is probably doing everything with floats anyway... but why should i have a 48mb vertex buffer if i only need 12mb? and the extra conversion really shouldnt turn 1200fps into 40fps, especially since even standard vertex array was running at 1200fps (with bytes and floats alike).

That's somewhat normal, and the significant framerate drop is expected. Here's why it happens.

If the hardware can't handle a certain vertex format, the driver must convert this data into something the hardware can handle. But, it can't do this at upload time; it has to wait until each render time, and do it for each render call. Since the data may be in AGP or video memory, this provokes very slow read requests, thus further slowing down the process. Reading from AGP is pretty slow (uncached), but reading from video is excruciatingly slow, as it's going over the PCI bus (AGP goes 1-way: to the card. Data from the card to the CPU has to go across the PCI).

Now, a more important question is, what hardware are you using? My 9500 can handle unsigned bytes just fine for colors (and all other vertex attributes). Lower-end cards (say, a GeForce 2 or less) may not be able to natively handle bytes for colors, thus provoking a conversion. The same might go for low-end Radeon cards too.

Jared
01-31-2004, 11:37 AM
Originally posted by Korval:
Now, a more important question is, what hardware are you using? My 9500 can handle unsigned bytes just fine for colors (and all other vertex attributes). Lower-end cards (say, a GeForce 2 or less) may not be able to natively handle bytes for colors, thus provoking a conversion. The same might go for low-end Radeon cards too.

a radeon 9800 currently with cat 4.1. maybe its time for a little driver safari. or a creative method for vertex compression.

V-man
02-01-2004, 09:40 AM
Sorry sorry sorry!
I forgot to mention that this is not VBO related, cause I already tested the standard VA path and I get identical results.

However, the exact same problem appears for both VBO and VA when using generic attributes.

For example, in a simple GLUT code, if I have this for the normals or whether I comment it, I get the same result :

glEnableVertexAttribArray(2);
glVertexAttribPointer(2, 3, GL_FLOAT, GL_FALSE, VertexSize, VBO_NormalAddress);


I have no idea what's going on. I'm going to do more testing I guess.

V-man
02-01-2004, 10:12 AM
OK, I did a bit more testing and this is what I found.

Using fixed pipe with generic vert attrib doesn't work. Is this the way it is suppose to be?
When doing that, and *then* enabling PP, it ****s up everything.

Just using PP with generic and also using generic in VP and FP works.
Using PP with generic, but using conventional with VP and FP screws it up.

This is bad. There should be a simple clear cut document that explains these pitfalls.

jwatte
02-01-2004, 10:21 AM
The fixed function pipe is not guaranteed to work right with generic vertex attributes. The standard says "there might be aliasing, or there might not".

This ALSO means that if you specify using VertexPointer, NormalPointer, etc, then you have to read them using the named bindings, not the generic aliases, in the vertex program.