Yay! ARB_vertex_buffer_object supported in ATI's Catalyst 3.4 drivers!

Ostsol · May 15, 2003, 1:44pm

www.ati.com

Finally. . . time for some fun! Down with vendor specific extensions!

NitroGL · May 15, 2003, 2:05pm

Cool. Now all I need is my new nForce2 mobo, and I’ll be all set!

skynet · May 15, 2003, 3:41pm

Well… that CounterStrike/Halflife problem is STILL NOT “fixed” Ok, the Screen doesn´t stay black anymore and you can switch between desktop and game back and forth, BUT those screenswitches take ages and when I get back into the game, the refresh rate goes back to annoying 60Hz, even though the game started with the expected 100Hz. ATI should take a look on how well nVidia does solve this task.

To VBO: It seems, ATI didn´t expect people to create thousands of VBO objects. Well, I do (I will reduce that some day, promised). After claming around 6000-6600 buffer objects, I get OUT_OF_MEMORY, though the amount of memory needed for all those objects fits well into 48megs. The Radeon9700 in use has 128megs of onboard memory and additional 64megs of AGP mem, my request SHOULD be fulfilled. All buffer objects either use GL_STATIC_DRAW_ARB or GL_DYNAMIC_DRAW_ARB. The same code worked without any overflow on a GF2 with 32megs (until the latest drivers 43.51 screwed VBO on GF2 class cards)

Aside from that a question: What is better (faster), to optimize for less texture switches (glBindTexture()) or less vertexbuffer switches (glBindBuffer(), plus glXXXPointer(), plus glLoadMatrix()) ?

Korval · May 15, 2003, 4:06pm

What is better (faster), to optimize for less texture switches (glBindTexture()) or less vertexbuffer switches (glBindBuffer(), plus glXXXPointer(), plus glLoadMatrix())

I imagine that glBindTexture will always hurt more, since you’re talking about a possible texture upload, as video cards can’t use textures in AGP directly. By contrast, video cards can directly use AGP vertex data just fine.

That being said, you’ll need to benchmark it to know for sure.

It seems, ATI didn´t expect people to create thousands of VBO objects. Well, I do (I will reduce that some day, promised). After claming around 6000-6600 buffer objects, I get OUT_OF_MEMORY, though the amount of memory needed for all those objects fits well into 48megs.

If each VBO has to be page-aligned (typically 4K), the minimum data taken up by 6600 VBO’s is 25.8MB. That’s hardly trivial.

Also, each VBO consumes some amount of resources, just to manage them. Let’s say that ATi’s VBO’s require a 1K-large struct for this purpose. That’s 6.6MB right there.

Lastly, it is entirely possible that ATi stores each of these VBO management structs in some kind of static array. It makes memory management a bit easier. As such, I would suspect that pathological misuse of the API (such as yours) would result in an OUT_OF_MEMORY error.

In short, I consider this to be much more your fault than theirs. You are misusing the API. Fix it.

ATI should take a look on how well nVidia does solve this task.

I’m sure that ATi people have seen that nVidia drivers handle this situation perfectly. Of course, that doesn’t actually help them solve the, apparently, non-trivial problem. For all we know, Half-Life is doing something wrong, and nVidia simply had an “if(HalfLife)” in their driver somewhere to solve this case.

I don’t care about switching to the desktop. As long as you can actually exit a game and re-enter without it dying (or other adverse effects like changing the refresh rate), I don’t care. I’ll find out when I get home.

[This message has been edited by Korval (edited 05-15-2003).]

Ostsol · May 15, 2003, 5:07pm

Originally posted by skynet:
ATI should take a look on how well nVidia does solve this task.

Probably a proprietary fix. . . j/k!

To VBO: It seems, ATI didn´t expect people to create thousands of VBO objects. Well, I do (I will reduce that some day, promised). After claming around 6000-6600 buffer objects, I get OUT_OF_MEMORY, though the amount of memory needed for all those objects fits well into 48megs. The Radeon9700 in use has 128megs of onboard memory and additional 64megs of AGP mem, my request SHOULD be fulfilled. All buffer objects either use GL_STATIC_DRAW_ARB or GL_DYNAMIC_DRAW_ARB. The same code worked without any overflow on a GF2 with 32megs (until the latest drivers 43.51 screwed VBO on GF2 class cards)

Back with ATI_vertex_array_object I was running into OUT_OF_MEMORY at 32 MB.

gator · May 15, 2003, 5:16pm

3.4 is great and all, but did ATI skip a version?

I thought the last released was 3.2.
They never released 3.3?

Ostsol · May 15, 2003, 5:31pm

Originally posted by gator:
[b]3.4 is great and all, but did ATI skip a version?

I thought the last released was 3.2.
They never released 3.3?[/b]

Yep, there was a WHQL test that prevented it 3.3 from being certified. It looks like the test was scheduled to be removed, though, but not soon enough. Rather than wait for it to be removed, ATI decided to skip 3.3’s release entirely and start working on 3.4. Technically, it could still have been called 3.3, just for continuity, but I guess ATI decided “3.4” was more appropriate since it contained not only what 3.3 was set to achieve, but also what ATI had planned for the release after.

imported_jwatte · May 15, 2003, 6:26pm

We’ve had members report bugs with “Catalyst 3.3” drivers to us, although we haven’t had those drivers ourselves. From which I draw the conclusion that someone leaked the drivers and claimed they were real. Perhaps ATI just didn’t want to confuse people by re-using that name. If that was a reason, I’d think that to be a smart and savvy reason, but I have no idea whether that’s the case.

Ostsol · May 15, 2003, 7:27pm

If they were leaked, then the person who leaked them probably won’t be getting any more betas. When ATI first introduced the Catalyst beta program, they stated that each driver will be made such that they can be traced back to the person they were originally given to. I believe there is also an NDA that beta testers must adhere to.

More likely, though, it is a set of ATI drivers released by Dell that were claimed by many as the Catalyst 3.3 drivers. They are, in fact, not the 3.3s, but they are newer than the 3.2s. They were also never WHQL certified, whereas all Catalyst drivers released to the public are.

[This message has been edited by Ostsol (edited 05-15-2003).]

cass · May 15, 2003, 8:00pm

During the VBO design, we agreed that allocating lots of relatively small VBOs would be inexpensive.

If each VBO has to be page-aligned (typically 4K), the minimum data taken up by 6600 VBO’s is 25.8MB. That’s hardly trivial.

Also, each VBO consumes some amount of resources, just to manage them. Let’s say that ATi’s VBO’s require a 1K-large struct for this purpose. That’s 6.6MB right there.

Lastly, it is entirely possible that ATi stores each of these VBO management structs in some kind of static array. It makes memory management a bit easier. As such, I would suspect that pathological misuse of the API (such as yours) would result in an OUT_OF_MEMORY error.

In short, I consider this to be much more your fault than theirs. You are misusing the API. Fix it.

I disagree. If the app can manage data efficiently by putting many arrays into a larger VBO, then the driver could do the same sort of management with very light-weight VBOs.

Again, the intent of these buffer objects is that they would be very lightweight. That is easy for the driver to manage, and it still provides the driver the flexibility to move buffers around to different memory. If you put lots of arrays into a single buffer, the driver must keep that single buffer contiguous, even if it need not be, because it would be legal for you to access any part of it. That’s an unnecessary and useless constraint.

Thanks -
Cass

cass · May 15, 2003, 8:04pm

Also, the idea that each VBO requires a 1k structure for overhead is ludicrous. Where did you come up with that number?

Korval · May 15, 2003, 8:59pm

During the VBO design, we agreed that allocating lots of relatively small VBOs would be inexpensive.

Lots of small VBOs, I can understand. 6000+? I’m sorry, but I’m willing to consider that pathological misuse of the API, and it is perfectly fine if an implementation wants to fail on that. I wouldn’t imagine that you could allocate 6000 small textures either, even if they were 32x32. If you can, that’s great. But I’m highly unwilling to call it a bug if the implementation fails to do so.

If each buffer were so much as 385 32-byte vertices, 6000 buffers would take well over 90MB of memory.

I disagree. If the app can manage data efficiently by putting many arrays into a larger VBO, then the driver could do the same sort of management with very light-weight VBOs.

Of course, light-weight VBO’s are a poor use of the API, as the glDraw* overhead starts to win out. As such, it doesn’t really make sense to encourage someone to misuse the API in such a fashion.

Not only that, it complicates the implementation quite a bit more, and unnecessarily so. This tends to lead to more bugs.

Also, the idea that each VBO requires a 1k structure for overhead is ludicrous. Where did you come up with that number?

Admittedly, 1K is a bit much, but I was giving ATi the benifit of the doubt.

Tom_Nuydens · May 15, 2003, 11:42pm

Just because you’re using thousands of VBOs doesn’t necessarily mean that they’re lightweight (i.e. near-empty). You might as well be rendering a scene with 50 million polys, for example. It also doesn’t mean that you intend to use all of them every frame.

Nowhere is it said that a VBO has to live in AGP or video memory. If they don’t fit in there, they should just be shifted to system memory. That’s what happens with textures, and it works just fine for those.

– Tom

velco · May 16, 2003, 12:14am

Originally posted by Ostsol:
[b] www.ati.com

Finally. . . time for some fun! Down with vendor specific extensions! [/b]

No new GNU/Linux drivers though … again

~velco

cass · May 16, 2003, 4:02am

Originally posted by Korval:
[b]Of course, light-weight VBO’s are a poor use of the API, as the glDraw* overhead starts to win out. As such, it doesn’t really make sense to encourage someone to misuse the API in such a fashion.

Not only that, it complicates the implementation quite a bit more, and unnecessarily so. This tends to lead to more bugs.
[/b]

???

Small light-weight VBOs are an intentional aspect of the API design. They simplify the implementation in a number of really important ways: 1) buffer renaming, 2) buffer mapping, 3) synchronization, 4) buffer migration/duplication.

It’s not unreasonable use of the API to have a several VBOs for each object in your application (e.g. one VBO for position, one VBO for tangent/binormal/normal, one VBO for texture coords, …).

I fail to see how large heavy-weight VBOs with lots of app-managed arrays inside them makes things easier for the driver or the app developer.

Thanks -
Cass

system · May 16, 2003, 6:39am

To the original poster who is getting out of memory errors.
What is your AGP aperature size?
I suggest you set it to maximum available number.

On a side note, a similar message to this was posted some time ago with the person hitting a limit of 26.x MB

skynet · May 16, 2003, 7:26am

To clearify some things:
I know that allocating that much buffers can cause problems. I´m just annoyed that the card with most memory drops out first (I tried 128 AGP aperture size, it didn´t help) Maybe its just a little setting in the driver source like #define MAX_VBO 6666 that prevents it from working
I´m developing a per-pixel-lighting graphics “engine”. One of the main goals is to store all geometry in agp/videomem and to cache as much as possible intermediate results like L and H vectors. I prefer switching VBO over batching geometry (with CPU). Like Cass mentioned my model data is “scattered” across different VBO objects. There maybe static data like texturecoordinates and indices, dynamic data like vertices and normals, and intermediate data, like L and H vectores. Additionally there is shadowmodel data and according indices. I try to cache as much as possible. For instance, if neither light nor object have moved, but the camera did, I can keep the L-vecs-VBO and the shadowmodel and only need to update the Hvecs VBO.
This should somehow explain where that much VBO objects come from. Of course I do not access all of them within a single frame, but the needed buffer binds can still pile up beyond 1000 binds per frame. In future I will reduce the number of binds and number of VBO objects in general. But restricting me to say a few dozen VBO objects will throw me back in time of VAR (though this time I could have more than one, one for static, one for dynamic data with own memory management of the large buffer.

To Cass:
I´d be interested in some kind of performance FAQ for ARB_fragment_program, ARB_vertex_program and ARB_vertex_buffer_object. I am really interested in recommended application behaviour, recommended program layout (should be texture sample instructions be scattered across the program or not etc) and pitfalls to avoid.

Korval · May 16, 2003, 9:21am

Nowhere is it said that a VBO has to live in AGP or video memory. If they don’t fit in there, they should just be shifted to system memory. That’s what happens with textures, and it works just fine for those.

Actually, it is said somewhere. Because the spec says that glBindBuffer should be a light-weight operation, the binding of a buffer should not necessitate a copy operation. Otherwise, it is no longer light-weight, and therefore violates the spec.

As such, if an implementation were to page out to system memory, I would not consider it a well-thought-out implementation, or even a valid one. VBO’s should remain in hardware memory. Some reside in video, others in AGP. But nothing should fall back to memory where the fastest method of transfer is to copy the verts to AGP and then send them. This is simply unreasonable.

Besides, if I understand correctly, textures never get paged out to system memory. The farthest away they get paged is to AGP.

Small light-weight VBOs are an intentional aspect of the API design. They simplify the implementation in a number of really important ways: 1) buffer renaming, 2) buffer mapping, 3) synchronization, 4) buffer migration/duplication.

The point I was making was not the fault of VBO’s, but simply a fact of the length of time that glDraw* takes to render anything. If I recall (granted, it was a D3D document, but I assume the performance spec transfers over to GL on the same hardware), the idea is to use as few glDraw* calls as you can for smaller objects, as these calls tend to dominate. Obviously, having low-poly objects means that the performance of glDraw* takes precidence over the actual rendering time of the mesh.

It’s not unreasonable use of the API to have a several VBOs for each object in your application (e.g. one VBO for position, one VBO for tangent/binormal/normal, one VBO for texture coords, …).

That’s reasonable only to the extent that you’re going to be doing some mixing-and-matching of these VBO’s. If you aren’t, what’s the point? It’s much easier, not only for the implementation, but for the person using the API, to have as vew VBO’s per renderable object. Also, if these things ever get paged out to system memory, it is also faster, as few VBO’s means that larger ones have to stick around.

I fail to see how large heavy-weight VBOs with lots of app-managed arrays inside them makes things easier for the driver or the app developer.

I’m not suggesting having a lot of app-managed arrays inside a VBO. What I’m suggesting is that VBO’s should be concatonated if at all reasonably possible. And, that having 6000+ VBO’s seems very much like a poor use of the API, and a poor use of a hardware resource.

There maybe static data like texturecoordinates and indices, dynamic data like vertices and normals, and intermediate data, like L and H vectores. Additionally there is shadowmodel data and according indices.

I have to ask: why are you computing the L&H vectors on the CPU? Isn’t that what ARB_vertex_program is for?

I don’t know what a “shadowmodel” refers to, but if this relates to stencil shadows, there are ways to use your regular meshes to do shadowing. I don’t know too much about these ways, as I prefer the shadow map approach, but they do exist.

Here’s a question. Much like glGenTextures, you create some number of buffer objects and then tell OpenGL how much space you want for them. Can you successfully generate, say, 8000 VBO’s without allocating room for them? Or is it the actual allocation phase where the implementation reports a problem?

cass · May 16, 2003, 10:55am

Originally posted by Korval:

Actually, it is said somewhere. Because the spec says that glBindBuffer should be a light-weight operation, the binding of a buffer should not necessitate a copy operation. Otherwise, it is no longer light-weight, and therefore violates the spec.

The spec states in the issues section that glBindBuffer should be light-weight. It did not say (as far as I’m aware) that binding of a buffer should not necessitate a copy operation. Using a buffer (not merely binding it) may indeed require the driver to copy it somewhere. This is not a spec violation. Specifically it is an implementation detail that is invisible to the user, and would therefore never be specified.

The point I was making was not the fault of VBO’s, but simply a fact of the length of time that glDraw* takes to render anything. If I recall (granted, it was a D3D document, but I assume the performance spec transfers over to GL on the same hardware), the idea is to use as few glDraw* calls as you can for smaller objects, as these calls tend to dominate. Obviously, having low-poly objects means that the performance of glDraw* takes precidence over the actual rendering time of the mesh.

This is true for D3D, but not OpenGL. Draw calls are expensive in D3D because they are done in kernel mode. GL calls are in user mode, and therefore relatively lightweight.

Indeed this is the fundamental difference between the way D3D handles vertex buffers and the way OpenGL decided to do it. Draw calls and gl*Pointer() calls in OpenGL are lightweight, so it makes sense to make VBOs lightweight. In D3D draw calls and setting up vertex buffers is expensive, so their vertex buffers are larger and heavy-weight with lots of complex methods for “locking” (or mapping) regions and rules for what you will overwrite and what can be discarded. Most of that complication goes away if you just have small lightweight VBOs.

That’s reasonable only to the extent that you’re going to be doing some mixing-and-matching of these VBO’s. If you aren’t, what’s the point? It’s much easier, not only for the implementation, but for the person using the API, to have as vew VBO’s per renderable object. Also, if these things ever get paged out to system memory, it is also faster, as few VBO’s means that larger ones have to stick around.

It’s not just mixing and matching. It’s also when some arrays are dynamic and some are not. You want to keep those separate. In any case, you want to give the driver the opportunity to lay these things out in memory the best possible way. The driver may decide to keep a small area in video memory for static VBOs and move them in/out based on LRU statistics. Further, if you make one large buffer that holds vertex data for many vertex arrays, then you have to be careful how you synchronize updates to that buffer with draw calls. This “accidental” synchronization burden can really hurt performance. With multiple buffer objects, this synchronization burden just goes away.

I’m not suggesting having a lot of app-managed arrays inside a VBO. What I’m suggesting is that VBO’s should be concatonated if at all reasonably possible. And, that having 6000+ VBO’s seems very much like a poor use of the API, and a poor use of a hardware resource.

Again, I simply disagree. You’ll wind up getting better performance on NVIDIA drivers if you don’t do unnecessary concatenation. Sure, keep arrays that are “interleaved” in the same VBO, but don’t spend energy trying to make a VBO heavyweight. Takes more time for the app to do, makes life more complicated for the driver. That’s a lose/lose proposition.

Thanks -
Cass

[This message has been edited by cass (edited 05-16-2003).]

Tom_Nuydens · May 16, 2003, 11:03am

Originally posted by Korval:
[b]As such, if an implementation were to page out to system memory, I would not consider it a well-thought-out implementation, or even a valid one. VBO’s should remain in hardware memory. Some reside in video, others in AGP. But nothing should fall back to memory where the fastest method of transfer is to copy the verts to AGP and then send them. This is simply unreasonable.

Besides, if I understand correctly, textures never get paged out to system memory. The farthest away they get paged is to AGP.[/b]

Sure they do – try it. I’ve succesfully ran apps that used 512 MB worth of textures, which is more than the size of my video memory and my AGP aperture combined.

It seems only natural to me that the drivers are free to put your data wherever they please. The VBO spec explicitly says so in the context of element buffers, but IMHO generalizing this to vertex buffers as well would be the logical thing to do.

If you have so much vertex data that it doesn’t fit in video/AGP memory, chances are that you won’t need all of it in the course of a single frame anyway. If that’s the case, binding a “non-resident” VBO isn’t half as bad as you make it sound. The driver could employ an LRU caching scheme just like it does for textures, and transfer the VBO to video memory just once, throwing out the least recently used one to make room.

You say that an implementation that behaves like this would be an invalid one. My opinion is the exact opposite: I think any implementation that doesn’t do this is inflicting an unnecessary burden on the developer by forcing him to fall back to a standard vertex array code path.

Note that I’m not taking sides here: I have no idea how either NVIDIA or ATI deal with this situation, so for all I know they could both just report GL_OUT_OF_MEMORY and be done with it.

– Tom