VB horror story

I’m updating a VB every frame and I need to both read it and write to it. From what I’ve understood from the VBO spec I should be specifying my VBO with STREAM_COPY or DYNAMIC_COPY and mapping my buffer with READ_WRITE.

Horrified at the results I got with a Radeon 9800 pro, I tested every possible mode on it and on a geforce I plugged into the same machine. I also tried a software OGL implementation because of the dreadful Radeon perf with just VAs.

Unless I’m totally misunderstanding the VBO spec, it seems to me that NVidia is in total violation of it, and ATI… Well, see the results:

Geforce FX 5800 ultra 128MB, detonator 52.16:

STATIC, STREAM or DYNAMIC_READ 24 fps

DYNAMIC_DRAW/COPY
READ_WRITE, WRITE_ONLY 57 fps
READ_ONLY 120 fps

no VBO 57 fps


Radeon 9800 pro 256MB, Catalyst 3.9

Not much to say. Always 11 fps with VBO and 15 with standard vertex arrays. Wahey!


To stress the point I tried this on SGI’s software OGL from july 2000 and that gives 17 fps.

Any comments?
I have one for the IHVs: Get it together! People are actually downloading these drivers!

I have good result using GL_STATIC_DRAW and GL_WRITE_ONLY for mapping w/ 52.xx and a GF4Ti4200. Never tested anything else.

Don’t know if using READ and COPY buffer are a good idea for VBO, semms to be used for other buffers objects (PBO?).

Well, I got very good results from DYNAMIC_COPY/DRAW with READ_ONLY on the geforce, more than twice as fast as standard vertex arrays. I should be getting these results with STREAM_COPY and READ_WRITE though which actually result in 20% the framerate of the former.

Well, I got another horror story for you. I am working with a Radeon 9800+Catalyst 3.9. I use to read back from VBO´s in order to create bounding boxes for objects. Since its a one-time job, I don´t care about the speed implications doing it that way.

Well to make it short, from time to time the just locked VBO object returns wrong data in its first 64 byte (it varies, but doesn´t seem to occur beyond that). I wrote to ATI devrel, sent a demo. They acknowledged the problem. Now I´m hoping for Catalyst 3.10, since the Call-of-Duty hotfix didn´t help.
The problem seems to be a driver problem only, its independend of chipset and cpu and happens on Radeon9500 cards (didn´t test on other yet), too.
Can anyone confirm similar experiences?

Madoc, your results for nVidia are what I would expect, in general.

Because you asked to read this data, the driver can’t put it in fast memory, since fast memory has slow CPU read properties. As such, it is no different from not using VBO’s at all.

Your results for ATi are probably the result of their driver developers not caring about the 0.001% of people who would ever want to read back from a VBO. Which is reasonable in my opinion. If you want to read this data, feel free to keep a copy of it in system memory. You don’t need to read the data directly from the VBO.

Unless I’m totally misunderstanding the VBO spec, it seems to me that NVidia is in total violation of it, and ATI… Well, see the results:

How is nVidia in violationg of the spec? Indeed, how is any particular performance in violation of the spec? The spec doesn’t promise performance; it simply gives the drivers the tools to give that performance to you, depending on the parameters you set. You told the drivers, “I want to read back, so I don’t care about the performance of drawing.” The drivers are taking the hint, and not giving you the performance you said you didn’t care about.

I have one for the IHVs: Get it together! People are actually downloading these drivers!

Maybe developers should stop abusing the API?

I use to read back from VBO´s in order to create bounding boxes for objects.

Wouldn’t it make more sense to do that before calling glBufferSubData/glBufferData? That way, you still have the vertices in main memory.

[This message has been edited by Korval (edited 11-28-2003).]

Korval, I’m not complaining about the results I got with the nvidia board themselves but how I got them. I am reading and writing to the same VBO simultaneously therefore READ_WRITE (yes! in one map!) should give me the results I’m getting for READ_ONLY. Also, I am doing this once for every time I use the data so STREAM_… should give the best performance not the worst.

Anyway, did you actually read the results or understand what I’m doing? It doesn’t seem like it.

I don’t see how I am abusing the API, what are you on about?! Those hints are precisely to specify such usage as I am doing.
Saying that the driver doesn’t have to give good results when it can is just plain silly.

What are you saying about “not caring about performance”? Of course the VB will be used at least once for every time it’s read+written to. I really wonder whether you read my post???
I’m not doing something incredibly stupid like just using the data for different purposes. The overall performance has to be good. The drivers should allow you to optimise for any case that involves drawing.

As for ATI, using standard vertex arrays should not be slower than a software implementation on a card that otherwise does 180M actual polys/sec. Using system mem is simply hopeless, I can’t even fall back to standard VAs.

I guess I might indeed have to work on a copy of the VBO in sys mem and copy it to a VBO every frame to get good performance. However, I am not one bit happy about the additional memory cost.

The point remains that these results are pathetic and I must complain. The hints are not doing what I could have done transparently with VAR.

Madoc, are you sure you understand how vbo works ??
when you specify READ_WRITE, the driver gives you a pointer on a copy in system memory of your vertex buffer, and then it needs to resynchronize this system vertex buffer with the video vertex buffer. That is exactly the same as not using VBO… (data transfers between system memory and video memory for each frame).
when you specify READ_ONLY, the driver still gives you a pointer on the system memory vertex buffer, but it does NOT synchronize at the end with the video vertex buffer, because normally you have not modify the vertex buffer!
So, the video vertex buffer is never modified and that’s why you run at full speed!! (no data transfers).
This is totally logical.

For READ_WRITE, WRITE_ONLY vs READ_ONLY. Maybe read doesn’t need to lock the data but a write operation need te data to be stalled (don’t forget asyncronism) and so you kill your perfs, that’s why a solution is to use a double dynamic VBO, fill one render the second and vice versa.

Just some thought.

Madoc,
There’s no way you can get good simultaneous read/write performance except for system memory. AGP memory and video memory is usually uncached and write combined. Freely mixing reads and writes to such memory regions is a recipe for disaster.

What are these fps anyway? You’re measuring rendering performance but what you request with your usage hints isn’t rendering performance, it’s access performance. You should measure that instead.

Tayo, that’s not necessarily so, the VBO could be in system memory depending on what the driver decides to do based on the usage hints you specify. You’re not necessarily getting a copy and my results show clearly that some hints result in system memory VBO (= standard VAs) that I doubt are ever copyed.
Possibly the nvidia drivers are not doing such a bad job here but I would think they are with STREAM_… as opposed to DYNAMIC, STREAM seems to behave like STATIC.
Also, the VBO is being updated when I specify READ_ONLY! It’s just a whole lot faster.

Zeckensack, I know how AGP memory behaves. Simply enough, when I specify STREAM_COPY and READ_WRITE I expect the driver to do something clever enough to allow me to read and write data on a 1 to 1 with when I draw with reasonable performance. The driver should either give me system mem or apply some clever copying so as to efficiently synchronise my updates with rendering.
The fps is measuring rendering performance, of course. Measuring anything else would be absolutely pointless. I’m trying to render this data after all.

It seems clear to me that something is wrong with the hints on NVidia drivers, at least concerning STREAM_.
The ATI drivers should simply not perform so horribly, I can’t say much about the hints because none of them make any difference at all.

Funny how people can both be right and still contradict each other. It is reasonable to expect developers to read the white papers that say READ tokens may cause buffers to be in system memory and therefore be slow to draw. It’s also reasonable to expect the drivers to optimize for this case so the white papers don’t need those disclaimers. It’s a fairly common use-case in reality, but I think some HW people are still annoyed that anyone would want to do geometry-related work on the CPU…

Anyway, in this case, I’ve typically used two copies of buffers, one in system RAM and one in AGP or video with a sync() function. The fast memory need only be synced once per frame and only if the data has changed (and only the part that’s changed, if you can track bounds). Syncing is ideally done a step or two before rendering, ideally overlapped with some other render call.

In the case of a mapped VBO, an implicit sync should (IMO) happen when I unmap the READ/WRITE buffer (if I mapped it with a WRITE flag, indicating the contents will likely change). There may be a OS technique for telling when a page of memory has changed, but the WRITE token is probably clear enough.

If the dual-buffer approach is actually done under the hood (unlikely, given the language about buffer demotion for reads) or worked on for a future version, it would be great to know. Currently, I’m looking at writing my own dual-buffer VBO on top of GL_VBO, just like I used to do with VAR and ATI extensions, so GL_VBO really didn’t buy me much at this point.

Avi

I might be totally wrong for the dual VB done in the drivers but two things make me thinks it works like that.
First, this from ARB specs :
Do buffer objects survive screen resolution changes, etc.?

    RESOLVED: YES.  This is not mentioned in the spec, so by
    default they behave just like other OpenGL state, like texture
    objects -- the data is unmodified by external events like
    modeswitches, switching the system into standby or hibernate
    mode, etc.

I thnik It means that the drivers needs to keep a copy of its VBO when they are put in video memory, because video memory can become unavalaible on this kind of OS changes.
Actually, i have used DirectX, and with DirectX if you specify a Managed VB, the specs says : “Resources are copied automatically to device-accessible memory as needed. Managed resources are backed by system memory and do not need to be re-created when a device is lost. See Managing Resources for more information. Managed resources can be locked. Only the system-memory copy is directly modified. Direct3D copies your changes to driver-accessible memory as needed.”
It seems to me this behaviour is similar to VBO (a device can be lost on a screen resolution change), but i might be totally wrong.

And to finish for Madoc, have you see this comments from the ARB specs ??
“Note that the “copy” and “read” usage token values will become
meaningful only when pixel transfer capability is added to
buffer objects by a (presumed) subsequent extension.”

I’m also annoyed at having to do this on the CPU. It currently only exists as a VP path but I’m currently having to support all sorts of OGL implementations. The worst thing is that we found that a lot of people with capable HW have too old drivers to support ARB_VP or VBO. Another issue is that software emulated VPs are horrendously slow if using VBO and the VB is not in sys mem.

In the absence of both, what I describe is a pretty optimal solution. With the kind of results I’m getting it would seem that I need yet another path though. With ATI, I can’t even fall back to standard VAs.

Tayo, no, I hadn’t noticed that in the spec. That’s quite dissapointing really. Yet it seems that at least READ_ONLY is doing something, I wonder what exactly. I would guess it’s giving me a copy of the VB and the copying it back. That would be pretty neat, it’s essentially what I would expect and the results are very good. The hints to get that behaviour seem wrong though.
I should hope that drivers only copy VBs to sys mem when necessary.

I still think that these drivers are not behaving well so I thought I would report it here and maybe get some useful comments. I would say the spec still needs some clarification.
I certainly hope that ATI will make their drivers perform decently under a few more conditions.

Originally posted by Madoc:
Zeckensack, I know how AGP memory behaves. Simply enough, when I specify STREAM_COPY and READ_WRITE I expect the driver to do something clever enough to allow me to read and write data on a 1 to 1 with when I draw with reasonable performance. The driver should either give me system mem or apply some clever copying so as to efficiently synchronise my updates with rendering.
The fps is measuring rendering performance, of course. Measuring anything else would be absolutely pointless. I’m trying to render this data after all.

Fair enough. It would still be helpful (and perhaps enlightening) to seperately benchmark array access performance and rendering performance. You’re essentially telling the implementation that you want frequent read and write access, and that may just be what you get.

Rendering speed being terribly slow is most likely a result of data bouncing back and forth over the AGP, in which case either your buffers are too large (bad swapping granularity), or you’re not keeping to your hint (ie you render the VBO much more frequently than you access the data).

No use complaining unless you’ve benchmarked it more precisely. If you find out that everything is slow, given your access patterns, now that would be a good point, of course. That would also be more helpful to ATI, in case they’re interested in your feedback on the matter (which I believe they are).

I’m still somewhat puzzled why you insist on using VBO for what you’re doing. System memory clearly is the way to go, so if everything performs well, you’ll have used VBOs as a replacement for malloc.

Originally posted by zeckensack:
I’m still somewhat puzzled why you insist on using VBO for what you’re doing. System memory clearly is the way to go, so if everything performs well, you’ll have used VBOs as a replacement for malloc.

Using system mem is exactly what I’m doing, it was more a test than anything else and my post a kind of bug report, as I said, hoping for some useful comments. But although it’s the obvious thing to do, it seems clear that there must be more optimal solutions for at least ATI. I can’t believe past driver versions all performed so badly with standard VAs. In fact, I know they didn’t.

Originally posted by Madoc:
Using system mem is exactly what I’m doing, it was more a test than anything else and my post a kind of bug report, as I said, hoping for some useful comments. But although it’s the obvious thing to do, it seems clear that there must be more optimal solutions for at least ATI. I can’t believe past driver versions all performed so badly with standard VAs. In fact, I know they didn’t.

Madoc,

Can you explain exactly what you’re doing? I don’t quite understand from your description.

VBO’s were designed to make these modes of vertex data transmission work: a) cpu->gpu fast and asynchronous, b) gpu->cpu fast and asynchronous, and c) gpu->gpu possible (and hopefully fast).

Note that no mechanism currently exists to do (b) or (c), because there’s no render-to-vertex-array. The idea was that when PBO was defined, or some standard way to render-to-vertex-array, these modes would become useful.

You should only expect (a) cpu->gpu to be accelerated. Does this seem unreasonable?

Thanks -
Cass

Hi Cass,

In few words, I’m trying to map a VBO, do some processing on it, which involves reading and writing data, and unmapping it again.

Performance isn’t at all a problem with your drivers, I’m just puzzled by how the usage hints are working.


edit:
I guess that wasn’t a very exhaustive explanation .

The data I’m writing to the VBO is dependant on other data in the VBO, I read and write to the VBO sequentially. I do this exactly once for every time I render with it, without any exception.

Because I both read and write to the data I would think to specify READ_WRITE when mappig the VBO. As my results show, I got far better performance from READ_ONLY, though the VB was still being updated.

Because I am reading and writing, I would use …_COPY in my bufferdata call which the spec would seem to claim is for both reading and writing data. This doesn’t seem to behave any differently from DRAW. I guess this might be because it’s usage in this context is not expected.

Because I am mapping the VBO on a 1-1 with when I render with it I would use STREAM_… but this seems to behave like STATIC while DYNAMIC behaves well.

I guess this usage of VBO is not expected and IHVs are unlikely to want to invest in it. At the same time, it seems the drivers are capable of doing what I would like very well (this is a guess) and adapting them to do this with the expected (at least by me) hints might be really trivial.

[This message has been edited by Madoc (edited 12-02-2003).]

Madoc,

For <usage> in BufferData,

*_DRAW implies cpu->gpu communication
*_READ implies gpu->cpu communication
*_COPY implies gpu->gpu communication

*_DRAW is the only meaningful usage in today’s implementation of VBO. It’s the only usage that corresponds to it’s predecessor extensions, VAR and VAO.

Copy means that the buffer is likely to be used both as a draw target and to read back into. Again, this makes more sense in the context of PixelBufferObject where there are existing paths for DRAW/READ/COPY for pixels and texels.

Map was only intended to support one-shot STREAM_DRAW and WRITE_ONLY. This was to eliminate any superfluous copying of data
into AGP or video memory.

STATIC_DRAW and STREAM_DRAW are likely to give you memory that has bad CPU read characteristics, because they’re generally put into AGP or video memory.

DYNAMIC_DRAW must assume regular updates and potentially frequent mapping and unmapping, so I would expect it to be backed by system memory.

Does this make sense to you?

I’ll also have the implementor of this extension at NVIDIA take a look at this thread and perhaps he will want to comment further.

Thanks -
Cass

It seems I misunderstood the meaning of STREAM. I guess it’s intended to be a temporary VBO and not one that is mapped once for every time it’s rendered. This explains a lot .

You say only WRITE_ONLY is supported but VAR allows you to specify read frequency. Does this mean that we’ve lost the equivalent functionality? None of the other modes have any meaning with VBO?

Well, I guess the best performing thing I can do given these restrictions is to keep a copy of the VB in sys mem and update into a DYNAMIC_DRAW VBO.

I would be curious to know what’s happening when I specify READ_ONLY and get such good performance.
Does VBO map/unmap actually do something to alleviate the nasty stalls that you can get updating vertex arrays?

Originally posted by Madoc:
You say only WRITE_ONLY is supported but VAR allows you to specify read frequency. Does this mean that we’ve lost the equivalent functionality?

read of AGP or vidmem in VAR was always a bad idea, and read frequency should always have been 0. The VAR api was overly general in it’s alloc scheme. WRITE_ONLY was the only high-performance path.


None of the other modes have any meaning with VBO?

They don’t have meaning without API calls to render to a VBO. They were put in there because we (foolishly?) expected PBO to follow fairly quickly after, and we didn’t want to go and add new buffer usage enumerants when we already knew what they would be.

Kurt Akeley did a lot of thinking about how PBO would work with the VBO infrastructure. For the longest time we just called the extension BufferObjects, because it wasn’t specific to vertex data. At the last minute that was changed though.

Thanks -
Cass