Performance issues w/ VBO on nvidia

I have encountered a serious problem with the ARB_vertex_buffer_object extension on nVidia hardware. Specs are as follows: GF4 Ti4600 AGPx4, 53.03 drivers on Win2k, VIA KT600, Athlon XP 2800.

The situation: I’m using a geometry cache (vertices only) to store visible parts of the scene using an LRU scheme. The cache is supposed to be stored in VRAM, as vertex arrays. The current frame is always fully in the cache (no streaming), but when moving the camera, the cache data needs to be updated in regular intervals.

Until recently, I used a vertex_array_range in VRAM to represent that cache. No problems - rendering was fast, and so was the update of the VRAM data. I’m using sequential 4-byte writes to the range.

Then I added a VBO codepath. I’m using one large VBO, initially defined as STATIC_DRAW_ARB, and offsets into it to separate the individual vertex arrays. I’m updating it using mapped memory access (glMapBufferARB with WRITE_ONLY_ARB).

Results: the rendering speed is comparable to VAR, no problems here. But the speed of updating the buffer (mapping/filling/unmapping) is an order of magnitude slower than with VAR, despite using the exact same filling code. With VAR, the updates were unnoticeable, and introduced no visible lag. With VBO, I get lags up to 1 second when updating a larger amount of data. The updating process is easily 30 to 40 times slower than with VAR, using the same writing code !

I tried switching to DYNAMIC_DRAW_ARB (instead of STATIC_DRAW_ARB), but could not notice any difference.

Anyone had similar experiences ?

Thanks for any help,

/ Alex

Anyone ?

Can you try glBufferSubDataARB, or better yet, try multiple smaller VBOs and use glBufferDataARB to respecify them?

I think glMapBufferARB is a hoax … haven’t tested it on the new drivers, but older results with Det 45.23 were quite discouraging.

There’s a paper on VBO performance somewhere in NVIDIA’s dev department, you should be able to find it.

PS: definititely not STATIC_DRAW. Sounds like you should be using STREAM_DRAW.

Thanks for your reply.

Originally posted by zeckensack:
Can you try glBufferSubDataARB, or better yet, try multiple smaller VBOs and use glBufferDataARB to respecify them?

Well, the idea with mapping was to avoid the redundant copy glBufferDataARB & co will introduce. The data is compressed in main memory, and is sequentially decompressed into the VBO. Using glBufferDataARB or similar needs an additional temporary buffer: decompress to temp buffer, update VBO from temp buffer. Not so good. Still, I admit I haven’t tested it yet, so it might be worth a try.

I think glMapBufferARB is a hoax … haven’t tested it on the new drivers, but older results with Det 45.23 were quite discouraging.

VBO implementation was buggy to the extreme in 45.23, almost to the point of total unuseability. It is much better in the newest drivers - but I’m still pretty sure that there is a huge performance hog hidden somewhere in the mapping functionality. I guess I could wait for the next driver release, but I hoped for some temporary workaround :slight_smile:

There’s a paper on VBO performance somewhere in NVIDIA’s dev department, you should be able to find it.

Thanks, I will look for that.

PS: definititely not STATIC_DRAW. Sounds like you should be using STREAM_DRAW.

Hmm, I don’t think so - from the specs:

“STREAM_DRAW_ARB - The data store contents will be specified once by the application, and used at most a few times as the source of a GL (drawing) command”.

That’s definitely not the case. Now, I don’t know how close to the spec ‘suggestions’ nvidia’s impementation actually is, but it sounds to me like STREAM_DRAW is best implemented by a main RAM buffer. I would like to have my data in VRAM. VBO does not offer direct control over the storage mode, but as far as I interprete the specs, STATIC_DRAW would be closest. DYNAMIC_DRAW is probably streamed over AGP. Just guessing, though.

In practice, STREAM_DRAW gave me absolutely horrible rendering performance, close to standard VAs. STATIC and DYNAMIC DRAW had both pretty good rendering performance and very poor updating performance.

Now, we all know that writing to VRAM is not the fastest operation there is. But since VAR managed to get it at very acceptable speed, it must be technically feasable.

/ Alex

Originally posted by AlexH:

Results: the rendering speed is comparable to VAR, no problems here. But the speed of updating the buffer (mapping/filling/unmapping) is an order of magnitude slower than with VAR, despite using the exact same filling code. With VAR, the updates were unnoticeable, and introduced no visible lag. With VBO, I get lags up to 1 second when updating a larger amount of data. The updating process is easily 30 to 40 times slower than with VAR, using the same writing code !
/ Alex

You need to split your arrays into multiple VBOs. By using a single composite VBO (which it sound’s like you’re doing), you introduce synchronization requirements. VAR would not have these because it places the burden of synchronization on the application programmer. VBO does not, but it’s sole means of eliminating synchronization is via multiple VBOs and VBO versions (renaming).

Mapping a large VBO to change only part of it is “death”. BufferSubData would be a much more efficient way to do that. However, you should strive to make your buffers small enough such that you’re usually replacing the whole buffer.

Thanks -
Cass

Cass, thanks a lot, that was exactly the information I was looking for.

Yes, I was in fact using a single large mapped and partitioned VBO, legacy from a quick’n’dirty VAR to VBO conversion. Splitting up the buffers into multiple VBOs should be easy enough.

Just one more quick question - as you probably have a lot of insider knowledge about nVidias implementation: assuming the entire buffer data is to be replaced, would it be more advisable to use BufferDataARB (and live with an additional copy), or to map the buffer and fill it directly (ie. is there any hidden overhead in mapping/unmapping I should be aware of, like the sync issue you brought up) ?

Thanks !

/ Alex

Originally posted by AlexH:
[b]Just one more quick question - as you probably have a lot of insider knowledge about nVidias implementation: assuming the entire buffer data is to be replaced, would it be more advisable to use BufferDataARB (and live with an additional copy), or to map the buffer and fill it directly (ie. is there any hidden overhead in mapping/unmapping I should be aware of, like the sync issue you brought up) ?

Thanks !

/ Alex[/b]

Hi Alex,

The primary (initial) intent of VBO map/unmap
was to allow the zero-copy mechanism for streaming vertex data.

If you’re going to replace a whole buffer by mapping, make sure to call BufferData with a null data pointer first. This allows the implementation to allocate a new one without copying the contents of the old one first.

If you can tolerate the single copy, favor BufferData over map/unmap.

I hesitate to give too much rock-solid VBO guidance, because what is expensive today may be much less so tomorrow. VBO implementations are pretty wet behind the ears still, and we’ll know better once we’ve had more experience optimizing the driver for them.

Thanks -
Cass

PS Happy New Year!