VBO enhanced performance

Excerpt from ARB_vertex_buffer_object

…When an application maps a buffer, it is given a pointer to the memory. When the application finishes reading from or writing to the memory, it is required to “unmap” the buffer before it is once again permitted to use that buffer as a GL data source or sink. Mapping often allows applications to eliminate an extra data copy otherwise required to access the buffer, thereby enhancing performance

Does anyone know how eliminate this extra copy and get enhanced performace on NVIDIA GPUs? All 90 series or later NVIDIA drivers seem to always copy the VBO between video and system memory after a map/unmap killing any performance gains. Is there any way to keep the VBO pinned in video memory? The usage hints seem to have no effect.

You have probably already seen this, but if not this might be useful for you.

http://developer.nvidia.com/object/using_VBOs.html

Specifics might be good, like what OS, what graphics card, how are you timing this glBeginQuery(GL_TIME_ELAPSED_EXT,…)?

If I’ve understood things correctly, when you draw from a buffer to OpenGL (using glVertexPointer etc), the OpenGL implementation has to copy the data into an internal buffer before returning from the draw call (glDrawElements or similar). It can then upload the data from that internal buffer to the video card while your app does other things.

The alternative would be to wait for the upload to complete before returning, as the OpenGL implementation has no other way to be sure that the memory you’ve pointed to is still valid after the draw call returns.

Using a VBO, the OpenGL implementation effectively gives you direct access to this internal buffer, and so it doesn’t have to copy data into it, since you fill it directly.

I believe this is the eliminated copy the text mentions, since afaik there’s no way for an app to directly access video memory.

Is there any way to keep the VBO pinned in video memory? The usage hints seem to have no effect.

No.

The implementation allows this behavior from a driver when you map/unmap it. This allows a driver to create VBOs for memory that you can’t actually map, or those that are currently in use (which may be your problem).

If you had read the document that Timothy Farrar referenced in his post you would have discovered your conclusion was false. VBO simpler alternative to VAR. VAR allows explicit mapping of video memory as describe in the paper available on this page. http://developer.nvidia.com/object/Using_GL_NV_fence.html

LordCRC, registered in 2001, yet only 38 posts, the most recent being incredibly ignorant of basic opengl mechanisms.
What have you been doing all these years?

Hi,
From personal experience I’ve found that the map/unmap is always
slower, I just use the glBufferData and not even the glSub command.
Ido

Similar experience here on NVidia with map/unmap (slower), but I’ve found that using a fixed (max) sized null glBufferData (to dump the old contents) followed by glBufferSubData call each time to reload works fastest, particularly with PBOs. The intuition I read was that if you tell GL you don’t care about the old contents, and then provide an update, yes it’s got to copy it immediately to the GL driver, but it doesn’t stall on previous uses of the buffer.

The changes planned for GL3 are going to make this much easier to get right…

http://www.opengl.org/pipeline/article/vol004_3/

Those claiming glBufferData or glBufferSubData is faster are still living in a single threaded / single CPU world. Mapping allows drawing from one VBO to occur in parallel with filling another. What could be faster?

Mapping allows drawing from one VBO to occur in parallel with filling another. What could be faster?

I don’t know, devoting an entire CPU to doing rendering, and doing everything not rendering on another CPU? It’s also a lot easier to implement and a lot harder to break.

Sure, you might get faster by taking your rendering thread and making it two threads. But you might not.

What matters for performance is that the GPU is always busy. If a dedicated rendering CPU is enough to do that, then do it.

Misunderstanding the specs, it appears…

But to really answer your question: been primarily interested in realistic/physically based rendering.

Sorry Lordcrc, I was in a bad mood.

You missed the point. Why keep the driver/GPU busy transferring vertex and index data when it could be receiving drawing commands. Mapping allows the vertex and index data transfers to occur with minimal driver/GPU activity. The thread filling the VBO’s does not need a OpenGL context because it is not interacting with the driver.

No worries. Always good to get misconceptions cleared up anyway.

I must admit I never had the use for mapping a VBO, so I skipped those parts when reading about them. I just assumed the driver uploaded the data using DMA or something, and thus that it was faster for it to do that from an internal buffer instead of having the app wait for the upload to complete.

In regards to GPU waiting when using glMapBuffer(), this is probably the most important thing to gleam from the NVidia doc,

To solve this conflict we you just need to call glBufferDataARB() with a NULL pointer. Then calling call glMapBuffer() tells the driver that the previous data aren’t valid. As a consequence, if the GPU is still working on them, there won’t be a conflict because we invalidated these data. The function glMapBuffer() returns a new pointer that we can use while the GPU is working on the previous set of data…

Basically always insure the GL driver doesn’t have to block waiting for the GPU to flag that it is finished with the previous frame’s VBO.

I never got glMapBuffer() to be faster than glSubBufferData() in my application.

In theory, mapping could be faster but i never saw it happen in practice… mapping caused serious slowdowns for me. Maybe it is the ATI driver?

Can I assume that the same is true for PBO’s? From what I’ve understood, which I obviously can’t rely on anymore, PBO and VBO are essentially the same in the way the buffers are handled. Is this correct?

Why keep the driver/GPU busy transferring vertex and index data when it could be receiving drawing commands.

Because you wanted to upload data to the GPU. That requires talking to the driver.

OK, let’s say you do this two rendering thread thing, where you have a mapped pointer in one thread and you’re rendering in another. What happens if the driver in the rendering thread suddenly decides that it needs to pull your buffer out of video memory and put it into main memory to make room for a texture?

There must always be communication between the mapped buffer and the driver.

Mapping allows the vertex and index data transfers to occur with minimal driver/GPU activity. The thread filling the VBO’s does not need a OpenGL context because it is not interacting with the driver.

Nonsense.

What if you’re rendering from that buffer when you decide to go mapping it? The driver has to ensure that the previous rendering command finishes before mapping it. And that requires access to the context.

Now, GL 3.0 will offer an ultimate form of mapping which provides you with absolutely no guarantees on anything; it just hands you a pointer and you’re expected to ensure that the data isn’t being read from/etc. But GL 2.1 doesn’t have any such concept.

And even in GL 3.0, it won’t be some magical process that can happen without the driver’s consent; it will still need to know about it.

Can I assume that the same is true for PBO’s?

They’re all just buffer objects. The fact that one gets bound to a gl*Pointer slot and the other gets bound to a PACK/UNPACK slot is fairly irrelevant to how you access the data.

Excerpt from ARB_vertex_buffer_object:

What happens to a mapped buffer when a screen resolution change or other such window-system-specific system event occurs?

RESOLVED: The buffer’s contents may become undefined. The application will then be notified at Unmap time that the buffer’s contents have been destroyed. However, for the remaining duration of the map, the pointer returned from Map must continue to point to valid memory, in order to ensure that the application cannot crash if it continues to read or write after the system event has been handled.

Where did you get this fact? Once the driver produces a pointer it is valid for all threads. No driver activity is involved in using the pointer.

Excerpt from ARB_vertex_buffer_object:

…The expectation is that an application might map a buffer and start filling it in a different thread, but continue to render in its main thread (using a different buffer or no buffer at all)…

Agreed. Mapping and unmapping require a context.