PDA

View Full Version : VBO enhanced performance



macarter
11-14-2007, 04:56 PM
Excerpt from ARB_vertex_buffer_object

...When an application maps a buffer, it is given a pointer to the memory. When the application finishes reading from or writing to the memory, it is required to "unmap" the buffer before it is once again permitted to use that buffer as a GL data source or sink. Mapping often allows applications to eliminate an extra data copy otherwise required to access the buffer, thereby enhancing performance...
Does anyone know how eliminate this extra copy and get enhanced performace on NVIDIA GPUs? All 90 series or later NVIDIA drivers seem to always copy the VBO between video and system memory after a map/unmap killing any performance gains. Is there any way to keep the VBO pinned in video memory? The usage hints seem to have no effect.

Timothy Farrar
11-14-2007, 05:36 PM
You have probably already seen this, but if not this might be useful for you.

http://developer.nvidia.com/object/using_VBOs.html

Specifics might be good, like what OS, what graphics card, how are you timing this glBeginQuery(GL_TIME_ELAPSED_EXT,...)?

Lord crc
11-14-2007, 11:08 PM
If I've understood things correctly, when you draw from a buffer to OpenGL (using glVertexPointer etc), the OpenGL implementation has to copy the data into an internal buffer before returning from the draw call (glDrawElements or similar). It can then upload the data from that internal buffer to the video card while your app does other things.

The alternative would be to wait for the upload to complete before returning, as the OpenGL implementation has no other way to be sure that the memory you've pointed to is still valid after the draw call returns.

Using a VBO, the OpenGL implementation effectively gives you direct access to this internal buffer, and so it doesn't have to copy data into it, since you fill it directly.

I believe this is the eliminated copy the text mentions, since afaik there's no way for an app to directly access video memory.

Korval
11-14-2007, 11:18 PM
Is there any way to keep the VBO pinned in video memory? The usage hints seem to have no effect.

No.

The implementation allows this behavior from a driver when you map/unmap it. This allows a driver to create VBOs for memory that you can't actually map, or those that are currently in use (which may be your problem).

macarter
11-15-2007, 12:20 PM
...
I believe this is the eliminated copy the text mentions, since afaik there's no way for an app to directly access video memory.
If you had read the document that Timothy Farrar referenced in his post you would have discovered your conclusion was false. VBO simpler alternative to VAR. VAR allows explicit mapping of video memory as describe in the paper available on this page. http://developer.nvidia.com/object/Using_GL_NV_fence.html

knackered
11-15-2007, 01:12 PM
LordCRC, registered in 2001, yet only 38 posts, the most recent being incredibly ignorant of basic opengl mechanisms.
What have you been doing all these years?

Ido_Ilan
11-15-2007, 01:25 PM
Hi,
From personal experience I've found that the map/unmap is always
slower, I just use the glBufferData and not even the glSub command.
Ido

Dark Photon
11-15-2007, 03:52 PM
From personal experience I've found that the map/unmap is always slower, I just use the glBufferData and not even the glSub command.

Similar experience here on NVidia with map/unmap (slower), but I've found that using a fixed (max) sized null glBufferData (to dump the old contents) followed by glBufferSubData call each time to reload works fastest, particularly with PBOs. The intuition I read was that if you tell GL you don't care about the old contents, and then provide an update, yes it's got to copy it immediately to the GL driver, but it doesn't stall on previous uses of the buffer.

Brolingstanz
11-15-2007, 03:59 PM
The changes planned for GL3 are going to make this much easier to get right...

http://www.opengl.org/pipeline/article/vol004_3/

macarter
11-15-2007, 07:32 PM
Those claiming glBufferData or glBufferSubData is faster are still living in a single threaded / single CPU world. Mapping allows drawing from one VBO to occur in parallel with filling another. What could be faster?

Korval
11-15-2007, 08:22 PM
Mapping allows drawing from one VBO to occur in parallel with filling another. What could be faster?

I don't know, devoting an entire CPU to doing rendering, and doing everything not rendering on another CPU? It's also a lot easier to implement and a lot harder to break.

Sure, you might get faster by taking your rendering thread and making it two threads. But you might not.

What matters for performance is that the GPU is always busy. If a dedicated rendering CPU is enough to do that, then do it.

Lord crc
11-16-2007, 01:22 AM
LordCRC, registered in 2001, yet only 38 posts, the most recent being incredibly ignorant of basic opengl mechanisms.
What have you been doing all these years?

Misunderstanding the specs, it appears...

But to really answer your question: been primarily interested in realistic/physically based rendering.

knackered
11-16-2007, 04:40 AM
Sorry Lordcrc, I was in a bad mood.

macarter
11-16-2007, 08:15 AM
...What matters for performance is that the GPU is always busy. If a dedicated rendering CPU is enough to do that, then do it.

You missed the point. Why keep the driver/GPU busy transferring vertex and index data when it could be receiving drawing commands. Mapping allows the vertex and index data transfers to occur with minimal driver/GPU activity. The thread filling the VBO's does not need a OpenGL context because it is not interacting with the driver.

Lord crc
11-16-2007, 09:11 AM
Sorry Lordcrc, I was in a bad mood.

No worries. Always good to get misconceptions cleared up anyway.

I must admit I never had the use for mapping a VBO, so I skipped those parts when reading about them. I just assumed the driver uploaded the data using DMA or something, and thus that it was faster for it to do that from an internal buffer instead of having the app wait for the upload to complete.

Timothy Farrar
11-16-2007, 10:57 AM
In regards to GPU waiting when using glMapBuffer(), this is probably the most important thing to gleam from the NVidia doc,


To solve this conflict we you just need to call glBufferDataARB() with a NULL pointer. Then calling call glMapBuffer() tells the driver that the previous data arenít valid. As a consequence, if the GPU is still working on them, there wonít be a conflict because we invalidated these data. The function glMapBuffer() returns a new pointer that we can use while the GPU is working on the previous set of data..

Basically always insure the GL driver doesn't have to block waiting for the GPU to flag that it is finished with the previous frame's VBO.

Nighthawk
11-16-2007, 12:34 PM
I never got glMapBuffer() to be faster than glSubBufferData() in my application.

In theory, mapping could be faster but i never saw it happen in practice... mapping caused serious slowdowns for me. Maybe it is the ATI driver?

Lord crc
11-16-2007, 12:38 PM
In regards to GPU waiting when using glMapBuffer(), this is probably the most important thing to gleam from the NVidia doc

Can I assume that the same is true for PBO's? From what I've understood, which I obviously can't rely on anymore, PBO and VBO are essentially the same in the way the buffers are handled. Is this correct?

Korval
11-16-2007, 02:18 PM
Why keep the driver/GPU busy transferring vertex and index data when it could be receiving drawing commands.

Because you wanted to upload data to the GPU. That requires talking to the driver.

OK, let's say you do this two rendering thread thing, where you have a mapped pointer in one thread and you're rendering in another. What happens if the driver in the rendering thread suddenly decides that it needs to pull your buffer out of video memory and put it into main memory to make room for a texture?

There must always be communication between the mapped buffer and the driver.


Mapping allows the vertex and index data transfers to occur with minimal driver/GPU activity. The thread filling the VBO's does not need a OpenGL context because it is not interacting with the driver.

Nonsense.

What if you're rendering from that buffer when you decide to go mapping it? The driver has to ensure that the previous rendering command finishes before mapping it. And that requires access to the context.

Now, GL 3.0 will offer an ultimate form of mapping which provides you with absolutely no guarantees on anything; it just hands you a pointer and you're expected to ensure that the data isn't being read from/etc. But GL 2.1 doesn't have any such concept.

And even in GL 3.0, it won't be some magical process that can happen without the driver's consent; it will still need to know about it.


Can I assume that the same is true for PBO's?

They're all just buffer objects. The fact that one gets bound to a gl*Pointer slot and the other gets bound to a PACK/UNPACK slot is fairly irrelevant to how you access the data.

macarter
11-16-2007, 05:00 PM
...OK, let's say you do this two rendering thread thing, where you have a mapped pointer in one thread and you're rendering in another. What happens if the driver in the rendering thread suddenly decides that it needs to pull your buffer out of video memory and put it into main memory to make room for a texture?

Excerpt from ARB_vertex_buffer_object:

What happens to a mapped buffer when a screen resolution change or other such window-system-specific system event occurs?

RESOLVED: The buffer's contents may become undefined. The application will then be notified at Unmap time that the buffer's contents have been destroyed. However, for the remaining duration of the map, the pointer returned from Map must continue to point to valid memory, in order to ensure that the application cannot crash if it continues to read or write after the system event has been handled.


...
There must always be communication between the mapped buffer and the driver.
Where did you get this fact? Once the driver produces a pointer it is valid for all threads. No driver activity is involved in using the pointer.

Excerpt from ARB_vertex_buffer_object:

...The expectation is that an application might map a buffer and start filling it in a different thread, but continue to render in its main thread (using a different buffer or no buffer at all)...


...
What if you're rendering from that buffer when you decide to go mapping it? The driver has to ensure that the previous rendering command finishes before mapping it. And that requires access to the context.

Agreed. Mapping and unmapping require a context.

Humus
11-16-2007, 05:29 PM
Those claiming glBufferData or glBufferSubData is faster are still living in a single threaded / single CPU world. Mapping allows drawing from one VBO to occur in parallel with filling another. What could be faster?

Drawing from one VBO and uploading new data to it can occur in parallel. Keep in mind that a VBO is an abstraction. Under the hood the driver can do buffer renaming and keep multiple buffers in flight as long as you're replacing the entire buffer. The performance advantage of mapping is that you potentially save a copy. Other than that there's nothing preventing glBufferData() from being equally fast.

Korval
11-16-2007, 06:20 PM
The expectation is that an application might map a buffer and start filling it in a different thread, but continue to render in its main thread (using a different buffer or no buffer at all)...

Yes. But notice that nowhere in there does it say that you should not have a context bound in the other thread.

macarter
11-19-2007, 09:43 AM
Drawing from one VBO and uploading new data to it can occur in parallel. Keep in mind that a VBO is an abstraction. Under the hood the driver can do buffer renaming and keep multiple buffers in flight as long as you're replacing the entire buffer. The performance advantage of mapping is that you potentially save a copy. Other than that there's nothing preventing glBufferData() from being equally fast.
Buffer renaming is a server side optimization and applies equally to mapping after nulling the buffer with glBufferData or using glBufferData() directly to fill the VBO. From page 13 of the NVIDIA VBO white paper:



The pointer returned by with glMapBuffer() refers to the actual location of the data. It is possible that the GPU could be working with these data, so requesting it for an update will force the driver to wait for the GPU to finish its task.

To solve this conflict we you just need to call glBufferDataARB() with a NULL pointer. Then calling call glMapBuffer() tells the driver that the previous data are arenít valid. As a consequence, if the GPU is still working on them, there wonít be a conflict because we invalidated these data. The function glMapBuffer() returns a new pointer that we can use while the GPU is working on the previous set of data..

What I am refering to a client side technique. Only one glBufferData() can be active at one time on a GL context. The data transfer must complete before the function will return. While this may be the fastest way to transfer the data it is also likely that it will reduce the frame rate of the drawing thread if the buffer is large. If instead a mapped pointer is supplied to a helper thread the transfer can occur in parallel with the drawing thread on multi-CPU systems. In fact several threads could be filling VBOs in parallel if processors are available. It is limited by the number of CPUs, write combiners
Hyper-Threading Technology and Write Combining Store Buffers -- Understanding, Detecting and Correcting Performance Issues (http://www.intel.com/cd/ids/developer/asmo-na/eng/20465.htm) or by saturation of the bus.

Issuing glBufferData() calls from helper threads is possible. It would require a GL context per thread and shared VBOs. Its performance advantage would likely be defeated by resource conflicts in the driver. I would be an interesting experiment.

Brolingstanz
11-19-2007, 04:11 PM
I think procedural content creation gets really interesting in the context of multiple CPU threads, even CPU LOD might find its way back into the sun, given enough "spare" cores. Makes me wonder a bit...

macarter
11-19-2007, 10:01 PM
Here are some performance testing statistics I gathered today.

Dell Inspiron 8400, 3.2 Ghz Pentium 4 hyperthreaded CPU, Windows XP sp2, NVIDIA 7800 GTX. 84.21 video driver

3.3 - 3.8 Gb/sec 16 Mb mapped buffer filled with memset() on a helper thread
1.3 - 1.4 GB/sec 1 Mb buffer filled using glBufferSubData()

Supermicro with two dual core Opteron 275 2.2 Ghz CPUs, Windows XP sp2, Radeon X1950XT. 7.10 video driver

3.0 - 6.0 Gb/sec 16 Mb mapped buffer filled with memset() on a helper thread
.6 - 1.0 Gb/sec 1 Mb buffer filled using glBufferSubData()

The mapped buffer higher number is the average with a lightly loaded drawing thread (one full screen poly). The mapped buffer lower number is with a 125,000 vertex load in the drawing thread.

Our normal procedural geometry generation averages about 300 Mb per second VBO fill rate using a helper thread. At this fill rate I measured a 3% improvement in drawing thread throughput. At the 3+Gb/sec transfer rate there was a 5% to 15% drop in drawing thread performance. I expect this drop indicates PCI bus contention between the threads. I have no explanation for the 3% performance gain with the lower transfer rate.

Humus
11-20-2007, 01:57 PM
Are particular reason you didn't compare with equally large buffers, 16MB vs 1MB, or is that a typo?

macarter
11-20-2007, 04:52 PM
Are particular reason you didn't compare with equally large buffers, 16MB vs 1MB, or is that a typo?

This is not a typo. I obtained this benchmark by small modifications to our application. The mapped buffer was available for a full overwrite without disturbing any other part of the application. The glBufferSubData call was limited to the last megabyte of the VBO which is rarely used by the drawing thread.

There is an overhead to the map/unmap call which is hidden by the size of the transfer. Our 16 MB double buffered VBO map swap consumes about 130 CPU microseconds. The glBufferSubData has a very low overhead. If procedural geometry was written to local memory and then transfered via a glBufferSubData call the crossover point for the fastest transfer depends on the buffer size. Buffers smaller than .3 to .5 MB would transfer faster with glBufferSubData. However, for our application, the absolute transfer rate is much less important than filling the VBO without a loss of frame rate.

macarter
11-29-2007, 12:43 PM
There is an overhead to the map/unmap call which is hidden by the size of the transfer. Our 16 MB double buffered VBO map swap consumes about 130 CPU microseconds...

There was a measurement error. 12-13 microseconds is required for a 16 MB VBO mapping swap. A glBufferSubData() transfer of 17 KB took 13 microseconds. Therefore transfering more than 17 KB in the drawing thread with glBufferSubData() causes a greater performance loss than mapping a 16 MB buffer.

macarter
01-02-2008, 03:23 PM
First the good news :D.

The NVidia 169.21 driver fixes the VBO mapping performance problem. On a GeForce 8800 there are no issues. On a GeForce 7800/7900 the driver requires either a glFinish or a buffer discard by a NULL glBufferData call prior to remapping the VBO to get consistently good performance. The completion of a query object associated with the VBO is sufficient on the 8800 but fails intermittently on the 7800/7900. Of couse the buffer discard is the recommended method.

Now the bad news :sorrow:.

The ATI OpenGL driver rewrite seems to have broken VBO mappings. On the Radeon 2900 when attempting to draw from a VBO that was mapped then unmapped I get GL_INVALID_OPERATION. If the VBO is mapped as a pixel pack or pixel unpack buffer there is no error but performance is very poor. With the Catalyst 7.12 release the problem has spread to the Radeon X1950.

jwatte
01-03-2008, 12:38 AM
glBufferData() with NULL is a really good idea before re-writing buffers, no matter what.

Regarding multi-threaded buffer filling, I don't see the performance gain, because if it's a copy, then it's a memory bus limited process, and having two threads fight for the memory bus might actually be slower than serializing the two fills (because of DRAM page open and streaming issues).

If you are generating the data (say, through skinning, or some other algorithm that is CPU bound), then multi-threaded filling might make more sense.

macarter
01-03-2008, 10:45 AM
glBufferData() with NULL is a really good idea before re-writing buffers, no matter what...


You would think so. But, we are seeing a timing jitter that may be related to the glBufferData() with NULL occasionly creating a new buffer. Since it is a large buffer it may cause the driver to reorganize memory or restart new usage heuristics. So far we get more stable timings with a glFinish() and no NULL glBufferData() call.

Yes. We are generating our VBO data with a CPU bound algorithm.

macarter
01-16-2008, 08:56 AM
:D
I found a workaround for our VBO bug with the Catalyst 7.12 driver. It appears our usage pattern confuses the driver's VBO state management. The driver thinks we are attempting to draw from a mapped buffer. However, querying the GL state shows no pointers are bound to the mapped VBO. If I make sure there are no pointers bound to the buffer object when the buffer is mapped the driver's GL state doesn't get confused. A bug report has been submitted.

knackered
01-16-2008, 09:25 AM
That's not a bug.

macarter
01-17-2008, 07:02 AM
That's not a bug.

Please provide the reason for your conclusion. I find only the following restriction.

From the OpenGL version 2.1 specification

2.9.1 Vertex Arrays in Buffer Objects
...Attempts to source data from a currently mapped buffer object will generate an INVALID OPERATION error.

Check out the example code for mapped buffer objects at the end of the ARB_vertex_buffer_object extension. You will find the VBO is mapped with bound pointers.

knackered
01-17-2008, 01:54 PM
Yep, as I said, that's a bug. ;)
Throwing an invalid_operation error is incorrect behaviour.
Submit a bug report.