glMapBufferRange performance issue

I’m currently trying to stream vertex color information for a fairly large mesh. The layout of the vertex data is as following:

VBO 1:

  • Static draw
  • Position, normals etc etc.

VBO 2:

  • Stream draw
  • Color data (1 float)

Rendering is done using VAO.

I’ve tried a number of different approaches to try and get the best performance streaming data into the VBO 2. Double buffing VBOs, orphaning the buffer using glBufferData and now glMapBufferRange using GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT.

My intuition is a bit off, but when trying to time the time spent on mapping, memcpy and unmapping the VBO2 using glMapBufferRange takes 16ms CPU time, 13ms spent on the map operation. Can it really be true that a mapping operation takes this much times, almost a full frame?

Uploading 2.5mb vertex data or 10mb vertex data doesn’t really change the timing.

My intuition is a bit off, but when trying to time the time spent on mapping, memcpy and unmapping the VBO2 using glMapBufferRange takes 16ms CPU time, 13ms spent on the map operation.

I can’t speak to how long the map operation takes, but I do know this: the purpose of mapping is so that you can generate the buffer’s data directly into the mapped memory. So you would load your data into the mapped pointer, or generate it with some algorithm or whatever. You shouldn’t be doing a memcpy at all; if you are, if your data is already being held in a memory pointer, you should use glBuffer(Sub)Data instead.

Now, the 13ms is a long time for mapping to take. What is your hardware, drivers, etc?

Well doing a glBufferSubData or memcpy into a mapped buffer SHOULD be the same afaik.

Sitting on a Xeon E5620 / Nvidia 470GTX (driver 270.61) computer.

Ok new development. I found that mapping a buffer being actively used for rendering while v-sync active will cause the CPU to wait a frame (~16ms) before being assigned a pointer from the mapping operation. EVEN when explicitly telling OpenGL I want to orphan the old data by supplying a GL_MAP_INVALIDATE_BUFFER_BIT to the mapping operation.

Anyone who can explain why that is?

If you’re stalling it generally means that the portion of the buffer you’re trying to map is currently in use by the driver, yes. Now, this is coming from D3D so take it with the appropriately sized grain of salt, but the typical dynamic buffer usage pattern that works is:

  • orphan
  • append
  • draw
  • append
  • draw

and so on with append/draw until the buffer is full, at which point we orphan again.

Note that we’re always appending to the buffer during a frame, and that we always orphan the entire buffer, not just a range. This way the portion of the buffer that we’re currently writing to is guaranteed to never be in use by the driver, and we don’t stall (granted the D3D Lock/Unlock semantics are considerably clearer than OpenGL’s here, so it’s more obvious what the correct thing to do is).

So you should double-check your code to ensure that you’re following a similar usage pattern, and adjust it if necessary.

Secondly, and if you’ve got the usage pattern set up right, you shouldn’t rule out the possibility that this might be a pure driver issue. Your driver and/or hardware may not be actually able to support this kind of usage pattern with glMapBufferRange without needing to stall, in which case you’ll need to consider an alternative method of getting your data into the buffer.

This has come up recently. Any perf results you get with V-sync enabled (i.e. sync-to-vblank on) may very well be total garbage. Some drivers read ahead past SwapBuffers before V-sync and then block for a while on random OpenGL calls in subsequent frames when the driver-internal FIFO’s fill up or the driver reaches some internally predefined “read-ahead” limit, rendering your timings completely useless.

It could be this is what is happening with your mapbuffers block.

You can try sticking a glFinish() after your SwapBuffers call and that may help. On some cards, this will force a wait for V-sync before continuing when you’ve got sync-to-vblank enabled. But I don’t think there’s anything in the spec that guarentees this behavior.

Best bet: disable sync-to-vblank when trying to do any meaningful timing.

And no, there’s no way in the world a Map should take anywhere near that long. You should be able to do many, many of those in that time frame.

My experience differs here. Doing pure memcpy’s into a mapped batch submission buffer with smart usage of UNSYNCHRONIZED/INVALIDATE performs very, very well.

I’d suggest this thread as required reading for anyone trying to get the max perf from VBO uploads, particularly Rob Barris’ posts, starting at this post: