At http://www.slideshare.net/CassEveritt/beyond-porting it is stressed that GL_MAP_UNSYCHRONIZED_BIT should not be used because it causes a sync between the client and driver threads. So I dropped that. Next, I wanted to save on making the glMapBufferRange() and glUnmapBuffer() calls, so I tried to use GL_MAP_PERSISTENT_BIT. I used it in two cases, and came up against a problem.
In the first case, I used it for the uniform buffers that holds the transform matrices, together with GL_MAP_FLUSH_EXPLICIT_BIT and calls to glFlushMappedBufferRange(). It seemed to work without any performance issue, even though I'm uploading per object before each draw call. I assume that any performance difference is small that other factors dominate.
In the second case, I tried it in the following context: I have shared memory where another process draws an HD resolution 32bpp image, on average once per frame. Whenever there's a new image, I upload it to a PBO and from there to a texture (as the latter is asynchronous)--there are actually two PBOs that I ping-pong between. What happened is that when I changed from map-memcpy-unmap to a persistent mapping and then memcpy-flush, as I had done with the uniform buffers in my first test case, the performance dropped a lot. Note that this happened with any combination of other flags I tried. I tried flushing both right after the memcpy, and instead right before the use of the data to load into texture from the PBO. I tried no explicit flushing. I tried putting GL_MAP_UNSYNCHRONIZED bit in again. I tried GL_MAP_COHERENT_BIT. I also tried to use fences (one for each PBO) set after the use of the buffer to load into texture and corresponding glClientWaitSync() before the memcpy into it. I tried orphaning with GL_MAP_INVALIDATE_BUFFER_BIT (though I'm not sure it makes sense for the large amount of data being transferred). I tried these in various combinations, but in the end, I simply could not get the performance back to what it was with the map-memcpy-unmap.
What am I missing? I'm running this on an NVIDIA GTX680 with the 332.21 driver (Windows 7 x64).