Wow, just wow! That is an amazing trick/hack using GL_STATIC_COPY and glCopyBufferSubData.
It made a huge difference to the pixel readback performance of my program.
I think someone should put this trick into the VBO wiki.
I guess that by using a GPU and CPU buffer with glCopyBufferSubData you are emulating the standard CUDA memcpy situation which would explain why it is fast.
I guess if you have a Quadro or whatever then the driver will do this pinned memcpy for you, but this seems to work nicely on GeForce.
One more thing to test l_hrabcak, what is the performance of glReadPixels into a PBO, and then use CUDA GL sharing to read it to CPU using cudaMemcpy?