Wow, just wow! That is an amazing trick/hack using GL_STATIC_COPY and glCopyBufferSubData.

It made a huge difference to the pixel readback performance of my program.

I think someone should put this trick into the VBO wiki.

I guess that by using a GPU and CPU buffer with glCopyBufferSubData you are emulating the standard CUDA memcpy situation which would explain why it is fast.

I guess if you have a Quadro or whatever then the driver will do this pinned memcpy for you, but this seems to work nicely on GeForce.

One more thing to test l_hrabcak, what is the performance of glReadPixels into a PBO, and then use CUDA GL sharing to read it to CPU using cudaMemcpy?