View Full Version : How to improve data transfer performance for GLSL and curretly glGetTexImage is used.

10-26-2015, 01:11 AM
Hi, everyone!

In my application(GLSL Compute Shader), glGetTexImage() is used to load an image buffer back to CPU memory. The texture size is 2K*2K*RGBA Byte, and my graphic card is GTX970.

However, the nvidia nsight reports that glGetTexImage() costs about 30.46ms. This can not be accepted by my application where I expect less than 2ms.

So, how could I improve the performace? Thanks!

Dark Photon
10-26-2015, 06:35 AM
Have you done any estimates on what doing that readback requires? Have you verified that that those requirements are met?

2K*2K*4Bpp = 16MB. You want that pulled down from a discrete GPU over the PCIe bus in 0.002 seconds. If it's not ready, you need more. In fact, depending on how much time it takes your GPU kernel to complete, you may completely overrun your 2ms budget in waiting for your kernel to finish. There are some techniques you can use to hide this latency if you don't absolutely need the latest image but possibly one rendered the last frame. But let's go with the assumption that you've already waited for your kernel to complete and you're ready to do the readback right now.

If the image was guaranteed to be ready on the GPU side, you'd need just under 8 GB/sec throughput in practice (not theoretical). 2k*2k*4 / 0.002.

What PCIe bus do you have that GTX970 plugged into? If PCIe v2 x16, you've got 8GB/sec throughput theoretical (not in practice). If PCIe v3 x16, you've got 12.8 GB/sec theoretical. So it's looking more plausible with that.

Beyond basic hardware capability issues, IIRC NVidia arbitrarily limits readback performance on their GeForce products, at least with ReadPixels. There was a trick posted a while back to get around this limitation in the driver. Maybe someone can help me out here with what it was. But IIRC instead of reading back directly or through one PBO, you readback to a PBO, then do a PBO-to-PBO copy, then readback from the 2nd PBO. Something like that. Adding this extra copy yielded an incredible reduction in the total readback time, which tells you that some internal speed governor in the driver has been bypassed.