Why is GPU-CPU transfer slow?

Everywhere I read that GPU to CPU transfers are horribly slow. Now I’ve had my fair share of GPU programming so I know this is true, but I’m wondering why?

At first I thought it had to do with the bandwidth between the GPU and CPU. I guess it would have something to do with it in some situations, but I ran into a particular situation where it shouldn’t have anything to do with it. I was doing a number of calculations on textures, mapping them to other textures using FBOs. And after one complete calculation-transaction was done, I would read back 1 pixel which contained my answer. Now reading 1 pixel would hardly fill the bandwidth I have, but yet when I put this pixel inside a different texture and read back the complete texture after a number passes, it would be faster.

My other idea was that it had something to do with the asynchronous behavior of the GPU-CPU. Since the calls to glReadPixel have to occur synchronous, either one of them has to wait. So let’s say the CPU has to wait for the GPU since it’s not done yet. Then the CPU would still have to wait for the GPU after 10 passes or any number of passes, since the GPU doesn’t magically calculate faster when you do a couple of more calculations.

But where does the speedup then come from?

(I hope everything makes sense btw :slight_smile: )

As you said, the issue is not raw bandwidth (we have lots of that), but latency.

A GPU -> CPU readback introduces a “sync point” where the CPU must wait for the GPU to complete its calculations. During this time, the CPU stops feeding the GPU with data, causing it to stall.

Now, remember that a modern GPU is designed in a highly parallel manner, with thousand threads in flight at any given moment. The sync point must wait for all those threads to finish processing, before it can readback the result of their calculations. Once the readback is complete, all those threads must restart execution from zero… bad!

Reading back the results asynchronously (after a few frames), allows the GPU continue execution without its threads starving (the stop-and-resume issue outlined above). This improves performance tremendously - the more parallel the GPU, the higher the performance improvement.

It really has nothing to do with GPU-internal parallelism. Instead it’s all about buffering commands in a command queue between the CPU and GPU. OpenGL drivers buffer up commands, sometimes for several frames, as typical applications submit their draw calls in bursts, and frame time may vary. Buffering commands is necessary to avoid stalling either CPU or GPU.

However, glReadPixels requires the OpenGL driver to finish processing all commands that affect the current read buffer (unless you do an asynchronous copy to a PBO). Thus the more time you leave between the last call to affect the read buffer and glReadPixels, the more likely it is that the GPU already finished rendering to that buffer.

Well beyond the stalling issue, the speed of the transfer IS slower on a GPU->CPU transfer than the other way around. I hope this improves in future.

What exactly are you expecting here? GPUs are designed to render images. This process doesn’t involve CPU readback. Or at least, it doesn’t have to. So it would be expected that upload would be prioritized over download.

It really has nothing to do with GPU-internal parallelism. Instead it’s all about buffering commands in a command queue between the CPU and GPU. OpenGL drivers buffer up commands, sometimes for several frames, as typical applications submit their draw calls in bursts, and frame time may vary. Buffering commands is necessary to avoid stalling either CPU or GPU.

The large command buffer and the high parallelism are two sides of the same coin. You cannot have one without the other!

The situation is roughly analogous to a CPU instruction pipeline: the more parallel the architecture, the deeper the pipeline needs to be (== command buffer) - and the higher the cost of a pipeline flush is (== glReadBuffers).

GPUs have highly parallel architectures to improve throughput. The more parallel the design, the deeper the command buffer needs to be to keep the GPU well-fed - and the higher the cost of a stall.

Anyone have any good rules of thumb for the optimal glReadBack latency? My guess is that 1 or 2 frames should be enough to ensure the data can be downloaded without stalling, but I haven’t measured this theory.

Thanks for the information people. It’s a lot clearer now :slight_smile:

Actually, you can. GPU-internal parallelism is just a way to increase performance, like increasing clock speed. Some GPUs only have a single pipeline (think embedded), yet their drivers still buffer commands for one frame or more.

The need for buffering commands does not arise from GPU-internal parallelism, but from GPU-CPU parallelism and the fact that applications don’t submit render commands in a steady flow (and each command can cause a highly variable workload).