I’m experimenting with my GPGPU framework to load balance a problem across CPUs as well. Each processor (or core of a processor, in this case) works on a division of the data set with the GPU usually taking a larger chunk. I’m testing on a dual-core AMD X2 4400 and a GeForce 7900 GTX, although I’ve previously tried this on a dual Xeon 3.0GHz with a GeForce 6800 GT as well.
If I run the GPU alone on its chunk of the data set it’s nice and fast. If I throw one core into the mix, the GPU computation time rises a negligible amount. But when I throw the second core in as well, the GPU skyrockets from about 450ms to over 3000ms. In both cases, the CPU computation time seems almost unaffected (which weakens my suspicions about memory bandwidth/cache saturation).
The computation is a synchronous OpenGL render in the main thread, with one additional thread for each core. I can understand how saturating both cores reduces the CPU availability to the OpenGL thread, but why does it suffer so badly? It only consists of a few function calls, while the majority of time should be locked in glTexImage2D/glReadPixels.
I hacked up an asynchronous implementation earlier with PBOs, just to see if it makes a difference, but it didn’t seem very asynchronous; the thread was blocked initiating the texture uploads and computation, and while reading the result back. I didn’t really gain much at all.
Any ideas?