Slow transfer speed on fermi cards

I am getting very slow transfer speeds on the newest NVidia cards (GTX470/GTX480). glReadPixels() and texture upload speed is about 55% slower compared to previous generation cards (GTX280/GTX285).

Tried different drivers, of course… same results.
The hardware specs predict better results. (less bits memory address bus, but higher memory clocks)

Can anyone report different results?

What platform?

Windows XP, drivers 197.x and 257.21.
Tested on different mainboards, etc.

I got the same problem. With most Windows OS (XP - Seven 32 and 64 Bit)and NV drivers 197 and 257.

Are you seeing this on both your GTX 480 and QuadroFX 5800?

We should post a short test prog for folks to run and compare readback rates between cards.

You guys have to come with something more solid than, “I got a 55% performance drop”; I aint saying its not true, but you aint giving any of us very much to work with.

Hello,

I was able to build a small GLUT benchmark tool to show the differences in transfer speed on GTX480 cards.
The app uses 720p HD resolution and RGBA Data.
I added synchronous PBO transfers and surprisingly got very good result without using multiple buffers etc.
The standard glReadPixels function seems to be the slow part on GTX470/GTX480…

Some results:

Geforce GTX480 driver 258.96

glReadPixels: 8.10 ms
PBO glReadPixels: 2.64 ms (memcpy 0.55 ms) total: 3.15 ms
glTexSubImage2D: 1.25 ms
PBO glTexSubImage2D: 0.02 ms (memcpy 0.56 ms) total: 0.58 ms
glCopyTexSubImage2D: 0.04 ms

Geforce GTX280 driver 197.45

glReadPixels: 2.90 ms
PBO glReadPixels: 2.60 ms (memcpy 1.17 ms) total: 3.77 ms
glTexSubImage2D: 2.90 ms
PBO glTexSubImage2D: 0.07 ms (memcpy 1.20 ms) total: 1.27 ms
glCopyTexSubImage2D: 0.03 ms

The two tests were run on different platforms hence the difference in memcpy speed. Will do a test with different cards on the same PC soon.

Feel free to give the benchmark a try and post your results with GTX480. (I used maximum quality settings in the driver)

How about posting the source code too, so folks can run this exact same test on Linux with GTX480s and GTX285/280s.

Thanks.

Here is a cleaned up version including source. Now using GLEW.
Since I am using QueryPerformanceCounter() for timing calculations it’s not completely portable yet.

For reference with a Radeon HD5870 on Linux here my result:


glReadPixels: 2.83 ms
PBO glReadPixels: 1.08 ms (memcpy: 1.68 ms) total: 2.76 ms
glTexSubImage2D: 1.23 ms
PBO glTexSubImage2D: 0.05 ms (memcpy: 1.23 ms) total: 1.28 ms
glCopyTexSubImage2D: 0.05 ms
glGetTexImage: 2.72 ms

memcpy speed: 2191 MBytes/sec

Total frame: 16.87 ms  (total transfer: 9.65 ms)


If people want the modified version of the source code to run on Linux I will post it.

Here’s a port that should run on both Linux and MSWin. Nice little test program!

Results. Confirmed on Linux that there appears to be a major performance slowdown here. Throwing more CPU at it helps a bit, but far from brings a GTX480 up to GTX285 perf.

Do you have an NVDeveloper account? If so, please kick in a bug report if you haven’t already.

GTX285, 256.35 drivers, 2GHz Nehalem EP CPU:

glReadPixels: 3.86 ms
PBO glReadPixels: 3.54 ms (memcpy: 1.24 ms) total: 4.78 ms
glTexSubImage2D: 4.99 ms
PBO glTexSubImage2D: 0.09 ms (memcpy: 1.43 ms) total: 1.52 ms
glCopyTexSubImage2D: 0.02 ms
glGetTexImage: 11.97 ms

memcpy speed: 2984 MBytes/sec

Total frame: 34.90 ms (total transfer: 22.15 ms)

GTX480, 256.35 drivers, 2GHz Nehalem EP CPU:

glReadPixels: 14.57 ms
PBO glReadPixels: 3.25 ms (memcpy: 1.48 ms) total: 4.72 ms
glTexSubImage2D: 2.68 ms
PBO glTexSubImage2D: 0.05 ms (memcpy: 1.43 ms) total: 1.49 ms
glCopyTexSubImage2D: 0.05 ms
glGetTexImage: 9.94 ms

memcpy speed: 2494 MBytes/sec

Total frame: 38.26 ms (total transfer: 30.77 ms)

GTX480, 256.35 drivers, 3.2GHz Nehalem CPU:

glReadPixels: 8.32 ms
PBO glReadPixels: 2.42 ms (memcpy: 0.71 ms) total: 3.13 ms
glTexSubImage2D: 1.24 ms
PBO glTexSubImage2D: 0.03 ms (memcpy: 0.61 ms) total: 0.64 ms
glCopyTexSubImage2D: 0.03 ms
glGetTexImage: 3.92 ms

memcpy speed: 5214 MBytes/sec

Total frame: 20.19 ms (total transfer: 16.05 ms)

Are any NVidia driver engineers reading this post? I upgraded from the GTX 275 to the GTX 465 and not only am I experiencing the same readback slowdowns discussed here, but slowdowns across the board. For instance, for a multi-view rendering application I’m writing, the NVidia GTX 275 could render 4032 views at 15fps. The Nvidia GTX 465 only gets 4fps.

@trinitrotoluene:
Thanks for the HD5870 bench data.
Very interesting to do this direct comparison.
What kind of quality/AA settings did you use in the driver?

Your results look like the optimal stats of my GTX280 and GTX480 combined.

Do you have an NVDeveloper account? If so, please kick in a bug report if you haven’t already.

No I don’t. I would be glad if somebody else with an account could do this. Thanks.

Done.

And since I just happen to have the stats handy:

GTX285, 256.38.02 drivers, 2.6GHz Nehalem CPU:

glReadPixels: 2.37 ms
PBO glReadPixels: 2.32 ms (memcpy: 0.58 ms) total: 2.90 ms
glTexSubImage2D: 2.70 ms
PBO glTexSubImage2D: 0.05 ms (memcpy: 0.60 ms) total: 0.65 ms
glCopyTexSubImage2D: 0.01 ms
glGetTexImage: 5.50 ms

memcpy speed: 6332 MBytes/sec

Total frame: 18.37 ms (total transfer: 11.44 ms)

For AA settings the quality is super-sample, I use level 8x and the filter is edge detect (24x). I noticed that my memcpy speed is slower than others. I have a Phenom II X6 1090T on ubuntu 64 bit. When I use the libc memcpy I got a transfer rate of ~2100 MB/s. When I use this:


void *(memcpy)(void * b, const void * a, size_t n){
  size_t i;
  char *s1 = (char*)b;
  const char *s2 = (char*)a;
  
  #pragma omp parallel for shared(s1,s2,n) private(i) schedule(static) 
    for(i=0; i<n; i++)
      {
        
	s1[i] = s2[i];
      }
    
  
    return b;
    
}

I got a transfer rate ~2700 MB/s all with GCC compiler flag -O3 -msse4a -march=amd10fam. When I enable openmp (-fopenmp) I got a transfer rate of ~3700 MB/s but now the rendering have some glitches. Maybe its because all my 6 cores work 100% of the time and the driver have not much cpu time to push drawing call to the video card in this case.

Like Dark Photon wrote before, this is a nice test program. I will profile the program to see if improving further the speed of memcpy call is important.

built on macos 10.6.4
snow leopard graphics update beta (1.6.18.16 19.5.9f02)
gtx 275 nehalem

g++ transferBench.cpp -framework OpenGL -framework GLUT -lGLEW
change time.h to sys/time.h and use gettimeofday
Status: Using GLEW 1.5.4

glReadPixels: 6.28 ms
PBO glReadPixels: 6.72 ms (memcpy: 0.67 ms) total: 7.40 ms
glTexSubImage2D: 0.92 ms
PBO glTexSubImage2D: 0.11 ms (memcpy: 0.82 ms) total: 0.94 ms
glCopyTexSubImage2D: 0.08 ms
glGetTexImage: 5.31 ms

memcpy speed: 5486 MBytes/sec

Total frame: 25.16 ms (total transfer: 20.00 ms)

The benchmark code for glReadPixels is not correct.

You should call glFinish before starting the timer. (line 212)
Internally the glReadPixels has to wait until all the geometry is rendered. Then it starts the transfer.

There is also small problem is with glTexSubImage2D with PBO.
The texture is actually not loaded when glTexSubImage2D returns. It just starts the DMA to GPU memory. This transfer happens in background.

I’d recommend calling glFinish() before every getTime().

For the base glReadPixels it should be fine. glReadPixels must block until rendering completes, all the data is back, and it has been copied into the destination array at the address indicated by the “pixels” parameter. Timing is sampled after glReadPixels.

For the PBO glReadPixels, while the glReadPixels may pipeline, the glMapBuffer will block until rendering completes, all the data is in the buffer, and it has been transfered into a mappable CPU block. Timing is sampled after the glMapBuffer. So that should be fine as well.

The only thing I see wrong is the glFlush – it superfluous in both cases. Removing it does not affect the performance of either case here.