Readback speeds of newest cards?

I am interested in high performance readback for off-screen rendering and for some FFT on GPU that I will need soon.
My 4850 gets a maximum ~1.2GB/s readback (only ~950 MB/s with large blocks) when tested with PCIe SpeedTest, and ~4.4GB/s for upload.

I found a post on AMD Developer forums that complained about the low readback speed of an 5870, they only get ~650 MB/s.
Link: http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=130923&enterthread=y

I’d like to know which new cards have at least the readback bandwidth of my 4850, so I can recommend them to my users.
Or when will the 5xxx series receive the readback bandwidth they deserve and need for OpenCL performance?

Here are some numbers, tested under Windows 7 in OpenGL with 1mb transfer sizes (512x512x4):

Radeon 5850
cpu -> gpu, glDrawPixels, GL_RGB : 0.73s, average: 0.36ms, 2111mb/s
cpu -> gpu, glDrawPixels, GL_RGBA : 0.90s, average: 0.44ms, 2276mb/s
cpu -> gpu, glDrawPixels, GL_BGR_EXT : 0.71s, average: 0.35ms, 2173mb/s
cpu -> gpu, glDrawPixels, GL_BGRA_EXT : 0.90s, average: 0.44ms, 2278mb/s

cpu -> gpu, glTexImage2D, GL_RGB : 1.45s, average: 0.71ms, 1060mb/s
cpu -> gpu, glTexImage2D, GL_RGBA : 1.61s, average: 0.79ms, 1268mb/s
cpu -> gpu, glTexImage2D, GL_BGR_EXT : 1.43s, average: 0.70ms, 1076mb/s
cpu -> gpu, glTexImage2D, GL_BGRA_EXT : 1.78s, average: 0.87ms, 1153mb/s

gpu -> cpu, glReadPixels, GL_RGB : 2.07s, average: 1.01ms, 744mb/s
gpu -> cpu, glReadPixels, GL_RGBA : 1.43s, average: 0.70ms, 1427mb/s
gpu -> cpu, glReadPixels, GL_BGR_EXT : 14.87s, average: 7.26ms, 103mb/s
gpu -> cpu, glReadPixels, GL_BGRA_EXT : 1.44s, average: 0.70ms, 1425mb/s

Geforce GTX 275
cpu -> gpu, glDrawPixels, GL_RGB : 2.13s, average: 1.04ms, 722mb/s
cpu -> gpu, glDrawPixels, GL_RGBA : 1.53s, average: 0.75ms, 1337mb/s
cpu -> gpu, glDrawPixels, GL_BGR_EXT : 1.70s, average: 0.83ms, 901mb/s
cpu -> gpu, glDrawPixels, GL_BGRA_EXT : 1.54s, average: 0.75ms, 1330mb/s

cpu -> gpu, glTexImage2D, GL_RGB : 2.00s, average: 0.98ms, 767mb/s
cpu -> gpu, glTexImage2D, GL_RGBA : 1.94s, average: 0.95ms, 1058mb/s
cpu -> gpu, glTexImage2D, GL_BGR_EXT : 1.60s, average: 0.78ms, 960mb/s
cpu -> gpu, glTexImage2D, GL_BGRA_EXT : 1.13s, average: 0.55ms, 1810mb/s

gpu -> cpu, glReadPixels, GL_RGB : 2.17s, average: 1.06ms, 709mb/s
gpu -> cpu, glReadPixels, GL_RGBA : 2.20s, average: 1.07ms, 931mb/s
gpu -> cpu, glReadPixels, GL_BGR_EXT : 2.04s, average: 1.00ms, 753mb/s
gpu -> cpu, glReadPixels, GL_BGRA_EXT : 1.44s, average: 0.70ms, 1423mb/s

Quadro FX 5800
cpu -> gpu, glDrawPixels, GL_RGB : 3.46s, average: 1.69ms, 444mb/s
cpu -> gpu, glDrawPixels, GL_RGBA : 2.70s, average: 1.32ms, 758mb/s
cpu -> gpu, glDrawPixels, GL_BGR_EXT : 2.93s, average: 1.43ms, 525mb/s
cpu -> gpu, glDrawPixels, GL_BGRA_EXT : 2.76s, average: 1.35ms, 741mb/s

cpu -> gpu, glTexImage2D, GL_RGB : 2.94s, average: 1.44ms, 522mb/s
cpu -> gpu, glTexImage2D, GL_RGBA : 2.70s, average: 1.32ms, 759mb/s
cpu -> gpu, glTexImage2D, GL_BGR_EXT : 2.18s, average: 1.06ms, 706mb/s
cpu -> gpu, glTexImage2D, GL_BGRA_EXT : 1.62s, average: 0.79ms, 1262mb/s

gpu -> cpu, glReadPixels, GL_RGB : 3.34s, average: 1.63ms, 460mb/s
gpu -> cpu, glReadPixels, GL_RGBA : 3.31s, average: 1.62ms, 619mb/s
gpu -> cpu, glReadPixels, GL_BGR_EXT : 3.12s, average: 1.52ms, 492mb/s
gpu -> cpu, glReadPixels, GL_BGRA_EXT : 2.93s, average: 1.43ms, 698mb/s

The first two machines have similar or the same hardware (X58 platform), while the last machine is older and slower (Core 2 Quad).

How do you measure this? And are you using PBOs?

I’m measuring from the CPU’s perspective, using glFinish before and after. For texture uploads I force the upload to complete by issuing a small draw call using the texture, as well.

I’m not using PBOs in these measurements, though I have measured PBO times and they seem to have the same raw throughput. PBOs may be async, but they aren’t any faster… and in practice, on almost all GPUs, PBO copies happen serially on the GPU side, as opposed to overlapping with rendering. I’d suggest benchmarking transfers using PBOs vs “normal” transfers for your particular usage case, because PBOs may not always be a win, even when providing “enough” time for the PBO transfer to complete. Maybe the new GF100 handles PBOs better.

The CUDA/OpenCL/etc. benchmarks will certainly give more accurate results for available PCIe bandwidth, so these numbers are more concerned with how quickly you can move data around under OpenGL, not raw bandwidth.

AlexN, thank you for posting your measurements.

I took a few measurements of glReadPixels from RGBA renderbuffers, with sizes of 4MB, 16MB, 64MB and 256MB (where 1MB is 1,048,576 bytes), in Vista64 with drivers 10.3 and 32 bit code, using QueryPerformanceCounter for high precision timing.
Computed results are 611.183 MB/s for 4MB, 628.1 MB/s for 16MB, 631.318 MB/s for 64MB and 590.113 MB/s for 256MB (the speed Megabyte is 1,000,000).
I noticed that reading from ALPHA8 renderbuffers is much slower, about 3x slower - in pixels - than from RGBA ones, so ~12x slower in bandwidth.
I also tried to set up (and read back from) RED or RG renderbuffers, but they don’t seem to work. Something like RGBA32F is accepted though, but useless for me.
I need the fastest single channel readback. If someone knows how to make RED renderbuffers work on 4850, please tell me.

AlexN, would you mind trying PCIe Speed Test and reporting back the results you would be getting with your 5850?
Link to AMD download page: http://developer.amd.com/GPU/ATISTREAMPOWERTOY/Pages/default.aspx

If someone knows how to make RED renderbuffers work on 4850, please tell me.

Have you informed ATI about the problem in their drivers?

No I didn’t inform yet ATI, should I? Maybe just a PM to an ATI staff member?

The problem is not uniform, for some formats that should be accepted by 3.3 version drivers I’m getting errors 1280 GL_INVALID_ENUM when calling glRenderbufferStorage and 1286 GL_INVALID_FRAMEBUFFER_OPERATION when calling glReadPixels.
For other formats I’m only getting 1282 GL_INVALID_OPERATION when calling glReadPixels, but glRenderbufferStorage goes through without reported errors.
I wanted to test all ~35 formats that are supposed to be valid for renderbuffers, but it would take too long, sorry.

For the 5850, under Win7 on X58 platform I get 4.15 GB/s upload and 750 mb/s (850 on second GPU) peak readback and 550 average readback (larger block sizes).

Thank you for posting these measurements too. So with PCIe Speed Test my 4850 gets about 60% faster readback than your 5850.
My system is stock i7-920 with 6GB DDR3-1066 CAS7, Vista64 HP. Yours is probably a faster i7, so it’s quite strange.
Could be a bug under Windows 7 (and Linux too, see first post), but 64 bit drivers are the same for Vista and 7 AFAIK.
Even stranger is that your peak readback is only 850 MB/s with PCIe Speed Test, but you measured glReadPixels at 1427 MB/s.

hi all,

I would like to clarify some of your observation:

  • the pcie speed test is really using CAL to do the readback

  • there are several ways to readback from the ASIC with different trade off, and sometimes doing more work will give better performance (especially, scattered writes are bad for chipset and cpu memory controller)

  • the performance depends heavily on the chipset/cpu (more than the GPU)

  • in opengl, you get higher performance today on firepro GPU compared to radeon GPU

  • the performance depends as well on the OS

on the RED/RG issue, can you post your code ? those formats should be supported correctly in both renderbufferstorage and readpixels.

Test code:

// Create four FBOs and renderbuffers, 8x 4x 2x and 1x
i = 0 ;
tsize = 1024 ;
glGenFramebuffers( 4, fbo ) ;
glGenRenderbuffers( 4, rendbuf ) ;
QueryPerformanceFrequency( &perffreq ) ;

errorlist[i] = glGetError() ;
i++ ;

glBindFramebuffer( GL_DRAW_FRAMEBUFFER, fbo[0] ) ;
glBindRenderbuffer( GL_RENDERBUFFER, rendbuf[0] ) ;
glRenderbufferStorage( GL_RENDERBUFFER, GL_RGBA8, tsize * 8, tsize * 8 ) ;
glFramebufferRenderbuffer( GL_DRAW_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, rendbuf[0] ) ;

errorlist[i] = glGetError() ;
i++ ;

glBindFramebuffer( GL_READ_FRAMEBUFFER, fbo[1] ) ;
glBindRenderbuffer( GL_RENDERBUFFER, rendbuf[1] ) ;
glRenderbufferStorage( GL_RENDERBUFFER, GL_ALPHA8, tsize * 4, tsize * 4 ) ;
glFramebufferRenderbuffer( GL_READ_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, rendbuf[1] ) ;

errorlist[i] = glGetError() ;
i++ ;

glBindFramebuffer( GL_READ_FRAMEBUFFER, fbo[2] ) ;
glBindRenderbuffer( GL_RENDERBUFFER, rendbuf[2] ) ;
glRenderbufferStorage( GL_RENDERBUFFER, GL_RG8UI, tsize * 2, tsize * 2 ) ;
glFramebufferRenderbuffer( GL_READ_FRAMEBUFFER_EXT, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, rendbuf[2] ) ;

errorlist[i] = glGetError() ;
i++ ;

glBindFramebuffer( GL_READ_FRAMEBUFFER, fbo[3] ) ;
glBindRenderbuffer( GL_RENDERBUFFER, rendbuf[3] ) ;
glRenderbufferStorage( GL_RENDERBUFFER, GL_R8UI, tsize * 1, tsize * 1 ) ;
glFramebufferRenderbuffer( GL_READ_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, rendbuf[3] ) ;

errorlist[i] = glGetError() ;
i++ ;

  cdmap1c = malloc( tsize * 8 * tsize * 8 * 4 ) ;
  cdmap0 = malloc( tsize * 8 * tsize * 8 ) ;
  glColor4ub( 128, 128, 128, 128 ) ; 
  glClearColor( 0.0, 0.0, 0.0, 0.0 ) ;

  glViewport( 0, 0, (GLsizei)( tsize * 8 ), (GLsizei)( tsize * 8 ) ) ;
  glOrtho( 0, 0, (GLdouble)( tsize * 8 ), (GLdouble)( tsize * 8 ), -1.0, 1.0 ) ;

errorlist[i] = glGetError() ;
i++ ;

  QueryPerformanceCounter( &perfcount0 ) ;

  glBindFramebuffer( GL_DRAW_FRAMEBUFFER, fbo[0] ) ; // fbo[0] is draw target
  glClear( GL_COLOR_BUFFER_BIT ) ;
  glBindFramebuffer( GL_READ_FRAMEBUFFER, fbo[0] ) ; // fbo[0] is now read target
  glReadPixels( 0, 0, tsize * 8, tsize * 8, GL_RGBA, GL_UNSIGNED_BYTE, (void *)cdmap1c ) ; 
  glFinish() ;

errorlist[i] = glGetError() ;
i++ ;
// rgba2gray( (uint *)cdmap1c, cdmap0, tsize * 8, tsize * 8 ) ;

  QueryPerformanceCounter( &perfcount1 ) ;	

  glBindFramebuffer( GL_DRAW_FRAMEBUFFER, fbo[1] ) ; // and fbo[1] is draw target
  glClear( GL_COLOR_BUFFER_BIT ) ;
  glBindFramebuffer( GL_READ_FRAMEBUFFER, fbo[1] ) ; // fbo[1] is now read target
  glReadPixels( 0, 0, tsize * 4, tsize * 4, GL_ALPHA, GL_UNSIGNED_BYTE, (void *)cdmap1c ) ; 
  glFinish() ;

errorlist[i] = glGetError() ;
i++ ;

  QueryPerformanceCounter( &perfcount2 ) ;

  glBindFramebuffer( GL_DRAW_FRAMEBUFFER, fbo[2] ) ; // fbo[2] is draw target
  glClear( GL_COLOR_BUFFER_BIT ) ;
  glBindFramebuffer( GL_READ_FRAMEBUFFER, fbo[2] ) ; // fbo[2] is now read target
  glReadPixels( 0, 0, tsize * 2, tsize * 2, GL_RG, GL_UNSIGNED_BYTE, (void *)cdmap1c ) ; 
  glFinish() ;

errorlist[i] = glGetError() ;
i++ ;

  QueryPerformanceCounter( &perfcount3 ) ;	

  glBindFramebuffer( GL_DRAW_FRAMEBUFFER, fbo[3] ) ; // fbo[3] is draw target
  glClear( GL_COLOR_BUFFER_BIT ) ;
  glBindFramebuffer( GL_READ_FRAMEBUFFER, fbo[3] ) ; // fbo[3] is now read target
  glReadPixels( 0, 0, tsize, tsize, GL_RED, GL_UNSIGNED_BYTE, (void *)cdmap1c ) ; 
  glFinish() ;

errorlist[i] = glGetError() ;
i++ ;

  QueryPerformanceCounter( &perfcount4 ) ;	

  free( cdmap1c ) ;
  free( cdmap0 ) ;

  fp = fopen( "timings2.txt", "wt" ) ;

  read8x_ms = ((double)perfcount1.QuadPart - (double)perfcount0.QuadPart ) * 1000.0 / (double)perffreq.QuadPart ;
  read4x_ms = ((double)perfcount2.QuadPart - (double)perfcount1.QuadPart ) * 1000.0 / (double)perffreq.QuadPart ;
  read2x_ms = ((double)perfcount3.QuadPart - (double)perfcount2.QuadPart ) * 1000.0 / (double)perffreq.QuadPart ;
  read1x_ms = ((double)perfcount4.QuadPart - (double)perfcount3.QuadPart ) * 1000.0 / (double)perffreq.QuadPart ;

  fprintf( fp, "64 MPixels time: %10.5f ms

16 MPixels time: %10.5f ms
4 MPixels time: %10.5f ms
1 MPixels time: %10.5f ms
",
read8x_ms, read4x_ms, read2x_ms, read1x_ms ) ;

  for( i=0; i< 10; i++ ){
  fprintf( fp, "Error %u : %u

", i, errorlist[i] ) ;
}

  fclose( fp ) ;

  glDeleteRenderbuffers( 4, rendbuf ) ;
  glDeleteFramebuffers( 4, fbo ) ;

I also measured with PCIe Speed Test the upload and readback speeds on my backup computer.
It’s X2 5050e with 4GB DDR2-800 CAS5 on 785G chipset, 4670 with 512MB, Windows XP with drivers 10.3.
Reported upload speed is 3.28 GB/s and download speed is 2.87 GB/s! Wow, that’s a lot better!
I then tested the 4850 in the backup computer, and got 3.24 GB/s for upload and only 2.31 GB/s for readback, still very good but SLOWER than 4670.
Well, this needs further investigation. One thing is sure IMO, X58 chipset and maybe 64 bit OS cause a major drop in readback speeds with PCIe Speed Test.

there was a known issue on vista+x58 chipset that we fixed recently. it is not available in cat10.3 (but was in the gl4 beta driver)

I downloaded and (crossed fingers!) installed the beta v4.0 drivers, as advised by Mr. Boudier.
Now the GL_R32F, GL_16F and GL_16 are accepted in glRenderbufferStorage, with GL_RED in glReadPixels.
However, GL_8 in glRenderbufferStorage does not work. It goes through, no error is generated, but IMO nothing is drawn.
I suppose GL_8 to GL_RED shouldn’t require any conversion and should be the fastest, also using the least GPU memory.
I found GL_16F to be the fastest. Odd, I would have thought GL_16 to GL_RED should have been faster.

Also, testing readback speed with PCIe Speed Test showed no improvement over 10.3 drivers.
So the X58 readback speed issue is not solved for Radeons, maybe it is only for Firepro cards…
I asked a friend of mine to test PCIe Speed Test on his 4670 and 5770, and he got upload speeds of 2.6 - 2.7 GB/s and readback speeds of 3.2 - 3.3 GB/s.
I start to suspect AMD wants me to ditch the i7-920 on X58 and go for a Thuban - they launch on the 26th of April. :wink:

Update: GL_8 in glRenderbufferStorage DOES WORK for rendering and blitting. In my previous post I wrongly concluded that it doesn’t.
But when the renderbuffer is in GL_8 format, glReadPixels messes up, returning all gray.
The workaround for me is to use GL_16 (or GL_16F or GL_32F, these work) in the smallest renderbuffer, while the larger ones are in GL_8, so I can get valid results from glReadPixels.
I believe this is a driver bug. Is there any other reason for glReadPixels to mess up when trying to read from GL_R8 to GL_RED?