PDA

View Full Version : slow transfer speed on fermi cards



def
06-26-2010, 09:05 AM
I am getting very slow transfer speeds on the newest NVidia cards (GTX470/GTX480). glReadPixels() and texture upload speed is about 55% slower compared to previous generation cards (GTX280/GTX285).

Tried different drivers, of course... same results.
The hardware specs predict better results. (less bits memory address bus, but higher memory clocks)

Can anyone report different results?

randall
06-26-2010, 11:48 AM
What platform?

def
06-26-2010, 11:54 AM
What platform?

Windows XP, drivers 197.x and 257.21.
Tested on different mainboards, etc.

RDNXR
07-13-2010, 02:08 AM
I got the same problem. With most Windows OS (XP - Seven 32 and 64 Bit)and NV drivers 197 and 257.

Dark Photon
07-13-2010, 05:01 PM
I got the same problem. With most Windows OS (XP - Seven 32 and 64 Bit)and NV drivers 197 and 257.
Are you seeing this on both your GTX 480 and QuadroFX 5800?

We should post a short test prog for folks to run and compare readback rates between cards.

ruysch
07-29-2010, 05:11 AM
You guys have to come with something more solid than, "I got a 55% performance drop"; I aint saying its not true, but you aint giving any of us very much to work with.

def
08-06-2010, 08:45 AM
Hello,

I was able to build a small GLUT benchmark tool to show the differences in transfer speed on GTX480 cards.
The app uses 720p HD resolution and RGBA Data.
I added synchronous PBO transfers and surprisingly got very good result without using multiple buffers etc.
The standard glReadPixels function seems to be the slow part on GTX470/GTX480...

Some results:

Geforce GTX480 driver 258.96
----------------------------
glReadPixels: 8.10 ms
PBO glReadPixels: 2.64 ms (memcpy 0.55 ms) total: 3.15 ms
glTexSubImage2D: 1.25 ms
PBO glTexSubImage2D: 0.02 ms (memcpy 0.56 ms) total: 0.58 ms
glCopyTexSubImage2D: 0.04 ms

Geforce GTX280 driver 197.45
----------------------------
glReadPixels: 2.90 ms
PBO glReadPixels: 2.60 ms (memcpy 1.17 ms) total: 3.77 ms
glTexSubImage2D: 2.90 ms
PBO glTexSubImage2D: 0.07 ms (memcpy 1.20 ms) total: 1.27 ms
glCopyTexSubImage2D: 0.03 ms

The two tests were run on different platforms hence the difference in memcpy speed. Will do a test with different cards on the same PC soon.

Feel free to give the benchmark a try and post your results with GTX480. (I used maximum quality settings in the driver)

Dark Photon
08-07-2010, 06:36 PM
transferBench.zip:
-> transferBench.exe
-> glut32.dll
How about posting the source code too, so folks can run this exact same test on Linux with GTX480s and GTX285/280s.

Thanks.

def
08-09-2010, 01:30 PM
Here is a cleaned up version including source. Now using GLEW.
Since I am using QueryPerformanceCounter() for timing calculations it's not completely portable yet.

trinitrotoluene
08-09-2010, 06:38 PM
For reference with a Radeon HD5870 on Linux here my result:


glReadPixels: 2.83 ms
PBO glReadPixels: 1.08 ms (memcpy: 1.68 ms) total: 2.76 ms
glTexSubImage2D: 1.23 ms
PBO glTexSubImage2D: 0.05 ms (memcpy: 1.23 ms) total: 1.28 ms
glCopyTexSubImage2D: 0.05 ms
glGetTexImage: 2.72 ms

memcpy speed: 2191 MBytes/sec

Total frame: 16.87 ms (total transfer: 9.65 ms)



If people want the modified version of the source code to run on Linux I will post it.

Dark Photon
08-10-2010, 05:20 AM
If people want the modified version of the source code to run on Linux I will post it.

Here's a port that should run on both Linux and MSWin. Nice little test program!

Dark Photon
08-10-2010, 06:18 AM
Results. Confirmed on Linux that there appears to be a major performance slowdown here. Throwing more CPU at it helps a bit, but far from brings a GTX480 up to GTX285 perf.

Do you have an NVDeveloper (http://nvdeveloper.nvidia.com) account? If so, please kick in a bug report if you haven't already.

GTX285, 256.35 drivers, 2GHz Nehalem EP CPU:

glReadPixels: 3.86 ms
PBO glReadPixels: 3.54 ms (memcpy: 1.24 ms) total: 4.78 ms
glTexSubImage2D: 4.99 ms
PBO glTexSubImage2D: 0.09 ms (memcpy: 1.43 ms) total: 1.52 ms
glCopyTexSubImage2D: 0.02 ms
glGetTexImage: 11.97 ms

memcpy speed: 2984 MBytes/sec

Total frame: 34.90 ms (total transfer: 22.15 ms)
GTX480, 256.35 drivers, 2GHz Nehalem EP CPU:

glReadPixels: 14.57 ms
PBO glReadPixels: 3.25 ms (memcpy: 1.48 ms) total: 4.72 ms
glTexSubImage2D: 2.68 ms
PBO glTexSubImage2D: 0.05 ms (memcpy: 1.43 ms) total: 1.49 ms
glCopyTexSubImage2D: 0.05 ms
glGetTexImage: 9.94 ms

memcpy speed: 2494 MBytes/sec

Total frame: 38.26 ms (total transfer: 30.77 ms)
GTX480, 256.35 drivers, 3.2GHz Nehalem CPU:

glReadPixels: 8.32 ms
PBO glReadPixels: 2.42 ms (memcpy: 0.71 ms) total: 3.13 ms
glTexSubImage2D: 1.24 ms
PBO glTexSubImage2D: 0.03 ms (memcpy: 0.61 ms) total: 0.64 ms
glCopyTexSubImage2D: 0.03 ms
glGetTexImage: 3.92 ms

memcpy speed: 5214 MBytes/sec

Total frame: 20.19 ms (total transfer: 16.05 ms)

soconne
08-11-2010, 05:15 PM
Are any NVidia driver engineers reading this post? I upgraded from the GTX 275 to the GTX 465 and not only am I experiencing the same readback slowdowns discussed here, but slowdowns across the board. For instance, for a multi-view rendering application I'm writing, the NVidia GTX 275 could render 4032 views at 15fps. The Nvidia GTX 465 only gets 4fps.

def
08-12-2010, 02:37 AM
@trinitrotoluene:
Thanks for the HD5870 bench data.
Very interesting to do this direct comparison.
What kind of quality/AA settings did you use in the driver?

Your results look like the optimal stats of my GTX280 and GTX480 combined.

def
08-12-2010, 08:18 AM
Do you have an NVDeveloper account? If so, please kick in a bug report if you haven't already.

No I don't. I would be glad if somebody else with an account could do this. Thanks.

Dark Photon
08-12-2010, 08:57 AM
Do you have an NVDeveloper account? If so, please kick in a bug report if you haven't already.
No I don't. I would be glad if somebody else with an account could do this. Thanks.
Done.

And since I just happen to have the stats handy:

GTX285, 256.38.02 drivers, 2.6GHz Nehalem CPU:

glReadPixels: 2.37 ms
PBO glReadPixels: 2.32 ms (memcpy: 0.58 ms) total: 2.90 ms
glTexSubImage2D: 2.70 ms
PBO glTexSubImage2D: 0.05 ms (memcpy: 0.60 ms) total: 0.65 ms
glCopyTexSubImage2D: 0.01 ms
glGetTexImage: 5.50 ms

memcpy speed: 6332 MBytes/sec

Total frame: 18.37 ms (total transfer: 11.44 ms)

trinitrotoluene
08-12-2010, 09:21 AM
For AA settings the quality is super-sample, I use level 8x and the filter is edge detect (24x). I noticed that my memcpy speed is slower than others. I have a Phenom II X6 1090T on ubuntu 64 bit. When I use the libc memcpy I got a transfer rate of ~2100 MB/s. When I use this:


void *(memcpy)(void * b, const void * a, size_t n){
size_t i;
char *s1 = (char*)b;
const char *s2 = (char*)a;

#pragma omp parallel for shared(s1,s2,n) private(i) schedule(static)
for(i=0; i<n; i++)
{

s1[i] = s2[i];
}


return b;

}


I got a transfer rate ~2700 MB/s all with GCC compiler flag -O3 -msse4a -march=amd10fam. When I enable openmp (-fopenmp) I got a transfer rate of ~3700 MB/s but now the rendering have some glitches. Maybe its because all my 6 cores work 100% of the time and the driver have not much cpu time to push drawing call to the video card in this case.

Like Dark Photon wrote before, this is a nice test program. I will profile the program to see if improving further the speed of memcpy call is important.

oscarbg
08-17-2010, 08:36 AM
built on macos 10.6.4
snow leopard graphics update beta (1.6.18.16 19.5.9f02)
gtx 275 nehalem

g++ transferBench.cpp -framework OpenGL -framework GLUT -lGLEW
change time.h to sys/time.h and use gettimeofday
Status: Using GLEW 1.5.4

glReadPixels: 6.28 ms
PBO glReadPixels: 6.72 ms (memcpy: 0.67 ms) total: 7.40 ms
glTexSubImage2D: 0.92 ms
PBO glTexSubImage2D: 0.11 ms (memcpy: 0.82 ms) total: 0.94 ms
glCopyTexSubImage2D: 0.08 ms
glGetTexImage: 5.31 ms

memcpy speed: 5486 MBytes/sec

Total frame: 25.16 ms (total transfer: 20.00 ms)

mfort
08-17-2010, 02:18 PM
The benchmark code for glReadPixels is not correct.

You should call glFinish before starting the timer. (line 212)
Internally the glReadPixels has to wait until all the geometry is rendered. Then it starts the transfer.

There is also small problem is with glTexSubImage2D with PBO.
The texture is actually not loaded when glTexSubImage2D returns. It just starts the DMA to GPU memory. This transfer happens in background.

I'd recommend calling glFinish() before every getTime().

Dark Photon
08-17-2010, 03:02 PM
The benchmark code for glReadPixels is not correct.
For the base glReadPixels it should be fine. glReadPixels must block until rendering completes, all the data is back, and it has been copied into the destination array at the address indicated by the "pixels" parameter. Timing is sampled after glReadPixels.

For the PBO glReadPixels, while the glReadPixels may pipeline, the glMapBuffer will block until rendering completes, all the data is in the buffer, and it has been transfered into a mappable CPU block. Timing is sampled after the glMapBuffer. So that should be fine as well.

The only thing I see wrong is the glFlush -- it superfluous in both cases. Removing it does not affect the performance of either case here.

mfort
08-18-2010, 12:44 AM
For the base glReadPixels it should be fine. glReadPixels must block until rendering completes

yes, that's right, glReadPixels must block. Therefore you are not measuring performance of DMA transfer but you measure performance of rendering + DMA transfer. That is the problem of the benchmark.

Dark Photon
08-18-2010, 04:17 AM
yes, that's right, glReadPixels must block. Therefore you are not measuring performance of DMA transfer but you measure performance of rendering + DMA transfer.
True, though the render time for something so trivial should be negligible overhead.

However, if there was a card nowadays where rendering a cube was really, really slow (1-5 ms), then I'd agree your point would result in a significant timing difference. But as-is the overhead should be pretty small.

And just to confirm here, I made this change (putting a glFinish() before sampling the start time) and did not see any timing difference.

def
08-18-2010, 01:54 PM
The only thing I see wrong is the glFlush -- it superfluous in both cases. Removing it does not affect the performance of either case here.
I put that in on short notice just to avoid the usual "your timing is not correct" reply. Unfortunately for me I confused it with glFinish().
And yes, theoretically it should be called just before and just after measuring a GL call. Who knows maybe someone wants to benchmark software rendering...

Geri.
09-15-2010, 04:43 PM
your application crashes on my laptop with radeon 7500 mobility.

mhagain
09-16-2010, 09:38 AM
Assuming this isn't a spambot, of course it doesn't work with a Radeon 7500, which is an OpenGL 1.3/Direct3D 8.1 card from about the year 1437. :)

Out of curiosity I tried this on my laptop's 230M and got comparable results. However, switching format from GL_RGBA to GL_BGRA caused performance to almost double. Changing the type from GL_UNSIGNED_BYTE to GL_UNSIGNED_INT_8_8_8_8_REV gave a more subtle increase, but I'm assuming that the driver is recognising it's a 32-bit format and optimizing accordingly.

It would be interesting to see benchmark results with a changed format and type for the troublesome hardware.

Dark Photon
09-21-2010, 06:36 AM
switching format from GL_RGBA to GL_BGRA caused performance to almost double. Changing the type from GL_UNSIGNED_BYTE to GL_UNSIGNED_INT_8_8_8_8_REV gave a more subtle increase

Attached updated transferBench with these mods (and Linux Makefile): transferBench_src4.zip (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=download&amp;Number=38&amp;filename=tra nsferBench_src4.zip)

Results:

GTX285, 260.19.04b drivers, 2GHz Nehalem EP CPU:[/B]


glReadPixels: 1.88 ms
PBO glReadPixels: 0.92 ms (memcpy: 1.64 ms) total: 2.55 ms
glTexSubImage2D: 3.55 ms
PBO glTexSubImage2D: 0.07 ms (memcpy: 1.67 ms) total: 1.74 ms
glCopyTexSubImage2D: 0.02 ms
glGetTexImage: 8.60 ms

memcpy speed: 2252 MBytes/sec

Total frame: 23.69 ms (total transfer: 14.80 ms)


GTX480, 260.19.04b drivers, 2GHz Nehalem EP CPU:[/B]


glReadPixels: 4.82 ms
PBO glReadPixels: 3.47 ms (memcpy: 1.12 ms) total: 4.59 ms
glTexSubImage2D: 4.74 ms
PBO glTexSubImage2D: 4.92 ms (memcpy: 1.11 ms) total: 6.03 ms
glCopyTexSubImage2D: 0.08 ms
glGetTexImage: 9.97 ms

memcpy speed: 3303 MBytes/sec

Total frame: 37.81 ms (total transfer: 25.49 ms)


So faster than before, but still a 2.6X slowdown on GTX480 vs. GTX285 (before, was 3.8X slowdown).

mfort
09-21-2010, 08:46 AM
So faster than before, but still a 2.6X slowdown on GTX480 vs. GTX285 (before, was 3.8X slowdown).
I guess this improvement is only due to faster memcpy (which I cannot explain).

Here is proof: 2.6 / 3.8 = 2252 / 3303

mfort
09-21-2010, 08:50 AM
Could anyone download CUDA-Z (google it) and run in on both 480 and 285 cards in the same PC? There is a memory performance statistics when using CUDA. It should match the OpenGL (with and without PBO).

slackkeymike
09-30-2010, 01:06 PM
I thought I would jump into the fray. I too have been chasing this problem for about a week or so. Bought a 465 to test with our s/w. Linux, Centos 4.7. Our 9800 and 280 GTX cards ran circles around the 465 until we stopped the glread stuff, then the render speeds made sense.

I have been hanging out at nvidia's site trying to get answers... no luck. Wanted to try CUDA Z, but my OS libraries are out of date. So I am building a new OS disk... Anbody have anything new to report? --Mike

For what it is worth, I also tested a GTX 480 and got the same reuslts

Itaru
10-01-2010, 09:22 PM
My card is a GTX 460 which uses the Fermi GF104 GPU, a derivative of the Fermi GF100 GPU used in the GTX 465/470/480. This card unfortunately also suffers from slow glReadPixels speed, and seems to be a lot worse than the GF100 cards. :( In transferBench, I get 22ms speed for glReadPixels in the beginning, but the odd thing is that if I let it run for about half a minute, it will improve to 18ms but at the same time, PBO glReadPixels and glGetTexImage become slightly slower by about 2ms.

As for memcopy speeds comparison between transferBench and CUDA-Z, CUDA-Z gives slower speed (5900MB/s Pinned, 4700MB/s Pageable) compared to transferBench (6900MB/s) for my GTX 460.

Itaru
10-05-2010, 05:21 AM
Well, turns out that I had Vsync on when I ran transferbench last time and got 22ms for glReadPixels. I tried it again after turning off Vsync this time and got much better speed at around 8ms. I guess it's still slow though.

Just out of curiousity I tried installing the Quadro 260.78 beta drivers for my GTX 460 by modding the INF file, and I'm sad to say that it changed nothing at all with regards to glReadPixels speed. So, either it's a hardware limitation in the 400-series cards, or the Quadro drivers is smart enough to know that it's not running on an actual Quadro card and therefore doesn't enable the Quadro-specific performance boost. I think it's the latter.

Wayland Strickland
11-12-2010, 04:39 PM
GTX480, 261. dev drivers,devdriver_3.2_winvista-win7_64_261.00_general

Nvidia Corp/GeForce GTX 480/PCI SSE2 4.1.0
Card MFG: ASUS

Intel Core i7 x980 3.33GHz 12.0GB RAM Win7 64bit
Quote:
glReadPixels: 7.11 ms
PBO glReadPixels: 2.32 ms (memcpy: 0.46 ms) total: 2.77 ms
glTexSubImage2D: 1.09 ms
PBO glTexSubImage2D: 0.04 ms (memcpy: 0.50 ms) total: 0.52 ms
glCopyTexSubImage2D: 0.04 ms
glGetTexImage: 3.93 ms

memcpy speed: 7923 MBytes/sec

Total frame: 18.74 ms (total transfer: 14.91 ms)

Anyone heard anything back from Nvidia about this major problem?

THANK YOU!
Wayland Strickland

realbabilu
01-10-2011, 06:10 PM
GTX 580 card result : GLreadpixel = 4.8ms
http://img15.imageshack.us/img15/6369/17406513.jpg
Any comments?
My radeon 5850 mobility has 3.8 ms..

Bullit
01-11-2011, 11:35 PM
For reference MobilityRadeon4530 W7x64 HP DV7 laptop

glReadPixels: 5.86 but goes to 6 temporarely
PBO glReadPixels: 5.29 ms (memcpy: 1.62 ms) total: 6.92 ms
glTexSubImage2D: 4.06 ms
PBO glTexSubImage2D: 0.18 ms (memcpy: 1.85 ms) total: 2.03 ms
glCopyTexSubImage2D: 0.05 ms
glGetTexImage: 4.84 ms

memcpy speed: 2269 MBytes/sec

Total frame: 36.90 ms (total transfer: 19.71 ms)

mfort
01-14-2011, 09:44 AM
I've made my own testing.
I tested GeForce GTX 260, GeForce GTX 460 on the same computer (Xeon based). Driver 260.99.

glReadPixels is about 10 times slower on GTX460 then GTX 260.
glReadPixels with PBO is about 2.5 slower on GTX460 then GTX260.

glTexSubImage with PBO is about the same on both GTX 260 and GTX 460.

Then I tested CUDA with OpenGL. I copied renderbuffer content from GPU to CPU memory using CUDA. I used plain memory (not page locked, alias pinned).
The CUDA performance on GTX 260 was about the same as glReadPixels+PBO.
The CUDA performance on GTX 460 was 2.5 higher then ReadPix+PBO and equal to performance on GTX 260 !!!
If anybody wants to see the CUDA code, I can post it here.


Conclusion
The GTX 460 is capable of transferring data from GPU to CPU at the same or higher speed then older GTX 260. There is no HW limitation. Current OpenGL driver cannot utilize full speed transfer.

It implicates two options:
A: There is a driver bug and NVIDIA will fix it one day.
B: This behaviour is done by purpose (Does anybody has fermi based Quadro?). I think the Fermi based Quadro will not suffer this performance lost.

NVIDIA, tell us the truth please.

CatAtWork
01-27-2011, 05:51 PM
Well, luckily I bought the cheapest Quadro Fermi yesterday, and it is much faster than a high-end GeForce Fermi for ReadPixels and GetTexImage:

i7 960 at 3.2GHz, Vsync forced off. Both of these cards are in the same machine at once.

Quadro 600 (Yes, not 6000) as a headless PCIe 2.0 at 8x[/B]

glReadPixels: 1.65 ms
PBO glReadPixels: 1.18 ms (memcpy: 0.77 ms) total: 1.95 ms
glTexSubImage2D: 2.60 ms
PBO glTexSubImage2D: 2.95 ms (memcpy: 0.78 ms) total: 3.73 ms
glCopyTexSubImage2D: 0.07 ms
glGetTexImage: 7.65 ms

memcpy speed: 4784 MBytes/sec

Total frame: 26.95 ms (total transfer: 15.05 ms)



GTX 470[/B] as the primary monitor:[/B]

glReadPixels: 23.21 ms
PBO glReadPixels: 2.43 ms (memcpy: 0.83 ms) total: 3.26 ms
glTexSubImage2D: 2.63 ms
PBO glTexSubImage2D: 2.69 ms (memcpy: 0.67 ms) total: 3.35 ms
glCopyTexSubImage2D: 0.06 ms
glGetTexImage: 12.81 ms

memcpy speed: 4455 MBytes/sec

Total frame: 50.11 ms (total transfer: 42.70 ms)

diduke
01-31-2011, 04:56 AM
And a quote from http://developer.nvidia.com/object/opengl_driver.html



8) Will functionality marked as deprecated be slow on NVIDIA hardware?

No. NVIDIA understands that features on the deprecated list are critical to the business of a large part of our customer base. NVIDIA will provide full performance, and will support, tune, and fix any issues, for any feature on the deprecated list. This means that all the functionality in the ARB_compatibility extension and Compatibility profile will continue to operate at maximum performance.



This thing (the disable of 'professional' features, or the crippling of the actual hardware capabilities depending on how you want to see it) seems to be happening with every new generation of hardware...
I can understand custom profiles for some applications, but the crippling of some selected set of features is another thing completely.

I will post the results of the transferbench on a quadro 3800 in a few days.

mfort
01-31-2011, 07:47 AM
8) Will functionality marked as deprecated be slow on NVIDIA hardware?

I do not think that depreciation has something to do with slow transfer. Even the transfer with PBO (which is not deprecated) is slow.



I will post the results of the transferbench on a quadro 3800 in a few days.

I've already tested Q FX 3800. It is fast (even faster then GF). But this card is not Fermi based. The problem is only with Fermi cards.

CatAtWork
02-03-2011, 01:18 PM
I've already tested Q FX 3800. It is fast (even faster then GF). But this card is not Fermi based. The problem is only with Fermi cards.


Fermi GeForce. Fermi Quadro is fine.

Chris Lux
02-14-2011, 11:47 AM
Hi,
i experimented with the source code from this thread to investigate download rates. I am on a GTX 480 with r266 drivers. I noticed a very strange and disturbing effect. After letting the benchmark run for more then a few seconds i noticed that the PBO glReadPixels dropped in performance. I nearly doubled the download time, from ~2.5ms to ~4.8ms. This happens reproducible every time after 10-15s runtime.

I suspect the driver moving the buffer object to another memory region after some usage analysis, which is actually worse than the first one. I came across this behavior of the nvidia drivers some time ago on another project, but never found a way around it...

Has anyone else noticed this behavior on other GPUs and drivers?

I commented out the other parts of the benchmark, this is the code i used:


// PBO glReadPixels
QueryPerformanceCounter(&amp;start_ticks);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, pbo[0]);
glBufferData(GL_PIXEL_PACK_BUFFER_ARB, 1280*720*4, NULL, GL_STREAM_READ);
glReadPixels(0, 0, 1280, 720, GL_RGBA, GL_UNSIGNED_BYTE, BUFFER_OFFSET(0));
//glReadPixels(0, 0, 1280, 720, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, BUFFER_OFFSET(0));
void* mem = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY); //blocks tilt data is available
glFlush();
QueryPerformanceCounter(&amp;ende_ticks);
pbo_rp_ms = ((double) ende_ticks.QuadPart - (double) start_ticks.QuadPart) / frequenz.QuadPart * 1000.0;

QueryPerformanceCounter(&amp;start_ticks);
memcpy( dump1, mem, 1280*720*4 );
QueryPerformanceCounter(&amp;ende_ticks);
rp_memcpy_ms = ((double) ende_ticks.QuadPart - (double) start_ticks.QuadPart) / frequenz.QuadPart * 1000.0;

glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);

mfort
02-14-2011, 01:19 PM
The performance degradation is due to NVIDIA power saving feature (a.k.a. Powermizer). When the driver detects low load it decreases the GPU or memory clock. You can disable it NVIDIA control panel. You can monitor the clocks using GPU-Z to be sure that the tests are running with max speed.

Chris Lux
02-15-2011, 01:27 AM
ah, i did not notice that before... thanks for the hint.

i still played a bit with the benchmark to maximize the throughput of the readback.

With the PBO glReadPixels I achieved:


glReadPixels(0, 0, 1280, 720, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, BUFFER_OFFSET(0));



PBO glReadPixels: 2.35 ms 1497.91 MiB/s (memcpy: 0.49 ms 7143.67 MiB/s) total: 2.84 ms 1238.27 MiB/s
glTexSubImage2D: 1.12 ms 3132.71 MiB/s


I think 1.2GiB/s are a joke on a x16 PCI-Express card. So i tried glGetBufferSubData instead of a map + memcopy for the readback:


glGetBufferSubData(GL_PIXEL_PACK_BUFFER, 0, 1280*720*4, mem);



PBO glReadPixels: 0.06 ms 63069.52 MiB/s (memcpy: 3.75 ms 937.86 MiB/s) total: 3.80 ms 924.11 MiB/s
glTexSubImage2D: 1.31 ms 2693.62 MiB/s

Under 1GiB/s... Is there anything that can be done to increase the readback throughput?

Chris Lux
02-16-2011, 05:44 AM
Quadro 6000, 266.58, Windows 7 x64 (Core i7 980, 12GiB RAM)


glReadPixels: 2.06 ms
PBO glReadPixels: 0.66 ms (memcpy: 0.52 ms) total: 1.18 ms
glTexSubImage2D: 1.13 ms
PBO glTexSubImage2D: 0.05 ms (memcpy: 0.44 ms) total: 0.49 ms
glCopyTexSubImage2D: 0.06 ms
glGetTexImage: 4.42 ms

memcpy speed: 7053 MBytes/sec

Total frame: 16.70 ms (total transfer: 8.21 ms)


GeForce GTX 285, 266.58, Windows 7 x64 (Core i7 980, 12GiB RAM)

glReadPixels: 3.14 ms
PBO glReadPixels: 3.11 ms (memcpy: 0.30 ms) total: 3.41 ms
glTexSubImage2D: 2.32 ms
PBO glTexSubImage2D: 0.06 ms (memcpy: 0.49 ms) total: 0.56 ms
glCopyTexSubImage2D: 0.04 ms
glGetTexImage: 9.30 ms

memcpy speed: 12103 MBytes/sec

Total frame: 27.75 ms (total transfer: 16.44 ms)


GeForce GTX 580, 266.58, Windows 7 x64 (Core i7 980, 12GiB RAM)

glReadPixels: 8.30 ms
PBO glReadPixels: 2.36 ms (memcpy: 0.55 ms) total: 2.91 ms
glTexSubImage2D: 1.20 ms
PBO glTexSubImage2D: 0.05 ms (memcpy: 0.48 ms) total: 0.53 ms
glCopyTexSubImage2D: 0.06 ms
glGetTexImage: 4.44 ms

memcpy speed: 6762 MBytes/sec

Total frame: 28.68 ms (total transfer: 16.24 ms)

THIS is another joke brought to you by Nvidia. I do not think the dual copy engines do something here, because how this benchmark works they are not needed (no parallel transfer required).

Here another test in my own software:
http://h-4.abload.de/img/readback_bench9tgh.png
http://h-4.abload.de/img/readback_bench9tgh.png
Explanation:
read: orphane buffer0, bind to PIXEL_PACK_BUFFER, glReadPixels
copy: map buffer1, memcopy image data, unmap buffer1
tex: not used
swap(buffer0, buffer1)

The gpu times are taken using timer queries, the cpu times using performance counters.

It is clear, that the on-device copy from the framebuffer to the unpack buffer is much slower on Fermi GeForces (read gpu time).

Edit: Maybe we have to wait and see how they optimize this for Rage and then we can go and start again to do what id does, just because the drivers do it better for them.

mfort
02-16-2011, 09:50 AM
NVIDIA is well aware of this "problem". They want to increase Quadro sales. Most games do not need reading pixels back.

The only workaround I found is the use of CUDA. I also tried OpenCL but no speedup there. CUDA can copy FBO renderbuffer to system memory at the same speed as Quadro with PBO.

Chris Lux
02-16-2011, 01:50 PM
I know that they are aware of this, but as a developer it is frustrating to work around these issues. Especially if these issues were not there in the last generations. A 2.5 times slower transfer due to artificial throttling is just stupid. And to introduce CUDA into the software just to get data fast to the host side is also insane... The transfer is fast with CUDA as long as Nvidia sees us abuse this API just for that.

What i like to see is a definitive list of features cut or artificially broken just to sell Quadros.

As i said, id Softwares Rage will depend on a fast readback. So maybe it will be enabled in an application profile directly for Rage (if they even use OpenGL in the release version).

Such things are so frustrating...

P.S. Do you have a small code snipped for fast FBO to PBO/host memory using CUDA or OpenCL (even if it is not faster than the current approach)?

kyle_
02-16-2011, 01:57 PM
Maybe rename your binary name 'rage.exe' and see what happens ;)

mfort
02-17-2011, 01:21 AM
CUDA workaround:

headers:


#include <cuda.h>
#include <cuda_runtime_api.h>
#include <cuda_gl_interop.h>
#include <cudaGL.h>

#pragma comment(lib, "cudart.lib")

struct cudaGraphicsResource *cudaGfxRes;


Initialization:


// init CUDA
cudaError_t cErr = cudaGLSetGLDevice(0/*GPU number*/);


Memory allocation:


// for best performance, allocate pinned memory
void* ptr;
cudaError_t cErr = cudaMallocHost(&amp;ptr, sizeInBytes);


Registering OpenGL RB and CUDA


// this is done only once
cudaError_t cErr = cudaGraphicsGLRegisterImage(&amp;cudaGfxRes,
renderBufferId, GL_RENDERBUFFER,
cudaGraphicsMapFlagsReadOnly);


Data transfer:


cudaError_t cErr;
struct cudaArray* cArray;

cErr = cudaGraphicsMapResources(1, &amp;cudaGfxRes);
cErr = cudaGraphicsSubResourceGetMappedArray(&amp;cArray, cudaGfxRes, 0, 0);
cErr = cudaMemcpyFromArray(dstMemPtr, cArray, 0,0,
sizeInBytes, cudaMemcpyDeviceToHost);
cErr = cudaGraphicsUnmapResources(1, &amp;cudaGfxRes);

Chris Lux
02-17-2011, 02:10 AM
Thanks!

Could you also post the OpenCL equivalent? I want to take a look at the exact performance differences.

Regards
-chris

mfort
02-20-2011, 01:10 AM
I am not a OpenCL expert. At this moment I am not quite sure about my implementation. I'd rather not to post it here and make wrong impression about OpenCL. If I find OpenCL implementation of the same speed as CUDA I will come back.

def
05-06-2011, 07:04 AM
I am not very familiar with CUDA. Can you register an unsigned byte FBO and do the cudaMemcpy? Is this what you have done?

It seems only FLOAT32 or unsigned INT formats are supported with cudaGraphicsGLRegisterImage().

If this really works, a glCopyTexSubImage2D() + render a textured quad to FBO + cudaMemcpy() from FBO to CPU would still be faster than using glReadPixels() on the standard framebuffer...

mfort
05-06-2011, 08:59 AM
@def - you can transfer GL_RGBA8 render buffers.