PDA

View Full Version : glReadpixels speed problem on nvidia card



Doomhammer
02-03-2007, 04:13 AM
Hello,

yes, I know glReadpixels is quiet slow, but I have to use it and that is not the point :)

For my application I am interested in fast readback of my rendered scene. That's why I measured the time the transfer take.

So I take the time, read pixels from the buffer and take the time again to estimate the elapsed time.

I tested this on my laptop for testing prupose equipped with an ATI radeon mobility x700 and got a average of 15 ms. quiet high, but this is not the problem at all!

I have done the same test on my target machine, equipped with Nvidia Quadro FX 3500.

I got very strange results:


0.002538
0.002548
0.002540
0.002545
0.002557
0.002578
0.002558
0.002543
...
0.020149
0.034502
0.024774
0.039152
0.037828
0.028562
0.029556
0.030189
0.040288 all times in seconds.

The first frames are copied very fast about some ms! but after 22 or 23 frames the copy time rise about factor 10! its the double or tribble of my radeon mobile !
What happens here ?

I have done the test with glReadpixels and glGetTexImage with bound FBO, glReadpixels and MapBuffer and PB/PBO and glReadpixels and the default framebuffer. It is alle the same drop-off!

I added a sleep command between my two measurements and another very strange happened.
for sleep times from 0 to 10 ms nothing changed, but if the application sleeps 100 or 1000 ms the drop-off do not occur.




0.103809
0.103563
0.102490
0.102299
0.103093
0.102961
0.102787
0.102594
0.104578
0.103186
0.102908
0.103701
0.103597
0.102443
0.103125
0.103041
0.102860
0.102666
0.102475
0.102294
0.103078
0.102886
0.102705
I am very confused why this happens...
Ohh all test done on windows XP!

Is there any bug in the nvidia driver (I used the latest forceware driver) or in nvidia's opengl implementation ?
Thanks for all replys !
Chris

jgennis
02-03-2007, 11:01 AM
Try putting a glFinish() before you start timing things. It looks like the GPU/driver may still be working on your previous GL commands at that point, and the glReadPixels() will have to wait for them to complete before it can return the image.

ZbuffeR
02-03-2007, 11:41 AM
About the performance, did you try different formats, such as RGBA, BGRA, RGB, etc ?

And what is the size of the readback region ?

Doomhammer
02-04-2007, 12:45 AM
thanks jgennis, I will try it out !

ZbuffeR : yes I've tried out RGBA,RGB,BGRA,BGR and such ... nothing changed !

andras
02-04-2007, 03:53 PM
I was just about to post a question about why glReadPixels is so slow, when I saw this thread. :)
Actually, I don't really care about transfer speed, but I do care about it using up a 100% of my CPU, even when I'm reading into a PBO, which should be an asynchronous, CPU free operation!

I've made a small test program, where I am reading a tiny chunk of the framebuffer into a PBO every frame. For the sake of testing, I ignore the result, so I don't even call glMap(), or glGetSubData(). It should not block at all, and should not use any CPU (it should copy from vram to vram with me never touching!)..

I've tried multiple buffer object usage types, multiple PBO sizes, multiple formats.. None of which seem to make any difference.

I haven't done serious profiling yet, but that's the next step.

This is on a 7600GS, driver v93.71

andras
02-04-2007, 04:25 PM
Ok, I've made some quick measurements:
glReadPixels takes the same amount of CPU time, regardless of reading to client memory, or to PBO.
Executing glFinish before the ReadPixels reduces glReadPixels time drastically.

This makes me suspect that the driver does an implicit glFinish() in glReadPixels, even when reading into a PBO.

yooyo
02-04-2007, 04:40 PM
How did you measure timings (are you use QueryPerformanceCounter)? Do you have dualcore CPU? If you have dualcore CPU then WindowsXP have bug with performance counters becase this value exist on both CPU cores and it's not sync'ed.

Please post code snippet (how do you create PBO, readback code, ...) I have play with this and I got VERY nice results.

andras
02-04-2007, 07:15 PM
I have used RDTSC on a single core computer.

My original test code is using our internal wrapper, I'll rewrite it using plain GL and post it in a littlebit.

andras
02-04-2007, 07:28 PM
Ok, here's the snippet:

///////////////////////////////////
// init (run only once):
GenBuffers(1, (GLuint*)&m_name);
BindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_name);
BufferData(GL_PIXEL_PACK_BUFFER_ARB, 1024, 0, GL_DYNAMIC_READ);

// runtime (every frame):
::glReadPixels(0, 0, 1, 1, GL_BGRA, GL_UNSIGNED_BYTE, 0);
///////////////////////////////////

Doomhammer
02-05-2007, 03:56 AM
Originally posted by jgennis:
Try putting a glFinish() before you start timing things. It looks like the GPU/driver may still be working on your previous GL commands at that point, and the glReadPixels() will have to wait for them to complete before it can return the image. Youre right, that works!no the correct time is shown.
but how can I combine this problem with asynchronous PBO transfer?
If I have to wait the glFinish() statement (which tooks about (210 ms) ) the 2ms for copy things are not taken into account.

yooyo
02-05-2007, 04:28 AM
{
m_FrameSize = outXres * outYres * 4; // framebuffer size (xres * yres * RGBA

glGenBuffers(1, &PBOreadback); // PBOreadback is GLuint PBOreadback;

// initialise PBO for readback
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, PBOreadback);
glBufferData(GL_PIXEL_PACK_BUFFER_EXT, m_FrameSize, NULL, GL_STATIC_READ);

glBindBuffer(GL_PIXEL_PACK_BUFFER_EXT, 0);
m_CurrentReadIndex = 0;
glPixelStorei(GL_PACK_ALIGNMENT, 4);
}
....
// this code readback from active FBO

glReadBuffer(GL_COLOR_ATTACHMENT0_EXT);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, PBOreadback);
glReadPixels(0, 0, outXres, outYres, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, BUFFER_OFFSET(0));
void *Memory = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY);
memcpy(m_sysMemPointer, Memory, m_FrameSize);
glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);This code works here.

Doomhammer
02-05-2007, 04:40 AM
Thanks yooyo for the code! I will try it out!

But my point is: I have to wait for the glFinish() which tooks about 200 ms and the readback by itself only takes 2ms ... what is the advantage of asynchronous read ? only save these 2 ms ??

andras
02-05-2007, 05:57 AM
The asynchronous read is only useful, if you don't need the results right away. In that case, you shouldn't have to wait for glFinish() to flush the rendering queue. OpenGL should only flush the queue, when you actually read back the data (eg. map the buffer), but if you do this a couple frames later, then by that time you won't have to wait anything..

Yooyo, I'm going to give your code snippet a try. I wonder if it has anything to do with reading from the FBO?

Doomhammer
02-05-2007, 07:12 AM
ohh ok andras, I HAVE to to have the data immediately and 2 ms is not a big deal....
Thanks for all repliers :)

andras
02-05-2007, 07:45 AM
Well, good for you :) I can wait for a couple frames, but definitely cannot stall the CPU with a glFinish()!

andras
02-05-2007, 08:18 AM
Ok, so I was playing around with my FBO test app, and I've noticed something strange there too: When I have VSync on, it runs at a constant 60Hz, but uses up a 100% CPU. Now, when I turn off VSync, the framerate goes up to 560+ FPS!! While using the same amount of CPU! This doesn't make any sense!

Note that this sample program doesn't even have ReadPixels in it, so it's something else.

andras
02-05-2007, 08:46 AM
Hah, I've figured out the problem with my FBO test app, and it turns out that it IS related to ReadPixels!:
Originally I've used this app to test procedural rendering into an alpha texture. Since you can't render into one component texture, I had to render into an RGBA buffer first, then did a CopyPixels into the alpha texture, and then used that texture to render. When I removed the copy and changed the app to use the RGBA texture, the CPU usage dropped to around 3%!!

So the culprit is in the CopyPixels code, which is almost the same as ReadPixels, except that it should be definitely non-blocking and much faster!

EDIT: I'm sorry, I mean I used glCopyTexSubImage, not CopyPixels..

andras
02-05-2007, 01:10 PM
Ok, I've created a standalone test application:
http://www.andrasbalogh.com/gltest.zip

All the interesting part is done in init2() and render_scene2() functions. As is, it uses up 50% CPU while running. When I comment out the ReadPixels line, it drops down to 3%..

EDIT: this is not exactly the same computer I used for testing before, so it's a dual core, and both cores are running at 50%, when using ReadPixels!

andras
02-06-2007, 02:14 PM
Could you guys run this test program (see previous post), and let me know how it works? I'd really appreciate that.

Yooyo: what do you mean when you say your code works? This one works too, and runs fairly fast, but uses up all the CPU. I just don't see why yours would behave different. What HW and driver are you running?
Also, I'm not sure why you would use PBOs in your case, when you Map() it right after the read. Doesn't that defeat the purpose of the PBO?

Doomhammer
02-07-2007, 04:23 AM
hi andras! i run your program but it crashes and windows close it... did I missed something ?

yooyo
02-07-2007, 05:25 AM
OK... here is an update...
Using PBO for readback have some advantages. First.. create two PBO buffers for readback and after finish rendering use previous PBO to map buffer and copy result to sysmem and current PBO to initiate readback (call glReadPixels). As I understand PBO, glReadPixels will be posted in command queue and it will be executed when GPU finish rendering.


static void render_scene2()
{
// readback current frame
BindBuffer(GL_PIXEL_PACK_BUFFER_ARB, g_pbo[currentBuffer]);
GLCALL(glReadPixels(0, 0, g_width, g_height, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, 0));

// copy prev frame to system memory
BindBuffer(GL_PIXEL_PACK_BUFFER_ARB, g_pbo[prevBuffer]);
void* data = MapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB);
// do something with data
memcpy(system_memory, data, g_width * g_height * 4);
UnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);

BindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);

int tmp;
tmp = currentBuffer; currentBuffer = prevBuffer; prevBuffer = tmp;
}I did benches for this code.. on my machine P4 dual core 3.0Ghz + GF7600GT-PCIX(fw 96.89) + WinXP SP2 at resolution 1280x1024 app can render and readback ~60fps and overall CPU usage is ~60%. When I do benchs in smaller resolution CPU usage increase a bit but FPS got 100% boost. At 640x480 fps increase to ~140.

andras
02-07-2007, 07:40 AM
hi andras! i run your program but it crashes and windows close it... did I missed something ? Hmm, that's strange, I've tested it on multiple computers and it run fine. It's also a really simple program.. I've included the source and project files, could you build it in debug, see where it crashes? Thanks.

EDIT: I made the test app even simpler, and added some error checking, in case the PBO is not supported.

andras
02-07-2007, 07:49 AM
I did benches for this code.. on my machine P4 dual core 3.0Ghz + GF7600GT-PCIX(fw 96.89) + WinXP SP2 at resolution 1280x1024 app can render and readback ~60fps and overall CPU usage is ~60%. When I do benchs in smaller resolution CPU usage increase a bit but FPS got 100% boost. At 640x480 fps increase to ~140. I think 60% is waay to high for such a simple app! And it stays high, even if I read one pixel. Insert this line into main, to see how it goes up to 100% on a single core:
SetThreadAffinityMask(GetCurrentThread(), 0x00000001);

BTW: I didn't know you were double buffering the PBOs, that makes sense. In my original code, I had a ring of PBOs.

yooyo
02-07-2007, 03:33 PM
http://developer.nvidia.com/object/fast_texture_transfers.html

andras
02-07-2007, 04:19 PM
http://developer.nvidia.com/object/fast_texture_transfers.html Yooyo, I'm already doing this. When I said I use a ring of PBO's, I meant the exact same technique that's at the very end of this document (page 5) titled "Map Different Frames to Different PBOs".

Also, this is the theory, and I understand how it *should* work. But in practice it doesn't! It says that reading into a PBO is asynchronous, but in practice, it is not! You can see in the test program, even when I read into a PBO, it does not return instantly, but instead stalls the CPU!

Ozo
02-08-2007, 06:38 PM
Interresting benchmarks from GPUBench:

http://graphics.stanford.edu/projects/gpubench/results/

We see that a Radeon X1900XTX can do glReadPixel at a rate of about 200MB/sec, ... while a nVidia 7900GTX or 8800GTX will sustain more than 800MB/sec. That's a 4:1 performance ratio!

I would like to hear from ATI as to why they perform so badly.

Ozo.

andras
02-08-2007, 07:29 PM
I would like to hear from ATI as to why they perform so badly.And I would like to hear from nVidia as to why reading into PBOs block! :)

jeffb
02-08-2007, 08:36 PM
Originally posted by andras:
And I would like to hear from nVidia as to why reading into PBOs block! :) Andras, please try requesting 8 AlphaBits in your pixelformat.

andras
02-08-2007, 09:06 PM
Originally posted by jeffb:

And I would like to hear from nVidia as to why reading into PBOs block! :) Andras, please try requesting 8 AlphaBits in your pixelformat. Hah, that did the trick! It is *much* faster now! CPU usage is at 1% while reading back every frame!

Thanks a lot! I owe you a beer! ;)

yooyo
02-09-2007, 01:14 AM
Wow.. I never checked your pixelformat (headbang). Now.. here is some rules:
pixelformat(colorbits:alphabits) and glReadPixels params
32:0 - (GL_BGR, GL_UNSIGNED_INT_8_8_8_8_REV)
32:8 - (GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV)

I'll check later readback from FBO performances. Generally, backbuffer or FBO should have pixels size 32bit*N.

andras
02-09-2007, 05:58 AM
Oh, wait! This works nice with the color buffer, but how do I read the depth efficiently?

EDIT: You also *have* to use 32 bit BGRA, reading back BGR seems to throw an OpenGL error (INVALID_OPERATION), when using GL_UNSIGNED_INT_8_8_8_8_REV (it works otherwise, but slowly).

andras
02-09-2007, 07:45 AM
I think I've just found it! This seems to work:
glReadPixels(0, 0, 1, 1, GL_DEPTH_STENCIL_EXT, GL_UNSIGNED_INT_24_8_EXT, 0);