glReadpixels speed problem on nvidia card

Hello,

yes, I know glReadpixels is quiet slow, but I have to use it and that is not the point :slight_smile:

For my application I am interested in fast readback of my rendered scene. That’s why I measured the time the transfer take.

So I take the time, read pixels from the buffer and take the time again to estimate the elapsed time.

I tested this on my laptop for testing prupose equipped with an ATI radeon mobility x700 and got a average of 15 ms. quiet high, but this is not the problem at all!

I have done the same test on my target machine, equipped with Nvidia Quadro FX 3500.

I got very strange results:

0.002538
0.002548
0.002540
0.002545
0.002557
0.002578
0.002558
0.002543

0.020149
0.034502
0.024774
0.039152
0.037828
0.028562
0.029556
0.030189
0.040288
all times in seconds.

The first frames are copied very fast about some ms! but after 22 or 23 frames the copy time rise about factor 10! its the double or tribble of my radeon mobile !
What happens here ?

I have done the test with glReadpixels and glGetTexImage with bound FBO, glReadpixels and MapBuffer and PB/PBO and glReadpixels and the default framebuffer. It is alle the same drop-off!

I added a sleep command between my two measurements and another very strange happened.
for sleep times from 0 to 10 ms nothing changed, but if the application sleeps 100 or 1000 ms the drop-off do not occur.

0.103809
0.103563
0.102490
0.102299
0.103093
0.102961
0.102787
0.102594
0.104578
0.103186
0.102908
0.103701
0.103597
0.102443
0.103125
0.103041
0.102860
0.102666
0.102475
0.102294
0.103078
0.102886
0.102705

I am very confused why this happens…
Ohh all test done on windows XP!

Is there any bug in the nvidia driver (I used the latest forceware driver) or in nvidia’s opengl implementation ?
Thanks for all replys !
Chris

Try putting a glFinish() before you start timing things. It looks like the GPU/driver may still be working on your previous GL commands at that point, and the glReadPixels() will have to wait for them to complete before it can return the image.

About the performance, did you try different formats, such as RGBA, BGRA, RGB, etc ?

And what is the size of the readback region ?

thanks jgennis, I will try it out !

ZbuffeR : yes I’ve tried out RGBA,RGB,BGRA,BGR and such … nothing changed !

I was just about to post a question about why glReadPixels is so slow, when I saw this thread. :slight_smile:
Actually, I don’t really care about transfer speed, but I do care about it using up a 100% of my CPU, even when I’m reading into a PBO, which should be an asynchronous, CPU free operation!

I’ve made a small test program, where I am reading a tiny chunk of the framebuffer into a PBO every frame. For the sake of testing, I ignore the result, so I don’t even call glMap(), or glGetSubData(). It should not block at all, and should not use any CPU (it should copy from vram to vram with me never touching!)…

I’ve tried multiple buffer object usage types, multiple PBO sizes, multiple formats… None of which seem to make any difference.

I haven’t done serious profiling yet, but that’s the next step.

This is on a 7600GS, driver v93.71

Ok, I’ve made some quick measurements:
glReadPixels takes the same amount of CPU time, regardless of reading to client memory, or to PBO.
Executing glFinish before the ReadPixels reduces glReadPixels time drastically.

This makes me suspect that the driver does an implicit glFinish() in glReadPixels, even when reading into a PBO.

How did you measure timings (are you use QueryPerformanceCounter)? Do you have dualcore CPU? If you have dualcore CPU then WindowsXP have bug with performance counters becase this value exist on both CPU cores and it’s not sync’ed.

Please post code snippet (how do you create PBO, readback code, …) I have play with this and I got VERY nice results.

I have used RDTSC on a single core computer.

My original test code is using our internal wrapper, I’ll rewrite it using plain GL and post it in a littlebit.

Ok, here’s the snippet:

///////////////////////////////////
// init (run only once):
GenBuffers(1, (GLuint*)&m_name);
BindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_name);
BufferData(GL_PIXEL_PACK_BUFFER_ARB, 1024, 0, GL_DYNAMIC_READ);

// runtime (every frame):
::glReadPixels(0, 0, 1, 1, GL_BGRA, GL_UNSIGNED_BYTE, 0);
///////////////////////////////////

Originally posted by jgennis:
Try putting a glFinish() before you start timing things. It looks like the GPU/driver may still be working on your previous GL commands at that point, and the glReadPixels() will have to wait for them to complete before it can return the image.
Youre right, that works!no the correct time is shown.
but how can I combine this problem with asynchronous PBO transfer?
If I have to wait the glFinish() statement (which tooks about (210 ms) ) the 2ms for copy things are not taken into account.

{
	m_FrameSize = outXres * outYres * 4; // framebuffer size (xres * yres * RGBA

	glGenBuffers(1, &PBOreadback); // PBOreadback is GLuint PBOreadback; 

	// initialise PBO for readback
	glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, PBOreadback);
	glBufferData(GL_PIXEL_PACK_BUFFER_EXT, m_FrameSize, NULL, GL_STATIC_READ);

	glBindBuffer(GL_PIXEL_PACK_BUFFER_EXT, 0);
	m_CurrentReadIndex = 0;
	glPixelStorei(GL_PACK_ALIGNMENT, 4);
}
....
// this code readback from active FBO

glReadBuffer(GL_COLOR_ATTACHMENT0_EXT);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, PBOreadback);
glReadPixels(0, 0, outXres, outYres, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, BUFFER_OFFSET(0));
void *Memory = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY);
memcpy(m_sysMemPointer, Memory, m_FrameSize);
glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);

This code works here.

Thanks yooyo for the code! I will try it out!

But my point is: I have to wait for the glFinish() which tooks about 200 ms and the readback by itself only takes 2ms … what is the advantage of asynchronous read ? only save these 2 ms ??

The asynchronous read is only useful, if you don’t need the results right away. In that case, you shouldn’t have to wait for glFinish() to flush the rendering queue. OpenGL should only flush the queue, when you actually read back the data (eg. map the buffer), but if you do this a couple frames later, then by that time you won’t have to wait anything…

Yooyo, I’m going to give your code snippet a try. I wonder if it has anything to do with reading from the FBO?

ohh ok andras, I HAVE to to have the data immediately and 2 ms is not a big deal…
Thanks for all repliers :slight_smile:

Well, good for you :slight_smile: I can wait for a couple frames, but definitely cannot stall the CPU with a glFinish()!

Ok, so I was playing around with my FBO test app, and I’ve noticed something strange there too: When I have VSync on, it runs at a constant 60Hz, but uses up a 100% CPU. Now, when I turn off VSync, the framerate goes up to 560+ FPS!! While using the same amount of CPU! This doesn’t make any sense!

Note that this sample program doesn’t even have ReadPixels in it, so it’s something else.

Hah, I’ve figured out the problem with my FBO test app, and it turns out that it IS related to ReadPixels!:
Originally I’ve used this app to test procedural rendering into an alpha texture. Since you can’t render into one component texture, I had to render into an RGBA buffer first, then did a CopyPixels into the alpha texture, and then used that texture to render. When I removed the copy and changed the app to use the RGBA texture, the CPU usage dropped to around 3%!!

So the culprit is in the CopyPixels code, which is almost the same as ReadPixels, except that it should be definitely non-blocking and much faster!

EDIT: I’m sorry, I mean I used glCopyTexSubImage, not CopyPixels…

Ok, I’ve created a standalone test application:
http://www.andrasbalogh.com/gltest.zip

All the interesting part is done in init2() and render_scene2() functions. As is, it uses up 50% CPU while running. When I comment out the ReadPixels line, it drops down to 3%…

EDIT: this is not exactly the same computer I used for testing before, so it’s a dual core, and both cores are running at 50%, when using ReadPixels!

Could you guys run this test program (see previous post), and let me know how it works? I’d really appreciate that.

Yooyo: what do you mean when you say your code works? This one works too, and runs fairly fast, but uses up all the CPU. I just don’t see why yours would behave different. What HW and driver are you running?
Also, I’m not sure why you would use PBOs in your case, when you Map() it right after the read. Doesn’t that defeat the purpose of the PBO?

hi andras! i run your program but it crashes and windows close it… did I missed something ?