View Full Version : PBO + glReadPixels not so fast?

10-15-2008, 08:18 AM

I'm trying to speed up some code that pulls that from the framebuffer using glReadPixels.

I've created two PBO with usage set to GL_STREAM_READ_ARB. My rendering code then alternate between the two PBO and do the following:

glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, current);
glReadPixels(..., 0);

I was under the impression that glReadPixel would return immediately, but these three lines takes about 6 ms for a 1024 x 1024 framebuffer.

Currently the code isn't doing any map/unmap so the data should never leave the GPU.

I'm using a Quadro FX 3500 with the latest drivers (169.96).

Am I doing something wrong or is this to be expected?


10-16-2008, 07:38 AM
glReadPixels has to wait until the rendering is done until it can start reading, so yes it's as expected.

10-16-2008, 11:06 AM
Use two PBO's. Bind 1st, read pixels, bind 2nd, map and copy data to sysmem, unmap, unbind all PBO's, then swap PBO names. Repeat this every frame.
One more thing. Use GL_BGR or GL_BGRA pixel format.

10-17-2008, 02:05 AM
Good reading on PBO upload/readback - http://www.songho.ca/opengl/gl_pbo.html

PS: Yooyo seems to be little bit tired by answering the same question several times a month :)

10-20-2008, 12:48 AM
Use two PBO's. Bind 1st, read pixels, bind 2nd, map and copy data to sysmem, unmap, unbind all PBO's, then swap PBO names. Repeat this every frame.
One more thing. Use GL_BGR or GL_BGRA pixel format.

Exactly what I'm doing. But as I said, it's not as fast as I expected. I don't see much improvement over just doing an ordinary glReadPixels.


10-20-2008, 04:35 PM
Then you are doing something wrong...
1. Render frame
2. bind pbo1
3. readback
4. bind pbo2
5. copy previous frame from pbo to sysmem... map buffer, copy data to sysmem (use some fast memcpy code)
6. unmap buffer
7. unbind pbos
8. swap pbo1 and pbo2

So.. frame will be in sysmem with one frame delay. With PBO readback call is nonblocking call. But map buggers can be blocking call if there is pending operation related to currently binded. So... if you call map buffer too soon it will be blocking call. If there is no pending operations mapbuffers returns very quickly.

10-21-2008, 01:57 AM
Ok, so the question is what am I doing wrong?

The intitation code looks roughly like this:

SomeClass::initPBO() {
glGenBuffersARB(2, m_ids);
for (int i = 0; i < 2; ++i) {
glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, m_ids[i]);
glBufferDataARB(GL_PIXEL_PACK_BUFFER_ARB, m_width * m_height * 4, 0, GL_STREAM_READ_ARB);
m_active = 0;

and the capture code looks like this:

SomeClass::capture() {
glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, m_ids[m_active]);
glReadPixels(0, 0, m_width, m_height, GL_BGRA, GL_UNSIGNED_BYTE, 0);
m_active = 1 - m_active;

As you can see, right now I'm not even mapping the buffers (eventually I will of course, otherwise this whole excersize would be kind of pointless), and the code in capture still takes about 6 ms. This could still stall I guess if one were rendering at a high enough framerate? However, my rendering is capped at 15 fps so this shouldn't be an issue.

The values for m_width and m_height has not changed since I created the buffers so their sizes are still valid.


10-21-2008, 07:28 AM
Could I for some reason be getting a PBO in system memory? From what I can see in the spec:


there's really nothing preventing this, am I wrong?

For the record, I've tested Song Ho Ahn's Asynchronous Read-back example and there I see a very clear difference in read speed when using PBO. From what I can see I'm not doing anything differently in my code, except that I'm using a lot more GPU memory for other things.


10-21-2008, 12:00 PM
Could I for some reason be getting a PBO in system memory?Yes.

From what I can see in the spec: [...] there's really nothing preventing this, am I wrong?No.

10-22-2008, 02:45 AM
Try with GL_STATIC_READ. Check your driver control panel.. maybe you have checked some forced AA or such... can you post repro case?

Song Ho Ahn's demo shows 3.1Mpix/sec on my laptop but I achived 1.6 GB/sec (same as CUDA).

10-23-2008, 03:21 AM
Try with GL_STATIC_READ. Check your driver control panel.. maybe you have checked some forced AA or such... can you post repro case?

Song Ho Ahn's demo shows 3.1Mpix/sec on my laptop but I achived 1.6 GB/sec (same as CUDA).

Unfortunately the precompiled binary for Song Ho Ahn's demo uses a screen size of 256 x 256 and waits for vertical refresh, with a refresh rate of 60 Hz this means the transfer rate will cap at 3.7 Mpixels/s regardless of wether PBO are on or off (The figure 3.1 Mpixels/s suggests you're using a refresh rate of 50 Hz, correct?)

You will have to recompile the project yourself and increase the buffer sizes and disable vsync. When doing this you will see a clear difference between using PBO and not using PBO.

I'm using the exact same code in my application and I'm not seeing any improvement over not using PBO, in lack of better theories this leads me to believe that I'm getting a system mem PBOs because there's not enough GPU ram left to allocate the PBOs there.

Any other theories for what could be holding glReadPixels up?


10-23-2008, 10:19 AM
This is my way...

#define valloc(size, prot) VirtualAllocEx(GetCurrentProcess(), NULL, (size), MEM_COMMIT, (prot))
#define vfree(mem) VirtualFreeEx(GetCurrentProcess(), mem, 0, MEM_RELEASE)
#define vlock(mem, size) VirtualLock((mem), (size))

#define BUFSIZE (4*1024)

// globals
GLuint m_pbos[NUMR_PBO]; // PBO pool
int vram2sys; // index of PBO used to copy from vram to sysmem
int gpu2vram; // index of PBO used to copy framebuffer to vram
unsigned char* membuffer = NULL;
unsigned char* tempbuff = NULL; // used during fast mem copy
unsigned int memsize;

// call this with size of framebuffer
void InitReadback( int xsize, int ysize)
tempbuff = (unsigned char*)valloc(BUFSIZE, PAGE_READWRITE);
vlock(tempbuff, BUFSIZE);

memsize = xsize * ysize * 4;

if (m_pbos[0] == 0)
glGenBuffers(NUMR_PBO, m_pbos);

for (int i=0; i<NUMR_PBO; i++)
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_pbos[i]);

vram2sys = 0;
gpu2vram = NUMR_PBO-1;

if (membuffer != NULL)
membuffer = NULL;

membuffer = (unsigned char*)valloc(memsize, PAGE_READWRITE);
vlock(membuffer, memsize);

// call this onec per frame or slice...
void ReadBack(int xsize, int ysize)
// first.. post read pixels

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_pbos[gpu2vram]);
glReadPixels(0, 0, xsize, ysize, GL_BGRA, GL_UNSIGNED_BYTE, 0);

// then copy previous frame from vram to sysmem (membuffer)

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_pbos[vram2sys]);

void* data = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY);
if (data != NULL)
FastMemCopy(membuffer, data, tempbuff, BUFSIZE, memsize);


// unbind PBO

// shift names
GLuint temp = m_pbos[0];
for (int i=1; i<NUMR_PBO; i++)
m_pbos[i-1] = m_pbos[i];
m_pbos[NUMR_PBO - 1] = temp;

// audiofreak tnx for this
void FastMemCopy(void *dst, const void *src, void *buf, size_t bufsize, size_t nbytes)
mov esi, src
mov edi, dst
mov eax, buf
mov ebx, bufsize
bsr ecx, ebx
mov ebx, nbytes
shr ebx, cl
test ebx, ebx
jz main_loop_end
mov edx, eax
mov ecx, bufsize
shr ecx, 7
test ecx, ecx
jz L1_cache_loop_end
prefetchnta [esi + 64 * 10]
movaps xmm0, [esi]
movaps xmm1, [esi + 16]
movaps xmm2, [esi + 32]
prefetchnta [esi + 64 * 11]
movaps xmm3, [esi + 48]
movaps xmm4, [esi + 64]
movaps xmm5, [esi + 80]
movaps xmm6, [esi + 96]
movaps xmm7, [esi + 112]

movaps [edx], xmm0
movaps [edx + 16], xmm1
movaps [edx + 32], xmm2
movaps [edx + 48], xmm3
movaps [edx + 64], xmm4
movaps [edx + 80], xmm5
movaps [edx + 96], xmm6
movaps [edx + 112], xmm7

add esi, 128
add edx, 128

sub ecx, 1
jmp L1_cache_loop
mov edx, eax
mov ecx, bufsize
shr ecx, 7
test ecx, ecx
jz stream_loop_end
movaps xmm0, [edx]
movaps xmm1, [edx + 16]
movaps xmm2, [edx + 32]
movaps xmm3, [edx + 48]
movaps xmm4, [edx + 64]
movaps xmm5, [edx + 80]
movaps xmm6, [edx + 96]
movaps xmm7, [edx + 112]

movntps [edi], xmm0
movntps [edi + 16], xmm1
movntps [edi + 32], xmm2
movntps [edi + 48], xmm3
movntps [edi + 64], xmm4
movntps [edi + 80], xmm5
movntps [edi + 96], xmm6
movntps [edi + 112], xmm7

add edx, 128
add edi, 128

sub ecx, 1
jmp stream_loop
sub ebx, 1
jmp main_loop

10-23-2008, 11:04 AM
The demo, "pboPack" does not measure the performance of glReadPixels() alone. It performs 3 things;
1. Read pixels from framebuffer with glReadPixels().
2. Modify the pixels in add().
3. Draw the modified pixels with glDrawPixels().

You will get pure throughput of glReadPixels() + PBO by disabling the step #2 and #3 in my code.

Also, I'd like to mention that pboPack demo does not use PBO for glDrawPixels() because of OpenGL driver bug. Most video cards are failed on glDrawPixels() + PBO except nVidia Quadro when I release this demo. So, I took it out of the code.

The proper usage of glDrawPixels() with PBO is like this. You may get a better result by replacing glDrawPixels() in my code;

if(pboUsed) // with PBO
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, pboIds[nextIndex]);
else // without PBO

I tested today on ATI Radeon X1900 with the above changes and disabling pixel modification block. And, I got this numbers;
(combination of glReadPixels() and glDrawPixels())

with PBO: 68.6 Mpixels/s = 274.4 MB/s
without PBO: 38.6 Mpixels/s = 154.4 MB/s

with PBO: 273.7 Mpixels/s = 1094.8 MB/s
without PBO: 63.3 Mpixels/s = 253.2 MB/s

with PBO: 568.7 Mpixels/s = 2274.8 MB/s
without PBO: 79.7 Mpixels/s = 318.8 MB/s

You will get higher numbers if you test glReadPixels() only.

10-24-2008, 03:13 AM
I'm pretty certain that there's nothing wrong with my PBO code. So I guess my question is what could make glReadPixels stall (when using PBO that is)? So far the only thing I can think of is that I may be getting a software fallback PBO because I've used up all the GPU ram on other stuff.

Other theories?


10-24-2008, 04:52 AM
Can you benchmark (profile) following calls:
- glBindBuffer
- glReadPixels
- glMapBuffer
- glUnmapBuffer.

glBindBuffer should be instant, glReadPixels too. If glReadPixels
stall then something really wrong there. glMapBuffers can stall if pending glReadPixels is not finished.
If you have frequent glReadPxels calls use several PBO's for that.

04-29-2009, 06:56 AM
I never managed to get glReadPixels any faster with PBO, the 6 ms that were spent in glReadPixels wasn't a huge problem at the time so I simply left the problem.

Now however, the problem has become more urgent. Since we upgraded to revision 182.08 of nvidias quadro driver the glReadPixels operation takes over 30 ms!

I've tried every combination of usage flag (GL_STREAM_READ etc.) and format (GL_BGRA etc.) but with no difference in speed.

I also tried another approach: instead of using two PBO:s, I used two FBO to which I transfered the framebuffer with glCopyTexImage2D, I then used glReadPixels on the FBO which were not currently being copied to. Unfortunately this was exactly as slow.

Does anyone have any theories what could be causing this huge stall?