PDA

View Full Version : Low readback performance with PBO , help !!!!!



PixIn
03-16-2008, 11:51 AM
Hi

i use a PBO approach to grab a bitmap from my 3d scene :p . steps are the following :

glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pbo1);

glReadPixels(0, 0,bWIDTH,bHEIGHT,GL_BGRA, GL_UNSIGNED_BYTE, 0);

copymem(glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB,GL _READ_ONLY_ARB)^, buffer,BWidth * BHeigh * 4);

glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pbo2);

glReadPixels(0, 0,pWIDTH,pHEIGHT ,GL_BGRA, GL_UNSIGNED_BYTE, 0);

copymem(glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB)^,bitmapbuf, bWidth * bHeigh * 4);
glUnmapBufferARB(GL_PIXEL_PACK_BUFFER_ARB);

swap(pbo1,pbo2)

with methode i get 15~25 fps low than a direct :

glReadPixels(0, 0,pWIDTH,pHEIGHT ,GL_BGRA, GL_UNSIGNED_BYTE, bitmapbuf);

as i see in many forum the PBO should be more Faster than the glReadPixels one :eek:


my card is : nVidia Geforce 7600 GS
forceware version : 169.21
Bus PCI Express x16
CPU : P4 3.0Ghz


are there any bad implementation in my PBO code, how can i boost it ?

help please

Thanks in Advance :)

Brolingstanz
03-16-2008, 11:59 AM
Is this for a screenshot?

You want to use PBOs when you have something else to do while the transfer is taking place behind the scenes, not when you need the results straight away.

Check out the PBO spec for some common usage scenarios and example code.

mfort
03-16-2008, 12:10 PM
1) Why are you reading the data twice?
2) Why are you coping data out of PBO? the PBO memory is just fine.
3) Do not call MapBuffer just after read pixels. There is no benefit of PBO then.

PixIn
03-16-2008, 12:14 PM
Is this for a screenshot?

You want to use PBOs when you have something else to do while the transfer is taking place behind the scenes, not when you need the results straight away.

Check out the PBO spec for some common usage scenarios and example code.

yes it is for a screenshot.

just now i change it to this to acheive a Asynchron readback

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[0]);
glReadPixels(0, 0, imagewidth, imageheight/2, GL_BGRA, GL_UNSIGNED_BYTE, BUFFER_OFFSET(0));

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[1]);
glReadPixels(0,imageheight/2,imagewidth,imageheight/2,GL_BGRA,GL_UNSIGNED_BYTE, BUFFER_OFFSET(0));

// Process partial images. Mapping the buffer waits for
// outstanding DMA transfers into the buffer to finish.
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[0]);
pboMemory1 = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY);
processImage(pboMemory1);

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[1]);
pboMemory2 = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY);
processImage(pboMemory2);

but still get low performance instead the glreadpixels

can someone please guide me to the correct steps :(

Thanks

yooyo
03-16-2008, 02:28 PM
In current frame bind pbo and do readback.
In next frame map pbo and copy it's content to sysmem.

Why this? When app call glReadPixels while pbo is bind, then glReadPixels is nonblocking call. But if you try to map pbo buffer soon after glReadPixels then this glMapBuffers will be blocked until glReadPixels is finished.
When to call map buffer is hard to tell, because it depends on underlaying hardware, driver, screen size, chipset, ... So the best will be to do that operation (glMapBuffer and memcpy) in next frame.

Also.. this pbo memory is not cacheable so do not try to do some weird access pattern. Plain memcpy in sysmem buffer is best approach.

PixIn
03-16-2008, 03:54 PM
Thanks for reply :)

ok, i am not sure that i get it well but i change my code a bit to :


index = (index + 1) % 2;
nextIndex = (index + 1) % 2;

glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[index]);
glReadPixels(0, 0, SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, 0);

GLubyte* src = (GLubyte*)glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB);

glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[nextIndex]);
GLubyte* src = (GLubyte*)glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB);

glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0);

but still same problem ... low performance :(

any idea, perhaps some code will help me better ;)

Thanks

yooyo
03-16-2008, 04:23 PM
No.. thats wrong.. see this


// At end of frame before SwapBuffers call
// to use this... just set bDoScreenShot to true.
if (bCopyToSysMem)
{
bCopyToSysMem = false;
glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pbo);
GLubyte* src = (GLubyte*)glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB);
memcpy(sysme, src, imgsize);
glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0);
}

if (bDoScreenShot)
{
glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pbo);
glReadPixels(0, 0, SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, 0);
glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0);
bCopyToSysMem = true;
bDoScreenShot = false;
}

SwapBuffers(...);


Above code snippet is just for single screenshot!

PixIn
03-16-2008, 07:47 PM
Thanks yooyo for the hints ;)

your code give a faster result (10~15 fps faster).
but :s i get a black bitmap it seems that the src is empy :S

can u tell me what the sysme is ?

Thanks

Jackis
03-17-2008, 01:58 AM
yooyo wants to say, that in order to use befits from PBO, you should make asynchronous readbacks. In the code above yooyo advice you to make ReadPixels with PBO on the first frame, but you can use this memory only next frame (or some frames later, maybe 2-3), and only way like that may get you PBO benefit.

yooyo
03-17-2008, 07:06 AM
Just insert my code before you call SwapBuffers, at end of render frame. Something like...


// this is a very basic render loop
while (bQuit == false)
{
UpdateGame();
RenderGame();
// insert my code here
SwapBuffers(); // present frame
}



can u tell me what the sysme is ?
sysme is a typo... it should be sysmem :)
sysmem is pointer to system memory buffer. Applicatio should allocate this buffer. Size should be SCR_WIDTH * SCR_HEIGHT * BYTES_PER_PIXEL.

PixIn
03-17-2008, 09:58 AM
Great.
so, if i understood well, this use only one pbo.
and for iterations (0..2..4..6..), it copy data from the pbo to SysMem, then in iteration (1..3..5..7..), it copy data from SysMem to my bitmap buffer....
that's fine.but in my app i get a black result it seems like the bitmapbuffer is filled with 0 value (i already allocate my SysMem). :(.

also, if the code use only one pbo why should Swapbuffers ?.

Thanks

yooyo
03-17-2008, 10:27 AM
Now, you are confusing me :)
You stated before that you need to readback backbuffer just for screenshot.. not for streaming readback. Then I write code for that usage pattern. Now.. If you want to do streaming readback then above code hav to be modifyed... something like:



#define NUMR_PBO 4
GLuint m_pbos[NUMR_PBO];
int vram2sys;
int gpu2vram;
unsigned char* membuffer;
unsigned int memsize;

// call this once
void Init()
{
memset(m_pbos, 0, sizeof(m_pbos));
membuffer = NULL;
}

// call this at least once... and whenever screen size has been changed
void OnScreenSize(vec2i newsize)
{
memsize = newsize.x * newsize.y * 4; // BGRA

// gen PBO names
if (m_pbos[0] == 0)
glGenBuffers(NUMR_PBO, m_pbos);

// create empty PBO buffers
for (int i=0; i<NUMR_PBO; i++)
{
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_pbos[i]);
glBufferData(GL_PIXEL_PACK_BUFFER_ARB, memsize, NULL, GL_STATIC_READ);
}

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);

// vram to sysmem pbo index
vram2sys = 0;
// backbuffer to vram pbo index
gpu2vram = NUMR_PBO-1;

if (membuffer != NULL)
{
delete [] membuffer;
membuffer = NULL;
}

membuffer = new unsigned char[memsize];
}

// call this at end of frame render
void ReadBack()
{
// readback current frame into PBO
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_pbos[gpu2vram]);
glReadPixels(0,0,m_ViewPortSize.x, m_ViewPortSize.y, GL_BGRA, GL_UNSIGNED_BYTE, 0);

// copy previous frame from PBO to sysmem
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_pbos[vram2sys]);
void* data = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY);
if (data != NULL)
{
memcpy(membuffer, data, memsize);
// Do something with image in membuffer..
}
glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);

// shift names
GLuint temp = m_pbos[0];
for (int i=1; i<NUMR_PBO; i++)
m_pbos[i-1] = m_pbos[i];
m_pbos[NUMR_PBO - 1] = temp;
}


Regarding black image... can you post your code.. or atlease pseudo code of your rendering loop. It is very strange why do you getting black image.

PixIn
03-17-2008, 10:42 AM
Now, you are confusing me :)



:D Sorry for that.

i will try this piece of code ;)

PixIn
03-17-2008, 12:07 PM
Hi

finaly it works :) , thanks yooyo for the Streaming code you are my hero ;) ,i get my image perfectly.

now back to the performance issus, normaly with the pbo implementation what is the factor of performance VS the "no pbo" one ? because with my machine i get more (2~5 fps) with the pbo approach.

my config is :
Nvidia 7600 GS
PCI Express 16x
CPU : P4 3.0Ghz

yooyo
03-17-2008, 12:25 PM
Memory bandwidth is same in both case, but with PBO you can avoid CPU/GPU stall. This mean, you can do something else on CPU side (like decoding or encoding).

Using some advanced mem copy functions (using MMX registers, prefetch, cache align) and using memory buffers allocated with VirtualAlloc and locked with VirualLock you can increase transfer speed 25-30%. Also, underlaying hardware (mem controller, mem speed, mem latency) affect transfer speed. My laptop HP 8710w with Quadro 1600M with optimizes memcopy function can readback 1250MB/sec. Friends machine (Penryn E8200 + 8800GT + DDR2-800+) have ~2GB/sec.

-NiCo-
03-17-2008, 12:50 PM
Using some advanced mem copy functions (using MMX registers, prefetch, cache align) and using memory buffers allocated with VirtualAlloc and locked with VirualLock you can increase transfer speed 25-30%. Also, underlaying hardware (mem controller, mem speed, mem latency) affect transfer speed. My laptop HP 8710w with Quadro 1600M with optimizes memcopy function can readback 1250MB/sec. Friends machine (Penryn E8200 + 8800GT + DDR2-800+) have ~2GB/sec.


Is this the same page-locking mechanism that CUDA uses? I just ran the CUDA bandwidthTest app. Without page-locked memory I get



Host to Device Bandwidth for Pageable memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 834.4

Device to Host Bandwidth for Pageable memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 789.0


and with page-locked memory I get



Host to Device Bandwidth for Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2124.4

Device to Host Bandwidth for Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1629.9


PS. I also ran this on a HP 8710w laptop with Quadro FX 1600M

yooyo
03-17-2008, 02:35 PM
CUDA is still mystery for me. :)
I don't have time to play with CUDA, but I'll do that ASAP.

Hampel
03-18-2008, 04:16 AM
@Nico: can you post the code snippet used in CUDA for this page-locking thing?

-NiCo-
03-18-2008, 05:36 AM
The memory is being allocated with cudaMallocHost(void **ptr, size_t size). It's implemented in their cudart dynamic library so I don't know exactly how they do it.

Here (http://www.gpgpu.org/forums/viewtopic.php?t=4798)'s a related thread on the gpgpu forum.

yooyo
03-19-2008, 04:35 AM
I was able to raach CUDA speed (at least on my machine ~1650M/sec)... Readback from FBO buffer 512x512xRGBA (1MB size) with optimized memcpy and with mem buffers created using VirtualAlloc + VirtualLock.

Anyway, the problem with readback stall every 47.5MB is still there.