Using Pixel Buffer Objects with glReadPixels

Hi everyone, have a question regarding pixel buffer objects.

I am running a P4 1.5 with a GeForce 5200 FX (AGP of course), and the 80+ Forceware drivers.

I am trying to use PBO to increase the speed of glReadPixels. I’ve implemented a class for doing just that, and alternatively, doing a normal glReadPixels (for comparison). I am reading both the color and depth components. My normal glReadPixels looks like:

void readPixelsNormal()
{
  // init mem
  unsigned int * m_pBufferColor = new unsigned int[GetWidth() * GetHeight()]; 
  unsigned int * m_pBufferDepth = new unsigned int[GetWidth() * GetHeight()];

  glReadPixels(0, 0, GetWidth(), GetHeight(), GL_RGBA, GL_UNSIGNED_BYTE, 
               m_pBufferColor);
  //... do stuff to the pixels ...

  glReadPixels(0, 0, GetWidth(), GetHeight(), GL_DEPTH_COMPONENT,
               GL_UNSIGNED_INT, m_pBufferDepth);
  //... do stuff to the pixels ...
}

Getting about 30 FPS using the above approach. Now next is my PBO implementation:

// macro for pointing glReadPixels to ... well ... nowhere
#define BUFFER_OFFSET(i) ((char *)NULL + (i))

// PBO generated IDs
GLuint m_pPBO[2] = {0, 0};
unsigned int * m_pBuffer;

void initPBO()
{
  // init the PBOs
  glGenBuffersARB(2, m_pPBO);
  glBindBufferARB(GL_PIXEL_PACK_BUFFER_EXT, m_pPBO[0]);
  glBufferDataARB(GL_PIXEL_PACK_BUFFER_EXT, 
                  (GetWidth() * GetHeight() * sizeof(unsigned int)), 
                  NULL, 
                  GL_STREAM_READ);

  glBindBufferARB(GL_PIXEL_PACK_BUFFER_EXT, m_pPBO[1]);
  glBufferDataARB(GL_PIXEL_PACK_BUFFER_EXT, 
                  (GetWidth() * GetHeight() * sizeof(unsigned int)), 
                  NULL, 
                  GL_STREAM_READ);
  
  // bind it to nothing so other stuff doesn't
  // think it should use the PBOs
  glBindBufferARB(GL_PIXEL_PACK_BUFFER_EXT, 0);
}

void readPixelsPBO()
{
  // bind buffer #1
  glBindBufferARB(GL_PIXEL_PACK_BUFFER_EXT, m_pPBO[0]);

  // read pixels
  glReadPixels(0, 0, GetWidth(), GetHeight(), GL_RGBA,
               GL_UNSIGNED_BYTE, BUFFER_OFFSET(0));

  // bind buffer #2
  glBindBufferARB(GL_PIXEL_PACK_BUFFER_EXT, m_pPBO[1]);

  // read pixels
  glReadPixels(0, 0, GetWidth(), GetHeight(), GL_RGBA,
               GL_UNSIGNED_BYTE, BUFFER_OFFSET(0));


  // map memory from card
  glBindBufferARB(GL_PIXEL_PACK_BUFFER_EXT, m_pPBO[0]);
  m_pBuffer = static_cast<unsigned int *>(glMapBufferARB(GL_PIXEL_PACK_BUFFER_EXT, GL_READ_ONLY_ARB));


  //... do stuff to pixels ...
  
    
  // map memory from card
  glBindBufferARB(GL_PIXEL_PACK_BUFFER_EXT, m_pPBO[1]);
  m_pBuffer = static_cast<unsigned int *>(glMapBufferARB(GL_PIXEL_PACK_BUFFER_EXT, GL_READ_ONLY_ARB));


  //... do stuff to pixels ...


  // unmap the memory
  glBindBufferARB(GL_PIXEL_PACK_BUFFER_EXT, m_pPBO[0]);
  if (!glUnmapBufferARB(GL_PIXEL_PACK_BUFFER_EXT))
  {
    //  handle the error
  }

  glBindBufferARB(GL_PIXEL_PACK_BUFFER_EXT, m_pPBO[1]);
  // unmap the memory
  if (!glUnmapBufferARB(GL_PIXEL_PACK_BUFFER_EXT))
  {
    //  handle the error
  }

  // bind it to nothing so other stuff doesn't
  // think it should use the PBOs
  glBindBufferARB(GL_PIXEL_PACK_BUFFER_EXT, 0);
}

void killPBO()
{
  // kill the PBOs
  glDeleteBuffersARB(2, m_pPBO);
  glBindBufferARB(GL_PIXEL_PACK_BUFFER_EXT, 0);  
}

It’s weird. This second approach yields ~27 FPS. If I’m bypassing the normal readback pipeline, and directly accessing card memory, shouldn’t I be getting some kick-butt framerate? In addition, since I’m using STREAM data, shouldn’t the glReadPixels be returning immediately and behave asynchronously? Am I doing something wrong? What do you gurus think?

Hi,

i’m not a guru, but i have some ideas.

First: asynchronous readpixels. Reading the PBO spec you can read: “…If the application issues several readbacks into several buffer objects, however, and then maps each one to process its data, then the readbacks can proceed in parallel with the data processing.” If you want to get asynchronous readpixels you may break the data several times, like make the example of the spec.

Second: performance. You are accessing directly memory card, but there are a moment where you must read bak to CPU (the map/unmap). I don’t see how this can useful to get more performance. I think that this will get more performance with applications like “render to vertex-array”, because if you use PBO you make memory transfers between memory card zones. If you dont use PBO you make transfers between memory card and RAM memory,
which is slower.

Third: STREAM data. Reading a NVIDIA paper you get
“But STATIC, STREAM, and DYNAMIC are simply suggestions or hints about potential usage patterns. They do not force the driver to do anything in particular, but they help us make decisions about buffer memory placement and mapping behavior.”

API

PD: please forgive my english.

Well, I can understand what you’re saying. I just have one burning question:

Why the heck isn’t the PBO readback performing at least as fast as the normal glReadPixels???

I don’t have direct experience explicitly in the above mentioned scenario, but i have experienced extreme slowdowns while using glCopyTexSubImage2D(…) on PBOs. I noticed that the CopyTexSubImage call was extremely slow on PBOs whereas provided pretty decent performance on the regular frame buffer. I tested the afore mentioned scenario on FX5700 Ultra and 5900 XT (NV3x). In the end i used the regular frame buffer thinking that there was either some problem with the drvier (Driver was probably late 6x.xx release. Haven’t tested it with subsequent driver releases because i use FBOs now and i have upgraded to NV4x anyway :smiley: ) or perhaps the architecture of the GPU! Now that a similar problem has surfaced again maybe we can get a good answer from someone from nVidia or someone else who has done better research.

About my personal experience, i have a NVIDIA GeForce 5900. I use PBO to implement a “pseudo” render to vertex array, because there aren’t a direct way to do this (the FBO spec says that the “super-buffers group” don’t know if includes this to FBO or create a new extension).

I implement two ways, using the default glReadPixels, and usig glReadPixels and PBO, because the NVIDIA white papers says that its possible to get a better performance, because the copies are in graphic card memory.

Well, I haven’t notice any difference. I get the same performance in FPS, and milliseconds cost.

API

Originally posted by renaissanz:
Why the heck isn’t the PBO readback performing at least as fast as the normal glReadPixels???
I also would like to know that because I’m going to reply here about other performance issues I have with PBO. :confused:
I think there’s something wrong with the driver. I think it’s being too much bandwidth conservative and ends up adding so much latency than the extra DMA speed (provided there’s an extra DMA speed to gain) is lost.

This is also what Zulfiqar Malik seems to be pointing out.

At least, I know other people is complaining about the actual PBO implementation.

By the way, what tests did you ran? I’ve noticed PBO seems to be somewhat better with high CPU usage but it’s not really “much” better anyway.

I tried to make async ReadPixels into a PBO perform better than regular ol’ ReadPixels into a pointer on both NVidia/Mac OS X 10.4.3 and NVidia 76xx drivers on Linux.

I used two PBOs, alternating which one was read back to each frame, and not mapping the buffers until a frame had passed.

I was completely unable to engineer a situation (high CPU load, low CPU load, high GPU load, low GPU load) where ReadPixels into a PBO was actually faster than the usual path. Sometimes PBOs were marginally slower, more often there was nothing to choose between the two.

If anyone can show a situation where PBOs actually do provide asynchronous ReadPixels ('cos I’m forced to conclude they don’t), I’d be very interested to see it!

Originally posted by renaissanz:
Am I doing something wrong? What do you gurus think?
Yes. Read this article:
http://download.nvidia.com/developer/Papers/2005/Fast_Texture_Transfers/Fast_Texture_Transfers.pdf
Found on this site:
http://developer.nvidia.com/object/all_docs_by_date.html

It explains why the GL_RGBA case you used is slow.

The only thing which is really interesting from this paper is that they used nForce.
I begin thinking the actual PBO implementation is something like

if(nForce) real PBO
else fall back path

This has some nasty conseguences but after all, it does have sense. By sure means, VIA won’t give nVIDIA chip details to use their chipset effectively now it is a competitor.

As for the rest, I confirm BGR to be faster than RGB from my tests but I could really ‘feel’ the extra performance in just a minority of cases. Providing alpha is also giving me a minor advantage but still far from promised.