PBO Performance under Vista

All,

I have been using Pixel Buffer Objects to try to get an optimal data upload rate on Windows. Under XP things seem to behave much as all the online documentation seems to indicate. That said, when I use Vista, I find that my PBO upload performance seems to drop dramatically (to almost exactly half the regular upload performance.)

Below is some code that should almost cut/paste if you want to try it.
Disclaimer: I know that it looks like a relatively crude benchmark but I wanted the code to be easy to follow. Even if you make this much more sophisticated you still see the same issues.

Thanks in advance for any help on this.

Andrew Cross
NewTek, http://www.newtek.com

[b][i]const int no_sources = 8;
const int no_runs = 1000;
const int xres = 1920;
const int yres = 1080;
const int image_size = xres*yres;

GLuint pbo[ 2 ];
::glGenBuffers( 2, pbo );
::glBindBuffer( GL_PIXEL_UNPACK_BUFFER, pbo[ 0 ] );
::glBufferData( GL_PIXEL_UNPACK_BUFFER, image_sizeno_sources, 0, GL_STREAM_DRAW );
::glBindBuffer( GL_PIXEL_UNPACK_BUFFER, pbo[ 1 ] );
::glBufferData( GL_PIXEL_UNPACK_BUFFER, image_size
no_sources, 0, GL_STREAM_DRAW );

GLuint pbo[ 2 ];
::glGenBuffers( 2, pbo );
::glBindBuffer( GL_PIXEL_UNPACK_BUFFER, pbo[ 0 ] );
::glBufferData( GL_PIXEL_UNPACK_BUFFER, image_sizeno_sources, 0, GL_STREAM_DRAW );
::glBindBuffer( GL_PIXEL_UNPACK_BUFFER, pbo[ 1 ] );
::glBufferData( GL_PIXEL_UNPACK_BUFFER, image_size
no_sources, 0, GL_STREAM_DRAW );

GLuint tex[ no_sources ];
::glGenTextures( no_sources, tex );

for( int i=0; i<no_sources; i++ )
{
::glBindTexture( GL_TEXTURE_2D, tex[ i ] );
::glTexImage2D( GL_TEXTURE_2D, 0, GL_LUMINANCE8, xres, yres, 0, GL_LUMINANCE, GL_UNSIGNED_BYTE, NULL );
}

for( int i=0; i<no_runs; i++ )
{
::glBindBuffer( GL_PIXEL_UNPACK_BUFFER, pbo[ 0 ] );
void* p_mem = ::glMapBuffer( GL_PIXEL_UNPACK_BUFFER, GL_WRITE );
::glUnmapBuffer( GL_PIXEL_UNPACK_BUFFER );

::glBindBuffer( GL_PIXEL_UNPACK_BUFFER, pbo[ 1 ] );
for( int j=0; j<no_sources; j++ )
{ ::glBindTexture( GL_TEXTURE_2D, tex[ j ] );
::glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, xres, yres, GL_LUMINANCE, GL_UNSIGNED_BYTE, (void*)(image_size*j) );
}

std::swap( pbo[ 0 ], pbo[ 1 ] );
}

float Mb = (float)(xresyresno_sources)(float)no_runs / (10241024);
float Mb_p_S = Mb*1000.0f / (float)( GetTickCount() - start_time );

printf( NULL, L"%f Mb/s
", Mb_p_S );[/i][/b]

what happens if you scale it up to more buffers ? 4, 8, 128 ?

Rob,

Thanks for the reply. I have not actually tried with more buffers, but I will run that test. It does however seem strange to me that doing an upload from what is basically a driver allocated block of ram would be slower than a user allocated one. I wonder if it is somehow related to the virtualization of graphics card memory that is causing an extra copy.

Andrew

I’m really unclear on why you are doing that map/unmap pair inside the no_runs loop. That’s a blocking point right there if there are pending reads on that buffer which have not yet completed (uploads).

This is simulating “actually filling the buffers”, but for the purposes of the benchmark I wanted that to take the minimum time possible so that this was not a factor. If the PBOs are working right, then in this loop what should happen is that while a buffer is being filled in, another is being uploaded to the GPU. Note that the buffer being locked is the one that is not the one being uploaded.

Andrew

OK, I see the motivation. However this is a contrived case, since you are spending zero time actually supplying data - so it would conceivably be easy for your loop to “fill” PBO A and then PBO B long before the driver has made much progress on the upload of A. So you get back to A and you block again… on reflection I guess this should be a fair benchmark in a “best case” upper bounds sense.

My goal is to find the fastest way to upload texture data to the GPU. This test in many ways is no different than filling in a user buffer (owned by me) and doing a glTexSubImage2D directly. A PBO should allocate memory in the AGP space which should be faster to DMA to the GPU from than regular system memory. Under XP, I believe that this is exactly what happens; I generally find that the PBO DMA speeds are about twice the speed of uploads from regular system memory (with the code above.) On the flip side however under Vista, this code runs about half the performance of regular uploads which makes it seem to me like something is broken somewhere. What I believe is happening is that the system virtualizes the AGP memory pool and so the driver actually maps regular system memory; when this memory is needed for a GPU upload it is copied back into the AGP memory and then the DMA can proceed. The problem of course is that there is a whole extra memory copy in this process.

It is of course hard to be sure that this is exactly what is going on and it is unfortunately hard to get any concrete feedback on things like this from the vendors. It does however seem to highlight quite a big issue given that GPGPU processes are highly dependent on transfer speeds, and with GL 3.0 moving towards using buffer objects as the primary form of transfer this could be quite an issue with Vista performance.

Thanks a lot for your input and feedback, it has truly been appreciated. If there is anything that I can do for you, please do not hesitate to let me know.

Andrew Cross,
VP of SW Eng,
NewTek

Take look on you last loop:


for( int i=0; i<no_runs; i++ )
{
  // bind first PBO and simulate buffer filling
  ::glBindBuffer( GL_PIXEL_UNPACK_BUFFER, pbo[ 0 ] );
  void* p_mem = ::glMapBuffer( GL_PIXEL_UNPACK_BUFFER, GL_WRITE );
  ::glUnmapBuffer( GL_PIXEL_UNPACK_BUFFER );

  // now, bind full buffer and upload it content to texture
  ::glBindBuffer( GL_PIXEL_UNPACK_BUFFER, pbo[ 1 ] );
  for( int j=0; j<no_sources; j++ )
  { 
    ::glBindTexture( GL_TEXTURE_2D, tex[ j ] );
    ::glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, xres, yres, GL_LUMINANCE, GL_UNSIGNED_BYTE, (void*)(image_size*j) );
  }

  std::swap( pbo[ 0 ], pbo[ 1 ] );
}

glTexSubImage2D call in this case is nonblocking and app gets control as soon as druver post command in queue. But, pbo[1] object is busy until all pending glTexSubImage2D operation is finished. In next loop you are trying to map that buffer and and fill new data. So, driver have to wait…

Here is my async uploader code… I didnt bechmark it…


#pragma once
#include <vector>
#include "glextmap.h" // this is my gl ext loader... use glew if you want

#ifndef IN
#define IN
#endif

#ifndef OUT
#define OUT
#endif

enum eUPState
{
	UP_NONE = 0,   // default
	UP_MAPPED,     // mapped. acccessible from other threads
	UP_LOCKED,     // locked in some thread
	UP_FULL,       // full
	UP_UNMAPPED,   // unmapped, render thread copy its content to texure
	UP_PENDING     // pending... map it in next frame 

};

class CUploadPBOPool;

class CPBO
{
public:
	CPBO();
	virtual ~CPBO();
        // size = size of single PBO
	bool Init(int size);
	// Cleanup
	bool Done();
	// get state of PBO
	eUPState GetState();
	// map and unmap PBO
	bool Map();
	bool Unmap();
	// get pointer to mapped memory
	unsigned char* Lock();
	bool Unlock(int used, void* userID);
	// get size of PBO
	unsigned int GetMaxSize();
	// Bind the PBO
	void Bind();
protected:
	friend class CUploadPBOPool;

	GLuint id;
	unsigned char* data;
	int maxsize;
	int usedsize;
	eUPState state;
	void* userID;
};

class CUploadPBOPool
{
public:
	CUploadPBOPool(void);
	virtual ~CUploadPBOPool(void);
	// create pool witn num_buffers PBO's each with max_size size
	bool CreateUploadPBOPool(IN int num_buffers, IN unsigned int max_size);

	// destroy PBO pool
	bool DeleteUploadPBOPool();

	// Lock first avaible PBO.. pass pointer to resulting pointer and pointer to size of locked buffer
	// returns index of locked PBO
	// this method should be called from another thread
	int  Lock(OUT unsigned char** data, OUT unsigned int* pSize); 

	// unlock previously locked PBO.. pass return value from Lock call
	// this method should be called from another thread
	// use userID to identify your buffer later in ProcessPBOData
	bool Unlock(IN int index, IN void* userID, IN int used);

	// Update... call once per frame from render thread
	bool UpdateUploadPBOPool();
protected:
	CCritSec m_LockPBOs;
	std::vector<CPBO> m_PBOs;

	// override this to upload data to texture. use userID to find where those data belongs in your app
	virtual void ProcessPBOData(unsigned char* data, unsigned int datasize, void* userID) {};

};



#include "StdAfx.h"
#include "UploadPBOPool.h"
#include "Log.h"


#ifdef _DEBUG
#define GLCALL(a)		{(a); glerr(#a, __LINE__);}
#else
#define GLCALL(a)		a
#endif

static void glerr(char *str, int line)
{
	int err;

	err=glGetError();
	if (err!=0) 
	{
		Log.AddLine(__FILE__, "GLERR: [%5d]  %s: %s", line, str, gluErrorString(err));
	}
}

CPBO::CPBO():
id(0),
data(NULL),
maxsize(0),
usedsize(0),
state(UP_NONE),
userID(0)
{

}

CPBO::~CPBO()
{

}

bool CPBO::Init( int size )
{
	GLCALL(glGenBuffers(1, &id));
	GLCALL(glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, id));
	GLCALL(glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, size, NULL, GL_STATIC_DRAW));
	//Log.Info("CPBO::Init %d", id);
	
	maxsize = size;
	usedsize = 0;
	return true;
}

void CPBO::Bind()
{
	GLCALL(glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, id));
	//Log.Info("CPBO::Bind %d", id);

}

bool CPBO::Done()
{
	if (data != NULL)
	{
		GLCALL(glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, id));
		GLCALL(glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER_ARB));
		data = NULL;
	}
	glDeleteBuffers(1, &id);
	state = UP_NONE;
	return true;
}

eUPState CPBO::GetState()
{
	return state;
}

bool CPBO::Map()
{
	GLCALL(glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, id));
	data = (unsigned char*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, GL_WRITE_ONLY);
	//Log.Info("CPBO::Map() %d", id);
	state = UP_MAPPED;
	return true;
}

bool CPBO::Unmap()
{
	GLCALL(glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, id));
	GLCALL(glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER_ARB));
	//Log.Info("CPBO::Unmap() %d", id);

	data = NULL;
	state = UP_UNMAPPED;
	return true;
}

unsigned char* CPBO::Lock()
{
	state = UP_LOCKED;
	//Log.Info("CPBO::Lock() %d", id);

	return data;
}

bool CPBO::Unlock( int used, void* UID )
{
	usedsize = used;
	userID = UID;
	state = UP_FULL;

	//Log.Info("CPBO::Unlock() %d", id);

	return true;
}

unsigned int CPBO::GetMaxSize()
{
	return maxsize;
}

//////////////////////////////////////////////////////////////////////////
//
//////////////////////////////////////////////////////////////////////////


CUploadPBOPool::CUploadPBOPool(void)
{
}

CUploadPBOPool::~CUploadPBOPool(void)
{
}

bool CUploadPBOPool::CreateUploadPBOPool( IN int num_buffers, IN unsigned int max_size )
{
	for (int i=0; i<num_buffers; i++)
	{
		CPBO pbo;
		m_PBOs.push_back(pbo);
	}

	for (int i=0; i<(int)m_PBOs.size(); i++)
	{
		m_PBOs[i].Init(max_size);
		m_PBOs[i].Map();
	}

	return true;
}

int CUploadPBOPool::Lock(OUT unsigned char** data, OUT unsigned int* pSize )
{
	m_LockPBOs.Lock();
	for (int i=0; i<(int)m_PBOs.size(); i++)
	{
		CPBO& pbo = m_PBOs[i];
		if (pbo.GetState() == UP_MAPPED)
		{
			*data = pbo.Lock();
			*pSize = pbo.GetMaxSize();
			m_LockPBOs.Unlock();
			return i;
		}
	}

	m_LockPBOs.Unlock();
	return -1;
}

bool CUploadPBOPool::Unlock( IN int index, IN void* userID, IN int used )
{
	if (index < 0) return false;
	if (index >= (int)m_PBOs.size()) return false;

	m_LockPBOs.Lock();
	CPBO& pbo= m_PBOs[index];
	pbo.Unlock(used, userID);
	m_LockPBOs.Unlock();
	return true;
}

bool CUploadPBOPool::UpdateUploadPBOPool()
{
	m_LockPBOs.Lock();
	for (int i=0; i<(int)m_PBOs.size(); i++)
	{
		CPBO& pbo = m_PBOs[i];
		eUPState state = pbo.GetState();
		//Log.Info("pbo %d st = %d", i, state);

		switch (state)
		{
		case  UP_NONE: 
			Log.Error("CUploadPBOPool::Update invalid state %d", i);
			break;
		case  UP_FULL:	
			//Log.Info("case  UP_FULL");

			pbo.Unmap();
			m_LockPBOs.Unlock();
			ProcessPBOData(pbo.data, pbo.usedsize, pbo.userID);
			m_LockPBOs.Lock();
			pbo.state = UP_PENDING; 
			break;	 
		case  UP_UNMAPPED: 
			pbo.state = UP_PENDING; 
			break;	 
		case  UP_PENDING:	 
			pbo.Map();
			break;	 
		}
	}
	GLCALL(glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0));

	m_LockPBOs.Unlock();
	return true;
}

bool CUploadPBOPool::DeleteUploadPBOPool()
{
	m_LockPBOs.Lock();
	for (int i=0; i<(int)m_PBOs.size(); i++)
	{
		CPBO& pbo = m_PBOs[i];
		pbo.Done();
	}
	m_PBOs.clear();
	m_LockPBOs.Unlock();
	GLCALL(glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0));

	return true;
}

Usage:

  1. derive your render class from CUploadPBOPool
  2. implement:
    virtual void ProcessPBOData(unsigned char* data, unsigned int datasize, void* userID) {};
    example:

void FXEngine::ProcessPBOData( unsigned char* data, unsigned int datasize, void* userID )
{
	CVideoTexture* vtex = (CVideoTexture*)userID;
	vtex->UpdateTexture(data, datasize);
}
   
  1. in some thread…

// already initialised stuff...
CVideoTexture* vtex;
unsiged char* img_data;
unsigned int img_size;

// temporary vars
unsigned char* pboptr;
unsigned int pbolen;
void* userID = (void*)vtex;

// lock one of the buffers
int index = renderer->Lock(&pboptr, &pbolen);
if ((index != -1) && (pbolen < img_size))
{
        // copy content to buffer
	memcpy(pboptr, data, img_size); // or use some faster memcpy code... 
	// unlock buffer
	renderer->Unlock(index, userID, img_size);
}

Yooyo,

Thank you very much for your code. My concern is not primarily with getting the texture to upload asynchronously; PBOs do achieve this, and I am aware that my code would not do this. My goal was to get the highest possible upload rates and based on all online recommendations it seems that PBOs are the way to go. It turns out however that doing an upload from a PBO buffer on Vista is half the speed of doing it from a user memory buffer. Indeed, if you wanted asynchronous uploads you would be better off sharing your gl lists across several threads and doing regular uploads (I have measured this and it is faster than async PBOs on Vista.)

On XP, all of the above is not correct and PBOs do yield performance improvements over regular uploads.

Andrew

Transfer rate is limited because of syncing between GPU and CPU.
PBO will not speedup transfer just by using it as immediate object for transfer. It is designed for async transfer and for intermediate copyng on GPU side (simulating render to vertex array). The goal of PBO is to break synchronisation between CPU and GPU and allow CPU to do something useful while GPU uploading large data blocks.

Your example might work on Vista if you increase number of PBO’s and change swap to circular shift. Try to create 8 PBO’s. Map PBO 0 and fill data in it, and bind PBO 4 and upload textures.
Also, try to decrease number of images in PBO buffer… try using 1 image per PBO.

There is a another trick… before map call ::glBufferData( GL_PIXEL_UNPACK_BUFFER, image_size*no_sources, 0, GL_STREAM_DRAW );. This will tell driver to discard memory allocated for PBO and next map call will not try to copy PBO buffer back on AGP memory.

Yooyo,

I understand the purpose of PBO buffers. My point however is still correct. Even if you upload a single texture to a GPU using a PBO it is slower then doing it from memory. Even with something like this :

  1. Fill in PBO
  2. Sleep( 5000 ); (It MUST bve synced in the driver here)
  3. GLTexSumImage2D from PBO

Step 3 will is a lot slower than just doing GLTexSumImage2D from system memory. This is only under Vista where it is about half the speed of the later option; under XP you get the same, or better performance from a PBO.

Although you are correct about re-allocating the buffer to break locking from previous loops, that is not the problem that I am looking at. I am trying to achieve the fastest possible upload rates.

Andrew

Did you try to change texture format… afaik, bgr or bgra is accelerated, im not sure for other texture formats.

edit: Seems that GL_LUMINANCE8 is not accelerated.

GL_LUMINANCE8 is typically accelerated. But even if you do change to BGRA it does not change the ultimate results. There a list of the nVidia native formats here :

http://download.nvidia.com/developer/OpenGL_Texture_Formats/nv_ogl_texture_formats.pdf

On a side-note, BGR is pretty much never accelerated.

Andrew

@NVidia: an update to this document to include GF8 would be fine. :slight_smile:

I have spend quite a bit of time over the past 24 hours actually disassembling the nVidia drivers to look at exactly what is going on. A number of things that I have found could be interesting. Firstly it is obvious, but when you do a glTexSubImage from a memory location it does a memcpy into driver memory and then schedules a DMA from that location. In the current shipping drivers they do this in a relatively inefficient way; if you want you can go and change the assembly to make it run about 20% faster almost trivially. It appears that they have improved this in the beta version of their Gl3 32bit drivers although their 64bit drivers do not have the same optimization yet.

Using the older drivers, when you upload the texture via PBOs it does seem to correctly DMA the PBO to video memory. There are however a number of major flaws in their waiting code that make it run very slowly on some machines, particularly newer ones (which is why uploads are so slow.) These can actually be patched by changing the dll, although it is not quite as trivial as the memcpy change and involves changing the way that they sleep waiting for a DMA completion in their ring 3 code.

Their very latest beta drivers have significantly changed the way that the PBO surfaces work, and so I cannot be quite as conclusive about exactly what is happening in the new versions since the time is spend in a kernel call that is harder to look into. The bad news is that these drivers actually perform far worse even than the previous ones on PBO uploads so I hope that this is an issue that gets resolved quickly.

I hope that this helps.

Andrew

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.