PBO performance

Hello all,
I am running a small PBO demo I wrote and see a big inprovement in the texture tranfer rate (relative to not using PBO).
In both cases I modify the image and than call glTexSubImage2D.
When not using PBO I modify the image in system memory, while when using PBO, I write directly to GPU memory using mapping.
Is the improvment originates from the fact that writing to GPU memory is faster than writing to the CPU own memory (not sure…) or that glTexSubImage2D is faster when perform in the GPU memories (quite sure about that).
I was once told in this forum that the advantage in PBOs is that glTexSubImage2D happens (relatively) asynchronously.
I don’t quite understand what does it mean.
I would be glad if someone can elaborate on this subject.

Many thanks

while I havent workd with PBOs, my pick on “asynchronously” would be, that it doesnt “stall” the engine. As in the command gets into the opengl command queue and is executed later, while mapping means that all commands need to be executed before (as in glFinish )

TexSubImage2D() doesn’t require a glFinish. It’s ReadPixels() without PBO that does…

The biggest thing that PBO should be saving you for texture uploads is an extra copy. There could be driver idiosyncracies (there always are) that affect the perf too though.

Thanks for the reply Cross,
But what is this extra copy that isn’t done when using PBO?
In my app I see only two steps:

  1. Modifying the image (either on the system memory or on the pixel buffer) and
  2. Calling glTexSubImage2D

In the PBO case, you’re writing the subimage data directly into driver controlled memory.

In the non-PBO case, the driver has to copy all the data in your subimage call into driver controlled memory before the call can return.

It’s the same basic issue as with vertex arrays. Without VBO, the driver has to always copy from the app’s memory because it cannot expect the memory to remain untouched after the call is finished.

With VBO, it controls the memory, so it knows if the app tries to touch it later. (It gets a BufferSubData or Map call.)

Hi !

I am a bit worried about PBOs. On Ati HW on OSX i get tremendous performance. it runs so much faster using PBOs.

On Win32 NVidia drivers 81.xx i get really crappy performance. On 91.31 i get reasonable performance but worse than using non PBOs. I use BGRA unsigned bytes so there should be no swizzling. Whats the problem ? I get better performance without PBOs ?

Thanks Cass for the answer,

I don’t want to nag, but the copy the driver performs from system to its memory is done in glTexSubImage2D call (if I understand correctly). So what is the meaning of this call when using PBO? Also a copy, but from driver’s memory to where? Is there also a issue of changing format from the texture format to the internal driver’s format?

when you’ve got a PBO bound, all gltexsubimage2D calls will reference data already in driver controlled memory (because the pointer parameter is overridden to mean an offset into the currently bound PBO). So gltexsubimage2d itself does nothing except instruct the driver to source its data from the specified offset within the currently bound PBO…no copy happens. The actual copy from the app to the driver happens when you call glBufferData/glBufferSubData, and no copy happens if you use glMapBuffer.
It really does seem like you haven’t read the PBO spec .

Think about how GL driver work. It have command queue, and when you call some gl function it doesn’t mean it will be executed immediately. Command will be stored in command queue. When app pass pointer to something (for example, by calling glTexSubImage2D) driver must allocate memory and copy content in some internal memory, before command return. Later, this buffer will be processed. When driver allocate memory it must deal with memory menagment troubles.

When app use PBO, it is optimal to allocate few PBO’s at beginning and driver propably won’t touch this piece of memory (depends on PBO access type). By copying data in PBO app actually avoid internal mem->vmem copy and memory allocation race.

In short… without PBO:
app->sysmem->driver controled mem->texture data
with PBO:
app->driver controled memory->texture data

Another nice feature of PBO is its async work. Any operation that goes from PBO to GPU or from GPU to PBO is async. In such situations app can use CPU for something else instead waiting for function complete.

Thanks knackered and yooyo for your answers.
knackered - I read the spec few times and as you know, reading the spec is not like reading a tutorial. It is not always that clear. I would not post a question in this forum if I haven’t read the spec first.
Yooyo, you said
“When app pass pointer to something (for example, by calling glTexSubImage2D) driver must allocate memory and copy content in some internal memory, before command return. Later, this buffer will be processed”…
What happens when the texture format is different than the internal texture format of the driver? The driver still have to do extra work to convert the data that was injected by the user via mapping (probably by another mem->vmem copy). Did you mean ‘processed’ to this extra work?
I think that when the user inject data via mapping the texels data is usually not in the ‘final’ format most suitable for rendering (for example patched arrangement). Maybe it involves converting a linear texture to a patched texture. What I am trying to say is that the texture supplied by the user via mapping IMHO must be further processed by reading the PBO by the driver and writing it to the ‘final’ texture format buffer. I hope I pointed out what’s bothering me.
From now on I will read over and over again the specs and avoid to putting silly questions on the forum.
Thanks again for the answers guys.

internal texture format of the driver
You mean the internal texture format of the hardware ?
The driver don’t really have the need for a custom internal format, especially when using PBO. If conversions have to be done, it is likely to be done directly on the hardware.

ZbuffeR is right. There are corner cases, but in general we want the hardware to handle pack/unpack and internal layout in memory.

What about glTexSubImage3D? Do PBOs improve the uploads there, too?

A couple of points.

  1. converting from linear memory layouts to “patched” layouts can, as long as the data provided is memory aligned for the GPU, be done by the GPU. The GPU does the conversion simply by saying that the data from the glTexSubImage2D call is a linearly formatted texture and that the destination texture is tiled.

  2. Doing conversions from the user specified data types to ones used by the hardware can be done as well, by the GPU. Again, as long as memory alignment constraints are met. Again, you simply say that the source texture is of a certain type and the destination is of another type and render a quad to replace that region.

  3. When you map the PBO, depending upon the flags passed in, say WRITE_ONLY, the driver can make the memory pages write combined and, provided you write out entire cache lines, the CPU memory subsytem changes from read-modify-write to a write-only mode. If you don’t fill out entire cache lines in time, the CPU then reads the whole cache line from memory and masks in what you wrote.

  4. If the PBO is in VRAM, writing directly across the bus to the card does away with the need to write to memory and then have the GPU pull that data across the bus to it’s local caches or VRAM.

  5. Try different alignments to get best results on different hardware. On ATI HW make sure that everything is aligned to 32 byte boundaries. That means pixel rows start on 32 byte aligned addresses.

thank you akaTONE, very interesting post.

I am not sure to understand your point 3) though…

Originally posted by glRulez:
What about glTexSubImage3D? Do PBOs improve the uploads there, too?
If I understand what’s been said so far, the benefit of PBOs is that you get asynchronous data transfers from DMA-able blocks of driver controlled memory. I would think this could be beneficial to any sort of texture transfer, including 3D texture slices.

For example, the “streaming textures” example given in the spec copies the texture data to the mapped block of memory, then immediately unmaps it. This allows the driver to stream the data (DMA) asynchronously to the GPU behind the scenes, rather than having to stall and wait for the upload to the GPU to complete directly.

As for formats, it seems to me that native (compressed) formats will have the best performance.

Somewhat unrelated but does anyone know if it possible to upload a portion of a DDS texture?

If by DDS texture you mean a DXT1/3/5 compressed texture, and if you mean by portion you mean sub-load into an existing texture…

… then yes you can as long as you upload a multiple of 4 sized area into a multiple of 4 destination area. (eg. a 8x8 area can be subloaded into a texture at point 4,4 but not 1,1)