I don't get ARB_PBO right.

Last week I began playing with ARB_pixel_buffer_object. The initial experiments disoriented me quite a bit. Its main usage was to check out how much PBO can speed up texture uploads.
At program startup, I simply uploaded a bunch of pixels up to the card. I’ve tried many details from 64x64 a texture up to a 4Kx4K, I’ve tried standard path, EXT_pbo, then ARB_pbo, I’ve tried mapping the buffer and updating it with SubData.
All this stuff ending in a final TexSubImage call.
The ARB_pbo performances on my systems turned out to be simply awful in most cases.
To give PBO an advantage, I also simulated a heavy CPU load. Nothing.

So, PBO does not look to be an attractive option (right now) to speed up massive uploads. I’ll try it again in the future.

Then I realized maybe I was not giving PBO enough work so I decided to stream multiple textures, updating them some rows at time.
PBO still doesn’t take over on the standard path (which does not even reach optimal transfer speed).

I now believe I’m doing something really wrong.
I’m possibly not taking correct measurements, or maybe I’ve misunderstood how to use PBO or maybe I have managed to get an unrealistic scenario.

Anyway, I believe I need some feedback on other experiences with PBO. Could you give me some hints on getting it to work right?

I personnaly had great results with PBOs. I my case I didn’t use them to upload things faster, but to rather upload texture asynchronously and use them later on. One thing I found is that I had to use BGRA format to get good results. RGBA seemed to fall back onto a software path. BGR didn’t even give good results.

I don’t think PBO exists to make uploading faster. After all, whether you’re doing the upload or the driver is doing it, it is still being done. What you can do with PBO is start the upload and then go do something else.

If you are using NV3x then calling TexSubImage on a PBO RC can be very expensive. I hope that is not whats hampering your performance?

Originally posted by MalcolmB:
…I didn’t use them to upload things faster, but to rather upload texture asynchronously and use them later on.
Initially, I wanted to try it out just to check if it would have been possible to use it that way. Even if it would have proven to be slower, I would have still considered it a win.
Those tests however showed that PBO on my system has a very low bandwidth. So low I hardly believe I can use it asyncronously because sooner or later it’s going to block anyway…

What I believe is that my test case is unrealistic. I will try another test again, after all the actual tests are quite away from my expected usage pattern.

Say however I have 60fps and a streaming texture which is (say) a video at 20-30fps, then I only have 2-3 frames to do all the work. I’m not sure I can live with this gap.

Originally posted by MalcolmB:
One thing I found is that I had to use BGRA format to get good results. RGBA seemed to fall back onto a software path. BGR didn’t even give good results.
On my first batch of tests RGBA was effectively slower than BGRA but not so much. I’ll try to check the second test to see if it changes something.

I still believe however PBO should make the upload faster even when syncronous because it can enforce a much stronger policy on policy alignments and such. It’s only a speculation but I don’t see why it should not… at least, it should not be much slower.

Originally posted by Zulfiqar Malik:
If you are using NV3x then calling TexSubImage on a PBO RC can be very expensive. I hope that is not whats hampering your performance?
This is an interesting detail but (un?)luckly I’m on NV4x, precisely 6600GT.

Although I haven’t tested extensively other PCs, I believe there’s something broken in my system. I’ll check it out soon. I’ll also try a more realistic test which is somewhat closer to my final usage pattern.

I haven’t been able to test other systems because I’ve been too busy in developing other tests.
Now, here are the facts: PBO is slower (at least on my system) in most cases OR I am measuring incorrect values systematically. I hope in the latter but I cannot figure out what I’m doing wrong.

First of all, I’ve done all the testing on RGB. I did some runs with BGR but it didn’t gave me any interesting difference so I got stuck without it.

I tested again with four RGB888 texture sized 1024^2, 512^2, 256^2. Each run uses the same texture size for all the four. At worse, it shall be around 38MB (I’m not using mipmaps). I wanted to run both with high and low CPU load. I also tried using a single PBO or a PBO for each texture, using SubData or Map.

For low cpu load, PBO never turned out to be the winner. The measured framerate is lower and there seems to be no “perceived performance” improvement. In some cases, PBO made the things even worse. Badly enough, using multiple PBOs actually decreased performance.

It has been stated than PBO is good for streaming because of its parallel execution. I have been able to get effectively awesome performance increments in some cases but again, this gain was only measured. Perceived performance stayed the same, with occasional hiccups. This means the gain is pretty useless at best.
When it comes to stream at high rate, PBO seems to be not able to keep up.

I noticed PBO needs some work (in other GL commands I mean) to flex its muscles. This doesn’t change the fact now I’m really out of ideas. Is there something broken with my testing? Am I measuring correctly?

In the end of this test run, what I’ve found is the confirmation of my fears. Asyncronous or not, the performance is so low that ends up lowering overall speed anyway.

Thank everyone who replied to my first post because they really pushed my in doing other tests. This showed my previous results were too bad. PBO looks much better in this run, but still not an attractive option.

Just out of curiousity, how does the Nvidia PBO demo perform on your system?
http://download.developer.nvidia.com/dev…ePerformancePBO

Is there a dramatic speed increase?

You give me The Right idea. Why didn’t I thought of it yet?
Thank you very much, this is definetly welcome. Now I’ll now if

  1. My system is broken OR
  2. My program is broken
    Both of the two alternatives are “very nice”. :wink:

By the way, I didn’t have time to bench other systems. Tomorrow I’ll be away, expect other informations the day after.

EDIT 19th Dec 05:
I was unable to run the tests up to now - I’ve cleaned up the whole system by formatting it completely so this took some time on my quite busy schedule. Let’s see if I can pull this away before the end of the week…