PBOs slow... Looking for the reason..

I have a fairly mature chunk of code which repeatedly updates lots of texture tiles which I upload from client space using glTexSubImage2D into a large texture on the GPU.

As an experiment, and to reduce CPU stalling, I modified the code so that it wrote the new tiles directly into PBOs, and then used a call to glTexSubImage2D(…,…,…,…,…,…,NULL) to effect the texture update on the GPU internally.

I have always use “fenced” client space RAM in any case so while I was aware that I would not save any memory copies, and upload speed would always be at DMA speeds, I hoped that the stalling would be reduced.

All texture formats used are natively supported. The texture tiles are 4 channel 17 x 17.

I was surprised to see a speed drop overall of about 10% doing things this way (and weirdly did not notice any change in the CPU load at all) and was wondering why?

My only theory is that now that the copies are taking place on the GPU, rather than with the CPU “assisting”, the GPU is having problems managing that as well as rendering, and the CPU is still ending up having to wait for the GPU elsewhere…

Can anyone else make any suggestions, or do I have it about right?

One further question for anyone that knows…
If I stop using textures and only use PBOs is it worth considering using glBufferSubDataARB instead of glTexSubImage2D so that texture updating is a one time copy from the CPU directly from any algorithm generating the tile data?

Are there any pitfalls to this idea? Internet info is sparse on this.

In my experience with using PBOs for streaming texture updates, it is only worthwhile if you have to update a “large” part of the texture. A large number is perhaps in the order of 50.000-100.000 texels. The cost of setting up a PBO transfer is not insignificant, so for many small updates, PBOs is likely to be slower than plain glTexSub* which seems to have a small overhead.

It sounds like you’re bottlenecked by the large number of API calls you’re issuing to update the texture atlas. How many tiles/updates are you working with? Maybe you should experiment with preparing the whole update texture on the CPU since it sounds like thats where you generate the tile data (maybe in a separate thread). Once you have the whole texture ready, you’ll probably see a benefit from transfering it with PBO.

As for the texture format, I don’t think it’s all important with the CPUs we have these days. In my experience you can expect to see an additional 5-10% if you organize the texels optimally (e.g. BGRA for NV cards, see the whitepapers on their website). Go for it as a last resort.

Thanks for the feedback…

I am already native with texture formats so I’ve gone as far as I can with that.

And I have the PBOs pre-defined in a rotating stack, and call glBufferDataARB() with NULL to prevent any stalling when I Map a buffer just in case it’s still in use.

I guess I should profile it to see where exactly the time is spent. I thought setting it up that way would keep the overhead low. I guess not.

EDIT: Having said that I just spotted a potential problem where I may be forcing the Texture to be in shared memory, which could in theory trigger a copy from the PBO back to client space, I am guessing?.. Not good.

Removing that I am now getting the same performance, maybe even slightly faster, with the PBO method. I also found I can now reduce the number of PBO “units” I use quite considerably without dropping any tiles…

It’s hard to quantize the performance as it’s in a mature application, but “by feel” I’d say it’s certainly no slower now, and is giving perhaps a 1 or 2 fps boost. At 20 fps that is worth having at the cost of a bit of GPU memory being traded away.

To answer your question I am uploading between 20 and a couple of hundred tiles per frame. Assuming they are truly asynchronous transfers when they get moved from PBO to the Texture then I guess I need to find the sweet spot between PBO buffer space and throughput…

BUMP

What is the latency people expect to see on an internal copy from a PBO to a texture?

The reason I ask is that my PBOs are definitely working. If they weren’t then the textures would not be visible!! They are also definitely slightly faster (albeit by a very small percentage). However, I can be particularly brutal with re-using the PBOs.

What I mean by this is I initially had about 100 in a rotating buffer, so that at least 100 copies would have to be queued before a buffer was forcibly dropped and re-used before copying had happened. Assuming I understand the mechanics of that last part correctly.

I reduced that to 10 to see what would happen, expecting to see a few partial textures when things got busy… But I didn’t.

Reducing that to even 2 does not cause a visible problem. When I am doing up to a hundred per frame this seems weird… And in my experience when something seems too good to be true you are generally missing something!

Can anyone comment who has a better understanding of when these copies are actually done… Thanks.

If you use glBufferData, the driver can orphan the existing storage and create new storage. This is what is probably happening. The driver sees that there is still a read operation queued for the storage backing the buffer object, so it allocates more memory and uses that. By doing so it doesn’t have to block. When the read operation completes on the old backing store, the driver will release it. As long as you’re not experiencing memory pressure, the driver shouldn’t ever have to block.

Thanks. That does make sense, and ties up with what seems to be happening.

Maybe a bit late to this, but did you do many tests on image size and pixel format? They have a massive effect on transfer speeds. I have a feeling that your 17 x 17 images are being transferred to 64 x 64 or so then transferred. That would probably negate any DMA gains. Likewise if you’re not using one of the fast pixel formats.

I do a script of tests using TransferBench when I’m considering a size and pixel format to stay on the ‘fast path’. It’s a great tool.

Bruce

Thanks for that Bruce…

I’ll take a look.
Assuming I am losing out on that choice of size I am guessing it would be relatively easy to batch texture tiles together and then use the offset in glTexSubImage2D to distribute them internally on the GPU.

Well that was interesting…

Firstly using glTexSubImage2D to directly upload the textures into a texture is almost double the speed of using PBOs according to the benchmark program.

Doubling the texture upload size to 34 x 34 instead of 17 x 17 doubles the speed overall respectively in both PBO and non-PBO uploads.
Using 64 x 64 gives a small increase over 34 x 34, but nothing significant.

At first glance those results seem to show a simple way for me to get a nice speed boost on texture uploads, and also stay close to a dimension that suits my apps needs.

Changing the texture size to larger and larger sizes, and playing with non-POT and POT sizes increases the overall speed as expected.

(Interestingly I could prep a batch of 17 x 17 textures into a PBO, and then distribute them all over textures by using an offset into the PBO with no penalty, and an increase in upload speeds overall. This was not as fast as direct upload, but means I don’t have to upload batches of neighboring tiles.)

Those are all results from benchmarking.

However, in a real world application with a simple hack where instead of uploading 17 x 17 textures I batch them and upload 34x34 textures with 4 neighboring 17 x 17 textures distributed through them shows an identical uSec time for each glTexSubImage call, but an increase in the amount of app time and GL time taken by the calls.

So… The end result is no discernible change in the applications speed at all.
This seems strange when texture uploading is a large portion of what the program is doing…

Still it was an interesting afternoon playing with the permutations, and I will go back and look at this again in the future…