Texture upload, multithreading, PBOs, FBOs

Hi everybody,
I will start describing the scenario of my question:
I need to optimize the rendering performance of a video surveillance application, which handles up to 16 video streams (720x576 pixels @ 25 fps).
Each decoder instance decodes its stream in its own thread and the uncompressed frames are stored in the decoder’s buffer queue.
Each decoder then notifies its opengl video panel that there’s a new frame available and it tells the panel to repaint; the panel then repaints itself by executing the opengl code in the graphics thread: basically it calls glTexSubImage2D(…) passing the pointer to the frame taken from the queue.

The bottlenecks here seem to be the bandwidth of the bus but also the cpu speed, indicating that the frames are not transferred using the DMA (i’ve done tests witout the decoding process to be sure).

I searched and read a lot of tutorials and posts and it seems that the way to do asynchronous texture uploads are the PBOs; but then i read about the FBOs which could avoid a copy from the BufferObject to the frame buffer (if I understand correctly).

So I’m a bit confused, and here’s the question:
-it is possible to use a mapped PBO (or FBO?) directly as the decoder output buffer so that the decoder writes to it in its own thread, avoiding a copy from RAM to PBO and allowing the frames to be transferred to the GPU using multiple threads rather than the single graphics thread?

If not, which could be the solution to maximize the preformance?

I apologize for the long post,
and thanks a lot in advance!

Hi Matteo,

i have not much experience in multithreaded rendering, but mapping/unmapping a buffer should trigger an asynchronous transfer of the buffer data from/to the gpu. This means that while it is being copied you can use the cpu for “other” things at the same time. However, i would assume if the bottleneck is the bus transfer speed the problem wont disappear - all the data has to be transferred to GPU side at some time…

Also i wonder if the transferred data is that much. Assuming the image data is 8-bit grayscale (surveillence camera?) you have 720x576*25(fps)*16(streams)*8Bit/s ~ 160MB/s if i am not mistaken.

Considering theoretical limits of 4GB/s for a PCIe 16x device this should be no problem even in practice.

Have you tried benchmarking asynchronous transfer speed with a simple single-threaded test application (there should be a few out there)?

The bottlenecks here seem to be the bandwidth of the bus but also the cpu speed, indicating that the frames are not transferred using the DMA (i’ve done tests witout the decoding process to be sure).

Then that’s the first place to look. The best scenario would be for your pixel data to be in a format that the hardware can copy directly. What are the pixel transfer parameters (the two enums near the end) to glTexSubImage, and what is the internal format you request (the third parameter)?

i read about the FBOs which could avoid a copy from the BufferObject to the frame buffer (if I understand correctly).

FBO stands for Framebuffer Object. It is an object for handling framebuffers. It has nothing to do with buffer objects.

it is possible to use a mapped PBO (or FBO?) directly as the decoder output buffer so that the decoder writes to it in its own thread, avoiding a copy from RAM to PBO and allowing the frames to be transferred to the GPU using multiple threads rather than the single graphics thread?

Yes, you can map a PBO and upload your data directly into it. Assuming that the decoder can write to an arbitrary memory pointer, of course. The thread doesn’t matter (so long as you don’t unmap the buffer while you’re writing to it. That would be bad).

Tanks for the responses!

Also i wonder if the transferred data is that much. Assuming the image data is 8-bit grayscale (surveillence camera?) you have 720x576*25(fps)*16(streams)*8Bit/s ~ 160MB/s if i am not mistaken.

Considering theoretical limits of 4GB/s for a PCIe 16x device this should be no problem even in practice.

The best scenario would be for your pixel data to be in a format that the hardware can copy directly. What are the pixel transfer parameters (the two enums near the end) to glTexSubImage, and what is the internal format you request (the third parameter)?

The frames are 24 bit per pixel in RGB format (the third parameter is GL_RGB), and the pixel transfer parameters are GL_BGR and GL_UNSIGNED_BYTE.

I’ve read this paper from nVidia and know that texture speed can be poor with certain picture formats; in my case I’d better have my frames decoded in BGR rather than RGB fromat, right?

The frames are 24 bit per pixel in RGB format

If you mean that each pixel takes up 3 bytes, with no extra pixel to align each pixel to 32-bits, then that’s bad.

You really want to upload this as RGBA, even if the alpha is ignored. Most implementations will use an RGBA format internally, just for alignment reasons (the alpha will always be 1). Is there some way you could get your decoder to generate the data in 32-bit pixels rather than 24-bits?

Otherwise, the driver will have to walk your data (on the CPU), convert each 24-bit pixel into a 32-bit pixel, and then perform the DMA.

That’s definitely bad!
Luckily I think I can tell the decoder to output 32-bit pixels…

I am also wondering another thing: since i could also tell the decoder to output the frames in YUV format, do you think it would be better to offload the color space conversion to the GPU with a shader, or not? (I don’t know anything about shaders and how they work).

YUV could be the best idea. With some shader code, you can overcome the need to pad to 32-bit, too.

I’m somewhat more concerned about the 16 “panels”, and them redrawing themselves - one frame per panel??

What gpus are you targeting (e.g. Geforce 7x00+ / Radeon X1x00), or GF8x00 and Radeon HD2x00+) ?

YUV could be the best idea. With some shader code, you can overcome the need to pad to 32-bit, too.

Thanks! That’s good news!

I’m somewhat more concerned about the 16 “panels”, and them redrawing themselves - one frame per panel??

I have a grid of 4x4 panels: every time a frame is decoded, the decoder triggers the repaint of the panel; then the graphics thread calls the actual paint method of the panel (where the OpenGL code is executed) as soon as possibile.
This happens 25 times per second for each video stream.

[The application is written in java, but the decoding library is written in c on top of ffmpeg and only wrapped in a Java class, so are the opengl panels which are the JOGL panels].

What gpus are you targeting (e.g. Geforce 7x00+ / Radeon X1x00), or GF8x00 and Radeon HD2x00+) ?

My target is to optimize the performance so that on low-end or older GPUs I can manage to display at least 2-5 frames per second per panel, while on high-end GPUs I hope to display every frame that’s decoded.