fast transfer of video frames into graphics card

rednoka · June 17, 2009, 12:59pm

Hi Everybody,

I am new to OpenGL, but I had written a couple of video drivers in embedded systems where I had direct access to video frame buffer.

Right now, all I want is to have fast transfer of video frames into the graphics card.

The hardware is a standard PC with a dual core CPU and most probably with a NVIDIA graphics card. The application will run on both windows and linux. The user interface framework is choosen as Qt.

Due to portability issues (Linux and Windows), I want to use OpenGL through the Qt QGLWidget classes.

I will not perform any 3D fancy stuff. All I need is:

Fast transfer of video frames into the graphics card.
(The target is 16 video windows which are bombarded with 640x480@25fps video streams. This means 400 frames per second.
The scaling of these images into smaller windows. This should be performed on the GPU (graphix processor) if possible.

So I need a fast frame transfer (bitblt) and scaling.

I tried a few things with Qt QGLWidget and PBOs but the result is ugly. One video window consuming almost 30 - 45% of the CPU.

So I need some advice maybe some article, tutorial or code samples to start with.

Unfortunately, most of OpenGL tutorials are concentrated on 3D drawing features of the OpenGL.

scratt · June 17, 2009, 7:14pm

My first stab at this would be to try and do some kind of streaming…

I found this to be a nice article on that…
http://www.songho.ca/opengl/gl_pbo.html#unpack

Specifically…

The texture sources are written directly on the mapped pixel buffer every frame in the PBO modes. Then, these data are transferred from the PBO to a texture object using glTexSubImage2D(). By using PBO, OpenGL can perform asynchronous DMA transfer between a PBO and a texture object. It significantly increases the texture upload performance. If asynchronous DMA transfer is supported, glTexSubImage2D() should return immediately, and CPU can process other jobs without waiting the actual texture copy.

Once you get that working then the scaling part should be very easy on the GPU side.

ZbuffeR · June 19, 2009, 12:44am

And for better performance, draw all textures from frame i, while uploading textures for frame i+1.
If you can afford the slight latency, it will greatly improve parallelism.

rednoka · June 19, 2009, 2:11am

Thanx for the tips…

Especially the article on PBOs is a very good one. I am now trying to implement it by using QPixelBuffer class of Qt. Not success yet.

By the way, another problem that I tossed with OpenGL is that: The coordinate system of the OpenGL is not the same with traditional video coordinate system. If I send my images directly to OpenGL (using glDrawPixels()) the image is shown as up-side down and mirrored. So before displaying it, I need to convert them (using QGLWidget::convertToGLFormat()) but this consumes a lot of CPU cycles.

Instead now I am sending the frame to video card without any conversion as an OpenGL texture. Then by applying OpenGL rotations and transformations, I got the corrected image. This saved %25-30 percent CPU cycles but is it the correct way? Is there simpler and faster solutions?

rednoka · June 19, 2009, 2:16am

Maybe its better if I put the previous question into another thread

scratt · June 19, 2009, 4:58am

Your first choice (pushing pixels was about the slowest path you could choose.

A better way IMO than the rotations and stuff (but only marginally), is to swap the Texture Coords around so that they flip the image. It’s just neater and you can then just draw the QUAD or whatever you are putting the texture on.

I’ve never done this kind of stuff before, but it interests me…
Are the images encoded in any way, or do you just get them as raw data. i.e. RGB RAW data?
If they were encoded I’d love to see the difference between doing that on the CPU and in a shader on the GPU…

dletozeun · June 19, 2009, 6:11am

Your first choice (pushing pixels was about the slowest path you could choose.

I may be mistaken but I would not be that sure with with PBO . Currently he is uploading the video frame data directly to the application provided framebuffer. If you use a texture as you seem to suggest you would still have to upload the frame data to the texture and then raster a textured quad.

for the upside/down image thing, show us how you set up matrices and viewport at the program start.

if you set the viewport like this:

glViewport( 0, 0, w, h );

you can try this:

glViewport( w, h, 0, 0 );

dletozeun · June 19, 2009, 1:00pm

Enabling my brain this time, I advise you to forget what I have said about the viewport. It is not how it works. Actually you can flip the viewport with glOrtho or glFrustum…

But in you case just draw a textured quad as others said and adjust texture coordinates.

rednoka · June 23, 2009, 6:56am

Hello everybody,

I just want to inform you about my current results.

For the fast transfer of video frames into the graphics card, I have implemented the pixel buffers together with QT QGLWidget as described in the http://www.songho.ca/opengl/gl_pbo.html#unpack.

For a single 320x240@25fps jpeg stream, including network capture, decompression and scaling; the program consumes %2 CPU on Linux and %7 CPU on Windows with a 2.2 GHz dual core Intel CPU and NVIDIA GeoForce 9500M video card.

Making it running on Windows is a bit trickier, since opengl32.dll is covering the functions only included in OpenGL 1.1 specification. The functions like glBindBufferARB(), glBufferDataARB() is obtained through “extensions”.

I do not understand the reason for the performance difference between Linux and Windows versions yet (the hardware configuration is almost the same).

The_Fiddler · June 23, 2009, 9:14am

Different OSes have completely different performance characteristics. Which version of Windows are you running? Is the compositor (i.e. Aero) enabled?

Edit: I should also mention that the fastest way to display video on Linux is to use Xv, not OpenGL. With nvidia hardware, you can even perform the decompression of the video stream directly on the GPU, which is an order of magnitude faster than doing it on the CPU.

Disregard this edit if you have are using OpenGL for other reasons than video upload.

rednoka · June 23, 2009, 9:48am

Correction:

For a single 320x240@25fps jpeg stream, including network capture, decompression and scaling; the program consumes %2 CPU on Linux and %3 CPU on Windows with a 2.2 GHz dual core Intel CPU and NVIDIA GeoForce 9500M video card.

I had just mistakenly used “debug” executable in Windows

Stephen:

I am using the OpenGL for two reasons:

Portability: My app should run on both Windows and Linux.
QT: The application GUI framework has chosen as QT and OpenGL is already integrated with QT.

Another option maybe using SDL. SDL uses DirectX on Windows and DirectFB on Linux.

I will check Xv. Is it easily portable?

def · June 24, 2009, 5:26am

16 video streams at 640x480 25fps is alot of data, unless the video is not updated each frame…
I would recommend a fast system bus with PCIe 2.0 support and a modern graphics card.

awhig · July 1, 2009, 7:32am

I don’t know about Qt interface but I suggest to go for FBO instead of PBO. Try attaching RenderObject to FBO.