Where's the load? Rendering a few quads...

I am rendering a few quads onto the screen in a video decoder application. I’ve been working on getting this as highly optimized as possible, however, there seems to be an unjustifiable load on the per-quad basis.

This decoder is a multithreaded linux app that allows me to play numerous streams of video, one stream per thread. These decoder threads publish YUV video data to a rendering thread, which then renders the video onto quads on the screen. I am using an NVidia card with their closed-source drivers. Anyway, I have found that the decoders take about 5% cpu load per, but the rendering thread takes about 3% cpu per stream. I have traced 1% cpu per stream to texture loading (by simply commenting out the texture loading portion of the code) however, I am still seeing 2% cpu going to sending a single quad to the screen.

I have tried plenty of tricks to try to get rid of that final 2%. I have switched from rendering quads to triangle strips. I have implemented vertex buffer objects and then render through a display list. No matter what I do, the 2% cpu load per quad seems to be invariant.

In oprofile, libGLcore.so.1.0.9746 seems to be consuming a lot of CPU time.

Anybody out there know of any other tricks I can try to get that cpu load down? It may seem trivial, but that 2% is killer when I am running as many as 8 decoders on this machine, as well as 2 encoders and also audio encoders and decoders.

Thank you for your support

Rob

I hate it when I answer my own question. Here goes…

The load was strangely enough in glFinish(). I was calling that before swapping the buffers, which I suppose calls its own implicit glFinish().

That 2% has now dried up.

swapbuffers is more likely to do a glFlush(), which force the execution of all commands in the pipeline, but returns immediately, so is better to provide concurrency beetween GPU and CPU.
Whereas glFinish() only returns when all commands have been executed, and can be polling a lot the GPU to check that (depends the implementation). this breaks parallelism too.