I am rendering a few quads onto the screen in a video decoder application. I’ve been working on getting this as highly optimized as possible, however, there seems to be an unjustifiable load on the per-quad basis.
This decoder is a multithreaded linux app that allows me to play numerous streams of video, one stream per thread. These decoder threads publish YUV video data to a rendering thread, which then renders the video onto quads on the screen. I am using an NVidia card with their closed-source drivers. Anyway, I have found that the decoders take about 5% cpu load per, but the rendering thread takes about 3% cpu per stream. I have traced 1% cpu per stream to texture loading (by simply commenting out the texture loading portion of the code) however, I am still seeing 2% cpu going to sending a single quad to the screen.
I have tried plenty of tricks to try to get rid of that final 2%. I have switched from rendering quads to triangle strips. I have implemented vertex buffer objects and then render through a display list. No matter what I do, the 2% cpu load per quad seems to be invariant.
In oprofile, libGLcore.so.1.0.9746 seems to be consuming a lot of CPU time.
Anybody out there know of any other tricks I can try to get that cpu load down? It may seem trivial, but that 2% is killer when I am running as many as 8 decoders on this machine, as well as 2 encoders and also audio encoders and decoders.
Thank you for your support
Rob