Draw call reordering
engine program works in a state-less mode on the draw-call level. I.e. I have all the information about each draw call: which render buffers it draws to, what vertex attributes it uses, what are the rasterizer states, etc. This opens the door for an ability to re-order the draw-calls automatically before they are sent to OpenGL. This seems to be pretty tough to implement, though, taking into account various GL states and how they rely to each other.
For example, the client orders to draw A->B, then B->C. Then the other render phase calls X->Y, then Y->Z. An unoptimized version would likely stall 2 times: first before B->C while waiting for A->B to complete, and second before Y->Z while waiting for X->Y. The algorithm can safely reorder it to A->B,X->Y,B->C,Y->Z, which would effectively reduce the stalls. Note that I'm not taking into account state switches here, I'm only concerned about possible GPU stalls.
afraid suspect that OpenGL driver might be already doing this. More to that, it might do this on a per-tile basis (instead of my per-call), which could be far more effective than my algorithm. Hence, all my reordering would be useless in this case. I know that the best thing is to test it myself. However, it is really a big chunk of work to me, so I'd like to hear first from people who might know the answer in advance.
OpenGL driver doesn't have time for such reordering. You should run your application under NVIDIA Nsight or AMD GPU Perf Studio and you will see. But there is another thing that it pretty common, multiple draw calls can be processed in parallel...
That's what I meant when I mentioned the per-tile execution. The GPU profiler still shows the continuous execution of the draw-calls, while the hardware may (theoretically) process tiles of different draw calls in parallel. If it does all it well - there would be no benefit in reordering draw calls.
Personally I would start reordering on a renderstate level - Try and stick to sorting on framebuffer, shaders and textures.
You can easily spend more time trying to optimize the render calls than you actually save doing so. Compare it to doing triangle perfect culling rather than just drawing meshes that are partially visible.