The problem indeed seems to be using a single FBO. As a test, I switched to using another FBO for shadow map rendering (switching between shadowmaps and the main view is the most frequent rendertarget change for me), and most of the "unexpected" performance hit went away. The rendering as a whole is still some constant factor slower than on Windows & OpenGL, but it's much more consistent now.
Now just to implement the multiple-FBO mechanism properly and transparently to the caller
Thanks to all who replied!