has anyone any idea why uploading a single matrix (using a VBO) and then using glDrawElementsInstanced() to draw a batch would be slower than calling glLoadMatrixf() and then calling glDrawElements()?
consider that the shader both methods use takes a fully formed modelview matrix, so the instanced version isn't doing 2 matrix mults. The maths in the shader is exactly the same.
also consider that both methods use the GL_NV_vertex_buffer_unified_memory extension, and the per-instance buffer is using the gpu address instead of buffer binding.

thanks for any thoughts on the subject.

NOTE: the instanced version is faster as the number of batches increases, but i would have thought it would be just as fast if not faster than glLoadMatrixf with a single batch.