Instancing in D3D is designed primarily to work around a fundamental problem in the Direct3D architecture and API. Calling DrawPrimitive (the primary array drawing function, along with DrawIndexedPrimitive) is not a thing to take lightly if you are at all concerned about performance. Such a call directly calls a hardware driver (not like OpenGL drivers), and hardware drivers have to run outside of OS protected mode, thus forcing a CPU switch to Ring 0. This is not a cheap operation; according to nVidia, a 1GHz CPU can only do approximately 100,000 of them per second. Not a lot if you’re runing at 75fps.
Through instancing, instead of drawing a mesh 1,000 times, you draw it using one DrawPrimitive call (and a massive index list). That way, you have saved 999 potential DrawPrimitve calls that can be used on more interesting stuff.
Here’s the thing. The equivalent OpenGL function, glDrawElements, does not force an immediate switch to Ring 0. The GL driver will need to do one eventually, in order to feed the hardware FIFO. However, the GL driver decides when this is a good idea, not the external code. Additionally, if you have made several glDrawElements calls since the last time it had to feed the hardware FIFO, it can queue them up and drop them into the FIFO all at once. Effectively, OpenGL allows the GL driver to marshal calls to the GPU.
As a side note, glFlush is a way to controll marshalling. It tells the GL implementation to block until it has actually placed all waiting commands into the hardware FIFO. If the hardware is slower than the caller, it can take a while for the hardware FIFO to be empty enough to need refilling.
So, clearly, OpenGL doesn’t need instancing nearly as much as D3D. However, this is not to say that instancing wouldn’t be a performance win on GL; glDrawElements is not free after all, and neither are state changes (another thing that instancing gets around by hiding them in vertex attributes). There’s been some significant debate in this forum on this very subject, with no conclusions drawn either way. Personally, I think it should be exposed, since it might give a performance win, and it already exists in D3D drivers.
However, there is a strong suggestion, though no certifiable proof, that ATi’s D3D instancing implementation is purely software based. According to some, it turns the single call into many separate calls. You still get the performance benifit of not having 999 swaps to Ring 0 by this method, since it is fully in the control of the hardware driver.