GL_ARB_instanced_arrays slower than glLoadMatrixf

has anyone any idea why uploading a single matrix (using a VBO) and then using glDrawElementsInstanced() to draw a batch would be slower than calling glLoadMatrixf() and then calling glDrawElements()?
consider that the shader both methods use takes a fully formed modelview matrix, so the instanced version isn’t doing 2 matrix mults. The maths in the shader is exactly the same.
also consider that both methods use the GL_NV_vertex_buffer_unified_memory extension, and the per-instance buffer is using the gpu address instead of buffer binding.

thanks for any thoughts on the subject.

NOTE: the instanced version is faster as the number of batches increases, but i would have thought it would be just as fast if not faster than glLoadMatrixf with a single batch.

Probably because the GPU memory allocated to buffer objects is not as fast as memory allocated to uniform storage, thus the write and read would be quicker for the single usage glLoadMatrix case.
It’s a mute point anyway though, because instancing is only a win when the number of instances to draw is sufficiently high to justify the extra development time to architect and implement the solution

NOTE: the instanced version is faster as the number of batches increases, but i would have thought it would be just as fast if not faster than glLoadMatrixf with a single batch.

No. Instancing is only worthwhile once you get many batches; you shouldn’t bother with numbers less than 100.

Instancing just has it’s own overhead that becomes (moderately) measurable when batch sizes are small. There’s a cutoff point (that will vary by hardware and drivers) beyond which it pulls ahead, but hard-coding this is probably not wise as today’s faster cutoff point may be slower in a year’s time. Just sending everything through an instanced path has non-performance benefits such as cleaner and more consistent code paths.

Worth noting that instancing has viable use cases other than just reducing draw call counts. You can use it for a particle system to get a submission of 1 vertex per particle and it’s considerably faster than a geometry shader, for example. Also good for text with a similar setup.

yes i’m after cleaner code if the required extensions exist, but i also understood from discussions i’ve read that uniform changes cause a pipeline flush while attribute changes don’t (hence nvidia pushing the pseudo instancing approach years ago) - therefore it should follow that pushing the transforms down an attribute rather than through a uniform should give better throughput, regardless of how many instances are being drawn.
I also understood that the reason instancing was not worthwhile on small numbers of batches was because the per-vertex overhead of doing 2 matrix mults (model-to-world then world-to-view) swamped the batch setup saving. But as I said, I’m doing the model-to-view transform on the cpu for both instancing and non-instancing, so the shader is doing a single matrix mult in both cases…

by the way mhagain…moderately measurable???
scene: 3.5 million triangles with very few instanceable batches (but some)

instanced: 709 batches = 13ms
glloadmatrix: 1035 batches = 6ms

7ms faster when not using instancing. I don’t call that ‘moderately measurable’, i call that a disaster!
I’m submitting 40% less batches using instancing but it’s rendering at less than half the speed.

another example…
scene: 23.5 million triangles with plenty of instanceable batches but also a lot of non-instanceables.

totally instanced: 3606 batches = 68ms
totally glloadmatrix: 37399 batches = 49ms

It’s kinda hard to say whether what you’re getting is reasonable or not without seeing your actual code, rather than a description of it.

Are you switching between instanced and non-instanced shaders often when drawing? That can degrade performance quite a bit.

In your second example, do you mean that instanced drawing draws about (37399-3606) objects? And are all the instances done by a single draw call, or are there multiple draw calls of different objects? If so, how many instanced draw calls are there? How big are these objects being instanced? I’ve found that instancing large models (>100K pnts) a few times (<1000) doesn’t tend to improve performance much (Nvidia Quadro 4000).

That particular use case reads like a disaster for sure, but there are always going to be cases for anything where general guidelines are skewed.

My feeling here is that you’re possibly using a non-optimal path for updating your per-instance buffer; maybe glBufferSubData per-matrix? Ideally you want to calculate ahead of time how many instances you’re going to need, and use glMapBufferRange on the lot, with the streaming buffer pattern (see Buffer Object Streaming - OpenGL Wiki).

thanks for all the replies.
interesting!..while i’d tried all the buffer submission methods under the sun (buffer streaming/orphaning, round robin of n buffers with either map or glBufferSubData, single buffer mapping) when dealing with each batch I’d not thought of spending the extra CPU time building a vector of ALL the transforms out of ALL the batches, submitting them to a buffer all at once and then drawing the batches with the correct offsets into this one buffer - and waddya know, it’s screamingly fast now, much faster than the non-instanced path. I’m trying to do as little CPU work as possible, so this kind of thing seemed heavy duty, but it’s a fairly light sysmem-to-sysmem copy and is hidden by the reduction in buffer management by the driver.
that same second example again:-
scene: 23.5 million triangles with plenty of instanceable batches but also a lot of non-instanceables.

totally instanced: 3606 batches = 43ms
totally glloadmatrix: 37399 batches = 52ms (different from first time probably because camera has moved)

thanks for the suggestion, good result!

EDIT: I’d still like to know what optimisations glLoadMatrix is doing to make it faster than a single attribute packet…it’s still got to do basically the same thing in the background surely.

only…i get a judder every 1 second…ffs…only when using instanced path.

What size is your per-instance buffer? It may well be the case that if it’s too large, the memory allocation that your driver needs to do when orphaning it becomes significant.

There’s a trick in an NVIDIA publication which I’ve also used on AMD hardware, but with D3D rather than GL; I believe that it may be of benefit to you if this matches your usage. The way you do it is to have a buffer sized for at least 3 frames worth of data, and - when streaming - instead of orphaning and resetting the buffer cursor to 0, you just reset the cursor. As 3 frames is the normal number of frames that your GPU will get ahead of your CPU for, this is safe to do and it means that the driver never has to reallocate buffer memory.

A variation on it involves counting the number of frames that have passed since the cursor was last at 0; if it’s more than 3 then you can reset to 0 without orphaning, otherwise you must orphan as normal.

There may also be benefit in bumping the number of frames to higher than 3; pick maybe 5 or 6 to give yourself some headroom (useful for cases where the amount of data may vary per-frame, such as a particle system).

You can easily enough figure the best size with a printf (or equivalent for your program) when a buffer orphaning occurs, then taking the size up gradually until it no longer happens.

To your previous point, have never seen/heard anything to indicate that uniform sets cause a flush – that would totally kill performance if true. In fact, in my experience they pipeline very, very well (better than buffer updates in my experience).

Also, info on these boards and beyond3d IIRC indicates that plain old uniforms are stored in shared memory on the compute units (SMs, SIMD processors, etc.), rendering lookups from them very, very fast on the GPU (faster than buffer objects). The con to using ordinary uniforms for instancing state is that to set them you have to break a batch, and batch call setup/dispatch can be expensive (thus bindless, VAOs, etc.). Using larger batches also helps reduce this overhead – thus pseudo-instancing techniques prior to “real” (GPU-side) instancing (ARB_instanced_arrays / ARB_draw_instanced). It’s all depends on what you’re bound on. That said, ARB_instanced_arrays perf is really excellent on large batches (or with bindless regardless), particularly when you’re pulling from buffers already on the GPU.

mmmyeah, i get what you’re saying, but i was rather lazily referencing this 2004 nvidia paper…

Unfortunately, rendering large number of instances results in a large amount of driver work. This is particularly true in GLSL where the driver must map abstract uniform variables into
real physical hardware registers. From the hardware’s perspective, the large number of
constant updates is not ideal either. Constant updates can incur hardware flushes in the
vertex processing engines.

it’s almost certainly out of date information of course, my experiments do seem to back that up.

thanks mhagain, i’ll give that tip a try when i next revisit the code. At the moment I can’t imagine getting better performance - it’s bloody stellar!

also Dark Photon, sorry but the conclusion I’ve come to is contradictory to yours. Rather than instanced_arrays having only good performance on large batches, they do in fact give way better performance on single batches than any other method I’ve found, so long as all the transforms have previously been uploaded to buffer store in a single operation (and provided you’re uploading the eyespace transform instead of world, otherwise you’d be hit by extra vertex shader instructions). Makes sense doesn’t it? you’re batching the ‘uniform’ data itself. You could do the same with a uniform buffer and instance_id I suppose, but i’ve not tried that so can’t speak for it’s relative performance. I have the luxury of targeting hardware that support instanced_arrays, with a fallback to standard batch submission.

doubtless without bindless this technique might be slower, because it would involve a lot of buffer binds and VAO modifications (mesh attributes + instance attributes).

i take some of that back, from testing it would seem this assertion is only true on the fermi architecture - on older cards it’s no faster at all.