Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 2 12 LastLast
Results 1 to 10 of 16

Thread: GL_ARB_instanced_arrays slower than glLoadMatrixf

  1. #1
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124

    GL_ARB_instanced_arrays slower than glLoadMatrixf

    has anyone any idea why uploading a single matrix (using a VBO) and then using glDrawElementsInstanced() to draw a batch would be slower than calling glLoadMatrixf() and then calling glDrawElements()?
    consider that the shader both methods use takes a fully formed modelview matrix, so the instanced version isn't doing 2 matrix mults. The maths in the shader is exactly the same.
    also consider that both methods use the GL_NV_vertex_buffer_unified_memory extension, and the per-instance buffer is using the gpu address instead of buffer binding.

    thanks for any thoughts on the subject.

    NOTE: the instanced version is faster as the number of batches increases, but i would have thought it would be just as fast if not faster than glLoadMatrixf with a single batch.

  2. #2
    Senior Member OpenGL Pro BionicBytes's Avatar
    Join Date
    Mar 2009
    Location
    UK, London
    Posts
    1,171
    Probably because the GPU memory allocated to buffer objects is not as fast as memory allocated to uniform storage, thus the write and read would be quicker for the single usage glLoadMatrix case.
    It's a mute point anyway though, because instancing is only a win when the number of instances to draw is sufficiently high to justify the extra development time to architect and implement the solution

  3. #3
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,793
    NOTE: the instanced version is faster as the number of batches increases, but i would have thought it would be just as fast if not faster than glLoadMatrixf with a single batch.
    No. Instancing is only worthwhile once you get many batches; you shouldn't bother with numbers less than 100.

  4. #4
    Advanced Member Frequent Contributor
    Join Date
    Jan 2007
    Posts
    982
    Instancing just has it's own overhead that becomes (moderately) measurable when batch sizes are small. There's a cutoff point (that will vary by hardware and drivers) beyond which it pulls ahead, but hard-coding this is probably not wise as today's faster cutoff point may be slower in a year's time. Just sending everything through an instanced path has non-performance benefits such as cleaner and more consistent code paths.

    Worth noting that instancing has viable use cases other than just reducing draw call counts. You can use it for a particle system to get a submission of 1 vertex per particle and it's considerably faster than a geometry shader, for example. Also good for text with a similar setup.

  5. #5
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    yes i'm after cleaner code if the required extensions exist, but i also understood from discussions i've read that uniform changes cause a pipeline flush while attribute changes don't (hence nvidia pushing the pseudo instancing approach years ago) - therefore it should follow that pushing the transforms down an attribute rather than through a uniform should give better throughput, regardless of how many instances are being drawn.
    I also understood that the reason instancing was not worthwhile on small numbers of batches was because the per-vertex overhead of doing 2 matrix mults (model-to-world then world-to-view) swamped the batch setup saving. But as I said, I'm doing the model-to-view transform on the cpu for both instancing and non-instancing, so the shader is doing a single matrix mult in both cases....

  6. #6
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    by the way mhagain....moderately measurable???
    scene: 3.5 million triangles with very few instanceable batches (but some)

    instanced: 709 batches = 13ms
    glloadmatrix: 1035 batches = 6ms

    7ms faster when not using instancing. I don't call that 'moderately measurable', i call that a disaster!
    I'm submitting 40% less batches using instancing but it's rendering at less than half the speed.

    another example...
    scene: 23.5 million triangles with plenty of instanceable batches but also a lot of non-instanceables.

    totally instanced: 3606 batches = 68ms
    totally glloadmatrix: 37399 batches = 49ms
    Last edited by peterfilm; 08-09-2012 at 09:07 AM.

  7. #7
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,793
    It's kinda hard to say whether what you're getting is reasonable or not without seeing your actual code, rather than a description of it.

  8. #8
    Member Regular Contributor malexander's Avatar
    Join Date
    Aug 2009
    Location
    Ontario
    Posts
    257
    Are you switching between instanced and non-instanced shaders often when drawing? That can degrade performance quite a bit.

    In your second example, do you mean that instanced drawing draws about (37399-3606) objects? And are all the instances done by a single draw call, or are there multiple draw calls of different objects? If so, how many instanced draw calls are there? How big are these objects being instanced? I've found that instancing large models (>100K pnts) a few times (<1000) doesn't tend to improve performance much (Nvidia Quadro 4000).

  9. #9
    Advanced Member Frequent Contributor
    Join Date
    Jan 2007
    Posts
    982
    Quote Originally Posted by peterfilm View Post
    by the way mhagain....moderately measurable???
    That particular use case reads like a disaster for sure, but there are always going to be cases for anything where general guidelines are skewed.

    My feeling here is that you're possibly using a non-optimal path for updating your per-instance buffer; maybe glBufferSubData per-matrix? Ideally you want to calculate ahead of time how many instances you're going to need, and use glMapBufferRange on the lot, with the streaming buffer pattern (see http://www.opengl.org/wiki/Buffer_Object_Streaming).

  10. #10
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    thanks for all the replies.
    interesting!...while i'd tried all the buffer submission methods under the sun (buffer streaming/orphaning, round robin of n buffers with either map or glBufferSubData, single buffer mapping) when dealing with each batch I'd not thought of spending the extra CPU time building a vector of ALL the transforms out of ALL the batches, submitting them to a buffer all at once and then drawing the batches with the correct offsets into this one buffer - and waddya know, it's screamingly fast now, much faster than the non-instanced path. I'm trying to do as little CPU work as possible, so this kind of thing seemed heavy duty, but it's a fairly light sysmem-to-sysmem copy and is hidden by the reduction in buffer management by the driver.
    that same second example again:-
    scene: 23.5 million triangles with plenty of instanceable batches but also a lot of non-instanceables.

    totally instanced: 3606 batches = 43ms
    totally glloadmatrix: 37399 batches = 52ms (different from first time probably because camera has moved)

    thanks for the suggestion, good result!

    EDIT: I'd still like to know what optimisations glLoadMatrix is doing to make it faster than a single attribute packet....it's still got to do basically the same thing in the background surely.
    Last edited by peterfilm; 08-10-2012 at 02:26 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •