Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 2 of 2 FirstFirst 12
Results 11 to 16 of 16

Thread: GL_ARB_instanced_arrays slower than glLoadMatrixf

  1. #11
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    only....i get a judder every 1 second....ffs....only when using instanced path.

  2. #12
    Advanced Member Frequent Contributor
    Join Date
    Jan 2007
    Posts
    964
    What size is your per-instance buffer? It may well be the case that if it's too large, the memory allocation that your driver needs to do when orphaning it becomes significant.

    There's a trick in an NVIDIA publication which I've also used on AMD hardware, but with D3D rather than GL; I believe that it may be of benefit to you if this matches your usage. The way you do it is to have a buffer sized for at least 3 frames worth of data, and - when streaming - instead of orphaning and resetting the buffer cursor to 0, you just reset the cursor. As 3 frames is the normal number of frames that your GPU will get ahead of your CPU for, this is safe to do and it means that the driver never has to reallocate buffer memory.

    A variation on it involves counting the number of frames that have passed since the cursor was last at 0; if it's more than 3 then you can reset to 0 without orphaning, otherwise you must orphan as normal.

    There may also be benefit in bumping the number of frames to higher than 3; pick maybe 5 or 6 to give yourself some headroom (useful for cases where the amount of data may vary per-frame, such as a particle system).

    You can easily enough figure the best size with a printf (or equivalent for your program) when a buffer orphaning occurs, then taking the size up gradually until it no longer happens.

  3. #13
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    2,882
    Quote Originally Posted by peterfilm View Post
    EDIT: I'd still like to know what optimisations glLoadMatrix is doing to make it faster than a single attribute packet....it's still got to do basically the same thing in the background surely.
    To your previous point, have never seen/heard anything to indicate that uniform sets cause a flush -- that would totally kill performance if true. In fact, in my experience they pipeline very, very well (better than buffer updates in my experience).

    Also, info on these boards and beyond3d IIRC indicates that plain old uniforms are stored in shared memory on the compute units (SMs, SIMD processors, etc.), rendering lookups from them very, very fast on the GPU (faster than buffer objects). The con to using ordinary uniforms for instancing state is that to set them you have to break a batch, and batch call setup/dispatch can be expensive (thus bindless, VAOs, etc.). Using larger batches also helps reduce this overhead -- thus pseudo-instancing techniques prior to "real" (GPU-side) instancing (ARB_instanced_arrays / ARB_draw_instanced). It's all depends on what you're bound on. That said, ARB_instanced_arrays perf is really excellent on large batches (or with bindless regardless), particularly when you're pulling from buffers already on the GPU.

  4. #14
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    mmmyeah, i get what you're saying, but i was rather lazily referencing this 2004 nvidia paper...
    http://http.download.nvidia.com/deve...instancing.pdf
    Unfortunately, rendering large number of instances results in a large amount of driver work. This is particularly true in GLSL where the driver must map abstract uniform variables into
    real physical hardware registers. From the hardware’s perspective, the large number of
    constant updates is not ideal either. Constant updates can incur hardware flushes in the
    vertex processing engines.
    it's almost certainly out of date information of course, my experiments do seem to back that up.

    thanks mhagain, i'll give that tip a try when i next revisit the code. At the moment I can't imagine getting better performance - it's bloody stellar!

  5. #15
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    also Dark Photon, sorry but the conclusion I've come to is contradictory to yours. Rather than instanced_arrays having only good performance on large batches, they do in fact give way better performance on *single* batches than any other method I've found, so long as all the transforms have previously been uploaded to buffer store in a single operation (and provided you're uploading the eyespace transform instead of world, otherwise you'd be hit by extra vertex shader instructions). Makes sense doesn't it? you're batching the 'uniform' data itself. You could do the same with a uniform buffer and instance_id I suppose, but i've not tried that so can't speak for it's relative performance. I have the luxury of targeting hardware that support instanced_arrays, with a fallback to standard batch submission.

    doubtless without bindless this technique might be slower, because it would involve a lot of buffer binds and VAO modifications (mesh attributes + instance attributes).
    Last edited by peterfilm; 08-14-2012 at 06:33 AM.

  6. #16
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    i take some of that back, from testing it would seem this assertion is only true on the fermi architecture - on older cards it's no faster at all.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •