only....i get a judder every 1 second....ffs....only when using instanced path.
only....i get a judder every 1 second....ffs....only when using instanced path.
What size is your per-instance buffer? It may well be the case that if it's too large, the memory allocation that your driver needs to do when orphaning it becomes significant.
There's a trick in an NVIDIA publication which I've also used on AMD hardware, but with D3D rather than GL; I believe that it may be of benefit to you if this matches your usage. The way you do it is to have a buffer sized for at least 3 frames worth of data, and - when streaming - instead of orphaning and resetting the buffer cursor to 0, you just reset the cursor. As 3 frames is the normal number of frames that your GPU will get ahead of your CPU for, this is safe to do and it means that the driver never has to reallocate buffer memory.
A variation on it involves counting the number of frames that have passed since the cursor was last at 0; if it's more than 3 then you can reset to 0 without orphaning, otherwise you must orphan as normal.
There may also be benefit in bumping the number of frames to higher than 3; pick maybe 5 or 6 to give yourself some headroom (useful for cases where the amount of data may vary per-frame, such as a particle system).
You can easily enough figure the best size with a printf (or equivalent for your program) when a buffer orphaning occurs, then taking the size up gradually until it no longer happens.
To your previous point, have never seen/heard anything to indicate that uniform sets cause a flush -- that would totally kill performance if true. In fact, in my experience they pipeline very, very well (better than buffer updates in my experience).
Also, info on these boards and beyond3d IIRC indicates that plain old uniforms are stored in shared memory on the compute units (SMs, SIMD processors, etc.), rendering lookups from them very, very fast on the GPU (faster than buffer objects). The con to using ordinary uniforms for instancing state is that to set them you have to break a batch, and batch call setup/dispatch can be expensive (thus bindless, VAOs, etc.). Using larger batches also helps reduce this overhead -- thus pseudo-instancing techniques prior to "real" (GPU-side) instancing (ARB_instanced_arrays / ARB_draw_instanced). It's all depends on what you're bound on. That said, ARB_instanced_arrays perf is really excellent on large batches (or with bindless regardless), particularly when you're pulling from buffers already on the GPU.
mmmyeah, i get what you're saying, but i was rather lazily referencing this 2004 nvidia paper...
http://http.download.nvidia.com/deve...instancing.pdf
it's almost certainly out of date information of course, my experiments do seem to back that up.Unfortunately, rendering large number of instances results in a large amount of driver work. This is particularly true in GLSL where the driver must map abstract uniform variables into
real physical hardware registers. From the hardware’s perspective, the large number of
constant updates is not ideal either. Constant updates can incur hardware flushes in the
vertex processing engines.
thanks mhagain, i'll give that tip a try when i next revisit the code. At the moment I can't imagine getting better performance - it's bloody stellar!
also Dark Photon, sorry but the conclusion I've come to is contradictory to yours. Rather than instanced_arrays having only good performance on large batches, they do in fact give way better performance on *single* batches than any other method I've found, so long as all the transforms have previously been uploaded to buffer store in a single operation (and provided you're uploading the eyespace transform instead of world, otherwise you'd be hit by extra vertex shader instructions). Makes sense doesn't it? you're batching the 'uniform' data itself. You could do the same with a uniform buffer and instance_id I suppose, but i've not tried that so can't speak for it's relative performance. I have the luxury of targeting hardware that support instanced_arrays, with a fallback to standard batch submission.
doubtless without bindless this technique might be slower, because it would involve a lot of buffer binds and VAO modifications (mesh attributes + instance attributes).
Last edited by peterfilm; 08-14-2012 at 06:33 AM.
i take some of that back, from testing it would seem this assertion is only true on the fermi architecture - on older cards it's no faster at all.