PDA

View Full Version : Instancing sucks ?



BionicBytes
05-04-2010, 08:55 AM
Iím getting round to adding instancing support to my engine primarily because we read about it so much in the DX world and it seems like a good idea in principle. However, Iím starting to have my doubts in real world scenarios (performance related issues).

Iíll try and explain my thinking belowÖ

Iím adding instancing support to render relatively simple OBJ type models (think trees on a terrain). Ignoring the problem of LODding the models with camera distance, I want a simple technique to plumb into the engine, and as I see it there are three techniques to choose from:

1. Uniform Buffer Objects
2. Texture Buffer Objects
3. Instanced_arrays_ARB


The per-instance data Iím trying to plumb into the engine is a modelview matrix per instance and all three techniques could be used in principle to solve this.
So which technique to use?

1. Uniform Buffer Objects
This is actually harder to plumb into the engine than I first thought. Iíve modified my underlying shader library to support Uniform Blocks but I have to track which shader is accessing which UBO because if a shader is recompiled I have to issue a glUniformBlockBinding to set the block uniform binding points for that shader.
Additionally, the memory layouts are a pain and the application needs to track offsets to packed uniforms within the block. Finally, there is a limit on itís size anyway Ė which may or may not be an issue.

Iím having difficulty coding up a suitable generic solution for Uniform Buffer Objects, so Iíll have to defer on this for now.


2. Texture Buffer Objects
These are a dream to work with; accessed just like a texture and simpler to create than vertex buffer objects. Dead easy to plumb into an abstract library which my engine is built upon.
Two TBO are created; one to hold the entire set of modelview matricies for all 400 instances; the other to hold an index list of which [modelview] instance to render this frame.
The index TBO is updated each frame to include the index [into the modelview TBO] of the models which have been determined to be visible.
During rendering, glDrawElementsInstanced is called and the vertex shader performs a TexelFetch on the uSampleBuffer uniform to fetch the model index. Using this model index, 4 more TexelFetches are performed to read the complete modelview matrix.

Hereís a snippet from the vertex shader:

//uniform mat4 modelmatrix; //replaced with texture buffer object - instanced rendering
uniform samplerBuffer modelmatrixbuffer; //RGBA32F
uniform usamplerBuffer renderlistbuffer; //R32UI

mat4 modelmatrix;
int offset = 4 * int(texelFetchBuffer( renderlistbuffer, gl_InstanceID).r); //get the real batch instance from the render list (supplied as an integer texture buffer)
// offset = int (gl_InstanceID * 4); //matricies are indexed as blocks of 4 RGBA floats
modelmatrix[0] = texelFetchBuffer( modelmatrixbuffer, offset);
modelmatrix[1] = texelFetchBuffer( modelmatrixbuffer, offset+1);
modelmatrix[2] = texelFetchBuffer( modelmatrixbuffer, offset+2);
modelmatrix[3] = texelFetchBuffer( modelmatrixbuffer, offset+3);


3. Instanced_Arrays
Part of OpenGL 3.3, but the extension has been around on ATI drivers as an ARB extension for a while. This method allows us to upload per-vertex streams to OpenGL but these streams are increamented only once per object instead of per vertex.
My assumption is that this technique is more efficient than the other two Ė they have to lookup 5 texels/uniforms per vertex which is hurting performance. With this technique we are sending more per-vertex data stream (4 * RGBA floats), but there is less work per vextex Ė so this should result in faster rendering.

//uniform mat4 modelmatrix; //replaced with 4 per-vertex attribute streams - instanced rendering
attribute vec4 modelview1;
attribute vec4 modelview2;
attribute vec4 modelview3;
attribute vec4 modelview4;

mat4 modelmatrix;
modelmatrix[0] = modelview1;
modelmatrix[1] = modelview2;
modelmatrix[2] = modelview3;
modelmatrix[3] = modelview4;


Results
Compared to drawing 400 instances individually:
TBO technique is slower (ATI Radeon 4850, Quad core processor 2.6GHz, OpenGL 3.3/4.0 beta drivers and also on my nVidia GT8600m Laptop) - roughly 33% slower.
Instanced_Arrays is significantly slower (can only test on ATI Ė since nVidia mobile drivers donít yet support ARB or GL 3.3) Ė roughly 75% slower.

I canít put this down to beta drivers since the ARB extension has been around on ATI drivers for some time. The only performance point I can make is that I donít cull any verticies before drawing with technique #3; in other words I just draw all 400 models whether they are in camera view or not. I intent to perform more tests where by I cull away non-visible models and then upload to the VBO only the model matricies of visible objects. The downside to this is the extra time spent copying memory.


Conclusion
Instanced rendering is not worth the effort and provides no benefits (real world).
I guess for specific cases where many 1000ís of objects would be drawn (eg asteroids in a space simulator), there may be some benefit.

Anyone else had similar experiences they wish to share?

DmitryM
05-04-2010, 09:14 AM
Instanced_arrays are the most convenient and appropriate here.
If they are not fast enough today, they surely will be when the driver matures.

Alfonse Reinheart
05-04-2010, 09:28 AM
Compared to drawing 400 instances individually:

That's your problem right there. Instancing is for when you want to render thousands of something, not merely hundreds.

glfreak
05-04-2010, 10:13 AM
Stick to the GL 1.1 goodies mate :)

I still use traditional glBegin/End whenever rendering dynamic scene, and for static geometry just use display lists. This is how GL has and originally designed to be.

I posted long ago about instancing in GL before it ever exists, and I got replies that telling me, unlike Direct3D, GL does not need instancing since there's no much drawing overhead. And the only reason instancing was there because D3D imposes a lot of drawing call overhead. Lies?

Now GL seems to adopt every [censored] that comes out of D3D world, while the later has approaching GL.

And we still blame the drivers. Geeeze! :D

randall
05-05-2010, 01:37 AM
Stick to the GL 1.1 goodies mate :)

I still use traditional glBegin/End whenever rendering dynamic scene, and for static geometry just use display lists. This is how GL has and originally designed to be.



Nothing to be proud of. Today we have different hardware.

BionicBytes
05-05-2010, 02:15 AM
Compared to drawing 400 instances individually:

That's your problem right there. Instancing is for when you want to render thousands of something, not merely hundreds.

This is where I have a problem with all this.

In the real world we can't just draw 1000 tree models on the terrain and be done with it - the GPU just can't cope with all those verticies and complex pixel shader calculations/lighting.
What we need is LOD calculations - but this then breaks the whole instancing thing. For example those 1000 tree objects would have to be broken down into multiple batches of: 100 at LOD=1, 300 at LOD=2, 200 at LOD=3 and 400 at LOD=4 (for example) and swtich between different material shaders for each batch; thus loosing the benefits of instaning in the first place.

Although we are blessed with such lovely h/w these days - they still are not fast enough to 'just draw' the scene as it was originally intended without having to setup complicated code paths to get around performance related issues.

It's all very fustrating!

I guess I'll have to back and draw 'grass' objects. At least there's something I can use this research for so I haven't wasted my time entirely.

Ludde
05-05-2010, 02:36 AM
This is where I have a problem with all this.

In the real world we can't just draw 1000 tree models on the terrain and be done with it - the GPU just can't cope with all those verticies and complex pixel shader calculations/lighting.
What we need is LOD calculations - but this then breaks the whole instancing thing. For example those 1000 tree objects would have to be broken down into multiple batches of: 100 at LOD=1, 300 at LOD=2, 200 at LOD=3 and 400 at LOD=4 (for example) and swtich between different material shaders for each batch; thus loosing the benefits of instaning in the first place.


But every tree in a specific LOD could be drawn using instancing

BionicBytes
05-05-2010, 05:04 AM
....yes, every tree of a certain LOD can be drawn at once with a single command - but that's the point. The batch size has now decreased from a single batch of 1000 to 4 batches of 100, 200,300 or 400 (in my made up example). As other posts have hinted, and as Alfonse hinted at, instancing seems to require at least 1000 objects to experience any performance gains. So by breaking up the huge batch of tree objects into smaller batches by LOD, we loose all the potential performance gains!

Dark Photon
05-05-2010, 05:19 AM
That's your problem right there. Instancing is for when you want to render thousands of something, not merely hundreds.
This is where I have a problem with all this.
Yeah, I thought his statement was a bit ridiculous as well. Don't worry about it.


In the real world we can't just draw 1000 tree models on the terrain and be done with it - the GPU just can't cope with all those verticies and complex pixel shader calculations/lighting.
Yes, or restated, you could, but there's a point past which it's a net waste of frame time to do so.

And the tipping point depends on the vertex complexity of your instance (and num pix, if frag shading is complex), your GPU, and your CPU.

Alfonse's hand-wave is not helpful, in fact quite destructive IMO.


What we need is LOD calculations.
Absolutely! And this is why jamming "boatloads of instances per batch" is a dumb idea. It culls like crap, and just dogs down the GPU, wasting frame time.

It's great if these instances are all you're rendering in a toy app, or perhaps you just want "interactive" performance. But when you're rendering a "real" wide-area scene (tons of stuff besides those instances) and insisting on hard "real-time" performance (e.g. 60Hz), it's a bad idea for frame time consumption to just blindly pump huge instance groups at the GPU.

This is one reason why minimizing the existing CPU-side waste submitting draw calls (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=276463) (ala CBOs, bindless graphics, etc.) is so important. It lets you render more things, the way you want to, without having to cram them into the instancing model even when it doesn't make sense. And it lets you do finer-granularity CPU-side LOD, culling, etc. and net: render a more visually varying and interesting scene with a lot less effort, merely by reclaiming the now-wasted time that your process is blocked waiting on main memory accesses.


Although we are blessed with such lovely h/w these days - they still are not fast enough to 'just draw' the scene as it was originally intended without having to setup complicated code paths to get around performance related issues.
Yeah, well that's partly what makes interactive graphics interesting, and why we have jobs, right? ;) I mean the whole Z-buffer "smash the entire scene onto the film" thing, and most of what we do in the process (including culling and LOD), is "to get around performance-related issues." Otherwise, we'd all just do stochastic ray tracing of all our scenes and be done with it :)

glfreak
05-05-2010, 07:13 AM
Stick to the GL 1.1 goodies mate :)

I still use traditional glBegin/End whenever rendering dynamic scene, and for static geometry just use display lists. This is how GL has and originally designed to be.



Nothing to be proud of. Today we have different hardware.

True we have different hardware, but OpenGL is not meant to be an interface to 3D accelerating hardware, it's a general purpose 3D rendering library that can be accelerated by hardware vendors.

Even with the new hardware, I cannot see a problem implementing glBegin/End for dynamic rendering using the video memory vertex buffers.

The implementation should take care of all acceleration goodies and make it transparent to the client through high level GL calls. ;)

Anyway..I went off topic

Regarding instancing feature in "new hardware," it should give a difference in performance even with not too many objects. It does not make sense to me why it should be tons of objects to see any difference otherwise it performs slower...This is totally nonsense.

The only interpretation of this either the hardware is broken, or drivers are faking this feature.

randall
05-05-2010, 08:24 AM
Even with the new hardware, I cannot see a problem implementing glBegin/End for dynamic rendering using the video memory vertex buffers.



I can see the problem. Thousands of function calls fill driver queue and pollute CPU cache. CPU becomes bottleneck and GPU waits for CPU doing nothing...

glfreak
05-05-2010, 09:54 AM
But we are doing it anyway when filling the hardware vertex buffers...CPU stalls.

HAL-10K
05-05-2010, 09:58 AM
But we are doing it anyway when filling the hardware vertex buffers...CPU stalls. Go ahead with your immediate mode rendering, but I can assure you that you won't convince anyone here with this nonsense.

glfreak
05-05-2010, 10:28 AM
But we are doing it anyway when filling the hardware vertex buffers...CPU stalls. Go ahead with your immediate mode rendering, but I can assure you that you won't convince anyone here with this nonsense.

The nonsense is indeed having hard time making VBO works and figure out if it's driver's bug or programmer's misuse.

I'm not against the VBO stuff but this could be done appropriately if we had something like:

glEnable(GL_VB);

glVBLayout(...)

define and fill vertexData in system memory

glBegin(GL_TRIANGLES);

glVB(vertexData);

glEnd();

;)

HAL-10K
05-05-2010, 10:40 AM
The nonsense is indeed having hard time making VBO works and figure out if it's driver's bug or programmer's misuse.

I'm not against the VBO stuff but this could be done appropriately if we had something like:

glEnable(GL_VB);

glVBLayout(...)

define and fill vertexData in system memory

glBegin(GL_TRIANGLES);

glVB(vertexData);

glEnd();

;) What you are suggesting here is absolute nonsense yet again.

Why are you posting in "OpenGL coding: advanced" and even make suggestions on development strategies when you aren't even able to use an API interface that is as simple as VBOs?

randall
05-05-2010, 10:43 AM
The nonsense is indeed having hard time making VBO works and figure out if it's driver's bug or programmer's misuse.



What part of VBO is hard for you?

glfreak
05-05-2010, 10:45 AM
Hmmmmmmmmmm 40% performance drop?

Crashes on some hardware?

;)

HAL-10K
05-05-2010, 10:50 AM
Hmmmmmmmmmm 40% performance drop?

Crashes on some hardware?

;) Can't you see that it is most likely that you are the one who is doeing it wrong when considering that literally everyone uses VBOs nowadays?

arekkusu
05-05-2010, 11:14 AM
<serious question>

What's wrong with this?



#define BUFFER_OFFSET(i) ((char *)NULL + (i))

glGenBuffers(1, &amp;vbo);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, sizeof(positions), positions, GL_STATIC_DRAW);

glEnableVertexAttribArray(0);
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 0, BUFFER_OFFSET(1));


What will happen when you draw with this state?

HAL-10K
05-05-2010, 12:02 PM
<serious question>

What's wrong with this?



#define BUFFER_OFFSET(i) ((char *)NULL + (i))

glGenBuffers(1, &amp;vbo);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, sizeof(positions), positions, GL_STATIC_DRAW);

glEnableVertexAttribArray(0);
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 0, BUFFER_OFFSET(1));
No

What will happen when you draw with this state?
Nothing is going to be drawn because you haven't defined any vertex data but only attributes.

Why are you asking this (here)?

peterfilm
05-05-2010, 12:46 PM
your vertex attribute is offset by 1 byte into the currently bound VBO. Is that your intention?
are you drawing with a shader? if not, then you should be aware that generic vertex attributes do not work with the fixed function.

Alfonse Reinheart
05-05-2010, 01:04 PM
In the real world we can't just draw 1000 tree models on the terrain and be done with it - the GPU just can't cope with all those verticies and complex pixel shader calculations/lighting.

Then don't use instancing. Instancing is a simple tool for a simple problem. If your problem does not fit the problem that instancing is meant to solve, then I'm guessing instancing won't solve the problem it's not meant to solve.

If you can finesse or coerce your data to actually fit the conditions that instancing works under, good. But instancing is not a magical panacea that will solve every problem associated with drawing lots of things.

If my data were not instanceable, I'd next look to glDrawElementsBaseVertex to avoid extra buffer binds and format changes. State changes are something you'll have to live with.

Alternatively, you could accept the CPU limitations and increase your mesh and shader complexity to the point where you're GPU limited again.

Dan Bartlett
05-05-2010, 01:05 PM
Also, according to spec (and apparently AMD implementation, but not NVidia), that code should fail in an OpenGL 3.2+ core profile, since the default vertex array object (the name zero) is deprecated.

to function, it would need this at the start:


glGenVertexArrays(1, &amp;vao);
glBindVertexArray(vao);

arekkusu
05-05-2010, 01:07 PM
your vertex attribute is offset by 1 byte into the currently bound VBO. Is that your intention?
are you drawing with a shader? if not, then you should be aware that generic vertex attributes do not work with the fixed function.
Yes-- assume all of the relevant shader etc state has been prepared, and that the positions array contains real data.

The intention was that the pointer is offset one byte into the VBO, and the attribute is 3 floats.

So-- what's going to happen when this is drawn?

What does the spec say should happen?

What do you think should happen?

I'm asking this, re: glfreak's hypothetical "glVBLayout()".

Ilian Dinev
05-05-2010, 02:46 PM
So-- what's going to happen when this is drawn? Random floats, producing a spiky triangle hedgehog soup. Try it :) . The num-primitives will be decreased by 1 if the buffer wasn't created big enough.

It's just a contiguous chunk of memory, which the GL implementation decides where to be kept and when. Nothing is magically optimized for you - it's your task as a programmer to align and preprocess the data nicely, just like cpu-side data in any app.


@BionicBytes:
you never shared what your scene specs are. "33% faster" and "400 instances" tell nothing. Furthermore, just 400 instances aren't telling much. Try 140k per frame :) (I've had code do this, @60fps).

Try this: store N instances' unique data simply in uniforms, vec4[] or mat4[] arrays. Keep N*sizeof(InstanceData)<8192. Access them in a way that element offsets are easy to calculate.
You must always bear in mind that access to the per-instance data via uniforms-vs-UBOs-vs-TBOs-vs-InstancedArrays has its pros and cons - and shader-overhead per vertex. None is universal, but each can be important in certain situations.

Instancing does not suck, it's just that GL's drawcalls are already quite fast enough for most stuff.

peterfilm
05-05-2010, 03:01 PM
Also, according to spec (and apparently AMD implementation, but not NVidia), that code should fail in an OpenGL 3.2+ core profile, since the default vertex array object (the name zero) is deprecated.
is everyone assuming people are coding to the 3.2+ core profiles?
btw, is there some currently maintained resource that shows which profiles are supported on which hardware? and which extensions are supported too?
there used to be the delphi3d.net repository, but that's down.
and the glview database doesn't seem to be being updated much (maybe due to the fact that it fails to launch a webmail client to send the report!).
http://www.realtech-vr.com/glview/

BionicBytes
05-05-2010, 04:04 PM
I didn't see the point giving actual fps as they mean nothing. I was comparing two different techniques so a relative figure is all that is needed.
I can tell u that I render 400 instances in 3 different ways: normally, reflected and then shadowed - all part of the scene graph. I have tested again upping the instance count to 4096 (so that's 12000+ instances in total across 3 draw calls). In this scene there are over 130 million visible triangles! Yes, that's right- although the fps is only 14-20 whether instance arrays are used or not. The difference when not instancing via instance arrays is more CPU time performing the draw calls ( as measured by the engine) but strangley the app feels more responsive to mouse input resulting in smoother frames during camera panning and motion- despite similar draw calls.
I also tested 1600 instances (same scene engine- so 1600 instances for normal rendering and again during reflection and again shadowing for the sun light). Here no instancing was giving 50 fps for 80+ million tris and instancing ( arb arrays) was giving 33 fps.

So I disagree- instancing does suck!
Common sense suggests drawing everything thing in a single call should be quicker than the same with multiple draw calls- but it's not.
I also don't agree that instancing isn't suitable for my needs. What could be more appropriate or simple than supplying x models with x modelview matricies. Surely the perfect instancing case?


Some one asked about core profile- no just compatible 3.3 profile on ati radeon 4850.

HAL-10K
05-05-2010, 04:26 PM
Try to upload the instance data with uniforms as it has already been suggested.

This gives me a significant improvement in a (game) scene with only a few dozen instances in avarage for a few hundred (instantiated) draw calls.

Ilian Dinev
05-05-2010, 04:56 PM
Now that's some nice data :) . The lack of mouse-input smoothness is imho related to the drawcall complexity, in my scenes I've seen it if I bake everything in several big meshes.

With the simple uniform-arrays trick, my instanced objects have absolutely the same performance as simple drawcalls _at_minimum_. And this saves a lot of cpu :). 10 mil poly/frame, 48fps , 30k instances total of (only) ~150 base meshes (encapsulated in VAOs actually). GTX275 msaa4 deferred 720p. (the 1 triangle/cycle limit is nigh). Non-synthetic scene/benchmark.
I recalculate/reupload instances' data via glUniform4fv; If I needed instance-data-size to be higher than 100 bytes, I'd try UBOs/TBOs/IAs again (were slower, but I tried them just when they were announced), but for now I'm happy with this tiny almost-universal solution :) . Plus, for those bigger chunks I can pass gl_InstanceID to the frag-shader, which often fetches instance-data fewer times than the count of vertices in bigger scenes (after rough depth prepass).

Anyway, what I meant is that maybe IAs and TBOs have too high shader and/or driver overhead, around uploading and fetching.

Dan Bartlett
05-05-2010, 05:06 PM
I was just going to test this + noticed glVertexAttribDivisor +GL_VERTEX_ATTRIB_ARRAY_DIVISOR aren't included in gl3.h at the moment, even though they are part of core now since OpenGL 3.3.

I added this report at http://www.khronos.org/bugzilla/show_bug.cgi?id=299

Alfonse Reinheart
05-05-2010, 07:05 PM
Common sense suggests drawing everything thing in a single call should be quicker than the same with multiple draw calls- but it's not.

It is faster; that is, calling the function is faster. However, by implementing instancing, you have made your shader/rendering system do more work. So what may have been CPU bound now becomes GPU bound.


I also don't agree that instancing isn't suitable for my needs. What could be more appropriate or simple than supplying x models with x modelview matricies. Surely the perfect instancing case?

Um, no.

Instancing is intended to remove state change and draw call overhead when drawing large numbers of objects. That is, if your rendering is CPU-bound, it should provide a speed-up if you're drawing a lot of things.

So before you can expect performance improvements, you need your rendering loop to be CPU-bound on state change and draw call overhead. This is why measuring FPS is not really the best way to test this kind of performance.

Once you've ensured that you are CPU-bound, you should then ensure that you are rendering enough instances for the draw call overhead gain to offset the loss from using less efficient means.

Ilian's uniform array trick might work, though you don't get very many uniforms to play with.

BionicBytes
05-06-2010, 02:03 AM
It is faster; that is, calling the function is faster. However, by implementing instancing, you have made your shader/rendering system do more work. So what may have been CPU bound now becomes GPU bound.

Not sure about this statement. Whether instancing or not, the system still has to draw 4000+ instances - the only difference is whether the CPU is locked into a loop whist doing so. There is no extra work for the rendering loop to do - 12 Million triangles is 12 million triangles whenther rendered one at a time in CPU loop, or as an instanced batch. On top of that, my observations show that the extra 4000+ drawcalls for each CPU instance are a lower overall overhead than via instacing with ARB_instanced Arrays or TBO. This surprised me - as I said I'd assumed this to be a perfect case for instancing.



So before you can expect performance improvements, you need your rendering loop to be CPU-bound on state change and draw call overhead.

Instancing is intended to remove state change and draw call overhead when drawing large numbers of objects.

How does instancing improve the situation if the app is cpu state change limited? Maybe I've missed something here, but I thought the point of instancing was to avoid state changes by rendering the same object over and over again. This does not usually involve any state changes by definition. Per instance you only want to vary something per object instance as a whole such as position, colour, 3rd texture coordinate or something and avoid the draw function call overhead. What I can see as a possible limitation is some architrectural limit on the number of vertex attribute streams being passed in simulatneously - I currently use 7 with 4 being used to pass the per instance modelview matrix position.



Ilian's uniform array trick might work, though you don't get very many uniforms to play with.

I must have missed this hole uniform arrays idea. Can someone enlighten me on this as I know about UBO which came as part of ARB_Uniform_Buffer_object. Did they arrive via EXT_Bindable_Uniform ? Are they part of core (and which)? What limitations are there?

...if they are easy to integrate then I can test straight away...

BionicBytes
05-06-2010, 02:30 AM
10 mil poly/frame, 48fps , 30k instances total of (only) ~150 base meshes (encapsulated in VAOs actually). GTX275 msaa4 deferred 720p.

OK, that sound like a lot of instances! But if i do the maths, that means you have ~150 batches which total at 30k instances. That's actually only 200 instances per batch - so actually its quite small #instances per batch. I found that with IAs and TBO instancing is not worth the effort - so are you using uniform arrays? - these are something I've overlooked.

Ilian Dinev
05-06-2010, 02:50 AM
"Uniform arrays" is not some special new object. It's just this:



uniform vec4 data[512];

//
void main(){
vec4 v1 = data[gl_InstanceID];
}


Access to good ol' uniforms is immediate in the shader (unlike with the TBOs/etc), upload of them is the fastest RAM->VRAM transfer available, and the total-size limitation forces you to keep data in L1 cache (which the driver then copies quickly in aligned fashion [without memory read-back] to the first FIFO and internal buffers).

You do have to limit the number of instances per glDrawElementsInstanced() call to i.e 150 (for a mat4x3+float) , but it's a non-problem.
Meanwhile, if you were doing transform-feedback visibility culling, or the per-instance data is big and constant, then just uniform-arrays won't be enough.

Btw
"30k instances of 150 base-meshes" - some of the meshes were 60k tris, few instances, some were 200-500 tris, 5k instances.

BionicBytes
05-06-2010, 03:49 AM
Oh I see.


uniform vec4 data[512];

So how do you allow the uniform array to be any size?
In other words, I won't know the size of the array until I load the scene data - so how does that fit with the shader having a defined array size during compile?
Is there some sort of 8K boundary for array sizes? Is that why you said keep the DrawInstanced batch count to ~150?
Is that 8K per uniform or per shader?

BionicBytes
05-06-2010, 03:52 AM
30k instances of 150 base-meshes" - some of the meshes were 60k tris, few instances, some were 200-500 tris, 5k instances

Good stuff! Did you ever benchmark the instancing benefit versus drawing LODed models instead (non instancing) - thus reducing the vertex count and/or pixel shader instructions (LODed material shader to match model LOD)?

Ilian Dinev
05-06-2010, 05:26 AM
I pick a number i.e MaxInstances=128 for each shader. I currently optimize only for nVidia G80 and GTXxxx, so I try to keep the 16kB register-file full, but empty enough for enough warps to fit (depends on the size of per-warp registers, which is often around 20-80 floats). So, 8kB happens to be a good middle-ground (for GTX, 4k for G80) if I do only 1-3 tex-lookups in the frag-shader.

Let's say I picked MI=128. If I have only 7 instances, I upload only 7 instances' data via glUniform4fv. If I have 300 instances, I upload the first 128, call glDrawElementsInstanced, upload the next 128, call glDrawxx, upload the remaining 300-128-128=44 instances, call glDrawxx. Simple :) . 3 calls instead of 300. Staying in L1 instead of going overboard. Not having to resize any buffers.


[edit]
Unfortunately, I do not do regular LOD yet, just A2C dissolve for foliage; I barely have enough time to model the LOD=0 meshes currently ^^". I had some geomorphing-LOD objects, but I haven't figured-out a way for mending the discontinuous UVs yet, so I'll tackle it later. Anyway, when those types of meshes were enumerated in the scenegraph (right after frustum and occlusion culling), they calculate their intLOD and fracLOD, and get grouped by intLOD. Instanced meshes need to have the same NumTriangles and indices, so each LOD group needs its own series of glDrawxx calls.

BionicBytes
05-06-2010, 06:32 AM
Thanks for clearing up what you do - that helps.
I think you've raised more questions now though!


I currently optimize only for nVidia G80 and GTXxxx, so I try to keep the 16kB register-file full, but empty enough for enough warps to fit (depends on the size of per-warp registers, which is often around 20-80 floats). So, 8kB happens to be a good middle-ground (for GTX, 4k for G80) if I do only 1-3 tex-lookups in the frag-shader.

1. what 16KB register? How do you know its 16KB - Have you read this somewhere or is it queryable with OpenGL?
2. per-warp registers? Do you just mean the number of inputs (max uniforms/max attributes) which the h/w supports?
3. I enumerate h/w capabilities from GL context - see snippet below. Are these what you refer to as 4K for G80 (my GFX card here is nVidia 8600GT)

OpenGL 3.0 Detected
EXT_texture_array:
MAX_ARRAY_TEXTURE_LAYERS: 512
ARB_Framebuffer_object:
MAX_COLOR_ATTACHMENTS: 8
MAX_RENDERBUFFER_SIZE: 8192
MAX_SAMPLES: 16
ARB_Texture_Buffer_Object:
MAX_TEXTURE_BUFFER_SIZE: 134217728
ARB_Uniform_Buffer_Object:
MAX_UNIFORM_BLOCK_SIZE: 65536
MAX_VERTEX_UNIFORM_BLOCKS: 12
MAX_GEOMETRY_UNIFORM_BLOCKS: 12
MAX_FRAGMENT_UNIFORM_BLOCKS: 12
MAX_COMBINED_UNIFORM_BLOCKS: 36
MAX_UNIFORM_BUFFER_BINDINGS: 36
MAX_COMBINED_VERTEX_UNIFORM_COMPONENTS: 200704
MAX_COMBINED_GEOMETRY_UNIFORM_COMPONENTS: 198656
MAX_COMBINED_FRAGMENT_UNIFORM_COMPONENTS: 198656

4. Don't quite understand what the 1-3 tex-lookups has to do with anything. Can you expand on this?

Ilian Dinev
05-06-2010, 07:43 AM
It appears the G80 and GTX2xx store uniforms in the 8k/16k regfile instead of the L1-cached constants-memory. I'm not sure at all, though. GL_MAX_VERTEX_UNIFORM_COMPONENTS=4096 on this GTX275, so 16k, which matches the regfile size. (but still, there are frag-uniforms etc which can make the program use more than 16kB constants+registers, so it makes me doubt the previous logic).
GL_MAX_COMBINED_VERTEX_UNIFORM_COMPONENTS=200704 , which is for UBOs - is probably going through the L1, but could be a few cycles slower. (btw you really should try UBOs with your scene)

In GeForces, warps are something like threads; the more threads you can have at once, the better the latency-hiding is. Texture-fetches are high latency, so having more threads is necessary. But the number of possible threads decreases with the number of registers you use.

Anyway, test and tune :) . Even if max_instances to fit in uniform-arrays were ==2, it's a cpu-saver. Having it be ==128 is already much more than hoped for :).

BionicBytes
05-06-2010, 08:05 AM
many thanks for the update - helpful as ever!

Looks like I've got my work cut out then. First I'll try Uniform arrays - as they're easy to code in. If I get good results then we'll see about UBO (I have to alter engine to support them properly).

According to my emumeraton of Radeon 4850, it only has Max Vertex Uniform Components=1024, so I may have to ensure I use #ifdef in shader to accomodate the h/w being used and supply a suitable max array length in either case.

Ilian Dinev
05-06-2010, 08:08 AM
Edit: since I wasn't sure about that 8192 logic, I rechecked docs and tested the GLSL limits. The 8192 on G80 isn't 8kB, but 8192 32-bit registers (32kB). On GTX2xx, it's 16k registers, 64kB.

GL_MAX_VERTEX_UNIFORM_COMPONENTS = 4096 // 16kB
GL_MAX_FRAGMENT_UNIFORM_COMPONENTS = 2048 // 8kB
So, the driver reserves at least 40kB on the GTX for the thread-data of warps.

Furthermore, I checked if I raise the limit of instances to use the whole 16kB, and there was no performance penalty. (raising it further makes the program fail to link).

Thus, maybe simply we need to use-up the GL_MAX_VERTEX_UNIFORM_COMPONENTS instead of tuning.

4096/ 12 = 341 instances max, if only mat4x3 per instance.
4096/4 = 1024 instances if you use only "vec3 pos; float rotateY;"
4096/3 = 1365 instances if you use only "vec3 pos";
4096/1 = 4096 instances if you use "int StaticID;" in combination with truly-constant UBOs, which can contain nice 800kB constant data with slightly slower access.

:)

Alfonse Reinheart
05-06-2010, 03:56 PM
Not sure about this statement. Whether instancing or not, the system still has to draw 4000+ instances - the only difference is whether the CPU is locked into a loop whist doing so. There is no extra work for the rendering loop to do - 12 Million triangles is 12 million triangles whenther rendered one at a time in CPU loop, or as an instanced batch.

Normally, your shader would simply fetch its state data from a uniform. If you're using instancing, you have to get the gl_InstanceID uniform and use that to index either another uniform (either a direct array or a UBO) or access a texture in order to get its state data. Depending on the performance of UBOs or texture accesses, and the number of vertices being rendered, this could be a greater performance loss than just doing it manually.


How does instancing improve the situation if the app is cpu state change limited? Maybe I've missed something here, but I thought the point of instancing was to avoid state changes by rendering the same object over and over again. This does not usually involve any state changes by definition.

If you draw the same thing with the exact same state, nothing will change. You will get the exact same vertices input and output from your shader, and the exact same fragment data written every time. In order for instancing to work, you must have some mechanism in your shader to know what instance you are. At which point, you can then decide where to render the object based on that.

The field of trees is an obvious example. Each tree has a position and orientation; that is its state. In order to render this normally, you will have to perform at least one glUniform call between each glDraw call. If you use instancing, you build a list of state data, put it where the shader can get it, and call glDrawInstanced once.

This can only help if your application's performance is limited by state changes and draw calls. If the app is limited by something else, instancing buys you nothing.

Dark Photon
05-07-2010, 05:52 AM
what 16KB register? How do you know its 16KB?

...currently optimize only for nVidia G80 and GTXxxx, so I try to keep the 16kB register-file full ... GL_MAX_VERTEX_UNIFORM_COMPONENTS = 4096 // 16kB
GL_MAX_FRAGMENT_UNIFORM_COMPONENTS = 2048 // 8kB

To your question Bionic Bytes, I think the underlying limit here is the amount of shared memory on a GPU SM (i.e. streaming multiprocessor, in NVidia lingo, which is a cluster of 32 shader cores). This SM shared memory is likely being used to store uniform values for shading threads (makes sense). This memory is "super fast" relative to texture because it's local to the cores.

On G80/GT200, that's 16kB per SM. On the new GF100/Fermi (GTX480), it's 64kB.

Ilian's insight/discovery here is excellent, and when you stop and think about it it clicks and just makes good sense (assuming that uploading uniforms to shared memory is fast)!

And thanks for passing on your thoughts and experience, Ilian. Very interesting stuff.

Xmas
05-07-2010, 09:04 AM
Once you've ensured that you are CPU-bound, you should then ensure that you are rendering enough instances for the draw call overhead gain to offset the loss from using less efficient means.
Except there should never be a need to use less efficient means with instancing.

Alfonse Reinheart
05-07-2010, 05:37 PM
Ilian's insight/discovery here is excellent, and when you stop and think about it it clicks and just makes good sense (assuming that uploading uniforms to shared memory is fast)!

Actually, looking over the original post, I realized something. He never actually got the UBO-based instancing working. Something about not being able to track the uniform offsets. I'd be curious to see what would happen if use got UBOs working.


Except there should never be a need to use less efficient means with instancing.

Instancing requires that, on some level, the shader will fetch information of some kind based on what instance it is. Instance arrays does this via one or more vertex attributes. Or you can index an array/texture based on gl_InstanceID. Both of these are slower than reading from a single, scalar uniform; even accessing a uniform array requires a bit of indirection, so it is (slightly) slower than just reading from an array.

Thus, you have to make sure that you're rendering enough instances so that the losses due to this inefficiency can be made up by gains due to less state change overhead.

Ilian Dinev
05-07-2010, 06:00 PM
No, I stated that I found them slower to __update__ and use, in the _first_ _beta_ drivers they were exposed in. And months ago I had stated that nVidia's compilers at that time were calculating and using offsets in an inefficient way. (I was using the nv-asm generated from cgc back then, the asm was obviously inefficient).

BionicBytes
05-08-2010, 10:19 AM
I'm working on adding all methods to the engine - so I can finally answer the question - which is faster (for rendering n instances of models of ~10,000 tris).

Currently I have added Texture Buffer Objects, Instanced arrays, Uniform arrays and, still in progress,Uniform Buffer Objects (they are added - just need to do it properly).

I'll post a table of the results on my laptop Gefore 8600GT and Radeon 4850 desktop. Should be interesting reading!.....

One of the annoyances is that the nVidia (laptop driver) is not supporting IA, and the Radeon is not running TBO very well(it just locks up when the objects are rendered). Making a direct comparioson is never easy!

peterfilm
05-08-2010, 12:14 PM
I found the doubling of matrix mults per vertex were a big cost with instancing. The state fetches didn't seem to be as significant with any of the methods. So my conclusion was to only instance meshes with low vertex counts. Of course, this is with basic shading - obviously with more complicated shaders it becomes less of a factor.

BionicBytes
05-09-2010, 02:19 PM
HELP!
Anyone know how to use these commands...I'm having issues with UBO - it's drawing spikey hedgehog soup at the moment.

glBindBufferBase(GL_UNIFORM_BUFFER,0, UBO);
glUniformBlockBinding(glslProgram, blockindex, 0);

Ideally they should be issued after compiling the shader(s) and issued when creating the Uniform Buffer Object storage. However, my engine is a little more complex....I have multiple shaders and the instance data for the scene objects is read quite independantly from shaders - hence there can never be any chance of knowing what the glslProgram_ID is at the time I issue the glGenBuffer command.

Therefore, during the drawmodels function, I issue those two commands just before I enable the shader and upload uniform values. IS this the correct method?

Alfonse Reinheart
05-09-2010, 08:39 PM
Ideally they should be issued after compiling the shader(s) and issued when creating the Uniform Buffer Object storage.

Um, no; that's not how it works at all.

These commands should be used when, and only when, you are intending to render something with the UBO right then. There is no other purpose in issuing these commands.

So your rendering would look like:

1: Bind program.
2: Use glUniform to set uniforms on that program, as needed.
3: Use glBindTexture/glActiveTexture to bind textures, as needed.
4: Use glBindBufferRange/Base and glUniformBlockBinding to bind UBOs, as needed.
5: Bind "VAO" (vertex array state), if needed.
6: Call glDraw*.

To create the storage for a buffer object, use the standard syntax. To upload data to it, again use the same commands you've always used to upload data (glBufferSubData/glMapBufferRange).

Just because you use a buffer object for uniforms doesn't make it special. All buffer objects are interchangeable. Though you should use the GL_UNIFORM_BUFFER target when you bind the object and call glBufferData on it.

BionicBytes
05-10-2010, 01:51 AM
OK, thanks for the clarification.

Just to be more clear: I don't need to call glBindBuffer (GL_UNIFORM_BUFFER, UBO) during rendering (assuming no updates to buffer object) - I just 'bind' the UBO to the active shader via:

glBindBufferRange/Base and
glUniformBlockBinding

This is what I'm actually doing - but I have spikey soup hell at the moment (may be something wrong with the contents of the buffer objects/wrong UBO id) or wrong commands?

One more question - I can't seem to define unsized uniform arrays for the uniform block.
Now that the uniform is backed by a buffer object, I had expected that unsized arrays would be allowed and that they would not take up more 'slots' in the vertex shader.
What seems to happen is the compiler compains and wants the array to be sized and accessed with a const integer, and the size of the array is limited to the MAX_VERTEX_UNIFORM_COMPONENTS size - just like regular unifrom arrays. Is this right?

Alfonse Reinheart
05-10-2010, 09:55 AM
One more question - I can't seem to define unsized uniform arrays for the uniform block.

I wasn't aware that you could define unsized uniform arrays for anything, whether in a uniform block or not. It's simply not allowed.


What seems to happen is the compiler compains and wants the array to be sized and accessed with a const integer, and the size of the array is limited to the MAX_VERTEX_UNIFORM_COMPONENTS size - just like regular unifrom arrays. Is this right?

All arrays must be of a fixed, compile-time size.

Every element of a uniform block contributes to that block's overall size. The number of components in a uniform block may not exceed MAX_COMBINED_VERTEX_UNIFORM_COMPONENTS. This is usually the same size or larger than MAX_VERTEX_UNIFORM_COMPONENTS. For example, ATI's pre-5xxx line advertises 1024 vertex components, but 4096 combined vertex components.

BionicBytes
05-11-2010, 07:13 AM
Upon reading the GL spec on Uniform Buffers again in more detail, it seems there are two implementaton limits we need to be aware of - both of which affect how many instances can be catered for in the solution.

MAX_COMBINED_VERTEX_UNIFORM_COMPONENTS (max floats in all uniform blocks) = MAX_VERTEX_UNIFORM_BLOCKS * (MAX_UNIFORM_BLOCK_SIZE/4) + MAX_VERTEX_UNIFORM_BLOCKS

on, on nVidia G8600GT we have,

200704 = 12 * (65536/4) + 4096; where MAX_UNIFORM_BLOCK_SIZE is the max size in bytes of the Buffer Object store.
This equates to 16,384 components per uniform block (or put another way, 16384 floats @ 4 bytes = 65536 bytes)



Therefore a solution looking to utilise UBO for instancing needs to factor in these limits with regard to the unmber of per-instance attributes to pack into the Buffer Object (64Kb limit) and the number of array element 'slots' in the shader uniform block (16384).

It does say in the spec, that the MAX_VERTEX_UNIFORM_COMPONENTS only applies to the default uniform block, therefore by definition, we should be able to use upto a maximum of 16,384 uniform buffer array elements in a UBO block for instancing (nVidia example). For my engine, I use a 16-float modelmatrix per instance - so this allows for 1024 instances.

I have found the cause of my UBO buffer overrun problem, so this is what I'll be trying tomorrow...and hopefully post some results on nVidia and ATI hardware.

BionicBytes
05-12-2010, 01:39 PM
After much testing and stress! Here are some results of implementing instancing with my
Deferred Rendering engine.


Unfortunately, the ATI radeon 4850 Opengl 4.0/3.3 beta drivers are exhibiting strange behaviour
(Texture Buffer Objects - hang application, Uniform Buffer Objects - The shader compiler generates an 'unknown link error' when
the Buffer object size is at it's maximum (GL_UNIFORM_BLOCK_SIZE)) and Secondly, GL is unable to 'find' the uniform blocks.



Geforce 8600GT (laptop)
2410 visible instances (from 4096 in total). Each instance ~3000 triangles


FPS drawtime
TBO 2 300
UBO 2 2400
UA 2 3260
IA - -
none 2 6000


371 visible instances (from 400 in total). Each instance ~3000 triangles


FPS drawtime
TBO 8 55
UBO 8 290
UA 9 400
IA - -
none 9 900






Key
====
None = no instancing technique - draw each object one at a time
IA = ARB Instanced Arrays
UA = Uniform Arrays - ie using glUniform*v
UBO = Uniform buffer Object
TBO = Texture Buffer Object


Summary:
Inconclusive due to lack of numbers from the IA technique and the laptop is not fast enough to use high instance numbers.
very disapointed not to get the Radeon to work - as this machine has the horsepower to really instancing.

Alfonse Reinheart
05-12-2010, 03:47 PM
Inconclusive due to lack of numbers from the IA technique and the laptop is not fast enough to use high instance numbers.

Remember: instancing is intended to deal with CPU-based performance issues. If you're not CPU performance bound, then you're not going to get anything from it.