Instancing sucks ?

I’m getting round to adding instancing support to my engine primarily because we read about it so much in the DX world and it seems like a good idea in principle. However, I’m starting to have my doubts in real world scenarios (performance related issues).

I’ll try and explain my thinking below…

I’m adding instancing support to render relatively simple OBJ type models (think trees on a terrain). Ignoring the problem of LODding the models with camera distance, I want a simple technique to plumb into the engine, and as I see it there are three techniques to choose from:

  1. Uniform Buffer Objects
  2. Texture Buffer Objects
  3. Instanced_arrays_ARB

The per-instance data I’m trying to plumb into the engine is a modelview matrix per instance and all three techniques could be used in principle to solve this.
So which technique to use?

1. Uniform Buffer Objects
This is actually harder to plumb into the engine than I first thought. I’ve modified my underlying shader library to support Uniform Blocks but I have to track which shader is accessing which UBO because if a shader is recompiled I have to issue a glUniformBlockBinding to set the block uniform binding points for that shader.
Additionally, the memory layouts are a pain and the application needs to track offsets to packed uniforms within the block. Finally, there is a limit on it’s size anyway – which may or may not be an issue.

I’m having difficulty coding up a suitable generic solution for Uniform Buffer Objects, so I’ll have to defer on this for now.

2. Texture Buffer Objects
These are a dream to work with; accessed just like a texture and simpler to create than vertex buffer objects. Dead easy to plumb into an abstract library which my engine is built upon.
Two TBO are created; one to hold the entire set of modelview matricies for all 400 instances; the other to hold an index list of which [modelview] instance to render this frame.
The index TBO is updated each frame to include the index [into the modelview TBO] of the models which have been determined to be visible.
During rendering, glDrawElementsInstanced is called and the vertex shader performs a TexelFetch on the uSampleBuffer uniform to fetch the model index. Using this model index, 4 more TexelFetches are performed to read the complete modelview matrix.

Here’s a snippet from the vertex shader:

[b]//uniform mat4 modelmatrix; //replaced with texture buffer object - instanced rendering
uniform samplerBuffer modelmatrixbuffer; //RGBA32F
uniform usamplerBuffer renderlistbuffer; //R32UI

mat4 modelmatrix;
int offset = 4 * int(texelFetchBuffer( renderlistbuffer, gl_InstanceID).r); //get the real batch instance from the render list (supplied as an integer texture buffer)
// offset = int (gl_InstanceID * 4); //matricies are indexed as blocks of 4 RGBA floats
modelmatrix[0] = texelFetchBuffer( modelmatrixbuffer, offset);
modelmatrix[1] = texelFetchBuffer( modelmatrixbuffer, offset+1);
modelmatrix[2] = texelFetchBuffer( modelmatrixbuffer, offset+2);
modelmatrix[3] = texelFetchBuffer( modelmatrixbuffer, offset+3);[/b]

3. Instanced_Arrays
Part of OpenGL 3.3, but the extension has been around on ATI drivers as an ARB extension for a while. This method allows us to upload per-vertex streams to OpenGL but these streams are increamented only once per object instead of per vertex.
My assumption is that this technique is more efficient than the other two – they have to lookup 5 texels/uniforms per vertex which is hurting performance. With this technique we are sending more per-vertex data stream (4 * RGBA floats), but there is less work per vextex – so this should result in faster rendering.

[b]//uniform mat4 modelmatrix; //replaced with 4 per-vertex attribute streams - instanced rendering
attribute vec4 modelview1;
attribute vec4 modelview2;
attribute vec4 modelview3;
attribute vec4 modelview4;

mat4 modelmatrix;
modelmatrix[0] = modelview1;
modelmatrix[1] = modelview2;
modelmatrix[2] = modelview3;
modelmatrix[3] = modelview4;[/b]

Results
Compared to drawing 400 instances individually:
TBO technique is slower (ATI Radeon 4850, Quad core processor 2.6GHz, OpenGL 3.3/4.0 beta drivers and also on my nVidia GT8600m Laptop) - roughly 33% slower.
Instanced_Arrays is significantly slower (can only test on ATI – since nVidia mobile drivers don’t yet support ARB or GL 3.3) – roughly 75% slower.

I can’t put this down to beta drivers since the ARB extension has been around on ATI drivers for some time. The only performance point I can make is that I don’t cull any verticies before drawing with technique #3; in other words I just draw all 400 models whether they are in camera view or not. I intent to perform more tests where by I cull away non-visible models and then upload to the VBO only the model matricies of visible objects. The downside to this is the extra time spent copying memory.

Conclusion
Instanced rendering is not worth the effort and provides no benefits (real world).
I guess for specific cases where many 1000’s of objects would be drawn (eg asteroids in a space simulator), there may be some benefit.

Anyone else had similar experiences they wish to share?

Instanced_arrays are the most convenient and appropriate here.
If they are not fast enough today, they surely will be when the driver matures.

Compared to drawing 400 instances individually:

That’s your problem right there. Instancing is for when you want to render thousands of something, not merely hundreds.

Stick to the GL 1.1 goodies mate :slight_smile:

I still use traditional glBegin/End whenever rendering dynamic scene, and for static geometry just use display lists. This is how GL has and originally designed to be.

I posted long ago about instancing in GL before it ever exists, and I got replies that telling me, unlike Direct3D, GL does not need instancing since there’s no much drawing overhead. And the only reason instancing was there because D3D imposes a lot of drawing call overhead. Lies?

Now GL seems to adopt every [censored] that comes out of D3D world, while the later has approaching GL.

And we still blame the drivers. Geeeze! :smiley:

Nothing to be proud of. Today we have different hardware.

This is where I have a problem with all this.

In the real world we can’t just draw 1000 tree models on the terrain and be done with it - the GPU just can’t cope with all those verticies and complex pixel shader calculations/lighting.
What we need is LOD calculations - but this then breaks the whole instancing thing. For example those 1000 tree objects would have to be broken down into multiple batches of: 100 at LOD=1, 300 at LOD=2, 200 at LOD=3 and 400 at LOD=4 (for example) and swtich between different material shaders for each batch; thus loosing the benefits of instaning in the first place.

Although we are blessed with such lovely h/w these days - they still are not fast enough to ‘just draw’ the scene as it was originally intended without having to setup complicated code paths to get around performance related issues.

It’s all very fustrating!

I guess I’ll have to back and draw ‘grass’ objects. At least there’s something I can use this research for so I haven’t wasted my time entirely.

This is where I have a problem with all this.

In the real world we can’t just draw 1000 tree models on the terrain and be done with it - the GPU just can’t cope with all those verticies and complex pixel shader calculations/lighting.
What we need is LOD calculations - but this then breaks the whole instancing thing. For example those 1000 tree objects would have to be broken down into multiple batches of: 100 at LOD=1, 300 at LOD=2, 200 at LOD=3 and 400 at LOD=4 (for example) and swtich between different material shaders for each batch; thus loosing the benefits of instaning in the first place.

But every tree in a specific LOD could be drawn using instancing

…yes, every tree of a certain LOD can be drawn at once with a single command - but that’s the point. The batch size has now decreased from a single batch of 1000 to 4 batches of 100, 200,300 or 400 (in my made up example). As other posts have hinted, and as Alfonse hinted at, instancing seems to require at least 1000 objects to experience any performance gains. So by breaking up the huge batch of tree objects into smaller batches by LOD, we loose all the potential performance gains!

This is where I have a problem with all this.[/QUOTE]
Yeah, I thought his statement was a bit ridiculous as well. Don’t worry about it.

In the real world we can’t just draw 1000 tree models on the terrain and be done with it - the GPU just can’t cope with all those verticies and complex pixel shader calculations/lighting.

Yes, or restated, you could, but there’s a point past which it’s a net waste of frame time to do so.

And the tipping point depends on the vertex complexity of your instance (and num pix, if frag shading is complex), your GPU, and your CPU.

Alfonse’s hand-wave is not helpful, in fact quite destructive IMO.

What we need is LOD calculations.

Absolutely! And this is why jamming “boatloads of instances per batch” is a dumb idea. It culls like crap, and just dogs down the GPU, wasting frame time.

It’s great if these instances are all you’re rendering in a toy app, or perhaps you just want “interactive” performance. But when you’re rendering a “real” wide-area scene (tons of stuff besides those instances) and insisting on hard “real-time” performance (e.g. 60Hz), it’s a bad idea for frame time consumption to just blindly pump huge instance groups at the GPU.

This is one reason why minimizing the existing CPU-side waste submitting draw calls (ala CBOs, bindless graphics, etc.) is so important. It lets you render more things, the way you want to, without having to cram them into the instancing model even when it doesn’t make sense. And it lets you do finer-granularity CPU-side LOD, culling, etc. and net: render a more visually varying and interesting scene with a lot less effort, merely by reclaiming the now-wasted time that your process is blocked waiting on main memory accesses.

Although we are blessed with such lovely h/w these days - they still are not fast enough to ‘just draw’ the scene as it was originally intended without having to setup complicated code paths to get around performance related issues.

Yeah, well that’s partly what makes interactive graphics interesting, and why we have jobs, right? :wink: I mean the whole Z-buffer “smash the entire scene onto the film” thing, and most of what we do in the process (including culling and LOD), is “to get around performance-related issues.” Otherwise, we’d all just do stochastic ray tracing of all our scenes and be done with it :slight_smile:

Nothing to be proud of. Today we have different hardware. [/QUOTE]

True we have different hardware, but OpenGL is not meant to be an interface to 3D accelerating hardware, it’s a general purpose 3D rendering library that can be accelerated by hardware vendors.

Even with the new hardware, I cannot see a problem implementing glBegin/End for dynamic rendering using the video memory vertex buffers.

The implementation should take care of all acceleration goodies and make it transparent to the client through high level GL calls. :wink:

Anyway…I went off topic

Regarding instancing feature in “new hardware,” it should give a difference in performance even with not too many objects. It does not make sense to me why it should be tons of objects to see any difference otherwise it performs slower…This is totally nonsense.

The only interpretation of this either the hardware is broken, or drivers are faking this feature.

Even with the new hardware, I cannot see a problem implementing glBegin/End for dynamic rendering using the video memory vertex buffers.

I can see the problem. Thousands of function calls fill driver queue and pollute CPU cache. CPU becomes bottleneck and GPU waits for CPU doing nothing…

But we are doing it anyway when filling the hardware vertex buffers…CPU stalls.

Go ahead with your immediate mode rendering, but I can assure you that you won’t convince anyone here with this nonsense.

Go ahead with your immediate mode rendering, but I can assure you that you won’t convince anyone here with this nonsense. [/QUOTE]

The nonsense is indeed having hard time making VBO works and figure out if it’s driver’s bug or programmer’s misuse.

I’m not against the VBO stuff but this could be done appropriately if we had something like:

glEnable(GL_VB);

glVBLayout(…)

define and fill vertexData in system memory

glBegin(GL_TRIANGLES);

glVB(vertexData);

glEnd();

:wink:

What you are suggesting here is absolute nonsense yet again.

Why are you posting in “OpenGL coding: advanced” and even make suggestions on development strategies when you aren’t even able to use an API interface that is as simple as VBOs?

The nonsense is indeed having hard time making VBO works and figure out if it’s driver’s bug or programmer’s misuse.

What part of VBO is hard for you?

Hmmmmmmmmmm 40% performance drop?

Crashes on some hardware?

:wink:

Can’t you see that it is most likely that you are the one who is doeing it wrong when considering that literally everyone uses VBOs nowadays?

<serious question>

What’s wrong with this?


#define BUFFER_OFFSET(i) ((char *)NULL + (i))

glGenBuffers(1, &vbo);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, sizeof(positions), positions, GL_STATIC_DRAW);
	
glEnableVertexAttribArray(0);
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 0, BUFFER_OFFSET(1));

What will happen when you draw with this state?

Nothing is going to be drawn because you haven’t defined any vertex data but only attributes.

Why are you asking this (here)?