Instanced Drawing

Hi,

since I have a very high performance lack when I’m rendering many objects, I want to use instancing. (The “bad” functions are glUseProgram/glUniform**/glDrawElements. With instancing I could draw much more objects.)

I’m using the function glDrawElementsInstanced. My question is now, how to get my object matrices into the vertex shader. I tried it with arrays, but even my GeForce GTX 460 can only take a mat4[255]. I can only imagine how low this limit would be on older cards.

Somewhere I read that one could use textures to pass the matrices. Is that really a good idea? (It seems more like a bad hack to me, but if this is the only possibility…)
So if there is no alternative, how can I access the parts of a sampler2D, so that I can read the matrices?

(I’m creating the texture like this:)


    glActiveTexture(GL_TEXTURE0);
    glBindTexture(GL_TEXTURE_2D, object_status_texture_id);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
    glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, 512, 512, 0, GL_RGBA, GL_FLOAT, &mats[0]);
    glUniform1i(shader_object_status_, object_status_texture_id);

Would uniform buffer help ?
http://www.opengl.org/wiki/Uniform_Buffer_Object
What are the limits for your hardware ?

Why:
http://rastergrid.com/blog/2010/01/uniform-buffers-vs-texture-buffers/

How:
http://www.jotschi.de/?p=427

Limits on UBOs are 64kB generally, or 16kB iirc. Just like with regular uniforms.
UBOs can be slower to read than registers (but can be made to be as fast, via driver-optimisations). TBOs should be slower to read than UBOs.

My benchmarks concluded: use registers, draw max ~128 instances at once. (compute the number by dividing the max register space by (2 * sizeof(perInstanceData)).
If your per-instance matrix doesn’t contain perspective-projection (is MV instead of MVP), then make those matrices be 4x3 instead of 4x4, restore the 0,0,0,1 data in the shader.

Thanks for the fast anwers! I probably need a bit to read those articles.

Just for more information: I’m currently rendering simple cubes. One cube is described by 24 vertices (f32) and 36 indices (u16). (I’m using GL_TRIANGLES for the cubes, both (vertices and indices) are stored inside a OpenGL buffer.) The color is static implemented inside the fragment shader. I’m passing one mat4 for each cube. Each mat4 describes the rotation and position for a cube. (The vertex shader multiplies this matrix with the camera matrix and the vertex.) I’m passing 255 matrices for each glRenderElementsInstanced() call, since mat4[255] is the maximum my graphics card can take. (I probably need to lower this down, since my personal hardware is more or less high-end.)
My system: Q6600@2.4GH, GeForce GTX 460 1024 MB, 4GB DDR2 800, Windows 7 32 bit.

My framework is currently able to render 5000 cubes at 150 FPS with 50% CPU-usage and ~40% GPU-usage. (Windows taskmanager, GPU-Z).

Is this an acceptable result for my hardware or should I try to optimize this even more? (Besides GL_CULLFACE I didn’t implement any culling yet.)
Sorry if this seems like a stupid question, but this are my first steps with OpenGL and I’m very insecure. I don’t want to roll up my whole framework later on.

Edit: When I do this with a mat4[64] instead of mat4[255], the CPU/GPU-usage remains the same, while the FPS drop down to 45. So I probably need more optimation here. I will try to use a texture.

This article has a few approaches to instancing that may be helpful.

The article Dan mentioned is really great. I got a huge speedup using the last technique shown.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.