What is best practice for batch drawing objects with different transformations?

I’m conceptualising a good approach to rendering as many disjointed pieces of geometry with a single draw call in OpenGL, and the wall I’m up against is the best way to do so when each piece has a different translation and maybe rotation, since you don’t have the luxury of updating the model view uniform between single object draws. I’ve read a few other questions here and elsewhere and it seems the directions people are pointed in are quite varied. It would be nice to list the main methods of doing this and attempt to isolate what is most common or recommended. Here are the ideas I’ve considered:

  1. Instancing; A new attribute is sent and updated per object, rather than per vertex. I could then pass varied transformation data efficiently, and within one draw call. The drawback of this technique is that my code would be less portable, supporting desktop GL only, since most mobile platforms do not seem to support this feature yet in OpenGL ES 2.0.

  2. Creating matrix transformations in the shader. Here I’d send a translation vector or maybe a rotation angle or quaternion as part of the attributes. The advantage is it would work cross-platform including mobile. But it seems a bit wasteful to send the exact same transformation data for every single vertex in an object, as an attribute. Without instancing, I’d have to repeat these identical vectors or scalars for a single object many many times in a VBO as part of the interleave array, right? The other drawback is I’m relying on the shader to do the math; I don’t know if this is wise or not.

  3. Similar to 2), but instead of relying on the shader to do the matrix calculations, I instead do these on the client side but still send through the final model view matrix as a stream of 16 floats in the VBO. But as far as I can tell, without instancing, I’d have to repeat this identical stream for every single vertex in the VBO, right? Just seems wasteful. The tradeoff with 2) above is that I am sending more data in the VBO per vertex (16 floats rather than a 3-float vector for translation and maybe a 4 float quaternion), but requiring the shader to do less work.

  4. Skip all the above limitations and instead compromise with a separate draw call for each object. This is what is typically “taught” in the books I’m reading, no doubt for simplicity’s sake.

Are there other common methods than these?

As an academic question, I’m curious if all the above are feasible and “acceptable” or if one of them is clearly a winner over the others? If I was to exclusively use desktop GL, is instancing the primary way for achieving this?

Instancing; A new attribute is sent and updated per object, rather than per vertex. I could then pass varied transformation data efficiently, and within one draw call. The drawback of this technique is that my code would be less portable, supporting desktop GL only, since most mobile platforms do not seem to support this feature yet in OpenGL ES 2.0.

Um, no. The biggest drawback with instancing is that instancing only supports drawing the same mesh. Instancing repeatedly loops through the same per-vertex data multiple times, each time with a different gl_InstanceID value in the VS and/or a different set of per-object attributes.

Whether instancing is or is not supported is irrelevant if it simply can’t do what you need. If you’re drawing different objects, instancing just isn’t going to help you.

Here I’d send a translation vector or maybe a rotation angle or quaternion as part of the attributes.

I seriously doubt that this could be faster than multiple draw calls in virtually all situations. The two main problems are the added vertex shader input data, and the fact that you’re now streaming vertex data on what may have been static models otherwise.

The first problem persists even if you use shorts for the quat+trans, will be no less than 16 bytes per vertex. The absolute best you could hope for is to pass an index (perhaps as a byte, but even then, it’s a good idea to align attributes to 4 bytes, so that’s still an extra 4 bytes per vertex), which you use to look something up in a buffer texture or uniform buffer.

The second problem causes a number of issues. If you’ve got half-static and half-streamed data, then now you’re going to have to split your vertex data (one buffer object for static, one for streamed). This is almost certainly going to be less performance friendly just in terms of upload time. Coupled with that, you’re going to need to do buffer object streaming of some form. This is certainly doable, but non-trivial.

If you use an index rather than the actual data, you might have a functional solution (especially if you can hide that index in some other attribute. Like if you only use the RGB of the color, you can hide the index in alpha). This would in effect be doing matrix palette skinning, just with only one index per vertex and no blending between matrices. This can be a workable solution, but generally it’s for objects that are hierarchically linked already. Not an arbitrary cloud of stuff.

But outside of that kind of situation, this will generally be a poor performer. And not because of the vertex shader, so your “matrix per vertex” solution is a non-starter.

Generally speaking, if you have multiple objects, with each object using independent transforms, you use multiple draw calls. That’s what they’re there for. The old NVIDIA “Batch Batch Batch” presentation cited between 10,000 and 40,000 draw calls per-frame (in D3D. More in GL) for a 1GHz GPU. Nowadays, you’re looking at rather more than that. So unless you’re dealing with tens of thousands of individual objects, all of them being different (so no instancing), odds are good that you’ll be fine.

On desktop GL, of course.

Very useful, thanks. This gives me some confidence to worry less about doing multiple draws until I actually see a serious bottleneck in effect. It certainly simplifies things for now. Appreciate it.

between 10,000 and 40,000 draw calls per-frame

I have found if the batch has only a small number of triangles like a polyline or a simple cube structure you cannot get anything like 10,0000 draw calls per frame with an acceptable frame rate (say 20 fps). Over about
2500 calls and the frame rate rapidly approaches 1 fps.

I have not profiled it to the extent of finding exact what I am cpu bound on but the loop was not changing states but it was changing buffers with each call.

My solution was easy because my data is ralatively static so I pre-multiplied the instanced objects by their translation/rotation matrices and stored the resulting vertices in large buffer to minimise draw calls and quite happily got back to
20+ fps. Of course the trade off is more data space for the vertices but the matrices don’t come free and the individual objects where typically less that 20 vertices.

I have found if the batch has only a small number of triangles like a polyline or a simple cube structure you cannot get anything like 10,0000 draw calls per frame with an acceptable frame rate (say 20 fps).

Are you saying that the performance per batch decreases if the batch size is small? On what hardware did you see this?

[QUOTE=tonyo_au;1250117]
My solution was easy because my data is ralatively static so I pre-multiplied the instanced objects by their translation/rotation matrices and stored the resulting vertices in large buffer to minimise draw calls and quite happily got back to
20+ fps. Of course the trade off is more data space for the vertices but the matrices don’t come free and the individual objects where typically less that 20 vertices.[/QUOTE]

Now there’s an idea I hadn’t thought of it. Take the modelview matrix calculations out of the shader entirely and just pass the vertices after multiplication. This allows a single draw call for many objects in different orientations and translations. The cost just comes at all the CPU calculations, but I suppose if that bottleneck is not as big as the bottleneck of multiple draw calls, it would be worth it, as you noted.

I wonder how often others end up doing this to achieve a decent frame rate.

Are you saying that the performance per batch decreases if the batch size is small? On what hardware did you see this?

I don’t think it was directly related to the batch size; I think it is more related to the number of buffers I had - I had 7000+ (not a good idea:whistle:) but with small batch sizes I think the gpu was basically idle as it had very little work to do with are render call.

I run on ATI 5870, nVidia Quadro 5000 and GTX 580 - the frame rate is different on each but the percentage change is similar

I wonder how often others end up doing this to achieve a decent frame rate.

If you look at the games industry, they do as much pre-processing as possible - that is why they get such impressive frame rates.

My biggest problem now is when a single object is moved or deleted. My current solution is repacking the vertex buffer but it is proving quite slow and I am looking at
just modifying the object vertices so that they are co-located.

I don’t think it was directly related to the batch size; I think it is more related to the number of buffers I had - I had 7000+ (not a good idea) but with small batch sizes I think the gpu was basically idle as it had very little work to do with are render call.

Wait. The number of buffers and the number of batches aren’t the same thing. Did you try putting all that in one buffer and just rendering parts of it, without changing the vertex format (ie: no glVertexAttribPointer calls)?

As we were talking about pre-multiplying vertex transformations before submitting them to the shader, I’m wondering what you mean in this context; Are you saying even these matrix calculations are pre-processed?

In somes 3D model’s formats, such as the .MD2 format, the vertex and matrix data are pre-processed for to minimize the size of the model’s data :



// vertex typedef struct 
{     
    unsigned char   v[3];                // compressed vertex (x, y, z) coordinates
     unsigned char   lightnormalindex;    // index to a normal vector for the lighting  
} vertex_t;

// texture coordinates typedef struct 
{
     short    s;
     short    t;  
} texCoord_t;


typedef struct {
     short   index_xyz[3];    // indexes to triangle's vertices
     short   index_st[3];     // indexes to vertices' texture coorinates  
} triangle_t;

// frame typedef struct 
{ 
    float       scale[3];      // scale values
     float       translate[3];   // translation vector
     char        name[16];       // frame name 
    vertex_t    verts[1];       // first vertex of this frame  
} frame_t;

glBegin( GL_TRIANGLES );   // draw each triangle
for( int i = 0; i < header.num_tris; i++ )   
{    
    // draw triangle #i
    for( int j = 0; j < 3; j++ )
    {
           // k is the frame to draw
           // i is the current triangle of the frame
           // j is the current vertex of the triangle
           glTexCoord2f( 
               (float)TexCoord[ Meshes[i].index_st[j] ].s / header.skinwidth,
               (float)TexCoord[ Meshes[i].index_st[j] ].t / header.skinheight 
            );

            glNormal3fv( anorms[ Vertices[ Meshes[i].index_xyz[j] ].lightnormalindex ] );

           glVertex3f( 
                   (Vertices[ Meshes[i].index_xyz[j] ].v[0] * frame[k].scale[0]) + frame[k].translate[0],
                    (Vertices[ Meshes[i].index_xyz[j] ].v[1] * frame[k].scale[1]) + frame[k].translate[1],                       
                   (Vertices[ Meshes[i].index_xyz[j] ].v[2] * frame[k].scale[2]) + frame[k].translate[2] 
          );
     }   
} 
glEnd();


We can find an full explanation of the .MD2 format at C++ > OpenGL > The MD2 Model File Format for example


You may have noticed that v[3] contains vertex' (x,y,z) coordinates and because      
 of the unsigned char type, these coordinates can only range from 0 to 255. In fact these 3D       
coordinates are compressed (3 bytes instead of 12 if we would use float or vec3_t). To uncompress it,       
we'll use other data proper to each frame. lightnormalindex is an index to a precalculated       
normal table. Normal vectors will be used for the lighting.

=> here, we can clearly say that the input vertex and matrix data is pre-processed …
(the vertex cordinates are stored with 3 bytes and not 3 floats and the normal is stored in a precalculed table [+ the matrix data is simplified to only handling scaling and translation])

Note that into this 3D model format, the vertex/normal/texel/matrix data is not pre-multiplied on another side :slight_smile:
(cf. they are pre-processed [for to optimize the size of the data to store] but not pre-multiplied)

[QUOTE=tonyo_au;1250166]I don’t think it was directly related to the batch size; I think it is more related to the number of buffers I had - I had 7000+ (not a good idea) but with small batch sizes I think the gpu was basically idle as it had very little work to do with are render call.

I run on ATI 5870, nVidia Quadro 5000 and GTX 580 - the frame rate is different on each but the percentage change is similar[/QUOTE]

Yes, in this case you are totally CPU limited, not GPU limited, like explained at Redirect Notice


Yes, at < 130 tris/batch (avg) you are
- completely,
- utterly,
- totally,
- 100%
– CPU limited!
• CPU is busy doing nothing, but submitting batches!

I think one good solution would be to have something like a “primitive transformation restart” than can be stored into the batch’s indices with specials indices that indicate that the ongoing primitive have to handle “transformations vertices” and not trues vertices indices

For a triangle batch, the first index can to be an index into a translation table, the second index into a rotation table and the third into a scaling table for example
(if we use quads batchs, the fourth index can to be used for to handle homogeneous coordinates for example)

=> we can certainly use negatives indices for to indicate that the ingoing primitive is in fact a transformation primitive

The number of buffers and the number of batches aren’t the same thing

In my case I had one buffer for each render call

Did you try putting all that in one buffer and just rendering parts of it

Since I wanted to render all the objects, I did put all the vertices in into several buffers each about 100,000 vertices and used the index restart primitive. This got me back to an acceptable frame rate.

The 100,000 size was a compromise for render time verses update time when objects are deleted. Doubling this number did not make a partical difference to the overall frame render time but noticablely slowed my delete.

As we were talking about pre-multiplying vertex transformations before submitting them to the shader

One of the objects I render lots of are pipes. These are all cylinders and could therefore use the same geometry with a scale/rotate/translate matrix.
I have tried rendering these 3 different ways

  1. instancing with matrix
  2. creating the geometry in the tesselator from a parameterised vertex that describes the radius/length/rotation/translation of the pipe
  3. separate geometry for each pipe with each vertex for each pipe at its world location

The third option is the fastest but takes the most space.

I am currently using the second option which uses little space (only marginally more than instancing) and allows lod to improve speed of render. It is not as fast
as the third option even with a pipe of 64 sides but is a lot less data.

With the new graphics cards tesselation is a lot faster but the amount of memory on the card is also larger so I am not sure my option is the best choice.

[QUOTE=tonyo_au;1250332]
2) creating the geometry in the tesselator from a parameterised vertex that describes the radius/length/rotation/translation of the pipe

With the new graphics cards tesselation is a lot faster but the amount of memory on the card is also larger so I am not sure my option is the best choice.[/QUOTE]

By “tesselator” are you referring to the building of geometry in the shader?

By “tesselator” are you referring to the building of geometry in the shader?

Yes using the tesselator shaders