Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Results 1 to 10 of 15

Thread: What is best practice for batch drawing objects with different transformations?

Hybrid View

  1. #1
    Junior Member Newbie
    Join Date
    Apr 2013
    Posts
    28

    What is best practice for batch drawing objects with different transformations?

    I'm conceptualising a good approach to rendering as many disjointed pieces of geometry with a single draw call in OpenGL, and the wall I'm up against is the best way to do so when each piece has a different translation and maybe rotation, since you don't have the luxury of updating the model view uniform between single object draws. I've read a few other questions here and elsewhere and it seems the directions people are pointed in are quite varied. It would be nice to list the main methods of doing this and attempt to isolate what is most common or recommended. Here are the ideas I've considered:

    1) Instancing; A new attribute is sent and updated per object, rather than per vertex. I could then pass varied transformation data efficiently, and within one draw call. The drawback of this technique is that my code would be less portable, supporting desktop GL only, since most mobile platforms do not seem to support this feature yet in OpenGL ES 2.0.

    2) Creating matrix transformations in the shader. Here I'd send a translation vector or maybe a rotation angle or quaternion as part of the attributes. The advantage is it would work cross-platform including mobile. But it seems a bit wasteful to send the exact same transformation data for every single vertex in an object, as an attribute. Without instancing, I'd have to repeat these identical vectors or scalars for a single object many many times in a VBO as part of the interleave array, right? The other drawback is I'm relying on the shader to do the math; I don't know if this is wise or not.

    3) Similar to 2), but instead of relying on the shader to do the matrix calculations, I instead do these on the client side but still send through the final model view matrix as a stream of 16 floats in the VBO. But as far as I can tell, without instancing, I'd have to repeat this identical stream for every single vertex in the VBO, right? Just seems wasteful. The tradeoff with 2) above is that I am sending more data in the VBO per vertex (16 floats rather than a 3-float vector for translation and maybe a 4 float quaternion), but requiring the shader to do less work.

    4) Skip all the above limitations and instead compromise with a separate draw call for each object. This is what is typically "taught" in the books I'm reading, no doubt for simplicity's sake.

    Are there other common methods than these?

    As an academic question, I'm curious if all the above are feasible and "acceptable" or if one of them is clearly a winner over the others? If I was to exclusively use desktop GL, is instancing the primary way for achieving this?

  2. #2
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    Instancing; A new attribute is sent and updated per object, rather than per vertex. I could then pass varied transformation data efficiently, and within one draw call. The drawback of this technique is that my code would be less portable, supporting desktop GL only, since most mobile platforms do not seem to support this feature yet in OpenGL ES 2.0.
    Um, no. The biggest drawback with instancing is that instancing only supports drawing the same mesh. Instancing repeatedly loops through the same per-vertex data multiple times, each time with a different gl_InstanceID value in the VS and/or a different set of per-object attributes.

    Whether instancing is or is not supported is irrelevant if it simply can't do what you need. If you're drawing different objects, instancing just isn't going to help you.

    Here I'd send a translation vector or maybe a rotation angle or quaternion as part of the attributes.
    I seriously doubt that this could be faster than multiple draw calls in virtually all situations. The two main problems are the added vertex shader input data, and the fact that you're now streaming vertex data on what may have been static models otherwise.

    The first problem persists even if you use shorts for the quat+trans, will be no less than 16 bytes per vertex. The absolute best you could hope for is to pass an index (perhaps as a byte, but even then, it's a good idea to align attributes to 4 bytes, so that's still an extra 4 bytes per vertex), which you use to look something up in a buffer texture or uniform buffer.

    The second problem causes a number of issues. If you've got half-static and half-streamed data, then now you're going to have to split your vertex data (one buffer object for static, one for streamed). This is almost certainly going to be less performance friendly just in terms of upload time. Coupled with that, you're going to need to do buffer object streaming of some form. This is certainly doable, but non-trivial.

    If you use an index rather than the actual data, you might have a functional solution (especially if you can hide that index in some other attribute. Like if you only use the RGB of the color, you can hide the index in alpha). This would in effect be doing matrix palette skinning, just with only one index per vertex and no blending between matrices. This can be a workable solution, but generally it's for objects that are hierarchically linked already. Not an arbitrary cloud of stuff.

    But outside of that kind of situation, this will generally be a poor performer. And not because of the vertex shader, so your "matrix per vertex" solution is a non-starter.

    Generally speaking, if you have multiple objects, with each object using independent transforms, you use multiple draw calls. That's what they're there for. The old NVIDIA "Batch Batch Batch" presentation cited between 10,000 and 40,000 draw calls per-frame (in D3D. More in GL) for a 1GHz GPU. Nowadays, you're looking at rather more than that. So unless you're dealing with tens of thousands of individual objects, all of them being different (so no instancing), odds are good that you'll be fine.

    On desktop GL, of course.

  3. #3
    Junior Member Newbie
    Join Date
    Apr 2013
    Posts
    28
    Very useful, thanks. This gives me some confidence to worry less about doing multiple draws until I actually see a serious bottleneck in effect. It certainly simplifies things for now. Appreciate it.

  4. #4
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,101
    between 10,000 and 40,000 draw calls per-frame
    I have found if the batch has only a small number of triangles like a polyline or a simple cube structure you cannot get anything like 10,0000 draw calls per frame with an acceptable frame rate (say 20 fps). Over about
    2500 calls and the frame rate rapidly approaches 1 fps.

    I have not profiled it to the extent of finding exact what I am cpu bound on but the loop was not changing states but it was changing buffers with each call.

    My solution was easy because my data is ralatively static so I pre-multiplied the instanced objects by their translation/rotation matrices and stored the resulting vertices in large buffer to minimise draw calls and quite happily got back to
    20+ fps. Of course the trade off is more data space for the vertices but the matrices don't come free and the individual objects where typically less that 20 vertices.

  5. #5
    Junior Member Newbie
    Join Date
    Apr 2013
    Posts
    28
    Quote Originally Posted by tonyo_au View Post
    My solution was easy because my data is ralatively static so I pre-multiplied the instanced objects by their translation/rotation matrices and stored the resulting vertices in large buffer to minimise draw calls and quite happily got back to
    20+ fps. Of course the trade off is more data space for the vertices but the matrices don't come free and the individual objects where typically less that 20 vertices.
    Now there's an idea I hadn't thought of it. Take the modelview matrix calculations out of the shader entirely and just pass the vertices after multiplication. This allows a single draw call for many objects in different orientations and translations. The cost just comes at all the CPU calculations, but I suppose if that bottleneck is not as big as the bottleneck of multiple draw calls, it would be worth it, as you noted.

    I wonder how often others end up doing this to achieve a decent frame rate.

  6. #6
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,101
    I wonder how often others end up doing this to achieve a decent frame rate.
    If you look at the games industry, they do as much pre-processing as possible - that is why they get such impressive frame rates.

    My biggest problem now is when a single object is moved or deleted. My current solution is repacking the vertex buffer but it is proving quite slow and I am looking at
    just modifying the object vertices so that they are co-located.

  7. #7
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    I have found if the batch has only a small number of triangles like a polyline or a simple cube structure you cannot get anything like 10,0000 draw calls per frame with an acceptable frame rate (say 20 fps).
    Are you saying that the performance per batch decreases if the batch size is small? On what hardware did you see this?

  8. #8
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,101
    Are you saying that the performance per batch decreases if the batch size is small? On what hardware did you see this?
    I don't think it was directly related to the batch size; I think it is more related to the number of buffers I had - I had 7000+ (not a good idea) but with small batch sizes I think the gpu was basically idle as it had very little work to do with are render call.

    I run on ATI 5870, nVidia Quadro 5000 and GTX 580 - the frame rate is different on each but the percentage change is similar
    Last edited by tonyo_au; 04-22-2013 at 08:55 PM.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •