glMultiDrawElements() and VBO

Say I wanted to model a human. I have a submodel for legs, torso, arms and head. What is more feasible if drawing with glMultiDrawElements():

a) put vertices of all submodels and indices of all submodels into a single VBO (in total, 1 VBO),
b) put vertices of all submodels into one VBO and indices of all submodels into one VBO (in total, 2 VBOs),
c) put verices of all submodels into one VBO and indices of each submodel into a separate VBO (in total, 5 VBOs),
d) put vertices and indices for each separate submodel into a separate VBO, then draw with glDrawElements() (in total, 4 VBOs)

In my opinion the correct answer is b, but I am not certain because of possible alignment problems. Are indices immune to this sort of problem in your experience?

Additional question: if VAO or display lists were used, is the answer the same?

mmm…
you can’t use C and D cause you can’t switch VBO inside the draw call, you can only use one VBO for the data and another VBO for the indexes.
A is a little strange. Index require a GL_ELEMENT_ARRAY_BUFFER and data require a GL_ARRAY_BUFFER.
Theoretically you can bind the same buffer two times but I never try it.
B is the most used solution.

I don’t understand witch king of alignment problems are talking about. Probably you need an offset for the index. This problem is solved with openGL 3.0 with the function
glMultiDrawElementsBaseVertex

The trick of putting indices and vertices into the same VBO works just fine. I’ve tried it with mesa, ATI, NVIDIA and even the iphone/ipod, it works everywhere. You can bind the same VBO for both indices and vertices.

All the variants are implementable. I am mostly worried about alignment issues. Let’s say the GPU wants indices or vertices at a specific alignment and if I pack everything into 1 VBO I might miss this requirement and hurt performance. So please help. Which variant is safest as regards performance? I know that a) is fastest if everything goes right with alignment.

Oh… sorry.
In this case I think you are more expert then me :smiley:
I still don’t understand how C work, you still need different drawing call, don’t you?
Yep… D is fine and probably used where you have to switch texture or state. But surelly is the slowest.

I think that performance are really driver dependant, but usually less state you switch the faster you are… that’s all I know :frowning:
Oh… damn… I’m a noob. Sorry.

Never heard of any such alignment problems (or any kind of “alignment fault” in sending data to the GPU. In fact, some folks have been doing a) (packing vertex attributes and indices into the same buffer) for a long while. See this post for example.

But what’s interesting is to ask why. Generally speaking binding buffers is expensive, especially when you’re dealing with smaller batches (though vendor-specific extensions allow you to circumvent this overhead). So you generally net a speed-up from binding fewer buffers over the course of a frame. The driver translating buffer handles to GPU addresses on the CPU can be expensive (since CPU memory is so daggum slow relative to CPU instructions nowadays).

Additional question: if VAO or display lists were used, is the answer the same?

Well, as with everything GPU, there are no performance guarentees or absolutes. But…

With display lists, the driver is gonna rip all your data out of your VBOs and create its own, carefully optimized (if it’s doing a good job) and managed internal to the driver, so while anything’s possible, which option you choose may not matter much, particularly on NVidia, which has “awesome” display list performance.

With VAOs, it again depends on the driver implementation. Remember that expense I mentioned translating VBO handles to addresses which (probably among other things) makes binding VBOs relatively expensive? VAOs were a first stab to try to and get rid of some of this “batch setup” overhead, or at least take the hit once up-front. They can usually help some (especially with smaller batches), but not like using vendor-specific extensions such as NVidia bindless. With this, then this overhead pretty much just goes away, and you don’t have to go crazy packing bunches of batches into each VBO (to reduce the overhead of binding buffers) to get great performance.

Never heard of any such alignment problems (or any kind of “alignment fault” in sending data to the GPU. In fact, some folks have been doing a) (packing vertex attributes and indices into the same buffer) for a long while. See this post for example.

The post you reference does mention that 3 color bytes in an array break any alignment, which leads to slow performance. I am worried, that the GPU expects to find indices at an address with a certain alignment. If I place a multitude of index arrays into a single VBO I’ll surely break any alignment requirements, which probably are met, if you place indices into separate VBOs. At least, that’s my thinking. You say, you’ve never heard of any problems with this?

If I place a multitude of index arrays into a single VBO I’ll surely break any alignment requirements

What alignment requirements are you talking about?

First of all, the GL spec says nothing about the alignment of anything with regard to gl*Pointer or any of the glDrawElements calls. So you are allowed to use any alignment you want and your rendering will work.

Different implementations may exhibit slower performance without certain alignments. But this is a per-attribute and per-vertex alignment. For example, it’s usually best to make sure that attribute data are aligned to their component size. So an attribute of 3 shorts should begin on a 2-byte boundary. An attribute of 3 floats should be on a 4-byte boundary. And so on.

For indices, it’s even simpler. Alignment would be on 2 or 4 byte boundaries depending on whether the index is a 2 or 4-byte index (short vs. int). That’s all.

The most restrictive alignment suggestions I’ve heard are:

1: Put all attributes on 4-byte alignment, regardless of size.
2: When interleaving data, align every vertex to 32-bytes.

Both of these are simple to do with appropriate buffer strides and offsets. And none of them interact with vertex indices in any way.

Beyond any of that, it’s do whatever you want.

Well while certain GPUs/drivers might prefer a specific alignment for best performance, you won’t trip a “GPU core dump” by using another alignment.

The same thing applies to the size of interleaved vertex attribute data alone, without even mixing indices into the same VBO. Some GPUs/drivers could net can you a slight speed-up if you padded your vertex data so that it’s a multiple of 32-bytes.

But anyway, if you want to be totally paranoid about the indices alignment thing, don’t ever mix indices and attributes in the same VBO. Or (slightly less paranoid), if you do, then ensure that the byte offset that you start your indices is a multiple of 8. It’s not like you’ll be wasting a ton of memory. But bench too to see if you can actually see a difference.

If I place a multitude of index arrays into a single VBO I’ll surely break any alignment requirements, which probably are met, if you place indices into separate VBOs.

Well, if you take the 2 VBO approach (i.e. only put indices in an index VBO), and you assume the start of the VBO is properly aligned by the driver, and especially if you only use one format of indices (e.g. all shorts) in the index VBO, then you can be pretty sure there’s no way you’re gonna end up “unaligned” if you tightly pack your indices.

I was talking about the performance-hit alignment problems. Say I had a VBO like this:

| vertices, all triples of 4 byte floats
| some 2 byte indices
| some 4 byte indices

The 4 byte indices could fall on a 2-byte boundary, if both indices are packed into the same VBO.

I see the strides can help though. I could also rearrange the indices so the 4 byte indices would come before the 2 byte indices. Until now I saw the strides as an ‘exotic’ functionality.

BTW:

2: When interleaving data, align every vertex to 32-bytes.

You meant 32-bits right? Also, in what situation would one want to interleave?

From this post and others, I believe it all comes down to testing; seeing how the app behaves with different cards/drivers and if one is not satisfied with performance, then the alignment fixes are just another tool in the tool box, with which one can try to improve performace.

Well, two things. Years back, yeah there was some comment that ATI wants items each vertex attribute to be 32-bit aligned (link).

But there have also been comments that ATI and/or NVidia at least used to prefer with interleaved vertex attributes for the vertex data for one complete vertex (all attributes) to be 32-bytes in length for best performance (random link, link, link, link, link, just to name a few).

That said, alot of this is old news in GPU evolution terms, so I’d retest to see if it still applies to the latest cards/drivers.

Also, in what situation would one want to interleave?

Most, if not all of them (on today’s GPUs), to minimize the number of buffers you have to bind. E.g. 3 vtx attribs and each in their own VBO, gotta bind 3. Interleaved: bind just one. Not only that, you can envision memory accesses likely being more coherent with 1 interleaved vtx attribute VBO rather than N. With N, the GPU has gotta do a memory gather from potentially all over GPU memory. With interleaved, it’s just streaming. And many (if not all nowadays) support memory access coalescing, even without a cache (though the newer cards allegedly have cache).

One reason not to interleave all would be that you need to dynamically update one attribute, but the rest are static. So you might store all the static ones interleaved in one VBO and separate the dynamic one out to its own VBO (which is periodically re-uploaded).

Now 5+ generations of cards ago, there used to be some cases where interleaving might be slower depending on what formats/alignment how “picky” the card was, but those days are probably long sense dead. That said, I haven’t benched this with lots of permutations recently.

…if one is not satisfied with performance, then the alignment fixes are just another tool in the tool box, with which one can try to improve performace.

Right.

Most, if not all of them (on today’s GPUs), to minimize the number of buffers you have to bind. E.g. 3 vtx attribs and each in their own VBO, gotta bind 3. Interleaved: bind just one.

Bind three buffers? Why? With this VBO, where the attrs have been stacked on top of one another:

| vattr1
| vattr2
| vattr3

So the vattrs dont interleave, you can provide three different vertex attribute pointers and bind just one VBO, no?

Not only that, you can envision memory accesses likely being more coherent with 1 interleaved vtx attribute VBO rather than N. With N, the GPU has gotta do a memory gather from potentially all over GPU memory. With interleaved, it’s just streaming. And many (if not all nowadays) support memory access coalescing, even without a cache (though the newer cards allegedly have cache).

This argument, however, is still valid even with the stacking VBO. So a personal question, do you use interleaving as a rule in your projects?

The 4 byte indices could fall on a 2-byte boundary, if both indices are packed into the same VBO.

“Could”? It’s up to you to decide where to put them. You don’t have to put the 4-bit indices on the byte directly after the 2-byte index array is finished. Just add a bit of extra room.

So a personal question, do you use interleaving as a rule in your projects?

It’s generally best to interleave unless you have a reason not to, like needing to dynamically update certain attribute’s data. Or if memory becomes a problem due to added padding for interleaved data.

Yes, that 1 non-interleaved VBO is yet another permutation, half-way in between the 3 separate VBOs I mentioned and the 1 interleaved VBO we were also discussing. Has the advantage of only one VBO to bind, but still has the disadvantage of 3 streams potentially fragmenting memory fetches from GPU memory.

So a personal question, do you use interleaving as a rule in your projects?

Yes, for the static attributes. One VBO to bind and only one sequential stream from GPU memory. But time both yourself and see whether you notice any difference. Make sure your batch sizes are big enough and not shader limited such that you’re most likely to be able to detect vertex fetch performance.

Thanks for your kind answers. I have one last question.

Would would happen if one were to upload all static vertex attributes and indices of all models of an app into one single big VBO? The binding overhead would then largely disappear (except for dynamic attributes), but what would the downside of such an approach be? Streaming would still be possible, as one could interleave “related” vertex attributes. Alignment problems could also be sorted out within the big VBO, I suppose. Is it possible to find out the maximum possible size of the big VBO in advance, assuming more than one application use the GPU at the same time?

It’s generally best to interleave unless you have a reason not to, like needing to dynamically update certain attribute’s data. Or if memory becomes a problem due to added padding for interleaved data.

Yes and you add padding according to the rules, you’ve given, I suppose, or whatever it is the card/driver prefers.

[quote=ugluk]Would would happen if one were to upload all static vertex attributes and indices of all models of an app into one single big VBO?

Give it a shot. If that satisfies your needs, then great.

[quote]Is it possible to find out the maximum possible size of the big VBO in advance, assuming more than one application use the GPU at the same time?

That I don’t know. That is, how to query the maximum possible size.

Even then, unless you allocate the space, it’s not reserved for your app specifically. Also, I would guess GPU memory fragmentation might potentially lower this size from the theoretical limit. Also, with OpenGL there’s no guarentee that there’s even that much space on GPU device memory at all, so every time you bind the VBO and render with it, the driver could be shuffling parts of it back and forth to GPU memory potentially causing your application to slow to a crawl.

So allocating super-big VBOs intuitively seems like it might increase your chances of having performance or insufficient memory problems (but try it and see!) Whether it will depends on your apps needs, the GPU mem size, and what other GL apps you might be competing with. Maybe you can just assume no other GL apps will be eating GPU memory while your application is running.

Could it be that even the super big VBO is divided into pages (like on the x86) and these get shuffled in and out of system memory? I’ll give it a shot. Maybe several large VBO are going to perform better than a single one, as long as there aren’t too many.

It seems like it would depend on the implementation. There are maximums for how many vertexes you can use per draw call, as well as the highest vertex index you can use (you can retrieve them at startup using D3D if that’s a viable option for you), but what happens when you go over those maximums could be anything from running the entire thing from system memory to a more graceful degradation. And what happens on your hardware might not be the same as what happens on someone else’s hardware. That’s before we even touch on GPU memory.

Thanks, just as photon said, it all comes down to benching; finding the balance between the number of VBOs (and consequently number of binds) and their size, apparently.