VBO layout of attributes

If you have a quantity of vertex attributes that are bound together in varying configurations, what would be an ideal VBO layout for performance?

Consider the following:

a) Put all attributes in one big VBO and gl*pointer those necessary.
Does this cause binding of unecessary attributess and hence an overhead? Perhaps in low-memory situations?

b) Put each optional attribute in it’s own VBO and bind it as needed.
Does using and binding a larger number of VBOs incur an overhead? How severe?

c) Create a complete VBO for each possible combination of attributes.
Well, obviously quite wasteful of memory, and could also be expensive if attribs need frequent data updates.

I’m not actually suggesting these as solutions but I thought they each clearly illustrated the problem and potential overheads. How would you weigh these overheads against eachother?

I didn’t get into application specific benefits or things like interleaving but if you have anything to add…

Target hardware would be PC consumer.

If your application needs frequent updates of a single attribute, I would go for option B.

If all vertex attributes need to be updated at once, I would suggest glInterleavedArrays. Performance-wise, this option should be the better choice because of data localization (no large jumps in memory when accessing the different attributes for a single vertex)

N.

That makes sense but I’m actually more concerned about rendering performance than buffer updates. Unfortunately b is also a bit of a bitch to implement (or rather inelegant) for a very general purpose renderer.

Interleaving is actually worth discussing in more detail. Would a suffer much with interleaving? Would c actually be a worthwhile option with interleaving, despite the memory cost?

Thanks

I’m not sure what your goal is… I know you’re worried about performance but why would you want to store all possible combinations of vertex attributes? Do you want it to spend much time on loading the data and then be able to quickly switch between different combinations of attributes at rendering time? If this is the case, why would option B be an inelegant solution?

N.

glVertexPointer calls are really expensive with VBO. Do it 100,000 times a frame and it becomes a bottleneck.
Personally I create fixed sized VBO (16bit addressable) for a format on first encounter and populate it until full (offsetting indices depending on the destination position of the verts within the VBO), when not enough room I create another fixed sized VBO for that format. So each format has its own list of VBO’s, and I sort by VBO (after material/xform) so I only call glVertexPointer once for a big batch of meshes.
I get very good performance improvements with this scheme, but the type of scene helps - low material count, high batch count, high poly count. Basically engineering data.

You absolutely want to interleave data, assuming that all the channels of that data are being used. This allows the vertex fetch circuitry to do a better job of streaming data onto the card (having to do with how DRAM is accessed).

knackered has a point, assuming you suspect the software is ever to be used on h/w not capable of handling int indices (in h/w). I’ve used a similar approach myself and it’s often about as fast as you can get (ofsetting an index buffer using CPU before sending to the server is hardly noticable).

However, 32-bit indices changed the game. Bigtime. All of a sudden locality-of-reference comes into play in a much bigger way. All of a sudden we need to start thinking about what data is needed in what pass on the server, and in what order.

As it’s obvious that sequential read is the fastest (what OP requested) I won’t go into this any more than suggesting to think of “when is what data needed”.

As for OP’s questions:
a) to little info to answer. Are the attributes interleaved or one-after-another? If the latter, are they at least properly aligned?
b) Don’t do this unless you have to. Allocate a “scratch” space in the VBO for this stuff, that you can use without switching buffer. VRAM is “cheap”, buffer-switching is expensive.
c) “each possible combination of attributes”? If you need something for worst-case (f.ex. 142 texture coordinates :slight_smile: ) don’t bother. Use the common case (even if you create a buffers for 1-2 more texture coordinates than you usually use) and save the extra stuff for when you actually need the extra attribute space.

Of course, this as always is domain specific. If I got these questions in context of my half-a-gigabyte VRAM card, I might answer one thing. If it’s in the context of a PS/2… you get the point.

by locality of reference, you mean have vertices arranged in order that they’re referenced by the indices of the primitive you’re rendering?
if so, yes that gave me a bit of a jump too.
Look into the d3dx functions OptimizeFaces and OptimizeVertices if you want to get some marked improvements in vertex throughput.
BTW, I still get much better performance with 16bit indices than 32bit on the latest nvidia chipsets…

Don’t forget the pre-vertex cache and post-vertex cache.
The pre-vertex cache is a bit like the CPU cache. You would want your vertex to be local in VRAM to benifit from it the most.
The post-vertex cache is beneficial when you resuse vertices so it depends on your indices.

An ideal case would be something like
index = 0, 1, 2, 2, 1, 3, 3, 1, 4, …
glDrawElements(GL_TRIANGLES, …, …);

Sucky precache usage, good post cache usage :
index = 0, 1, 33, 33, 1, 45, 45, 1, 4, …
glDrawElements(GL_TRIANGLES, …, …);

Good pre-cache usage, sucky post cache usage :
index = 0, 1, 2, 3, 4, 5, 6, 7, 8, …
glDrawElements(GL_TRIANGLES, …, …);

BTW, I still get much better performance with 16bit indices than 32bit on the latest nvidia chipsets…

Sure, it’s all about fitting more into the cache.

Woohoo, 3000 posts!

that’s why I pointed him at OptimizeFaces and OptimizeVertices - for pre and post vertex cache sorting.

In the days of VAR a lot of stuff was more obvious, with VBOs it’s hard to find out what’s going on under the hood. I’m doing a rewrite of my first VBO implementation which was based on guesswork at best. I can’t say I’m having performance problems, and I doubt I have much to gain but as I’m making some changes anyway I want to be well informed.

So, this is what I’ve understood so far:

  1. As switching VBOs is expensive, prefer sticking more data at the end of the VBO than getting it from another VBO.

  2. Interleaving is definitely worthwhile.

  3. In the case of dynamic attribs, putting those in a separate VBO can be beneficial.

I already attempt to reduce VBO binds and gl*pointer calls a little. I do optimise triangles and VB entries for cache. I stopped using interleaving after I benchmarked it exhaustively on a few different boards and found it did little or nothing.

The engine is used in numerous and diverse applications so it’s hard to optimise very specifically. The renderer is multi-pass. In a purely hypothetical example, if you had a surface rendered in 4 passes, having a total of 6 attribs and the passes use them like this:
[a0,a1]
[a0,a1,a2,a3]
[a0,a1,a2,a4]
[a0,a2,a5]
What would you do? Or better, what would you not do?

From tamlin’s post I gather he might suggesst something along the lines of interleaving a0, a1 and a2 and sticking the rest somewhere after in the same VBO.

tamlin, what do you mean by “are they at least properly aligned”?

Apologies for my initial post being a bit confusing (that was me trying to be clear). I should have put it in the form of questions rather than examples… Something like:

a) What is the overhead of binding a VBO that contains unnecessary attributes? Is this a really bad idea? What about interleaving unused attributes?

b) How does performance of spanning attribs across multiple VBOs compare to that of having them non-interleaved in the same VBO?

c) What does a considerably higher VBO memory usage result in? Memory being paged in?
Slower memory being used? What kind of hit can I expect?

Thanks everyone! (except for knackered)

F uck you too, madoc.

Damn, I was hoping for something more humorous. Anyway, that was intended as an affectionate jest, not a f uck you. You’re actually my favourite forum member and the only to have received a rating from me (5*).

Aha, and I only meant it to be an affectionate f uck you, so there! I out-did your feigning aggression.

Oh yeah? Well… I meant it! But I’m willing to reconcile, how about I buy you a pint?

Anyway, is there some trick to getting a boost out of interleaving? I understand how it works in theory but google found me lots of people claiming that benchmarks led them to believe there is nothing to be gained vs one claim of a 15% boost and to quote John “You absolutely want to interleave data”. But also “assuming that all the channels of that data are being used”, which suggests I should not use it with something like the above mentioned multipass setup.

I think alignment is more important, and interleaving gets you 32 byte alignment without wasting space with quite so much padding.

Anyway, is there some trick to getting a boost out of interleaving?

Well, no, but odds are you aren’t going to get a performance decrease from it, (unless you have a whole lot of alignment padding), so you may as well do it where possible.

summary of the big things I got a benefit from (kind of in order of benefit):-
1/ dropping tristrips in favour of trilists, eliminating lots of degenerates.
2/ face sorting for post-transform cache coherence.
3/ sorting by VBO and offsetting indices, to reduce glVertexPointer calls.
4/ always using 16bit indices, and allocating VBO’s accordingly.
5/ interleaving with padding to 32 byte boundaries.
6/ vertex sorting based on index fetch order for pre-transform cache coherence.

I was batch-bound, in that I was drawing scenes composed of insane numbers of batches that weren’t practical to merge into bigger ones.
I still think I shouldn’t have to be doing this stuff, because I’m sure with each generation of cards the priorities will swing back and forth. Geometry display lists for static geometry would be the right thing to do, in my opinion.

You got a benefit (increase of performance) by dropping triangle strips? I don’t understand how this is possible, could you explain it a little bit further? What degenerates?

CatDog

To draw everything in a single batch, people generally link multiple tristrips together with invisible triangles built by specifying the same index for more than 1 corner (degenerate triangles). The hardware will probably process these triangles exactly the same way as if it were a normal triangle (take a look at the wireframe), so they’ll involve a cache look-up etc. Then the rasterizer will reject the zero area triangle.
I measured a slight increase in performance (half to one mtps) by converting the tristrips into triangle lists, and thus eliminating the need for degenerate triangles.