PDA

View Full Version : VBO layout of attributes



Madoc
01-18-2008, 05:26 AM
If you have a quantity of vertex attributes that are bound together in varying configurations, what would be an ideal VBO layout for performance?

Consider the following:

a) Put all attributes in one big VBO and gl*pointer those necessary.
Does this cause binding of unecessary attributess and hence an overhead? Perhaps in low-memory situations?

b) Put each optional attribute in it's own VBO and bind it as needed.
Does using and binding a larger number of VBOs incur an overhead? How severe?

c) Create a complete VBO for each possible combination of attributes.
Well, obviously quite wasteful of memory, and could also be expensive if attribs need frequent data updates.

I'm not actually suggesting these as solutions but I thought they each clearly illustrated the problem and potential overheads. How would you weigh these overheads against eachother?

I didn't get into application specific benefits or things like interleaving but if you have anything to add...

Target hardware would be PC consumer.

-NiCo-
01-18-2008, 05:55 AM
If your application needs frequent updates of a single attribute, I would go for option B.

If all vertex attributes need to be updated at once, I would suggest glInterleavedArrays. Performance-wise, this option should be the better choice because of data localization (no large jumps in memory when accessing the different attributes for a single vertex)

N.

Madoc
01-18-2008, 06:08 AM
That makes sense but I'm actually more concerned about rendering performance than buffer updates. Unfortunately b is also a bit of a bitch to implement (or rather inelegant) for a very general purpose renderer.

Interleaving is actually worth discussing in more detail. Would a suffer much with interleaving? Would c actually be a worthwhile option with interleaving, despite the memory cost?

Thanks

-NiCo-
01-18-2008, 06:27 AM
I'm not sure what your goal is... I know you're worried about performance but why would you want to store all possible combinations of vertex attributes? Do you want it to spend much time on loading the data and then be able to quickly switch between different combinations of attributes at rendering time? If this is the case, why would option B be an inelegant solution?

N.

knackered
01-18-2008, 08:45 AM
glVertexPointer calls are really expensive with VBO. Do it 100,000 times a frame and it becomes a bottleneck.
Personally I create fixed sized VBO (16bit addressable) for a format on first encounter and populate it until full (offsetting indices depending on the destination position of the verts within the VBO), when not enough room I create another fixed sized VBO for that format. So each format has its own list of VBO's, and I sort by VBO (after material/xform) so I only call glVertexPointer once for a big batch of meshes.
I get very good performance improvements with this scheme, but the type of scene helps - low material count, high batch count, high poly count. Basically engineering data.

jwatte
01-18-2008, 02:34 PM
You absolutely want to interleave data, assuming that all the channels of that data are being used. This allows the vertex fetch circuitry to do a better job of streaming data onto the card (having to do with how DRAM is accessed).

tamlin
01-18-2008, 03:20 PM
knackered has a point, assuming you suspect the software is ever to be used on h/w not capable of handling int indices (in h/w). I've used a similar approach myself and it's often about as fast as you can get (ofsetting an index buffer using CPU before sending to the server is hardly noticable).

However, 32-bit indices changed the game. Bigtime. All of a sudden locality-of-reference comes into play in a much bigger way. All of a sudden we need to start thinking about what data is needed in what pass on the server, and in what order.

As it's obvious that sequential read is the fastest (what OP requested) I won't go into this any more than suggesting to think of "when is what data needed".

As for OP's questions:
a) to little info to answer. Are the attributes interleaved or one-after-another? If the latter, are they at least properly aligned?
b) Don't do this unless you have to. Allocate a "scratch" space in the VBO for this stuff, that you can use without switching buffer. VRAM is "cheap", buffer-switching is expensive.
c) "each possible combination of attributes"? If you need something for worst-case (f.ex. 142 texture coordinates :) ) don't bother. Use the common case (even if you create a buffers for 1-2 more texture coordinates than you usually use) and save the extra stuff for when you actually need the extra attribute space.

Of course, this as always is domain specific. If I got these questions in context of my half-a-gigabyte VRAM card, I might answer one thing. If it's in the context of a PS/2... you get the point.

knackered
01-18-2008, 07:43 PM
by locality of reference, you mean have vertices arranged in order that they're referenced by the indices of the primitive you're rendering?
if so, yes that gave me a bit of a jump too.
Look into the d3dx functions OptimizeFaces and OptimizeVertices if you want to get some marked improvements in vertex throughput.
BTW, I still get much better performance with 16bit indices than 32bit on the latest nvidia chipsets...

V-man
01-18-2008, 10:01 PM
Don't forget the pre-vertex cache and post-vertex cache.
The pre-vertex cache is a bit like the CPU cache. You would want your vertex to be local in VRAM to benifit from it the most.
The post-vertex cache is beneficial when you resuse vertices so it depends on your indices.

An ideal case would be something like
index[] = 0, 1, 2, 2, 1, 3, 3, 1, 4, ...
glDrawElements(GL_TRIANGLES, ......, ..);


Sucky precache usage, good post cache usage :
index[] = 0, 1, 33, 33, 1, 45, 45, 1, 4, ...
glDrawElements(GL_TRIANGLES, ......, ..);


Good pre-cache usage, sucky post cache usage :
index[] = 0, 1, 2, 3, 4, 5, 6, 7, 8, ...
glDrawElements(GL_TRIANGLES, ......, ..);



BTW, I still get much better performance with 16bit indices than 32bit on the latest nvidia chipsets...
Sure, it's all about fitting more into the cache.

Woohoo, 3000 posts!

knackered
01-19-2008, 05:04 AM
that's why I pointed him at OptimizeFaces and OptimizeVertices - for pre and post vertex cache sorting.

Madoc
01-19-2008, 08:21 AM
In the days of VAR a lot of stuff was more obvious, with VBOs it's hard to find out what's going on under the hood. I'm doing a rewrite of my first VBO implementation which was based on guesswork at best. I can't say I'm having performance problems, and I doubt I have much to gain but as I'm making some changes anyway I want to be well informed.

So, this is what I've understood so far:

1) As switching VBOs is expensive, prefer sticking more data at the end of the VBO than getting it from another VBO.

2) Interleaving is definitely worthwhile.

3) In the case of dynamic attribs, putting those in a separate VBO can be beneficial.

I already attempt to reduce VBO binds and gl*pointer calls a little. I do optimise triangles and VB entries for cache. I stopped using interleaving after I benchmarked it exhaustively on a few different boards and found it did little or nothing.

The engine is used in numerous and diverse applications so it's hard to optimise very specifically. The renderer is multi-pass. In a purely hypothetical example, if you had a surface rendered in 4 passes, having a total of 6 attribs and the passes use them like this:
[a0,a1]
[a0,a1,a2,a3]
[a0,a1,a2,a4]
[a0,a2,a5]
What would you do? Or better, what would you _not_ do?

From tamlin's post I gather he might suggesst something along the lines of interleaving a0, a1 and a2 and sticking the rest somewhere after in the same VBO.

tamlin, what do you mean by "are they at least properly aligned"?


Apologies for my initial post being a bit confusing (that was me trying to be clear). I should have put it in the form of questions rather than examples... Something like:

a) What is the overhead of binding a VBO that contains unnecessary attributes? Is this a really bad idea? What about _interleaving_ unused attributes?

b) How does performance of spanning attribs across multiple VBOs compare to that of having them non-interleaved in the same VBO?

c) What does a considerably higher VBO memory usage result in? Memory being paged in?
Slower memory being used? What kind of hit can I expect?

Thanks everyone! (except for knackered)

knackered
01-20-2008, 08:57 AM
F uck you too, madoc.

Madoc
01-20-2008, 09:04 AM
Damn, I was hoping for something more humorous. Anyway, that was intended as an affectionate jest, not a f uck you. You're actually my favourite forum member and the only to have received a rating from me (5*).

knackered
01-21-2008, 05:53 AM
Aha, and I only meant it to be an affectionate f uck you, so there! I out-did your feigning aggression.

Madoc
01-21-2008, 07:30 AM
Oh yeah? Well... I meant it! But I'm willing to reconcile, how about I buy you a pint?

Anyway, is there some trick to getting a boost out of interleaving? I understand how it works in theory but google found me lots of people claiming that benchmarks led them to believe there is nothing to be gained vs one claim of a 15% boost and to quote John "You absolutely want to interleave data". But also "assuming that all the channels of that data are being used", which suggests I should not use it with something like the above mentioned multipass setup.

knackered
01-21-2008, 07:45 AM
I think alignment is more important, and interleaving gets you 32 byte alignment without wasting space with quite so much padding.

Korval
01-21-2008, 11:29 AM
Anyway, is there some trick to getting a boost out of interleaving?

Well, no, but odds are you aren't going to get a performance decrease from it, (unless you have a whole lot of alignment padding), so you may as well do it where possible.

knackered
01-21-2008, 04:55 PM
summary of the big things I got a benefit from (kind of in order of benefit):-
1/ dropping tristrips in favour of trilists, eliminating lots of degenerates.
2/ face sorting for post-transform cache coherence.
3/ sorting by VBO and offsetting indices, to reduce glVertexPointer calls.
4/ always using 16bit indices, and allocating VBO's accordingly.
5/ interleaving with padding to 32 byte boundaries.
6/ vertex sorting based on index fetch order for pre-transform cache coherence.

I was batch-bound, in that I was drawing scenes composed of insane numbers of batches that weren't practical to merge into bigger ones.
I still think I shouldn't have to be doing this stuff, because I'm sure with each generation of cards the priorities will swing back and forth. Geometry display lists for static geometry would be the right thing to do, in my opinion.

CatDog
01-27-2008, 04:56 AM
dropping tristrips in favour of trilists, eliminating lots of degenerates.
You got a benefit (increase of performance) by dropping triangle strips? I don't understand how this is possible, could you explain it a little bit further? What degenerates?

CatDog

knackered
01-27-2008, 01:14 PM
To draw everything in a single batch, people generally link multiple tristrips together with invisible triangles built by specifying the same index for more than 1 corner (degenerate triangles). The hardware will probably process these triangles exactly the same way as if it were a normal triangle (take a look at the wireframe), so they'll involve a cache look-up etc. Then the rasterizer will reject the zero area triangle.
I measured a slight increase in performance (half to one mtps) by converting the tristrips into triangle lists, and thus eliminating the need for degenerate triangles.

CatDog
01-27-2008, 02:59 PM
You mean "swap"-operations in triangle strips? Shure, the number of swaps should be as low as possible... but, uhm, it's hard to believe, that you got a better result with simple triangle lists. (Although I don't want to argue about that, since you measured it.)

Maybe I should do some tests on current hardware.

CatDog

Dark Photon
01-28-2008, 06:43 AM
dropping tristrips in favour of trilists, eliminating lots of degenerates.
You got a benefit (increase of performance) by dropping triangle strips? I don't understand how this is possible, could you explain it a little bit further? What degenerates?
knackered would have to say for sure, but his tristrips-to-trilists transition may also have been an explicit vertex -to- indexed vertex transition (e.g. gl{Multi}DrawArrays -> glDraw{Range}Elements. With the former, you have no potential to get < 1 vertex shader run per triangle (aka average cache miss ratio [ACMR]). With indexed verticies, you do.

So in addition to eliminating degenerates, perhaps there was enough locality in his data that he got some vertex cache-induced speed-up even before sorting triangles to get closer to 0.5 ACMR. Also, the driver/hardware is probably more efficient in handling indexed primitives (beyond vert cache benefit), which would make perfect sense as that's where the real performance is.

So yeah, strips are dead. The only thing they may get you now is eliminating a few bytes sending duplicate indices to the GPU, and I've never seen that to be a bottleneck. Besides, there's no cross-vendor NV_primitive_restart, the conditions under which you can use even that are extremely limited, and you sure don't want to break a batch just to start a new strip. And degenerate triangles are out as options because they're not recognized until after the vertex shader, and thus eat triangle setup time and vertex index bandwidth.

CatDog
01-28-2008, 07:15 AM
Of course, indexed primitives are for shure better than explicit vertex arrays. If knackered moved from vertex arrays to indexed vertices, he would have said so! (Err, knackered?)

I'm currently using glMultiDrawElements(GL_TRIANGLE_STRIP). The strips are somewhat optimized for cache performance, especially for good (as V-man called it) postcache usage.

I was very happy with the results, until now... would you say, it's worth a try to change to glMultiDrawElements(GL_TRIANGLES)? (I'm really curious if it is *worth* it.)

<s>Ah, and what about vertex attributes like normals or shader attributes? Using triangle lists would mean to duplicate these things much more often than with strips. Don't you see this as a drawback?</s> *edit* Ouch, forget this last one, please. :-)

CatDog

Dark Photon
01-28-2008, 07:26 AM
I'm currently using glMultiDrawElements(GL_TRIANGLE_STRIP). The strips are somewhat optimized for cache performance...would you say, it's worth a try to change to glMultiDrawElements(GL_TRIANGLES)? (I'm really curious if it is *worth* it.)
If you're curious I'd say yes, but switch to glDrawElements instead. If you need the multi to join strips, you no longer need that when you switch to triangles.

Past posts have stated that glMulti*Draw APIs don't really help you at all unless you're CPU limited, suggesting it may just be a for loop down in the GL driver, maybe saving a stack frame or two of API calls and a little setup/validation. That may have changed. Try it and see.

AlexN
01-28-2008, 01:34 PM
I can second knackered's results, I've noticed a small benefit from using indexed triangle lists as opposed to indexed strips. This is with simply using the same indices as an optimized triangle strip, minus the degenerate stitches.

knackered
01-28-2008, 01:35 PM
Well of course I'm using indexed primitives - I'm not an idiot.
Catdog, you're using glMultiDrawElements(GL_TRIANGLE_STRIP)???
And you're getting spec performance? I always read that the multi functions just call the single functions in a loop in the driver, which would kill performance. Unless you're compiling it all into a display list? In which case take my word for it, you'll get much better performance using VBO with some decent manager code to reduce buffer binds (on nvidia anyway).
Oh, and always use glDrawRangeElements - otherwise something on the vendor side (driver or gpu) will have to find out this information every time you draw a batch (makes about .01 mtps difference on my setup, but every little helps).
Thanks, I forgot about the CMR, darkphoton. There's some articles scattered around t'internet that describe the improvements you get in CMR when you use indexed tri lists as opposed to indexed strips.

CatDog
01-28-2008, 02:04 PM
Well, now I am really very curious.

knackered, what is so bad about letting the driver loop within glMultiDrawElements(GL_TRIANGLE_STRIP)? I just thought, there had to be a good reason for introducing this routine in GL1.4 - so I used it. Maybe it is giving the driver the opportunity to build something like a display list internally by itself!? Whatever. Oh my... do I always have to know driver internals to get the best from OpenGL? Obviously, the answer is: yes.

So, that's what I'm taking home today:
- Strips are dead. (Still, I have to see this with my own eyes.)
- Use old glDrawRangeElements()

CatDog

knackered
01-28-2008, 02:20 PM
If the driver's calling drawelements under the hood then the gpu is going to be starved of things to do while it waits for the next batch. This can't be happening, otherwise I'm sure you'd have noticed very bad performance.
Just do some benchmarking - look on the box your graphics card came in and measure what million-tris-per-second you're getting and compare with the figure on the box.
I agree that all this stuff should be hardware abstracted, but the consensus of developers seems to be against it for reasons best left to the imagination....I supposed some people get a kick out of tweaking mesh data based on a vendor string. The fact that there's been at least 4 papers published about how best to format your mesh data based on an unknown cache size speaks volumes to me. The objections can't be about speed of loading, because the d3dx OptimizeFaces makes a negligible difference to load times in my system.

CatDog
01-28-2008, 03:11 PM
I just tried my favorite STL file. It's a very good 3d scan of a statue, that my stripifier can turn into a single mesh of tristrips. So this mesh is rendered by a single glMultiDrawElements-Call.

In total, it contains 1.8M triangles. On my Asus 7950GX2 it renders at around 80FPS (fixed lighting with one light source). So it is around 144M tris/second. Now anyone knows if this is a good value? I can't find any spec to compare, only marketing buzz everywhere, damn.

CatDog

CatDog
01-29-2008, 06:24 AM
Ok, I did some benchmarking.

Data is that STL file with 1.8M triangles.
As AxelN proposed I am converting the indexed strips to indexed triangle lists - maintaining the cache friendlyness.

glMultiDrawElements(GL_TRIANGLE_STRIP) -> 80 fps
glMultiDrawElements(GL_TRIANGLES) -> 86 fps

Cool! But, now I'm rendering this data 10 times (=18M tris), simply by calling glMultiDrawElements 10 times.

glMultiDrawElements(GL_TRIANGLE_STRIP) -> 9.0 fps
glMultiDrawElements(GL_TRIANGLES) -> 8.6 fps

Interesting, huh?

I also tried some "real world" engineering data (= high batch count). Using triangle lists is never significantly slower, and sometimes slightly faster. Unfortunately, it seems never to get faster, if the data size increases, just as seen above. So in real world, the improvements are not noticable.


Now I'm using a somewhat complicated fragment based point light shader instead of fixed lighting:

glMultiDrawElements(GL_TRIANGLE_STRIP) -> 7.2 fps
glMultiDrawElements(GL_TRIANGLES) -> 5.1 fps

So if I didn't do something wrong, I can not confirm, that using triangle list is always faster (on my machine of course).

(Testing with glDrawRangeElements will take some time, since it needs rearranging my VBO layout.)

Any suggestions?

CatDog

Dark Photon
01-29-2008, 10:11 AM
Any suggestions?
Well, first, make sure you are vertex bound. All of this discussion is geared toward improving vertex throughput (and to a small degree, reducing CPU load, with Multi vs. non-multi). Does nothing for fill. So first thing, shrink your window to 1x1 pixel (or as small as you can get it) to hopefully insure you are not fill bound. If you are in any of the above, the times don't prove anything. If you shrink your window and times improve, you are.

Now, you still could be either vertex or CPU bound here (we'd like to be vertex bound for the test). In the case of your high batch count data, you may be CPU bound, which means GPU pipeline bubbles just due to that and your vertex cache efficiency is only mildly relevent. Also even if you're not CPU bound, strips to tris alone is unlikely to net you a big win unless you re-optimize your triangle order. Don't just convert strips to triangles and dump them in the buffer.

Hope this gives you some ideas.

bobvodka
01-29-2008, 10:25 AM
Just a small point, AxelN proposed swapping your glMultiDrawElements call for a glDrawElements call for the GL_TRIANGLES, which with a 32bit index buffer is a single draw call (well, I say that, it's more about vertices than triangles but meh, you get the idea).

knackered
01-29-2008, 01:50 PM
catdog, either use d3dxOptimizeFaces/Vertices or download this:-
http://www.deep-shadows.com/hax/3DRipperDX.htm#Download
Install it, and browse into the installation directory - and nick a file called VCache.h. It contains an implementation of Tom Forsyth's optimiser, which you can read about here:-
http://home.comcast.net/~tom_forsyth/papers/fast_vert_cache_opt.html
Be warned though, this particular implementation is quite slow at processing the data - but you can re-factor it to get the speed up.

CatDog
01-30-2008, 06:30 AM
Dark Photon, I'm not CPU bound. The third test may be fill bound, since this was an expensive fragment shader, but the first two were vertex bound, definitely.

Next, I'm going to try to optimize the meshes using Forsyth's method. Thanks for the links, knackered, this will speed things up for me!

I'll come back here soon.

CatDog

CatDog
02-17-2008, 10:31 AM
Small update.

I tried out the VCache.h implementation. It produces good triangle lists, but they are not really faster then my old strips. It depends heavily on the scene. Sometimes it's 5% faster, sometimes not. But the problem is, that this particular implementation only works with small meshes. I tried to feed it with my STL files with 2-5 mio tris... and no way! This would run for days! (Ok you warned about it knackered, but... looking at the code confirmes quadratic runtime behaviour! Ugh.)

I'm now thinking about trying d3dxOptimizeFaces(). Since I never used it, could someone please tell me something about its performance?

CatDog

CatDog
02-17-2008, 11:13 AM
Oh, I was a little bit too hastily with my judgement. I just loaded a small (500.000 tris) file. It took half an hour to optimize, but it renders at 140% compared to my tri strips!!

Wow, asap I will take a closer look at it!

CatDog

Dark Photon
02-18-2008, 06:03 AM
I tried out the VCache.h implementation. It produces good triangle lists, but they are not really faster then my old strips. It depends heavily on the scene. Sometimes it's 5% faster, sometimes not.
You'll only see the maximum speed-up on scenes where you are vertex bound. For others, you may see no benefit at all from vertex cache optimization if vertex transform overhead is not your bottleneck. That's the breaks of optimizing a pipelined system.


But the problem is, that this particular implementation only works with small meshes. I tried to feed it with my STL files with 2-5 mio tris... and no way! This would run for days!
Code up the Forsyth algorithm from his write-up. Linear time if you implement his performance tweaks (sounds like VCache may not have those optimizations, which knackered basically stated), and shouldn't take long. Then you can move on to more interesting problems.


I'm now thinking about trying d3dxOptimizeFaces(). Since I never used it, could someone please tell me something about its performance?
No clue. Thankfully, my day job is 100% OpenGL. Do you really want Direct3D in your tools pipe?

CatDog
02-18-2008, 07:27 AM
Do you really want Direct3D in your tools pipe?
No! (Well, as long as I use OpenGL. In fact, the existance of such a routine in D3D is one more thing on my contra list. But that's another topic (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=234270#Post234270).)

I just wanted a quick way to test how much I can get from this kind of optimization.

You're so right, there are more interesting problems - I have to find the time somehow in between.

CatDog

knackered
02-18-2008, 11:17 AM
d3dxOptimizeFaces uses the hoppe method and is quite fast at sorting the faces, but not as fast as the forsyth method (when optimised, not like the vcache.h implementation). I couldn't measure any difference in render performance between the two algorithms, so if I were you I'd use the forsyth method (but not the vcache.h version in production code, just use it for inspiration).

CatDog
03-05-2008, 05:48 PM
Last weekend I spend some time on the topic again.

I've got my own implementation of the forsyth method now. Very good, I get increases of performance of 5-30%!

Since it's a greedy algorithm, its success depends on the order of the input triangles. So I rewrote my old tristrip generator and passed its output to the forsyth thing. That gave me another 5-25% increase, especially with huge meshes! I'm guessing that using striped triangle lists as input will decrease the chance of cache misses. It depends on the original data, but with unsorted triangle clouds, pre-sorting seems to work well (although this needs some further testing).

Then, I've replaced glMultiDrawElements() by a loop over glDrawRangeElements() - of course while keeping the limits given by MAX_INDICES and MAX_VERTICES. That resulted in no performance gain. But I noticed a blatant drop, when exceeding those limits, not only when using glDrawRangeElements, but also with glDrawElements()! Obviously, these limits should be regarded for all kinds of element arrays, no matter which draw command is used.

Finally, and that really made my day, all these changes seem to solve a completely different problem (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=235321#Post235321) that was bugging me for two years now!

Complete success!! :)

Thanks - especially to Dark Photon and knackered!

CatDog

Jackis
03-06-2008, 06:17 AM
So, you mean, that when you split large chunks by smaller ones to fit in MAX_INDICES and MAX_VERTICES requirements, perfomance drops with multithreading gone away?

CatDog
03-06-2008, 08:04 AM
No, I just made the observation, that on my hardware those limits should be fulfilled always, not only with glDrawRangeElements(). Otherwise performance drops with glDrawElements() as well. This seems independant of the multithreading issue from the other thread.

(Please reply there (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=235321#Post235321) to further discuss the threaded optimization bug.)

CatDog

Dark Photon
03-10-2008, 05:47 AM
while keeping the limits given by MAX_INDICES and MAX_VERTICES...I noticed a blatant drop, when exceeding those limits
That's surprising, as those numbers are tiny on NVidia:

* GeForce 8800GTX: 4096 indices / 4096 vertices (http://www.delphi3d.net/hardware/viewreport.php?report=1631)
* Radeon X1900 XTX: 65535 indices / 2147483647 vertices (http://www.delphi3d.net/hardware/viewreport.php?report=1630)

What board are you working with?

CatDog
03-10-2008, 10:30 AM
Geforce 7950GX2

It reports 1048576 and 1048576!! (Can't find it on Delphi3D. Here (http://links.mycelium.de/glinfo_7950GX2.htm) is my report. I also submitted it.)

When I'm exceeding these limits, *all* indexed based drawing drops performance.

4096?? Unbelievable. So this seems to be an issue for my board only..? It's really hard to believe, that these fundamental drawing routines are defective.

CatDog