PDA

View Full Version : VBO optimization: bandwidth & culling granularity



spurserh
06-28-2009, 07:31 AM
Hello everyone,

I am in the process of optimizing my renderer, and I am wondering if I should change its caching behavior a bit. Right now it caches the smallest cullable bits of geometry on the video card, and calls them all individually. This results in about 1000 calls to glCallList() per pass, and there are many passes.

I am planning to switch to use VBOs instead of display lists, and it brings up some interesting questions for me. I am wondering if I should minimize OpenGL calls as much as possible by culling my scene, sorting the geometry by material, transforming it, and then filling in a single VBO for each material by using glMapBuffer or GL_STREAM_DRAW. These VBOs would change just about every frame, but given the small size of my geometry relative to the bandwidth of the video card, I wonder if it would be a win?

What do you think? If I just migrate my renderer to VBOs as-is, the average VBO will be only 50 triangles. Of course there are a few tricks I can play to get that number up, but maybe someone experienced has some insight on how to strike the right balance?

Thanks,

Sean Purser-Haskell

tamlin
06-28-2009, 10:18 AM
VBO will likely (OK, almost certainly) be slower if you only have 50 triangles. The time to set up+tear down a VBO will easily overshadow the time to submit that data using plain interactive. Half a decade (perhaps more) ago, nv suggested ... was it 1k-2k or 10k-20k vertices? Nowadays, the bandwidth to upload a buffer, compared to the time to set it up and then tear it down, suggests buffer sizes "somewhat" larger. Only profiling can nail it on your platform, but I'd start my binary search with at least 1MB.

While it's a good thing to reduce number of OpenGL calls, what's even more important is to reduce the number of state changes. This implies; yes, you should sort by material.

My concern is - is your scene really as small as 50 triangles, or have you perhaps not considered you can use one VBO for the geometry of many, many objects? If you multiply that with the number of passes you have ("many") where I have to assume the geometry is not changing, I wouldn't be surprised to see a quite measurable performance win if you put all you could into a VBO (or more VBOs, should you need the space).

But to start with, I'd like to suggest to get some familiarity with VBO's in general, and perhaps manually rebasing indices when packing many objects and their corresponding index arrays into VBO(s).

++luck;

spurserh
06-28-2009, 10:52 AM
Sorry, I didn't mean to give the impression that I had only 50 triangles.

I am testing with a single model, of which there may be many in a scene, which consists of about 50 000 triangles divided into about 1000 meshes, each with a material assigned. I will be implementing some optimizations to the depth peeling process to reduce the number of passes by an order of magnitude, but right now between 20 and 30 depth peeling passes are done over this model for each light in the scene. That's 1 000 000 triangles pushed for just one light.

I am fairly familiar with VBOs from a functional perspective, having worked with them before. I am just wondering about the best way to use them in this case, since there are so many possible ways to do it. The answer may also not be obvious, in which case I may have to just benchmark them all - but I'd rather avoid that if there is an obvious answer.

tamlin
06-28-2009, 12:07 PM
50k triangles is indeed a more suitable number for VBO. :-)

Put all the vertices for all those triangles in a single VBO (if possible).

Manually (on the CPU - it's fast) rebase the indices for the meshes to refer to the rebased vertex index of the corresponding mesh in the resulting VBO. That way you'll only have to bind it once each frame. Just don't forget to put the indices in a VBO too (preferably the same one).

If each model indeed consists of 1000 (!) materials, perhaps you should consider not only a texture atlas, but also a higher-level approach? Assuming you can use shaders, uniforms can probably take you a long way to reduce (as I see it) the number of permutations. Perhaps a shader doing lookup from a texture could further help?

As you again mention many passes, I get the feeling this might be many translucent materials (like fur/hair). Could that perhaps be pre-processed to reduce the number of passes?

As for lighting complexity, have you considered deferred lighting (search this forum)?

Anyway, I think there's still too little information given to give a more informed answer. I hope I at least did point you in the right direction re. VBO's, and perhaps gave you some ideas.

++luck;

Dark Photon
06-28-2009, 12:50 PM
...Right now it caches the smallest cullable bits of geometry on the video card, and calls them all individually. This results in about 1000 calls to glCallList() per pass, and there are many passes.

I am planning to switch to use VBOs instead of display lists, and it brings up some interesting questions for me. I am wondering if I should minimize OpenGL calls as much as possible by culling my scene, sorting the geometry by material, transforming it, and then filling in a single VBO for each material by using glMapBuffer or GL_STREAM_DRAW...
From several recent posts, it seems that (from the NVidia camp at least), making lots of draw calls (already pretty efficient) has become even more efficient, to the point where you may not need to sweat pure batch size. Here's one recent post. (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=257897#Post2578 97).

So as tamlin suggests, state changes. Minimize these. However, profile your app first to ensure that this is your primary bottleneck.

spurserh
06-28-2009, 01:31 PM
Thanks for your helpful answers! What other information can I provide to help?

So the consensus, then, is that I should do my best to compile all of the vertex data into as few huge VBOs as possible, and then use smaller VBOs for indexing into this per-mesh? In my situation, this would likely result in one 5-10MB VBO for the vertex data, and then 1000 ~150 index VBOs for the indices of the meshes. Would this be expected to give better performance than one giant VBO with flat, un-indexed vertex data for all the visible geometry (built once per frame, re-used for each pass)?

Perhaps the ideal solution would be to put all the vertices into said huge VBO, and then generate VBOs full of indices for each view [frustum] on the fly?

I have been put off of deferred shading because of the depth complexity of the scenes that are to be rendered (that's why I'm using depth peeling in the first place). Once optimized, the depth peeling algorithm will only shade the pixels which actually influence the final rendered output - so unless I am missing something, I don't think deferred shading will help here.

satan
06-29-2009, 06:21 AM
Once optimized, the depth peeling algorithm will only shade the pixels which actually influence the final rendered output - so unless I am missing something, I don't think deferred shading will help here.
It looks like you are missing something, because that is exactly what deferred shading gives you. In the geometry pass you fill your FBO (G-Buffer) with all the necessary information (position/normal/color/etc.) per pixel/fragment. Then you draw your light sources and only light pixels which are visible (have passed the depth test) and are inside the area of influence of the corresponding light source.
Only transparent materials will have to be processed the normal (forward rendering) way. So if your scene consists mostly of opaque meshes deferred rendering might be the right tool for you.

spurserh
06-29-2009, 04:43 PM
satan,

I don't see how that advantages me when I am already guaranteed to only shade the pixels that contribute to the final scene. Also, my scene does not reliably consist of mostly opaque meshes - that's why I am doing things the way I am doing them.

Anyhow this is getting a bit off-topic. What I have decided to do is to have two VBOs per geometry object (which in my test cases tends to range from 20k to 60k triangles). The first contains all the interleaved vertices. The second contains all the rebased indices, sorted by mesh. I can then cull at whatever granularity I choose by drawing ranges of indices from the index VBO. Does that sound like a reasonable solution?

One last question I have is whether or not _hurts_ anything to have a number of very small VBOs lying around on top of my big ones. My scenes largely consist of a both very trivial objects and very complex ones - and very little in between. Should I just go with old fashioned client-side arrays under some minimum?


- Sean

Dark Photon
06-29-2009, 05:46 PM
One last question I have is whether or not _hurts_ anything to have a number of very small VBOs lying around on top of my big ones. My scenes largely consist of a both very trivial objects and very complex ones - and very little in between. Should I just go with old fashioned client-side arrays under some minimum?
Yes. In my experience, small batches made VBOs actually render slower than client arrays on NVidia. So yeah, I'd do some testing and see where the break even point is for you with your batches.

spurserh
06-29-2009, 06:36 PM
Dark Photon: Thanks!

Satan: Actually you're right, deferred shading could still help me if there are many lights. Thanks!