Rendering a HEAVY load of static billboards

We are trying to render a nice forest. On our demo scene, we need about 3000 trees, each tree containing 10 billboards. It makes about 30 000 quads …

I tried to render a fifth of this, and frame rate is killed when there is a lot of overdraw (trees are packed all close together). So I suppose I have to sort front to back, and at some point use some level of detail, let’s say I’ll render first 5000 billboards.

How can I sort 5000 quads each frame (knowing that all these quads are static) ? This won’t make a huge stall ?

How can I render these 3000 trees in a tight rendering bin (reducing the calls to glDrawElem) ?

Thanks,
SeskaPeel.

Originally posted by SeskaPeel:
We are trying to render a nice forest. On our demo scene, we need about 3000 trees, each tree containing 10 billboards. It makes about 30 000 quads …

I would use a quadtree for sorting the billboards. Have you considered adding simplier models with only one billboard for trees far away?

Why do you think I should use a quadtree ?

For the far away 1 billboard tree, of course, this is what I was talking about when saying “level of detail”.

SeskaPeel.

Originally posted by SeskaPeel:
Why do you think I should use a quadtree ?
SeskaPeel.

You can use a quadtree to output a reasonably depth-sorted list of visible objects. For any quadnode in a quadtree, you can traverse the children in an arbitrary order, right? Now imagine what happens if you always pick the farthest child (from the viewer) first. If you traverse the quadtree in depth-first (in both senses of the word) and all objects are in leaf nodes, then you should get a well sorted result.

Imperfections in the sorting come from a) leaf quadnodes having more than one object in unsorted order, and b) intermediate quadnodes holding objects (that is, if those objects spanned quad boundaries). Both of these are minor problems, IMO, unless you need a perfect depth order for alpha blending, which can be had for a few more passes of traditional sorting. In this case, fast and almost right may be better than perfect, since the goal is to reduce overdraw without creating a new bottleneck.

However, my question is whether you’re really fill limited or not? Unless your billboard trees are big, you’re more likely API limited in trying to draw so many individual billboards.

In that case, I’ll suggest something I suggested in another particle thread (and was apparently ignored there): use cheaper billboards that can go all in one draw pool and, most importantly, one drawElements call.

The trick is to put all billboard trees in one coordinate system, starting with the tree center and offsetting by four displacements, which are pre-rotated to make your trees parallel to the image plane. The vectors are made from combinations of two basis vectors, which in your case would be the world “up” vector and the cameras “right” vector.

[edit: it’s not exactly the “right” vector if the camera can tilt, but more properly the normalized projection of that vector on the ground plane. Modulate the “up” vector slightly to add a simple wind/sway effect.]

However, if you mind the slight error caused by using the same cameras “right” vector to rotate all of your trees, you can do the extra math to compute the proper perpendiculoar vector for each tree. It adds perhaps two extras cross products and one normalization per tree, so it’s not a killer, but I generally skip it. You can try it both ways and see what you like.

But the important thing is to put all of the trees into one VAR or VBO pool and render with a small number of glDrawElements calls. The fastest way I know is similar to the Nvidia “learning VAR” example, where you render in fixed-size batches, interleaving the CPU-side contstruction of the vertices with async work on the GPU.

The next best alternative is to use a vertex program. There, you’d create a single VBO where, for each billboard-quad, you’d write center point four times. VP will rewrite this. The trick is that another per-vertex parameter, say the first texcoord, holds the number 0-3 based on which corner of the quad it is. In the VP, you use that param to do a table lookup of the real texcoords and also do the math I mentioned above to construct each of the four correct corner points. The 0-3 number is a kind of switch, though you don’t need branching since it’s just a table lookup. In my programs, I used the same four offset vectors for all billboards, which I just loaded into VP constants before rendering to keep the VP small and fast.

The choice between the two approaches I mentioned is mainly one of CPU time. The first approach will use more CPU while trees are rendering. The second will be modestly slower on the GPU, but take zero CPU time. If you’re not transform limited, you can probably go either way.

Hope that helps.

Avi

[This message has been edited by Cyranose (edited 10-29-2003).]

I believe that the 10 “billboards” are not actually billboards as such, because that would cause horrible parallax when rotating. Instead, I think they are cut-out alpha-tested images of parts of the tree, essentially a higher-poly version of cruciforms.

If that is the case, then you can group trees in either a quad/oct-tree, or just in a regular grid, with some minimum cell size (~10 trees per cell). Then traverse the tree or grid, and for cells that are visible, draw the entire set of trees in a single draw-range-elements call. This assumes that you can aggregate all your textures for a single group, as well, which should be possible with texture sheeting.

This would be very light on the CPU (just traversing the nodes) and reasonably efficient for the GPU (drawing static buffers, each covering 20x20 meters). You could even glom nodes together at larger sizes further out, a la geo-mip-mapping for terrain (with or without level of detail in the geometry).

If you’re drawing using cut-out (alpha test, as opposed to alpha blend) then you want to draw near-to-far, to get the benefit of early fragment rejection and hierarchical Z. If you draw using alpha blend, you can’t do that, but instead should draw far-to-near.

Thus, assuming you build geometry buffers (or display lists) for the bottom N levels of your tree, the traversal would be something like:

Node::Traverse()
if is entirely outside frustum
terminate
if is entirely inside frustum
if is within the bottom N levels
if doesnt’ have aggregated geometry
generate aggregate geometry for all children
draw aggregate geometry
terminate
else
sort child nodes in order (near-to-far or far-to-near)
recurse children in this order
else
sort child nodes in order
recurse children in this order

If you do this, you probably want to purge generated aggregate geometry that hasn’t been viewed in a while, to avoid exhausting available memory.

You divide the trees into grid cells, as jwatte said, but what I don’t think he said was that you should pre-sort each cell by the 4 major axis, constructing index arrays for each sort axis. Then at runtime, when drawing a cell, you just need to choose the appropriate major axis depending on the view direction, and send its index array to the driver with a single gldrawelements for each cell.

Precisions :

These are cut-out (alpha tested, and not blended) billboards. Both camera aligned or view plane aligned are acceptable, and incurs no visual artifacts, though some small popping can occur sometimes.

I didn’t fully understand all what you all said. I need to reread a lot of time and to think about it. The first question that arises, is why I should use a quadtree, rather than a BSP, rather than an octree ? And I’ll try to answer to myself : octrees are rejected because no two trees can be at the same position and at two different altitudes. One point, but why not BSP ?

SeskaPeel.

Originally posted by SeskaPeel:
One point, but why not BSP ?

A BSP tree doesn’t quite work if all your polygons are constantly spinning around.

– Tom

I’m not sorting polygons, I’m sorting billboards. So I’m sorting depending on the position of the center of the billboard, that is a point and not polygon.

SeskaPeel.

Ah right, you’d be looking at more of a kd-tree then? I don’t believe that’ll help you sort the points front-to-back like a BSP tree would for polygons. (Points don’t have a front and a back side like polygons do)

On another note, have you tried to benchmark any brute-force sorting algorithms? I’ve used a simple bubble sort to do depth sorting on scenes with around 1000 objects before and didn’t find it to be a big bottleneck (bubble sort exploits temporal coherence). If you’ve only got about 3000 trees, you might very well get away with it.

– Tom

Tom, you’re right … I should test a bit brute force before trying to optimize.

I just hope brute force won’t be too long to implement.

SeskaPeel.

Since you are using static objects, you might consider using dynamic imposters. Mark Harris describes the technique and got some good results doing this for cloud rendering (http://www.cs.unc.edu/~harrism/clouds/).
You basically pre-render some of your billboards to texture and reuse these frame to frame. Cuts down on fill enormously, especially for stuff in the background.

Cyranose :

I limited both by API and fillrate.
So I need to sort and to bin.

For the sorting stuff, I’ll go for either a quadtree or a regular grid or a brute force, depending on perfs.

I’m still very interested in your “cheaper billboards that can go in one draw pool and, most importantly, one drawElements call”.

Tons of questions :

the tree center and offsetting by four displacements, which are pre-rotated to make your trees parallel to the image plane

Sorry you left me with a stupid face after reading this. What are those “four displacements” ?

However, if you mind the slight error caused by using the same cameras “right” vector to rotate all of your trees, you can do the extra math to compute the proper perpendiculoar vector for each tree. It adds perhaps two extras cross products and one normalization per tree, so it’s not a killer, but I generally skip it. You can try it both ways and see what you like.

Huh … is this view plane aligned opposed to camera aligned ?

I used the same four offset vectors for all billboards, which I just loaded into VP constants before rendering to keep the VP small and fast

Alternative would be to pass a second stream containing offsets ? So your improvment is to pass a single scalar (integer) value instead a of 3-scalar vector ? (Of course you can pad it to the end of an existing stream, and thus saving a full stream)

Last question : how do you mix this vertex program method with the sort ? Am I supposed to sort an array of index to vertices ?

Thanks,
SeskaPeel.

To construct a billboard, given a center point, the four points [0-3] of the billboard quad can be expressed as center + displacement[0-3]

For example:

displacement[0] = + basisX + basisY
displacement[1] = + basisX - basisY
displacement[2] = - basisX - basisY
displacement[3] = - basisX + basisY

The order will depend on which corner you want to start with and whether you want to go clockwise or not. As I generally turn off cull-face anyway, order doesn’t matter much. The texcoords will follow the same pattern as above within texture coord ranges, of course.

For “cheap” billboards, basisX is a scaled version of camera-right and basisY is a scaled version of camera-up. That makes the billboards draw parallel to the image plane, rather than rotate for each individual “at” vector, which is usually fine for me. Some people might mind the slight warping you see in billboards near the corners of the window.

For “cheap” trees, however, you want them to stand up straight, so you’d use the world “up” axis as basisY in that case, possibly with some gentle sway offset for a virtual wind.

[edit: you might also want to shift the math above to make the “center” of the tree at each tree’s base for proper swaying. In that case, the -basisY disappears and +basisY is replaced with +basisY*2. make sense?]

Now, for basisX, you have two main options: You can use the projection of the camera’s “right” vector on the groud plane. This will give you trees that are generally parallel to the image plane (at least on their X axis). The second method is better but a little slower: to take the “at” vector from the camera to each tree individually, cross it with the basisY and project that onto the ground plane, normalizing and scaling it too. The goal is to get a vector that’s parallel to the ground and perpendicular to the “at” vector.

So to sum up, whatever basis vectors you choose, those four combinations of basisX and basisY give you the four displacements for your corner points. The code can very easily fit in a tight-loop over your quads, using SSE if you have time.

If you use a VP, I suggest computing those four common displacements on the CPU and loading them into a VP constant table, so the VP code is simply

pos = center + displacement[i],

where ‘center’ and ‘i’ come in via vertex data. And the four displacements can be preloaded into vertex constant registers. (the Four Displacements keeps sounding like an old R&B band to me…)

In this case, it might be best to send in vertices of the form XYZi, so they’re 16 bytes each. Multiples of 16 or 32 bytes seem to fetch faster from AGP memory.

If you go with VP and want each tree rotated to the camera “at” vector, you’ll need to compute a new basisX for each vertex. You already have the center point passed in so all you need is the eye point passed into the VP (once, via constants) plus some more VP instructions for the extra math.

On the sort, I’d go with coarsely sorting whole objects first via the quadtree or a radix sort and then iterate the sorted list to compute an up-to-date index list for that frame. Sorting the indices directly sounds painful, since you have to do swaps in groups of four now.

Mixing the VP with the sort? Once enabled, loaded, and bound, the VP will simply work on each [new/uncached] vertex in the order specified in the index list you pass with glDrawElements. No magic there.

Avi

[This message has been edited by Cyranose (edited 10-30-2003).]

Cyranose,

yes thanks, all is clear. I still have to look at that radix sort (someone told me of this before, don’t remember where).

Your opinion would be that wich one is faster to implement (quadtree or radix) ? Which one will be the more efficient at run time ?

SeskaPeel.

Originally posted by SeskaPeel:
[b]Cyranose,

yes thanks, all is clear. I still have to look at that radix sort (someone told me of this before, don’t remember where).

Your opinion would be that wich one is faster to implement (quadtree or radix) ? Which one will be the more efficient at run time ?

SeskaPeel.[/b]

Radix sorting is one of the fastest sorts around for fixed range values, like 32 bit integers or floats.

However, if you’re doing a quad-tree anyway (e.g., for view culling, LOD), modifying it to support sorted or mostly-sorted output could easily be faster. I’d personally go with the quadtree first, since that buys you more overall, but it’s more complicated. You can find a radix implementation via google.

BTW, it looks like Tom implemented something like the VP particle billboard approach I described and has kindly posted source on his site. It’s on the OpenGL main page.

Avi

We are in a hurry, and I need the fastest implementation possible. So I choosed a simple bin sort.

I still didn’t understand what were the real improvment of your method (the four displacement vectors), but as I’ll implement a simple vertex program to render my billboarded trees very soon, I’ll consider it at this time.

Of course, I’ll use only 3 calls to glDrawElem (one for leaves, one for truncs, and one for far trees) and 3 VBO. I use some paging for textures to have multiple leaves billboarded textures so I can render all of them in a single call, but I need a stream of texture coordinates to do this.

I’ll let know how it went quite soon.

SeskaPeel.

Originally posted by jwatte:
If you’re drawing using cut-out (alpha test, as opposed to alpha blend) then you want to draw near-to-far, to get the benefit of early fragment rejection and hierarchical Z.

Alpha-tested geometry have problems with Hierarchical Z, because it can’t be used to build one. So, if one have heavy overdraw based on alha-tested geometry it can’t update Hierarchical Z, and he is screwed.
So, maybe switching from alpha-tested in front to alpha-blended in bigger distances is good idea, moreover if distance is very big that trees will fade anyway.

I implemented using your displacement vectors (thanks to Tom vertex program code). I did it that way because it was faster (implementation) than computing the correct rotation matrix.
I added a small feature, passing a different size for X and for Y. So I need 8 displacement vectors, in two separate arrays, one for X displacement, and the second for Y, each scaled by its proprer scalar.
I pass index value in vertex.position.w, and in texture stream 0 I pass tex coo s, tex coo t, scale x, and scale y.

I still need to do some performance tests, and I’ll let you know if it gave something good.

SeskaPeel.

Brute force rendering of 300 trees, each containing 11 billboards : 30 fps (stable)

Compiling all data so that only 2 glDrawElems are called + vertex program using displacement vectors : 120 fps. Looking from close to the trees (worst position having back to front rendering of the trees on the whole frame buffer) : 55fps

Bin sorting + 2 glDrawElems : 110 fps at least when looking from the worst possible position.

I’ll post some results with 3000 trees and a LOD depending on position of the trees in the bins soon.

I think this is a win

SeskaPeel.