Fastest way to render lots of models

I’m trying to figure out the fastest way to send the vertex data for models to the GPU – some models are non-animated, others are animated. Should I use a display list for each static model and alter the matrices for each individual one when rendering it? The static models have multiple materials so each one would take a few calls to render. This will result in many calls to DrawElements, not a good thing I believe.

For the dynamic ones I assume the best thing is to rotate/translate each vertex and batch them onto a common array so I’m not calling DrawElements hundreds of times. This is what I already do for level geometry and I’m wondering if I should also do it even for static models. That’s my main conundrum.

[This message has been edited by CGameProgrammer (edited 03-26-2003).]

Hi,

glDrawElement calls are quite fast so I don’t think you need to worry about them. Matrix and material changes can take longer, especially if you change textures, so those are the ones you should optimize by. Transforming your dynamic models on cpu isn’t necessary, unless you use skinning, in which case I think you can get modelview transform for free.

-Ilkka

Well animation means I need to transform all the vertices anyway, though a simple keyframe interpolation is faster than the whole rotation thing. But since I’m looping through all the vertices I figure I might as well transform them then.

For the dynamic ones I assume the best thing is to rotate/translate each vertex and batch them onto a common array so I’m not calling DrawElements hundreds of times.

You’re not transforming it on the CPU, are you? The conventional wisdom as far as speed is concerned these days is, “Let the GPU figure it out.”

Don’t try detailed LOD schemes; just some some gross fustrum culling and give the results to the GPU. Do use regular OpenGL T&L or even vertex programs when possible and reasonable.

The key to drawing performance is minimizing state changes and calls to glDraw*.

If each of your buildings has multiple materials, but many building parts share the same material, you shouldn’t be drawing each building individually. Instead, you need to set up one material and draw everything that uses that material.

Also, you should consider using glMultiDraw* rather than glDraw*. It’s a relatively new extension, but it could give you a boost to your game’s speed. It does the equivalent of several calls to glDraw*.

Yes yes, I do all that for the buildings (except glDrawMulti*, thanks for the tip, I’ll look into it). The buildings are very well optimized, and of course they are not transformed on the CPU.

But they’re not models; they’re part of the level. I’m adding models to the engine and am just unclear on the best way to go about doing it. The reason models are different is because they are always moving around and some of them are animated as well.

By the way, frustum culling in lieu of LOD is not an option for the buildings. Standing in the corner of the level and looking at the buildings, 600-odd buildings are in the view, plus houses and streets and so on. But I get 40fps on a GeForce2 by using LOD.

The models will of course be culled and they won’t have any LOD, except from one type that will use an identical scheme as the level geometry (for very large objects like spaceships, buildings [if a building gets demolished and sinks into the ground], etc.).

[This message has been edited by CGameProgrammer (edited 03-26-2003).]

The fastest way to render a lot of models is to use a combination of few very known tricks.I see that you know about the disp lists and data buffers.Unfortunetly they can’t help you a lot when animated models and other complex stuff.There are some other very good ways to spend time and speed.The backface culling is an exelent way to pass 1/3 of the geometry that you doesn’t need.Other good ideas are to use frustum culling,LOD and some other.Hope this helps.

Well I need to clarify… my problem is not figuring out how to minimize the polygon count, I’m only concerned with the fastest method of transferring the data onto the 3D card. Do I:

a) use a lot of DrawElements calls so that I can let the GPU transform the models and so that I can use display lists for non-animated models, or
b) do I manually transform the vertices and put them onto large vertex arrays each frame to minimize the calls to DrawElements?

Someone said DrawElements was fast but I doubt that since the vertex array is uploaded to the GPU when the function is called.

Presumably, sooner or later, you’re going switch to VBO’s or something like that.

You should almost never do ‘b’, as that makes your rendering CPU dependent. The ultimate hope is that you get full async rendering.

The best way to get full speed is to do as few glDraw* calls as possible. That may mean you have to rearchitect your meshes and textures. Make it so that each building can be drawn with a single glDraw* call (concatonate textures and re-texture your meshes).

Here’s a cute trick that works on vertex program-equipped cards.

Put all your building geometry in one large vertex array. For each instance of a building, you have a separate attribute array that contains just the same number repeated over and over (this array is the same size as the other arrays). This attribute array is an index into the per-vertex constant data in the program.

That array contains the center position, in 3D space, of where the building is. All your vertex shader needs to do is offset every vertex position with this value (the w coordinate might even be a uniform scale or some kind of angular rotation, if you need it). This way, you can draw upwards of 80 buildings using a single glDrawElement’s call. All you need to do is make sure the attribute vertex array is correct for each instance. Now, you will have to rebuild your index list per frame, so that your index list contains multiple buildings.

Now, this doesn’t take into account material changes. You need to limit the number of materials you use to something reasonable like 5-15 or so. That way, a scene with 600 buildings only requires, at worst, 15 different material changes. Each material has 40 different buildings (or so).

You can maintain variety by having that array above that you use to index into the constant registers also apply vertex colors. These can be independent of the building’s location (but they should be constant, of course).

Now, if you only have 5 materials, you get 120 different buildings. A 9500 or better can handle that in a vertex program, but a GeForce 3 cannot (it only has 96 registers). To improve upon this, you can make limit the building positions to being 2D, and have each 4-float register represent 2 coordinates. It’ll be a bit tougher in the shader to pull this one off, (especially figuring out which half to offset with), but it can be done.

The key to performance in this case is to limit both the number of material changes and the number of glDrawElements calls. And this method will improve much more if you use VBO (when it gets optimized).

You may have problems integrating this into an LOD scheme, however. But, it may go faster using the full LOD than whatever you come up with without this scheme, so that’s something.

BTW, the above doesn’t quite work as advertised. To get it to work, you’d need to bind one attribute array per instance, so that each instance’s index array get’s loaded up to the shader. Then, it would have to pick which one to use based on some value. Otherwise, you have to still do one glDraw* per building, with gl*Pointer calls inbetween.

Actaully, this would work if OpenGL could allow different attributes to be indexed with different index arrays. This is one of the cases where it would really be useful.

[This message has been edited by Korval (edited 03-26-2003).]

[This message has been edited by Korval (edited 03-26-2003).]

[This message has been edited by Korval (edited 03-26-2003).]

While I certainly appreciate you trying to help, I must urge you to read my previous posts. The buildings are fully optimized (well, mostly) but my concern is with actual models that move, like cars, people, etc. Cars are static models – they don’t animate. The wheels rotate/turn but that is not an animation; they would simply be separate models “attached” to the car body. People are of course animated – since I will, one can assume, be changing the vertices every frame as I loop through animations, I figure I might as well rotate/translate them as well. But it’s static models like cars that are more questionable.

Now, you DID come up with something I can theoretically apply to the cars, which is to take each instance of a car and put its untransformed vertices into a common array, upload that with DrawElements, and use a vertex shader to rotate/translate them. This is an interesting idea but I cannot do it because I’m avoiding really hardware-dependent features such as that. My target platform is any T&L card, but preferably it should be functional with cards below that. Apparently the engine does work on an 8MB ATI Rage Mobility, of course the guy said he got 3-5 FPS (600MHz CPU) but with no graphical errors.

But you did say it was important to reduce the number of calls to DrawElements which is what I figured, so it looks like I will need to do CPU transforms. But I can keep your idea in mind as an optimization to be done later.

And in case you’re wondering, the reasons buildings are not applicable here is that they use their own efficient pipeline that takes advantage of the fact that they are usually static; even with LOD, most of the time a given building’s level of detail remains unchanged. So their vertices are combined onto common dynamic/static buffers (or, in OpenGL, arrays and display lists, but I definitely need to replace display lists with… VBOs? or something).

You also mentioned reducing the texture count, but I don’t want to use enormous textures and yet I want to allow high-resolution textures, so that means combining them isn’t really an option. The models will use a single texture each, but models are small.

Now, you DID come up with something I can theoretically apply to the cars, which is to take each instance of a car and put its untransformed vertices into a common array, upload that with DrawElements, and use a vertex shader to rotate/translate them. This is an interesting idea but I cannot do it because I’m avoiding really hardware-dependent features such as that. My target platform is any T&L card, but preferably it should be functional with cards below that. Apparently the engine does work on an 8MB ATI Rage Mobility, of course the guy said he got 3-5 FPS (600MHz CPU) but with no graphical errors.

You should at no time be doing any T&L operations, either regular GL matrix multiplies or skinning, that can be done just as easily on the GPU. This means that buildings, cars, people, etc all go through the GL pipe untransformed if at all possible.

Unless your card supports vertex programs (or vertex blend, but that’s been phased out), you’re stuck with software skinning. However, everything else should be transformed on the GPU unless you have a good reason not to.

To deal with static-mobile geometry (things that move, but can be transformed by a regular T&L card), simply use the regular OpenGL T&L pipe. Do your very best to minimize state changes (a car’s texture should include its tires). Definately use different LOD models. Only the highest LOD really needs to have tires that rotate and turn; anything less can have tires that are fixed rigidly to the vehicle. Do everything in your power such that the average car renders in one glDrawElements call. Keep your material changes down.

And in case you’re wondering, the reasons buildings are not applicable here is that they use their own efficient pipeline that takes advantage of the fact that they are usually static; even with LOD, most of the time a given building’s level of detail remains unchanged. So their vertices are combined onto common dynamic/static buffers (or, in OpenGL, arrays and display lists, but I definitely need to replace display lists with… VBOs? or something).

Buildings, as mentioned before, should at not time be transformed on the CPU. You won’t be able to get the performance you should be able to out of VBO’s if you keep building that array each frame. You’ll get better performance by rendering each of the 600 building individually. Granted, you’ll get even better performance by rendering clumps of buildings (groups of 20, say) that all use one texture and can therefore be rendered in one shot.

You also mentioned reducing the texture count, but I don’t want to use enormous textures and yet I want to allow high-resolution textures, so that means combining them isn’t really an option. The models will use a single texture each, but models are small.

Then you’re sacrificing performance for something. For modern graphics cards (GeForce3/Radeon8500 or better), 2048x2048 is a good cross-platform maximum texture size, though GeForces allow for up to 4096.

Above all else, sit down with your code and benchmark different rendering techniques. Take various pieces of this discussion to your code and see what happens if it is implemented (though VBO’s are not mature yet, so you can’t really gage their performance).

Korval, I don’t understand your cute vp trick. I don’t understand where you’re getting performance gains, you just seem to be adding more data to be transfered and more per-vertex maths. Are you saying that for every instanced house you just have to change the ‘index’ stream rather than position/normal/texcoords etc.? In which case, are you sure this gives a gain? If you’re using VAR, and therefore the vertex data is already in fast memory, I don’t see the gain…or am I going mad? It certainly won’t cut down on the number of drawelement calls.
Are you talking about compiling the vertex data every frame from the visible meshes? Touching that much vertex data is going to be bad.

Korval, I don’t understand your cute vp trick.

Nor should you, as I realized right after I sent it, it would not work. I editted the post and added a comment about how it doesn’t work (and a way to make it work if certain facilities were avaliable in OpenGL.

Rendering the 600 buildings individually would be slow – nVidia urges developers to use buffers with “thousands of vertices” rather than smaller groups of vertices and more buffers. That’s why I batch things.

[EDIT: Unless you mean that if the buffers remain onboard the card (static) then it would be fast just using many, and that it’s only slow using many small buffers if you have to upload them all. Well I forgot that actually the static buffers are small, it’s only the dynamic buffers that are large.]

Also while it is certainly true that constantly remaking the arrays would be slow, I only do it when they need to be updated – when a building’s LOD changes, the buffers it uses need to be updated, but the majority of the time they can remain unchanged.

I just added code where absolutely everything except rendering is ignored unless the camera crosses a boundary (every 16 units, for example). When standing still (meaning only rendering is done, not LOD or culling or rebuilding the arrays) the framerate increased from 40fps to 44fps in the low-polygon area of the level, and from 21fps to 31fps in the high-polygon area. In the latter area there are a bunch of dynamic buffers being used, and these are rebuilt every frame (not good but not too bad). So the CPU processing really isn’t major.

But I have been largely ignoring hardware facilities when writing the engine, as I’m initially developing for the lowest system that can handle the polygon count – I’ll add hardware tweaks later, including the vertex shader idea (which I just saw in an NVidia performance PDF) and a lower reliance on LOD to minimize buffer updates.

By the way, how do you do a shader in OpenGL? I thought they were Direct3D-specific (until OGL 2.0).

[This message has been edited by CGameProgrammer (edited 03-27-2003).]