Pushing the polygon limits.

I need some help here guys. My engine isn’t exactly measuring up to what I would have expected it to do. I have checked the applications critical path to ensure no additional activities are taking place. I’ve just finished adding support for vertex_array_range(more on this below).

Vsync off, 16bit zbuffer. I am using a supported VAR format. Fullscreen. Release build with VC7.

Following are some tests at 800x600x32x75
4 * 20k Triangle Models yields 3.2m ,fps 40
60 * 1k Triangle Models yeilds 3.2m ,fps 53

Increasing the screen size does not appear to have any affect until I hit 1600120075 and higher.

SphereMark nets 12.1m on my system.

Specs:

Athlon 700, 384mb 133mhz ram, GeForce2 32mb running on XP.

Notes:

-My method of rendering is currently primitive. Walk a scene tree(no culling) and render as we find items. -My VAR support is based on the nVidia doc and what I found in VAR_Learning. I create x buffer(out of 1 memory block). I iterate over them for each model to render an gasp memcpy the verts & normals in to the buffer. Setup pointers, call DrawElements, Set the fence and move on. If I iterate back to a buffer with the fence still set I call FinishFence and wait. -I know that not using strips is non optimal, however I need this for a later implementation of VIPM which cannot use strips.

I’m open to suggestions here. I think I’ve followed out all of the obvious leads. I understand that not using stips will hurt, though I doubt it’ll be this much.

Many thanks,

Chris

You would be surprised at how much not using strips can hurt. The number of vertices and the associated data that is transformed multiple times can be quite insane. I don’t know what the T&L performance is for the Geforce 2 of the top of my head. But its something to look into. But like you said SphereMark seems to do a lot better so I don’t think that is it. I must warn you though VAR can actually be slower that standard vertex arrays in memory if the implementation isn’t exactly correct. A poor VAR program seems to be slower than a bad standard vertex array program with the arrays in system memory. I am not sure exactly what to tell you. Try posting the code that you think is most likely causing the problem. I am certainly not an expert in VAR (currently not even implementing it yet in my program.) BUt there are many guys on this forum that can definately answer all your VAR questions.

I am running on a TNT 2 Ultra. Probably gonna wait until NV30. Unless I build a new workstation before then. With software T&L strips are very important and provide a very good speed increase. Anything to reduce the transformations.

Devulon

Gimp, having no strips hurts a LOT. It’s almost 3X the vertices to be sent for long strips. Just count them, it’s easy to estimate. You’re getting 3.2Mtris/sec, so you might be closer to 10Mtris/sec with strips.

Maybe you should try making a persistent representation. Why make a copy and of your data every frame?

[This message has been edited by dorbie (edited 06-28-2002).]

If you cant convert to strips atleast convert your vertices to Vertex Cache Friendly format using the NvTriStrip library or something.

In any case, converting to strips really dint give me a huge speed boost, I could just squeeze some 10% performance gain. But convertiing lists to Vertex cache friendly format helps in a good speed increase.

Sort objects by distance, that could help you a lot in the z-tests.

Are u using vertex lighting or your own per-pixel lighting?

-Sundar

Yes but Triangle Strips arent always able to be used. Like in my case, I am making an engine that renders my maps, I am working with Per Vertex lighthing, now when you turn a corner, the Verts on the corner, have MULTIPLE normals, 90 degrees off from each other. It would be a waiste of time to stripifie them, since they are seen as two individualy different points in space by OpenGL. So I use indexed Vertex Arrays. And this seems to be very reliable, for when you cant use stips.

Why not have a seperate normal list for each strip if that is the case? If your strips are large enough, which they may not if they are just simple walls then this will not be a problem.

I have also found that if I send some very large index lists then the performance is very poor. Instead, I send a series of ~1k index lists to the card, this increased performance dramatically. I am using just triangles as well, but this is because I don’t have the time to generate strips in real-time. So, splitting my index lists is very easy. You may find that this will help some.

Neil Witcomb

I am doing something very similar, I have broken my virtual world up into Zones. This is great for 3 reasons, first I can CULL entire zones from ever being drawn with a simple Frustum to Zone test. If the test fails, i simply dont draw that zone at all, which helps OpenGL ALOT. And secondly each zone is its own Indexed Array, so each zone is small and fast for OpenGL. And third I did this to REALY speed up my Collision Detection algorithm. I simply test to see what zone im in, and test only that particular zone for collisions. Seems to work well so far.

Originally posted by dabeav:
I am doing something very similar, I have broken my virtual world up into Zones. This is great for 3 reasons, first I can CULL entire zones from ever being drawn with a simple Frustum to Zone test. If the test fails, i simply dont draw that zone at all, which helps OpenGL ALOT. And secondly each zone is its own Indexed Array, so each zone is small and fast for OpenGL. And third I did this to REALY speed up my Collision Detection algorithm. I simply test to see what zone im in, and test only that particular zone for collisions. Seems to work well so far.

You mean you made a quad tree?

I heard that the limit for the vertex buffer should be 4K. With glDrawRangeElements, trhose limits should be respected to give better results. I don`t know what the limit is for the indices and the other buffer types.

Also, turn off other features, like collision detection when benchmarking your graphics card.

V-man

“You would be surprised at how much not using strips can hurt”

For me, unfortunatly I have no choice but to just suffer with unsorted triangle lists. The next feature to be added to my engine will be View Independent Progressive Meshs. For those that have never heard of it, it’s a LOD scheme for sending a stack of triangles that represent the model at a progressive levels of detail. For this reason, for the purposes of optimisation I have no control over triangle order(as soon the order will be determined by the LOD scheme) and VIPM only uses triangle lists.

I could use descreet LOD’s(like a mipmapped model) but not only is that much more memory intensive(cause you have so many copies of the model) but also not as easy to hide the morphing between the two models. I have seen some suggestions that alpha blending between the two LOD models is an options, but I’ve always admired the purity vanilla VIPM. Perhaps I should consider this method. The difficult part here is that either solution will be so time consuming to code that I don’t want to just ‘try out’ something that will take me weeks to code. My brain is soft today, but from memory a mipmap pyramid(dowmwards) where each image is half the size of the parent will total to exactly the same size as the parent, so roughly double a models size. … Not so bad I guess.

"Maybe you should try making a persistent representation. "

I think your suggesting just leaving the model data in AGP memory. I considered this be gave myself a headache trying to work out how I would page data in and out of AGP cache when the static data exceeded the AGP memory. Paging is a lot harder that it initially seems esp if you have priorities.

I only use AGP data in the VAR calls. I was told be professional game developer once to stay away from the video ram, else I mess up texture caching and cause thrashing, besides there is no GetFreeVideoRam() type call to see how much opengl is using and my game will be using full unique landscape texturing, so I probably don’t want to mess around here do I?

Is spatial sorting worth the effort? I could probably turn the zbuffer off for things like a terrain system where I don’t have any intersections to worry about. Is it worth the effort of doing all those comparisons on the CPU? How fast are statically created display lists compared to agp resident vertex
ormal arrays?

“Are u using vertex lighting or your own per-pixel lighting?”

Good point. I just left vertex lighting on. I have long term future plans of baking light in to non moving object via radiosity. Dynamic object were going to be left with bump maps + vertex lighting(though I just heard about vertex maps).

"A poor VAR program seems to be slower than a bad standard vertex array program "

Yes. My initial attempt just called a global flush at the end of each mesh, this sucked but it was good for testing. The next version used fences properly in multiple buffers(single memory block), this saw a 50% speedup over standard vertex arrays.

Does anyone know of a document that tells you where a bottleneck is? .i.e If you increase the screen resolution without a drop in frame rate you have a problem with X(fill,bus,etc).

Chris

The next feature to be added to my engine will be View Independent Progressive Meshs. For those that have never heard of it, it’s a LOD scheme for sending a stack of triangles that represent the model at a progressive levels of detail. For this reason, for the purposes of optimisation I have no control over triangle order(as soon the order will be determined by the LOD scheme) and VIPM only uses triangle lists.

Then you’ve got to accept that the trade-off for using this progressive mesh scheme is overall vertex performance.

The best you can do is use VAR (to turn on the post-T&L cache) and send indexed triangles that use similar indices from one triangle to another. You’ll need to figure out an algorithm to produce this cache-aware triangle list, but once you get that, you should get a reasonable speed increase. Don’t expect to beat a nice stripped model, let alone a cache-aware stripped model.

It’s interesting to think that many of the triangle reduction techniques I thought were “really cool” a few years ago seem very uninteresting now.

The two that come to mind at the front of that list are the ROAM heightfield algorithm and progressive meshes.

I say, better to draw more triangles with better-optimized locality, longer strips, and less CPU time wasted figuring out exactly which triangles to draw.

For heightfields, I’d say a simple block-based LOD algorithm is probably good enough. You can stitch edges together to match up LODs without cracks without too much trouble.

For meshes, I’d go with a fixed set of LOD models.

  • Matt