Fatest method of rendering geo?

What’s the fatest method in your opinon?

Currently display lists seems to be 50% faster than vertex buffer objects(even static, non-changing vertexbuffer objects) on my g5…
but rather than assume that’s the same on all cards, I thought I’d ask.

Also does using a interleaved array make much of a difference? I use seperate buffers for verts/normals/color/tris currently.

Lastly, and slightly off topic but didn’t want to start three threads in a day… Do you guys suggest sticking with Cg for shader support, or using gsLang, considering I’d like shaders to work on all pixel shader cards(I.e g3 and up) not just next-gen ones.

Please don’t disillusion me, I am just tinking about changing to vbo from the current display list implementation in my program to get if FASTER (not slower) :wink: .

I guess this question can hardly be answered generally, it depends on what you do and which hardware you are using. On nvidia, display lists are highliy optimized, while on ati (afaik), they even tend to slow things down (compared to immediate opengl calls). Also, in display lists, it’s best to use triangle strips, while with vbo I heard that normal triangles are best… but I am also curious towards other people’s experiences. However I am afraid that this thread will be closed anyway, as this hardly is a rare topic.

On Macintosh (regardless of nvidia/ATI) display lists are the fastest for static geometry. Of course there is an initial hit for compiling the list, so for dynamic stuff, recompiling a huge display list could very well turn out slower than a VBO. See the 2003 WWDC session (#209, 27 minutes in) where this is graphed out.

For dynamic stuff, using non-interleaved arrays could be faster with VBOs since you can update only the part that is actually changing (i.e. dynamic vertices and static tex coords or vice versa.) But we can’t talk about VBO on Mac since as of 10.3.3 it doesn’t exist. :wink: VAR is the current best path for dynamics on Mac.

Well I wouldn’t get dillusioned just yet. Just prepare yourself for a very dodgy world of uncertainty :wink:
Really wish they’ed sort out a proper agp scene management on cards, it’s just a nightmare to optimize…

For example, my engine, a 17,000 poly scene = 100fps.
A 700 poly one =200fps.

Going on that you’d expect a 1.2 million one to cripple it to like 20seconds per frame…yet nope, it runs at 17fps.

There doesn’t appear to be any sort of consistancy…at least not on my g5-6000xt.

I really don’t get how it can handle 1.2million polys at 17fps yet crawl to a 100fps with just 17… (Considering the surface changes etc are the same…as the engine optimizes mesh order to cut out needless state changes…)

Someone high-up (I.e vendors) needs to sort out a set method for optimal speeds…even if each vendor does one just for their own cards…(Well, obviously…)… I mean we only have to take a look at all the poorly optimised slow commercial games out there to realise it’s not exactly common knowledge.

rambling…

Originally posted by ManOfSpace:
There doesn’t appear to be any sort of consistancy…at least not on my g5-6000xt.
there is, but assuming a simple linear connection between number of triangles and fps is just plain wrong.

Originally posted by ManOfSpace:
For example, my engine, a 17,000 poly scene = 100fps.
A 700 poly one =200fps.

700*200=140,000 polys/sec.

Are you sure you’re not hitting a fill rate bottleneck there? That is terribly slow.

Aderian, I very much doubt that, as the frame-rate is the same whether the model(a terrain, not dynamic) is on screen or not, plus this is a bare bones test of the engine…i.e only a miniscule(sp?) cpu load…

If I use a display list for 17,000 I then get 140fps.

This is with vsync disabled.

Yet the same engine, demo on a ati-98000,p4-3ghz etc(I.e top spec as it gets at least prior to the g6 era) can render a 1.2million poly scene at 65fps…

YET!, throw in lod, so the scene is reduce to 100,00 polys…i.e 10x less, it only renders 2x as fast on the ati.(120fps)

Fill-rate perhaps…

but this is what I mean,

the 1.2million scene btw, runs as fast on a g3 as it does on my g5, even even slower than my g4(Which it’s self is slower than a g3 in all tests I’ve carried out)…

Life is strange? They should try coding for a nvidia gpu… :wink:

Jared, of course it’s not that simple, but my engine is pretty much optimized to avoid the pitfalls.

I.e that 17,000 poly scene has just as many surface changes/material state changes etc as the 1.2million one. So the only difference is in fact tris rendered. both single textured, single light.

BUT, the 1.2million scene one is split up into a load of seperate renders…i.e no changes to the surface, just differant world space. So the gpu is starting/stopping around 100 times per frame, yet still pumps 1.2million out 17 times a sec even on my budget g5-5600xt 64bit.
The 17,000 one on the other hand is a single render.

Perhaps gpus are not designed to handle single high poly objects/display lists?(Well not to handle them as well as a load of small calls anyway…)

But i’ll happily admit i’m far from an expert at these matters, why I started this thread :wink:
Don’t suppose any Nvidia openGL driver developers post here? :slight_smile:

not that I know much about high end optimizations and stuff, but eventually found papers that recommend you to send bigger chunks of geometry instead of many small.
there is a paper on nvidia thats called batchbatchbatch.pdf on this topic.

I found display lists were extremely optimized on nvidia, but vbo same in performance as dl on ati, while still allowing more “changes” then the very static display lists…
Also on a non ati and non nvidia card I found display list to be not faster at all and eventually less “working”.

I guess vbo is the way to go for the future so, but until then DL are likely the most simple to use and spread method for static objects…

maybe this has a few suggestions that help:
http://www.ati.com/developer/gdc/PerformanceTuning.pdf

there are quite a few pitfalls when it comes to vbo. buffers are to small or big, data isnt optimally aligned, etc.

for example my “record” in pure transfer and transformation on a rad9800 was something over 200mill tri/sec (though that number still seems too high). though 40-70mill seem to be good numbers when theres some typical texturing going on.