Display list question

I’m trying to design a reasonably high-performance OpenGL wrapper for a specific geometry-once, render-many case.

Past experience has shown me that on NVIDIA hardware at least, display lists can be faster than VBOs. So I thought I’d arrange things so that geometry always gets “baked” into a display list when it’s specified, using vertex arrays. I figure I can upgrade those to VBOs easily enough later if speed testing shows a benefit.

However, I’ve gotten a bit rusty, and there are a few issues about display lists I don’t recall. I know glVertexPointer and glEnableClientState don’t get added to display lists. Do I need to therefore reset those each time I glCallList(), or will the display list figure out how to compile in the relevant information?

I’m particularly interested as it affects glPrimitiveRestartIndexNV(), which also does not get added into display lists.

no, all vertex/index data is dereferenced at compile time. You can delete your data once you hit glEndList. glCallList will then draw the data that was dereferenced at compile time.
This is one of the reasons dlists are so fast - they don’t rely on how you laid out your data when they’re rendered. They copy your data and optimise it to hell and back. It doesn’t even stop there. Depending on the way you call your display lists, they can do other things at render time, such as split/merge display lists. Wonderful mechanism. Depreciated though. Go figure.

Oh, bare in mind that ATI/AMD consumer cards (non-workstation) have a crippled display list compiler in order to sell more workstation cards to markets that use display lists (ie. data driven apps - not heavily pre-processed and packed game-like apps). Again, go figure.

Slide 29/35 of http://www.sci.utah.edu/~bavoil/opengl/bavoil_trimeshes_2005.pdf shows that VBOs and display lists have similar performance if there is about less than 50000 vertices per display list.

The conclusion is to batch your big data into buckets lower than this limit.

Go experiment with your card. It is probably to the value of GL_MAX_ELEMENTS_VERTICES.

I don’t know if you can trust those numbers, in theory VBOs should get faster the larger the batches are (caused by reduced overhead), thus in those graphs there should be a somewhat curved line sloping down from left to right.
Display lists should display somewhat similar behavior.

Now I’m not exactly 100% sure about this, but im guessing that

  1. they are not measuring VBO/DL performance there.
  2. they are running the benchmark with v-sync on.
  3. they are running on old hardware with stupid limitations.

As long as VBOs are above about 2500 polys they should be the fastest method.

i don’t understand your conclusion, zeoverlord.
you’re comparing apples and oranges.
you’ve basically said explicit vertex data rendered as is will be faster than a method that is a black box with completely vendor specific and unspecified optimisations. If display lists are not faster than any other drawing method, then the IHV has either made a mistake in their code or they’ve deliberately crippled it for ‘some reason’. There’s only two prices to pay with display lists - compilation time and the inability to read from or write to them after compilation. The payoff is they should give the best performance, period.

Display list speed is entirely implementation dependent. Conformance testing only mandates what goes in and what comes out, you’re completely at the mercy of your driver for what happens in-between. In the specific case of display lists, yes, they can be stored on the GPU which will give the best performance, but they don’t have to be. The driver could just as easily store them in system memory in which case you get zero performance advantage. Even worse, the driver could convert a nice fast vertex array to a series of glBegin/glEnd calls which would cause them to run slower.

So unless you want to descend into the netherworld of driver-specific hacks (abandon hope all ye who enter here) I would avoid display lists like the plague.

you test your code on the target hardware. On nvidia hardware they give the best performance, so ‘avoid them like the plague’ and your app is going to suffer in comparison to apps that use display lists on nvidia hardware. Also, what you consider a “nice fast vertex array” may not be nice and fast for the hardware you throw it at.
Incidentally, I haven’t noticed anything in the spec about VBO’s created with static_draw requiring them to be stored in GPU memory. The spec is just providing hints to an implementation. A display list is a single giant hint to the driver - you’re saying static_draw. Again, nothing in the spec saying what should be inferred from this, but it’s assumed it means give me best performance - and that’s what you get, on nvidia and firegl cards.
The method we use is static stuff gets compiled into display lists on nvidia hardware, on other hardware we use buffer objects with static_draw. Nothing needs to be avoided like the plague. You use the best options you’re given by each vendor.

And the GPU vendor is free to do the same and underclock the GPU every time it was asked to render VBOs so that performance might suffer even more than normal (client arrays sometimes being faster even without that handicap). So maybe we shouldn’t even use the GPU at all and should hand-code all our rasterization on the CPU… Get to work on that, will you?

Most of your note is ridiculous, and everything peterfilm said in his post stands.

Yes display list perf is vendor specific, but so are the various permutations of other draw methods. Welcome to GPU development.

If you happen to be developing support for vendors that have great display list perf (i.e. doesn’t cripple them), and they give you a speed-up you need that you can’t easily get otherwise, you’re nuts if you don’t take it.

The goal of some recent threads has been to try to expose the “display list speed-ups” in the API via other means so they aren’t needed anymore (to obtain the best performance). Until then, clicking your heels and wishing this perf advantage gone, or trying to scare folks away with “here be dragons” nonsense is just childish. As with all things GPU, perf test all features on all GPUs you care about. We’re big boys and girls here – we’ll make our own calls.

Yes, but not really, i am also of cause including everything else needed to render the same thing.

well yea, If your rendering the exact same thing, the exact data, then with reasonably modern hardware (SM3 and up) the display list can only be as fast as VBO as the logical optimization would be to so the exact same thing.
What your arguing is that standalone rendering commands can’t have the same kind of optimizations as it enters the rendering que that the “black box” rendering would have, especially since it’s from the same vendor, and that a black box doesn’t have any added overhead in order to do all those things.

It’s faster then immediate mode for which it was designed to optimize, the basic reasoning was that if you could just send the finished vertex data instead of having to do the calculations on the fly the cpu would be able to transfer the data much faster to the gpu.
But with vertex shaders and vertex buffers that is now obsolete.
and sure on at least nvidia cards they basically convert the incoming vertex data to a vbo internally, it’s not to make displaylists a viable option but rather to speed up legacy software, but speed is not the only thing, there are other reasons why it’s now deprecated.

Then what about the overhead of accessing the list and executing it.

sorry zeoverlord, don’t mean to be dismissive of the points you’ve raised, but when all is said and done, have you managed to match the speed of nvidia display lists on a quadro? (with any other mechanism available through nvidia’s OpenGL implementation)
if you have, could you please give me some hints.

By the way, I don’t want that last comment to seem in any way flippant or flame-like. I really would like to know how to beat the nvidia display list mechanism. I’ve tried everything over the years. I’m left with bindless graphics and maybe the basevertex extension to try (so I can still use 16 bit indices, but not have to rebind the buffer offsets).
Thanks if you can offer some insight.

Bindless graphics (NV_vertex_buffer_unified_memory in particular) gets me very, very close to beating them.

Implying to me that if you feed a display list well ordered triangles, a lot of the remaining speed-up is pre-resolving GPU VBO handles into GPU addresses – i.e. removing CPU memory access inefficiency.

…but then who knows for sure what special sauce they’ve got in there :stuck_out_tongue:

thanks darkphoton.
it’s at least encouraging that bindless won’t be a waste of time. Trying to find a bit of time to try it out on real data. No doubt I shall be back here enthusing about it! (I hope, anyway)