Display Lists - The Next Generation (CBO)

As we all know, display lists are deprecated since OpenGL 3.0… But their speed has never been achieved, although there is a bunch of new “speedy” stuff (like: instancing - for drawing huge amount of similar objects, VAO - for collecting states, UBO - for collecting uniforms, bindless - for direct access to VBOs, etc.).

The bindless graphic significantly reduced CPU cache misses. If we have a lot of VBOs in the scene, each of them has to be bound before drawing. Their IDs have to be translated into physical addresses, and that is the stage skipped by bindless. BUT, if we have a lot of VBOs, we have A LOT OF FUNCTION CALLS, which makes our application a CPU bound. Hundreds of thousands of function calls is something that makes a driver overhead enormous, and no one extension, as far as I know, tries to solve the problem. Instancing is useful only for similar objects. But in many applications instancing is not suitable.

The only solution is to store function-calls into a “display list” and execute it in just one function-call. I will illustrate this with a scene having more than 65K VBOs. Using highly optimized approach with bindless access the frame rate on 9600GT is about 21 (with view-frustum culling up to 64). Without bindless it is almost two times less. What is happening if we draw it in a single “old fashion” display list? The frame rate without culling is 152. More than 7 times faster!!! Of course, this display list is not very useful. This is just an illustration.

The purpose of this long introduction is to draw attention to, what I think, is the greatest bottleneck in many applications today. In mine certainly! :wink:

Well, by this post I want to gather your opinions about a suggestion for a new revision of OpenGL, or an extension. I think it would be very useful to have a new kind of display lists that would store batch of commands that can be invoked with a single call. The list of commands can be restricted to: activating shader program, setting uniforms, setting attributes, activating and drawing VBO. It would be even more efficient if each command would have its slot, so that changing a command would not affect the entire buffer.

Because this buffered object does not require data management (VBO reorganization or similar), maybe a command buffer object (CBO) is more suitable name for it.

I agree completely.

i think the display list mechanism is fine as it is. Maybe add a new enum in addition to the existing GL_COMPILE and GL_COMPILE_AND_EXECUTE. Something like GL_COMPILE_RESTRICTED. Then you spec a limited set of allowed commands in a restricted display list.

sorry, I was agreeing with you really. Just the way in which it’s formalised is up for debate. I don’t think a new set of entry points is really needed, seeing as though there’s already something suitable in the API. glNewList is then just analogous to binding a vertex array object, in that it captures the command stream from all drawing commands until it’s unbound.
I’m sure we’re all in a agreement that display lists are obviously worth keeping because just look at the performance NVidia get out of them! They are obviously a driver writers best friend.

I’m sure that they are not. :slight_smile:
Furthermore, for the ATI drivers developers they are the nightmare. :wink:

Okay, let’s be serious. DLs, as they were (and are now), are pretty complicated to implement. If we impose additional burden, it could collapse. What you see as an enumeration requires huge work underneath. glNewList has many disadvantages:

  • it does not reuse buffers already stored on the server’s side, so we have no benefits of previously created VBOs,
  • it completely rebuilds (and optimizes) data, which is totally unnecessarily, and time costly (just for the illustration, building a huge DL lasts from 2.5 up to 10 times longer than building VBOs),
  • DLs cannot be changed, just destroyed and created again

Old fashion DLs are mixture of data and commands. With VBOs we do not need that; just a usable command-list that can be “compiled” and stored at the server’s side in order to be executed by one function call (and totally remove driver’s overhead).

This is just an illustration.

A very poor illustration, since it doesn’t even pretend to mirror any actual real-world use.

And have you tested this on ATI hardware? What is your performance speedup there?

I think it would be very useful to have a new kind of display lists that would store batch of commands that can be invoked with a single call.

And why do you think this will guarantee you significant performance improvements over buffer objects?

I’m sure we’re all in a agreement that display lists are obviously worth keeping because just look at the performance NVidia get out of them!

I don’t agree. Look at the performance you don’t get out of them on ATI or Intel. Relying on display lists for performance is fiction.

Hi, Alfose! Believe it or not, I have waited for you and your criticism to find out if there is any weakness in my proposal. :smiley:

And, so far, there is none!

Before start to critique, please patiently read what is written. I’m not against VBOs. The new DLs (or CBOs) would be just an efficient way to draw VBOs. And I don’t think; I surely know that it would boost drawing speed significantly! If you know what makes VBOs faster than immediate mode drawing, you will also know why I’m claiming this.

It is a real application that draws a huge terrain with 65K blocks.

I haven’t, but I know that the speed gain will be less. The point is not in the efficiency of Nvidia’s DLs.

Before start to critique, please patiently read what is written. I’m not against VBOs. The new DLs (or CBOs) would be just an efficient way to draw VBOs. And I don’t think; I surely know that it would boost drawing speed significantly!

You’re missing the point.

ATI’s display list implementation is no faster than doing buffer objects yourself. This can be for one of two reasons. Either their VBO code is really good, or they don’t care to optimize it.

If it’s the former, then this won’t get any performance benefits either. If it’s the latter, well, why would ATI care any more about optimizing CBOs than optimizing display lists? If ATI doesn’t care, they’re not suddenly going to start caring just because a new, more limited form of display list is available.

It is a real application that draws a huge terrain with 65K blocks.

Which you put into one display list. Which means you could have just as easily put them into one buffer object and drawn them with a sequence of glVertexAttribPointer/glDraw* calls.

It’s the buffer binding that’s killing your performance (technically, the bind followed by glVertexAttribPointer). If you just put the data into one buffer object, you’d only need to bind the buffer once.

I’m sure that they are not. :slight_smile:
Furthermore, for the ATI drivers developers they are the nightmare. ;)[/QUOTE]
sorry aleksandar, but I was actually quoting an nvidia driver writer there. It’s obvious really. Give them what you want to draw, and they can pretty much optimise it specifically for their hardware. If you are too specific (so they’re unable to rearrange the buffer layouts and concatenate them etc.) then there’s huge amounts they cannot do. As for ATI, well yes their display list performance is poor on the consumer cards, but on the workstation cards their dlist performance is almost at nvidia’s level. They’re just choosing to use their optimal display list compiler to drive sales of fireGL workstation cards (non-game software tends to use display lists because their content is user-driven and therefore needs the drivers help where possible). If they didn’t, then almost all engineering software would run at a snails pace and nobody would buy fireGL cards.
At the end of the day, you’re proposing an object that just stores up some buffer binds and draw commands. That is certainly not going to give you the performance you’re currently getting from display lists (on nvidia and fireGL cards). They do so much more than save you some CPU cycles. As for display lists not reusing buffer objects etc., well the buffer objects you created before compiling the display list are irrelevent to the driver by then…they’re just a source for its dlist compiler. They’ll create their own buffer objects, or concat onto their existing ones. Compile the display list and delete your buffers! or don’t even use buffers when compiling them. It’s all static data anyway.

As a side note, I have in the past written a batch/buffer manager that managed to pretty much match nvidia’s display list compiler, but unfortunately it took about 5 times as long to build them. About 4 months later nvidia’s display list compiler was beating mine after a driver update. This is the power of having display lists in the driver.

Alfonse, I agree that you shouldn’t rely on display list performance for anything other than nvidia and firegl cards. Fortunately, our customers wouldn’t use anything else. It’s obviously different for games companies - who wouldn’t be using OpenGL anyway…if you are, and you’re not targetting mac/linux, then might I suggest you switch to d3d. As far as I can see, the only two reasons for using OpenGL on Windows is quad buffered stereo and, ironically, display lists.

[edit] a pre-emptive apology. Having read back what I’ve just written, I have to say I come across a little [censored]-sure. I didn’t mean to, I’m open to debate on the subject. It’s just that I’ve battled with display lists for a long time, and I absolutely know they’re doing some very clever stuff when you render large numbers of batches with very little state change going on. A few years back I would have loved to have been able to ditch them and go full-on static buffer objects, but it just doesn’t give me the performance I need to compete with other products that stick with display lists. Now I just think they’re absolutely the right way to do things and should never be taken away. Even more so now we can do so much on the GPU and having dynamic buffers filled by data having to be generated on the CPU isn’t as important.

One more thing before I shut the heck up - I would like to see some priority put into making display list compilation asynchronous. Along with shader compilation. It’s all driver work anyway, with just internal buffer uploads involving the GPU (which could be scheduled correctly so there’s no frames dropped).

Come on, Alfonse… Have you ever read about performance issues, and how driver makes an application a CPU bound if there are a lot of function calls? I’m really sorry… I have expected more constructive comments from you. :frowning:

And yes, putting the whole terrain into a single DL or VBO is nonsense. At least each LOD should have a separate VBO/DL. My LOD scheme requires far more blocks (VBOs) inside each LOD ring in order to exploit spatial coherence and minimal update during horizontal movements. It was just an example of performance issues.

Peterfilm, do not feel sorry or apologize. Any advice, comment or opinion is valuable. Thank you for sharing your experience with us. Ups, I sound as a preacher or psychiatrist. :slight_smile: Sorry! That was not my intent. But I’m really thankful for your comments. Display lists are really fastest way to represent a static geometry, but they cannon be updated as fast as VBOs. Having a command buffer, I think VBOs could be very close to DLs performance, and maybe even outperform them.

Have you ever read about performance issues, and how driver makes an application a CPU bound if there are a lot of function calls?

Are you kidding? Are you actually suggesting that the reason that binding 65,000 buffer objects and rendering with them is slower than rendering with a single display list is solely because of function call overhead?

So the fact that the driver must access 65,000 buffer objects and pull out a GPU address doesn’t matter. The fact that the driver must access 65,000 index buffer objects and pull out a GPU address doesn’t matter. The fact that each of these memory accesses is effectively an uncached memory read isn’t the performance problem.

No. You’re saying the problem is that you’re calling too many functions.

:rolleyes:

I’m sorry, but no: the performance problem isn’t simple function call overhead. The kind of functions you call matters. Some operations are more expensive than others.

And yes, putting the whole terrain into a single DL or VBO is nonsense. At least each LOD should have a separate VBO/DL.

Putting all the terrain in a display list makes no sense, because you can only render either none of the terrain or all of it. This is almost never what you want to do.

Buffer objects are not like display lists. You can render whatever parts of it you want. It will be faster to do one bind and 65,000 draw calls than to do 65,000 binds and 65,000 draw calls.

I don’t see a reason to put each LOD of a model in a separate buffer object.

My LOD scheme requires far more blocks (VBOs) inside each LOD ring in order to exploit spatial coherence and minimal update during horizontal movements. It was just an example of performance issues.

But your example is invalid because it shows bad buffer object usage. If you compare bad buffer object usage to artificially optimal display list usage, how can you reasonably expect valid results?

Aleksandar, have you tried using some custom VBO memory management, like peterfilm and Alfonse mentioned? The idea is to pack the data of/for many VBOs, that have the same interleaved vertex format, into one big VBO; and then use glDrawRangeElementsBaseVertex() instead of glDrawElements().

This way, if you have 60000 meshes, but 20 vtx-layouts, you’ll effectively call glBindBuffer+glVertexAttribPtr[] only 20 times, and have only 20 VBOs. Sorting by VBO needs to be added in the scenegraph traversal, naturally. Calling 60000 times a glDraw** (the glDrawRangeElementsBaseVertex, specifically) will still stay necessary.

Instancing can still be used to draw at once many meshes, that have completely different geometry and vtxattrib values (but same vtxattrib layout) and uni-params; you just have to match the NumPrimitives of such meshes by creating degenerate triangles, override the vtx-input-assembler and have no post-VS cache. Of course, this is only necessary and natural when there’s a severe cpu bottleneck.

thanks Aleksandar for your understanding.

llian, i’ve tried it but as i say, never got the same performance as dlists (but damn near). That suggests it’s not just about sorting by attrib format. There’s something else going on.

Alfonse, have you ever beaten display lists for static geometry on a quadro or firegl card? or even come within 5% of them?

by the way, Aleksandar, are you sure you’re eliminating cache misses in your own data structures when you’re benchmarking?
it’s just that I’ve seen people compile whole branches of scenegraphs into display lists, then compare the performance with VBO’s drawn at the leaves. What they’re seeing is the display list path also eliminates traversal of big fat node structures strewn all over the working set.

(but a like-for-like test still shows dlists outperform even the most organised buffer manager by at least 5%…which in large datasets is the difference between interactive and a slide show).

@peterfilm, could that 5% difference be because of:

  • calling glUniform many times before glDrawXX (vs having all constants batched together with DLs, in VRAM). Curable by batching the per-instance/mesh uniforms into one mat4[] array, I hope.
  • having the triangle indices in an inefficient for post-VS caching way (vs the driver doing index/data priority optimization, or unwelding it all in a continuous stream of 3*NumTris vertices)
  • having vtxattribs in the VBO with imperfect alignment (vs repacking and inflating to i.e 32-byte multiples)

Alfonse, have you ever beaten display lists for static geometry on a quadro or firegl card? or even come within 5% of them?

I think you’re misunderstanding.

Beating display lists isn’t the goal, because display list performance is unreliable across hardware.

Buffer object performance is reliable. Thus, doing something that improves this performance will improve performance generally.

The idea behind CBO is to improve buffer object performance. However, the thought behind CBO is based on ideas about where buffer objects lose performance are incorrect. So the idea doesn’t really solve the problem.

  • no, the uniforms are not compiled into display lists. Just the geometry. So comparing VBO’s to dlists is pretty much direct. How I handle the uniforms is irrelevant to this thread, I think you’ll agree. Both my dlist and VBO geometry managers benefit from those optimisations.
  • the indices and vertices are sorted into fetch order and optimal ACMR, yes (forsyth etc.). Interestingly, the nvidia dlist compiler doesn’t seem to do this for you. You give it unsorted indices/vertices and you get worse performance. As I say, something else is going on beyond what nvidia expect you to have already done.
  • nah, verts are aligned to 16 words.

this thread is interesting, because it’s bringing peoples experiences together. It is missing some hard numbers though. You must all have done these comparisons before?

Yes alfonse, I think I understand your point now. You’re saying that buffer objects are reliable across multiple vendors while dlists aren’t. That’s the current state, yes, which is unfortunate. You must admit that in an ideal world ATI/Intel would just sort out their dlist compilers (or in ATI’s case, just use the one from the fireGL drivers)?

btw, you also said that the buffer binds were the problem, but they’re not. The glVertexAttribPointer call’s the bottleneck (which you alluded to, but seemed to be suggesting it’s the bind followed by the attribptr call). That call is just as expensive whether you’ve changed the currently bound buffer or not. Your basic idea is correct though, except it should be a single buffer where the data is laid out in the buffer in attribute format order, and drawn in that order.
Bindless seems to have sorted a lot of this out by finally separating the vertex format from the buffer offsets (like d3d did ages ago).

llian, glDrawRangeElementsBaseVertex isn’t supported on older cards…unfortunately. Considering there’s a reasonable architecture change needed to fully take advantage of it, it sort of makes it a bit on the useless side. This suggests it required a change in the hardware, to literally add a base index onto each index fetch on the card. But…it was always in d3d’s DrawIndexedPrimitive, so was specifying a non-zero base index in that d3d function a performance killer? Never tried it.