Why would VBO be faster than display lists?

Driver issues aside, why would VBO be faster than display lists? This forum has indicated that in certain situations, VBO is actually the preferred way to go. This is strictly for static data.

My first thought was that NVidia drivers have been known to frustum cull all the geometry in a DL for you. Ignoring this, is there any other reason to use VBO over DLs when you know that your geometry will never change throughout the life of your application?

My first thought was that NVidia drivers have been known to frustum cull all the geometry in a DL for you.

They can’t. Unless you actually build the projection and modelview matrices into the display list, it cannot possibly do frustrum culling.

Ignoring this, is there any other reason to use VBO over DLs when you know that your geometry will never change throughout the life of your application?

Sure. Building a fast display list is somewhat difficult. Some hardware doesn’t mind certain state changes in a DL, and others do. You’d have to experiment to see what is acceptable in a DL.

With VBO’s, as long as you use an optimized format, you’re getting the maximum performance the API allows.

Why wouldn’t the driver be able to do frustum culling for display lists? Just build a bounding sphere in model space for the list and then cull away using the current modelview/projection matrices when the list is called. There was a thread on this a while back where someone claimed nvidia said they did frustum culling for display lists.

Korval, NVidia drivers do frustum cull DL geometry. They don’t magically pre-cull it, but they most definitely do some sort of bounding volume culling against the frustum.

I appreciate your input, as you’ve always given insightful posts on this forum, but I think I speak for most of us, when I say that I could do without the haughty dismissal of ideas that you don’t agree with.

You often sound condescending and even disgusted when you oppose someone. It’s uncalled for.

[This message has been edited by CatAtWork (edited 10-19-2003).]

Hum … as the model view matrix can change each time, and thus the visibility of each tri, you must cull each time you call a display list which is just stupid (Korval told the only case it can be efficient).

But I think i can be possible for the driver to do an optimised VBO as it have, as far i can see, more informations and freedom than a standart VBO. But I didn’t hear about any driver that made such optimisation.

[This message has been edited by TheSillyJester (edited 10-19-2003).]

NVidia drivers DO frustum culling on display lists. I’ve tested it myself. They also have the same optimization in D3D drivers with static vertex buffers. It has been confirmed by their driver team, so there’s no question about that.

as the model view matrix can change each time, and thus the visibility of each tri, you must cull each time you call a display list which is just stupid

Ok, so you’ve got a bounding box/frustum test per display list per call. It’s not stupid, that’s what you’re doing when you perform frustum culling yourself. The problem is, if you actually do frustum culling yourself on your DLs too, the work is done twice.

Y.

Driver issues aside, indexed VBOs can take advantage of the post T&L vertex cache.

Counting driver issues: A couple months ago, I did some profiling on different geometry submission methods and the amount of CPU time they use. You can tell by looking at the number of CPU cycles being used if the DL is being stored on the card or in main memory/AGP. Old Nvidia cards took a lot of cycles rendering DLs. Newer Nvidia cards and ATI cards took very, very few cycles. The performance of VBOs on the other hand were a lot more consistent, which is why I prefer to always use VBOs over DLs whenever possible.

edit - I’ve also read several times on this forum and others that people have found that DLs use significantly more memory than VBO/VAO/VAR. I use DLs to encapsulate OpenGL state changes, but very rarely for geometry.

[This message has been edited by Stephen_H (edited 10-19-2003).]

Originally posted by Ysaneya:
Ok, so you’ve got a bounding box/frustum test per display list per call. It’s not stupid, that’s what you’re doing when you perform frustum culling yourself. The problem is, if you actually do frustum culling yourself on your DLs too, the work is done twice.

I was thinking about testing for each tri visibility, but testing all the DL once make sense.

Why wouldn’t the driver be able to do frustum culling for display lists?

Because, in particular with vertex programs, vertices may not be transformed the way you expect them to, for one. The position data with a vertex program is just a group of floats; there is no guarentee that the user isn’t doing something silly like using them for texture coordinates.

Granted, that’s an outside chance, but it is very reasonably possible that the user isn’t using the OpenGL modelview or projection matrices with a vertex program. Certainly, they aren’t going to run a vertex program on the data on the CPU to determine where the output positions come out.

Of course, the driver can just turn off this frustum culling if the user binds a vertex program, but then I’d be at least partially correct. And, as we choose to do more vertex-program based operations, optimizations for the non-vertex program case become less and less important.

Driver issues aside, indexed VBOs can take advantage of the post T&L vertex cache.

Technically, there’s no reason why indexed data rendered into a display list can’t use the post-T&L cache either. After all, the display list code could easily just create a VBO itself.

That doesn’t mean that the post-T&L cache is being used in current DL implementations. It just means that, technically, it should be possible.

Old Nvidia cards took a lot of cycles rendering DLs. Newer Nvidia cards and ATI cards took very, very few cycles.

Were these geometry-only display lists, or did they include state changes as well?

Originally posted by Ysaneya:
NVidia drivers DO frustum culling on display lists. I’ve tested it myself.

I’d be very interested in hearing exactly how you tested that such that you could draw that conclusion.
I’m very sceptical to the idea of frustum culling on DL. How are they going to do it? Triangle by triangle on every time you call the DL? Sounds more like a loss than a gain.

At one point in time, Detontator 45.something, using display lists with the ffp, or vp with position_invariant, behaved exactly as if I was doing manual frustum culling on AABBs for each list.

As soon as a list was no longer visible, framerate increased by a large amount. As soon as the first triangle of the list intersected the frustum, that performance increase was lost.

Originally posted by Humus:
I’m very sceptical to the idea of frustum culling on DL. How are they going to do it? Triangle by triangle on every time you call the DL? Sounds more like a loss than a gain.

It seems quite simple to me - when you ‘compile’ a display list (at the glEndList() point), the driver can easily construct a bounding volume (be it sphere, box or whatever) for all the vertices you specified inside the list - it would start with an identity matrix, and it would transform the vertices by any matrix uploads specified inside the DL as it constructs the volume.
Then, at the time you issue a glCallList, it can simply transform that bounding volume by the current modelview matrix and test it against the view frustum - continuing to draw or terminating depending on the result. And since you can nest display lists (as is often done in scenegraphs), you can hiearchically reject large portions of the whole display list with one test.
Yes, of course they wouldn’t do the volume/frustum test if a vertex program was bound when a glCallList is issued - that would be silly. But the fixed vertex functions should be used by default unless you’re doing anything fancy - hence you’ll get a big performance boost from this intelligent DL optimisation.

But the fixed vertex functions should be used by default unless you’re doing anything fancy - hence you’ll get a big performance boost from this intelligent DL optimisation.

Unless, of course, you happen to be doing frustum culling yourself. This is likely in any performance application, since you can’t guarentee that all DL implementations will do it for you. In which case it is a waste of good CPU time.

As for using fixed-function by default, why? Doing something “fancy” these days tends to be the common case, rather than the exception. It’s easier to write an engine based on an API you define (through your vertex programs) than to write one based on the standard GL API as well as your vp-based one.

Were these geometry-only display lists, or did they include state changes as well?

There were no state changes. I was just rendering a bunch of small quads (characters in a string of text) for differently sized strings - 1, 2, 4, 8, … 128 characters to see if VBOs or DLs were faster.

I tested on 7500, 9700, and GF2, GF3, GF4. In all cases, except on the 9700, the DLs used more CPU time than VBOs to make the render call. On the 9700, the DLs used several factors (approx 10x) less cycles for a render call than the best I got from VBOs. I am guessing that this is because ATI’s drivers attempt to store their DLs in card memory (see ATI’s performance FAQ they came out with a few months ago).

My testing was done about 6 months ago. I was using the most recent version of drivers at that time.

Note that my results say nothing about which is actually faster to render on the card, only which uses less CPU time to submit.

edit - I reviewed my timing logfiles before posting this time to be sure. I thought i remembered newer Nvidia cards being faster with DLs, but I was wrong.

[This message has been edited by Stephen_H (edited 10-20-2003).]