PDA

View Full Version : Display Lists - The Next Generation (CBO)



Aleksandar
04-23-2010, 04:29 AM
As we all know, display lists are deprecated since OpenGL 3.0... But their speed has never been achieved, although there is a bunch of new "speedy" stuff (like: instancing - for drawing huge amount of similar objects, VAO - for collecting states, UBO - for collecting uniforms, bindless - for direct access to VBOs, etc.).

The bindless graphic significantly reduced CPU cache misses. If we have a lot of VBOs in the scene, each of them has to be bound before drawing. Their IDs have to be translated into physical addresses, and that is the stage skipped by bindless. BUT, if we have a lot of VBOs, we have A LOT OF FUNCTION CALLS, which makes our application a CPU bound. Hundreds of thousands of function calls is something that makes a driver overhead enormous, and no one extension, as far as I know, tries to solve the problem. Instancing is useful only for similar objects. But in many applications instancing is not suitable.

The only solution is to store function-calls into a "display list" and execute it in just one function-call. I will illustrate this with a scene having more than 65K VBOs. Using highly optimized approach with bindless access the frame rate on 9600GT is about 21 (with view-frustum culling up to 64). Without bindless it is almost two times less. What is happening if we draw it in a single "old fashion" display list? The frame rate without culling is 152. More than 7 times faster!!! Of course, this display list is not very useful. This is just an illustration.

The purpose of this long introduction is to draw attention to, what I think, is the greatest bottleneck in many applications today. In mine certainly! ;)

Well, by this post I want to gather your opinions about a suggestion for a new revision of OpenGL, or an extension. I think it would be very useful to have a new kind of display lists that would store batch of commands that can be invoked with a single call. The list of commands can be restricted to: activating shader program, setting uniforms, setting attributes, activating and drawing VBO. It would be even more efficient if each command would have its slot, so that changing a command would not affect the entire buffer.

Because this buffered object does not require data management (VBO reorganization or similar), maybe a command buffer object (CBO) is more suitable name for it.

Jan
04-23-2010, 05:34 AM
I agree completely.

peterfilm
04-23-2010, 05:43 AM
i think the display list mechanism is fine as it is. Maybe add a new enum in addition to the existing GL_COMPILE and GL_COMPILE_AND_EXECUTE. Something like GL_COMPILE_RESTRICTED. Then you spec a limited set of allowed commands in a restricted display list.

peterfilm
04-23-2010, 09:04 AM
sorry, I was agreeing with you really. Just the way in which it's formalised is up for debate. I don't think a new set of entry points is really needed, seeing as though there's already something suitable in the API. glNewList is then just analogous to binding a vertex array object, in that it captures the command stream from all drawing commands until it's unbound.
I'm sure we're all in a agreement that display lists are obviously worth keeping because just look at the performance NVidia get out of them! They are obviously a driver writers best friend.

Aleksandar
04-23-2010, 11:19 AM
...They are obviously a driver writers best friend.
I'm sure that they are not. :)
Furthermore, for the ATI drivers developers they are the nightmare. ;)

Okay, let's be serious. DLs, as they were (and are now), are pretty complicated to implement. If we impose additional burden, it could collapse. What you see as an enumeration requires huge work underneath. glNewList has many disadvantages:
- it does not reuse buffers already stored on the server's side, so we have no benefits of previously created VBOs,
- it completely rebuilds (and optimizes) data, which is totally unnecessarily, and time costly (just for the illustration, building a huge DL lasts from 2.5 up to 10 times longer than building VBOs),
- DLs cannot be changed, just destroyed and created again

Old fashion DLs are mixture of data and commands. With VBOs we do not need that; just a usable command-list that can be "compiled" and stored at the server's side in order to be executed by one function call (and totally remove driver's overhead).

Alfonse Reinheart
04-23-2010, 11:22 AM
This is just an illustration.

A very poor illustration, since it doesn't even pretend to mirror any actual real-world use.

And have you tested this on ATI hardware? What is your performance speedup there?


I think it would be very useful to have a new kind of display lists that would store batch of commands that can be invoked with a single call.

And why do you think this will guarantee you significant performance improvements over buffer objects?


I'm sure we're all in a agreement that display lists are obviously worth keeping because just look at the performance NVidia get out of them!

I don't agree. Look at the performance you don't get out of them on ATI or Intel. Relying on display lists for performance is fiction.

Aleksandar
04-23-2010, 11:42 AM
And why do you think this will guarantee you significant performance improvements over buffer objects?
Hi, Alfose! Believe it or not, I have waited for you and your criticism to find out if there is any weakness in my proposal. :D

And, so far, there is none!

Before start to critique, please patiently read what is written. I'm not against VBOs. The new DLs (or CBOs) would be just an efficient way to draw VBOs. And I don't think; I surely know that it would boost drawing speed significantly! If you know what makes VBOs faster than immediate mode drawing, you will also know why I'm claiming this.

Aleksandar
04-23-2010, 12:02 PM
A very poor illustration, since it doesn't even pretend to mirror any actual real-world use.
It is a real application that draws a huge terrain with 65K blocks.


And have you tested this on ATI hardware? What is your performance speedup there?
I haven't, but I know that the speed gain will be less. The point is not in the efficiency of Nvidia's DLs.

Alfonse Reinheart
04-23-2010, 12:24 PM
Before start to critique, please patiently read what is written. I'm not against VBOs. The new DLs (or CBOs) would be just an efficient way to draw VBOs. And I don't think; I surely know that it would boost drawing speed significantly!

You're missing the point.

ATI's display list implementation is no faster than doing buffer objects yourself. This can be for one of two reasons. Either their VBO code is really good, or they don't care to optimize it.

If it's the former, then this won't get any performance benefits either. If it's the latter, well, why would ATI care any more about optimizing CBOs than optimizing display lists? If ATI doesn't care, they're not suddenly going to start caring just because a new, more limited form of display list is available.


It is a real application that draws a huge terrain with 65K blocks.

Which you put into one display list. Which means you could have just as easily put them into one buffer object and drawn them with a sequence of glVertexAttribPointer/glDraw* calls.

It's the buffer binding that's killing your performance (technically, the bind followed by glVertexAttribPointer). If you just put the data into one buffer object, you'd only need to bind the buffer once.

peterfilm
04-23-2010, 01:41 PM
...They are obviously a driver writers best friend.
I'm sure that they are not. :)
Furthermore, for the ATI drivers developers they are the nightmare. ;)
sorry aleksandar, but I was actually quoting an nvidia driver writer there. It's obvious really. Give them what you want to draw, and they can pretty much optimise it specifically for their hardware. If you are too specific (so they're unable to rearrange the buffer layouts and concatenate them etc.) then there's huge amounts they cannot do. As for ATI, well yes their display list performance is poor on the consumer cards, but on the workstation cards their dlist performance is almost at nvidia's level. They're just choosing to use their optimal display list compiler to drive sales of fireGL workstation cards (non-game software tends to use display lists because their content is user-driven and therefore needs the drivers help where possible). If they didn't, then almost all engineering software would run at a snails pace and nobody would buy fireGL cards.
At the end of the day, you're proposing an object that just stores up some buffer binds and draw commands. That is certainly not going to give you the performance you're currently getting from display lists (on nvidia and fireGL cards). They do so much more than save you some CPU cycles. As for display lists not reusing buffer objects etc., well the buffer objects you created before compiling the display list are irrelevent to the driver by then....they're just a source for its dlist compiler. They'll create their own buffer objects, or concat onto their existing ones. Compile the display list and delete your buffers! or don't even use buffers when compiling them. It's all static data anyway.

As a side note, I have in the past written a batch/buffer manager that managed to pretty much match nvidia's display list compiler, but unfortunately it took about 5 times as long to build them. About 4 months later nvidia's display list compiler was beating mine after a driver update. This is the power of having display lists in the driver.

Alfonse, I agree that you shouldn't rely on display list performance for anything other than nvidia and firegl cards. Fortunately, our customers wouldn't use anything else. It's obviously different for games companies - who wouldn't be using OpenGL anyway...if you are, and you're not targetting mac/linux, then might I suggest you switch to d3d. As far as I can see, the only two reasons for using OpenGL on Windows is quad buffered stereo and, ironically, display lists.

[edit] a pre-emptive apology. Having read back what I've just written, I have to say I come across a little [censored]-sure. I didn't mean to, I'm open to debate on the subject. It's just that I've battled with display lists for a long time, and I absolutely know they're doing some very clever stuff when you render large numbers of batches with very little state change going on. A few years back I would have loved to have been able to ditch them and go full-on static buffer objects, but it just doesn't give me the performance I need to compete with other products that stick with display lists. Now I just think they're absolutely the right way to do things and should never be taken away. Even more so now we can do so much on the GPU and having dynamic buffers filled by data having to be generated on the CPU isn't as important.

One more thing before I shut the heck up - I would like to see some priority put into making display list compilation asynchronous. Along with shader compilation. It's all driver work anyway, with just internal buffer uploads involving the GPU (which could be scheduled correctly so there's no frames dropped).

Aleksandar
04-23-2010, 02:50 PM
...
If it's the former, then this won't get any performance benefits either.

Come on, Alfonse... Have you ever read about performance issues, and how driver makes an application a CPU bound if there are a lot of function calls? I'm really sorry... I have expected more constructive comments from you. :(

And yes, putting the whole terrain into a single DL or VBO is nonsense. At least each LOD should have a separate VBO/DL. My LOD scheme requires far more blocks (VBOs) inside each LOD ring in order to exploit spatial coherence and minimal update during horizontal movements. It was just an example of performance issues.

Aleksandar
04-23-2010, 03:01 PM
Peterfilm, do not feel sorry or apologize. Any advice, comment or opinion is valuable. Thank you for sharing your experience with us. Ups, I sound as a preacher or psychiatrist. :) Sorry! That was not my intent. But I'm really thankful for your comments. Display lists are really fastest way to represent a static geometry, but they cannon be updated as fast as VBOs. Having a command buffer, I think VBOs could be very close to DLs performance, and maybe even outperform them.

Alfonse Reinheart
04-23-2010, 03:22 PM
Have you ever read about performance issues, and how driver makes an application a CPU bound if there are a lot of function calls?

Are you kidding? Are you actually suggesting that the reason that binding 65,000 buffer objects and rendering with them is slower than rendering with a single display list is solely because of function call overhead?

So the fact that the driver must access 65,000 buffer objects and pull out a GPU address doesn't matter. The fact that the driver must access 65,000 index buffer objects and pull out a GPU address doesn't matter. The fact that each of these memory accesses is effectively an uncached memory read isn't the performance problem.

No. You're saying the problem is that you're calling too many functions.

:rolleyes:

I'm sorry, but no: the performance problem isn't simple function call overhead. The kind of functions you call matters. Some operations are more expensive than others.


And yes, putting the whole terrain into a single DL or VBO is nonsense. At least each LOD should have a separate VBO/DL.

Putting all the terrain in a display list makes no sense, because you can only render either none of the terrain or all of it. This is almost never what you want to do.

Buffer objects are not like display lists. You can render whatever parts of it you want. It will be faster to do one bind and 65,000 draw calls than to do 65,000 binds and 65,000 draw calls.

I don't see a reason to put each LOD of a model in a separate buffer object.


My LOD scheme requires far more blocks (VBOs) inside each LOD ring in order to exploit spatial coherence and minimal update during horizontal movements. It was just an example of performance issues.

But your example is invalid because it shows bad buffer object usage. If you compare bad buffer object usage to artificially optimal display list usage, how can you reasonably expect valid results?

Ilian Dinev
04-23-2010, 03:46 PM
Aleksandar, have you tried using some custom VBO memory management, like peterfilm and Alfonse mentioned? The idea is to pack the data of/for many VBOs, that have the same interleaved vertex format, into one big VBO; and then use glDrawRangeElementsBaseVertex() instead of glDrawElements().

This way, if you have 60000 meshes, but 20 vtx-layouts, you'll effectively call glBindBuffer+glVertexAttribPtr[] only 20 times, and have only 20 VBOs. Sorting by VBO needs to be added in the scenegraph traversal, naturally. Calling 60000 times a glDraw** (the glDrawRangeElementsBaseVertex, specifically) will still stay necessary.

Instancing can still be used to draw at once many meshes, that have completely different geometry and vtxattrib values (but same vtxattrib layout) and uni-params; you just have to match the NumPrimitives of such meshes by creating degenerate triangles, override the vtx-input-assembler and have no post-VS cache. Of course, this is only necessary and natural when there's a severe cpu bottleneck.

peterfilm
04-23-2010, 04:07 PM
thanks Aleksandar for your understanding.

llian, i've tried it but as i say, never got the same performance as dlists (but damn near). That suggests it's not just about sorting by attrib format. There's something else going on.

Alfonse, have you ever beaten display lists for static geometry on a quadro or firegl card? or even come within 5% of them?

peterfilm
04-23-2010, 04:14 PM
by the way, Aleksandar, are you sure you're eliminating cache misses in your own data structures when you're benchmarking?
it's just that I've seen people compile whole branches of scenegraphs into display lists, then compare the performance with VBO's drawn at the leaves. What they're seeing is the display list path also eliminates traversal of big fat node structures strewn all over the working set.

(but a like-for-like test still shows dlists outperform even the most organised buffer manager by at least 5%...which in large datasets is the difference between interactive and a slide show).

Ilian Dinev
04-23-2010, 04:38 PM
@peterfilm, could that 5% difference be because of:
- calling glUniform many times before glDrawXX (vs having all constants batched together with DLs, in VRAM). Curable by batching the per-instance/mesh uniforms into one mat4[] array, I hope.
- having the triangle indices in an inefficient for post-VS caching way (vs the driver doing index/data priority optimization, or unwelding it all in a continuous stream of 3*NumTris vertices)
- having vtxattribs in the VBO with imperfect alignment (vs repacking and inflating to i.e 32-byte multiples)

Alfonse Reinheart
04-23-2010, 05:10 PM
Alfonse, have you ever beaten display lists for static geometry on a quadro or firegl card? or even come within 5% of them?

I think you're misunderstanding.

Beating display lists isn't the goal, because display list performance is unreliable across hardware.

Buffer object performance is reliable. Thus, doing something that improves this performance will improve performance generally.

The idea behind CBO is to improve buffer object performance. However, the thought behind CBO is based on ideas about where buffer objects lose performance are incorrect. So the idea doesn't really solve the problem.

peterfilm
04-23-2010, 05:14 PM
- no, the uniforms are not compiled into display lists. Just the geometry. So comparing VBO's to dlists is pretty much direct. How I handle the uniforms is irrelevant to this thread, I think you'll agree. Both my dlist and VBO geometry managers benefit from those optimisations.
- the indices and vertices are sorted into fetch order and optimal ACMR, yes (forsyth etc.). Interestingly, the nvidia dlist compiler doesn't seem to do this for you. You give it unsorted indices/vertices and you get worse performance. As I say, something else is going on beyond what nvidia expect you to have already done.
- nah, verts are aligned to 16 words.

this thread is interesting, because it's bringing peoples experiences together. It is missing some hard numbers though. You must all have done these comparisons before?

peterfilm
04-23-2010, 05:33 PM
Yes alfonse, I think I understand your point now. You're saying that buffer objects are reliable across multiple vendors while dlists aren't. That's the current state, yes, which is unfortunate. You must admit that in an ideal world ATI/Intel would just sort out their dlist compilers (or in ATI's case, just use the one from the fireGL drivers)?

btw, you also said that the buffer binds were the problem, but they're not. The glVertexAttribPointer call's the bottleneck (which you alluded to, but seemed to be suggesting it's the bind followed by the attribptr call). That call is just as expensive whether you've changed the currently bound buffer or not. Your basic idea is correct though, except it should be a single buffer where the data is laid out in the buffer in attribute format order, and drawn in that order.
Bindless seems to have sorted a lot of this out by finally separating the vertex format from the buffer offsets (like d3d did ages ago).

llian, glDrawRangeElementsBaseVertex isn't supported on older cards...unfortunately. Considering there's a reasonable architecture change needed to fully take advantage of it, it sort of makes it a bit on the useless side. This suggests it required a change in the hardware, to literally add a base index onto each index fetch on the card. But....it was always in d3d's DrawIndexedPrimitive, so was specifying a non-zero base index in that d3d function a performance killer? Never tried it.

Ilian Dinev
04-23-2010, 05:47 PM
Here's some ancient and incomplete for this thread bench I did:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=255602#Post2556 02

My experience is limited, as it's just a hobby for me, and the actual data I work with isn't much; the frames are easily gpu-bound. I sort by material_pass->mesh_buffer->instance, so there are few buffer-bind/vtxsetup calls per viewport per frame.

Alfonse Reinheart
04-23-2010, 06:07 PM
You must admit that in an ideal world ATI/Intel would just sort out their dlist compilers (or in ATI's case, just use the one from the fireGL drivers)?

No. In an ideal world, we would have one method for rendering that simply works as fast as possible with as great flexibility as possible.


The glVertexAttribPointer call's the bottleneck (which you alluded to, but seemed to be suggesting it's the bind followed by the attribptr call). That call is just as expensive whether you've changed the currently bound buffer or not.

Where is your evidence on that? In their discussion of why they implemented NV_vertex_buffer_unified_memory, NVIDIA specifically called out the cache issue of fetching the GPU address from the internal buffer object.

Furthermore, if glVertexAttribPointer is a problem, then wouldn't it make more sense to simply divide the vertex format from the buffer object+offset the way that NV_vertex_buffer_unified_memory? After all, if there is both a cache problem and a glVAP problem, then that glVAP problem must have to do with the cost of changing vertex formats.

Rather than making some gigantic change that requires lots of IHV support to make work and may not actually help, just make small, targeted, specific changes that fix the specific problems.

And while I don't like bindless for breaking the buffer object abstraction, I have to say, it is a very specific, targeted change to solve a specific problem.


llian, glDrawRangeElementsBaseVertex isn't supported on older cards...unfortunately.

CBO won't be supported on them either. And BaseVertex is supported on DX9 cards; at least, the ones that are still receiving driver support from the IHVs. It's a feature that has been available on D3D for some time.

Dark Photon
04-23-2010, 08:37 PM
Before start to critique, please patiently read what is written.
You're missing the point.
No, you're diluting this thread by confusing the issue.


Beating display lists isn't the goal...
No, beating display lists isn't the goal. Maximal application performance is!

If display list perf is the fastest route on one vendor for static pre-loaded geometry, then we will use it, no matter what you say! If another route is faster on another vendor, then we will use that! This is not rocket science. This is common sense. And switching rendering paths is easy. Only academia can afford to lean back and "settle" with lowest-common-denominator across all vendors. Performance and content sells -- that's reality.

So the fact that vendor X does a sad job on render path Y is a really dumb reason to say that path Y doesn't matter.

Now (ignoring your destructive bashing), Aleksander's point in starting this thread was a very good one: paraphrasing,

NVidia's display lists provide huge speed improvements, but display lists are deprecated. How do we expose that speed-up in next-gen OpenGL?

and he proposed CBOs. ...after which this degenerated into a food-fight over what the "cause" of the speed-up is, and (in some cases) how you can could kludge around those causes rather than fix the underlying problem(s).

This is something the vendors will have to decide. On NVidia in my experience, we have our modern "display list" perf solution: it is called the Bindless extensions (http://developer.nvidia.com/object/bindless_graphics.html). If other vendors want that perf to make their cards look good and compare well, they'll implement it (or something like it) too. I don't see a need for another layer of objects/abstraction on top of this (display lists, CBOs, etc.) but vendor driver internals may steer the shared solution to that. Again, the vendors will have to decide.

And here's hoping they are working through the ARB to facilitate this discussion, so we can get one EXT or ARB extension from this, not 2-3 vendor-specific extensions. A point which you also agree with:


In an ideal world, we would have one method for rendering that simply works as fast as possible with as great flexibility as possible.

So I agree with Aleksander, and I'm glad he started this thread. Something is needed (API support) to fill this performance gap in a simple, cross-vendor way.

So far I've yet to hear a good reason why bindless (http://developer.nvidia.com/object/bindless_graphics.html) (using 64-bit buffer handles, which just so happen to be GPU addresses on some hardware) isn't "it".

Alfonse Reinheart
04-23-2010, 10:43 PM
So the fact that vendor X does a sad job on render path Y is a really dumb reason to say that path Y doesn't matter.

It's all a matter of effort vs. reward.

NVIDIA's graphics card division is... well... things aren't going well for them. They're 6 months late with a DX11 card, the card they eventually released is not exactly shipping in quantity, it runs fantastically hot, etc.

ATI by contrast was able to ship 4 DX11 chips in 6 months, and they're able to meet demand in selling those chips. They're selling DX11 hardware to the mainstream market, while NVIDIA can't even produce mainstream (sub-$200) DX11 cards after a 6 month delay.

One company is winning, and the other is losing.

The simple economic reality is this: development resources are not infinite. It'd be great if we could optimize everything, everywhere, for every piece of hardware. But what matters most is doing the greatest good for the greatest number. Adding a rendering path for display lists only helps NVIDIA card users; for most people, that means some percentage of their customer base less than 100%. This rendering path requires testing, debugging, and other care&feeding.

Or, one could spend my development resources tweaking shaders to make them faster and gain a performance benefit there. Alternatively, since performance is being lost anyway, one could make the game look better with the same performance. Maybe make the shaders more complex, or add in HDR or bloom, or whatever. Unlike the display list optimization, both of these will be useful for 100% of the customer base.

Where are the development resources better spent? On the slowly dwindling population of NVIDIA card holders? Or on all of the potential customers? Yes, it'd be nice if development resources could be spent on both. And for some, they can afford it; more power to them.

The rest of the developers would rather have a single path that both NVIDIA and ATI are willing to optimize as much as possible. Right now, that path is VBOs.


and he proposed CBOs. ...after which this degenerated into a food-fight over what the "cause" of the speed-up is, and (in some cases) how you can could kludge around those causes rather than fix the underlying problem(s).

That's how you see it, but that's not what the actual discussion is.

First, identifying the cause of the performance increase from display lists or bindless is vital to determining how to actually achieve it. If the cause of the increase is not what was identified in the original post, then CBOs will not help! And proposing something that will not actually solve the problem is a waste of everyone's time.

If you want to consider any discussion of whether CBOs will actually solve the problem to be missing the point, well, that's something you'll have to deal with yourself.

Second, "kludging" around the problem is more likely to solve it than inventing an entire new API. Bindless is nothing if not a gigantic kludge, yet you seem totally happy with it.


Something is needed (API support) to fill this performance gap in a simple, cross-vendor way.

This thread is not about "something" that solves the problem. It is not a thread for discussing arbitrary solutions to the problem. It is about a specific solution. A solution who's efficacy is far from settled.


using 64-bit buffer handles, which just so happen to be GPU addresses on some hardware

Those are not 64-bit handles; they are actual GPU addresses. Even if you completely ignore the fact that the function is called glBufferAddressRangeNV and the fact that the spec constantly refers to them as "addresses", glBufferAddressRangeNV doesn't take an offset. So that 64-bit value must be referring to an address. Either the address returned from querying it or an offset from the queried value.

If it looks like an address, acts like an address, and everyone calls it an address, then it is an address. So please don't act like bindless is something that could be trivially adopted by the ARB or something that doesn't break the buffer object abstraction.

Aleksandar
04-24-2010, 11:23 AM
I'm using glMultiDrawElements(). glDrawElements() would further increase number of function calls. Maybe it is true that glMultiDrawElements() just iterates through many glDrawElements() calls inside the driver, but I still believe that it is little bit faster than if I do the iteration. I have also mentioned that I'm using bindless, so binding a VBO is not critical any more.

Would it be faster if I have just one VBO that is not static, instead of many static VBOs? I'm not sure. But there is definitely a need to update some parts of it. There is also a problem of indexing such big and complex structure. Anyway, thank you all for useful advices! I have tried to draw attention to something else....

Obviously I chose a wrong example. Maybe next illustration would be better. Can anyone tell me why a single glCallLists() is faster than several independent glCallList() calls? I've changed the application so that it draws 65K DLs on two different ways. A single glCallLists() call is faster 60% than thousands of glCallList() calls. Measuring is done using QueryCounter (on GPU). Maybe answer to this question will help to be understood what I wanted to say.

Fugitive
04-25-2010, 06:47 AM
@Alfonse: NVidia is 'losing'? slowly dwindling population of NVIDIA card holders? Seriously? I use Both ATI and NVidia cards every day, and buy the latest ones every year or so, and I can tell you, its a one-up game for both of them. 6-months later, with Fermi, it could very well be ATI that seems to be 'losing'. Neither of them are...

You entire section on the 'reality' of reward vs time is misplaced. Its true, but misplaced. The gaming industry has been customizing their render-paths to specific cards since the birth of the GPU era. There are only two major brands of cards to worry about: NVidia and ATI. For major game development companies (which produce perhaps 80% of the professional games?), one more programmer who works on tweaking the render paths for two brand of cards is not a big deal. Infact, its a competitive advantage.

Alfonse Reinheart
04-25-2010, 12:13 PM
On bindless and buffer objects:

It's clear that bindless achieves what it sets out to, at least on NVIDIA hardware. Thus, in order to decide how best to create a platform neutral extension that gives bindless performance without the bad parts of bindless, it stands to reason that the first step is to examine why bindless works. And that starts with the basic differences between rendering with bindless and rendering without it.

There are really only 2 differences between the bindless API and the regular one.

1: The division of vertex format (type, normalization, stride, etc) from GPU location (buffer + offset). In bindless, these are set by different API calls, whereas in regular GL, they are not.

2: The explicit locking of buffer objects, which prevents them from being moved. This also means that the buffer has an explicit address for the duration that it is bound.

NVIDIA did not have to do #1. Indeed, they went out of their way to do #1 in the implementation: they added a bunch of new functions just to add this functionality. This suggests that, for NVIDIA hardware/drivers at least, performing a glVertexAttribArray/Format call is expensive. Indeed, in the bindless examples, they specifically minimize this kind of state change. Setting the GPU address is done much more frequently.

And this makes some degree of sense for the user. Vertex formats don't change nearly as frequently as what buffer object + offset you use. Indeed, you could imagine some applications that only use maybe 7 or 8 vertex formats per frame, if that. Indeed, with clever enable/disable logic, one imagines that you could setup a vertex format once, and pretty much never change it (though if you're making heavy use of attributes, this may not be possible).

So just from analyzing how bindless changes rendering, we can already see something that the ARB should be looking into: separating vertex formats from buffer object+offset.

I would suggest a VFO: vertex format object. This should work similarly to the way that sampler objects work: if a VFO is bound, it overrides the equivalent VAO settings. Like sampler objects, it should be DSAified by nature; binding only to use.

This would also require adding an API to assign a buffer object+offset to an attribute. While this data would be VAO data, it should probably not be DSAified.


NVidia is 'losing'? slowly dwindling population of NVIDIA card holders? Seriously? I use Both ATI and NVidia cards every day, and buy the latest ones every year or so, and I can tell you, its a one-up game for both of them. 6-months later, with Fermi, it could very well be ATI that seems to be 'losing'.

As of right now, what I said was true. I made no claim to say that this would continue in perpetuity. Only that, right now, ATI's cards are selling better than NVIDIAs.

My point was that you can't ignore ATI. During the embarrassing years of the R520 and R600, there was a reasonable case to be made for ignoring ATI cards. That case cannot currently be made.

As an aside, you may want to investigate the Fermi problems (and other NVIDIA/TMSC manufacturing problems) in more depth. It's very fascinating, and it does not offer a rosy outlook for NVIDIA in the near term. NVIDIA might come up some mainstream Fermi-based cards that trump ATI's Evergreen/Southern Isles stuff. But NVIDIA's problems with manufacturing really don't suggest that this is likely.


For major game development companies (which produce perhaps 80% of the professional games?), one more programmer who works on tweaking the render paths for two brand of cards is not a big deal. Infact, its a competitive advantage.

I don't believe I mentioned game developers. However, not all game developers are equal. And "one more programmer" is a lot of money for many game developers. If your project has 7 people on it, adding 1 more is pretty substantial.

Pierre Boudier
04-25-2010, 12:26 PM
Obviously I chose a wrong example. Maybe next illustration would be better. Can anyone tell me why a single glCallLists() is faster than several independent glCallList() calls? I've changed the application so that it draws 65K DLs on two different ways. A single glCallLists() call is faster 60% than thousands of glCallList() calls. Measuring is done using QueryCounter (on GPU). Maybe answer to this question will help to be understood what I wanted to say.


there are several possible answers:
- the display list optimizer can do a better job inside one DL than across several DL. (for instance, it can remove unused attribute, it can reindex, ...)
- or if you are CPU bound, then writting a few dwords is faster for the driver than to write a thousand times a few dwords

Aleksandar
04-25-2010, 12:49 PM
there are several possible answers:
- the display list optimizer can do a better job inside one DL than across several DL. (for instance, it can remove unused attribute, it can reindex, ...)
Indeed, but glCallLists() does not create a single display list; just iteratively calls separate DLs.


- or if you are CPU bound, then writting a few dwords is faster for the driver than to write a thousand times a few dwords
Completely agree. That's one of the reasons I've proposed CBO.

This thread becomes both NV vs. ATI and DL vs VBO fight. It was not my intent. I just wanted to ask the community if command buffer object (or whatever its name would be) could be beneficial for boosting rendering speed. So far, it seems that the community is not interesting in such extension. :(

Alfonse Reinheart
04-25-2010, 12:54 PM
That's one of the reasons I've proposed CBO.

And that's the problem: it only solves that particular issue. It has most of the limitations of display lists and a lot less optimization potential. It doesn't guarantee optimal performance, just like display lists. It can't even offer display list performance in the example you yourself used.

So it's not a good idea.

Aleksandar
04-25-2010, 01:06 PM
It has most of the limitations of display lists and a lot less optimization potential.

I don't understand what optimization you are talking about. Data is already stored in VBOs. Commands can be compiled (and optimized by reordering) into separate buffer, and called as a batch in just one call. Where is the problem?

Of course, I'm unaware of all possible problems, and that's the reason I've started this debate. All suggestions are welcome.

Fugitive
04-25-2010, 02:05 PM
My point was that you can't ignore ATI.

No, your point was that you can ignore NVidia because they are doing poorly anyway. Its not reasonable to ignore NVidia because of their current performance as this may just be a temporary turn, as it usually is. BTW, I already know about problems with Fermi. Haven't we seen ATI struggle with similar problems in the past?




I don't believe I mentioned game developers. However, not all game developers are equal. And "one more programmer" is a lot of money for many game developers. If your project has 7 people on it, adding 1 more is pretty substantial.
You are right, you didn't mention game developers. I also agree that on smaller projects, adding one more person is substantial. However, the reality still is that:

1) Most professional/commercial software is made by large corporations.
2) By consequence, commercial software will continue to support multiple render-paths. I.e, optimize for each card separately since one more developer is not as much a cost as losing out to the competition with a sub-optimized product on a particular GPU.

peterfilm
04-25-2010, 03:01 PM
Indeed, but glCallLists() does not create a single display list; just iteratively calls separate DLs.
Aleksandar, they re-order or optimise whole frames of display lists. Telling the driver you want to draw a big contiguous block of display lists in a single call is gold dust to them. It gets sent to a worker thread which re-optimises the whole thing, given a batch id, and the next frame that optimised block is used instead. (speculation based on some observations).

Alfonse Reinheart
04-25-2010, 03:12 PM
I don't understand what optimization you are talking about. Data is already stored in VBOs. Commands can be compiled (and optimized by reordering) into separate buffer, and called as a batch in just one call. Where is the problem?

And this is the problem. You believe that the problem is function call overhead. That each function call itself is necessarily creating a noticeable performance drop. That it doesn't matter which functions you call thousands of times per frame.

A display list is free to the following that CBOs cannot:

1: Put all of the mesh data into a single buffer.

2: Be directly tied to this buffer, so that when it is moved, the display list is notified.

3: Analyze the mesh data and modify the vertex format for optimal performance (interleaving, etc).

4: Minimize vertex format state changes during rendering.

And that's just what I came up with off the top of my head.

You specifically stated, "The new DLs (or CBOs) would be just an efficient way to draw VBOs." This means that CBOs must be using the same buffer objects that were used when compiling them. So there is no chance for format changes or reording or anything.


No, your point was that you can ignore NVidia because they are doing poorly anyway.

My point was that you can't let NVIDIA alone guide your decision making about where to spend your money. NVIDIA-specific optimizations are reaching less of your customer base.

peterfilm
04-25-2010, 03:48 PM
Display lists are fine. No need to change them. Restrict what can be compiled into them, and maybe ATI will produce a better implementation for consumer cards (doubt it, as I said, I believe they've deliberately crippled dlists on non-workstation cards in order to sell more workstation cards).

I love the way display lists 'describe' a frame of drawing ops. If your scene is basically static, the whole thing can be a single 'display list' as far as the driver is concerned. That could include compiling all display lists called contiguously without intermittent state updates into a single display list. So it goes beyond the compile stage of the display list mechanism - the draw part is also easier to optimise.

None of this would be possible with your description of CBO's. But change your description to "content is dereferenced at compile time" and you're in business again. But then that's display lists you're describing. Or at least, display lists as most people use them (as geometry display lists, not any of the state change stuff).

Dark Photon
04-25-2010, 07:02 PM
it stands to reason that the first step is to examine why bindless works.
NVidia was pretty blatent about that, as you well know. Buffer handle->addr lookups causing CPU cache pollution. CPU-side inefficiency.


There are really only 2 differences between the bindless API and the regular one.

1: The division of vertex format (type, normalization, stride, etc) from GPU location (buffer + offset). In bindless, these are set by different API calls, whereas in regular GL, they are not.

2: The explicit locking of buffer objects, which prevents them from being moved. This also means that the buffer has an explicit address for the duration that it is bound.

NVIDIA did not have to do #1. Indeed, they went out of their way to do #1 in the implementation: they added a bunch of new functions just to add this functionality.
If we ignore the legacy, deprecated vertex attributes (as you usually do), then there is only one new API for that:

glVertexAttribFormatNV

In general:

glVertexAttribPointer = glVertexAttribFormatNV + glBufferAddressRangeNV

So yes, they separated the set of the vtx attr format from the set of the buffer address. Consequently, they can use the same API to "bind the buffer" via address (glBufferAddressRangeNV) just as we use the same API now to "bind the buffer" via buffer handle without bindless (glBindBuffer).

So for a modern OpenGL app that uses new-style vertex attributes, there are only these 2 new APIs that matter (glVertexAttribFormatNV and glBufferAddressRangeNV).


This suggests that, for NVIDIA hardware/drivers at least, performing a glVertexAttribArray/Format call is expensive.
Well, I can't speak for all hardware, but I can tell you I tried doing both lazy sets of the vtx attr format with bindless vs. setting it every time regardless (both via glVertexAttribFormatNV of course), and so basically no significant difference. I tried this on systems with two different latest-gen CPUs/CPU mem/MB combinations, one slow, one moderately fast (2.0/2.6 GHz Core i7s).

Nearly all (> 98%) of the benefit to be had is just through using buffer handles vs. buffer addresses. Nothing to do with lazy setting vtx attr formats.


Indeed, in the bindless examples, they specifically minimize this kind of state change.
Yeah, I noticed that. Weird.


So just from analyzing how bindless changes rendering, we can already see something that the ARB should be looking into: separating vertex formats from buffer object+offset.
That's premature, unless it just makes the API cleaner, which in bindless's case it seems to and apparently doesn't matter to perf, which seems to be the case.

Dark Photon
04-26-2010, 04:38 AM
...right now, ATI's cards are selling better than NVIDIAs.

My point was that you can't ignore ATI.
Yeah, you work for them, we get it. ;) Marketing dept?

This "was" a technical discussion about how to expose display list perf in next-gen OpenGL. Let's get back to that... And stop bashing people...

Dark Photon
04-26-2010, 04:54 AM
This thread becomes both NV vs. ATI and DL vs VBO fight. It was not my intent. I just wanted to ask the community if command buffer object (or whatever its name would be) could be beneficial for boosting rendering speed. So far, it seems that the community is not interesting in such extension. :(
It's not that there's no interest. It's just that it's another object (like display lists) in the driver, that needs to be created and managed. If that's the fastest/cleanest approach, great, but do we really need this level of abstraction?

...from what I've seen (using geometry-only display lists), the underlying gain is almost purely from switching to buffer handles to buffer addresses.

We maintain buffer handles now. No reason we can't maintain buffer addresses (or 64-bit handles) in addition in the same app data structures especially when it buys you so much perf. Ignoring Alfonse's senseless bashing, I haven't heard a good reason why this "isn't" a good idea.

...the only reason I see for needing another level of abstraction is if the underlying concept of "GPU buffer addresses" is purely an "NVidia quirk" and not the way other GPUs work.

Then maybe we go with CBOs or some abstraction which can pre-resolve buffer handles to addresses once and then reuse them many times.

Alfonse Reinheart
04-26-2010, 11:16 AM
Ignoring Alfonse's senseless bashing, I haven't heard a good reason why this "isn't" a good idea.

Because it breaks the fundamental abstraction of buffer objects.

It's the same reason the ARB used VBOs (which behaves similarly to ATI_vertex_array_objects) instead of NV_vertex_array_range.

Aleksandar
04-26-2010, 11:43 AM
Aleksandar, they re-order or optimise whole frames of display lists. Telling the driver you want to draw a big contiguous block of display lists in a single call is gold dust to them. It gets sent to a worker thread which re-optimises the whole thing, given a batch id, and the next frame that optimised block is used instead. (speculation based on some observations).
I really doubt it works on that way. In a glCallLists, through ID vector you can pass any combination of DLs IDs and it works perfectly. I think that the driver has no time to reorganize anything during that call. It happens in a fraction of millisecond.


You specifically stated, "The new DLs (or CBOs) would be just an efficient way to draw VBOs." This means that CBOs must be using the same buffer objects that were used when compiling them. So there is no chance for format changes or reording or anything.
Of course that you shouldn't change the layout of VBO without recompiling CBO, but you can change the content of the buffer without affecting CBO.


None of this would be possible with your description of CBO's. But change your description to "content is dereferenced at compile time" and you're in business again. But then that's display lists you're describing. Or at least, display lists as most people use them (as geometry display lists, not any of the state change stuff).
Dereferencing in the compile time can totally remove need for bindless, because CBO would manage physical addresses. I don't understand one thing: why VBOs used with bindless must be made resident every time we change their content? The layout of the buffer does not change. The size is the same. Why should we do all the stuff of getting physical adresses and making buffers resident if we just want to change the content?

Anyway, I think that the major advantage of CBO over DL is that it decouples commands from the data. DL cannot be modified. VBO can! By modification I don't think changing the size or the vertex format, just the content. If we want to chnage the layout of the vertex buffer or a vertex format, CBO should be recompiled. But even a recompilation can be optimized.

Something about Bindless: Recent experiments with my ("more optimized") application showed that maximum overall speedup using Bindless is slightly below 50%. I can post charts for three NV cards. A very interesting phenomenon is that Bindless is slower for small scenes than ordinary VBOs, and have a jump for the middle-range scenes. Of course, it suggests that finding bottleneck is not easy, because in most cases it is distributed across different stages of execution. My application is definitely a CPU/driver bound and some kind of command buffer could be beneficial for boosting the speed.

Alfonse Reinheart
04-26-2010, 12:45 PM
I don't understand one thing: why VBOs used with bindless must be made resident every time we change their content? The layout of the buffer does not change.

Because that's the whole point. The only way you can resolve a buffer object into a GPU address is to prevent the buffer from being moved around.


Anyway, I think that the major advantage of CBO over DL is that it decouples commands from the data.

Yes, but a lot of the performance benefit of Display Lists comes from the freedom to optimize the data format.

peterfilm
04-26-2010, 12:48 PM
sorry aleksandar, I think you're talking a lot of sense, but I'm absolutely sure that's what nvidia do (referring to the first bit of your reply). Maybe this is a quadro-only thing, I'm not too sure as I rarely run on geforces these days. Render the same thing for 4 frames (well, it depends on the complexity of the scene - 4 frames for a very heavy scene) and your frame rate goes up significantly. This is because they are optimising the frame in the background, and then upload the optimised data.
None of this would be possible with CBO's. Dynamic data is a completely different topic, as far as I'm concerned. Not a particularly common thing either...do you do a lot of CPU vertex work each frame? I don't. I barely touch my geometry after I've created the GL resources. I stream data in big blocks, which I have to stagger over a number of frames so I don't drop one. If I could do this in another thread I'd be completely happy. I would literally consider the problem solved. Yay for the big block box that is display lists.

Dark Photon
04-26-2010, 06:47 PM
Recent experiments with my ("more optimized") application showed that maximum overall speedup using Bindless is slightly below 50%. I can post charts for three NV cards.
Would be interested in seeing this (and trying the code here). However note that since bindless optimizes CPU/CPU-mem-limited issues, the CPU/CPU-mem should be more relevant for good bindless test cases than GPU, so definitely cite that too. In the limit (large batches), I haven't seen bindless cost you anything over static VBOs though.


A very interesting phenomenon is that Bindless is slower for small scenes than ordinary VBOs, and have a jump for the middle-range scenes.
I'd be "very" interested in more details on this. I have never seen this. You are saying that bind-by-handle was faster than bind-by-address? And just to be clear, are we only talking about bindless for vtx attrs and index data (i.e. NV_vertex_buffer_unified_memory), not shader data? And static, unchanging VBOs? Interleaved attrs? Not repeatedly doing the buffer addr query and make resident (glGetBufferParameterui64vNV / glMakeBufferResidentNV)?

(Forgot those two in my list above. There are actual 4 APIs relevant to new-style vtx attribs batches, not 2.)

It'd be cool if we could collectively pull a test pgm together to illustrate it, and pass around to try on different CPU/CPUmem/GPU combos/driver to verify and get a better feel for when this oddity occurs.

Simon Arbon
04-26-2010, 10:20 PM
Lets go back to first principles and see where it takes us.

The heart of the rendering loop is something like this:

For each Material
Select Shaders, UBO, Textures, Samplers etc. to use;
For each Object
Load transformation Matrix;
Render triangles from VBO;
Next Object
Next Material

The setup for each material (Skin, cloth, metal, wood etc.) currently involves several calls to select a shader program for the material, which textures it uses, and possibly a UBO.
However, once a material is defined it doesn't change, so this would be better done with an immutable material object.
The pipeline state for this object will be pre-validated when its created, so switching materials will be simpler and faster.
(This is very similar to the Longs Peak 'Program Object')

When switching between materials, some state changes are more expensive than others, so the material rendering order should be sorted to minimise rendering time by grouping together materials that share that slow-to-change state.

Now we could run our own tests to find the most expensive state changes, but that could change in future hardware or differ between vendors.
Hence it would be better if the driver decided the rendering order, sorting the materials optimally whenever we add a new Material Object.

But if the driver is controlling the rendering order then it needs to know which VBO's contain objects of each material.
So lets add an Object Buffer Object that contains a list of object records which contain the position and rotation of an object as a transformation matrix, plus the VBO, offset, size etc. of the actual triangle data.
Then give each Material Object a linked-list of the Objects that are made from that particular material.
This has the added benefit that the transformation matrix and VBO can be updated by an OpenCL physics engine without being shuffled to the CPU and back.
(I chose a linked-list for objects so that objects can be added and removed from the scene without expensive searching or data re-packing)

To make a depth pre-pass more efficient we could use a separate linked-list of objects sorted in front-to-back order.

Now the GPU has all the information it needs to render the main scene in a single API call. No cache-misses, no table lookups, no Draw calls, and very few API calls.

But we still have one big problem, the lag between the CPU and GPU.
The API was designed for a CPU that directly controlled a graphics peripheral, but with modern hardware the GPU will stall if we try to control the render from information read back from it, forcing us to use out-of-date information from the previous frame to control the current frame.
With commands, display lists, or even CBO's, we are limited to a linear sequence of commands similar to an old DOS batch file.
Conditional rendering was added in recognition of this problem, but it only provides a very basic if-then branch for occlusion queries.

Display lists are said to be 'Compiled' to run efficiently on the GPU command processor, but why limit this to such simple programs, why not go all the way and have a 'Command Shader'.

This would be compiled in the same way as a GLSL program, but would have a single instance that would automatically run on the GPU command processor after each buffer swap.
Its purpose would be to move the main rendering loop from the CPU to the GPU, removing all lag, allowing more complex control of rendering, improving speed, and reducing CPU workload.
It would also allow proper synchronisation between OpenCL and OpenGL by directly scheduling OpenCL kernals to be run when the rendering has completed and OpenGL is waiting for the buffer swap.
The CPU would now be responsible for changes to the game world, adding and removing objects, moving the camera and animating creature movements, while the GPU does all the repetitive processing that is the same every frame.

peterfilm
04-27-2010, 10:15 AM
He's a man of big ideas. Sounds good to me. By friday, please.

Alfonse Reinheart
04-27-2010, 10:56 AM
I am not at all convinced that adding more objects is going to solve the problem.

Also, relying on the driver to optimize things like rendering order and so forth is, well, look at ATI's display lists. If they don't want to optimize this stuff, why should that inhibit your performance?


No cache-misses, no table lookups, no Draw calls, and very few API calls.

No cache misses? What, is this stuff somehow magically preloaded into the cache? What exactly do you expect the driver to be doing behind the scene when you say, "execute this rendering list?"

It's going to have to read the array that stores those objects. Cache miss, for every N indices).

It then has to reference this pointer and read the data for that object. Cache miss, every time.

It then has to read the various other objects (VAOs, programs, textures, etc) used by that object. Cache miss, cache miss, cache miss.

Putting the traversal of the scene graph on the driver does not magically make the problem go away. It's still there; it's just hidden from you.

It's much better to give the programmer more ability to optimize rather than forcing it on the driver.


why limit this to such simple programs, why not go all the way and have a 'Command Shader'.

Because shaders run on the GPU. They can only read GPU memory.

The only way to do what you're suggesting is to make a CPU thread that executes "shader" code compiled to CPU assembly.

Also, even if you could make the GPU do this, GPUs are not actually good at this stuff. They have caches too as well as cache misses, and their caches are optimized for graphics work, not general programming work (which is what you're asking for). Building the rendering command list is not a highly parallel activity like actual shaders. It's something best left to a CPU thread.

Plus, every GPU cycle spent on traversing the scene graph is a cycle lost to your actual rendering.

Lastly, it doesn't even do what you want. Even if the GPU could build the command list, it wouldn't "removing all lag." This "command shader" would have to wait to do readback, just like the CPU. It would have to sit there and wait until the GPU has completed the operation before it could effectively do readback.

So there's really no difference, except that you're wasting precious GPU resources on a task that the GPU is highly unsuited to doing.

peterfilm
04-27-2010, 11:10 AM
true, and if it were practical to do it on the GPU, then display lists would be executed entirely on the GPU - but they're not.

Simon Arbon
04-27-2010, 10:51 PM
No cache misses? What, is this stuff somehow magically preloaded into the cache? What exactly do you expect the driver to be doing behind the scene when you say, "execute this rendering list?"
I mean no cache misses on the CPU, because this would be run entirely on the GPU command processor.
The CPU would setup the VBO's, Material objects and Object data, then the GPU would render frames repeatadly.
The application/driver on the CPU would only be involved with changes to the game world.
Cache misses on the GPU can be avoided by the driver inserting prefetch instructions in the command stream, just as it does with compiled display lists.
Materials and objects are much smaller and accessed much less often than the actual VBO data, they are optimised when created just like a display list is, and internally they would use GPU addresses not names, so cache misses would have a minor impact anyway.

Because shaders run on the GPU. They can only read GPU memory.Which is why we put all the data it needs into the GPU memory first.

Also, even if you could make the GPU do this, GPUs are not actually good at this stuff. They have caches too as well as cache misses, and their caches are optimized for graphics work, not general programming work (which is what you're asking for). Building the rendering command list is not a highly parallel activity like actual shaders. It's something best left to a CPU thread.
Fermi does have a cache very like a CPU cache.
The command shader is not meant to run on either the CPU or the shader processors, it is meant to run on the command processor.
This is a separate processor in the GPU that receives a block of rendering commands from the CPU (or a compiled display list) and executes them, while managing the distribution of work threads to all of the shader processors.

Lastly, it doesn't even do what you want. Even if the GPU could build the command list, it wouldn't "removing all lag." This "command shader" would have to wait to do readback, just like the CPU. It would have to sit there and wait until the GPU has completed the operation before it could effectively do readback.
A command shader does not 'build' a rendering command list, it replaces the rendering command list.
Yes, the command shader has to wait for the shader processors to finish a specific task before it can test and branch, but this response is almost instantanious and nothing like a CPU/GPU synchronisation.
Inserting another task between the operation and the test for its result can ensure that the shader processors are kept busy.
Conditional rendering already does exactly this for occlusion query results.

Alfonse Reinheart
04-27-2010, 11:26 PM
Cache misses on the GPU can be avoided by the driver inserting prefetch instructions in the command stream, just as it does with compiled display lists.

There is no "command stream," because your whole idea revolves around the specific removal of commands. Instead, you just have a "shader". If you have a scene graph, with descriptions of how to render the various things, this will not be located in contiguous memory. Prefetching doesn't help, because you're essentially accessing data at random.

Furthermore, any such prefetching could just as easily be done on the CPU.


Materials and objects are much smaller and accessed much less often than the actual VBO data, they are optimised when created just like a display list is, and internally they would use GPU addresses not names, so cache misses would have a minor impact anyway.

Do you think that buffer objects don't use GPU addresses internally?

In terms of cache behavior, what matters is the access pattern. The scene graph, at best, would be an array of pointers to objects. These objects would, essentially, be random accesses. And since you only access these objects once (or relatively few times, at any rate) per frame, you get terrible cache behavior.

The problem isn't what those objects store internally; the problem is that it takes two fetch operations. The first fetch is to get the pointer to the object itself. The second is to reference the pointer to get the object's data.

Bindless is designed to do an end-run around this.


Which is why we put all the data it needs into the GPU memory first.

Which will require CPU/GPU synchronization in order for the CPU to modify to that memory. After all, you can't go changing all of these heavyweight objects before the rendering on them has been done from last frame.


The command shader is not meant to run on either the CPU or the shader processors, it is meant to run on the command processor.

That's not going to happen.

GPU hardware makers have spent a great deal of effort over the past years unifying their shader architecture, so that vertex, fragment, geometry and whatever all run on the same hardware. They're not going to make a completely new shader stage with its own specialized logic hardware just for something you could do yourself.

peterfilm
04-28-2010, 01:33 AM
why is your scenegraph not contiguous in memory?

Simon Arbon
04-28-2010, 02:15 AM
In terms of cache behavior, what matters is the access pattern. The scene graph, at best, would be an array of pointers to objects. These objects would, essentially, be random accesses. And since you only access these objects once (or relatively few times, at any rate) per frame, you get terrible cache behavior.
The problem isn't what those objects store internally; the problem is that it takes two fetch operations. The first fetch is to get the pointer to the object itself. The second is to reference the pointer to get the object's data.
All i am doing is shifting the scenegraph access from the CPU to the GPU, either way you will be randomly accessing objects in memory once per frame to get its transformation matrix and VBO address.
If done on the CPU, then after reading the object (CPU cache miss can be avoided by prefetch) we first need to call the driver to put the matrix into a UBO, then call it again to draw from the VBO (both of which execute a LOT of instructions).
These commands get assembled into a buffer on the client side which sometime later get flushed to the server side and passed to the GPU which then starts reading vertices and scheduling vertex shader runs (with a GPU cache miss on first access to VBO).

If done on the GPU, then we copy object transformation matrix directly to UBO, read the address of the VBO, then immediately start reading vertices. MUCH less work overall.

There is potentially a cache miss on the object read and the first VBO access, but there are ways around this.
The loop that itterates through the objects can prefetch the VBO address a loop ahead, and the next object address a loop before that.
In many cases the entire object buffer would be small enough to remain in the 768 KB L2 cache anyway.


Bindless is designed to do an end-run around this.
Bindless only saves one table lookup, and hence one cache miss, per command, in the CPU case.
In the GPU case the name-to-address translation was done during compilation, so bindless is irrelevant.


Which will require CPU/GPU synchronization in order for the CPU to modify to that memory. After all, you can't go changing all of these heavyweight objects before the rendering on them has been done from last frame.
No synchronisation is required, commands to alter objects will simply be queued and executed between frames.
VBO's and other buffers will use the same ping-pong and orphaning techniques we use now.


GPU hardware makers have spent a great deal of effort over the past years unifying their shader architecture, so that vertex, fragment, geometry and whatever all run on the same hardware. They're not going to make a completely new shader stage with its own specialized logic hardware just for something you could do yourself.This is complete nonsense, in the fermi block diagram i can see texture units, vertex fetch units, tesselators, viewport transform units, attribute setup units, stream output units, rasterisation engines, memory controllers and a gigathread engine.
Some of these will be dedicated hardware, but the gigathread engine not only executes the commands sent from the CPU, it also "creates and dispatches thread blocks to various SMs" and has to do load balancing between all of the SM's, so is likely to be quite a capable processor.
"Individual SMs in turn schedule warps to CUDA cores and other execution units" so there could be other processors there as well.

I am not talking about adding a new stage, i am talking about using the existing NVIDIA gigathread engine or AMD command processor in a slightly different way.
There may be limitations that prevent them executing arbitrary code in current GPU's, but only slight modifications would enable next generation GL5 hardware to run command shaders.

Simon Arbon
04-28-2010, 02:38 AM
By friday, pleaseI would at least like to see Immutable Material objects added on July 25th (Siggraph), after all, this is basically the same as was proposed for Longs Peak back in 2007.
A query that asks the driver for a list giving the most efficient ordering of the material state changes would be nice too...
The Command shader may have to wait for GL5 hardware though.


true, and if it were practical to do it on the GPU, then display lists would be executed entirely on the GPU - but they're not.I find that quite surprising, do you have evidence for this? I would have expected the more modern GPU's to keep a compiled display list in GPU memory and directly execute it.


why is your scenegraph not contiguous in memory?
In my case i have an entire planet as my game world, so most of it stays on the hard disk most of the time.
I am continuously streaming objects in and out of the scenegraph as the player moves towards them or away from them. (Not to mention the different Level-Of-Detail versions that each object has).
This causes a lot of memory fragmentation, though i do try to keep it as compacted as possible.

Alfonse Reinheart
04-28-2010, 03:01 AM
If done on the CPU, then after reading the object (CPU cache miss can be avoided by prefetch) we first need to call the driver to put the matrix into a UBO, then call it again to draw from the VBO (both of which execute a LOT of instructions).

First, how can you avoid that cache miss? You don't know what object you're going to be reading next until you reference the pointer.

Second, none of what you're talking about involves the execution of a "LOT of instructions". It involves a lot of work, due to the synchronization needed in updating a buffer object's contents. But this is not a lot of instructions.

And the GPU version needs to do that synchronization too.


If done on the GPU, then we copy object transformation matrix directly to UBO, read the address of the VBO, then immediately start reading vertices. MUCH less work overall.

And where does this object transformation matrix come from? The GPU isn't allowed to read arbitrary CPU data, so it must be coming from a buffer object or the parameter of some other object. Which the CPU must set. This requires CPU/GPU synchronization.


In many cases the entire object buffer would be small enough to remain in the 768 KB L2 cache anyway.

You're making an assumption that the quantity of data used by the shaders is rather small. I imagine that the state graph for scenes of significance exceeds 1MB. At least, the ones that are state-change or drawing call bound, rather than shader bound.

Also, where are you coming up with this 768KB L2 cache from?


In the GPU case the name-to-address translation was done during compilation, so bindless is irrelevant.
In the GPU case the name-to-address translation was done during compilation, so bindless is irrelevant.

Not if you're doing what you're talking about. So long as those buffer objects can be affected by the CPU and you expect the results of those changes to be reflected in rendering, the GPU-based scene graph code must still be using the buffer object's name. Buffer objects can be moved around by the creation/destruction of other memory objects.

The reason bindless works is because of the MakeResident call, which explicitly forbids the implementation from changing or moving the buffer object's location in memory.


in the fermi block diagram i can see texture units, vertex fetch units, tesselators, viewport transform units, attribute setup units, stream output units, rasterisation engines, memory controllers and a gigathread engine

All of which are fixed functionality, not arbitrary shader processors.


No synchronisation is required, commands to alter objects will simply be queued and executed between frames.

Queued by what? And if such queuing were possible, why isn't it done now? How do you define a "frame"? When does the state actually change? And how big exactly are these objects? If I'm trying to render 10,000 copies of something, which would have been child's play with instancing, is that going to require 10,000 objects?

And how does this interact with non-traditional rendering models, like rendering a GUI (typically ad-hoc, without a lot of formal objects for each element) or deferred rendering (the number of passes in the deferred part is based on the number of lights in the scene)?


I am not talking about adding a new stage, i am talking about using the existing NVIDIA gigathread engine or AMD command processor in a slightly different way.

Except that the command processors in question do not execute arbitrary code. They are incapable of processing a state graph. Command processors are very simple pieces of hardware. They execute a FIFO who's commands are very limited. Set registers, execute rendering, clear cache X, etc. All very trivial.


I would at least like to see Immutable Material objects added on July 25th (Siggraph), after all, this is basically the same as was proposed for Longs Peak back in 2007.
A query that asks the driver for a list giving the most efficient ordering of the material state changes would be nice too...

These two things act as cross-purposes. Immutable combinations of program objects and the particular set of uniforms they use are not the most efficient way to go.

For example, let's say you have 7 objects. 3 use program A and 4 use program B. Even though they use different programs, they share a UBO between them all. And two of the objects that use program A share a UBO, as do 2 of the objects that use program B.

Ignoring all other state, this leads to the following sequence of bind and rendering commands:

1: Bind program B.
2: Bind common UBO to the common UBO slot (say, slot 7).
3: Bind shared UBO to slot 0.
4: Render object 1.
5: Render object 2.
6: Bind UBO to slot 0.
7: Render object 3.
8: Bind UBO to slot 0.
9: Render object 4.
10: Bind program A.
11: Bind shared UBO to slot 0.
12: Render object 5.
13: Render object 6.
14: Bind UBO to slot 0.
15: Render object 7.

Now, let's compare this to what you would have to do with immutable "material" objects:

1: Bind object 1's material.
2: Render object 1.
3: Bind object 2's material.
4: Render object 2.
5: Bind object 3's material.
6: Render object 3.
7: Bind object 4's material.
8: Render object 4.
9: Bind object 5's material.
10: Render object 5.
11: Bind object 6's material.
12: Render object 6.
13: Bind object 7's material.
14: Render object 7.

Looks more efficient, right? You don't have all of those separate binds that we did in the first one.

However, what you're not seeing is one simple fact: those binds don't go away just because we happen to be using an immutable material.

Every time you bind one of these material objects, one of two things has to happen. Either the driver has to be stupid, or it has to be smart.

If the driver is stupid, then it will internally bind all of the state, even if that state was previously bound. So in the above case, we get a performance penalty for binding a common UBO 6 extra times.

If the driver is smart, then it will examine the old material and the new, changing only the state that is necessary to change. The problem here is that there's no need for that. The driver is wasting time doing something that the application could do much more easily.

The application knows that all of these objects share a certain UBO. The driver doesn't have to check on every bind whether the incoming material uses a different UBO in that slot; we know it doesn't. So why make the driver do the work?

Now, you could say that we would have to do the same thing on the CPU. Except that's not true. The work that the driver does to detect whether there is a shared uniform buffer being used is a lot harder than it is on the client side. Objects that share UBOs likely have other traits in common. Traits that can be used to sort the rendering list properly. Traits the driver does not have.

Doing a sort operation on the list of rendered objects will buy you more performance than immutable materials, even in the case where drivers are written to avoid redundant state changes. And if the drivers are written stupidly, you're in a world of hurt.

I would rather have low-level drivers that do exactly and only what they're told, rather than drivers that have to figure out stuff I already know.

Simon Arbon
04-28-2010, 06:39 AM
First, how can you avoid that cache miss? You don't know what object you're going to be reading next until you reference the pointer.You must have misunderstood me here, i'm just talking about itterating through my own scene-graph so i certainly can prefetch each object.
I didn't mention the OpenGL name lookup cache miss because i am assuming the use of bindless.


none of what you're talking about involves the execution of a "LOT of instructions".When i trace into an API call with my debugger it sure looks like a lot to me.


And where does this object transformation matrix come from? The GPU isn't allowed to read arbitrary CPU data, so it must be coming from a buffer object or the parameter of some other object. Which the CPU must set.Yes, this is a GPU buffer containing the matrix of each object. For static objects it never changes. If i am running an OpenCL physics engine then it is this that updates the matrix of a moving object. If i just want to move an object from the CPU then i just send that command (and new matrix) to the GPU just like a normal OpenGL command.


where are you coming up with this 768KB L2 cache from?NVIDIA’s Fermi: The First Complete GPU Computing Architecture, A white paper by Peter N. Glaskowsky;
NVIDIA GF100 Whitepaper;
Whitepaper: NVIDIA’s Next Generation CUDA Compute Architecture: Fermi;


The reason bindless works is because of the MakeResident call, which explicitly forbids the implementation from changing or moving the buffer object's location in memory.So just use MakeResident for the object buffer. The VBO's could be made resident as well if you really want. You seem to be suggesting that Bindless is some sort of alternative to a GPU scenegraph or a command shader, but there is no reason not to have both.


Queued by what? And if such queuing were possible, why isn't it done now?It is, thats just the OpenGL command buffer. The only difference is that it waits until the command shader finishes executing the frame (The equivalent of a Swapbuffers command) before being executed.


If I'm trying to render 10,000 copies of something, which would have been child's play with instancing, is that going to require 10,000 objects?No, you can still use instancing.


how does this interact with non-traditional rendering models, like rendering a GUI (typically ad-hoc, without a lot of formal objects for each element) or deferred rendering (the number of passes in the deferred part is based on the number of lights in the scene)?
This is exactly why i introduced the Command shader, to give you this flexibility. Its a GLSL-like program so you can have a for loop that triggers a full-screen rasterisation for as many lights as you want.
The Material/Object scenegraph idea by itself would simply let you have a single API call to draw all the opaque objects in the scene. (Though given some time to work on the idea i'm sure it could be extended to a depth pre-pass and transparent objects at least).


the command processors in question do not execute arbitrary code. Command processors are very simple pieces of hardware. They execute a FIFO who's commands are very limited.Do you have references to back this up? They may have been this simple in older GPU's, but modern GPU's like fermi are very complex systems that at least deserve something at the level of an 8086, which is all you would need.
It is certainly possible that they currently execute firmware from ROM or flash, but GPU's become more flexible and programmable every generation, so it should be easily done in the next generation.


Immutable combinations of program objects and the particular set of uniforms they use are not the most efficient way to go.Longs Peak seems to have had all of the UBO's as part of the program object, which i agree is wrong simply because the transformation matrix has to get to the vertex shader somehow, and its per object, not per material.
Perhaps allow UBO's to be attached to both Material objects and to object objects. (I really need another word for the object being rendered here; Thing object, segment object, section object, Mesh object? Any idea's?)


<Very long block of text that we wont repeat here>For a start you put a material bind between objects 1&amp;2 and 5&amp;6 which are obviously unnecisary as they are the same material, and in practical applications you would often have quite a few objects sharing the same material.
As for re-binding all the state on a material change, no driver writer is THAT stupid (well OK, maybe an Intel driver writer).
You say that the driver is wasting time doing comparisons between the old and new materials that can be done better by the application. I dont agree at all, the material objects consist of a few numbers that reference some shaders and a UBO or two, these comparisons are trivial and would have to be done by the application anyway.
Furthermore, if we use my idea of a GPU scenegraph (or even just letting the driver pre-sort the material order) then we can pre-compile the state switching when we create the material objects. During the rendering we dont need to test anything, we just replay the stored state-switching sequence.

Finally you mentioned traits that objects can have that effect the ideal rendering order but which are not OpenGL state or anything the driver can know about.
Can you give an example of what you mean?

Ilian Dinev
04-28-2010, 09:21 AM
Btw guys, let's not forget GL_ARB_draw_indirect on DX11 hw. Mix with instancing facilities like texture-arrays/etc.
Its unavailability in GL3.3 hints that maybe DX10 cards can't support such command-buffers.

Alfonse Reinheart
04-28-2010, 12:13 PM
NVIDIA’s Fermi: The First Complete GPU Computing Architecture, A white paper by Peter N. Glaskowsky;

I must have missed the part of the paper that says what the L2 cache on a Cyprus is.

Between you and Dark Photon, I'm starting to wonder if I stumbled onto the NVIDIA forums or something.


So just use MakeResident for the object buffer. The VBO's could be made resident as well if you really want. You seem to be suggesting that Bindless is some sort of alternative to a GPU scenegraph or a command shader, but there is no reason not to have both.

There are reasons not to have command "shaders". I'm outlining them here.

And if we have "bindless" (or whatever form it eventually takes), why do we need command shaders? We'd already be getting performance nearly equivalent to NVIDIA display lists. And if NVIDIA can't get much more performance than bindless, I don't see command shaders doing any better.


Do you have references to back this up?

Do you? Besides Fermi, that is.

Why would an IHV waste the silicon and transistors on the command processor for a GPU? The only reason the Fermi might have a more complicated CP is because it's designed for GPGPU first and as a renderer second.

The CP simply doesn't do anything worth the extra die space in making it fully programmable.


For a start you put a material bind between objects 1&amp;2 and 5&amp;6 which are obviously unnecisary as they are the same material, and in practical applications you would often have quite a few objects sharing the same material.

You're assuming that the only material properties are UBOs. In this example, I only showed the UBO properties, but there could just as easily have been shared UBOs but different textures.


As for re-binding all the state on a material change, no driver writer is THAT stupid (well OK, maybe an Intel driver writer).

Never underestimate the stupidity of drivers. It was not too long ago that NVIDIA's GL drivers constantly recompiled shaders when you changed certain uniform state. This was considered an "optimization."

You're more likely to get consistently good performance if the specification is tight. A loose specification gives IHVs a lot of room to make things better, but it also allows them to make things worse. That's why I prefer buffer objects to display lists; VBOs may be slower than DLs sometimes, but they're consistent.


You say that the driver is wasting time doing comparisons between the old and new materials that can be done better by the application. I dont agree at all, the material objects consist of a few numbers that reference some shaders and a UBO or two, these comparisons are trivial and would have to be done by the application anyway.

Here is the set of data that a material needs:

1: program.
2: textures and where they are bound.
3: UBOs and where they are bound.
4: non-buffer object uniform state (and no, not everything is or should be a UBO).

Some of this state is intrinsically per-instance state. Some of it is shared among several instances. Some of it is global.

The only way for a driver to know what state changes between materials is for them to actually do the test. However fast this may be (and it can't be that fast) it is still slower than the 0 time that would be spent if the user simply sent the data properly.

The user is at a higher level than the driver. The user has more tools to know what state is global, what state is per-instance, and what is shared. The user does not have to check the basic material properties; it knows all of the "soldiers" share the same array texture.


Finally you mentioned traits that objects can have that effect the ideal rendering order but which are not OpenGL state or anything the driver can know about.
Can you give an example of what you mean?

Shadow mapping. You render to a depth texture, then use that texture for rendering the scene. In that second pass, every shader uses this texture.

The driver doesn't know this; it will have to check this at every material change for the second pass, even though you the user already know that it isn't changing. It's a waste of time.

Dark Photon
04-28-2010, 05:53 PM
Between you and Dark Photon, I'm starting to wonder if I stumbled onto the NVIDIA forums or something.
No problem, as you more than compensate on the pro-ATI and NVidia FUD side. ;)

P.S. I'd evangelize ATI too if we could get our apps running on their drivers (would be good to have them as an alternative, especially right now), but they keep locking the whole machine up randomly and crashing the app in the driver. And I don't get bindless from them yet. So no surprise, I don't relay much ATI experience, good or bad.[/B]


Never underestimate the stupidity of drivers. It was not too long ago that NVIDIA's GL drivers constantly recompiled shaders when you changed certain uniform state. This was considered an "optimization."
Now you're dredging, dude. Yes it did happen, but in NVidia GPUs circa 2005-6. NVidia driver writers are still the tops out there in product stability. (Disclosure: No, I don't work for them.)

Simon Arbon
04-28-2010, 10:05 PM
Between you and Dark Photon, I'm starting to wonder if I stumbled onto the NVIDIA forums or something.I have nothing against ATI, in fact there are some things like their closer adherence to the spec that i prefer over NVIDIA.
I used Fermi as an example simply because i am checking it out at the moment and have the documents to hand.
It would have been a more useful comment if you actually told us what the L2 cache on a cypress is.


why do we need command shaders? We'd already be getting performance nearly equivalent to NVIDIA display lists. And if NVIDIA can't get much more performance than bindless, I don't see command shaders doing any better.Display lists, CBO's, or command shaders are all designed to reduce the number of API calls the application has to make. If your application is not CPU bound then you wont see any difference, if your CPU has too much work to do and cant keep up with the GPU, then it can make a big difference.
On the GPU side all that matters is that the shader processors are kept busy and dont stall waiting for the CPU to catch up.
But the main advantage of command shaders is that you could do conditional branching in the rendering loop that depends on GPU state, which currently stalls the pipeline if you try to do it from a CPU rendering loop.


It was not too long ago that NVIDIA's GL drivers constantly recompiled shaders when you changed certain uniform state. This was considered an "optimization."
This wasn't stupidity, this was NVIDIA trying to make certain benchmarks run faster so they got better scores and hence sold more cards.


However fast this may be (and it can't be that fast) it is still slower than the 0 time that would be spent if the user simply sent the data properly.Lets see, the materials have a variable called "Vertex Shader" that contains the name of a compiled shader object (or maybe its GPU address), so to find out if the vertex shader changed we need to compare two numbers. Last i heard CPU's are pretty good at that sort of thing. If the names match then we dont need to do anything, if not we bind the new shader. Repeat for 4 other shaders, a couple of UBO's and some textures, a few billionths of a second extra.
And how do you get "zero" time for your application to do the same thing? You either do the comparison yourself as part of your scenegraph logic, or create a display list for each material-to-material change.
But the real problem is that for each of your shader and UBO changes the driver needs to do some validation checks to ensure that what you are telling it to do makes sense.
With material objects the validation is done when they are created, so when we change material the driver simply changes the pipeline state without repeating these checks every time.
In your shadow mapping example, the driver does have to check each material has the same texture bound as the previous one did, but its a single CPU instruction, and how many materials per frame would you have anyway? It would take an awful lot for this time to even be detectable.

Simon Arbon
04-29-2010, 01:25 AM
If we can get back to what this thread was originally about:
minimising API calls when you have many thousands of meshes that are too different to use instancing, and i assume have their own Model matrix that needs to be written to a UBO before another API call to draw from its VBO.

I'll list all the options i can think of with my personal opinions, i would like to hear what everybody else thinks is the best option(s), or if you can think of any other ways to do this.

Traditional Display Lists
Gives the vendor the most oportunities for optimisation, and can be lightning fast for fixed geometry, but useless for a world that is constantly changing as they need to be recompiled from scratch and this takes too long.

Mark Kilgard's Enhanced Display Lists (Siggraph Asia 2008)
-Compile commands into display lists that defer vertex and pixel transfers until execute-time rather than compile-time.
-Allow objects (textures, buffers, programs) to be bound “by
reference”.
-Conditional display list execution.
-Relaxed vertex index and command order.
-Parallel construction of display lists by multiple threads.
This is a lot more flexible, objects can be animated and building a modified display list does not stall rendering.
But if the world is changing too quickly then the effort of continuously rebuilding display lists could exceed the gains from using them.

Aleksandar's command buffer object
Similar to a display list but limited to commands.
The main difference being that it is organised as an array of command slots, allowing it to be edited.
This would allow some flexibility in changing an objects parameters without having to re-compile it, but then you lose all the optimisations of that compilation, and adding/removing commands could cause a fragmentation problem.

Longs Peak Program Objects
Reduces several API calls to one, and pre-validates the state settings so less work needs to be done to switch materials.
All UBO's seem to be attached to program objects, but then how do you set the per-mesh transformation matrix?
When switching to a new program object the driver must check all attached shaders, UBO's and textures to determine which state needs to be set.

Material Objects
Similar to above, but the driver sorts the material objects into the most efficient rendering order.
This allows the state changes to be optimised and pre-compiled just like display lists.

GL_ARB_draw_indirect
Allows several meshes (from different parts of the same VBO) to be drawn (and multiple instances generated) from data stored in a structure in a GPU buffer object.
This puts all the objects into the same VBO, so they will become fragmented if you add & remove objects.
The main use of this seems to be to allow an OpenCL program to switch between different meshes on the fly or change the number of instances, though i cant really see where i would use this, a physics engine would either animate an object by moving the vertices in the VBO, or change the ModelView matrix to be used with the mesh.
It doesn't specify how you position each of these meshes in the world, i would assume you need a UBO containing a ModelView Matrix for each instance of each mesh.

Mesh Buffer Objects
Similar in concept to CBO's, its an array of slots used to draw meshes. But instead of arbitrary commands, each slot contains the Model (or ModelView?) matrix of a particular mesh in the game world, and a pointer to the VBO that describes its shape.
For efficient rendering there could be one MBO per material, allowing a single API call to draw the whole lot at once.
Could still have fragmentation problems as removed meshes leave holes in the array.
Could use a linked-list of mesh objects, but then that could cause cache misses unless the whole MBO is prefetched.

Command Shader
Move the entire main rendering loop from the CPU to the GPU.
Will probably require enhancements to the GPU command processor, hence is for future hardware only.

Alfonse Reinheart
04-29-2010, 11:01 AM
Lets see, the materials have a variable called "Vertex Shader" that contains the name of a compiled shader object (or maybe its GPU address), so to find out if the vertex shader changed we need to compare two numbers. Last i heard CPU's are pretty good at that sort of thing. If the names match then we dont need to do anything, if not we bind the new shader. Repeat for 4 other shaders, a couple of UBO's and some textures, a few billionths of a second extra.

Doing comparison operations in a tight loop is a big no-no for performance. Branches, and branch mis-prediction, is bad.

Also, you are ignoring the non-buffer object based uniforms in the program. Not every uniform is UBO based, nor should it be.


And how do you get "zero" time for your application to do the same thing? You either do the comparison yourself as part of your scenegraph logic, or create a display list for each material-to-material change.

Because I'm doing a different comparison. It's the reason why high-level logic can organize data faster than a low-level sorting algorithm: it knows what the data is for. I can write an optimal sorter because I know what data comes from where, what uses which shaders, what things are shared with other things, etc.

If I'm rendering a "soldier", then I know that he is made up of a number of rendered objects that use certain programs. I know that he uses a texture array atlas that is shared among all soldiers. I know

From a single comparison of "entity type == soldier", I have already done the equivalent work of 20+ comparisons by the driver.


But the real problem is that for each of your shader and UBO changes the driver needs to do some validation checks to ensure that what you are telling it to do makes sense.

Does it? And considering that it is possible to delete objects at pretty much any time, materials would need the same object validation.


If we can get back to what this thread was originally about:
minimising API calls when you have many thousands of meshes that are too different to use instancing, and i assume have their own Model matrix that needs to be written to a UBO before another API call to draw from its VBO.

The problem is this: we have insufficient evidence that there is a significant performance penalty coming specifically from function call overhead. Bindless graphics doesn't get its performance from reducing the number of function calls; Dark Photon demonstrated that with his test where he used redundant glVertexAttribFormat calls.

Display Lists don't get their performance on NVIDIA implementations because of lower function call overhead. They get their performance by properly optimizing the sequence of rendering steps for the hardware in question. Putting all of the data in a driver-controlled buffer object, reformatting the data to be optimal when read, etc.

You shouldn't try to fix a problem unless you have evidence that the problem exists. Thus far, if such evidence exists, it has not been presented.

barthold
04-29-2010, 03:33 PM
using 64-bit buffer handles, which just so happen to be GPU addresses on some hardware

Those are not 64-bit handles; they are actual GPU addresses. Even if you completely ignore the fact that the function is called glBufferAddressRangeNV and the fact that the spec constantly refers to them as "addresses", glBufferAddressRangeNV doesn't take an offset. So that 64-bit value must be referring to an address. Either the address returned from querying it or an offset from the queried value.

If it looks like an address, acts like an address, and everyone calls it an address, then it is an address. So please don't act like bindless is something that could be trivially adopted by the ARB or something that doesn't break the buffer object abstraction.

Hi Alfonse,

Just to clarify (and apologies if this is already clear). There's nothing fundamental to the bindless graphics extensions that requires that a GPU address be a physical address, any more than a CPU pointer needs to be a physical address. For example, the Windows Display Driver Model (Vista/Win7) requires that all allocations be movable, yet the bindless extensions still provide a static GPU address. As a result, we believe that bindless graphics can be implemented on any DX10 capable hardware.

Regards,
Barthold
(with my NVIDIA hat on)

Simon Arbon
04-29-2010, 06:11 PM
Display Lists don't get their performance on NVIDIA implementations because of lower function call overhead. They get their performance by properly optimizing the sequence of rendering steps for the hardware in question. Putting all of the data in a driver-controlled buffer object, reformatting the data to be optimal when read, etc.
So what is your actual suggestion?
Do you want a limited version of display lists reintroduced to the core profile specifically for state changes?
How do you propose we query the driver for the relative expense of binding shaders, samplers, textures, and UBO's so we can group materials in the most efficient way for whatever future hardware the user is running on that we have not had the oportunity to profile ourselves?
If we need to continuously update the data in VBO's (for soft-body animation for example) then the driver cant reformat the data. Would you like a query that gives you information on how best to arrange your VBO's?
Or is your plan to buy every new card that comes out, manually profile it yourself, and send all your customers regular software updates?


You shouldn't try to fix a problem unless you have evidence that the problem exists. Thus far, if such evidence exists, it has not been presented.The problem is quite clear, on certain combinations of hardware the GPU has power to spare but the CPU is running at 100% on all cores and cant keep up with a reasonable frame rate.
Aleksandar and i simply want to find the best way to reduce the total number of CPU clock cycles being used to service the GPU.

MZ
04-30-2010, 11:18 AM
Alfonse Reinheart = Korval.

Congratulations on your resurection, O Master Of Pointless Bickering!

Rejoice fellow programmers. The single biggest source of opengl.org forum noise since it acquired moderators, is among us. Again.

Aleksandar
05-02-2010, 06:48 AM
Oh, the discussion was very interesting last week. I'm sorry I was not able to take part.

Well, Simon, your proposal is very interesting, but requires significant changes in hardware, interfaces and even API.

First, I suggest a name change, from Command Shader to something more suitable. For example Execution Control Program (ECP). This would make its appliance more obvious. The name shader suggests that it is a part of 3D pipeline, which ECP is certainly not. At leas, it would skip one argue-post of Alfonse Reinheart. ;)

Second, NVIDIA's GigaThread thread scheduler cannot execute ECP. The GigaThread global scheduler distributes thread blocks to SM thread schedulers, while at the SM level, each warp scheduler distributes warps of 32 threads to its execution units. I do not know how it works on ATI, but AFAIK there is no trace of outer-programmability. Further more, for managing thousands of threads and splitting tasks I would rather rely on driver than do it on my own. So, to support ECP we need significantly revised existing hardware, or a new block that can manage what you've proposed.

Third, to manage that new peace of hardware we need a new API. Either of existing APIs (OpenGL, GLSL, OpenCL, CUDA) cannot be used for flow/execution control management.

Having all that in mind, I don't believe that the hardware vendors would be pleased to implement such significant modification, unless it is proved to be useful and profitable.

CBO is simpler than DL. In fact, all the functionality (if we reject slots for commands and make a CBO immutable) is already implemented in display lists. It wouldn't be faster than DLs, but it would certainly significantly boost the speed of static groups of objects.

knackered
05-02-2010, 02:57 PM
Alfonse Reinheart = Korval.
yes, it is him - those of us who have read nearly 10 years of his posts could recognise his style a mile off, and there's zero overlap between korval ending his contribution and alfonse beginning his.
I'm quite pleased though - he's very thorough in his analysis, and is a welcome authority to have back among us. It's not pointless bickering! He does seem to be repeating himself a lot in this thread, though...maybe senility? ;)
i miss jwatte.

Simon Arbon
05-02-2010, 07:40 PM
First, I suggest a name change, from Command Shader to something more suitable. For example Execution Control Program (ECP).Yes, i fully agree.


So, to support ECP we need significantly revised existing hardware, or a new block that can manage what you've proposed.
The load balancing, memory management and other tasks that occur at a global level in GPU's would be extremely complex to perform in fixed-function hardware, and having to interupt the CPU for the driver to do it all is just too slow.
At some stage in GPU evolution they will require a control processor to manage all of these global tasks efficiently.
There is also a precident, shaders were introduced because the fixed-function hardware was becoming too complex and a programable unit is far more flexible in what it can do.
Both of these also apply to GPU thread/memory management.


Having all that in mind, I don't believe that the hardware vendors would be pleased to implement such significant modification, unless it is proved to be useful and profitable.However it is done, GPU's currently execute a command buffer or display list which can contain at least one branching instruction (conditional rendering).
Hence they should be able to execute an ECP, (which is just commands with conditional branches), the main question being how efficiently it could be done with the current hardware.


Further more, for managing thousands of threads and splitting tasks I would rather rely on driver than do it on my own.I was not suggesting this, just an enhancment of the existing command buffer to allow conditional branching, better synchronisation of OpenGL with OpenCL, and to allow a function to be automatically executed after each front/back buffer swap.


Third, to manage that new peace of hardware we need a new API. Either of existing APIs (OpenGL, GLSL, OpenCL, CUDA) cannot be used for flow/execution control management.The main difference is that the OpenGL and OpenCL API commands would be specified as text instead of as function calls which add themselves to a display list.
The program would be compiled by the GLSL front-end but output to the display-list compiler back-end (with additions for flow control) to generate the ECP.
The number of API functions this would need to support would be small as it does not need to include any of the shader, UBO, VBO, or drawing commands (These would be done using normal API calls, Material objects, Mesh objects or display lists).


CBO is simpler than DL. In fact, all the functionality (if we reject slots for commands and make a CBO immutable) is already implemented in display lists. It wouldn't be faster than DLs, but it would certainly significantly boost the speed of static groups of objects.I have been trying to write my renderer using only core OpenGL, but i still cant quite get the same speed as a compiled display list for static-objects.
There is clearly a need for some kind of stored command buffer. DX added them just as OpenGL took them out.

The main problem i have with display lists is that very little of my geometry stays static long enough to make creating a display list worthwhile.
A CBO with slots would help manage dynamic objects, but how would you prevent excessive fragmentation of the buffer over time?
If it consists simply of a list of commands then you would have some state setting commands followed by a series of drawing commands, then a change of state followed by some more drawing.
If you now remove an object then you have an empty slot that needs to be skipped over, but if want to add an object (or extra state change) and you dont have a spare slot, then you either have to move all of the following slots to make room, or repeat all of the state settings so you can add it at the end.
A linked-list structure instead of a linear sequence of slots might be more efficient if the CBO needs to be modified regularly.
If a CBO is short enough then it may be better to compile a new one in a background thread ready to be switched in when needed.

Aleksandar
05-04-2010, 11:33 AM
If you now remove an object then you have an empty slot that needs to be skipped over, but if want to add an object (or extra state change) and you dont have a spare slot, then you either have to move all of the following slots to make room, or repeat all of the state settings so you can add it at the end.

Maybe it could be solved by allowing CBO nesting. If you need something to change frequently, it can be "isolated" into separate CBOs. In that case "slots" would be "pointers" to other structures.

When I proposed CBO I had in mind changeable data and fixed commands. You have proposed something more flexible. In any case, a very interesting idea. We shell see if anyone of our wishes will be fulfilled. :)

Simon Arbon
05-04-2010, 05:24 PM
Or just the equivalent of glCallList/glCallLists commands for CBO's.
The glCallLists array could be stored in a buffer object and contain actual CBO addresses, hence acting like your 'slots'.
Null pointers in this array would simply be skipped over.
Compiling a small CBO should be very fast, the tradoff being that you would loose a lot of optimisation oportunities to gain more flexibility.