FBO performance: who is right?

Hello,

years ago, I saw Simon Green’s presentation, which showed recommendations for FBO performance. The relevant slide is nr.29 from http://http.download.nvidia.com/developer/presentations/2005/GDC/OpenGL_Day/OpenGL_FrameBuffer_Object.pdf .
He recommends to use one FBO and switch between textures.

But then, Valve comes, and tells a very different story. The slides can be found here: https://developer.nvidia.com/sites/default/files/akamai/gamedev/docs/Porting%20Source%20to%20Linux.pdf the relevant one is nr.65. It says “Do not create a single FBO and then swap out attachments on it.”

Now, I realize that (a) nothing replaces doing your own benchmarks (b) Green’s slides are years older than Valve’s. So, perhaps changes in the hardware caused Green’s preferred option to no longer be the fastest one?

Your thoughts?

FBOs were designed to encapsulate the whole framebuffer state and provides a way to replace it all, thus I would highly recommend having multiple FBOs instead of a single one and swapping attachments. Why?

  1. This is how the API was designed to be used
  2. Usually results in less API calls
  3. The driver doesn’t have to re-validate framebuffer completeness and other internal state (these states are cached inside an FBO)

Sure, doing your own benchmarks is the best way to determine what’s best for you, but even if currently the choice I’ve suggested is not the most efficient on some driver implementation, it is the more likely one that can be further optimized, so in the long run it should be faster.

(Off-topic: That’s why I didn’t really understand Valve’s presentation about VAOs, as they not recommend using them. In practice VAOs are in fact faster if you have a larger number of attributes than 1 or 2, and they should be preferred because of the same reason: that’s the purpose of them, less API calls, and state can be cached in the object)

Well, it doesn’t really matter for the recent core GL developer anyway. :slight_smile:

It matters. In Core Profile you need VAO yes it is true, but you can gen and bind one at application startup and never undbind it. Then before each draw call configure your arrays with GL_ARB_vertex_attrib_binding. This is a fast path for NV driver.

Well, having just read the very brief passage in the Valve slides, I find a claim like “Slower than glVertexAttribPointer on all implementations” without any substantial data to back it up rather amusing - just like them calling a VAO a vertex attribute object. Someone clearly has a love for the spec terminology.

The idea of having more API calls and still reach higher performance, even if true on all current implementations, just baffles me and again leaves me wondering, why the hell is it so difficult with OpenGL to have mandatory concepts perform as well or better than algorithms that apparently do the exact same thing which is supposed to be avoided by the aforementioned concepts?

I agree that binding as few as possible VAOs per frame is good, but inding a single VAO just to conform to the spec and then setting up your arrays every frame? That’s simply messed up.

I believe the problem with using VAOs is the lack of distinction between “bind for modify” and “bind for draw”. Because the driver has no idea what you’re going to do, it has to set up all the state for each vertex attribute location (x16 or x20), whether that location is enabled or not, so that all the vertex queries and state is available to you for a modify op. But if there existed a specialized “bind VAO for draw” call, it could ignore setting up the state for all disabled locations, and simply just set the disable state on those.

That’s probably why calling glVertexAttribPointer() on a small subset of locations is faster than a VAO switch, which sets up the state for all locations.

There isn’t really “bind for modify” for VAOs except at the beginning, when you set them up, thus that argument sounds irrelevant. Also, assuming that binding a VAO will internally set 16 vertex attribute state even if e.g. 14 are disabled is also pretty naive and it is unlikely to be done so by any driver.

I think a rather logical reason why it looks like VAOs are slower than VertexAttribPointer is that usually most apps only use 1-2 vertex attributes, and that drivers probably have more optimizations in place for VertexAttribPointer calls than VAO binds just because there are more applications using VertexAttribPointer than VAO binds. But once again, the theoretical maximum performance of VAO binds is definitely higher than individual VertexAttribPointer calls, it might just happen that the former is so underused that it’s not well optimized.

This generally applies to everything, even core profile vs compatibility profile. Lot of people already benchmarked it and said that compatibility profile looks faster. Why? Because all applications use compatibility profile so obviously those paths are more mature and well optimized at the moment, but doesn’t mean that they are any better, in fact, it’s quite the opposite.

VAO state can be changed at any time so the driver has to treat each bind as “bind for edit” and “bind for draw”. AFAIK it is not possible to optimize VAO in NV driver if it was they would do it.

Also, assuming that binding a VAO will internally set 16 vertex attribute state even if e.g. 14 are disabled is also pretty naive and it is unlikely to be done so by any driver.

Yet it’s been shown in several threads that VAOs are measurably slower in many cases, so this doesn’t give me a lot of faith that the driver is doing something smart. It almost seems like it has been implemented as a client-side convenience macro to set up global vertex state rather than a referenced server-side object. At very least, using a VAO should be just as fast as manually setting up the state seeing as it’s an integral part of GL now.

Why? You think the NV driver has optimized every single feature to the level that it cannot be improved any further? That’s a pretty naive assumption.

No, but VAO is obvious optimization and Valve is rather important client. I am sure NVIDIA got reports from Valve about slow VAO, but new drivers still aren’t faster in this area.
Other optimizations requested by Valve has been implemented by NVIDIA (NVIDIA 310.14: OpenGL 4.3, Threaded Optimizations - Phoronix).

If they theoretically could and should be but still aren’t, do you really think it’s because the driver is already fully optimized? Valve is one AAA client on Linux. One.

One and only one on Linux.

There is and there will be ever a fully optimized OpenGL driver in the same way there are no perfect things in world in general. In software engineering this is even more true, there doesn’t exist any sufficiently large application that is 100% complete and optimal. It’s just how things are. Once again, assuming something is fully optimized is rather naive.

Also, it doesn’t take to be a rocket scientist to figure out that if something is done with 2 or more API calls then the same thing can be definitely done faster with a single API call (if the underlying implementation is optimal, then the additional function calls would definitely make the former slightly slower).

What I am saying is that VAO design and NV driver design just don’t work well together. This is why GL_NV_vertex_buffer_unified_memory extension has been developed, this is why NVIDIA isn’t using VAOs in code samples, and this is why NVIDIA employees do not recommend using it.

If a company is prepared to put their balls on the chopping block over this, if that company has millions of dollars at stake over this, if that company has a priority of a working program that runs well rather than taking sides in a religious war - I’m inclined to believe them.

If a company is prepared to put their balls on the chopping block over this, if that company has millions of dollars at stake over this, if that company has a priority of a working program that runs well rather than taking sides in a religious war - I’m inclined to believe them.

Fair enough, so let’s analyze this on those grounds.

NVIDIA does indeed have “millions of dollars at stake over this”. But what exactly is “this”? They’re basically saying, “Don’t use the standard method; use our proprietary extension.” Or put another way, “Don’t use the standard method; make your code only work on our hardware.”

How does NVIDIA not have a stake in “taking sides in a religious war?” NVIDIA has financial reasons to want people to use their proprietary extensions, and financial reasons to encourage people to not use similar core functionality. With AMD being less competitive and having financial troubles… why should we expect what NVIDIA says to be on the level here?

What would NVIDIA have to gain by putting effort into making VAOs more performant? The more they push their proprietary extensions, the more people buy into it. Which puts people in their ecosystem. This puts pressure on some developers, who then start wanting their customers to buy NVIDIA because that’s what they write their code for. Thus pushing sales of NVIDIA hardware. Which in turn increases NVIDIA’s marketshare, thus encouraging other developers to make the switch, which causes more sales, etc.

NVIDIA doesn’t make more money by having fast VAO code. NVIDIA makes more money by encouranging more people to write NVIDIA-GL code instead of OpenGL code. Does this mean that they have convinced their developers not to work on VAO performance? Well, someone told them to write the bindless graphics extensions to begin with. Whether that someone was on the driver team, pushing for a performance extension, or was someone in marketing wanting to differentiate their performance from others, I can’t say.

But you cannot reject the very real possibility that they’re not on the level here. I see no reason to blindly trust an organization who has a direct financial stake in getting people to not use cross-platform OpenGL code.

There’s no way to know for certain because NVIDIA guards their hardware specifications vigorously. We only have the word of someone who has plenty of reasons to lie.

No. VAOs are a half-way bandaid (that can improve perf on NV BTW). But bindless vtx attribs are a faster/cleaner solution for the problem, on NV at least. No bazillion VAO blocks floating around in driver memory to cause cache misses.

That said, it does make assumptions on the underlying implementation, which may be an obstacle for wider adoption. Would like to see some discussion on that. If so, perhaps there’s an intermediate approach that bridges the gap – such as storing the “content” of VAOs in “client” memory. That gets you part way there.

That’s all lovely but I was talking about Valve.

[QUOTE=dv;1250041]years ago, I saw Simon Green’s presentation<snip>He recommends to use one FBO and switch between textures.
<snip>
But then, Valve comes, and tells a very different story. It says “Do not create a single FBO and then swap out attachments on it.”[/QUOTE]

As far as I am aware, both are right, they are talking about two different scenarios.

If you switch between fewer rendertextures (of same size) than the maximum number of FBO-attachments, you can attach them to a single FBO to switch quickly between them. This is a static attachment, typically done at init-time, it never changes.

What Valve are talking about, is swapping out the attachments, which does not apply to the situation described above.