OpenGL performance - demystified

Hello to whole GL community,

This post is not a question, but a try to help beginners to take an overview of OpenGL performance and to demystify some of common delusions. Everything written here is related to NVIDIA’s implementation in 19x.xx drivers for Windows. I’ll be glad if you would take a part and broaden this overview to other vendors (AMD/ATI at first place).

The first three delusions I’ll comment are:

  1. Using OpenGL 3.2 Core Profile significantly boosts the application speed
  2. Shaders are faster than fixed functionality
  3. The Bindless Graphics boosts up speed to the order of magnitude

Some months ago, I have read on one of the posts that GL 3.2 Core Profile enables much cheaper function-calls. I was very excited with that, and that was prime reason to switch to a “new technology”. But, after some time spent in “porting” an application to GL 3.2, I have realized that the boost is meaningless small, if it exists at all. A few percents of change were due to code reorganization, but not from cheaper function-calls. To be more direct: Using OpenGL 3.2 Core Profile on NVIDIA currently does not change the speed even a bit! Of course, my intent is not to turn you from the new programming model. Further more, I, personally, am using it. But just to emphasize that switching to GL 3.2 Core does not mean any speed boost. I have measured the speed of execution of glMultiDrawElements() function, and noticed no speed change, as well as the frame-rate of the application.

Because fixed functionality is still supported (no matter how it is implemented), it is unlikely that those functions take other paths through the pipeline that the shaders do. It is also unlikely that you can better implement any of the functionality than the system developers can. Further more, shaders are usually used for extending standard functionality, which means more computation in the shaders. To be more direct: Shaders are at least as slow as fixed functionality if not slower! On the other hand, we can use some tricks and skip some calculations to boost the speed of shaders, but the implementation of the full fixed functionality in shaders will certainly be slower or in the best case equal to standard fixed functionality.

The Bindless Graphics is one of the best thing happened in the OpenGL world in the previous year. Porting application to it was a very pleasant experience (although until I have resolved some bugs in my application, some very severe application crushes happened). Because I’m using (tens of) thousands VBOs in the scene, I thought the Bindless extensions were something I had to try, and I haven’t regret. With 65025 VBOs in the scene (and from 600K to 10M triangles), the speed gain was from 50% to 70%. The greatest speed gain achieved with bindless extensions (in all test cases) was 2 times. Although I didn’t achieved 7.5x, with “just” 1.5-2x I’m very satisfied. (For less than 1K VBOs there is no speed gain at all). Another great feature of Bindless extensions is a support for both fixed functionality and shaders.

Table 1. shows results of the testing on the textured and lighted terrain. The values in the gray columns are pseudo frame-rates (reciprocal value of the rendering time), and the greater values are better. Values in the yellow columns are speed gain factors.

http://sites.google.com/site/opengltutorialsbyaks/events/vboperformancetest/Tab-1.PNG

Table 2. shows how triangles and VBOs count maps to LODx and block size(64, 128 or 256) values.

http://sites.google.com/site/opengltutorialsbyaks/events/vboperformancetest/Tab-2.PNG

  1. Using OpenGL 3.2 Core Profile significantly boosts the application speed

I don’t think anyone has ever argued that.

  1. Shaders are faster than fixed functionality

Or this.

  1. The Bindless Graphics boosts up speed to the order of magnitude

Even NVIDIA only ever claimed a 7x performance improvement. And that was involving both halves of bindless (no uniforms).

So I’m afraid your findings are not particularly Earth shadowing. Disproving arguments nobody is making is not particularly insightful :wink:

I agree with Alfonse.

  1. This has always only been mentioned as a “possibility”. Like “if they were to write an entirely new driver only for this profile, they COULD make it faster”. It was always clear, that this is not the case at the moment and won’t be for many years.

  2. Maybe on the very first shader-only GPUs. And only for a few months. You can’t beat IHVs hand-tuned fixed-function shaders.

  3. I am actually positively surprised about your findings. I assumed this extension would be as pointless as VAOs are.

Jan.

I’ve realised vector versions of functions, partially glVertex are very slow with pyopengl at least. Not sure about OpenGL with other languages.

Changing glVertexfv(li) to glVertexf(*li) in python gave me a massive speed increase.

That’s something I’ve noticed about performance.

Matthew, it could be caused by Python itself, i.e having to clone memory instead of push its value on the virtual program stack. In a binary app, it would be a face-palm if this problem existed. Anyway, I think Aleksandar didn’t mean to discuss the glBegin/glEnd interfaces.

What Aleksandar wrote may not be news to many of us, but I think it’s a nice post for the google-search archive, for newcomers to directly get clarifications on performance aspects (for the current snapshot of drivers).

If I had any Earth shadowing discovery, I would write a scientific paper, not a post on the beginners forum. You have overlooked that. But my goal is achieved. You have agreed with my statements, and many beginners would see it before they fall into false estimations. :wink:

And, yes there is a post that claims GL 3.2 calls are cheaper (not will be, but are), but I won’t post the link, because it would be a “negative citation” for the poster.

There are also question on this forum about shaders speed. So I didn’t make any statement on my own.

I have involved Bindless extensions in the previous story because I was excited by speed improvements that are not advertised enough. :wink:

That poster might have been me, with some beta GF drivers, multithreading enabled, and comparing cpu+gpu cycles between slightly differently-tuned PCs and drivers, and benchmark scenes (basically I did some completely unfair comparison). I remember having regret on not realizing my mistakes soon enough while the thread was recent and on topic. Though it was a post about 3.1 iirc.

Aleksandar, actually writing a PDF and uploading it anywhere seems to get higher google-rank when searching for tech topics.

Don’t worry Ilian, and thank you for many vary useful posts.

Writing a pdf is maybe better for Google-search, but eliminates possible feedback. Official OpenGL forum is better for that purpose. And I’m also curious about how ATI deals with OpenGL performance issues. Intel is “out of the game”, but it will be also interesting to see how i5 with integrated graphics chip deals with OpenGL.

…maybe not out for good, but definitely gonna be really, really late to the party.

So just to quantify it, 50% reduction in draw time. I’ve definitely seen ~15-20% here, and that’s only for vertex attribute bindless.

It depends on the number of VBOs and their complexity. Greater number of VBOs with fewer triangles achieves better performance. The drawing speed reduction to 50% is an absolute maximum I have seen in my tests.

Greater number of VBOs with fewer triangles achieves better performance.

Of course it does. Bindless removes the performance impact of buffer object binding. The more binding you do, the better bindless is. If you’re not doing much binding already, bindless isn’t going to help much.

Your bindless tests were solely using vertex buffer bindless, right? You weren’t testing program “uniform” data bindless.

I’m guessing the wrapper doesn’t convert the lists to vectors properly.

Yes, I’ve just used vertex buffer bindless, because the number of vertex buffers is critical in my application. I think that using pointers to uniform blocks would not help much, but I’ll try it too.

Well, I for-one appreciate Aleksandar chiming in here with his personal experience (apparently you don’t). This is much more valuable than mere intuition on the expected performance of a feature, especially ex-post-facto when several folks have already cited actual quantitative results, rendering expectations moot. Don’t be so impertinent.

Thanks, Aleksander! Very actual info, and as Dark Photon has mentioned, it is real experience, and thank you for sharing it here.

P.S. Sorry, other guys, but there is no points for argues or discussions. Everyone of us has a lot of expectations on BindlessGraphics, on VAO, whatever… but we must always respect real experience, especially when someone shares it absolutely for free. If we have our own experience, which differs greatly from Aleksander’s one - so let us share it here, and that would be the best we can do.

Also, before we put this thread totally to rest. Let me say that my total bindless batches draw-time speedup thus far is ~42% (nearly 2X) – very close to Aleksander’s 50%. That’s not even using shader bindless.

You get about half of that for providing actual GPU addresses for the vertex attribute and index “pointers”. And you get the other half by getting rid of all the now-needless glBindBuffer calls. This makes VBOs (which tend to have very heavy batch setup cost) very light – to the point of almost matching NVidia display list performance, which as most here know, is awesome!

Without bindless, VBOs can’t come anywhere close to display lists (unless your batches are silly-huge, which means culling sucks). With bindless, they’re neck-and-neck. Doubtless bindless is what has been going on under-the-covers all this time.

speedup thus far is ~42% (nearly 2X)

+42% speed up is 1.42X, not “nearly 2X”.
Can you clarify ?

Ah, I think I see your mistake (or at least the difference in our thinking). :slight_smile:

By % speed-up, I mean:

42% speed-up: Original = X ms. New = (X - 42%X) ms = 42% time reduction. Aka 58%X ms = takes 58% of original time. 1/.58 = 1.72X faster. Can do 1.72X as much as before in the same time (e.g. what took 10ms before, is now 5.8ms).

1.42X faster would be 30% speed-up. 2X faster would be 50% speed-up (time reduction). 7X speed-up (what NVidia claimed in the bindless test they did) would be 85.7% speed-up.

Is there another interpretation of % speed-up that I might have overlooked? Mine could be wrong.

@Dark Photon sorry but your “speed up” definition see so odd.

% or X are some kind of units, actually ratios so that it’s unit less. you can’t assume “ms” framerate or anything, it’s mathematical not correct. X with 1 as reference, and % with 100 as reference. Hummm sound like obvious relationship to me! :stuck_out_tongue:

Well, do you know this say that say “A good program is a program that solve complex problem with simple solution.”

Work for this % / X I guess.