Backstabbed by skanky IHVs

In order to increase sales of their professional cards, ATI and Nvidia speed limit opengl performance of their consumer cards. Believe it or not, my 8800 GTX can’t do more than 8.5 million triangles a second in opengl whereas it can do over 200 million (that’s a guess, probably far more) in D3D and I could get 16 million out of my GeForce 256 almost 10 years ago.

I’ve been aware of this for some time and I can’t begin to tell you how disgusted I am. Naively, I thought such an outrageous, horrible thing couldn’t last but it’s actually getting much worse.

Well, I make 3D graphics software pretty much exclusively for consumers, mostly in the field of cultural heritage. Visualisation of the odd complex model is part of this and with fast GPUs becoming common place this should be fine and dandy. In practice, it’s become impossible.

So, with 10 years of reputation for extrememly high performance 3D graphics software (I’ve always sweat blood over that last 0.01% of performance) I now can’t deliver a simple model visualisation app to my clients.

Thanks a lot IHVs, should I then start looking for a new job and forget about OpenGL entirely?

Now, I’m wondering, have the rest of you not noticed this peculiar fact? When I first got an X800 and discovered it could only do ~40m tris whereas the previous gen 9800 pro could do 4 times that I spent an inordinate amount of time ripping our engine to shreds trying to figure out what was wrong with it. Finally I discovered that by installing a modified driver that tricked the card into believing it was a FireGL brought perfomance on par with it’s (normal) D3D performance. Now the nvidia 8800 has flung me right back to the dark ages when no one knew what a 3D accelerator was.

I guess I’m hoping there is some way around this, but if there is the IHVs will probably remove it. What I would really like to see is the IHVs reconsider this feat of treachery. And Nvidia, 8.5 million?! What?! You’ve really overdone it! I can probably do better in software!

In fact I don’t believe it. This would make the latest OpenGL games totally unplayable.

But maybe you can show how you measured that value?

That is a completely crazy statement, unless you’re talking about rendering polygons with lines (wireframe). I and many others render several hundred million triangles per seconds with OpenGL on the 8800 series. This is for shaded triangles. I find it quite hard to believe you work professionally with 3D graphics. The basis for your accusations are flawed. Even if it wasn’t flawed, you should back your accusations up with proof that they indeed this to increase the sales of their cards targeted for the professional market.

@Madoc: um … what the ** have you smoked lately? How did you get thous ridiculous numbers? Bold statements need considerable proof!

Well, I’m not sure exactly how it works but it seems to not adversely affect performance until you actually attempt to use high polygon counts. I can’t think of many games that use higher polygon counts than that. It’s also possible that they attempt to detect the type of application and scale performance accordingly in some way. Perhaps the IHVs can enlighten us on how they implement this.

As to how I measured it, nothing fancy, just rendering optimised dense mesh with a variety of vertex formats and simple shaders counting number of frames over time. I’ve had the same benchmark run on a number of systems. For example, a GeForce Go 7800 GTX I have here doesn’t seem to be limited in GL_FILL mode and does up to 185 million triangles per second on the same test. A radeon x1600 (speed limited) does ~40 million IIRC, can’t remember what it does with modified drivers but I think again around 200 million.

Anyway, try it for yourself, I’d be curious to see what kind of figures you get. Unfortunately I’m not at liberty to distribute my particular benchmark in it’s current state.

Edit:
By “simple shaders” I mean the simplest possible fixed function modes. Untextured, unlit; textured, unlit; textured with 1 directional light etc.

Please, why do you have to be insulting?

I don’t see anything particularly bold about my statements. In fact, this is actually common knowledge (will post links later). If you are inclined to to not believe what I say get hold of a radeon x800 or later and a modified ATI driver and run some simple tests with that and the standard latest drivers. That’s proof. I have not invented these figures and all but the geforce 256 ones are from the same exact benchmark.

Hmmm… Great! How? I can’t say I’ve investigated it extensively on the 8800 series but I most certainly did a while back on a few radeons. I tried every combination of state, vertex formats and layouts, primitive modes, VBO modes and even pixel formats etc with the same exact results. IIRC the triangle throughput didn’t even scale, it was just capped. The only thing that made a difference was the hacked driver.

I have second hand information from one of my collegues about nvidia doing the same thing (we all know they have always done it for some things such as wireframe), I’m trying to contact him so he can point me to the source of it.

I’ve been presuming that the poor 8800 performance was the exact same thing, perhaps it’s worth investigating further.

Few games use 8.5M polygons per frame, but you easily get more than that per second. An 8800GTX easily runs most games at 100+ fps in lower resolutions, and 85k polygons per frame is ridiculously low for a current game.

Well, I just ran one of my projects here, and it drew 137498 triangles per frame at around 645 frames per second, so about 88 million tris/sec. That was with two texture lookups per fragment and a bit of lighting, and almost all the tris were visible. I got an 8800 GT, running Vista x64. And I’d say those numbers are by no means spectacular.

Now… in order to get these numbers, I had to disable Thread Optimizations (or force the app to one core). Before that it’d get 10-15 fps (ie ~2mtris/s). This happens frequently with OpenGL apps for me (happened in XP as well). So perhaps that’s what you’re seeing?

Right, I’ve just tested on another system with an 8800 GTS and it did 350 million textured lit and 396 million untextured unlit. It seems that indeed I was jumping to conclusions regarding the 8800.

I will have to test the original system again (should be able to tonight) and see if thread optimisation is what is causing the problem. It seems very likely. Thanks for that.

I am already greatly relieved, I never expected such behaviour from nvidia who I always greatly respected. Sincere apologies for the premature accusations.

However, how do we get around this problem without sacrificing use of multiple cores?

I just ran my current project. It rendered about 2.1 million triangles (most of them visible, only few back-face culled) at 30 FPS resulting in 63 mtris/sec.

Test was on a Geforce 9600 GT 512 MB, Quadcore, 2 GB RAM, WinXP.

I disabled all my fancy shaders, which didn’t increase FPS, which might suggest, that at that view-point the app is indeed vertex-limited (usually i am fillrate-limited, due to extensive post-processing effects, but then most of the geometry is culled).

That is easily double the speed, that my Radeon X1600 (Mobility) was able to do. I know, that this app runs at 1.5 to 2.0 times the speed on a Geforce 8800 Ultra (some kind of Quadro).

I would say, that there is some truth in the accusations. Quadro-cards are of course a bit faster and the price-difference is by far not justifiable. But HOW much faster they are is the question. An 8800 Ultra is ALWAYS a lot faster then my 9600 GT, be it a Quadro or not. But 8.8 mtris/sec ? I’d say: You are doing it wrong :wink:

And the 200 mtris/sec through D3D: I don’t believe it.
All that PR fluff has been floating around for ages and everyone should know real-world scenarios can never even get close to half or a third of those claims! It’s just completely exaggerated.

IF there are indeed speed-limits for OpenGL compared to D3D, they are not that extreme, as you suggest. However, IF it is like that, it would still be a real scandal. And honestly, i wouldn’t doubt, that it is actually done, because it would be a great argument to buy the expensive Quadro/FireGL cards, since OpenGL is usually only used for CAD and similar stuff, where you can tell your clients, what to buy. And Doom 3 etc. is detected by the driver and speed is unlimited.

Jan.

Yup, I use 8800GTS too. In my terrain algorithm (prototype) I can push hundreds of millions of triangles. 8.5 million triangles can be rendered in real-time depending on the complexity of the shader. In this screenshot I am doing 5 texture lookups per vertex although it’s only diffuse lit (4 of the texture lookups are for normals).

I also see a lower vertex throughput when the driver has Threaded Optimization enabled. I just set to Off instead of the default Auto. You can find the option in the NVidia Control Panel under 3D settings. It will possibly solve the issue you’re experiencing. I have yet to see a benefit of enabling Threaded Optimization. Maybe someone who knows exactly what it is good for could post how it works?!

Jan, did you read my last post? That 8.5 million tris/s is obviously a problem with that system, the figures I reported for the 8800 GTS on another system would suggest that no kind of speed limiting is being applied (this was with thread optimisation on auto and off, didn’t make a difference).

When this 8.5 figure came up a collegue (on the phone) reported that nvidia apply the same limiting we discovered on ATI cards (and that unlike ATI cards they could not be tricked into thinking they were professional versions of the product). I took his word for it and proceeded to post this thread. It was a mistake, I apologise.

Errata: The x1600 from an earlier thread was actually an x1900xtx, we measured a 360% increase in triangle throughput with the modified drivers. Interestingly, a standard mobility x1400 outperformed it before we made it think it was a FireGL.

used that glsl_pseudo_instancing (arb path) nvidia demo and it gives me 350 MTris/second on Geforce 9600 GT 512 MB, Quadcore, 2 GB RAM, WinXP.(same rig as Jan) (threaded optims off, however I tested with “on” as well no difference for that simple thing)

I would think numbers could be even higher when larger batches were used (the instances have a maximum of a few thousand tris each).

now obviously that very simple demo isnt really “real world” application, but anyway shows what could be done (the demo has single light, textured and vertexcolored surfaces).

I never had a “CAD” card (firegl or quadro) but I always expected them to have better wireframe drawing and slightly optimized drivers for “immediate mode” rendering and other old-school opengl (no VBOs and so on), since that kind of rendering seems to be the case with many “serious” applications (just guessing so, can only refer to some medical vis stuff we use here)

AFAIK you can make the driver disable the thread optimizations for your app if you set the affinity of your process to one core just before creating the context. Then, after you got the context, you set the affinity back to all cores. This way you can do all the multithreading you want, and not have the driver screw things up :smiley:

Cool, thanks Lord crc.

So, I’ve had someone test that card again and… I don’t have any idea why it was doing 8.5, it now performs as well as the other 8800, though only with thread optimisation off which makes it 40% faster. Apparently my collegue was looking at some opengl benchmark results and, though he claimed otherwise, these must include rendering modes for “pro apps” only.

Right, well, made myself look like a right git, sorry everyone.

However, I’m still confident that what I say about the ATI cards is accurate (and is in fact why I so readily jumped to conclusions). I have quite a lot of data to support this and it would seem to exclude the possibility of something like the thread optimisations.

I have a Quadcore, so i disabled the thread-optimization, didn’t make any difference, at all. Question is, of course: In which situations DOES it make a difference?

I have some experience with command-queued drivers in the context of the MT-GL feature on OS X Leopard; it usually doesn’t help if you have become GPU limited (restated, you are generating batches faster than GPU can consume them).

For a steady state demo re-drawing the same thing over and over, the command queue will either trend towards being empty (CPU not generating enough work) or full (CPU generating too much work).

The win comes when you are not GPU limited, the main thread of your app can spend less wall clock time specifying the scene’s commands and then time is freed up to do “other things” while the driver mows down all the commands you piled up. In a demo there are usually no “other things” to do, so you just wait (or try to draw more stuff and jam the cmd queue full and then you have to wait).

So my perception of the benefit of command-queued / thread-optimized drivers is that they can convert workloads which might have stayed CPU-limited before, into ones which can run closer to the GPU limit. Restated more simply you get free time back on your drawing thread to do other stuff.

On a crowded scene in World of Warcraft, the difference between MTGL on and off can be as high as 2X in frame rate, because there was sufficient “other stuff” going on to translate into a net wall clock time savings. That said there’s not too much stopping a developer from implementing their own drawing thread and getting the same kind of scaling.

So just to sum up my gut feeling is that threaded drivers can’t easily speed up scenes which are already 100% GPU limited, and some demos may be in that category.

I can believe anything of ATI. For example, ATI don’t know how to compile an efficient display list, while nvidia are brilliant at it. If nvidia wasn’t so good, ATI might look ok.
However, you won’t be surprised if everyone treats your “supporting data” with a big lump of scepticism after that tirade.

Performance drop with threaded optimization: Here’s what I found out.

  • Use VBOs for vertex arrays
  • Always interleave vertex attributes <-> never mix VBOs (e.g. one VBO normals, the other one texcoords)
  • Use indexed primitives
  • Regardless of which glDrawXX()-command is used: never exceed MAX_INDICES or MAX_VERTICES
  • Don’t use GL_DOUBLE
  • Don’t use immediate mode

When doing like this, the threaded optimization works really nice! It seems not releated to using shaders. It’s got something to do with the data management/conversion inside the driver.

After all this means: pass the data in exactly that one way, the hardware can handle it best. If the driver has to rearrange your data in any way, either performance drops, or all cores go to 100%, or everything laggs like hell.

CatDog