PDA

View Full Version : Backstabbed by skanky IHVs



Madoc
06-26-2008, 02:27 AM
In order to increase sales of their professional cards, ATI and Nvidia speed limit opengl performance of their consumer cards. Believe it or not, my 8800 GTX can't do more than 8.5 million triangles a second in opengl whereas it can do over 200 million (that's a guess, probably far more) in D3D and I could get 16 million out of my GeForce 256 almost 10 years ago.

I've been aware of this for some time and I can't begin to tell you how disgusted I am. Naively, I thought such an outrageous, horrible thing couldn't last but it's actually getting much worse.

Well, I make 3D graphics software pretty much exclusively for consumers, mostly in the field of cultural heritage. Visualisation of the odd complex model is part of this and with fast GPUs becoming common place this should be fine and dandy. In practice, it's become impossible.

So, with 10 years of reputation for extrememly high performance 3D graphics software (I've always sweat blood over that last 0.01% of performance) I now can't deliver a simple model visualisation app to my clients.

Thanks a lot IHVs, should I then start looking for a new job and forget about OpenGL entirely?

Now, I'm wondering, have the rest of you not noticed this peculiar fact? When I first got an X800 and discovered it could only do ~40m tris whereas the previous gen 9800 pro could do 4 times that I spent an inordinate amount of time ripping our engine to shreds trying to figure out what was wrong with it. Finally I discovered that by installing a modified driver that tricked the card into believing it was a FireGL brought perfomance on par with it's (normal) D3D performance. Now the nvidia 8800 has flung me right back to the dark ages when no one knew what a 3D accelerator was.

I guess I'm hoping there is some way around this, but if there is the IHVs will probably remove it. What I would really like to see is the IHVs reconsider this feat of treachery. And Nvidia, 8.5 million?! What?! You've really overdone it! I can probably do better in software!

Xmas
06-26-2008, 03:11 AM
Believe it or not, my 8800 GTX can't do more than 8.5 million triangles a second in opengl
In fact I don't believe it. This would make the latest OpenGL games totally unplayable.

But maybe you can show how you measured that value?

Nicolai de Haan
06-26-2008, 03:49 AM
Believe it or not, my 8800 GTX can't do more than 8.5 million triangles a second in opengl whereas it can do over 200 million (that's a guess, probably far more) in D3D

That is a completely crazy statement, unless you're talking about rendering polygons with lines (wireframe). I and many others render several hundred million triangles per seconds with OpenGL on the 8800 series. This is for shaded triangles. I find it quite hard to believe you work professionally with 3D graphics. The basis for your accusations are flawed. Even if it wasn't flawed, you should back your accusations up with proof that they indeed this to increase the sales of their cards targeted for the professional market.

tanzanite
06-26-2008, 03:55 AM
@Madoc: um ... what the ** have you smoked lately? How did you get thous ridiculous numbers? Bold statements need considerable proof!

Madoc
06-26-2008, 04:05 AM
Well, I'm not sure exactly how it works but it seems to not adversely affect performance until you actually attempt to use high polygon counts. I can't think of many games that use higher polygon counts than that. It's also possible that they attempt to detect the type of application and scale performance accordingly in some way. Perhaps the IHVs can enlighten us on how they implement this.

As to how I measured it, nothing fancy, just rendering optimised dense mesh with a variety of vertex formats and simple shaders counting number of frames over time. I've had the same benchmark run on a number of systems. For example, a GeForce Go 7800 GTX I have here doesn't seem to be limited in GL_FILL mode and does up to 185 million triangles per second on the same test. A radeon x1600 (speed limited) does ~40 million IIRC, can't remember what it does with modified drivers but I think again around 200 million.

Anyway, try it for yourself, I'd be curious to see what kind of figures you get. Unfortunately I'm not at liberty to distribute my particular benchmark in it's current state.


Edit:
By "simple shaders" I mean the simplest possible fixed function modes. Untextured, unlit; textured, unlit; textured with 1 directional light etc.

Madoc
06-26-2008, 04:14 AM
@Madoc: um ... what the ** have you smoked lately? How did you get thous ridiculous numbers? Bold statements need considerable proof!

Please, why do you have to be insulting?

I don't see anything particularly bold about my statements. In fact, this is actually common knowledge (will post links later). If you are inclined to to not believe what I say get hold of a radeon x800 or later and a modified ATI driver and run some simple tests with that and the standard latest drivers. That's proof. I have not invented these figures and all but the geforce 256 ones are from the same exact benchmark.

Madoc
06-26-2008, 04:35 AM
I and many others render several hundred million triangles per seconds with OpenGL on the 8800 series.

Hmmm... Great! How? I can't say I've investigated it extensively on the 8800 series but I most certainly did a while back on a few radeons. I tried every combination of state, vertex formats and layouts, primitive modes, VBO modes and even pixel formats etc with the same exact results. IIRC the triangle throughput didn't even scale, it was just capped. The only thing that made a difference was the hacked driver.

I have second hand information from one of my collegues about nvidia doing the same thing (we all know they have always done it for some things such as wireframe), I'm trying to contact him so he can point me to the source of it.

I've been presuming that the poor 8800 performance was the exact same thing, perhaps it's worth investigating further.

Xmas
06-26-2008, 04:57 AM
Well, I'm not sure exactly how it works but it seems to not adversely affect performance until you actually attempt to use high polygon counts. I can't think of many games that use higher polygon counts than that.
Few games use 8.5M polygons per frame, but you easily get more than that per second. An 8800GTX easily runs most games at 100+ fps in lower resolutions, and 85k polygons per frame is ridiculously low for a current game.

Lord crc
06-26-2008, 05:31 AM
Well, I just ran one of my projects here, and it drew 137498 triangles per frame at around 645 frames per second, so about 88 million tris/sec. That was with two texture lookups per fragment and a bit of lighting, and almost all the tris were visible. I got an 8800 GT, running Vista x64. And I'd say those numbers are by no means spectacular.

Now... in order to get these numbers, I had to disable Thread Optimizations (or force the app to one core). Before that it'd get 10-15 fps (ie ~2mtris/s). This happens frequently with OpenGL apps for me (happened in XP as well). So perhaps that's what you're seeing?

Madoc
06-26-2008, 06:53 AM
Right, I've just tested on another system with an 8800 GTS and it did 350 million textured lit and 396 million untextured unlit. It seems that indeed I was jumping to conclusions regarding the 8800.

I will have to test the original system again (should be able to tonight) and see if thread optimisation is what is causing the problem. It seems very likely. Thanks for that.

I am already greatly relieved, I never expected such behaviour from nvidia who I always greatly respected. Sincere apologies for the premature accusations.

However, how do we get around this problem without sacrificing use of multiple cores?

Jan
06-26-2008, 08:21 AM
I just ran my current project. It rendered about 2.1 million triangles (most of them visible, only few back-face culled) at 30 FPS resulting in 63 mtris/sec.

Test was on a Geforce 9600 GT 512 MB, Quadcore, 2 GB RAM, WinXP.

I disabled all my fancy shaders, which didn't increase FPS, which might suggest, that at that view-point the app is indeed vertex-limited (usually i am fillrate-limited, due to extensive post-processing effects, but then most of the geometry is culled).

That is easily double the speed, that my Radeon X1600 (Mobility) was able to do. I know, that this app runs at 1.5 to 2.0 times the speed on a Geforce 8800 Ultra (some kind of Quadro).

I would say, that there is some truth in the accusations. Quadro-cards are of course a bit faster and the price-difference is by far not justifiable. But HOW much faster they are is the question. An 8800 Ultra is ALWAYS a lot faster then my 9600 GT, be it a Quadro or not. But 8.8 mtris/sec ? I'd say: You are doing it wrong ;-)

And the 200 mtris/sec through D3D: I don't believe it.
All that PR fluff has been floating around for ages and everyone should know real-world scenarios can never even get close to half or a third of those claims! It's just completely exaggerated.

IF there are indeed speed-limits for OpenGL compared to D3D, they are not that extreme, as you suggest. However, IF it is like that, it would still be a real scandal. And honestly, i wouldn't doubt, that it is actually done, because it would be a great argument to buy the expensive Quadro/FireGL cards, since OpenGL is usually only used for CAD and similar stuff, where you can tell your clients, what to buy. And Doom 3 etc. is detected by the driver and speed is unlimited.

Jan.

Nicolai de Haan
06-26-2008, 08:37 AM
Yup, I use 8800GTS too. In my terrain algorithm (prototype) I can push hundreds of millions of triangles. 8.5 million triangles can be rendered in real-time depending on the complexity of the shader. In this screenshot (http://www.ndhb78.web.surftown.dk/nesos/screenshots/terrain_diffuse.png) I am doing 5 texture lookups per vertex although it's only diffuse lit (4 of the texture lookups are for normals).

I also see a lower vertex throughput when the driver has Threaded Optimization enabled. I just set to Off instead of the default Auto. You can find the option in the NVidia Control Panel under 3D settings. It will possibly solve the issue you're experiencing. I have yet to see a benefit of enabling Threaded Optimization. Maybe someone who knows exactly what it is good for could post how it works?!

Madoc
06-26-2008, 09:00 AM
Jan, did you read my last post? That 8.5 million tris/s is obviously a problem with that system, the figures I reported for the 8800 GTS on another system would suggest that no kind of speed limiting is being applied (this was with thread optimisation on auto and off, didn't make a difference).

When this 8.5 figure came up a collegue (on the phone) reported that nvidia apply the same limiting we discovered on ATI cards (and that unlike ATI cards they could not be tricked into thinking they were professional versions of the product). I took his word for it and proceeded to post this thread. It was a mistake, I apologise.

Errata: The x1600 from an earlier thread was actually an x1900xtx, we measured a 360% increase in triangle throughput with the modified drivers. Interestingly, a standard mobility x1400 outperformed it before we made it think it was a FireGL.

CrazyButcher
06-26-2008, 10:40 AM
used that glsl_pseudo_instancing (arb path) nvidia demo and it gives me 350 MTris/second on Geforce 9600 GT 512 MB, Quadcore, 2 GB RAM, WinXP.(same rig as Jan) (threaded optims off, however I tested with "on" as well no difference for that simple thing)

I would think numbers could be even higher when larger batches were used (the instances have a maximum of a few thousand tris each).

now obviously that very simple demo isnt really "real world" application, but anyway shows what could be done (the demo has single light, textured and vertexcolored surfaces).

I never had a "CAD" card (firegl or quadro) but I always expected them to have better wireframe drawing and slightly optimized drivers for "immediate mode" rendering and other old-school opengl (no VBOs and so on), since that kind of rendering seems to be the case with many "serious" applications (just guessing so, can only refer to some medical vis stuff we use here)

Lord crc
06-26-2008, 12:41 PM
However, how do we get around this problem without sacrificing use of multiple cores?


AFAIK you can make the driver disable the thread optimizations for your app if you set the affinity of your process to one core just before creating the context. Then, after you got the context, you set the affinity back to all cores. This way you can do all the multithreading you want, and not have the driver screw things up :D

Madoc
06-26-2008, 12:57 PM
Cool, thanks Lord crc.

So, I've had someone test that card again and... I don't have any idea why it was doing 8.5, it now performs as well as the other 8800, though only with thread optimisation off which makes it 40% faster. Apparently my collegue was looking at some opengl benchmark results and, though he claimed otherwise, these must include rendering modes for "pro apps" only.

Right, well, made myself look like a right git, sorry everyone.

However, I'm still confident that what I say about the ATI cards is accurate (and is in fact why I so readily jumped to conclusions). I have quite a lot of data to support this and it would seem to exclude the possibility of something like the thread optimisations.

Jan
06-26-2008, 02:25 PM
I have a Quadcore, so i disabled the thread-optimization, didn't make any difference, at all. Question is, of course: In which situations DOES it make a difference?

Rob Barris
06-26-2008, 03:11 PM
I have some experience with command-queued drivers in the context of the MT-GL feature on OS X Leopard; it usually doesn't help if you have become GPU limited (restated, you are generating batches faster than GPU can consume them).

For a steady state demo re-drawing the same thing over and over, the command queue will either trend towards being empty (CPU not generating enough work) or full (CPU generating too much work).

The win comes when you are not GPU limited, the main thread of your app can spend less wall clock time specifying the scene's commands and then time is freed up to do "other things" while the driver mows down all the commands you piled up. In a demo there are usually no "other things" to do, so you just wait (or try to draw more stuff and jam the cmd queue full and then you have to wait).

So my perception of the benefit of command-queued / thread-optimized drivers is that they can convert workloads which might have stayed CPU-limited before, into ones which can run closer to the GPU limit. Restated more simply you get free time back on your drawing thread to do other stuff.

On a crowded scene in World of Warcraft, the difference between MTGL on and off can be as high as 2X in frame rate, because there was sufficient "other stuff" going on to translate into a net wall clock time savings. That said there's not too much stopping a developer from implementing their own drawing thread and getting the same kind of scaling.

So just to sum up my gut feeling is that threaded drivers can't easily speed up scenes which are already 100% GPU limited, and some demos may be in that category.

knackered
06-26-2008, 03:17 PM
I have quite a lot of data to support this and it would seem to exclude the possibility of something like the thread optimisations.

I can believe anything of ATI. For example, ATI don't know how to compile an efficient display list, while nvidia are brilliant at it. If nvidia wasn't so good, ATI might look ok.
However, you won't be surprised if everyone treats your "supporting data" with a big lump of scepticism after that tirade.

CatDog
06-26-2008, 03:21 PM
Performance drop with threaded optimization: Here's what I found out.

- Use VBOs for vertex arrays
- Always interleave vertex attributes <-> never mix VBOs (e.g. one VBO normals, the other one texcoords)
- Use indexed primitives
- Regardless of which glDrawXX()-command is used: never exceed MAX_INDICES or MAX_VERTICES
- Don't use GL_DOUBLE
- Don't use immediate mode

When doing like this, the threaded optimization works really nice! It seems not releated to using shaders. It's got something to do with the data management/conversion inside the driver.

After all this means: pass the data in exactly that one way, the hardware can handle it best. If the driver has to rearrange your data in any way, either performance drops, or all cores go to 100%, or everything laggs like hell.

CatDog

Lord crc
06-26-2008, 04:32 PM
Well, I've experienced "the lag" even when drawing nothing, so it's not just related to how you pass the data...

tanzanite
06-26-2008, 11:10 PM
@Madoc: um ... what the ** have you smoked lately? How did you get thous ridiculous numbers? Bold statements need considerable proof!
Please, why do you have to be insulting?I was expressing my profound disbelief of the numbers given - reasonable disbelief as it seems now. As you felt it to be just insulting instead, i clearly failed at that - sorry.

zed
06-27-2008, 02:16 AM
Please, why do you have to be insulting?
this coming from someone who titled the thread
'Backstabbed by skanky IHVs'
youve gotta laugh :)

Madoc
06-27-2008, 03:19 AM
Uh, I've always taken "skanky" to be quite a light way of saying "knavish", that's how people use it where I come from. Apparently the dictionary quite disagrees with me. Meeeeh! Wrong again.

Well, I'll add that to my other apology. Until someone deals the final blow and proves me wrong about ATI too, I hope you at least see where I was coming from. It was a rather unfortunate post either way, I guess I just panicked.

tamlin
06-27-2008, 04:00 AM
At 8Mtris/s, I'd panic too. :)

Nicolai de Haan
06-27-2008, 05:11 AM
Performance drop with threaded optimization: Here's what I found out.

- Use VBOs for vertex arrays
- Always interleave vertex attributes <-> never mix VBOs (e.g. one VBO normals, the other one texcoords)
- Use indexed primitives
- Regardless of which glDrawXX()-command is used: never exceed MAX_INDICES or MAX_VERTICES
- Don't use GL_DOUBLE
- Don't use immediate mode

Hmm. I just looked through some code and I am already doing all these things.

Yet threaded optimizations still seems quirky. When set to "Auto/On" the OS reports between 58-61% busy (dual core CPU). When set to "Off" the OS reports between 8-11% busy. It looks like the driver is hogging all the resources of 1 core (50%) with threaded optimizations enabled. In other words, it *seems* the driver is running a thread that is implemented with "busy-waiting"/"polling". This happens regardless of whether anything is rendered, only the front and back buffer is swapped as far as I can tell.

What API have you used to create and manage your GL context CatDog?

CatDog
06-27-2008, 12:28 PM
Your description fits exactly to what's happening with threaded optimization going wrong. (Lord crc, maybe you are talkting about some other problem?)

Hm, if you really do everything on the list, there must be something else. What a pity, I thought I had tracked it down. But somehow I managed to get rid of the lagging.

Ok wait, there's two more things:
- Don't use GL_XXX_STRIP (use GL_TRIANGLES and GL_LINES instead)
- Optimize batches using the Forsyth method

In fact, the problem was gone, after I made some HUGE changes to the entire VBO layout, as described in this thread (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=233320#Post233320). So I don't know exactly, what helped.

I'm using plain Windows API for the setup and everything. No other libs involved here.

CatDog


*edit* Btw, the current version of Google Earth suffers from this also. Here, after some varying period of time, one of the cores goes to 100% and stays there. It's the nvogl.dll running for life. So, we are not alone.

Seth Hoffert
06-27-2008, 01:33 PM
It sounds ridiculous to have to avoid and conform to so many things just to get it to work right... is it safe to say that it's best to just disable it?

Seth Hoffert
06-27-2008, 01:38 PM
In other words, it *seems* the driver is running a thread that is implemented with "busy-waiting"/"polling". This happens regardless of whether anything is rendered, only the front and back buffer is swapped as far as I can tell.

I've noticed something similar (on Linux anyway) when using glFinish() + vsync. The driver spinlocks on glFinish, until it is ready to perform the swap. If I use glFlush instead, it passes right through onto the glXSwapBuffers function (which performs an implicit flush anyway...), which does NOT spinlock.

CatDog
06-27-2008, 01:43 PM
But isn't that what glFinish() is supposed to do (not returning until all GL commands are *finished*)?

CatDog

Seth Hoffert
06-27-2008, 01:51 PM
You're right. But shouldn't it use some form of polling instead of spinlocking on the CPU, or at least sleep? I've found that by loading in my own custom sched_yield() which performs a very small sleep, the CPU usage drops to 0%. :D I suppose the tight loop they used was to ensure MAXIMAL performance...

CatDog
07-04-2008, 05:01 AM
Well, don't know if anybody is interested, but I found out another one: two sided lighting is evil.

- Use VBOs for vertex arrays
- Always interleave vertex attributes <-> never mix VBOs (e.g. one VBO normals, the other one texcoords)
- Use indexed primitives
- Regardless of which glDrawXX()-command is used: never exceed MAX_INDICES or MAX_VERTICES
- Don't use GL_DOUBLE
- Don't use immediate mode
- Don't use GL_XXX_STRIP (use GL_TRIANGLES and GL_LINES instead)
- Optimize batches using the Forsyth method
- Don't use two sided lighting

To make it short:
- Don't use OpenGL at all, except for pushing your preformatted and optimized vertex arrays over the bus. Do all driver work by yourself.

CatDog

NeARAZ
07-04-2008, 05:56 AM
To make it short:
- Don't use OpenGL at all, except for pushing your preformatted and optimized vertex arrays over the bus. Do all driver work by yourself.
In other words, "know your hardware". Does not look like shocking news to me... Want it to be fast - know what's the hardware and what happens in your code (and the code you call into). Sure, it's somewhat messy in OpenGL, with a myriad of ways of doing the same thing (where usually only one of them is optimal)...

Seth Hoffert
07-04-2008, 07:41 AM
Or, disable that threading stuff in the driver. :)

I've also found two-sided lighting to be very slow, so instead I branch on gl_FrontFacing in the fragment shader which runs much more efficiently. Various other things are unusable on my card as well, such as changing the polygon mode to lines...