PDA

View Full Version : multithreaded OpenGL WTF?



def
12-07-2006, 11:47 PM
Am I the only one irritated by this so-called "Multithreaded OpenGL" Apple is talking about? This awesome new "technology", of course only available on Mac OSX. :confused:
It seems Steve Jobs has invented a new breed of graphics cards which are capable of doing this, while still using the manufactures hardware reference designs...
;)

ZbuffeR
12-08-2006, 12:28 AM
And no more info on the provided link. Cheap PR stunt from Blizzard and Apple :p

OneSadCookie
12-08-2006, 12:48 AM
There is information available... like this developer technote: http://developer.apple.com/technotes/tn2006/tn2085.html

Basically, it offloads the CPU work associated with making an OpenGL call (allocation of driver resources, pixel format conversion, etc.) to a secondary thread. Obviously, that means that the application as a whole is doing more work (thread synchronization and data copying + all the work it was already doing), but if the second CPU was otherwise under-utilized, the quicker return of the OpenGL call on the main thread can improve performance.

If your application was already CPU-bound on all CPUs, you will lose out with the multithreaded GL. That's why it's opt-in; applications have to explicitly enable it if they feel they will benefit from it. Basically, it's an easy way for a developer to get a multithreaded application out of a single-threaded one; if you've already done the work of threading your app well you'll likely gain nothing from it.

It's pretty buggy in 10.4.8; WoW must be doing things pretty strictly on whatever clean path Apple has provided to not be crashing.

ZbuffeR
12-08-2006, 12:55 AM
Thanks for the relevant link and precisions.

def
12-08-2006, 01:36 AM
Thanks for the info. That is exactly what I was expecting...
This has nothing to do with OpenGL, however. It's just a framework for multithreaded applications with support for OpenGL in a single thread. It's nothing new at all!
But maybe it is for mac users. I am not familiar with the OSX API.

Since part of the OpenGL framework’s workload has been parallelized, OpenGL calls will return control to your application sooner and more CPU cycles will be available for its main thread.Parallelism means execution in parallel, not waiting for execution in parallel, what it actually is.

I think the phrasing is misleading (intentionally so) and agree with ZbuffeR, just a cheap PR trick.

Jan
12-08-2006, 03:23 AM
Don't recent drivers from nVidia and ATI do this already? I think i read somewhere, that they already utilize multithreading, but i might be wrong.

Jan.

gybe
12-08-2006, 06:15 AM
I did some simple tests yesterday on my X1900 and the result really surprise me. I attached two threads of the same process to the same hDC (I called wglCreateContext in both thread) and it works. The geometry rendered from both thread is combined in the same back buffer. My test application is very simple (I just render two quads without depth test), so Im wondering if it will always work or I was just lucky.

In the MSDN documentation they say:

"An application can perform multithread drawing by making different rendering contexts current to different threads, supplying each thread with its own rendering context and device context."

In my test app I have different rendering context, but the same device context for both thread so I don't follow the spec. There's a lot of information on the internet about many new effects we can do with the programmable hardware but only few information about multithreaded rendering.

OneSadCookie
12-08-2006, 12:12 PM
Originally posted by def:
This has nothing to do with OpenGL, however. It's just a framework for multithreaded applications with support for OpenGL in a single thread. It's nothing new at all!Um, I'm not certain, but I think you may have misinterpreted. When you opt in, every GL call submits work to another thread. It has nothing to do with anything *not* GL, and indeed, couldn't work with most other APIs where synchronous execution is required.



Since part of the OpenGL framework’s workload has been parallelized, OpenGL calls will return control to your application sooner and more CPU cycles will be available for its main thread.Parallelism means execution in parallel, not waiting for execution in parallel, what it actually is.There is no waiting involved... you make a GL call, it starts executing asynchronously on the other thread some time later. You wait only for the work required to transfer the call to the other thread, not for any of the work that the call makes.


I think the phrasing is misleading (intentionally so) and agree with ZbuffeR, just a cheap PR trick. I think the phrasing succinctly captures what it's doing, and if an 80% performance improvement for WoW is a "cheap PR trick", I'll take a cheap PR trick any day of the week ;)

V-man
12-08-2006, 10:33 PM
If your GL calls are lightweight, meaning the command just needs to get to the GPU, then there is no need for a second thread.
I don't know why but other libs like OpenAL seems to create another thread.

What evidence is there that there is benifit?

Jan
12-09-2006, 12:04 AM
AFAIK OpenAL does its audio-processing in the second thread. That makes it independent from the main-app, so that it can process audio-data at the necessary updated-rate, no matter how fast the main-app runs.

I think that's a big difference, what the threads are used for, because the audio processing actually needs to be continued over the next frames.

So i don't think one can compare the two use-cases.

Jan.

Adrian
12-09-2006, 04:09 AM
Originally posted by Jan:
Don't recent drivers from nVidia and ATI do this already? I think i read somewhere, that they already utilize multithreading, but i might be wrong.

Jan. I've heard this too, it would be good if NVidia/ATI could tell us how it compares to the Apple OpenGL multithreading, and, if it is significantly different, when we will we see it on the PC (if ever). If its an OS related feature perhaps Vista will allow NVidia/ATI to implement something similar? Hopefully this will all be sorted out by the time I get a quad core :)

V-man
12-09-2006, 07:33 AM
I don't know how audio works exactly, but isn't it like GL? Make a sound BO and tell the system to play it, the system being the sound card. Or maybe the sound mixing is done by software and that's the need for a thread.

I remember someone said he had a dual core 64 bit CPU and WinXP 64. nVidia spawned a thread that uses 100% of second CPU even when he wasn't doing anything.

Korval
12-09-2006, 08:02 AM
What evidence is there that there is benifit?Evidence? An 80% performance improvement in WoW isn't good enough for you?

Frogblast
12-09-2006, 02:00 PM
Although it does change the performance cost of various GL API, other behavior should not change. It does not allow an app to submit commands to the same GL context from multiple threads at once. An app is, of course, free to use other threads for it's own work, as long as each context is only used from a single thread at once.

The performance boost comes from the state validation within GL occuring within another thread, allowing it to run in parallel with your code. GL calls effectively return earlier, allowing your code to continue.

OneSadCookie
12-09-2006, 03:14 PM
hmm, does it mean that for code like:


glDrawElements(.........);
glGetError();
sleep(5);
glGetError();The first glGetError can return NO_ERROR because it hasn't seen the error generated by DrawElements, and the second glGetError can return an error code?

Frogblast
12-09-2006, 03:16 PM
Nope, it'll still work. However, the first glGetError will likely be expensive, as it must wait for any outstanding commands to complete. It isn't a full glFinish (as you don't need to wait for the HW to do software-side work), but it will hurt.

Basically, you should be able to just flip the switch on for any app, and the results should be unchanged from a correctness point of view*. The app might actually be slower in the case that OneSadCookie describes (hence the need for many apps be modified to avoid constructs like this), but it should still be correct.

If you've observed otherwise, you really need to file a radar.

*Assuming the app is correctly written. If an app relies on undefined beavior, then 'undefined' very well may have changed. This is especially true for apps that may not get their fencing quite right when using extensions such as APPLE_flush_buffer_range. It is especially especially true for apps with latent threading errors (ie, trying to re-enter a GL context), even if you were getting lucky before.

def
12-10-2006, 10:52 PM
Hmm, I always thought the underlying hardware was responsible for OpenGL being single threaded...
If "Multithreaded OpenGL" means I can do CPU work 80% faster (WoW), that's great, but OpenGL is still the same as before.
Who is saying that actual raw OpenGL performance is getting better through Multithreaded OpenGL? Raise your hands, please. ;-)

OneSadCookie
12-11-2006, 12:09 AM
You're right, of course, multithreaded OpenGL won't let you draw more polygons with more complex shaders or anything. All it does is spread the CPU load (and in fact, increase it slightly).

Still, if that makes it easier to make a fast game, who's to complain? Nobody ever claimed that this was magic ;)

andras
12-11-2006, 04:26 PM
Well, just for the record, there's a multithreading switch in the latest nVidia drivers for WinXP (only present, if you are running on multi-core CPUs), but it actually made our app run a lot slower, so we had to turn it off! YMMV..

EDIT: Correction! After reading the article, I've checked if we had any glGetError() in the code, and we did, so I #ifdef-d them out, and now our app runs okay with the multithreaded optimization. I say okay, because it's not any faster (our app is already threaded, and is not really CPU limited), but it makes the framerate a bit jerky (even when locked to vsync!), so I still keep it turned off.

Hampel
12-11-2006, 10:39 PM
There is also a flag "Generate/Log OpenGL errors" (or something similar; written from memory) in the driver settings (just above the MT-flag). What does it generate and where can I find the logs?

Rob Barris
12-12-2006, 01:59 PM
Originally posted by def:
Hmm, I always thought the underlying hardware was responsible for OpenGL being single threaded...
If "Multithreaded OpenGL" means I can do CPU work 80% faster (WoW), that's great, but OpenGL is still the same as before.
Who is saying that actual raw OpenGL performance is getting better through Multithreaded OpenGL? Raise your hands, please. ;-) Hand raised.

World of Warcraft can be up to twice as fast on OS X with the MT-GL mode on. No new threads on the application side.

Edit: But MT-GL can't raise frame rates if you are running at the GPU limit already. Basically it raises the odds for any given scene that you will be able to keep the GPU running at its limits.

gdewan
12-13-2006, 05:00 AM
Originally posted by andras:
Well, just for the record, there's a multithreading switch in the latest nVidia drivers for WinXP (only present, if you are running on multi-core CPUs), but it actually made our app run a lot slower, so we had to turn it off! YMMV..

... Where is this set? I have a 8800 GTX here on a dual core WinXP machine and I can't find the setting anywhere.

Stephen_H
12-13-2006, 04:36 PM
Could someone explain what parts of the driver could be multithreaded that would increase WoW's framerate by 80%?

I personally don't see how the driver could take more than 5 to 10% of CPU time in WoW or any commercial game. Afaik the driver mostly does verification and checking and then sticks the information/gl commands onto a FIFO for the card to pull.

I can see that multithreading might help but only in certain specific situations for example... if you are supplying RGBA uint8 textures and Nvidia drivers like BGRA uint8, the driver might do a conversion on the CPU to BGRA.

There seems to be something I'm not understanding here... does the CPU actually have to waste cycles feeding data from the FIFO to the card?

V-man
12-13-2006, 04:38 PM
Originally posted by Korval:

What evidence is there that there is benifit?Evidence? An 80% performance improvement in WoW isn't good enough for you? 80% increase in FPS? I think that is really large and it would be good to know the reason. Since lots of games are said to be GPU limited, and many other games CPU limited because of AI and physics, why is WoW GL driver limited?

It makes me think something is not well coded.

Mars_999
12-13-2006, 06:01 PM
I can say this, Rob Barris knows his stuff... So you need to trust him when he says its 2x as fast you can bet on it.

Rob Barris
12-13-2006, 07:36 PM
So, picture a few different programs, all single threaded.

Program A is totally application CPU-bound. Say a fractal generator cranking out texture animations to play back on some spinning quad. 95% application-work, 5% driver-work. MT-GL won't help (well it might help 5% by getting that driver work to run concurrently instead of in-main-thread). Restated, the app was not being held back significantly by the driver work taking place.

Program B is the opposite of A, how about a fancy Pong game using OpenGL, except it doesn't do very good state control, switches shaders too often, basically does a lot of stuff incurring CPU work in the driver. Say it has the opposite ratio of work - 5% application, 95% driver. MT-GL won't help much here either; a 5% benefit again.

http://en.wikipedia.org/wiki/Amdahl\'s_law

Now, consider Program C: say its work balance can vary drastically depending on what is going on - it might be 80% app and 20% driver, or in some really rough situations it might be 50% app and 50% driver. Scene dependent.

Program C's benefit from MT-GL will therefore also vary between 20 and 50% reduction in clock time assuming the application thread avoids making any calls that result in synchronization between the app side and the driver side (queries, readbacks, a few other cases).

Thus the "up to 2X faster"... in some weird cases maybe even a little better than 2X when you have less cache contention going on between app-land and driver-land. "2X faster than what?" -> in comparison with the same scene rendered with MT-GL off.

I haven't seen any claim from Apple saying that this technique is novel or unique to OS X.

If you have WoW on OS X (Intel Mac dual core) you can flip the MT stuff on and off in-game:

/console glfaster X

where X = 0, 1, or 2

0 = off
1 = MT on but with a bit of frame throttling
2 = MT on, no throttling, some mouse lag can occur.

knackered
12-13-2006, 11:01 PM
I don't get this, doesn't everyone have a dispatching draw thread anyway?
app/cull/draw anyone?
the draw thread is just issuing gl commands from a queue of draw messages....or is that just me and SGI?

Jan
12-14-2006, 01:44 AM
Maybe they coded WoW completely in immediate mode...

Knackered: I don't think many people are doing that. On single-core CPUs that doesn't make sense (which we were dealing with up to now). SGI on the other hand, certainly did have multi-processor machines for a long time, don't they?

Jan.

knackered
12-14-2006, 03:59 AM
Well I suppose we've been delivering systems targeted at multi-processor x86 systems for literally years now. But I still can't understand why someone would deliberately engineer a system that couldn't be easily separated out into threads. You're talking a simple FIFO...pushing a few bytes into a queue for every draw op. It's just proper modular programming.
For me, this sounds like the driver being forced into doing optimizations that the app should be doing - if it were not for the games industries addiction to hiring cheap graduates with no programming skill or experience. No offence Rob.
Thank god it's an optional feature, or my software would be paying a heavy price in double the CPU synchronization for other peoples laziness/ineptitude.

Rob Barris
12-14-2006, 07:03 AM
WoW is not coded in immediate mode, it uses VBO's, PBO's, pbuffer or FBO, drawelements, ARB vertex/pixel shaders...

The idea of rolling our own "rendering thread" has come up before, but the subtleties involving two way communication between low level and high level code - esp for titles where assets are being dynamically loaded using async I/O at almost all times - made it a tough sell.

Also, consider if you had three different games by three different teams with three unique engines. Re-writing each engine to have a rendering thread means duplication of work on the developer side. When the command queuing and parallel processing is provided as a standard feature by the implementation, this is less work for the developer(s) to have to worry about.

I really don't see anything wrong with an OS level feature that allows single-threaded renderers to still utilize multiple cores and run faster. If some number of developers find that they can make a few changes and achieve the FPS benefit that we got with WoW, I'm not sure who should be bothered by that outcome.

On the topic of perceived extra CPU synchronization, how would an OS provided MT implementation have any more or less synchronization overhead than a thread you authored yourself?

knackered
12-14-2006, 07:35 AM
Depending on the subtleties of a particular driver implementation to determine whether your application is interactive or not (say 80% slower than 60fps), when you have an alternative that will guarantee it doesn't strike me as good sense. But if it works for you, and I'm sure you've done your field work, then all's well with the world....you're a braver man than I.

On the topic of perceived extra CPU synchronization, how would an OS provided MT implementation have any more or less synchronization overhead than a thread you authored yourself?If you're referring to my last sentence, I was saying that if I detect I'm running on multiple cores, I'll put my drawer in another thread, which has the very small (but real) overhead of a guarded message queue. Now the driver kicks in and decides to fork off it's own thread for dispatching gl commands, with the very small (but real) overhead of a guarded message queue. Hence double the sync overhead for every GL command.

Rob Barris
12-14-2006, 08:16 AM
Originally posted by knackered:
Depending on the subtleties of a particular driver implementation to determine whether your application is interactive or not (say 80% slower than 60fps), when you have an alternative that will guarantee it doesn't strike me as good sense. But if it works for you, and I'm sure you've done your field work, then all's well with the world....you're a braver man than I.

On the topic of perceived extra CPU synchronization, how would an OS provided MT implementation have any more or less synchronization overhead than a thread you authored yourself?If you're referring to my last sentence, I was saying that if I detect I'm running on multiple cores, I'll put my drawer in another thread, which has the very small (but real) overhead of a guarded message queue. Now the driver kicks in and decides to fork off it's own thread for dispatching gl commands, with the very small (but real) overhead of a guarded message queue. Hence double the sync overhead for every GL command. You've described a good rationale for making the new behavior opt-in, which is what Apple GL does. We're glad the old single threaded mode is still available, since it can make some debugging and profiling tasks easier, and it's in fact the only mode available on single core CPU's.

edit: while we'd be happy if every one of our users was getting 60FPS, but due to the wide spread of machines that our games run on, it's not uncommon to find users playing at 5-10FPS and other users playing at 120+ FPS. The attribute of "interactivity" is not a Boolean.

Stephen_H
12-14-2006, 10:07 AM
If you care to elaborate, I'm curious if you did any profiling/testing to see which gl operations were using up most of your driver's CPU time? Also curious, are OSX drivers any better/worse than say Nvidia's GL driver for chomping up CPU time?

knackered
12-14-2006, 11:33 AM
gizza job, rob.

Rob Barris
12-14-2006, 01:06 PM
Originally posted by Stephen_H:
If you care to elaborate, I'm curious if you did any profiling/testing to see which gl operations were using up most of your driver's CPU time? Also curious, are OSX drivers any better/worse than say Nvidia's GL driver for chomping up CPU time? Yes, we did quite a bit (of profiling). We didn't do any comparative benchmarking against PC GL; closing the gap with Direct3D/WinXP when tested on the same hardware was higher priority.

PC GL presently lacks some of the extensions we're now counting on with OS X, such as flush_buffer_range, so that would have been a bit of a skewed test as well.

Jan
12-14-2006, 01:39 PM
Rob, Wow being written in immediate mode was a joke. I didn't want to offend anyone. I just wanted to point out that apps using immediate mode will most certainly benefit very much from a multithreaded driver, since that uses even more CPU resources.

Jan.

knackered
12-14-2006, 09:35 PM
Originally posted by Jan:
Maybe they coded WoW completely in immediate mode...Sounded unequivocal and unambiguous to me. ;)

Jan
12-15-2006, 02:01 AM
Nice. I learned two new words, and still don't know how you actually meant that.

Rob Barris
12-15-2006, 02:30 AM
Originally posted by Jan:
Rob, Wow being written in immediate mode was a joke. I didn't want to offend anyone. I just wanted to point out that apps using immediate mode will most certainly benefit very much from a multithreaded driver, since that uses even more CPU resources.

Jan. (wasn't offended, really!)

Good point about immediate mode apps though I can see how the much higher API-call frequency could potentially give the command queuing mechanism a real workout.

Though, a tuned implementation could batch up everything between a glBegin and glEnd and then submit that to the command queue as a single blob; have no idea which drivers might do this already.

Rob Barris
12-15-2006, 02:37 AM
Originally posted by knackered:
I was saying that if I detect I'm running on multiple cores, I'll put my drawer in another thread, which has the very small (but real) overhead of a guarded message queue.
It occurred to me that there are ways to implement one-writer / one-reader FIFO's without explicit mutexing on every queue transaction; with a ring buffer and atomic fetch-and-adds you can do it in a lock-free style. You might already be doing that ?

If mutexes are the only way to go on a given platform, there's also a way to set something like this up by incurring a little bit of command batching, and only use the lock to obtain larger batch buffers from a pool... you can amortize the locking over a larger numer of transactions that way, trading latency for speed.

It might get a little twisty with variable sized commands of course.

Jan
12-15-2006, 06:23 AM
Actually right now i am working on a small editor-application, which renders using immediate mode, since it is so much easier and there isn't much graphics stuff going on.

I was able to test it on a AMD X2 1.6 GHz laptop, with a Geforce 7300 Go (Windows XP).

I could multithread the main worker loop using OpenMP. It decreased speed drastically, although both cores had maximum workload. After disabling multithreading in my app, i found that speed was much better, but still both cores were fully utilized.

I didn't find any option in the driver to turn it on/off, as mentioned in this thread somewhere, but i guess the gfx driver already uses it's own thread.

The driver is a few months old, so i assume it is one of the first versions doing that. I also got the impression that it is quite inefficient, since it used the whole processing power of one core, even though the graphics was very simple (a few hundred textured billboards, no shaders, no nothing). Although that's only my impression, could be wrong entirely.

Jan.

andras
12-16-2006, 08:58 AM
Originally posted by gdewan:
Where is this set? I have a 8800 GTX here on a dual core WinXP machine and I can't find the setting anywhere. It's in the NVidia control panel. It's called "Threaded optimization". I think you have to enable the advanced view to see it.

gdewan
12-17-2006, 07:56 AM
Originally posted by andras:

Originally posted by gdewan:
Where is this set? I have a 8800 GTX here on a dual core WinXP machine and I can't find the setting anywhere. It's in the NVidia control panel. It's called "Threaded optimization". I think you have to enable the advanced view to see it. Its not in 97.44 anywhere. And standard view is disbled so I can't take it out of advanced view.

CatAtWork
12-17-2006, 11:02 AM
3D Settings
Manage 3D Settings
Threaded optimization

gdewan
12-17-2006, 04:57 PM
Originally posted by CatAtWork:
3D Settings
Manage 3D Settings
Threaded optimization Well, my options there are (in order):

Anisotropic filtering
Antialiasing - Gamma correction
Antialiasing - Mode
Antialiasing - Setting
Antialiasing - Transparency
Conformatant texture clamp
Extension limit
Force mipmaps
Multi-displau/mixed GPU acceleration
Texture filtering - Anisotropic sample optimiz...
Texture filtering - Negative LOD bias
Texture filtering - Quality
Texture filtering - Trilinear optimization
Tripple buffer
Vertical sync

CatAtWork
12-17-2006, 08:15 PM
Well, I'll check at work, because this is a Quadro FX 1500M. The 97.44 driver set on the 8800GTXs is very buggy, however. I would drop back to 97.02. I saw a ~200 FPS app drop to 50FPS when I upgraded. Some forum is claiming that the 97.44s severely underclock the memory, and although I can't verify that specific claim, my benchmarks show something is amiss.

gdewan
12-18-2006, 04:14 AM
97.02 also didn't show it. I haven't noticed any performance difference between 97.02 and 97.44

CatAtWork
12-18-2006, 09:47 AM
I see it under 97.02 on my 8800GTX. Do you have a multi-core or -cpu system?

gdewan
12-18-2006, 10:16 AM
Yes, its a multi-core system. Doing some digging on Google, it seems like the option may only be available on series 7 cards.

CatAtWork
12-18-2006, 11:11 AM
The option is available on my 8 series cards.

andras
12-18-2006, 12:36 PM
Are you sure that your BIOS and OS recognize your CPU correctly?

V-man
12-18-2006, 02:16 PM
I understand the case of Program A and B.
In the case of program B, if the driver does state control, then it will benifit.
So a separate thread that handles all that work will help. I think drivers don't do state checks.

In case of C, it's not clear what is going on.
What is happening in the driver that makes it use 20%-50%?
Can the programmed be fixed?

The IHVs should give advice to whoever is making WoW because 80% increase is insanely high.
If it's 5%, okay.
What is the driver doing? Culling?



Originally posted by Rob Barris:

Now, consider Program C: say its work balance can vary drastically depending on what is going on - it might be 80% app and 20% driver, or in some really rough situations it might be 50% app and 50% driver. Scene dependent.

Program C's benefit from MT-GL will therefore also vary between 20 and 50% reduction in clock time assuming the application thread avoids making any calls that result in synchronization between the app side and the driver side (queries, readbacks, a few other cases).

Thus the "up to 2X faster"... in some weird cases maybe even a little better than 2X when you have less cache contention going on between app-land and driver-land. "2X faster than what?" -> in comparison with the same scene rendered with MT-GL off.

I haven't seen any claim from Apple saying that this technique is novel or unique to OS X.

If you have WoW on OS X (Intel Mac dual core) you can flip the MT stuff on and off in-game:

/console glfaster X

where X = 0, 1, or 2

0 = off
1 = MT on but with a bit of frame throttling
2 = MT on, no throttling, some mouse lag can occur.