PDA

View Full Version : Control for busy-waits in blocking commands



l_belev
07-13-2011, 02:43 AM
The problem: as it is now (at least on NVIDIA) the driver implements glClientWaitSync with busy-wait instead of releasing the CPU.
I know that releasing the CPU imposes context-switching, which is heavy operation and has bigger latency but sometimes it is really needed.

For example in one my application i need to have a "waiter" thread with the sole purpose to block on fences and raises flags when fences are passed while consuming as little CPU as possible,
whereas various other threads are doing hard work on the CPU (the opengl drawing is done by another thread with shared context).
The working threads needs all the available CPU and wasting it for busy-waiting is extremely unwanted, it degrades the overall performance a great deal.
In contrast, the bigger latency of glClientWaitSync if it was blocking instead of busy-waiting would be completely ok.

My suggestion: Please define a new flag for glClientWaitSync that forces the driver to block the thread (release the CPU) instead of doing busy-wait.

Also it would appear that the driver is doing other internal busy-waits. This is seen by the abnormal CPU consumption by internal driver
threads for no apparent reason.
Again, there are cases when the latency of the wait operaions is less important than the CPU utilization.
Please provide means for the application to express it's preferences between lower-latency or lower CPU wastage by the driver. Maybe use the opengl hint mechanism.

Alfonse Reinheart
07-13-2011, 03:20 AM
For example in one my application i need to have a "waiter" thread with the sole purpose to block on fences and raises flags when fences are passed while consuming as little CPU as possible

That doesn't sound like a good idea. Wouldn't it make more sense to just test the fences when you might be interested in seeing if they're done? Testing a fence takes very little CPU, so I don't see the problem. Indeed, you could implement your "waiter" thread exactly that way: just test the fence, and not block the CPU if it isn't finished yet.

l_belev
07-13-2011, 05:38 AM
That doesn't sound like a good idea. Wouldn't it make more sense to just test the fences when you might be interested in seeing if they're done? Testing a fence takes very little CPU, so I don't see the problem. Indeed, you could implement your "waiter" thread exactly that way: just test the fence, and not block the CPU if it isn't finished yet.

The threads that need the fence info are not opengl threads - they do not and can NOT have their own current contexts and even may be located in separate process. Please don't assume other people don't know what they are doing.

Anyway i only gave some example. But the problem is more general. Please don't focus on my concrete example and try to find workaround for it, thats not the point.

l_belev
07-13-2011, 06:17 AM
Alfonse, there was a guy named Korval in these mailing lists that had his greatest pleasure in endless pointless and mindless carpings and annoying people.
Your behavior resembles very closely to his and for this reason i will be ignoring you from now on

aqnuep
07-13-2011, 08:36 AM
This is not something that should be in the specification. It's the implementation's responsibility to choose the most efficient way of implementing glClientWaitSync.

When you've seen in the GL spec something like "The GL implementation must not use busy-wait for the implementation of the glFinish command"?

I think you should rather write about your problem to NVIDIA.

Also would like to point out that blocking is not necessary that expensive compared to busy wait as people think. I would always vote for blocking instead of busy wait.

l_belev
07-13-2011, 09:47 AM
This is not something that should be in the specification. It's the implementation's responsibility to choose the most efficient way of implementing glClientWaitSync.

When you've seen in the GL spec something like "The GL implementation must not use busy-wait for the implementation of the glFinish command"?

I think you should rather write about your problem to NVIDIA.


I agree this is somewhat strange thing to be included in the spec, but i don't like the notion of "implementation's responsibility to choose the most efficient way". The implementation has no way to know which is the most efficient way because it depends on the particular application and the particular situation. Thats why i would prefer that there is a way for the app to choose one way or the other.

Then again:



Also would like to point out that blocking is not necessary that expensive compared to busy wait as people think. I would always vote for blocking instead of busy wait.


I agree.

Note that if the driver does blocking (with CPU yield) the application still can do busy-waiting if it wishes so by calling in a loop glClientWaitSync with zero timeout.
In contrast the application can do nothing if it needs to block but glClientWaitSync does busy-wait.

I would prefer glClientWaitSync to never ever busy-wait, but as it seems, "some vendors" are really fond of doing that, and thats the reason i suggested a flag as a compromise: if the flag is not specified, let them do whatever they think is "the most efficient way", but if the flag is specified, let the application has it's blocks.




When you've seen in the GL spec something like "The GL implementation must not use busy-wait for the implementation of the glFinish command"?

What argument is this? Of course things can change. With time people discover shortcomings in the API and patch them. glFinish also does not have timeout. Is this a reason that glClientWaitSync should not have one neither?

Ilian Dinev
07-13-2011, 12:18 PM
Btw, context-switching in windows is not really a heavy operation. I had measured it take something like 370-450 cycles consistently, on older cpus.
So, could you try polling with


while(glClientWaitSync(hsync,0,0) == GL_TIMEOUT_EXPIRED) SwitchToThread();

l_belev
07-13-2011, 01:25 PM
I need exactly the oposite - not to poll.
Thats what glClientWaitSync is doing - polling in a loop and burning all the CPU time it can get in the process.

I need the thread to SLEEP 99.99% of the time and wake only when a gl fence is passed.

SwitchToThread() causes the thread to yield the CPU momentarily but remains runnable and windows will switch to it again. I don't want that.


The bad thing is, by the documentation, one can assume that glClientWaitSync is a BLOCKING function in the same sense as e.g. the windows WaitForSingleObject() or the unix select(). This assumption sounds very logical to and one designs his software around it.
Then suddenly he discovers that glClientWaitSync actually does dumb busy-loop-polling because someone decided that while their driver is in use it can assume exclusive ownership of any and all the CPUs in the machine and no one else can need to do work on the CPUs!

Alfonse Reinheart
07-13-2011, 01:33 PM
The threads that need the fence info are not opengl threads - they do not and can NOT have their own current contexts and even may be located in separate process. Please don't assume other people don't know what they are doing.

I didn't say those threads should be waiting on the fences. I said that your "waiter thread" should periodically check the fences and fire off whatever it needs to if it finds that they have completed. If no fences have completed, it can release the CPU and get a timeslice later to test again. This is as opposed to forcing implementations to implement things a certain way.

Unless you're saying that the "waiter thread" isn't an OpenGL thread. And if that's the case, I have no idea how you planned to have it call glClientWaitSync, regardless of its CPU behavior.


Anyway i only gave some example. But the problem is more general. Please don't focus on my concrete example and try to find workaround for it, thats not the point.

You're talking about a feature that exists to force implementations to implement something in a very specific way. That is not a minor thing, and it is not something that the OpenGL spec should do. Therefore, if your concrete example can be solved in another way, one that is fairly simple and easily implemented, then that concrete example simply is not a good reason for doing that.


i don't like the notion of "implementation's responsibility to choose the most efficient way". The implementation has no way to know which is the most efficient way because it depends on the particular application and the particular situation.

But you have all the tools you need to implement it yourself in the way that is most efficient for your needs. I consider the timed version of glClientWaitSync to mean, "I don't care how you halt the CPU", since the untimed version (wait time = 0) already allows you to implement the exact behavior you need.

You are asking for something that you could do yourself. And in so doing, guarantee that it would have the efficiency you need.

l_belev
07-13-2011, 01:48 PM
You are asking for something that you could do yourself. And in so doing, guarantee that it would have the efficiency you need.

Ok, please tell me how to do what i need. I still fail to understand how.

Alfonse Reinheart
07-13-2011, 01:58 PM
You do what Ilian Dinev:


while(glClientWaitSync(hsync,0,0) == GL_TIMEOUT_EXPIRED)
SwitchToThread();

Where "SwitchToThread" is a sleep function or whatever.

If you pass 0 for the wait time to glClientWaitSync, then it will test the fence and return immediately. Either the fence will have been completed, or the timeout of 0 will have expired. If it expired, then it hasn't been completed and you can relinquish your timeslice.

l_belev
07-13-2011, 02:12 PM
while(glClientWaitSync(hsync,0,0) == GL_TIMEOUT_EXPIRED) SwitchToThread();

this code will not work for the following reason:
let's say it just executed SwitchToThread(), which managed to find another thread ready to run and so our waiter thread is put to sleep. In this moment some fence gets passed by the GPU, but our waiter thread remains sleeping, because the condition for it to wake has NOTHING to do with the fences - it will wake when windows decides to give it slice again.
This defeats the whole point of the waiter thread - it was supposed to react immediately on fence passage. Well a delay of a few context switches is ok, but not much more.
But with this code we have a potential delay equal to the windows time slice.

The windows time slice can be very long. I think typically it is 5 or 10 ms, but can be over 100 ms. In any case it is whole eternity compared to the timings we need.

Alfonse Reinheart
07-13-2011, 03:09 PM
In this moment some fence gets passed by the GPU, but our waiter thread remains sleeping, because the condition for it to wake has NOTHING to do with the fences - it will wake when windows decides to give it slice again.

This defeats the whole point of the waiter thread - it was supposed to react immediately on fence passage.

How is this different from what you're asking for? Do you believe that the driver is somehow able to wake a user-created thread immediately upon the completion of a fence? I highly doubt it.

Does Windows even have a way to immediately wake a thread? Or is that function just a way of saying, "The next time you pick who gets a timeslice, give priority to this thread." There is no guarantee that the given thread will awaken immediately or in the immediate future.

If the driver is blocking on a fence and relinquishing its timeslice, there's no guarantee that it will get a timeslice immediately after the fence completes. Fences are not the same as OS mutexes, and even if they were, not even OS mutexes guarantee that blocked threads will get a timeslice immediately after the mutex is switched.

The only way to ensure a prompt response to a fence being completed is to sit there and wait for it. Once you relinquish your timeslice, when you get another one is in the hands of the OS. This is just as true for the driver as for your code.

Therefore, even if this proposal were accepted, the simple fact is that it wouldn't get you the timings you want. There would always be a "potential delay equal to the windows time slice." If that is unacceptable to you, then the only alternative is to waste precious CPU time sitting there and waiting for the fence to complete.

Which is likely why NVIDIA implemented it this way. If you want to give up your timeslice (and therefore have the OS decide when you get another), you have the means to do that with the aforementioned code. But if you want to sit and wait within your timeslice, you have a way to do that by giving glClientWaitSync a non-zero time to wait.

There is no third option. There is no way to give up a timeslice and be given one immediately after a fence completes. That's simply not possible, for you and for the driver.

l_belev
07-13-2011, 03:46 PM
Well, i'm not going to present a lecture about how modern OS-es work, but yes, the blocking system calls (like WaitForSingleObject) most definitely have far finer time granularity than the slice. Basically they wake the thread immediately when their condition is met, with the only delay being 1-2 switches from kernel to user mode and/or the other way around and a thread context switch, unless there are complications (e.g. all available CPUs are busy with higher-priority threads).

Do you really imagine that all inter-thread synchronizations (critical sections, etc.), which are based on blocking system calls have timing accuracy of ~10 ms?
You are not being serious are you?

Ilian Dinev
07-13-2011, 03:56 PM
Could you then quickly tell the KeXYZ function that a DPC queued by the ISR must use to force a specific thread to be resumed immediately after IRQL is low enough? I can't find it yet.
(and btw, I doubt the sync specification requires hardware to have facilities to issue IRQs on command-completion; it seems it's always been easy and preferable so far to simply map some device-memory to userspace sysram, and just dump some data to 4-16 bytes there, and have the userspace part poll that value)

l_belev
07-13-2011, 04:03 PM
was this question for me?
if so, i don't know what these abbreviations mean or what is their relation to the subject.

Ilian Dinev
07-13-2011, 04:11 PM
Oh , and that remark about WaitForSingleObject ... sorry, but no. Reread the DDK. WFSO removes the thread from scheduling, gets checked annually for its timeout, and should the object be signalled, the thread is queued for scheduling; and then it hopes to take a timeslice sometime this week. So, not "finer granularity than a slice", but "hope for 100 other running threads to quickly finish with their polling, while 500 threads are sitting in the non-scheduled graveyard it just came out of" .
Right?

l_belev
07-13-2011, 04:12 PM
(and btw, I doubt the sync specification requires hardware to have facilities to issue IRQs on command-completion; it seems it's always been easy and preferable so far to simply map some device-memory to userspace sysram, and just dump some data to 4-16 bytes there, and have the userspace part poll that value)

Aha now i think i start to get an idea what you are talking about :)
Well, i'm not a hardware vendor so i don't know. I would guess the hardware has the ability the trigger and IRQ signal on fence crossing, which the driver can process and cause the interested threads to be awakened.
All this should not be too hard for the HW vendors to do, because even back in the (good old) VGA times, the IRQ was already in place. Then it only served for vertical retrace signal because the GPU fences were not invented yet. But i would imagine they might have extended it's function now to include servicing the fences too.

l_belev
07-13-2011, 04:17 PM
Oh , and that remark about WaitForSingleObject ... sorry, but no. Reread the DDK. WFSO removes the thread from scheduling, gets checked annually for its timeout, and should the object be signalled, the thread is queued for scheduling; and then it hopes to take a timeslice sometime this week. So, not "finer granularity than a slice", but "hope for 100 other running threads to quickly finish with their polling, while 500 threads are sitting in the non-scheduled graveyard it just came out of" .
Right?

I you are wrong here. If waking from blocking syscalls was being rounded up to the next slice, that would make any thinkable multi-threading impossible. But i'm not going to argue on this anymore.

Im not familiar nor interested in the internal guts of ms windows, but that does not mean i don't know how certain user-space APIs work.
As it is not necessary to know that the water consists of oxygen and hydrogen in order to know how to drink it.

I mean, your argument is invalid in the same way as would be the argument of someone who tells me he knows better than me how to drink water because he knows it's internal structure and i don't know it. You see?

Ilian Dinev
07-13-2011, 06:01 PM
"If waking from blocking syscalls was being rounded up to the next slice" - this happens if another thread with same priority is compute-intensive, so doesn't let the dispatcher run all living ("ready") threads multiple times per 16ms (assuming they are just polling and using SwitchToThread).

My point above was that afaik Windows won't let a device-driver force a switch to a specific thread after io completion, or DPC or ISR, or ever. All the driver can do is give a hint - a small thread-priority boost, a value which is checked *behold* on the next time the thread-dispatching is weighting its options.

So, with the way the sync api are now, you have flexibility to tune your app for more different things, than if you wanted the driver+OS to try to handle it with heavier code (that will affect everyone else, who happen to care more about get-result-ASAP than yielding to other threads) .

l_belev
07-14-2011, 12:50 AM
"If waking from blocking syscalls was being rounded up to the next slice" - this happens if another thread with same priority is compute-intensive

This is not true, the other thread will be pre-empted DESPITE being of equal priority.

Think of it this way: the thread which is blocked still had it's slice unfinished at the time of blocking (apart from Sleep and SwitchToThread, no other blocking syscalls give up the timeslice remainder of the thread). So the blocking syscall essentially "borrows" time to another thread to run when it is still our time to run. When our thread is un-blocked, it immediately is switched to to continues it's unfinished timeslice.

Alfonse Reinheart
07-14-2011, 01:15 AM
When our thread is un-blocked, it immediately is switched to to continues it's unfinished timeslice.

So let me get this straight. Every time you release a mutex, every thread that was blocked on that mutex (there can be lots) immediately comes active. So releasing a mutex basically means, "my timeslice is done; somebody else take over."

I'm afraid I'm going to need to see some documentation or other evidence on that. Especially since you have admitted that you are "not familiar nor interested in the internal guts of ms windows" (and whether you believe it or not, this is all about the internal guts of Windows) So I want to see something that proves that this is how it is implemented.

Also, we're talking about a cross-platform API in OpenGL. So not only do I want to see documentation on that for Windows, but I'll need to see some on Linux, other flavors of UNIX, BSD, and Mac OS X. And if this is going to propagate to mobile platforms with OpenGL ES, now I need info for iOS and Android too.

And then there are fences themselves. Fences are not OS mutexes; they are GPU constructs. So even if you can show that, under all of these systems, mutex release will instantly restore a previous thread to functioning, that doesn't show that fence completion can instantly restore a thread to functioning.

If even one of these GPUs is incapable of doing that, then what you are asking for would be impossible on that platform. And therefore, it would not be a good idea to implement it.

aqnuep
07-14-2011, 01:41 AM
I think we slipped too deep into the mud.

What l_belev wants to say is that operating systems (including Windows and Linux) can put the threads to a waiting state from the running state and it will automatically put back into running state as soon as some OS synchronization primitive is signaled by some other thread/process. This way there is no need to switch to the context of the waiting thread, actually the scheduler does not even see the thread while it is in waiting state as it only deals with running threads.

This is where blocking has an edge over busy loop and this *is* a busy loop:


while(glClientWaitSync(hsync,0,0) == GL_TIMEOUT_EXPIRED)
SwitchToThread();

In this case the thread still gets scheduled, wasting precious time on context switching, calling driver functions, so on, so on.

While I still believe such an indication has nothing to do in the GL spec, I agree with l_belev in that part that busy loop is never a good choice, even if you are yielding.

Alfonse, as evidence:
At my workplace we where working on a server that we inherited from previous developers (Linux platform).
They said threading is expensive so they implemented the southbound interface on three threads: a sender, a receiver and a worker.
Now they did polling in the receiver thread (aka busy wait, but with yielding). This way the server consumed roughly 30% of the CPU time even when it was in idle state, no requests, nothing.
Now we changed this so that instead of these three threads, there are thousands (1000 to 3000) threads, each dealing with its own job, including sending and receiving internal messages. The key point here is that we used blocking receives (aka select with proper parameters) and semaphores for blocking idle threads.
In this case (even with the thousand running threads) the processor load was less than 0.01% and even in case of handling requests the load barely passed a few percents as most of the time the server threads were communicating with other internal modules, thus the blocking wait enabled them to really go idle, thus not consuming processor time.

Maybe this is not a good enough evidence for you, but believe me, every modern OS implements blocking waits efficiently.

Alfonse Reinheart
07-14-2011, 02:48 AM
What l_belev wants to say is that operating systems (including Windows and Linux) can put the threads to a waiting state from the running state and it will automatically put back into running state as soon as some OS synchronization primitive is signaled by some other thread/process. This way there is no need to switch to the context of the waiting thread, actually the scheduler does not even see the thread while it is in waiting state as it only deals with running threads.

That sleeping threads are woken when a mutex is flipped was never contested. What was contested is when those newly awoken threads get a timeslice. And the OS provides no guarantees on when exactly that may be. It may immediately switch over, or it may wait until the scheduler's next update, or even longer.

It also doesn't deal with the deeper issue: that fences are not OS mutexes, but GPU signaled state. Can a client thread block on GPU signaled state without polling? For all GPUs?

l_belev
07-14-2011, 03:47 AM
as aqnuep said, this thread has gone too far astray.
If you are interested in learning about mutexes, please find some book on the matter. This forum is about opengl.

Ilian Dinev
07-14-2011, 01:31 PM
The latest edit to your original post is now nice and coherent. I support the suggested addition of a hint/flag (specified at glFenceSync), and the relaxed timing requirements. This way devices that actually can issue interrupts on such events will be able to wake the thread, while others can implement this by quickly polling a list of sync-objects once per handling some existing interrupt.
:)

Groovounet
07-19-2011, 06:50 PM
Very interesting post! I encounter a while ago that type of problem and it end up pretty ugly and buggy in my code.