High CPU usage

I run into the following problem with all nVidia linux driver releases:
I am enabling syncing on vblank (setenv __GL_SYNC_TO_VBLANK 1)
My program use doublebuffer visual, and calls glXSwapBuffers
In this case, I expect thatdoing rendering that is not too complex (take substantialy less time than the monitor frame time) to NOT use 100% CPU.
Yet this is exactly what happens.
(It should block-wait (not busy-wait) for the next vblank interrupt.)

[ this question went to nVidia support but was not answered for a long time ]

Hi Moshe,
I think Eric or someone else posted about similar problem in the Advaced Forum (under Win2k and Nvidia HW). Not 100% sure however, it was long ago…

Regards
Martin

[This message has been edited by martin_marinov (edited 05-31-2002).]

I saw that post. It is old, and doesn’t seem to be resolved. In that thread it seems that it was “agreed” that thread-blocking (vs. busy-wait) swapBuffers is the “good thing”, but it not done yet. Any idea if its done now on the Windows side? What about linux?

Im not sure if this is at all possible with linux : depending on your horiz sync frequency, that gives you an interrupt every 20 ms or so. Its possible that the gl driver block-wait untill the interrupt occurs because it can’t serve all the drawing work onto the interrupt hadler.
If the interrupt handler had send a signal to the opengl process, the process won’t necessary be awaikened by linux before the next interrupt (sheduller swap tasks every 0.1s or so).

IMHO!

[QUOTE]Originally posted by rixed:
Its possible that the gl driver block-wait untill the interrupt occurs because it can’t serve all the drawing work onto the interrupt hadler.
?? what do you mean by “serve drawing work onto the interrupt handler?” there is no need to do that…


If the interrupt handler had send a signal to the opengl process, the process won’t necessary be awaikened by linux before the next interrupt (sheduller swap tasks every 0.1s or so).

0.1 seconds?!? You are off by exactly one order of magnitude. Default linux kernels are compiled with 10 ms “timer interrupt” (the preemtive scheduling interrupt". This is 0.01 seconds.
You don’t have to take my word for it. Try this:
grep timer /proc/interrupts ; sleep 1 ; grep timer /proc/interrupts
and see by how much the interrupt counter has increased during that one second.
You can also check in the kernel source, in include/asm-i386/param.h , look for #define HZ

I even compile my kernel with 0.5 ms , and its works real fine.

I don’t know why does it work, but my app was using near to 100% CPU, and adding

timespec timereq = { 0, 1 };
timespec timerem;

nanosleep(&timereq, &timerem);

before of the render function dropped the usage to near zero when there are few primitives and up to 50% using displaying about 3700 individual triangles with quadratic light attenuation. That is, I asked the app to sleep by a nanosecond before drawing.

Originally posted by FermatSpiral:
[b]I don’t know why does it work, but my app was using near to 100% CPU, and adding

timespec timereq = { 0, 1 };
timespec timerem;

nanosleep(&timereq, &timerem);
before of the render function
[/b]

Very interesting.
Where exactly is “before the render function”?
Immediately after glXSwapBuffer() ?
Any glFinish()'s involved there?
Maybe sched_yield() would have the same effect?

I have three functions inside main(); begin(), draw() and end(), running inside a loop.

Inside of begin() is glClear() and the transformations code.

Inside of draw() is all the glBegin() and glEnd() stuff.

Inside of end() is glSwapBuffers().

All the other stuff inside these functions is application specific.

The call to nanosleep() is made between begin() and draw().

I have tried with and without glFinish(), the results are the same.

from Moshe Nissim :

“0.1 seconds?!? You are off by exactly one order of magnitude. Default linux kernels are compiled with 10 ms “timer interrupt” (the preemtive scheduling interrupt”. This is 0.01 seconds."

Thats True that the timer is at 100Hz on PCs. But the scheduler do not swap tasks every timer interrupt, but, in average, every 10 or so. So, we come to 0.1s.

(Yes, on PCs there are only 10 different tasks that gets executed every seconds. See “Core Kernel Commentary” for more details).

So, I doubt that you can nonblock-wait for the vertical interrupt without missing at least the next one.

First, let me say that I’ve just stumbled over this same problem while trying to run two “real-time” rendering applications sync’d to the vertical blank of two TV-out channels on two different video cards.

I also discovered behavior similar to the nanosleep() post using usleep(); however, I believe that the timer resolution issue is still a problem here. Basically, no matter how small a number you specify for the sleep time, you’re still sleeping some minimum number of milliseconds (say 10ms or more). At higher refresh rates, this simply causes you to miss one or more subsequent vertical blanks. In my application, this is not good…

The underlying problem is the Linux timer resolution. If I’m not mistaken, there are a few Linux patches/extensions available such as UTIME, KURT, and/or possibly RTLinux which do offer timing resolution reliably down into the tens of micro-seconds. With these extensions, you may actually be able to get some relief by clever use of usleep(); however, this is still a poor substitute for having a blocking-wait.

With the usleep() method and a timer resolution modification, you’d still have to be able to accurately determine your rendering time, usleep() for any remaining time in each vblank cycle, and wake up in time to spin-lock on the glXSwapBuffers() call. Though this may work as a temporary fix, this doesn’t sound like much fun to me… especially considering that you’ve got multiple applications competing over the processor and this may complicate accurate determination of time required to render… thereby causing missed vertical blanks!

Anyway, the correct solution would have to come in the form of a blocking-wait buffer swap instead of the current spin-wait. As noted in other posts, this is not feasible with most (if not all) standard Linux implementations; however, it should be possible in combination with one of the timer resolution modifications available for Linux. So, what I’d like to see is nVidia providing an option whereby the developer could select blocking-wait or spin-wait as appropriate for their application. Spin-wait could certainly be left as default for default Linux users, but I’d like to be able to switch to blocking-wait on my timer resolution modified systems capable of pulling off the scheduling required.

So the question is: does such an option exist to change wait methods? If not, could we have one please??? Hopefully nVidia has and/or will address this issue shortly… would certainly be a big help in achieving my real-time computing goals under Linux!

Anyone been in contact with nVidia on this topic???

Originally posted by ScuzziOne:

The underlying problem is the Linux timer resolution. If I’m not mistaken, there are a few Linux patches/extensions available such as UTIME, KURT, and/or possibly RTLinux which do offer timing resolution reliably down into the tens of micro-seconds.

I work reliably with standard linux, with 0.5 ms timer , simply by compiling the kernel with HZ = 2000. No special extension needed.

[b]

Anyway, the correct solution would have to come in the form of a blocking-wait buffer swap instead of the current spin-wait. [/b]

I completely agree

As noted in other posts, this is not feasible with most (if not all) standard Linux implementations; however, it should be possible in combination with one of the timer resolution modifications available for Linux.

I disagree. Its possible with different coding in the nVidia driver. Linux is perfectly capable of putting a process to ‘sleep’ when its doing a blocking wait (select, read, whatever), and wake it up very quickly upon hw interrupt (handled in this case in the nVidia driver kernel module)
Remember that even without modifying timer resolution in the kernel (HZ) the scheduler can still wake up the waiting process exactly on the event (vsync, via interrupt), if no other process is in the CPU. Modifying timer resolution will help preemption. (kicking other process out when vsync arrived and ‘our’ process is waiting for it). But again, this is just compliation of standard kernel with #define HZ changed – no patch

Anyone been in contact with nVidia on this topic???

I have. I sent a report to their linux support … and got zero response :frowning:

I think this is happening if the program take substantialy less time than the monitor frame time:
The first call to glXSwapBuffers is done the proper way letting the driver wait for the sync using the interrupt. Control is given back to the program that creates another frame and calls glXSwapBuffers. NVidia do not store more than one frame so if the first frame is still waiting should the driver block now. Using the sync interrupt is little more complicated now since it is the first frame that should be called then and not the second. The above works very well if the program does not create frames faster than the sync rate for the monitor. If you let the second frame block-wait will the modified code probably be little worse for the more common use.

Has anybody tried to put in a glFinish()? Perhaps will this function wait in a more polite way.

hi all !!

i’m beginner to opengl, glx and threads and linux…

I tried it myself (so got some problem in another post), i done something like that:

i got one thread that scheduler the whole, and i thread get from window management: rendering and user interactivity…

if i don’t ask any thread to sleep some nanosecond, my render crashes.

Rixed sayed: "Thats True that the timer is at 100Hz on PCs. But the scheduler do not swap tasks every timer interrupt, but, in average, every 10 or so. So, we come to 0.1s.
"

I don’t think you’re right… the scheduler swaps threads or processes related to their priorities. in fact, their effective priorities, that it calculates in relation with their real one.
So, if you have a high priority, your task could be executed some times in a row, so it could looks like it’s done just once with a slower rate. i think it’s that.

when you make a basic thread with 2 thread just counting, you get more that 10 lines that happears on your screen in a second…
and the linux scheduler manages all other tasks too… as almost here: X.
X could be another pb too.

thanx for reading me !

saian :

“when you make a basic thread with 2 thread just counting, you get more that 10 lines that happears on your screen in a second”

Of course, if you don’t wait for swapping.
Here is a little test for you to see :


toto1.sh :

#!/bin/bash
while true;
do
while ! test -e /tmp/toto ; do
echo “wait…”
done ;
echo “thread 1”;
rm -f /tmp/toto
done


toto2.sh :

#!/bin/bash
while true;
do
while test -e /tmp/toto ; do
echo “wait…”
done ;
echo “thread 2”;
touch /tmp/toto
done


chmod a+x toto1.sh toto2.sh

((toto1.sh &) ; (toto2.sh &)) > output ; sleep 10 ; killall toto1.sh && killall toto2.sh

grep ‘thread 1’ output | wc -l
-> 123
grep ‘thread 2’ output | wc -l
-> 122

meaning : toto1.sh was interrupted by toto2.sh 122.5 times in 10s.

Do it with another HZ, you gona have more.

clear ? :slight_smile:

[This message has been edited by rixed (edited 08-01-2002).]

I do not have any more speculations about how the driver works but some comments about threads. If you have to call a sleep function to avoid a crash does you code probably have a race condition. It is just the scheduler that is running during the interrupt not the threads and processes.

I also thinks that the assumptions about how fast threads can be swapped is not true. Here is a little hack that gives me a lot more than 1000 swaps each second on a standard RH 7.3 kernel. Remember to build with -D_GNU_SOURCE

#include <unistd.h>
#include <signal.h>
#include <stdio.h>
#include <pthread.h>

int done;
pthread_mutex_t mutex1= PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t mutex2=PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t m1=PTHREAD_COND_INITIALIZER;
pthread_cond_t m2=PTHREAD_COND_INITIALIZER;

void alarmHandler(int signum)
{
done=1;
}

void *thread2(void *count)
{
int *c=count;

pthread_mutex_lock(&mutex2);
while (!done) {
pthread_cond_wait(&m2, &mutex2);

  pthread_mutex_lock(&mutex1);
  pthread_cond_signal(&m1);
  pthread_mutex_unlock(&mutex1);
  
  *c=*c+1;	

}
pthread_mutex_unlock(&mutex2);
pthread_cond_signal(&m1);
pthread_exit(NULL);
}

int main()
{
int thrd2ID;
int count=0;
pthread_t pt2;

done=0;

thrd2ID=pthread_create(&pt2, NULL, thread2, &count);

signal(SIGALRM, alarmHandler);
alarm(10);

/* give the second thread little time so it starts */
usleep(20);

pthread_mutex_lock(&mutex1);
while (! done) {
pthread_mutex_lock(&mutex2);
pthread_cond_signal(&m2);
pthread_mutex_unlock(&mutex2);

  pthread_cond_wait(&m1, &mutex1);

}
pthread_mutex_unlock(&mutex1);
pthread_cond_signal(&m2);

pthread_join(thrd2ID, NULL);
pthread_mutex_destroy(&mutex1);
pthread_mutex_destroy(&mutex2);

printf("
thread2 was swapped %d times in ten seconds
", count);

return 0;
}

Of course you can swap between THREADS very quickly. This has nothing to do with the timer interrupt : by doing a pthread_cond_wait or a pthread_cond_signal you implicitely call the sheduler ! So that the number of swaps, in your example, is only limited by your CPU.

I could have achieve the same result in the shell example I gave, if instead of simply touch/rm the file used as a lock, I had connected the two shell scripts with a lockfile.

The point was, if I understood it clearly, how longer a task can run before being interrupted by the scheduler ; so obviously the tasks must not explicitely ask the scheduler to switch task to a specified thread, which is of course done at once :slight_smile:

The example I gave do just that : executing a 100% CPU loop and counting, by a trick, how much time these stand-alone CPU hungry processes were interrupted by force by the scheduler.

The scheduler itself is launched by timer interrupt every 1/100s, and actualy swap tasks about 1 times out of 10, so that leads us to the 0.1s of execution time by process I claimed.

Why is that the point instead of thread calling each others ? because we were talking about the vertical IRQ, adding a task for the GLX function that swaps screen, task which is not connected by any signal, as far as I understand the situation, to a given openGL application.
That’s why I doubted the nonbusy waiting functionality of GLXswappbuffer.

I may misunderstood the relations between vertical IRQ, GLX and the application, anyway. I tried to disassemble the libGLX, looking for the IRQ handler and the swappBuffer function, without results.

Someone with knoledge of the technical design of GLX could help me…

friendly waiting for your comments,

Sorry if I made your statement more general than intended. I think that you overlook the fact that linux is using preemptive scheduling. According to this page is every process given a fixed time slice of 200ms to run because of that http://iamexwiwww.unibe.ch/studenten/schlpbch/linuxScheduling/LinuxScheduling.html
You can create processes with higher priority that will intterrupt the lower priority task. The man page for sched_setscheduler also describes the scheduler.

The NVidia people know but will not tell us. I guess they find our speculations amusing.

The fact that the preemtion interrupt (“timeslice”) is X (10 ms in unmodified kernel) doesn’t mean processes are swapped exactly at 10 ms boundaries, or swapped every 10 ms. It can be more if the scheduler decides to let the current process keep the CPU. It can be less if a process “voluntarily” yields the CPU by making a blocking system call (select, read, etc.)

rixed: you doubted glXSwapBuffers makes a nonbusy wait?? You think it SHOULD make a BUSY wait?? You can find the top level of the IRQ handler in the nv.c file in the open-source part of the nVidia kernel module. But it doesn’t help much, very little is visible there.

For what its worth, I checked the behaviour under Windows (and OpenGL) and its exactly the same.

The fact that the board delivers interrupt at every vertical blanking (it does, I checked) means to me they know it should be done this way (nVidia I mean).
Why they don’t do it after all, that’s the interesting question

Well,

I personally believe this is a problem with the nVidia drivers/GLX and have sent a bug report to them. Of course, there’s been no reply to date. Regardless, I urge any/all who are interested in gaining maximum utilization/performance out of their system (such as in a realtime environment) to send a note to nVidia to let them know you exist. As in all things, “the squeaky wheel gets the grease” and it appears that not enough of us have squeaked yet to warrant any sort of response from the nVidia technical support folks. Given that I’ve got my reputation on the line with projects pushing real-time linux and OpenGL, I certainly hope nVidia is not on the verge of pulling a 3dfx on us! Don’t get me wrong, I appreciate the support they’ve dedicated to the Linux community thus far… it just concerns me that they seem to feel no need to respond to their Linux tech support e-mail! So, if you’re in the same boat with the rest of us, please let nVidia know you exist as a Linux based user.

Perhaps we’ll get some sort of response if enough folks e-mail them. Personally, I wouldn’t even mind if the response is only to point out that we’re all friggin’ idiots who don’t know squat about coding OpenGL as long as it also points out the error of our ways so that we can move on to happier times.

Hey, our e-mails may be going straight to /dev/null; however, it’s worth a try!

Thanks

Originally posted by ScuzziOne:
[b]Well,

Given that I’ve got my reputation on the line with projects pushing real-time linux and OpenGL, I certainly hope nVidia is not on the verge of pulling a 3dfx on us! Don’t get me wrong, I appreciate the support they’ve dedicated to the Linux community thus far… it just concerns me that they seem to feel no need to respond to their Linux tech support e-mail!
Thanks[/b]

Just so you know, the problem isn’t specific to Linux. A 1 millisecond rendering loop with vsynced swap on Windows (2000 included…) also takes up 100% CPU.
If you search the advanced forum you’ll see that this thing has already came up, without resolution (and without a mention of Linux…). The nVidia people seems to go as far as stating something like “we know it should be done differently, but our bosses don’t let us release the CPU for fear of loosing even one benchmark point in game engine benchmark (which don’t multirhead…)”
In Linux the problem is even somewhat reduced by the fact that you can run the OS at a much smaller preemtive time-slice than 10ms (giving the scheduler a chance to take the CPU from that dreaded nv-driver busy-wait and give it back in time!)