Strange CPU-Load with enabled VSync

vbsvbs · October 5, 2016, 1:26pm

Hey Guys,

I am new to this and I took my first steps on OpenGL. I have written a small test program (basically hacked an OpenGL example) that uses SDL2 to get a OpenGL context and to draw a rotating cube (basically took this code and made it compile on SDL2: Using OpenGL With SDL). I don’t really understand the CPU-Load and the framerate that I see when running it on Linux on a Atom D525 with ION2 (in relation to VSync). I think there might be something wrong with my setup (maybe related to graphics driver or Xorg?).

WITH VSync the framerate gets limited to 60 fps and causes a CPU load of about 2 % (fine so far). But when I disable VSync I get a framerate of around 480 fps but a CPU load of 100 %. 100 % CPU means that the program is CPU-bound, right? So the framerate could be increased with a faster CPU I think.
I have the feeling that the CPU-Load of 100 % is not ok (or the framerate should be higher). Is it expected that a framerate of 60 fps (vsync) causes 2% CPU but a framerate of 480 fps (non-vsync) causes 100% CPU? Its 50 times the CPU-Load (2% -> 100%) but only 8 times the framerate (60fps -> 480fps).

I have read that sometimes you get 100% CPU WITH VSync since waiting for vsync is sometimes implemented as a busy loop but in my case it is exactly the other way round.

And is it expected that a program like that is CPU-bound? It does not do much on the CPU and is only drawing on the GPU, no?

These are my system specs:

Shuttle XS35GTv2 (Intel Atom D525 with ION2)
Lubuntu 16.04 LTS
Graphics-Driver: Nvidia 340.96

This is the output of glxinfo: name of display: :0.0display: :0 screen: 0direct rendering: Yesserver glx - Pastebin.com

This is my Xorg.log: [ 8.239] X.Org X Server 1.18.3Release Date: 2016-04-04[ 8.240] X P - Pastebin.com

If any information is missing I will happily provide more

So if you have any idea how my numbers (fps <-> cpu) make sense or what could be wrong about my setup, then I would be very thankful!

EDIT:
Actual code: /* * SDL OpenGL Tutorial. * (c) Michael Vance, 2000 * briareos@lokigames.c - Pastebin.com

Cornix · October 5, 2016, 1:56pm

With VSync enabled your render loop should block when swapping buffers. This will also block your application and thus reduce CPU load.
If you dont have VSync enabled and you dont do any Thread blocking / sleeping / yielding yourself then you application runs as fast as it can. Its practically a while(true) loop. Of course this will get you 100% CPU load.

These are the relevant lines in your code:

    while( 1 ) {
        /* Process incoming events. */
        process_events( );
        /* Draw the screen. */
        draw_screen( );
    }

What other than 100% CPU load do you expect to get without ever blocking your thread?

vbsvbs · October 5, 2016, 2:22pm

Thanks for quick answer! I thought that the program would maybe be limited by GPU (instead of CPU) so that it would not reach 100% CPU. In that case maybe a function like “SDL_GL_SwapWindow” would yield while waiting for GPU to free the CPU. Something similar seems to happen with VSync enabled in SDL_GL_SwapWindow while waiting for next sync.

But ok, lets say the program is CPU-bound: But why does it produce 100% CPU-Load drawing 480 fps (without vsync) while the same program just produces 2% CPU-Load to draw 60 fps (with vsync). CPU increased by a factor of ~50 while FPS only increased by factor of ~8. Can this be explained pleased?

Also I noticed I forgot to add the actual source code to the first post, sorry, fixed :dejection:

john_connor · October 5, 2016, 3:30pm

https://www.opengl.org/wiki/Performance#FPS_vs._Frame_Time

you should build a “check” for the FPS / frametime in your main loop
something like:


static double time_lastframe = 0;
double time_now = /*.. any time function with relatively high precision ..*/;
double time_frame = time_now - time_lastframe;
if (time_frame < 1.0 / 60 /*FPS*/)
  continue;
time_lastframe = time_now;

/*render 1 frame*/

that makes sure that the FPS does not rise too high, e.g. if you have ALT+TAB pressed / window minimized / etc

Dark_Photon · October 5, 2016, 5:15pm

vbsvbs, first what Cornix said is the dead-on answer to your question, with two qualifications:

[ol]
[li]With VSync on (or off for that matter), your render loop will block when the driver decides it should block. This may be when you swap buffers, but it may not be. It could be inside some random GL command when the driver FIFO has just filled up. It could also be when the GPU has gotten too many frames behind the commands you’re submitting on the CPU. In any case, you can force the driver to block on VSync by following SwapBuffers with a glFinish() call. That will force any conformant driver to wait until it can swap and does swap, which is (in a VSync-limited case with VSync enabled when VSync isn’t being “virtualized” by a compositor) typically after it hits the vertical retrace in the video signal. This is fine to do on a desktop GPU, but don’t do this on a tile-based mobile GPU. [/li][li]Re “of course this’ll get you 100% CPU”, it will so long as you’re CPU-cycles limited. If you’re not (e.g. GPU limited, draw thread keeps getting bumped off the CPU due to locks, I/O, thread preemption, etc.), then obviously you won’t hit 100% CPU. [/li][/ol]

First off, if you’re going to be doing much GPU profiling, you should ditch frames-per-second (FPS). It’s non-linear and borderline useless. Gamerz like it partly because of the non-linearity, as it makes their FPS numbers just silly as the frame times get smaller.

Instead, use frame time (in milliseconds) instead of FPS for profiling. 60 Hz (seconds/frame) == 16.66 msec/frame.

Given that, let’s look at your numbers:

2% CPU-Load to draw 60 fps (with vsync). … 100% CPU-Load drawing 480 fps (without vsync)

With VSync off, you’re getting 480fps. That’s 2.1 ms/frame.

With VSync on, you’re making 60Hz == 16.66 ms/frame. If your used frame time is still 2.1 ms, that’d be 12% used CPU and 88% idle CPU.

So without mixing in other factors, we’d guess that with VSync on you’d get 12% CPU instead of 2% CPU. However, when you’re free-running (running without VSync), it could be that there are some different code paths kicking in that consume more CPU. For instance, if the driver FIFO fills, I wonder if your driver is doing a busy-wait until space opens up. Another possibility is: I notice you’re using triple-buffering. This means you could be rendering more frames than necessary. If so, accounting for this could get your estimate down closer to the 2% measured that you’re looking for (at least down to 6%). Try turning triple buffering off and see what you find. Yet another thought: On that embedded Ion GPU, are you sure that CPU load doesn’t include your GPU?

In any case, if you profile in more detail you can probably figure out whether the 10% CPU difference is in your app or elsewhere (e.g. in the kernel-mode driver).

vbsvbs · October 6, 2016, 10:54am

Ok, very interesting.
If I am not wrong this means for Vsync off:
2% CPU load -> CPUs get utilized for around 20ms per second (2% from one second) -> renders 60 frames in that 20ms -> 0.33ms per frame (compared to 2.1ms with Vsync)

Could I verify this by modifying the program to provoke this situation? Lets say I will just fire GPU commands in a while-loop to make sure to fill the FIFO and then see what happens? If I am seeing then 100% CPU load then this would kinda prove that theory. Any hints please how I could do this? What if I would do something simple like this?

    
while(1)
{
    glClearColor( 0, 0, 0, 0 );
}

I tried it with disabling triple buffering but it didn’t change anything.

Hm no, not 100% sure. I googled for it but could not find out. Any ideas how I could verify?

Ok, if the other test will not lead to anything I will try to profile it. Will be tricky in the driver but maybe I am lucky.

vbsvbs · October 6, 2016, 2:41pm

Well, I hacked the coded and added a loop to the drawing function like this, so it is drawing the cube 8000 times to the same place now:


    /* Send our triangle data to the pipeline. */
    for(int i=0; i < 8000; ++i)
    {
        glBegin( GL_TRIANGLES );

        glColor4ubv( red );
        glVertex3fv( v0 );
        glColor4ubv( green );
        glVertex3fv( v1 );
        glColor4ubv( blue );
        glVertex3fv( v2 );
<...>
        glColor4ubv( black );
        glVertex3fv( v5 );
        glEnd( );
    }

The framerate dropped from the ~480fps before to now ~22fps. CPU-Load stays at 100%!
Am I right that this code should be limited by the GPU for sure? So since the CPU still stays at 100% load it really looks like it is busy-waiting for the overloaded GPU which seems to be the reason for my “odd” numbers, right?

Dark_Photon · October 6, 2016, 5:36pm

[QUOTE=vbsvbs;1284147]Ok, very interesting.
If I am not wrong this means for Vsync off:
2% CPU load -> CPUs get utilized for around 20ms per second (2% from one second) -> renders 60 frames in that 20ms -> 0.33ms per frame[/QUOTE]

Yes, except this is the Vsync ON case. So you’re likely using the CPU 20ms per 1 second of wall-clock time, or 0.33ms per 16.66ms (60Hz) frame.

(compared to 2.1ms with Vsync)

No, that’s your Vsync OFF frame time, computed from your 480 Hz frame rate.

Could I verify this by modifying the program to provoke this situation?

Probably. Before you go to guessing though, check and see what profiling tools you have access to (CPU and GPU). Also, I’d check your NVidia driver configuration. On Linux IIRC, NVidia provides settings to configure how the driver waits. See /usr/share/doc/NVIDIA_GLX-1.0/README.txt and search down for “sleep” (in particular, see __GL_YIELD).

What if I would do something simple like this?

Well that’ll definitely burn down your CPU. However, it depends on the driver implementation whether this by itself will queue anything to be sent to the driver and thus block on a FIFO full condition.

Hm no, not 100% sure. I googled for it but could not find out. Any ideas how I could verify?

I’d download a benchmarking tool that you know is GPU limited, and then look at your % CPU metric.

Dark_Photon · October 6, 2016, 5:46pm

[QUOTE=vbsvbs;1284148]Well, I hacked the coded and added a loop to the drawing function like this, so it is drawing the cube 8000 times to the same place now:


    /* Send our triangle data to the pipeline. */
    for(int i=0; i < 8000; ++i)
    {
        glBegin( GL_TRIANGLES );

        glColor4ubv( red );
        glVertex3fv( v0 );
        glColor4ubv( green );
        glVertex3fv( v1 );
        glColor4ubv( blue );
        glVertex3fv( v2 );
<...>
        glColor4ubv( black );
        glVertex3fv( v5 );
        glEnd( );
    }

The framerate dropped from the ~480fps before to now ~22fps. CPU-Load stays at 100%![/QUOTE]

Yup. 2.1 ms/frame -> 45 ms/frame (a 21X increase in CPU consumption).

Am I right that this code should be limited by the GPU for sure?

No. You just said you’re 100% CPU. That strongly indicates that you are CPU limited.

And stepping back, the reason why you’re CPU limited is clear. The way you’re feeding vertex data to the GPU is very inefficient. It requires many, many GL calls and lots of CPU per draw call to pack and reformat the data into a format where it can even be sent to the kernel-space GL driver to be rendered on the GPU.

You may know but this glBegin()…gl{Vertex,Normal,Color,etc.}()…glEnd() style of drawing is called immediate mode, and it dates back to the earliest versions of OpenGL. It is not a good match for GPUs today. If you want better performance (read: much lower CPU cost per batch), you should switch from immediate mode to vertex arrays: either Client-side Vertex Arrays or Server-side Vertex Arrays (via Vertex Buffer Objects (VBOs)). I’d start with client arrays as that’s the simplest first step and provides a huge speed-up. Once you understand that, switching your vertex arrays to VBOs is easy. However, making VBOs fast can be tricky.

So since the CPU still stays at 100% load it really looks like it is busy-waiting for the overloaded GPU which seems to be the reason for my “odd” numbers, right?

No, you’re spending all of your time in the GL driver (likely the user-space GL driver) packing all of these individual vertex attributes into arrays under-the-hood so that the driver can actually send the batch to the GPU to draw.

vbsvbs · October 7, 2016, 11:08am

[QUOTE=Dark Photon;1284151]Yes, except this is the Vsync ON case.
No, that’s your Vsync OFF frame time, computed from your 480 Hz frame rate.
[/QUOTE]
Oh yeah sorry, I confused on/off in the post.

Wow, I think you hit it with this!
When I do __GL_YIELD=USLEEP (without vsync) then the CPU usage drops from 100% to ~22% and the frametime rises from 0,208 ms (480 fps) to 0,217 ms (460 fps). So this really makes sense now, no? Nvidia chooses by default sched_yield() which seems to be more effective than usleep(0) and therefore the framerate is slightly decreased (480 fps -> 460 fps). Also the CPU usage of 22% makes roughly sense compared to “8 x 2.5%”! So I think this explains everything, no?
Tbh I am not really sure what the big difference between usleep(0) and sched_yield is but it is conclusive for me. Big thanks to you!! It really gave me headaches!!

Dark_Photon · October 7, 2016, 5:46pm

[QUOTE=vbsvbs;1284157]
Wow, I think you hit it with this!
When I do __GL_YIELD=USLEEP (without vsync) then the CPU usage drops from 100% to ~22% and the frametime rises from 0,208 ms (480 fps) to 0,217 ms (460 fps). So this really makes sense now, no?[/QUOTE]

Yes, I think so.

Also the CPU usage of 22% makes roughly sense compared to “8 x 2.5%”! So I think this explains everything, no?

I don’t know where the 8 x 2.5% came from. But your new VSync OFF % CPU of 22% just suggests that the driver is waiting in a way that doesn’t burn so many CPU cycles. Before it was effectively doing a busy-wait.

Tbh I am not really sure what the big difference between usleep(0) and sched_yield is but it is conclusive for me. Big thanks to you!! It really gave me headaches!!

Sure thing. As to the difference… If there are no other threads ready to run, a sched_yield may bounce immediately back to your thread, effectively becoming a no-op (read busy-wait == high CPU usage). As to *sleep() , at least at one time, some *sleep calls on Linux were guaranteed to suspend your thread for at least a minimum time configured into the kernel. That was a while back though, so I’m not sure if this is still the case.

http://www.perlmonks.org/bare/?node_id=763235

vbsvbs · October 8, 2016, 10:44am

I meant the difference in framerate/frametime with vsync on and off. 60fps with vsync on -> ~480fps with vsync off is roughly a factor of 8. And when the 60fps produced a CPU load of 2.5% then it makes somehow sense to have now a load of 22% (equals roughly 8 times the 2.5%).

[QUOTE=Dark Photon;1284158]
Sure thing. As to the difference… If there are no other threads ready to run, a sched_yield may bounce immediately back to your thread, effectively becoming a no-op (read busy-wait == high CPU usage). As to *sleep() , at least at one time, some *sleep calls on Linux were guaranteed to suspend your thread for at least a minimum time configured into the kernel. That was a while back though, so I’m not sure if this is still the case.

hxxp://www.perlmonks.org/bare/?node_id=763235[/QUOTE]
Very interesting read.

Big thanks again!