GPU is used ?

Hi every one,

I use Window 7 + mingw + glut to draw a simple sphere (glutSolidSphere, no light, no texture).
Then I define an idle function with a glutPostRedisplay and a frame counter. Finally I compute the frames per second.

I wanted to “stress” my Athlon X2 5000+ and my Nvidia GeForce 8800 GT so I tryed to draw the sphere with 300 slices and 150 stacks (so 45000 quads). I got only 27 frames per second.

I am a little bit disappointed because with recent video games I usually get more frames per second. And I think those games have at least the same number of polygons to draw, plus the shadows, the lights, the antialiasing etc …

So my question is : by default, is the GPU used by opengl ?
I ask because I noticed that by CPU load is very high when I run my exe. I have tried to check my GPU load but I haven’t found any software yet (I have tried GPU-z but its GPU load sensor doesn’t work with my graphic card).

Thanks

glutSolidSphere probably does a lot of calculations if you are calling it every frame - I believe the expected usage is to call it once inside a display list (old OpenGL) and then call the display list each frame.

GPU is much faster then your CPU.
But to take advantage of the GPU power you must send all the data in the GPU memory (with VBO, VAO or other GPU only tricks) and then use few command to draw the data.

Actually you are doing a lot of work in the CPU and your GPU is sleeping waiting for commands.

As said above, glutSolidSphere is not optimized, was designed to be called within a display list :
http://www.google.com/codesearch/p?hl=en#ncwl7ziM06g/trunk/Box2D/freeglut/freeglut_geometry.c&q=glutSolidSphere%20package:http://box2d\.googlecode\.com&sa=N&cd=1&ct=rc&l=170&t=1

Ok, I am learning how to use display lists.
At the moment, I can’t make the program go faster with display list than without.
That is strange because it even runs slowier.
I have checked with GPU-z : the GPU memory used with display list is bigger than without, but the CPU is still very loaded.

Here are the results I have :

  • 65024 triangles => 57 fps without display list, 28 fps with display list
  • 1046528 triangles => 4.3 fps without, 1.7 fps with

I continue trying to optimise it.

For those who are patient enough to read my program, here is the file I use :
I have written another drawing sphere function instead of using glutSolidSphere because I wanted to draw each triangle with a different color than the adjacent triangle.
The STRIP_NBR def allows to change the triangle number and the WITH_DISPLAY_LIST def allows to compile two programs with or without display list.
NB: I haven’t found any button to add a file within a post. Is it still possible ?

main.c

It’s also possible that display lists aren’t optimal with your hardware either, they’re quite old-hat these days. Your other big problem is use of glBegin/glEnd. You really want to be putting your data into an indexed vertex array and using glDrawElements instead (and you should prefer 16-bit indexes to 32-bit indexes unless you want to risk it going through software emulation on some hardware). This is essentially the method used by recent games and is the key to high performance on modern hardware - batch your primitives, index them so that your hardware’s vertex cache will be used, and - if appropriate (it’s not always) - put them in a vertex buffer so that they don’t need to be sent to the GPU every frame. Low number of draw calls, as little data as possible, and you get very high performance.

Even if you don’t go the full vertex array route, move your glBegin/glEnd outside the loops like so:

void drawSphere(void) {
    int lonIndex;
    int latIndex;
#ifdef WITH_DISPLAY_LIST
    sphere = glGenLists(1);
    glNewList(sphere, GL_COMPILE);
#endif
    glBegin(GL_TRIANGLES);

    // up part of the sphere
    for (lonIndex = 0; lonIndex < 2 * STRIP_NBR; lonIndex++) {
        if (lonIndex % 2 == 0) {
            glColor3f(1.0, 0.0, 0.0);
        } else {
            glColor3f(0.9, 0.0, 0.0);
        }
            glVertex3f(0.0, 1.0, 0.0);
            glVertex3fv(vertexArray[lonIndex][STRIP_NBR - 2]);
            glVertex3fv(vertexArray[lonIndex + 1][STRIP_NBR - 2]);
    }
    
    // center par of the sphere
    for (lonIndex = 0; lonIndex < 2 * STRIP_NBR; lonIndex++) {
        for (latIndex = 0; latIndex < STRIP_NBR - 2; latIndex++) {
            glColor3f(1.0, 0.0, 0.0);
                glVertex3fv(vertexArray[lonIndex][latIndex]);
                glVertex3fv(vertexArray[lonIndex + 1][latIndex]);
                glVertex3fv(vertexArray[lonIndex + 1][latIndex + 1]);
            glColor3f(0.9, 0.0, 0.0);
                glVertex3fv(vertexArray[lonIndex][latIndex]);
                glVertex3fv(vertexArray[lonIndex + 1][latIndex + 1]);
                glVertex3fv(vertexArray[lonIndex][latIndex + 1]);
        }
    }
    
    // down part of the sphere
    for (lonIndex = 0; lonIndex < 2 * STRIP_NBR; lonIndex++) {
        if (lonIndex % 2 == 0) {
            glColor3f(1.0, 0.0, 0.0);
        } else {
            glColor3f(0.9, 0.0, 0.0);
        }
            glVertex3f(0.0, -1.0, 0.0);
            glVertex3fv(vertexArray[lonIndex][0]);
            glVertex3fv(vertexArray[lonIndex + 1][0]);
    }
    glEnd();
#ifdef WITH_DISPLAY_LIST
    glEndList();
#endif
}

If your driver is intelligent enough it should give you much better performance from that alone.

mhagain nailed it.

For the latter, switching from 65,000+ Begin/End pairs in the display list to 1 improves perf from 140 fps (~7ms/frame) to 2500 fps (0.4 ms/frame) – a 17X improvement. My guess is that Begin/End constitutes a batch, and NVidia doesn’t always (if ever) merge batches within display lists.

Also note that even here on NVidia, which typically has steller display list perf, I notice that the non-display list version was 190 fps (5ms/frame) – faster than the display list version with 65,000+ Begin/End pairs. But the display list version totally outsmokes it when we trim that down to one Begin/End pair (0.4 ms vs. 5 ms/frame).

As mhagain said, you can likely improve on even the 0.4 ms/frame by using indexed arrays, and then stick them into a display list for maximum driver perf tweaking.

thanks guys for those tips
I’ve got my action plan for the next weeks/months :slight_smile:

as a first step, I’ve tried to move the glBegin/glEnd as adviced by mhagain, and it really helps : 10.8 fps for about 1_000_000 triangles instead of 4.3 fps, not bad !