This section offers advice on making your OpenGL programs go faster and run smoother.
Perhaps the most common error in measuring the performance of an OpenGL program is to say something like:
start_the_clock () ; draw_a_bunch_of_polygons () ; stop_the_clock () ; swapbuffers () ;
This fails because OpenGL implementations are almost always pipelined - that is to say, things are not necessarily drawn when you tell OpenGL to draw them - and the fact that an OpenGL call returned doesn't mean it finished rendering.
Typically, there is a large FIFO buffer on the front end of the graphics card. When your application code sends polygons to OpenGL, the driver places them into the FIFO and returns to your application. Sometime later, the graphics processor picks up the data from the FIFO and draws it.
Hence, measuring the time it took to pass the polygons to OpenGL tells you very little about performance. It is wise to use a 'glFinish' command between drawing the polygons and stopping the clock - this theoretically forces all of the rendering to be completed before it returns - so you get an accurate idea of how long the rendering took. That works to a degree - but experience suggests that not all implementations have literally completed processing by then.
Worse still, there is the possibility that when you started drawing polygons, the graphics card was already busy because of something you did earlier. Even a 'swapbuffer' call can leave work for the graphics system to do that can hang around and slow down subsequent operations in a mysterious fashion. The cure for this is to put a 'glFinish' call before you start the clock as well as just before you stop the clock.
However, eliminating these overlaps in time can also be misleading. If you measure the time it takes to draw 1000 triangles (with a glFinish in front and behind), you'll probably be happy to discover that it doesn't take twice as long to draw 2000 of them.
Graphics cards are quite complex parallel systems and it's exceedingly hard to measure precisely what's going on.
The best practical solution is to measure the time between consecutive returns from the swapbuffer command with your entire application running. You can then adjust one piece of the code at a time, remeasure and get a reasonable idea of the practical improvements you are getting as a result of your optimisation efforts.
Even when you do this, you may have to take a little care - some systems force the swapbuffers command to wait for the next video vertical retrace before performing the swap - if that is the case then you'll only ever see times that are an exact multiple of the video frame time and it will be impossible to see exactly how much time you are consuming. However, most PC graphics adaptors do not do this by default - so you would probably have to have taken an active step to turn this feature on. However, if you have (say) a 60Hz monitor and your times all come out at either 16.66ms, 33.33ms or 50ms then suspect that you have a problem.
FPS vs. Frame Time
It is common for people to measure FPS (Frames Per Second, sometimes called Frame Rate). This is considered as rendering speed by most gamers and new programmers. FPS is not a great way to measure performance because it is not linear. It is better to measure Frame Time, which is the inverse of the FPS (or rather, FPS is the inverse of Frame Time). So if the FPS is 25, the Frame Time is 1/25 = 0.04 seconds. An FPS of 200 means the Frame time is 0.005 seconds. To explain the problem with FPS, consider the following example: Let's say the FPS is 180. You add a few models to your scene and you end up with 160. How bad is that? Yes, you did lose 20 points but how many seconds longer is it taking?
1/180 - 1/160 = 0.00065 seconds
Let's continue the example. Assume your FPS is 60 and you add some models and your FPS drops to 40. How bad is that?
1/60 - 1/40 = 0.00833 seconds
Although the FPS dropped by 20 in both cases, the slowdown in the first case isn't very big, but in the later case, it is noticeably bigger.
Another thing that is commonly mentioned is that something makes rendering 3x (or 300%) faster. Most of the time, this means that the FPS is 3x as high. If the (old) Frame Time is X, this means the new time is 1/(3*(1/X)) = X/3. The improvement is X - X/3 = (2/3)*X. A similar case is when rendering speed is increased by, for example, 50%. This means that the FPS is 50% higher. The improvement in Frame Time is then X/3. There's often a difference between "an increase by X%" and "X% times as fast". An increase by X% often corresponds to (100+X)% as fast.
Performance is not Linear
The Frame Time is not proportional to the amount of things that you render. For example, there can be a constant-time job that the driver has to do for every frame. Measuring the performance when rendering very little (or perhaps nothing at all) is meaningless. In fact, a faster GPU on a faster computer might actually be SLOWER when doing very small jobs. However, when rendering more things (as in most applications and games), the high-end computer would most likely be faster.
Performance and Monitors
There is also the refresh rate of the monitor to consider. If the refresh rate is 60 Hz, then the monitor can show 60 different images per second and not more. If your FPS is 100, you won't be seeing some of those frames. It is recommended to vsync or at least give the user the option to activate/deactivate it. You can read about it in the Swap Interval article. If you are making a game, you would want to know what your monitor's refresh rate is. You would need to render, do your physics, do your game logic as fast as possible so that it takes less than your monitor's speed. For example, if your monitor's refresh rate is 60 Hz, this means 1/60 Hz = 0.01667 seconds for each frame. You have 0.01667 seconds to render, do your physics, do your game logic. If you have some extra time left, then that is not a problem.
Understanding where the bottlenecks are
There are generally four things to look at initially.
- CPU performance: If your code is so slow that it's not feeding the graphics pipe at the maximum rate it could go - then improving the nature of the things you actually draw won't help much.
- Bus bandwidth: There is a finite limit to the rate at which data can be sent from the CPU to the graphics card. If you require too much data to describe what you have to draw - then you may be unable to keep the graphics card busy simply because you can't get it the data it needs.
- Vertex performance: The first thing of any significance that happens on the graphics card is vertex processing. A Vertex Shader can be a bottleneck. This is usually easy to diagnose by replacing the shader with something (for example) without lighting calculations and see if things speed up. If they do - then the odds are good that you are vertex limited.
- Fragment performance: After vertex processing, the polygons are chopped up into fragments (typically the size of a pixel) and the fragment processing takes over. Fragment processing takes longer for large polygons than for small ones - and generally gets slower the more textures you use - and the more complex your fragment shader is.
If you are fragment processing bound, a really simple test is to reduce the size of the window you are rendering into down to the size of a postage stamp and see if your frame rate improves. If it does then you are at least partially fill rate limited - if it doesn't then that's not the problem. This is such a simple test that it should always be the first thing you try.
There are more subtleties here - but this is a start.
Deprecated functionality is often a lot slower than core functionality. This is NOT because they are deprecated; They are deprecated because they were slow. All deprecated functionality isn't slow, however.
CPU and Bus limits
If you are using some deprecated functionality, for example direct mode (glBegin() ... glEnd(), etc.), consider switching to more modern alternatives. Direct mode requires one or more function calls per vertex, so with a large number of vertices, the CPU has to do a lot of work. Using Vertex Buffer Objects is a more efficient alternative.
Using the legacy fixed-function pipeline when you don't need its features (such as materials and lights) might force the GPU to do some unnecessary work. Instead, use small Shaders that only does what you want to be done.
If you are fragment limited, consider moving some computations to the Vertex Shader when possible. Then let the GPU interpolate the values for each fragment. The results will possibly not look as nice, so this is a trade-off between performance and quality.
If you are not using mipmaps for your textures, this can give you poor performance. Whether you are using 1D or 2D or 3D or cubemap, performance can be improved if you generate a full or partial mipmap chain.
Anisotropic texture filtering can be very costly. This is because the GPU will need to sample the texture at multiple locations and run an algorithm on the many texels it reads and then return the final result to your shader.