PDA

View Full Version : How to assess performance of OpenGL program?



2Wheels
04-11-2013, 11:40 AM
Hello Gurus,

I am learning OpenGL, with a hope of employment in the games industry.

I think I am getting to grips with it, including storing standard geometries on the GPU & re-using them, same for texture maps, varying level of detail by depth, drawing objects ordered by shader & texture etc. I am getting to grips with off-screen drawing for shadows & glows.

My question: how can I assess the performance of my programs? I have a laptop (Intel HD Integrated graphics) and a desktop (NVidea GeForce GPU), I can look up specs, but I have no idea how to tell if I am getting anywhere near the performance envelope of the chips or if I have missed something & still have an inefficient implementation (& therefore still have things to learn ..)

Any suggestions? I don't feel I need to wring every last ounce of performance out, but it would be nice to know that I am in the right order of magnitude of throughput / framerate.

Any suggestions please ?

Thanks

tonyo_au
04-11-2013, 06:10 PM
for nVidia with Visual Studio 2008/2010 you can use Nsight from nVidia

Dark Photon
04-11-2013, 07:09 PM
My question: how can I assess the performance of my programs? I have a laptop (Intel HD Integrated graphics) and a desktop (NVidea GeForce GPU). I can look up specs, but I have no idea how to tell if I am getting anywhere near the performance envelope of the chips or if I have missed something & still have an inefficient implementation...

Well, part of learning to optimize GPU rendering is getting familiar with the various ways you can get bottlenecked and what to do about them. You can get bottlenecked on your app side (solution: profile/optimize your app-side code). You can get bottlenecked doing state changes in the GL driver (solution: do more with fewer state changes). You can get bottlenecked on the GPU for various reasons. etc. And your bottlenecks will change over the course of a frame.

Re your NVidia GeForce desktop GPU, which GPU? As an example of a good "performance wall" to see if you can get close to with your rendering code (and one way you can get bottlenecked on the GPU), NVidia GeForce GPUs have a limit on the number of tris (triangles) they can setup per GPU core clock cycle. IIRC with anything GTX285 and newer, you get one tri setup per clock max for standard rasterization. Also, with Fermi+ (e.g. GTX4xx+), you seem to have near-free triangle frustum culling so those tris apparently don't count toward the limit. So assuming your GeForce GPU is recent... take your GPU core clock rate, and that's about how many std tris/sec you can push through the card.

Now render a non-trivial scene start to finish and time it (with glFinish() inside the beginning and ending of your timing interval to ensure there's no stray GPU work leaking in/out). Note the tri count you throw at the GPU over that interval. Compute tris/sec over the interval, and divide that by the theoretical max tris/sec for your GPU that we established above. That'll give you a sense of what percentage of maximum GPU triangle throughput you're utilizing. Note: that's a separate question from whether you really need to be sending all those tris down the pipe in the first place, but it's a useful GPU utilization benchmark.

Also note that on Fermi+ (GTX480+) you can push 4 tris/clock with tesselation (and you allegedly can hit this rate even with std triangle rasterization on Fermi+ Quadros). But if you're just doing std tri rasterization on a GeForce GTX285+, 1 tri/clock is a good benchmark to compare against.

2Wheels
04-13-2013, 06:19 AM
Well, part of learning to optimize GPU rendering is getting familiar with the various ways you can get bottlenecked and what to do about them. You can get bottlenecked on your app side (solution: profile/optimize your app-side code). You can get bottlenecked doing state changes in the GL driver (solution: do more with fewer state changes). You can get bottlenecked on the GPU for various reasons. etc. And your bottlenecks will change over the course of a frame.

Re your NVidia GeForce desktop GPU, which GPU? As an example of a good "performance wall" to see if you can get close to with your rendering code (and one way you can get bottlenecked on the GPU), NVidia GeForce GPUs have a limit on the number of tris (triangles) they can setup per GPU core clock cycle. IIRC with anything GTX285 and newer, you get one tri setup per clock max for standard rasterization. Also, with Fermi+ (e.g. GTX4xx+), you seem to have near-free triangle frustum culling so those tris apparently don't count toward the limit. So assuming your GeForce GPU is recent... take your GPU core clock rate, and that's about how many std tris/sec you can push through the card.

Now render a non-trivial scene start to finish and time it (with glFinish() inside the beginning and ending of your timing interval to ensure there's no stray GPU work leaking in/out). Note the tri count you throw at the GPU over that interval. Compute tris/sec over the interval, and divide that by the theoretical max tris/sec for your GPU that we established above. That'll give you a sense of what percentage of maximum GPU triangle throughput you're utilizing. Note: that's a separate question from whether you really need to be sending all those tris down the pipe in the first place, but it's a useful GPU utilization benchmark.
...


Great reply ... a simple number like triangles compared to clock speed is just what I meant. Thanks very much. I have a GTX440 btw.

Dark Photon
04-13-2013, 04:59 PM
Great reply ... a simple number like triangles compared to clock speed is just what I meant. Thanks very much. I have a GTX440 btw.

GTX440? I don't think there is such a beast, is there? Maybe you mean GT440. To see which you have, bring up "nvidia-settings" (or NVidia control panel). The GPU 0 tab will tell you what you have. Further, select the PowerMizer tab, and look at the highest clock rate in the Graphics Clock column to get your GPU core clock.

If it is a GT440, looks like there are two versions: retail (810 MHz core freq) and OEM (594 MHz core freq). Assuming you're running SYNC_TO_VBLANK with a 60Hz LCD monitor in-tow, then you're talking theoretical max throughput for std tri rasterization of about 13.5 Mtris/frame (retail) or 9.9 Mtris/frame (OEM).