Performance measurement

Hi,
I have developed a GPU based ray casting system. The system itself consists of several fragment programs (kernels) executed in a loop. I need exact timing of each kernels. In order to be as precise as possbile I count cpu ticks (using RDTSC instruction). On the other hand;

startTimer
    doSomeGlStuff
endTimer

immediately returns and gives incorrect results, as the GPU handles operations concurrently.
To remedy, I instead use

glFinish()   // to finish any pending gl operations
startTimer
    doSomeGlStuff
glFinish()   // ditto
endTimer

glFinish() functions here may adversely affect the overall performance though.
My question; is this the correct way to measure the performance, or are there better more precise ways ?
Thanks

Well, you don’t really need an extremely fine-grained timer. You just need to run the code more often.

  1. Start timer
  2. render not once, but say 100 times
  3. Read timer. Divide result by 100.

In addition to the current comment, I’d also like to say that; glFinish() is the way to get timings for pieces. Once you start to string them together, unforseen timing consequences can rear their head, and then glFinish in the middle of a number of operations might either create a stall (making it seem slower) or even hide e.g. filled buffers pending to be sent to the GPU (making it seem faster).

glFinish obviously also have an inherent overhead, that might or might not reliably be O(1) depending on what the OS is currently busy doing. In a micro-test you might find that an operation isn’t of concern, but once you factor a program sucking 100% CPU while the driver wants to swap textures in and out of AGP (system) memory, it might turn out that an alternative operation could have been faster in the end.

As many before me have said, profiling is still more of a black art than science, but I think you are on the right path.