Hi,
I have developed a GPU based ray casting system. The system itself consists of several fragment programs (kernels) executed in a loop. I need exact timing of each kernels. In order to be as precise as possbile I count cpu ticks (using RDTSC instruction). On the other hand;
startTimer
doSomeGlStuff
endTimer
immediately returns and gives incorrect results, as the GPU handles operations concurrently.
To remedy, I instead use
glFinish() // to finish any pending gl operations
startTimer
doSomeGlStuff
glFinish() // ditto
endTimer
glFinish() functions here may adversely affect the overall performance though.
My question; is this the correct way to measure the performance, or are there better more precise ways ?
Thanks
In addition to the current comment, I’d also like to say that; glFinish() is the way to get timings for pieces. Once you start to string them together, unforseen timing consequences can rear their head, and then glFinish in the middle of a number of operations might either create a stall (making it seem slower) or even hide e.g. filled buffers pending to be sent to the GPU (making it seem faster).
glFinish obviously also have an inherent overhead, that might or might not reliably be O(1) depending on what the OS is currently busy doing. In a micro-test you might find that an operation isn’t of concern, but once you factor a program sucking 100% CPU while the driver wants to swap textures in and out of AGP (system) memory, it might turn out that an alternative operation could have been faster in the end.
As many before me have said, profiling is still more of a black art than science, but I think you are on the right path.