Maybe this is not the most suitable place to ask such a question, but on the other hand is there any better place in the Universe to do it?
OK, let's be more serious.
Has anyone ever tried to use NVIDIA PerfKit for profiling OpenGL applications?
I've been using it for some time, but only to read some hw counters. Recently I've started to use simplified experiments hoping they would give me some useful hints about units' utilization and isolating bottleneck, but...
Something really unexpected is happening. Namely, instead of shaders as I expected, those experiments find bottlenecks in either FB or ROP.
I have to mention that the application in the test does not use blending (in fact it does, but it is skipped during experiments), stencil buffer or anything that would rise ROP load.ROP is the blending unit and handles both color blending and Z/stencil buffer handling.
The FB or frame buffer unit handles all requests for reading memory that missed any possible L1/L2 caches.
Other important facts are readings of other counters:
For the same scene simplified experiments returned:
ROP bottleneck: 68%
FB bottleneck: 13%
ROP utilization (SOL): 6%
FB utilization (SOL): 80%
(But this could vary widely. Sometimes ROP but sometimes FB is a bottleneck).
I have tried with trivial fragment shaders (just outputting black or even discard output, but on a G80 GPU, and the FB is still the bottleneck).
Can anyone help me to resolve these readings?
It seems to me that simplified experiments don't retrieve correct values, but on the other hand how come that the problem persists for several years (I've tried it with various drivers, from R266 to R332).