Number of performance states for Fermi GPUs

The question for the whole community:
What do you think how many performance states Fermi GPUs have?

(Today I have discovered something very interesting by using GLProfiler) :slight_smile:

What is a “performance state”?

The performance states (P-States) are GPU performance capability and power consumption states.

If you have ever started GPU Shark you should spotted P-State indication. P-States severely impacts GPU performance and should be carefully tracked during profiling. The problem is that vendors don’t expose API for that although it already exists. That’s why I asked for exposing P-States through OpenGL API a month ago.

Px-notations (P0 through P15) are not adequate since NV has never exposed the way to read it in such form, so I’ll reformulate the question:
How many combinations of GPU/Memory frequencies are possible on Fermi GPUs?

Interesting! You are asking nv to make the power control interface public? It could only be set by the gnu control panel right now. App has no way to tweak it.

tell us… i demand you tell us what you found! :wink: …please.

Well, actually I’m only asking NV to make interface for tracking P-States public. The API exists since I can read all frequencies, but it is not public. The same is true also for AMD, I guess. You can keep power management for yourself, but let us know at what state we are currently.

You can keep power management for yourself, but let us know at what state we are currently.

And what would that mean, exactly? You query the NV-specific API and it says that it’s in PS3. OK… so what do you do about that? How does that help you accomplish something?

Also, the specific meaning of power states changes frequently, sometimes on a driver-to-driver basis. So “PS3” might not mean the same thing it used to. Indeed, one driver might want more power states than another.

And what of power states for different kinds of hardware from the same vendor? AMD’s most recent Cayman-based GPUs for example have complex power management logic built directly into the hardware. The “frequency” changes on a clock-to-clock basis. It’s “power states” are measured by current pulled, not a “frequency.” Their earlier GPUs use a more traditional method. So they would have to release two APIs, one for the old kind and one for the new. Or else they would have to try to map one kind of “power state” onto another.

I want to know the frequencies of GPU and memory at every moment in order to be able to track states. I don’t mind whether it is NV hardware or not. All vendors have power management control.

Exactly! I don’t need to know how many states they have or how frequent changes are. I just need the current state reading.

What does it mean “current pulled”?

The burden dictates the state certainly. I was talking about that from the very first moment. It is a dynamic process. Furthermore, modern graphics cards have very aggressive power state management policy. That is exactly the reason I need to know in which state GPU is.

Why the frequency is so important (and the dissipation is controlled by changing frequency)? It is a common knowledge that the dissipation (in CMOS technology) depends on the voltage linearly, but from the frequency with the second power.

What is the “traditional method”?

By whose authority? :wink:

I thought someone might guess, but obviously no one tried.
Well, there are more states than reported by GPU Shark or similar utilities, or at NV’s documentation. :slight_smile:

I’m not sure if it a bug in the power management system or it is intended behavior, but I have encountered state with the maximal CPU speed and the lowest memory speed. I could write more about it if there is any interest in the community for the topic. But at the first place we have to clear why P-States are so important.

I want to know the frequencies of GPU and memory at every moment in order to be able to track states.

That doesn’t explain what you intend to do with that. What you intend to gain by that knowledge.

What does it mean “current pulled”?

Electrical current. How much energy it uses. How much heat it dissipates.

What is the “traditional method”?

The traditional method being that the driver sets a clock speed based on the estimated load.

By that knowledge I could interpret profiling results more accurately and along with graphics_engine/frame_buffer/video_engine utilization get very precise estimation of the implemented algorithm efficiency.

What is the cause of that current pull? What does increase the energy usage?

The traditional method being that the driver sets a clock speed based on the estimated load.

Do you have any reference for that? I would really like to read some official documents about the power management of modern GPUs. NV severely reduces frequencies if the load is low. Does AMD have different policy? AMD builds CPUs upon higher number of simpler processing units. Does AMD “eliminate” certain number of units according to the current load?

What is the cause of that current pull? What does increase the energy usage?

Lots of things. Executing memory fetches. Running shaders. Writing data to memory. You know, doing stuff on the GPU.

Does AMD have different policy? AMD builds CPUs upon higher number of simpler processing units. Does AMD “eliminate” certain number of units according to the current load?

Which AMD? Like NVIDIA, they make a lot of GPUs.

My point was that AMD’s recent Cayman architecture uses a completely different kind of power management, one that your “power states” concept cannot hope to map to. It’s dynamic, changing from clock to clock, and it lives entirely on the GPU.

On these chips, the driver says, “Don’t use more than 225W,” and the GPU doesn’t. Exactly what it’s doing on a moment-to-moment basis to keep itself as fully utilized as possible while not drawing more than that much power is all hard-wired GPU logic. It can clock itself up higher on certain loads, based on how much power that load is drawing, and then clock itself down on other loads, again based on power draw. And I don’t mean “every 10 frames”; I mean that this can happen many times within a frame.

I didn’t ask for a story, but for the reference. The P-States are not my invention. The whitepepers for the Cayman architecture also mention P-States. The only “innovation” AMD has built into new GPUs is using more fine-grain clock control. Instead of several P-States, there are much more P-States (or if you prefer, clock frequencies) controlled by the engine known as PowerTune.

It sounds like preaching from the Bible. :slight_smile:
Please, use more precise and technically correct terms. What you wanted to say is that Cayman, like all modern GPUs, have TDP clamping.

Having a more fine-grain clock control doesn’t mean that we should wander in the darkness without knowing under what conditions our application executes.

Thank you for drawing my attention to AMD’s “world”!

I am still wondering how this API if released can benefit the application. In other words, what applications can do with these power consumption/performance knowledge. If application wants to make use of these knowledge to adjust the workload, it does not make sense. Because the hardware itself will decide whether to push hard to run slowly based on the current workload. There will be an end-loop between application and hardware that race on who determines the power consumption.

No, I really don’t need to control power management. There is no doubt you are doing the job well. I just ask for the possibility to track current state. While profiling applications, I want to know under what conditions the measured performance is achieved.

By the way, thank you for the whitepapers available for download! AMD is quite open for that kind of information.

Let me illustrate what I meant with my request…

Let we have two algorithms: A and B.
Algorithm A requires about 5 ms to execute on the GPU.
Algorithm B finishes after 5.66 ms on the same GPU.
Both algorithms execute each frame, and the measured values are the average values of 10000 frames. The measuring is done on GTX460.

Which algorithm is faster, A or B?

Which algorithm is faster, A or B?

This question is essentially asking, “which algorithm takes the fewest clock cycles to complete?” I submit that the question is irrelevant.

Let’s say that you could tell that Algorithm A ran in Power State 2, while Algorithm B ran in Power State 3. And let’s also say that you knew the exact memory and shader clock speed of PS2 and PS3.

At which point, you can now compute the answer to that question. But what have you gained?

Nothing. So long as you cannot force Algorithm B to run in PS2, what good is it to know that Algorithm B is technically faster, but also somehow hits the unknown and unknowable combination of factors to provoke PS3 instead of PS2?

It doesn’t matter that Algorithm B may be theoretically faster than Algorithm A. So long as you cannot force the driver into PS2, that advantage is purely theoretical.

So in the end, you would implement Algorithm A in the product in question. Because it runs faster on the given hardware and driver version.

If all you’re interested in is an academic question, that’s one thing. But there is ultimately no practical benefit to answering this question.

Excuse me, but I have to disagree.

I don’t need to force driver into the higher P-State, because Algorithm B is twice as efficient as Algorithm A and can gain the same performance with lower power consumption. On low-end cards, Algorithm B will be actually twice as fast, because the load would be high enough not to trigger changing performance state. On high-end cards, the execution time would be twice as long as on our test example, because not one, but maybe two P-States would be changed.

Could you explain to any user why his expensive graphics card actually gives lower frame-rate than less expensive card of his friend without involving P-States into explanation? Isn’t it a better algorithm if it succeeds to gain interactive frame-rate with much lower power consumption?

It should also be stressed that graphics cards don’t reduce frequencies unless the GPU utilization is low enough (or high enough to cause TDP clamping).

Yes, I am interested in academic aspect of measuring performances, but it is also a very real problem. Whoever wants to measure performance needs to know under which condition they are carried out. What is the purpose of the experiment if we are not aware of the conditions?

On low-end cards, Algorithm B will be actually twice as fast, because the load would be high enough not to trigger changing performance state.

Let’s assume this is true. You could easily find this out by testing on them, like you should be doing anyway. If you’re making a real product, you should be testing on a wide variety of systems. At the very least, you should be testing on the lowest-end and high-end cards.

It seems to me what you’re trying to do is get away without doing proper testing. You want to test on a GeForce 460 and try to extrapolate that down to mean something to a GeForce 430.

Isn’t it a better algorithm if it succeeds to gain interactive frame-rate with much lower power consumption?

“Better” is a subjective question. What is “better” to me is the one that gives the fastest performance on the hardware of interest. Which ultimately requires testing on the hardware of interest.

It was not my intention, but you gave me a great idea. :wink:
Just kidding, of course! It would be a gaming-industry award wining tool if it could test application only on the single machine and extrapolate results to a wide variety of hardware.

I took GTX460 for the example because it is a middle-range graphics card and the highest P-State differs from the next one for the factor of 1.77 considering a GPU frequency (varies from model to model, but 1.77 is true for DEV_0E22-SUBSYS_34FC1458-REV_A1).

Aleksander, I for one agree with you, and Alfonse IMO is just being difficult. You need a method of profiling that guarantees a consistent test environment, or comparison is nonsensical.

In the absence of a good method to control this environment, we’re left with disabling the annoying power management altogether and benching in apples-to-apples max perf mode.