PDA

View Full Version : Number of performance states for Fermi GPUs



Aleksandar
10-11-2011, 12:05 PM
The question for the whole community:
What do you think how many performance states Fermi GPUs have?

(Today I have discovered something very interesting by using GLProfiler) :)

Alfonse Reinheart
10-11-2011, 12:37 PM
What is a "performance state"?

Aleksandar
10-11-2011, 01:19 PM
The performance states (P-States) are GPU performance capability and power consumption states.


P-States range from P0 to P15, with P0 being the highest performance/power state, and P15 being the lowest performance/power state. Each P-State maps to a performance level. Not all P-States are available on a given system. The definition of each P-States are currently as follows:
- P0/P1 - Maximum 3D performance
- P2/P3 - Balanced 3D performance-power
- P8 - Basic HD video playback
- P10 - DVD playback
- P12 - Minimum idle power consumption

If you have ever started GPU Shark you should spotted P-State indication. P-States severely impacts GPU performance and should be carefully tracked during profiling. The problem is that vendors don't expose API for that although it already exists. That's why I asked for exposing P-States through OpenGL API a month ago.

Px-notations (P0 through P15) are not adequate since NV has never exposed the way to read it in such form, so I'll reformulate the question:
How many combinations of GPU/Memory frequencies are possible on Fermi GPUs?

Hongwei Li
10-11-2011, 04:18 PM
Interesting! You are asking nv to make the power control interface public? It could only be set by the gnu control panel right now. App has no way to tweak it.

Chris Lux
10-12-2011, 12:43 AM
(Today I have discovered something very interesting by using GLProfiler) :)
tell us... i demand you tell us what you found! ;) ...please.

Aleksandar
10-12-2011, 12:56 AM
Well, actually I'm only asking NV to make interface for tracking P-States public. The API exists since I can read all frequencies, but it is not public. The same is true also for AMD, I guess. You can keep power management for yourself, but let us know at what state we are currently.

Alfonse Reinheart
10-12-2011, 02:37 AM
You can keep power management for yourself, but let us know at what state we are currently.

And what would that mean, exactly? You query the NV-specific API and it says that it's in PS3. OK... so what do you do about that? How does that help you accomplish something?

Also, the specific meaning of power states changes frequently, sometimes on a driver-to-driver basis. So "PS3" might not mean the same thing it used to. Indeed, one driver might want more power states than another.

And what of power states for different kinds of hardware from the same vendor? AMD's most recent Cayman-based GPUs for example have complex power management logic built directly into the hardware. The "frequency" changes on a clock-to-clock basis. It's "power states" are measured by current pulled, not a "frequency." Their earlier GPUs use a more traditional method. So they would have to release two APIs, one for the old kind and one for the new. Or else they would have to try to map one kind of "power state" onto another.

Aleksandar
10-12-2011, 05:14 AM
And what would that mean, exactly? You query the NV-specific API and it says that it's in PS3. OK... so what do you do about that? How does that help you accomplish something?
I want to know the frequencies of GPU and memory at every moment in order to be able to track states. I don't mind whether it is NV hardware or not. All vendors have power management control.


Also, the specific meaning of power states changes frequently, sometimes on a driver-to-driver basis. So "PS3" might not mean the same thing it used to. Indeed, one driver might want more power states than another.
Exactly! I don't need to know how many states they have or how frequent changes are. I just need the current state reading.


And what of power states for different kinds of hardware from the same vendor? AMD's most recent Cayman-based GPUs for example have complex power management logic built directly into the hardware. The "frequency" changes on a clock-to-clock basis. It's "power states" are measured by current pulled, not a "frequency." Their earlier GPUs use a more traditional method. So they would have to release two APIs, one for the old kind and one for the new. Or else they would have to try to map one kind of "power state" onto another.
What does it mean "current pulled"?

The burden dictates the state certainly. I was talking about that from the very first moment. It is a dynamic process. Furthermore, modern graphics cards have very aggressive power state management policy. That is exactly the reason I need to know in which state GPU is.

Why the frequency is so important (and the dissipation is controlled by changing frequency)? It is a common knowledge that the dissipation (in CMOS technology) depends on the voltage linearly, but from the frequency with the second power.

What is the "traditional method"?



tell us... i demand you tell us what you found! ...please.
By whose authority? ;)

I thought someone might guess, but obviously no one tried.
Well, there are more states than reported by GPU Shark or similar utilities, or at NV's documentation. :)

I'm not sure if it a bug in the power management system or it is intended behavior, but I have encountered state with the maximal CPU speed and the lowest memory speed. I could write more about it if there is any interest in the community for the topic. But at the first place we have to clear why P-States are so important.

Alfonse Reinheart
10-12-2011, 11:34 AM
I want to know the frequencies of GPU and memory at every moment in order to be able to track states.

That doesn't explain what you intend to do with that. What you intend to gain by that knowledge.


What does it mean "current pulled"?

Electrical current. How much energy it uses. How much heat it dissipates.


What is the "traditional method"?

The traditional method being that the driver sets a clock speed based on the estimated load.

Aleksandar
10-12-2011, 01:13 PM
That doesn't explain what you intend to do with that. What you intend to gain by that knowledge.
By that knowledge I could interpret profiling results more accurately and along with graphics_engine/frame_buffer/video_engine utilization get very precise estimation of the implemented algorithm efficiency.


Electrical current. How much energy it uses. How much heat it dissipates.
What is the cause of that current pull? What does increase the energy usage?


The traditional method being that the driver sets a clock speed based on the estimated load.
Do you have any reference for that? I would really like to read some official documents about the power management of modern GPUs. NV severely reduces frequencies if the load is low. Does AMD have different policy? AMD builds CPUs upon higher number of simpler processing units. Does AMD "eliminate" certain number of units according to the current load?

Alfonse Reinheart
10-12-2011, 02:08 PM
What is the cause of that current pull? What does increase the energy usage?

Lots of things. Executing memory fetches. Running shaders. Writing data to memory. You know, doing stuff on the GPU.


Does AMD have different policy? AMD builds CPUs upon higher number of simpler processing units. Does AMD "eliminate" certain number of units according to the current load?

Which AMD? Like NVIDIA, they make a lot of GPUs.

My point was that AMD's recent Cayman architecture uses a completely different kind of power management, one that your "power states" concept cannot hope to map to. It's dynamic, changing from clock to clock, and it lives entirely on the GPU.

On these chips, the driver says, "Don't use more than 225W," and the GPU doesn't. Exactly what it's doing on a moment-to-moment basis to keep itself as fully utilized as possible while not drawing more than that much power is all hard-wired GPU logic. It can clock itself up higher on certain loads, based on how much power that load is drawing, and then clock itself down on other loads, again based on power draw. And I don't mean "every 10 frames"; I mean that this can happen many times within a frame.

Aleksandar
10-12-2011, 03:06 PM
I didn't ask for a story, but for the reference. The P-States are not my invention. The whitepepers for the Cayman architecture also mention P-States. The only "innovation" AMD has built into new GPUs is using more fine-grain clock control. Instead of several P-States, there are much more P-States (or if you prefer, clock frequencies) controlled by the engine known as PowerTune.


On these chips, the driver says, "Don't use more than 225W," and the GPU doesn't.
It sounds like preaching from the Bible. :)
Please, use more precise and technically correct terms. What you wanted to say is that Cayman, like all modern GPUs, have TDP clamping.

Having a more fine-grain clock control doesn't mean that we should wander in the darkness without knowing under what conditions our application executes.

Thank you for drawing my attention to AMD's "world"!

Hongwei Li
10-13-2011, 09:20 PM
I am still wondering how this API if released can benefit the application. In other words, what applications can do with these power consumption/performance knowledge. If application wants to make use of these knowledge to adjust the workload, it does not make sense. Because the hardware itself will decide whether to push hard to run slowly based on the current workload. There will be an end-loop between application and hardware that race on who determines the power consumption.

Aleksandar
10-14-2011, 07:55 AM
No, I really don't need to control power management. There is no doubt you are doing the job well. I just ask for the possibility to track current state. While profiling applications, I want to know under what conditions the measured performance is achieved.

By the way, thank you for the whitepapers available for download! AMD is quite open for that kind of information.

Aleksandar
10-14-2011, 11:46 AM
Let me illustrate what I meant with my request...

Let we have two algorithms: A and B.
Algorithm A requires about 5 ms to execute on the GPU.
Algorithm B finishes after 5.66 ms on the same GPU.
Both algorithms execute each frame, and the measured values are the average values of 10000 frames. The measuring is done on GTX460.

Which algorithm is faster, A or B?

Alfonse Reinheart
10-14-2011, 01:56 PM
Which algorithm is faster, A or B?

This question is essentially asking, "which algorithm takes the fewest clock cycles to complete?" I submit that the question is irrelevant.

Let's say that you could tell that Algorithm A ran in Power State 2, while Algorithm B ran in Power State 3. And let's also say that you knew the exact memory and shader clock speed of PS2 and PS3.

At which point, you can now compute the answer to that question. But what have you gained?

Nothing. So long as you cannot force Algorithm B to run in PS2, what good is it to know that Algorithm B is technically faster, but also somehow hits the unknown and unknowable combination of factors to provoke PS3 instead of PS2?

It doesn't matter that Algorithm B may be theoretically faster than Algorithm A. So long as you cannot force the driver into PS2, that advantage is purely theoretical.

So in the end, you would implement Algorithm A in the product in question. Because it runs faster on the given hardware and driver version.

If all you're interested in is an academic question, that's one thing. But there is ultimately no practical benefit to answering this question.

Aleksandar
10-14-2011, 03:38 PM
Excuse me, but I have to disagree.

It doesn't matter that Algorithm B may be theoretically faster than Algorithm A. So long as you cannot force the driver into PS2, that advantage is purely theoretical.
I don't need to force driver into the higher P-State, because Algorithm B is twice as efficient as Algorithm A and can gain the same performance with lower power consumption. On low-end cards, Algorithm B will be actually twice as fast, because the load would be high enough not to trigger changing performance state. On high-end cards, the execution time would be twice as long as on our test example, because not one, but maybe two P-States would be changed.

Could you explain to any user why his expensive graphics card actually gives lower frame-rate than less expensive card of his friend without involving P-States into explanation? Isn't it a better algorithm if it succeeds to gain interactive frame-rate with much lower power consumption?

It should also be stressed that graphics cards don't reduce frequencies unless the GPU utilization is low enough (or high enough to cause TDP clamping).


If all you're interested in is an academic question, that's one thing. But there is ultimately no practical benefit to answering this question.
Yes, I am interested in academic aspect of measuring performances, but it is also a very real problem. Whoever wants to measure performance needs to know under which condition they are carried out. What is the purpose of the experiment if we are not aware of the conditions?

Alfonse Reinheart
10-14-2011, 04:33 PM
On low-end cards, Algorithm B will be actually twice as fast, because the load would be high enough not to trigger changing performance state.

Let's assume this is true. You could easily find this out by testing on them, like you should be doing anyway. If you're making a real product, you should be testing on a wide variety of systems. At the very least, you should be testing on the lowest-end and high-end cards.

It seems to me what you're trying to do is get away without doing proper testing. You want to test on a GeForce 460 and try to extrapolate that down to mean something to a GeForce 430.


Isn't it a better algorithm if it succeeds to gain interactive frame-rate with much lower power consumption?

"Better" is a subjective question. What is "better" to me is the one that gives the fastest performance on the hardware of interest. Which ultimately requires testing on the hardware of interest.

Aleksandar
10-15-2011, 02:04 AM
It seems to me what you're trying to do is get away without doing proper testing. You want to test on a GeForce 460 and try to extrapolate that down to mean something to a GeForce 430.

It was not my intention, but you gave me a great idea. ;)
Just kidding, of course! It would be a gaming-industry award wining tool if it could test application only on the single machine and extrapolate results to a wide variety of hardware.

I took GTX460 for the example because it is a middle-range graphics card and the highest P-State differs from the next one for the factor of 1.77 considering a GPU frequency (varies from model to model, but 1.77 is true for DEV_0E22-SUBSYS_34FC1458-REV_A1).

Dark Photon
10-15-2011, 04:54 PM
Which algorithm is faster, A or B?
...there is ultimately no practical benefit to answering this question.
Aleksander, I for one agree with you, and Alfonse IMO is just being difficult. You need a method of profiling that guarantees a consistent test environment, or comparison is nonsensical.

In the absence of a good method to control this environment, we're left with disabling the annoying power management altogether and benching in apples-to-apples max perf mode.

Aleksandar
10-16-2011, 02:19 AM
Thanks Dark for the support. Alfose is a really clever guy. I don't have problem to communicate with him. ;)

Let's back to the topic... Yes, the problem is in inability to have a consistent test environment. You have found the way to disable power management options, but who guarantees that some application will not cross over TDP when power management is disabled and cause graphics card damage. Also, new AMD's PowerTune engine controls clock speed on a very fine level. Unlike with NV cards that have several P-States, Cayman has a whole range of different working frequencies. Even if you (somehow) disable PowerTune, it is likely that TDP clamping will remain. Who guarantees that the working frequency will be the same if you reach TDP? The most effective way to control dissipation is to control frequency. I just ask them to provide public interface to read the current frequency. Nothing more.

This is the illustration of what I need. State Tracker (https://sites.google.com/site/opengltutorialsbyaks/download/state-tracker/MemCon.zip) It is a Windows application that works only on NV cards.

Hongwei Li
10-17-2011, 10:11 PM
Now I agree. I'll discuss it with some seniors and let you know when there is any news.

Aleksandar
10-18-2011, 04:16 AM
I'm glad you have agreed.

Thank you in advance!

Hongwei Li
10-18-2011, 09:10 PM
See if ADL_Overdrive5_CurrentActivity_Get() function in ADL SDK is what you want. You can download ADL SDK here http://developer.amd.com/sdks/ADLSDK/Pages/default.aspx

Aleksandar
10-20-2011, 02:56 PM
Wow! I asked for the ability to track frequencies and you gave me an oveclocking tool.

Really nice, indeed! Thank you! Unfortunately I don't possess any of your cards and it would not be nice to play with other's people hardware.

I have tested it on several cards and the conclusions are the following:
- AMD denotes P-States in opposite way (P-State 0 is the lowest P-State, and 2 is the highest for the previous generation of cards - 5xxx)
- Cayman GPUs (6xxx series) also change P-States in discrete steps. How can we activate PowerTune and have a continual frequency changes?
- All frequencies are defined in strange units (tens of kHz). Why didn't you use simply kHz?
- The number of BUS lanes varies. Why is it so? How P-States are controlled by disabling BUS lanes? Is there any documentation publicly available? I would be glad to read something about it.

Is there any additional technical documentation publicly available about Overdrive5 or ADL? It is a really nice API.

Thank you so much!