LCU14-410: How to build an Energy Model for your SoC

1
How to build an energy model
for your SoC
Linaro Connect CLU14, Burlingame,CA.
Morten Rasmussen

Why do you need an energy model?
 Most of the Linux kernel is blissfully unaware of SoC power
management features:
 P-states, clock domains, C-states, power domains, ...
 Only largely autonomous subsystems are aware of some of these
details (cpufreq, cpuidle, …)
 The plan is to change that by coordinating task scheduling, frequency
scaling, and idle-state selection to improve power management.
 Energy saving techniques must be applied under the right
circumstances which vary between SoCs.
 The kernel must therefore have a better understanding of
power(energy)/performance trade-offs for the particular SoC to make
the right decisions.
 An energy model can provide that information.
 As a bonus, the energy model may also be used by tools to quick
energy estimates based on execution traces.
2

Modelling limitations
 Model are never accurate, but we only need enough detail
to make the right decisions most of the time.
 The model will be used by critical code paths in the kernel,
so it has to be as simple as possible.
 Only considers cpus, no memory or peripherals.
3

A simplified system view
4
G
cpu0 cpu1
Shared HW
G G
G
cpu2 cpu3
Shared HW
G G
Power
G G G G
Clock source Clock source
G Clock gating
G Power gating
Power domain

Energy consumption simplified
5
Px
time
power
Py
Transition
energy
Cz
Busy
Transition
Idle
Busy energy Busy energy
Idle energy

Scheduler Topology Hierarchy
6
0 1 2 3
Cluster/package
P-states
Cluster/package
Disclaimer: This a simplified view of
the sched_domain hiearchy.
Struct sched_group
Energy model tables Per-core C-states
C-states

Energy model data
 P-states:
 Compute capacity: Performance score normalize to highest P-state of
fastest cpu in the system (1024). Choose benchmark carefully.
Preferably use a suite of benchmarks.
 Power: Busy power = energy/second. Normalized to any reference, but
must be consistent across all cpus.
 C-states:
 Power: Idle power = energy/second. Normalized.
 Wake-up energy. Energy consumed during P->C + C->P state
transitions. Unit must be consistent with power numbers.
 Note:
7
 Power numbers should only include power consumption associated
with the group where the tables are attached, i.e. per-core P-state
power should only include power consumed by the core itself, shared
HW is accounted for in the table belonging to the level above.

Energy model data
0 1
8
C-states
power wu (state)
10 6 (C1)
... ... ...
Cluster
P-states
P-states C-states
power wu (state)
0 0 (WFI)
... ... ...
capacity power (freq)
358 2967 (350)
... ... ...
1024 4905 (1000)
capacity power (freq)
358 187 (350)
... ... ...
1024 1024 (1000)
CPU

Energy model algorithm
9
for_each_domain(cpu, sd) {
sg = sched_group_of(cpu)
energy_before = curr_util(sg) * busy_power(sg)
+ (1-curr_util(sg)) * idle_power(sg)
energy_after = new_util(sg) * busy_power(sg)
+ (1-new_util(sg)) * idle_power(sg)
+ (1-new_util(sg)) * wakeups * wakeup_energy(sg)
energy_diff += energy_before - energy_after
if (energy_before == energy_after)
break;
}
return energy_diff

11
Platform performance/energy
data/model in scheduler or
user-space
Energy-Aware Workshop @ Kernel Summit 2014, Chicago
Morten Rasmussen

Sub-topics
 Techniques for reducing energy consumption vary between
platforms:
 Race-to-idle
 Task packing
 P- and C-state constraints (Turbo Mode, package C-states, …)
 … but they are not universally all good. Most likely only to a
certain extend.
 We need to know when to apply each of the techniques for a
particular platform.
 Proposals:
12
 Tunable heuristics for each technique that can controlled by somebody
else (user-space?), basically passing the problems to others.
 Provide in-kernel performance/energy model that can estimate the
impact of scheduling decisions.

Model Validation: ARM TC2, sysbench
14
Correlation (Pearson):
A15 = 0.93
A7 = 0.96

Model Validation: ARM TC2, periodic
15
A15 = 0.17
A7 = -0.01

Model Validation: ARM TC2, Android audio
16
A15 = 0.03
A7 = 0.48

Model Validation: ARM TC2, Android
bbench
17
A15 = 0.67
A7 = 0.80

Motivation
 Energy cost driven task placement (load-balancing)
19
 Focus on the actual goal of the energy-aware scheduling activities:
 Saving energy while achieving (near) optimum performance.
 Energy benefit of scheduling decision clear when made.
 Assuming energy cost estimates are fairly accurate.
 Introduce a simple energy model to estimate costs and guide
scheduling decisions.
 Requested by maintainers at the KS workshop.
 Gives the right amount of packing and spreading.
 May simplify balancing decision logic.
 Strong focus on saving energy in load balancing algorithms.
 big.LITTLE support comes naturally and almost for free.
 This just one part of the energy efficiency work.
 Several related sessions this week.

Energy Load Balancing
 The idea (a bit simplified):
20
 Let the resulting energy consumption guide all balancing decisions:
 if (energy_diff(task, src_cpu, dst_cpu) > 0) {
move_task(task, src_cpu, dst_cpu);
} else {
/* Try some other task */
}
 Ideally, we should get the optimum balance if we try all combinations
of tasks and cpus.
 In reality it is not that simple. We can't try all combinations, but we
can get fairly close for most scenarios.
 If the energy model is accurate enough we get packing and spreading
implicitly and only when it saves energy
 Should work for any system. SMP and big.LITTLE (with a few
extensions).

Power and Energy
 Goal: Save energy, not power. Power
21
Time
Energy
ecpu=P⋅t , t=inst
cc
ecpu=P(cc) inst
cc
ecpu=P(cc)(
insttask
cc +
Work
instidle
cc )
ecpu=etask+eidle
Compute capacity (~ freq * uarch)
= Energy/inst: This is what we try to minimize.
ecpu=Pbusy (cc)
insttask
cc +Pidle
instidle
cc
If we have cpuidle support we get:
~ utilization
Tracked load
Time
Time in runnable state
~ utilization*
We have to add an additional leakage energy term to reflect that it is better not wake cpus
unnecessarily.

Simple Energy Model
 cpu_energy = power(cc) * util/cc
22
+ idle_power * (1-(util/cc))
+ leakage_energy
 cluster_energy =
c_active_power * c_util
+ c_idle_power * (1-c_util)
 util = Scale invariant cpu utilization (Tracked load).
 cc = Current compute capacity (depends on freq and uarch).
 power(cc) = Busy power (fully loaded) at current capacity from table.
 idle_power = Idle power consumption (~WFI).
 leakage_energy = Constant representing the cost of waking the cpu.
 c_util = Cluster utilization. Depends on max(util/cc) ratio of its cpus.
 c_active_power = Cluster active power.
 c_idle_power = Cluster idle power.

Compute Capacity and Power
 Processor specific table expressing power and compute
capacity at each P-state.
 The sched domain hierarchy is in a good position to hold this type of
information.
 Example (entirely made up):
23
Capacity Power
0.2 0.4
0.4 0.9
0.6 1.5
0.8 2.2
1.0 3.2
Capacity Power
0.4 1.6
0.8 4.4
1.2 9.0
1.6 15.0
2.0 23.0
Little Big
idle 0.1
leakage 0.1
Equal compute capacity
idle 0.3
leakage 0.5
Little Big
cluster
active 2.4 6.0
idle 0.0 0.0

energy_diff()
 Balancing two cpus:
24
def energy_diff(tload, scpu, dcpu):
# Estimate the next compute capacity (P-state)
s_new_cc = find_cpu_cap(scpu, cpu_util(scpu))
# energy model cost for task on source cpu
s_task_energy = tload/s_new_cc * cpu_cc_power(scpu, s_new_cc)
if nr_running(scpu) == 1:
s_task_energy += cpu_leakage_energy[cpu_type[scpu]]
# Estimate destination cpu cc after adding the task
d_new_cc = find_cpu_cc(dcpu, cpu_util(dcpu)+tload)
# energy model cost for task on destination cpu
d_task_energy = tload/d_new_cc * cpu_cc_power(dcpu, d_new_cc)
if nr_running(dcpu) == 0:
d_task_energy += cpu_leakage_energy[cpu_type[dcpu]]
return s_task_energy - d_task_energy
 Balancing sched domains is slightly more complicated as it
involves cluster power as well.

Example
After EA load balance:
25
cpu rq util cap cc_power leak power
0 {0.2} 0.2 0.2 0.4 0.1 0.5
1 {0.1} 0.1 0.2 0.4 0.1 0.35
energy_diff()
2 {} 0.0 0.2 0.4 0.1 0.1
= 0.075*
cluster - 1.0 - 2.4 - 2.4
Total 3.35
0.55
saved
cpu rq util cap cc_power leak power
0 {0.2, 0.1} 0.3 0.4 0.9 0.1 0.8
1 {} 0.0 0.4 0.9 0.1 0.1
2 {} 0.0 0.4 0.9 0.1 0.1
cluster - 0.75 - 2.4 - 1.8
Total 2.8
* energy_diff() ignores cluster power and other tasks to keep computations cheap and simple.
Better accuracy can be added if necessary.

Is the energy model too simple?
 It is essential that the energy model is fast and is easy to use for load-balancing.
26
 The scheduler is a critical path and already complex enough.
 Python model tests
 Disclaimer: These numbers have not been validated in any way.
 Test configuration: 3+3 big.LITTLE, 1000 random balance scenarios.
 Rand/Opt: Random balance energy (starting point) worse than best possible balance
energy (brute-force).
 EA/Opt: Energy model based balance energy worse than best possible balance energy.
 EA == Opt: Scenarios where EA found best possible balance.
Tasks Rand/Opt EA/Opt EA == Opt
2 7.86% 0.09% 72.60%
3 7.79% 0.15% 64.80%
4 9.39% 0.45% 62.00%
5 10.02% 1.15% 51.10%
6 11.44% 2.23% 38.30%

What is next?
 Early prototype to validate the idea. Initial focus getting
energy_diff() working on simple SMP system.
 Post on LKML very soon.
 Open Issues
 Exposing power/capacity tables to kernel. Essential to make the right
decisions.
 Plumbing: Where do the tables come from? DT?
 Next steps:
27
 Scale invariance: Requirement for the energy model to work.
 Fix cpu_power/compute capacity use in scheduler.
 Tooling and benchmarks (covered in another session)
 Idle integration (covered in another session)

LCU14-410: How to build an Energy Model for your SoC

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to LCU14-410: How to build an Energy Model for your SoC

Similar to LCU14-410: How to build an Energy Model for your SoC (20)

More from Linaro

More from Linaro (20)

Recently uploaded

Recently uploaded (20)

LCU14-410: How to build an Energy Model for your SoC