LCU13: Power-efficient scheduling, and the latest news from the kernel summit

1
Power-efficient scheduling, and the latest
news from the kernel summit
Linaro Connect USA 2013
Morten Rasmussen, Dietmar Eggemann

2
Topics Overview
 Timeline
 Towards a unified scheduler driven power policy
 Task placement based on CPU suitability
 Kernel Summit Feedback
 Status
 Questions?

3
Timeline
 May – Ingo's response to the task packing patches from
VincentG reignited discussions on power-aware scheduling
 Early July – Posted proposed patches for a power aware
scheduler based on a power driver running in conjunction
with the current scheduler
 Avoid big changes to the already complex current scheduler
 Migrate functionality back in to the scheduler when we had worked
out the kinks
 Sept – At Plumbers there was a relatively broad agreement
with the approach
 October – Morten reposts patchset with refined APIs between
power driver and the scheduler
 LKS – Reopened the discussion. More on this later

4
Unified scheduler driven power policy … Why ?
 big.LITTLE MP patches are tested, stable and performant
 Take the principles learnt during the implementation and apply to
an upstream solution
 Existing power management frameworks are not coordinated
(cpufreq, cpuidle) with the scheduler
 E.g. the scheduler decides which cpu to wake up or idle without
having any knowledge about C-states. cpuidle is left to do its best
based on these uninformed choices.
 The scheduler is the most obvious place coordinate power
management at it has the best view of the overall system load.
 The scheduler knows when tasks are scheduled and decides the
load balance. cpufreq has to wait until it can see the result of the
scheduler decisions before it can react.
 Task packing in the scheduler needs P and C-state information to
make informed decisions.

5
Existing Power Policies
 Frequency scaling: cpufreq
 Generic governor + platform specific driver
 Decides target frequency based on overall cpu load.
 Idle state selection: cpuidle
 Generic governor + platform specific driver
 Attempts to predict idle time when cpus enter idle.
 Scheduler:
 Completely generic and unaware of cpufreq and cpuidle policies.
 Determines when and where a task runs, i.e. on which cpu.
 Task placement considering CPU suitability required.

6
cpu1cpu1
Existing Power Policies
cpu0cpu0
Freq Load
T
Scheduler
policy
cpufreq
policy
cpuidle
policy
Powerrq
T
Load balance
idle
Current load (pre-3.11)
Current load (3.11)
 No coordination between power policies to avoid
conflicting/suboptimal decisions.
 Is it a problem?

7
Issues
 Scheduler->cpufreq->scheduler cpu load feedback loop
 From 3.11 the scheduler uses tracked load for load-balancing.
 Tracked load is impacted by frequency scaling. Lower frequency
leads to higher tracked load for the same task.
 Hindering new power-aware scheduling features
 Task packing: Needs feedback from cpufreq to determine when cpus
are full.
 Topology aware task placement: Needs topology information inside
the scheduler to determine the most optimal cpus to use when the
system is partially loaded.
 Heterogeneous systems (big.LITTLE): Needs topology information
and accurate load tracking.
 Thermal also needs to be considered

8
Power scheduler proposal
Power driver (drivers/*/?.c)Scheduler (fair.c) Power framework (power.c)
Helper function
library
Driver registrationsched_domain
Hierarchy
(Generic topology)
Load balance
algorithms
Detailed platform
topology
Platform HW driver
Load tracking
Platform perf. and
energy monitoring
Performance state
selection
Sleep state
selection
“Important tasks”
cgroup
+ New generic info
(pack, heterogeneous, ...)
+ Packing,
+ P & C-state aware,
+ Heterogeneous
+ Scale invariant
Abstract power
driver/topology
interface
Existing policy algorithms
Library (drivers/power/?.c)

9
Task placement based on CPU suitability
Part of the power scheduler proposal
 sched_domain hierarchy
 Load balance algorithm (Heterogeneous)
Existing big.LITTLE MP Patches
 Definition: CFS scheduler optimization for heterogeneous platforms.
Attempts to select task affinity to optimize power and performance
based on task load and CPU type
 Hosted at
http://git.linaro.org/gitweb?p=arm/big.LITTLE/mp.git
 Co-exists with existing (CFS) scheduler code
 Guarded by CONFIG_SCHED_HMP
 Setup HMP domains as a dependency to topology code
Implement big.LITTLE MP functionality inside scheduler mainline code

10
Task placement scheduler architectural bricks
1) Additional sched domain data structures
2) Specify sched domain level for task placement
3) Unweighted instantaneous load signal
4) Task placement hook in select task
5) Task placement hook in load balance
6) Task placement idle pull

11
Brick 1: Additional sched domain data structures
big.LITTLE MP:
 struct hmp_domain

struct hmp_domain {
        struct cpumask cpus;
        struct cpumask possible_cpus;
        struct list_head hmp_domains;
}
Task placement based on CPU suitability:
 Use the existing sched groups in CPU sched domain level
 Add task load ranges into CPU, sched domain and group

12
Brick 2: Specify sched domain level
big.LITTLE MP:
 No additional sched domain flag
 Deletes SD_LOAD_BALANCE flag in CPU level
 Adds SD_SUITABILITY flag to CPU level

13
Brick 3: Unweighted instantaneous load signal
 big.LITTLE MP & Task placement based on CPU suitability:
 For sched entity and cfs_rq
    struct sched_avg {
            u32 runnable_avg_sum, runnable_avg_period;
            u64 last_runnable_update;
            s64 decay_count;
            unsigned long load_avg_contrib;
            unsigned long load_avg_ratio;
    }
 sched entity: runnable_avg_sum * NICE_0_LOAD / (runnable_avg_period + 1)
 cfs_rq: set in [update/enqueue/dequeue]_entity_load_avg()

14
Brick 4: Task placement hook in select task
big.LITTLE MP:
 Force new non-kernel tasks onto big CPUs until
load stabilises
 Least loaded CPU of big cluster is used
 Use task load ranges of previous CPU and
(initialized) task load ratio to set new CPU

15
Brick 5: Task placement hook in load balance
big.LITTLE MP:
 Completely bypasses load_balance() in CPU level
 hmp_force_up_migration() in run_rebalance_domains()
 Calls hmp_up_migration() for migration to faster CPU
 Calls hmp_offload_down() for using little CPUs when idle
 Does not use env->imbalance or something equivalent
 Happens inside load_balance()
 Find most unsuitable queue (i.e. find source run-queue)
 Move unsuitable tasks (counterpart to load balance)
 Move one unsuitable task (counterpart to active load balance)
 Cannot use env->imbalance to control load balance
 Using grp_load_avg_ratio/(NICE_0_LOAD * sg->group_weight) <= THRESHOLD
 Falling back to 'mainline load balance' in case condition is not meet (destination
group is already overloaded)

16
Brick 6: Task placement idle pull
big.LITTLE MP:
 Big CPU pulls running task above the threshold from little CPU
 Not necessary because idle_balance()->load_balance() is not
suppressed on CPU level by missing SD_LOAD_BALANCE flag
 Idle pull happens inside load_balance

17
Kernel Summit Feedback
 Good to get active discussion
 First time with everybody in the same room
 LWN article - “The power-aware scheduling mini-summit”
 Key points made
 Power benchmarks are needed for evaluation
 Use-case descriptions are needed to define common ground.
 The scheduler needs energy/power information to make power-aware
scheduling decisions.
 Power-awareness should be moved into the scheduler.
 cpufreq is not fit for its purpose and should go away.
 cpuidle will be integrated in the scheduler. Possibly support by
new per task properties, such as latency constraints
 Are there ways to replay energy scenarios?
 Linsched or perf sched

18
Kernel Summit feedback observations
 All part of the open-source process
 Discussions have raised awareness of the issues
 Maintainers recognise the need for improved power management
 Iterative approach necessary but the steps are clear
 Maintainers have a clear server/desktop background
 ARM community can help educate this audience on embedded
requirements
 Benchmarking for power could be hard to do in a simple way
 Cyclic test, sysbench type tests unlikely to yield realistic results in real
systems
 However, full accuracy not required
 Power models necessarily complex and often closely guarded
secrets
 Collection and reporting of meaningful metrics is probably sufficient

19
Status
 Latest Power-aware scheduling patches on LKML
 https://lkml.org/lkml/2013/10/11/547
 Task placement based on CPU suitability patches prepared
 Proof of concept done
 Waiting for right time to post to lists
 Feedback from Linux kernel Summit needs to be discussed

20
Questions?
 Thanks for listening.

LCU13: Power-efficient scheduling, and the latest news from the kernel summit

More Related Content

What's hot

Viewers also liked

Similar to LCU13: Power-efficient scheduling, and the latest news from the kernel summit

More from Linaro

Recently uploaded

LCU13: Power-efficient scheduling, and the latest news from the kernel summit