1
Power-efficient scheduling, and the latest
news from the kernel summit
Linaro Connect USA 2013
Morten Rasmussen, Dietmar Eggemann
2
Topics Overview
 Timeline
 Towards a unified scheduler driven power policy
 Task placement based on CPU suitability
 Kernel Summit Feedback
 Status
 Questions?
3
Timeline
 May – Ingo's response to the task packing patches from
VincentG reignited discussions on power-aware scheduling
 Early July – Posted proposed patches for a power aware
scheduler based on a power driver running in conjunction
with the current scheduler
 Avoid big changes to the already complex current scheduler
 Migrate functionality back in to the scheduler when we had worked
out the kinks
 Sept – At Plumbers there was a relatively broad agreement
with the approach
 October – Morten reposts patchset with refined APIs between
power driver and the scheduler
 LKS – Reopened the discussion. More on this later
4
Unified scheduler driven power policy … Why ?
 big.LITTLE MP patches are tested, stable and performant
 Take the principles learnt during the implementation and apply to
an upstream solution
 Existing power management frameworks are not coordinated
(cpufreq, cpuidle) with the scheduler
 E.g. the scheduler decides which cpu to wake up or idle without
having any knowledge about C-states. cpuidle is left to do its best
based on these uninformed choices.
 The scheduler is the most obvious place coordinate power
management at it has the best view of the overall system load.
 The scheduler knows when tasks are scheduled and decides the
load balance. cpufreq has to wait until it can see the result of the
scheduler decisions before it can react.
 Task packing in the scheduler needs P and C-state information to
make informed decisions.
5
Existing Power Policies
 Frequency scaling: cpufreq
 Generic governor + platform specific driver
 Decides target frequency based on overall cpu load.
 Idle state selection: cpuidle
 Generic governor + platform specific driver
 Attempts to predict idle time when cpus enter idle.
 Scheduler:
 Completely generic and unaware of cpufreq and cpuidle policies.
 Determines when and where a task runs, i.e. on which cpu.
 Task placement considering CPU suitability required.
6
cpu1cpu1
Existing Power Policies
cpu0cpu0
Freq Load
T
Scheduler
policy
cpufreq
policy
cpuidle
policy
Powerrq
T
Load balance
idle
Current load (pre-3.11)
Current load (3.11)
 No coordination between power policies to avoid
conflicting/suboptimal decisions.
 Is it a problem?
7
Issues
 Scheduler->cpufreq->scheduler cpu load feedback loop
 From 3.11 the scheduler uses tracked load for load-balancing.
 Tracked load is impacted by frequency scaling. Lower frequency
leads to higher tracked load for the same task.
 Hindering new power-aware scheduling features
 Task packing: Needs feedback from cpufreq to determine when cpus
are full.
 Topology aware task placement: Needs topology information inside
the scheduler to determine the most optimal cpus to use when the
system is partially loaded.
 Heterogeneous systems (big.LITTLE): Needs topology information
and accurate load tracking.
 Thermal also needs to be considered
8
Power scheduler proposal
Power driver (drivers/*/?.c)Scheduler (fair.c) Power framework (power.c)
Helper function
library
Driver registrationsched_domain
Hierarchy
(Generic topology)
Load balance
algorithms
Detailed platform
topology
Platform HW driver
Load tracking
Platform perf. and
energy monitoring
Performance state
selection
Sleep state
selection
“Important tasks”
cgroup
+ New generic info
(pack, heterogeneous, ...)
+ Packing,
+ P & C-state aware,
+ Heterogeneous
+ Scale invariant
Abstract power
driver/topology
interface
Existing policy algorithms
Library (drivers/power/?.c)
9
Task placement based on CPU suitability
Part of the power scheduler proposal
 sched_domain hierarchy
 Load balance algorithm (Heterogeneous)
Existing big.LITTLE MP Patches
 Definition: CFS scheduler optimization for heterogeneous platforms.
Attempts to select task affinity to optimize power and performance
based on task load and CPU type
 Hosted at
http://git.linaro.org/gitweb?p=arm/big.LITTLE/mp.git
 Co-exists with existing (CFS) scheduler code
 Guarded by CONFIG_SCHED_HMP
 Setup HMP domains as a dependency to topology code
Implement big.LITTLE MP functionality inside scheduler mainline code
10
Task placement scheduler architectural bricks
1) Additional sched domain data structures
2) Specify sched domain level for task placement
3) Unweighted instantaneous load signal
4) Task placement hook in select task
5) Task placement hook in load balance
6) Task placement idle pull
11
Brick 1: Additional sched domain data structures
big.LITTLE MP:
 struct hmp_domain
                                                                            
struct hmp_domain {
        struct cpumask cpus;
        struct cpumask possible_cpus;
        struct list_head hmp_domains;
}
Task placement based on CPU suitability:
 Use the existing sched groups in CPU sched domain level
 Add task load ranges into CPU, sched domain and group
12
Brick 2: Specify sched domain level
big.LITTLE MP:
 No additional sched domain flag
 Deletes SD_LOAD_BALANCE flag in CPU level
Task placement based on CPU suitability:
 Adds SD_SUITABILITY flag to CPU level
13
Brick 3: Unweighted instantaneous load signal
 big.LITTLE MP & Task placement based on CPU suitability:
 For sched entity and cfs_rq
    struct sched_avg {
            u32 runnable_avg_sum, runnable_avg_period;
            u64 last_runnable_update;
            s64 decay_count;
            unsigned long load_avg_contrib;
            unsigned long load_avg_ratio;
    }
 sched entity: runnable_avg_sum * NICE_0_LOAD / (runnable_avg_period + 1)
 cfs_rq: set in [update/enqueue/dequeue]_entity_load_avg()
14
Brick 4: Task placement hook in select task
big.LITTLE MP:
 Force new non-kernel tasks onto big CPUs until
load stabilises
 Least loaded CPU of big cluster is used
Task placement based on CPU suitability:
 Use task load ranges of previous CPU and
(initialized) task load ratio to set new CPU
15
Brick 5: Task placement hook in load balance
big.LITTLE MP:
 Completely bypasses load_balance() in CPU level
 hmp_force_up_migration() in run_rebalance_domains()
 Calls hmp_up_migration() for migration to faster CPU
 Calls hmp_offload_down() for using little CPUs when idle
 Does not use env->imbalance or something equivalent
Task placement based on CPU suitability:
 Happens inside load_balance()
 Find most unsuitable queue (i.e. find source run-queue)
 Move unsuitable tasks (counterpart to load balance)
 Move one unsuitable task (counterpart to active load balance)
 Cannot use env->imbalance to control load balance
 Using grp_load_avg_ratio/(NICE_0_LOAD * sg->group_weight) <= THRESHOLD
 Falling back to 'mainline load balance' in case condition is not meet (destination
group is already overloaded)
16
Brick 6: Task placement idle pull
big.LITTLE MP:
 Big CPU pulls running task above the threshold from little CPU
Task placement based on CPU suitability:
 Not necessary because idle_balance()->load_balance() is not
suppressed on CPU level by missing SD_LOAD_BALANCE flag
 Idle pull happens inside load_balance
17
Kernel Summit Feedback
 Good to get active discussion
 First time with everybody in the same room
 LWN article - “The power-aware scheduling mini-summit”
 Key points made
 Power benchmarks are needed for evaluation
 Use-case descriptions are needed to define common ground.
 The scheduler needs energy/power information to make power-aware
scheduling decisions.
 Power-awareness should be moved into the scheduler.
 cpufreq is not fit for its purpose and should go away.
 cpuidle will be integrated in the scheduler. Possibly support by
new per task properties, such as latency constraints
 Are there ways to replay energy scenarios?
 Linsched or perf sched
18
Kernel Summit feedback observations
 All part of the open-source process
 Discussions have raised awareness of the issues
 Maintainers recognise the need for improved power management
 Iterative approach necessary but the steps are clear
 Maintainers have a clear server/desktop background
 ARM community can help educate this audience on embedded
requirements
 Benchmarking for power could be hard to do in a simple way
 Cyclic test, sysbench type tests unlikely to yield realistic results in real
systems
 However, full accuracy not required
 Power models necessarily complex and often closely guarded
secrets
 Collection and reporting of meaningful metrics is probably sufficient
19
Status
 Latest Power-aware scheduling patches on LKML
 https://lkml.org/lkml/2013/10/11/547
 Task placement based on CPU suitability patches prepared
 Proof of concept done
 Waiting for right time to post to lists
 Feedback from Linux kernel Summit needs to be discussed
20
Questions?
 Thanks for listening.

LCU13: Power-efficient scheduling, and the latest news from the kernel summit

  • 1.
    1 Power-efficient scheduling, andthe latest news from the kernel summit Linaro Connect USA 2013 Morten Rasmussen, Dietmar Eggemann
  • 2.
    2 Topics Overview  Timeline Towards a unified scheduler driven power policy  Task placement based on CPU suitability  Kernel Summit Feedback  Status  Questions?
  • 3.
    3 Timeline  May –Ingo's response to the task packing patches from VincentG reignited discussions on power-aware scheduling  Early July – Posted proposed patches for a power aware scheduler based on a power driver running in conjunction with the current scheduler  Avoid big changes to the already complex current scheduler  Migrate functionality back in to the scheduler when we had worked out the kinks  Sept – At Plumbers there was a relatively broad agreement with the approach  October – Morten reposts patchset with refined APIs between power driver and the scheduler  LKS – Reopened the discussion. More on this later
  • 4.
    4 Unified scheduler drivenpower policy … Why ?  big.LITTLE MP patches are tested, stable and performant  Take the principles learnt during the implementation and apply to an upstream solution  Existing power management frameworks are not coordinated (cpufreq, cpuidle) with the scheduler  E.g. the scheduler decides which cpu to wake up or idle without having any knowledge about C-states. cpuidle is left to do its best based on these uninformed choices.  The scheduler is the most obvious place coordinate power management at it has the best view of the overall system load.  The scheduler knows when tasks are scheduled and decides the load balance. cpufreq has to wait until it can see the result of the scheduler decisions before it can react.  Task packing in the scheduler needs P and C-state information to make informed decisions.
  • 5.
    5 Existing Power Policies Frequency scaling: cpufreq  Generic governor + platform specific driver  Decides target frequency based on overall cpu load.  Idle state selection: cpuidle  Generic governor + platform specific driver  Attempts to predict idle time when cpus enter idle.  Scheduler:  Completely generic and unaware of cpufreq and cpuidle policies.  Determines when and where a task runs, i.e. on which cpu.  Task placement considering CPU suitability required.
  • 6.
    6 cpu1cpu1 Existing Power Policies cpu0cpu0 FreqLoad T Scheduler policy cpufreq policy cpuidle policy Powerrq T Load balance idle Current load (pre-3.11) Current load (3.11)  No coordination between power policies to avoid conflicting/suboptimal decisions.  Is it a problem?
  • 7.
    7 Issues  Scheduler->cpufreq->scheduler cpuload feedback loop  From 3.11 the scheduler uses tracked load for load-balancing.  Tracked load is impacted by frequency scaling. Lower frequency leads to higher tracked load for the same task.  Hindering new power-aware scheduling features  Task packing: Needs feedback from cpufreq to determine when cpus are full.  Topology aware task placement: Needs topology information inside the scheduler to determine the most optimal cpus to use when the system is partially loaded.  Heterogeneous systems (big.LITTLE): Needs topology information and accurate load tracking.  Thermal also needs to be considered
  • 8.
    8 Power scheduler proposal Powerdriver (drivers/*/?.c)Scheduler (fair.c) Power framework (power.c) Helper function library Driver registrationsched_domain Hierarchy (Generic topology) Load balance algorithms Detailed platform topology Platform HW driver Load tracking Platform perf. and energy monitoring Performance state selection Sleep state selection “Important tasks” cgroup + New generic info (pack, heterogeneous, ...) + Packing, + P & C-state aware, + Heterogeneous + Scale invariant Abstract power driver/topology interface Existing policy algorithms Library (drivers/power/?.c)
  • 9.
    9 Task placement basedon CPU suitability Part of the power scheduler proposal  sched_domain hierarchy  Load balance algorithm (Heterogeneous) Existing big.LITTLE MP Patches  Definition: CFS scheduler optimization for heterogeneous platforms. Attempts to select task affinity to optimize power and performance based on task load and CPU type  Hosted at http://git.linaro.org/gitweb?p=arm/big.LITTLE/mp.git  Co-exists with existing (CFS) scheduler code  Guarded by CONFIG_SCHED_HMP  Setup HMP domains as a dependency to topology code Implement big.LITTLE MP functionality inside scheduler mainline code
  • 10.
    10 Task placement schedulerarchitectural bricks 1) Additional sched domain data structures 2) Specify sched domain level for task placement 3) Unweighted instantaneous load signal 4) Task placement hook in select task 5) Task placement hook in load balance 6) Task placement idle pull
  • 11.
    11 Brick 1: Additionalsched domain data structures big.LITTLE MP:  struct hmp_domain                                                                              struct hmp_domain {         struct cpumask cpus;         struct cpumask possible_cpus;         struct list_head hmp_domains; } Task placement based on CPU suitability:  Use the existing sched groups in CPU sched domain level  Add task load ranges into CPU, sched domain and group
  • 12.
    12 Brick 2: Specifysched domain level big.LITTLE MP:  No additional sched domain flag  Deletes SD_LOAD_BALANCE flag in CPU level Task placement based on CPU suitability:  Adds SD_SUITABILITY flag to CPU level
  • 13.
    13 Brick 3: Unweightedinstantaneous load signal  big.LITTLE MP & Task placement based on CPU suitability:  For sched entity and cfs_rq     struct sched_avg {             u32 runnable_avg_sum, runnable_avg_period;             u64 last_runnable_update;             s64 decay_count;             unsigned long load_avg_contrib;             unsigned long load_avg_ratio;     }  sched entity: runnable_avg_sum * NICE_0_LOAD / (runnable_avg_period + 1)  cfs_rq: set in [update/enqueue/dequeue]_entity_load_avg()
  • 14.
    14 Brick 4: Taskplacement hook in select task big.LITTLE MP:  Force new non-kernel tasks onto big CPUs until load stabilises  Least loaded CPU of big cluster is used Task placement based on CPU suitability:  Use task load ranges of previous CPU and (initialized) task load ratio to set new CPU
  • 15.
    15 Brick 5: Taskplacement hook in load balance big.LITTLE MP:  Completely bypasses load_balance() in CPU level  hmp_force_up_migration() in run_rebalance_domains()  Calls hmp_up_migration() for migration to faster CPU  Calls hmp_offload_down() for using little CPUs when idle  Does not use env->imbalance or something equivalent Task placement based on CPU suitability:  Happens inside load_balance()  Find most unsuitable queue (i.e. find source run-queue)  Move unsuitable tasks (counterpart to load balance)  Move one unsuitable task (counterpart to active load balance)  Cannot use env->imbalance to control load balance  Using grp_load_avg_ratio/(NICE_0_LOAD * sg->group_weight) <= THRESHOLD  Falling back to 'mainline load balance' in case condition is not meet (destination group is already overloaded)
  • 16.
    16 Brick 6: Taskplacement idle pull big.LITTLE MP:  Big CPU pulls running task above the threshold from little CPU Task placement based on CPU suitability:  Not necessary because idle_balance()->load_balance() is not suppressed on CPU level by missing SD_LOAD_BALANCE flag  Idle pull happens inside load_balance
  • 17.
    17 Kernel Summit Feedback Good to get active discussion  First time with everybody in the same room  LWN article - “The power-aware scheduling mini-summit”  Key points made  Power benchmarks are needed for evaluation  Use-case descriptions are needed to define common ground.  The scheduler needs energy/power information to make power-aware scheduling decisions.  Power-awareness should be moved into the scheduler.  cpufreq is not fit for its purpose and should go away.  cpuidle will be integrated in the scheduler. Possibly support by new per task properties, such as latency constraints  Are there ways to replay energy scenarios?  Linsched or perf sched
  • 18.
    18 Kernel Summit feedbackobservations  All part of the open-source process  Discussions have raised awareness of the issues  Maintainers recognise the need for improved power management  Iterative approach necessary but the steps are clear  Maintainers have a clear server/desktop background  ARM community can help educate this audience on embedded requirements  Benchmarking for power could be hard to do in a simple way  Cyclic test, sysbench type tests unlikely to yield realistic results in real systems  However, full accuracy not required  Power models necessarily complex and often closely guarded secrets  Collection and reporting of meaningful metrics is probably sufficient
  • 19.
    19 Status  Latest Power-awarescheduling patches on LKML  https://lkml.org/lkml/2013/10/11/547  Task placement based on CPU suitability patches prepared  Proof of concept done  Waiting for right time to post to lists  Feedback from Linux kernel Summit needs to be discussed
  • 20.