LCU13: Power-efficient scheduling, and the latest news from the kernel summit

Uploaded on

Resource: LCU13 …

Resource: LCU13
Name: Power-efficient scheduling, and the latest news from the kernel summit
Date: 28-10-2013
Speaker: Dietmar Eggemann, Morten Rasmussen

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. 1 Power-efficient scheduling, and the latest news from the kernel summit Linaro Connect USA 2013 Morten Rasmussen, Dietmar Eggemann
  • 2. 2 Topics Overview  Timeline  Towards a unified scheduler driven power policy  Task placement based on CPU suitability  Kernel Summit Feedback  Status  Questions?
  • 3. 3 Timeline  May – Ingo's response to the task packing patches from VincentG reignited discussions on power-aware scheduling  Early July – Posted proposed patches for a power aware scheduler based on a power driver running in conjunction with the current scheduler  Avoid big changes to the already complex current scheduler  Migrate functionality back in to the scheduler when we had worked out the kinks  Sept – At Plumbers there was a relatively broad agreement with the approach  October – Morten reposts patchset with refined APIs between power driver and the scheduler  LKS – Reopened the discussion. More on this later
  • 4. 4 Unified scheduler driven power policy … Why ?  big.LITTLE MP patches are tested, stable and performant  Take the principles learnt during the implementation and apply to an upstream solution  Existing power management frameworks are not coordinated (cpufreq, cpuidle) with the scheduler  E.g. the scheduler decides which cpu to wake up or idle without having any knowledge about C-states. cpuidle is left to do its best based on these uninformed choices.  The scheduler is the most obvious place coordinate power management at it has the best view of the overall system load.  The scheduler knows when tasks are scheduled and decides the load balance. cpufreq has to wait until it can see the result of the scheduler decisions before it can react.  Task packing in the scheduler needs P and C-state information to make informed decisions.
  • 5. 5 Existing Power Policies  Frequency scaling: cpufreq  Generic governor + platform specific driver  Decides target frequency based on overall cpu load.  Idle state selection: cpuidle  Generic governor + platform specific driver  Attempts to predict idle time when cpus enter idle.  Scheduler:  Completely generic and unaware of cpufreq and cpuidle policies.  Determines when and where a task runs, i.e. on which cpu.  Task placement considering CPU suitability required.
  • 6. 6 cpu1cpu1 Existing Power Policies cpu0cpu0 Freq Load T Scheduler policy cpufreq policy cpuidle policy Powerrq T Load balance idle Current load (pre-3.11) Current load (3.11)  No coordination between power policies to avoid conflicting/suboptimal decisions.  Is it a problem?
  • 7. 7 Issues  Scheduler->cpufreq->scheduler cpu load feedback loop  From 3.11 the scheduler uses tracked load for load-balancing.  Tracked load is impacted by frequency scaling. Lower frequency leads to higher tracked load for the same task.  Hindering new power-aware scheduling features  Task packing: Needs feedback from cpufreq to determine when cpus are full.  Topology aware task placement: Needs topology information inside the scheduler to determine the most optimal cpus to use when the system is partially loaded.  Heterogeneous systems (big.LITTLE): Needs topology information and accurate load tracking.  Thermal also needs to be considered
  • 8. 8 Power scheduler proposal Power driver (drivers/*/?.c)Scheduler (fair.c) Power framework (power.c) Helper function library Driver registrationsched_domain Hierarchy (Generic topology) Load balance algorithms Detailed platform topology Platform HW driver Load tracking Platform perf. and energy monitoring Performance state selection Sleep state selection “Important tasks” cgroup + New generic info (pack, heterogeneous, ...) + Packing, + P & C-state aware, + Heterogeneous + Scale invariant Abstract power driver/topology interface Existing policy algorithms Library (drivers/power/?.c)
  • 9. 9 Task placement based on CPU suitability Part of the power scheduler proposal  sched_domain hierarchy  Load balance algorithm (Heterogeneous) Existing big.LITTLE MP Patches  Definition: CFS scheduler optimization for heterogeneous platforms. Attempts to select task affinity to optimize power and performance based on task load and CPU type  Hosted at   Co-exists with existing (CFS) scheduler code  Guarded by CONFIG_SCHED_HMP  Setup HMP domains as a dependency to topology code Implement big.LITTLE MP functionality inside scheduler mainline code
  • 10. 10 Task placement scheduler architectural bricks 1) Additional sched domain data structures 2) Specify sched domain level for task placement 3) Unweighted instantaneous load signal 4) Task placement hook in select task 5) Task placement hook in load balance 6) Task placement idle pull
  • 11. 11 Brick 1: Additional sched domain data structures big.LITTLE MP:  struct hmp_domain                                                                              struct hmp_domain {         struct cpumask cpus;         struct cpumask possible_cpus;         struct list_head hmp_domains; } Task placement based on CPU suitability:  Use the existing sched groups in CPU sched domain level  Add task load ranges into CPU, sched domain and group
  • 12. 12 Brick 2: Specify sched domain level big.LITTLE MP:  No additional sched domain flag  Deletes SD_LOAD_BALANCE flag in CPU level Task placement based on CPU suitability:  Adds SD_SUITABILITY flag to CPU level
  • 13. 13 Brick 3: Unweighted instantaneous load signal  big.LITTLE MP & Task placement based on CPU suitability:  For sched entity and cfs_rq     struct sched_avg {             u32 runnable_avg_sum, runnable_avg_period;             u64 last_runnable_update;             s64 decay_count;             unsigned long load_avg_contrib;             unsigned long load_avg_ratio;     }  sched entity: runnable_avg_sum * NICE_0_LOAD / (runnable_avg_period + 1)  cfs_rq: set in [update/enqueue/dequeue]_entity_load_avg()
  • 14. 14 Brick 4: Task placement hook in select task big.LITTLE MP:  Force new non-kernel tasks onto big CPUs until load stabilises  Least loaded CPU of big cluster is used Task placement based on CPU suitability:  Use task load ranges of previous CPU and (initialized) task load ratio to set new CPU
  • 15. 15 Brick 5: Task placement hook in load balance big.LITTLE MP:  Completely bypasses load_balance() in CPU level  hmp_force_up_migration() in run_rebalance_domains()  Calls hmp_up_migration() for migration to faster CPU  Calls hmp_offload_down() for using little CPUs when idle  Does not use env->imbalance or something equivalent Task placement based on CPU suitability:  Happens inside load_balance()  Find most unsuitable queue (i.e. find source run-queue)  Move unsuitable tasks (counterpart to load balance)  Move one unsuitable task (counterpart to active load balance)  Cannot use env->imbalance to control load balance  Using grp_load_avg_ratio/(NICE_0_LOAD * sg->group_weight) <= THRESHOLD  Falling back to 'mainline load balance' in case condition is not meet (destination group is already overloaded)
  • 16. 16 Brick 6: Task placement idle pull big.LITTLE MP:  Big CPU pulls running task above the threshold from little CPU Task placement based on CPU suitability:  Not necessary because idle_balance()->load_balance() is not suppressed on CPU level by missing SD_LOAD_BALANCE flag  Idle pull happens inside load_balance
  • 17. 17 Kernel Summit Feedback  Good to get active discussion  First time with everybody in the same room  LWN article - “The power-aware scheduling mini-summit”  Key points made  Power benchmarks are needed for evaluation  Use-case descriptions are needed to define common ground.  The scheduler needs energy/power information to make power-aware scheduling decisions.  Power-awareness should be moved into the scheduler.  cpufreq is not fit for its purpose and should go away.  cpuidle will be integrated in the scheduler. Possibly support by new per task properties, such as latency constraints  Are there ways to replay energy scenarios?  Linsched or perf sched
  • 18. 18 Kernel Summit feedback observations  All part of the open-source process  Discussions have raised awareness of the issues  Maintainers recognise the need for improved power management  Iterative approach necessary but the steps are clear  Maintainers have a clear server/desktop background  ARM community can help educate this audience on embedded requirements  Benchmarking for power could be hard to do in a simple way  Cyclic test, sysbench type tests unlikely to yield realistic results in real systems  However, full accuracy not required  Power models necessarily complex and often closely guarded secrets  Collection and reporting of meaningful metrics is probably sufficient
  • 19. 19 Status  Latest Power-aware scheduling patches on LKML   Task placement based on CPU suitability patches prepared  Proof of concept done  Waiting for right time to post to lists  Feedback from Linux kernel Summit needs to be discussed
  • 20. 20 Questions?  Thanks for listening.