LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler


Published on

Resource: LCA14
Name: LCA14-306: CPUidle & CPUfreq integration with scheduler
Date: 05-03-2014
Speaker: Daniel Lezcano, Mike Turquette

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler

  1. 1. Wed 5 March, 11:15am, Daniel Lezcano, Mike Turquette LCA14-306: CPUidle & CPUfreq integration with scheduler
  2. 2. Introduction ● Power aware discussion ● Patchset « Small task packing » − Some informations shared between cpuidle and the scheduler − ● « Line on the sand » by Ingo Molnar − Integrate first cpuidle and cpufreq with the scheduler −
  3. 3. Scheduler CPUidle Idle task Governor CPUidle backend driver cpuidle_idle_callswitch_to cpuidle_select cpuidle_enter CPUidle + scheduler : Current design
  4. 4. Idle time measurement ● From the scheduler : − The duration of the idle task is running − Includes the interrupt processing time ● From CPUidle : − The duration between interrupts ● CPUIdle code happens with local interrupts disabled ● T(idle task) = Σ T(CPUidle) + Σ T(irqs)
  5. 5. Idle time measurement
  6. 6. Idle time measurement unification ● What is the impact of returning to the scheduler each time an interrupt occurred ? − Scheduler will choose the idle task again if nothing to do − Mainloop code simplified − Idle time measured nearly the same for the scheduler and cpuidle − Probably a negative impact on performance to fix
  7. 7. Load balance ● Taking the decision to balance a task when going to idle ■ Use of avg_idle ● Does not use how long the cpu will sleep ■ The idle state should be selected before ■ CPUIdle should give the state the cpu will be ● Balance a task to the idlest cpu ■ Does not use the cpu's exit latency ■ CPUidle should give back the state the cpu is
  8. 8. CPUidle main function ● Reduce the distance between the scheduler and the cpuidle framework − Move the idle task to kernel/sched − Move the cpuidle_idle function in the idle task code − Integrate the idle mainloop and cpuidle_idle_call ● Allows to access the scheduler's private structure definition
  9. 9. Menu governor split ● The events could be classified in three categories : 1. Predictable → timers 2. Repetitive → IOs 3. Random → key stroke, incoming packet ● Category 2 could be integrated into the scheduler
  10. 10. IO latency tracking ● IO are repetitive within a reasonable interval to assume it as predictable enough
  11. 11. IO latency tracking ● Measurement from the scheduler − io_schedule − io_schedule_timeout ● Count per task the io latency − Task migration moves IO history unlike current governor − Latency constraint for the task
  12. 12. Combine informations ● Move predictable event framework in the scheduler ● Informations combined between the scheduler and menu governor will be more accurate − Idle balance decision based on the idle state a cpu is or about to enter − Load tracking from task for idle state exit latency − CPU computation power and topology − DVFS strategies for exit idle state boost
  13. 13. Scheduler + CPUidle ● The scheduler should have all the informations to tell CPUidle : − How long it will sleep − What is the latency constraint ● The CPUidle should use the information provided by the scheduler : − Select an idle state − Use the backend driver idle callback − No more heuristics
  14. 14. Status ● A lot of cleanups around the idle mainloop ● CPUidle main function inside the idle mainloop − Code distance reduced, sharing the structures scheduler/cpuidle − Communication between sub-systems made easier
  15. 15. Work in progress ● First iteration of IO latency tracking implemented − Validation in progress ● Simple governor for CPUIdle − Select a state ● Idle time unification experimentation
  16. 16. CPUfreq + scheduler The title is misleading … CPUfreq may completely disappear in the future.
  17. 17. CPUfreq + scheduler The title is misleading … CPUfreq may completely disappear in the future. Goal is to initiate CPU dynamic voltage & frequency scaling (DVFS) from the Linux scheduler
  18. 18. CPUfreq + scheduler The title is misleading … CPUfreq may completely disappear in the future. Goal is to initiate CPU dynamic voltage & frequency scaling (DVFS) from the Linux scheduler Nobody knows what this will look like, so please ask questions and raise suggestions
  19. 19. • Polling workqueue • E.g. ondemand • Based on idle time / busyness • No relation to decisions taken by the scheduler • Task may be run at any time • No relation to idle task • In fact, task will not wake-up during idle CPUfreq today
  20. 20. • Replace polling loop with event driven action • Scheduler already takes action which affects available compute capacity • Load balance • Migrating tasks to and from CPUs of different compute capacity • DVFS transitions are a natural fit Event driven behavior
  21. 21. • Method to initiate CPU DVFS transitions from the scheduler • Identify call sites to initiate those transitions • Enqueue/dequeue task • Load balance • Idle entry/exit • Aggressively schedule deadline tasks • Maybe others • Define interface between the scheduler & the DVFS thingy • Currently a power driver in Morten’s RFC • Remove CPUfreq governor layer from the power driver completely? Lots of work ahead
  22. 22. • Experiment with policy • When and where to evaluate if frequency should be changed • What metrics are important to the algorithm? • DVFS versus race-to-idle • Integrate with power model • Benchmark performance & power • Performance regressions • Does it save power? • Make it work with non-CPUfreq things like PSCI and ACPI for changing CPU P-state Lots of work ahead, part 2
  23. 23. • • Replaces polling loop in CPUfreq governor with scheduler event-driven action • CPUfreq machine drivers are re-used initially • CPUfreq governor becomes a shim layer to the power driver Morten’s power aware scheduling RFC
  24. 24. • DVFS task is itself scheduled on a workqueue • Might not be run for some time after the scheduler determines that a DVFS transition should happen • Kworker threads are filtered out • Prevents infinite reentrancy into the scheduler • CPU capacity is not changed when enqueuing and dequeuing these tasks Nitty gritty details
  25. 25. include/linux/sched/power.h struct power_driver { /* * Power driver calls may happen from scheduler context with irq * disabled and rq locks held. This must be taken into account in * the power driver. */ /* cpu already at max capacity? */ int (*at_max_capacity) (int cpu); /* Increase cpu capacity hint */ int (*go_faster) (int cpu, int hint); /* Decrease cpu capacity hint */ int (*go_slower) (int cpu, int hint); /* Best cpu to wake up */ int (*best_wake_cpu) (void); /* Scheduler call-back without rq lock held and with irq enabled */ void (*late_callback) (int cpu); };
  26. 26. • • Replaced workqueue method with per-CPU kthread • This allows removal of the kworker filter • Please commence bikeshedding over the name of this kthread • Use SCHED_FIFO policy for the task • Will be run before the normal work (right?) • These patches were just validated yesterday • Bugs • Holes in logic • Misunderstandings • Voided warranties Incremental changes on top
  27. 27. • Gather more opinions on the power driver interface • Is go_faster/go_slower the right way? • Spoiler alert: Probably not. • When else might we want to evaluate CPU frequency? • Idle entry/exit as mentioned by Daniel • Cluster-level considerations • Sched domains • Not just per-core • Four Cortex-A9’s with single CPU clock • Coordinate with the power model work What’s next?
  28. 28. Questions?
  29. 29. More about Linaro Connect: More about Linaro: More about Linaro engineering: Linaro members: