LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
Upcoming SlideShare
Loading in...5

LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler




Resource: LCA14
Name: LCA14-306: CPUidle & CPUfreq integration with scheduler
Date: 05-03-2014
Speaker: Daniel Lezcano, Mike Turquette



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler Presentation Transcript

  • Wed 5 March, 11:15am, Daniel Lezcano, Mike Turquette LCA14-306: CPUidle & CPUfreq integration with scheduler
  • Introduction ● Power aware discussion ● Patchset « Small task packing » − Some informations shared between cpuidle and the scheduler − ● « Line on the sand » by Ingo Molnar − Integrate first cpuidle and cpufreq with the scheduler −
  • Scheduler CPUidle Idle task Governor CPUidle backend driver cpuidle_idle_callswitch_to cpuidle_select cpuidle_enter CPUidle + scheduler : Current design
  • Idle time measurement ● From the scheduler : − The duration of the idle task is running − Includes the interrupt processing time ● From CPUidle : − The duration between interrupts ● CPUIdle code happens with local interrupts disabled ● T(idle task) = Σ T(CPUidle) + Σ T(irqs)
  • Idle time measurement
  • Idle time measurement unification ● What is the impact of returning to the scheduler each time an interrupt occurred ? − Scheduler will choose the idle task again if nothing to do − Mainloop code simplified − Idle time measured nearly the same for the scheduler and cpuidle − Probably a negative impact on performance to fix
  • Load balance ● Taking the decision to balance a task when going to idle ■ Use of avg_idle ● Does not use how long the cpu will sleep ■ The idle state should be selected before ■ CPUIdle should give the state the cpu will be ● Balance a task to the idlest cpu ■ Does not use the cpu's exit latency ■ CPUidle should give back the state the cpu is
  • CPUidle main function ● Reduce the distance between the scheduler and the cpuidle framework − Move the idle task to kernel/sched − Move the cpuidle_idle function in the idle task code − Integrate the idle mainloop and cpuidle_idle_call ● Allows to access the scheduler's private structure definition
  • Menu governor split ● The events could be classified in three categories : 1. Predictable → timers 2. Repetitive → IOs 3. Random → key stroke, incoming packet ● Category 2 could be integrated into the scheduler
  • IO latency tracking ● IO are repetitive within a reasonable interval to assume it as predictable enough
  • IO latency tracking ● Measurement from the scheduler − io_schedule − io_schedule_timeout ● Count per task the io latency − Task migration moves IO history unlike current governor − Latency constraint for the task
  • Combine informations ● Move predictable event framework in the scheduler ● Informations combined between the scheduler and menu governor will be more accurate − Idle balance decision based on the idle state a cpu is or about to enter − Load tracking from task for idle state exit latency − CPU computation power and topology − DVFS strategies for exit idle state boost
  • Scheduler + CPUidle ● The scheduler should have all the informations to tell CPUidle : − How long it will sleep − What is the latency constraint ● The CPUidle should use the information provided by the scheduler : − Select an idle state − Use the backend driver idle callback − No more heuristics
  • Status ● A lot of cleanups around the idle mainloop ● CPUidle main function inside the idle mainloop − Code distance reduced, sharing the structures scheduler/cpuidle − Communication between sub-systems made easier
  • Work in progress ● First iteration of IO latency tracking implemented − Validation in progress ● Simple governor for CPUIdle − Select a state ● Idle time unification experimentation
  • CPUfreq + scheduler The title is misleading … CPUfreq may completely disappear in the future.
  • CPUfreq + scheduler The title is misleading … CPUfreq may completely disappear in the future. Goal is to initiate CPU dynamic voltage & frequency scaling (DVFS) from the Linux scheduler
  • CPUfreq + scheduler The title is misleading … CPUfreq may completely disappear in the future. Goal is to initiate CPU dynamic voltage & frequency scaling (DVFS) from the Linux scheduler Nobody knows what this will look like, so please ask questions and raise suggestions
  • • Polling workqueue • E.g. ondemand • Based on idle time / busyness • No relation to decisions taken by the scheduler • Task may be run at any time • No relation to idle task • In fact, task will not wake-up during idle CPUfreq today
  • • Replace polling loop with event driven action • Scheduler already takes action which affects available compute capacity • Load balance • Migrating tasks to and from CPUs of different compute capacity • DVFS transitions are a natural fit Event driven behavior
  • • Method to initiate CPU DVFS transitions from the scheduler • Identify call sites to initiate those transitions • Enqueue/dequeue task • Load balance • Idle entry/exit • Aggressively schedule deadline tasks • Maybe others • Define interface between the scheduler & the DVFS thingy • Currently a power driver in Morten’s RFC • Remove CPUfreq governor layer from the power driver completely? Lots of work ahead
  • • Experiment with policy • When and where to evaluate if frequency should be changed • What metrics are important to the algorithm? • DVFS versus race-to-idle • Integrate with power model • Benchmark performance & power • Performance regressions • Does it save power? • Make it work with non-CPUfreq things like PSCI and ACPI for changing CPU P-state Lots of work ahead, part 2
  • • • Replaces polling loop in CPUfreq governor with scheduler event-driven action • CPUfreq machine drivers are re-used initially • CPUfreq governor becomes a shim layer to the power driver Morten’s power aware scheduling RFC
  • • DVFS task is itself scheduled on a workqueue • Might not be run for some time after the scheduler determines that a DVFS transition should happen • Kworker threads are filtered out • Prevents infinite reentrancy into the scheduler • CPU capacity is not changed when enqueuing and dequeuing these tasks Nitty gritty details
  • include/linux/sched/power.h struct power_driver { /* * Power driver calls may happen from scheduler context with irq * disabled and rq locks held. This must be taken into account in * the power driver. */ /* cpu already at max capacity? */ int (*at_max_capacity) (int cpu); /* Increase cpu capacity hint */ int (*go_faster) (int cpu, int hint); /* Decrease cpu capacity hint */ int (*go_slower) (int cpu, int hint); /* Best cpu to wake up */ int (*best_wake_cpu) (void); /* Scheduler call-back without rq lock held and with irq enabled */ void (*late_callback) (int cpu); };
  • • • Replaced workqueue method with per-CPU kthread • This allows removal of the kworker filter • Please commence bikeshedding over the name of this kthread • Use SCHED_FIFO policy for the task • Will be run before the normal work (right?) • These patches were just validated yesterday • Bugs • Holes in logic • Misunderstandings • Voided warranties Incremental changes on top
  • • Gather more opinions on the power driver interface • Is go_faster/go_slower the right way? • Spoiler alert: Probably not. • When else might we want to evaluate CPU frequency? • Idle entry/exit as mentioned by Daniel • Cluster-level considerations • Sched domains • Not just per-core • Four Cortex-A9’s with single CPU clock • Coordinate with the power model work What’s next?
  • Questions?
  • More about Linaro Connect: More about Linaro: More about Linaro engineering: Linaro members: