Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE

224 views

Published on

Today Xen is scheduling guest virtual cpus on all available physical cpus independently from each other. Recent security issues on modern processors (e.g. L1TF) require to turn off hyperthreading for best security in order to avoid leaking information from one hyperthread to the other. One way to avoid having to turn off hyperthreading is to only ever schedule virtual cpus of the same guest on one physical core at the same time. This is called core scheduling.

This presentation shows results from the effort to implement core scheduling in the Xen hypervisor. The basic modifications in Xen are presented and performance numbers with core scheduling active are shown.

Published in: Software
  • Be the first to comment

  • Be the first to like this

XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE

  1. 1. Core scheduling in Xen Jürgen Groß Virtualization Kernel Developer SUSE Linux GmbH, jgross@suse.com
  2. 2. 2 Agenda • What is “Core scheduling”? • Motivation • How does it work? • Performance numbers • Current state
  3. 3. What is “Core scheduling”?
  4. 4. 4 Today: Cpu scheduling • On each physical cpu the scheduler is deciding which vcpu is to be scheduled next • When taking other physical cpus into account only the load of the system is being looked at • Each vcpu can run on each physical cpu within some constraints (cpupools, pinning)
  5. 5. 5 Cpu scheduling Core 0 Thread 0 Thread 1 Core 1 Thread 0 Thread 1 Core 2 Thread 0 Thread 1 Core 3 Thread 0 Thread 1 Dom0 vcpu0 Dom0 vcpu1 Dom0 vcpu2Dom0 vcpu3Dom0 vcpu4 Dom0 vcpu5 Dom0 vcpu6 Dom0 vcpu7 blocked
  6. 6. 6 Cpu scheduling Core 0 Thread 0 Thread 1 Core 1 Thread 0 Thread 1 Core 2 Thread 0 Thread 1 Core 3 Thread 0 Thread 1 Dom0 vcpu0 Dom0 vcpu1 Dom0 vcpu2Dom0 vcpu3Dom0 vcpu4 Dom0 vcpu5 Dom0 vcpu6 Dom0 vcpu7 DomU vcpu3 DomU vcpu0 DomU vcpu2 DomU vcpu1 blocked
  7. 7. 7 Cpu scheduling Core 0 Thread 0 Thread 1 Core 1 Thread 0 Thread 1 Core 2 Thread 0 Thread 1 Core 3 Thread 0 Thread 1 Dom0 vcpu0 Dom0 vcpu1 Dom0 vcpu2Dom0 vcpu3 Dom0 vcpu4 Dom0 vcpu5 Dom0 vcpu6 Dom0 vcpu7 DomU vcpu3 DomU vcpu0 DomU vcpu2 DomU vcpu1 runq blocked
  8. 8. 8 Core scheduling • The scheduler is no longer acting on (v)cpus, but on (v)cores • All siblings (threads) of a core are scheduled together, scheduling for all siblings of a single core is synchronized • The relation between vcores and vcpus is fixed, in contrast to “core aware scheduling” where it might change • Pinning and cpupools are affecting cores (so e.g. pinning a vcpu to a specific physical cpu will pin all vcpus of the same vcore)
  9. 9. 9 Core scheduling Core 0 Thread 0 Thread 1 Core 1 Thread 0 Thread 1 Core 2 Thread 0 Thread 1 Core 3 Thread 0 Thread 1 Dom0 vcpu0 Dom0 vcpu1 Dom0 vcpu4 Dom0 vcpu5 Dom0 vcpu6 Dom0 vcpu7 Dom0 vcpu2 Dom0 vcpu3 blocked
  10. 10. 10 Core scheduling Core 0 Thread 0 Thread 1 Core 1 Thread 0 Thread 1 Core 2 Thread 0 Thread 1 Core 3 Thread 0 Thread 1 Dom0 vcpu0 Dom0 vcpu1 Dom0 vcpu4 Dom0 vcpu5 Dom0 vcpu6 Dom0 vcpu7 DomU vcpu0 DomU vcpu1 DomU vcpu3 DomU vcpu2 runq Dom0 vcpu2 Dom0 vcpu3 blocked
  11. 11. Motivation
  12. 12. 12 Cpu bugs • Several cpu bugs (e.g. L1TF, MDS) involve side channel attacks to steal data from threads of the same core • Core scheduling prohibits cross-domain side channel attacks • This lays the groundwork for a safe operation with SMT enabled
  13. 13. 13 Fairness of accounting • Threads running on the same core share multiple resources (execution units, TLB, caches), so they are influencing each other regarding performance • With cpu scheduling a guest’s cpu performance is depending on the host load, not on that of the guest • In case the owner of the guest has to pay for the used cpu time the price will depend on host load
  14. 14. 14 Guest side optimizations • The guest might decide to let run only one thread of a core to make use of all resources of that core in a single thread • Some threads might be able to make use of shared resources when running on the same core • The guest might want to mitigate against cpu bugs via core-aware scheduling
  15. 15. How does it work?
  16. 16. 16 Decoupling scheduling from cpus • In schedulers switch: ‒ vcpu→sched_unit ‒ pcpu→sched_resource • Scheduling decisions are the same as before • Amount of needed changes in sched_*.c rather high, but mostly mechanical • Scheduler.c is acting as abstraction layer for rest of the hypervisor • A sched_resource can be a cpu, core or socket
  17. 17. 17 Syncing of context switches • When switching vcpus on a cpu all other vcpus of the same sched_unit must be switched on all other cpus of the sched_resource • Syncing is done in 2 steps: ‒ After decision is made to switch all other cpus of sched_resource must rendezvous ‒ Context switch is performed on all affected cpus in parallel, after that all cpus again rendezvous before they proceed • At no time two vcpus of different sched_units are running in guest mode on the same sched_resource
  18. 18. 18 Syncing of context switches 1.Schedule event on one cpu 2.Take schedule lock, call scheduler for selecting next sched_unit to run 3.If no change, drop lock and exit, otherwise signal other cpus of sched_resource to process schedule_slave event, then drop lock and wait for others to join 4.Last one to join switches sched_unit on sched_resource, frees others to continue 5.On each cpu of sched_resource context is switched to new vcpu 6.Wait on each cpu until all context switches are done, then leave schedule handling
  19. 19. 19 Idle vcpus • A guest vcpu becoming idle will result in the idle vcpu being scheduled • Only if the scheduler decides to switch sched_units a synchronized context switch is needed • No change of address space when switching between idle and guest vcpus without sched_unit switch (no change on x86) • An idle vcpu running in a guest sched_unit won’t run tasklets or do livepatching in order to avoid activities for other guests
  20. 20. 20 Cpupools • Only complete sched_resources can be moved from/to cpupools • For easy support of cpu hotplug cpus not in any cpupool are not handled in units of sched_resources, but individually • At system boot cpupool0 is only created after all cpus are brought online, as otherwise the number of cpus per sched_resource isn’t yet known • Cpus not in any pool are no longer handled by the default scheduler, but by the new idle_scheduler
  21. 21. 21 Cpu hotplug • SMT on/off switching at runtime disabled with core scheduling active • Offlining a cpu will now first remove the related sched_resource from cpupool0 if necessary • Onlining a cpu will add it only to cpupool0 in case the complete sched_resource is online
  22. 22. 22 Cpupool0 Cpupools and cpu hotplug Core 0 Thread 0 Thread 1 Core 1 Thread 0 Thread 1 Core 2 Thread 0 Thread 1 Core 3 Thread 0 Thread 1
  23. 23. 23 Cpupool0 Cpupools and cpu hotplug Core 0 Thread 0 Thread 1 Core 1 Thread 0 Thread 1 Core 2 Thread 0 Thread 1 Thread 0 Thread 1
  24. 24. 24 Cpupool0 Cpupools and cpu hotplug Core 0 Thread 0 Thread 1 Core 1 Thread 0 Thread 1 Core 2 Thread 0 Thread 1 Thread 0
  25. 25. Performance numbers
  26. 26. 26 Test basics • All tests done by Dario Faggioli (SUSE) • Test machine was a 4-core system with HT (8 cpus) • Dom0 always with 8 vcpus, HVM domU with 4 or 8 vcpus • Scenarios (all results compared to “without patches, HT on”, positive numbers are better): ‒ Without patches (HT on/off) ‒ sched-gran=cpu (HT on/off) ‒ sched-gran=core • Benchmarks: ‒ Stream (memory benchmark, 4 tasks in parallel) ‒ Kernbench (kernel build with 2, 4, 8 or 16 threads) ‒ Hackbench (communication via pipes, machine saturated) ‒ Mutilate (load generator for memcached) ‒ Netperf (TCP/UDP/UNIX, two communicating tasks) ‒ Pgioperf (postgres micro-benchmark)
  27. 27. 27 Dom0 only Unpatched, no-HT gran=cpu, HT gran=cpu, no- HT gran=core Stream -0.06% … +0.11% +0.11% … +0.50% -0.02% … +1.11% -7.37% … -3.82% Kernbench -36.9% ... +7.07% -0.61% … +0.21% -36.81% … +6.77% -5.98% … -0.01% Hackbench -67.08% … -43.44% -4.79% … +7.35% -68.30% … -37.19% -3.22% … +5.38% Mutilate -20.65% … +10.05% -0.63% … -0.08% -19.70% … +11.26% -11.40% … -2.23% Netperf -0.37% … +3.14% -4.38% … +1.00% -5.01% … +1.71% -33.08% … +6.71% Pgioperf -14.01% … -6.63% -12.54% … +1.15% -11.04% … +3.09% -6.71% … -4.04%
  28. 28. 28 HVM domU, 4 vcpus Unpatched, no-HT gran=cpu, HT gran=cpu, no- HT gran=core Stream -6.67% … -0.25% -6.86% … -5.38% -1.35% … +0.23% -16.81% … -8.35% Kernbench +1.17% ... +14.52% +1.14% … +6.31% -0.03% … +13.52% -39.96% … -13.99% Hackbench -8.12% … +26.34% -33.51% … +10.54% -11.78% … +24.71% -43.25% … -4.07% Mutilate -0.49% … +9.76% -4.49% … -0.12% -3.12% … +8.80% -16.66% … -8.48% Netperf -8.04% … +11.83% -41.63% … +2.55% -10.78% … +17.42% -26.58% … +4.74% Pgioperf -1.47% … +3.57% -29.63% … +1.77% +0.28% … +5.48% +0.10% … +13.85%
  29. 29. 29 HVM domU, 8 vcpus Unpatched, no-HT gran=cpu, HT gran=cpu, no- HT gran=core Stream +2.82% … +6.84% +0.47% … +5.07% +4.52% … +5.73% -14.91% … -9.55% Kernbench -46.41% ... +6.04% +0.46% … +1.70% -46.42% … +6.25% -6.91% … +0.19% Hackbench -50.23% … +4.17% -14.08% … +14.06% -48.40% … +7.08% -16.51% … +11.06% Mutilate -68.33% … -6.48% -1.11% … +2.33% -66.81% … -3.00% -45.50% … -6.17% Netperf -11.87% … +25.95% -15.48% … +14.57% -8.64% … +4.58% -18.00% … +1.81% Pgioperf +0.79% … +94.25% -1.62% … +19.02% -0.44% … +83.68% -49.56% … +0.51%
  30. 30. 30 2 * HVM domU, 8 vcpus Unpatched, no-HT gran=cpu, HT gran=cpu, no- HT gran=core Stream -26.13% … -22.94% -0.87% … +1.45% -25.48% … -22.58% -13.34% … -6.37% Kernbench -50.26% ... -48.38% -0.24% … -0.13% -51.79% … -49.89% -23.98% … -17.84% Hackbench +15.02% … +35.59% -2.28% … +5.42% +10.41% … +34.48% -12.19% … +16.91% Mutilate -93.85% … -56.82% -2.19% … +8.57% -91.89% … -57.33% -83.70% … -13.03% Netperf -50.48% … -15.77% -16.39% … +7.61% -48.31% … -18.41% -36.22% … +4.41% Pgioperf -7.32% … -2.18% -231.22% … +0.30% -1605.80% … -5.63% -6035.64% … -30.76%
  31. 31. Current state
  32. 32. 32 Patches already committed • Removing cpu on/offlining hooks in schedule.c and cpupool.c for suspend/resume handling • Small correction in sched_credit2.c for SMT-aware scheduling (needed for core scheduling) • Inline wrappers for calling per-scheduler functions from schedule.c • Test for mandatory per-scheduler functions instead of ASSERT() • Interface change for sched_switch_sched() per- scheduler function avoiding code duplication
  33. 33. 33 Patches in review • V1 of the (rest-)series, currently 57 patches • 40 files changed, 3704 insertions(+), 2299 deletions(-) • Only small parts of V1 have been reviewed up to now (thanks to all reviewers!) • All comments on RFC-V1 and RFC-V2 have been addressed, those required some major reworks (partially due to renaming requests, partially conceptual ones)
  34. 34. 34 Future plans • Rework of scheduler related files (move to common/sched/, making sched-if.h really scheduler private) • ARM support • Support of per-cpupool scheduling granularity • Support of per-cpupool SMT setting • Sane topology reporting to the guests • Add hypercall syncing between threads for full L1TF/MDS mitigation (probably kills performance)
  35. 35. 35

×