Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

717 views

Published on

This slides were presented at the 12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’16).

Virtual Machine based approaches to workload consolidation, as seen in IaaS cloud as well as datacenter platforms, have long had to contend with performance degradation caused by synchronization primitives inside the guest environments. These primitives can be affected by virtual CPU preemptions by the host scheduler that can introduce delays that are orders of magnitude longer than those primitives were designed for. While a significant amount of work has focused on the behavior of spinlock primitives as a source of these performance issues, spinlocks do not represent the entirety of synchronization mechanisms that are susceptible to scheduling issues when running in a virtualized environment. In this paper we address the virtualized performance issues introduced by TLB shootdown operations. Our profiling study, based on the PARSEC benchmark suite, has shown that up to 64% of a VM's CPU time can be spent on TLB shootdown operations under certain workloads. In order to address this problem, we present a paravirtual TLB shootdown scheme named Shoot4U. Shoot4U completely eliminates TLB shootdown preemptions by invalidating guest TLB entries from the VMM and allowing guest TLB shootdown operations to complete without waiting for remote virtual CPUs to be scheduled. Our performance evaluation using the PARSEC benchmark suite demonstrates that Shoot4U can reduce benchmark runtime by up to 85% compared an unmodified Linux kernel, and up to 44% over a state-of-the-art paravirtual TLB shootdown scheme.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs

  1. 1. VEE’16 04/02/2016 Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs Jiannan Ouyang, John Lange University of Pittsburgh Haoqiang Zheng VMware Inc.
  2. 2. CPU Consolidation in the Cloud 2 CPU Consolidation: multiple virtual CPUs (vCPUs) share the same physical CPU (pCPU). Motivation: Improve datacenter utilization. Figure 1. Average activity distribution of a typical shared Google clusters including Online Services, each containing over 20,000 servers, over a period of 3 months [Barroso 13].
  3. 3. Problems with Preempted vCPUs 3 PP P P A B A B B A A B A vCPU of VM-A B vCPU ofVM-B Preempted Running pCPUP Performance problems: Busy-waiting based kernel synchronization operations —  Lock Holder Preemption problem —  Lock Waiter Preemption problem —  TLB Shootdown Preemption problem
  4. 4. Lock Holder Preemption 4 Lock holder preemption [Uhlig 04, Friebel 08] —  A preempted vCPU is holding a spinlock —  Causes dramatically longer lock waiting time —  context switch latency + CPU shares allocated to other vCPUs SchedulingTechniques —  co-scheduling, relaxed co-scheduling [VMware 10] —  Adaptive co-scheduling [Weng HPDC11] —  Balanced scheduling [Sukwong EuroSys11] —  Demand-based coordinated scheduling [Kim ASPLOS13] HardwareAssistedTechniques —  Intel Pause-Loop Exiting (PLE) [Riel 11]
  5. 5. Lock Waiter Preemption [Ouyang VEE13] 5 Linux uses a FIFO order fair spinlock, named ticket spinlock i i+1 i+2 i+3 Lock waiter preemption —  A lock waiter is preempted, and blocks the queue —  P(waiter preemption) > P(holder preemption) 0 T 2TTimeout: 3T PreemptableTicket Spinlock —  Key idea: proportional timeout
  6. 6. TLB Shootdown Preemption 6 KVM Paravirt Remote FlushTLB [kvmtlb 12] —  VMM maintains vCPU preemption states and shares with the guest. —  Use conventional approach if the remote vCPU is running. —  DeferTLB flush if the remote vCPU is preempted. —  Cons: preemption state may change after checking. TLB shootdown IPIs as scheduling heuristics [Kim ASPLOS13] Shoot4U —  Goal: eliminate the problem —  Key idea: invalidate guestTLB entries from theVMM
  7. 7. Contributions 7 —  An analysis of the impact that various low level synchronization operations have on system benchmark performance. —  Shoot4U:A novel virtualizedTLB architecture that ensures consistently low latencies for synchronizedTLB operations. —  An evaluation of the performance benefits achieved by Shoot4U over current state-of-art software and hardware assisted approaches.
  8. 8. Performance Analysis 8
  9. 9. Overhead of CPU Consolidation 9 0 2 4 6 8 10 12 14 16 18 20 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 Slowdown 70.6 max ideal slowdown PARSEC Runtime with co-locatedVM over running alone (12-coreVMs, measured on Linux/KVM, with PLE disabled)
  10. 10. CPU Usage Profiling (perf) 10 0 10 20 30 40 50 60 70 80 90 100 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 Percentage(%) 1VM 2VM k:lock k:tlb k:other u:*
  11. 11. CDF of TLB Shootdown Latency (ktap) 11 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 101 102 103 104 105 106 CumulativePercent Latency (us) 1VM 2VM
  12. 12. How TLB Shootdown Works in VMs 12 —  TLB (Translation Lookaside Buffer) —  a per-core hardware cache for page table translation results —  TLB coherence is managed by the OS —  TLB shootdown operations: IPI + invlpg —  LinuxTLB shootdown is busy-waiting based VMM PP P P A A B A 1. vIPIs 3. pIPIs 2. trap 4. inject virtual interrupts 5. vCPU is scheduled (TLB Shootdown Preemption) B B A B 6. invalidation & ACK
  13. 13. Shoot4U 13
  14. 14. Shoot4U 14 Observation: modern hardware allows theVMM to invalidate guest TLB entries (e.g. Intel invpid) Key idea: invalidate guestTLB entries from theVMM —  Tell theVMM whatTLB entries and vCPUs to invalidate (hypervall) —  TheVMM invalidates and returns, no interrupt injection and waiting 2. pIPIs 1.  hypercall <vcpu set, addr range> 3. invalidation andACKVMM PP P P A A B A B B A B
  15. 15. Implementation 15 KVM/Linux 3.16, ~200 LOC (~50 LOC guest side) —  https://github.com/ouyangjn/shoot4u Guest —  use hypercall forTLB shootdowns VMM —  hypercall handler: vCPU set => pCPU set, and send IPIs —  IPI handler: invalidate guestTLB entries with invpid kvm_hypercall3(unsigned long KVM_HC_SHOOT4U, unsigned long vcpu_bitmap, unsigned long start_addr, unsigned long end_addr); Shoot4U API
  16. 16. Evaluation 16 Dual-socket Dell R450 server —  6-core Intel “Ivy-Bridge” Xeon processors with hyperthreading —  24 GB RAM split across two NUMA domains. —  CentOS 7 (Linux 3.16) Virtual Machines —  12 vCPUs, 4G RAM on the same socket —  Fedora 19 (Linux 4.0) —  VM1: PARSEC Benchmark Suite,VM2 sysbench CPU test Schemes —  baseline: unmodified Linux kernel —  kvmtlb [kvmtlb 12] —  Shoot4U —  Pause-Loop Exiting (PLE) [Riel 11] —  PreemptableTicket Spinlock (PMT) [OuyangVEE ’13]
  17. 17. TLB Shootdown Latency (Cycles) 17 Order of magnitude lower latency
  18. 18. TLB Shootdown Latency (CDF) 18 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 101 102 103 104 105 106 CumulativePercent Latency (us) shoot4u-1VM shoot4u-2VM kvmtlb-1VM kvmtlb-2VM baseline-1VM baseline-2VM
  19. 19. Parsec Performance (2-VMs) 19 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 blackscholes bodytrackcannealdedup ferret freqmineraytracestreamcluster swaptionsvips x264 NormalizedExecutionTime baseline ple pmt pmt+kvmtlb pmt+shoot4u ple+pmt+shoot4u
  20. 20. Revisiting Performance Slowdown 20 0 2 4 6 8 10 12 14 16 18 20 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 Slowdown 70.6 baseline ple+pmt+shoot4u
  21. 21. Revisiting CPU Usage Profiling 21 0 10 20 30 40 50 60 70 80 90 100 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 blackscholes bodytrack canneal dedup ferret freqmine raytrace streamcluster swaptions vips x264 Percentage(%) baseline 2VM ple+pmt+shoot4u 2VM k:lock k:tlb k:other u:*
  22. 22. Conclusions 22 —  We conducted a set of experiments in order to provide a breakdown of overheads caused by preempted virtual CPU cores, showing thatTLB operations can have a significant impact on performance with certain workloads. —  We Shoot4U, an optimization forTLB shootdown operations that internalizesTLB shootdowns in theVMM and so no longer requires the involvement of a guest’s vCPUs. —  Our evaluation demonstrates the effectiveness of our approach, and illustrates how under certain workloads our approach is dramatically better than state-of-the-art techniques.
  23. 23. 23 https://github.com/ouyangjn/shoot4u
  24. 24. Q & A 24 Jiannan Ouyang Ph.D. Candidate University of Pittsburgh ouyang@cs.pitt.edu http://www.cs.pitt.edu/~ouyang/ The Prognostic Lab University of Pittsburgh http://www.prognosticlab.org Pisces Co-Kernel Kitten Lightweight Kernel PalaciosVMM
  25. 25. References 25 —  [Ouyang 13] Jiannan Ouyang and John R. Lange. PreemptableTicket Spinlocks: Improving Consolidated Performance in the Cloud. In Proc. 9thACM SIGPLAN/SIGOPS International Conference onVirtual Execution Environments (VEE), 2013. —  [Uhlig 04]Volkmar Uhlig, Joshua LeVasseur, Espen Skoglund, and Uwe Dannowski.Towards scalable multiprocessor virtual machines. In Proceedings of the 3rd conference onVirtual Machine ResearchAndTechnology Symposium -Volume 3, VM’04, 2004. —  [Friebel 08]Thomas Friebel. How to deal with lock-holder preemption. Presented at the Xen Summit NorthAmerica, 2008. —  [Kim ASPLOS’13] H. Kim, S. Kim, J. Jeong, J. Lee, and S. Maeng. Demand- based Coordinated Scheduling for SMPVMs. In Proc. Inter- national Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013.
  26. 26. 26 —  [VMware 10]VMware(r) vSphere(tm):The cpu scheduler in vmware esx(r) 4.1.Technical report,VMware, Inc, 2010. —  [Barroso 13] L.A. Barroso, J. Clidaras, and U. Holzle.The Datacenter as a Computer:An Introduction to the Design of Warehouse- Scale Machines. Synthesis Lectures on Computer Architec- ture, 2013. —  [Weng HPDC’11] C.Weng, Q. Liu, L.Yu, and M. Li. Dynamic Adaptive Scheduling forVirtual Machines. In Proc.20th International Symposium on High Performance Parallel and Distributed Computing (HPDC), 2011. —  [Sukwong EuroSys’11] O. Sukwong and H. S. Kim. Is Co- schedulingToo Expensive for SMPVMs? In Proc.6th European Conference on Com- puter Systems (EuroSys), 2011.
  27. 27. 27 —  [Riel 11] R. v. Riel. Directed yield for pause loop exiting, 2011. URL http://lwn.net/Articles/424960/. —  [kvmtlb 12] KVM Paravirt Remote FlushTLB. https://lwn.net/ Articles/500188/.

×