This slides were presented at the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15)
Performance isolation is emerging as a requirement for High Performance Computing (HPC) applications, particularly as HPC architectures turn to in situ data processing and application composition techniques to increase system throughput. These approaches require the co-location of disparate workloads on the same compute node, each with different resource and runtime requirements. In this paper we claim that these workloads cannot be effectively managed by a single Operating System/Runtime (OS/R). Therefore, we present Pisces, a system software architecture that enables the co-existence of multiple independent and fully isolated OS/Rs, or enclaves, that can be customized to address the disparate requirements of next generation HPC workloads. Each enclave consists of a specialized lightweight OS co-kernel and runtime, which is capable of independently managing partitions of dynamically assigned hardware resources. Contrary to other co-kernel approaches, in this work we consider performance isolation to be a primary requirement and present a novel co-kernel architecture to achieve this goal. We further present a set of design requirements necessary to ensure performance isolation, including: (1) elimination of cross OS dependencies, (2) internalized management of I/O, (3) limiting cross enclave communication to explicit shared memory channels, and (4) using virtualization techniques to provide missing OS features. The implementation of the Pisces co-kernel architecture is based on the Kitten Lightweight Kernel and Palacios Virtual Machine Monitor, two system software architectures designed specifically for HPC systems. Finally we will show that lightweight isolated co-kernels can provide better performance for HPC applications, and that isolated virtual machines are even capable of outperforming native environments in the presence of competing workloads.
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Achieving Performance Isolation with Lightweight Co-Kernels
1. Jiannan Ouyang, Brian Kocoloski, John Lange
The Prognostic Lab @ University of Pittsburgh
Kevin Pedretti
Sandia National Laboratories
HPDC 2015
Achieving Performance Isolation
with Lightweight Co-Kernels
2. HPC Architecture
2
— Move computation to data
— Improved data locality
— Reduced power consumption
— Reduced network traffic
Compute Node
Operating System and Runtimes
(OS/R)
Simulation
Analytic /
Visualization
Supercomputer
Shared Storage Cluster
Processing Cluster
Problem: massive data movement
over interconnects
Traditional In Situ Data Processing
3. Challenge: Predictable High Performance
3
— Tightly coupled HPC workloads are sensitive to OS noise
and overhead [Petrini SC’03, Ferreira SC’08, Hoefler SC’10]
— Specialized kernels for predictable performance
— Tailored from Linux: CNL for Cray supercomputers
— Lightweight kernels (LWK) developed from scratch: IBM CNK, Kitten
— Data processing workloads favor Linux environments
— Cross workload interference
— Shared hardware (CPU time, cache, memory bandwidth)
— Shared system software
How to provide both Linux and specialized kernels on the same node,
while ensuring performance isolation??
4. Approach: Lightweight Co-Kernels
4
— Hardware resources on one node are dynamically composed into
multiple partitions or enclaves
— Independent software stacks are deployed on each enclave
— Optimized for certain applications and hardware
— Performance isolation at both the software and hardware level
Hardware
Linux LWK
Analytic /
Visualization
Hardware
Linux
Analytic /
Visualization
Simulation
Simulation
6. Building Blocks: Kitten and Palacios
— the Kitten Lightweight Kernel (LWK)
— Goal: provide predictable performance for massively parallel HPC applications
— Simple resource management policies
— Limited kernel I/O support + direct user-level network access
— the Palacios LightweightVirtual Machine Monitor (VMM)
— Goal: predictable performance
— Lightweight resource management policies
— Established history of providing virtualized environments for HPC [Lange et al.
VEE ’11, Kocoloski and Lange ROSS‘12]
Kitten: https://software.sandia.gov/trac/kitten
Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/
7. The Pisces Lightweight Co-Kernel Architecture
7
Linux
Hardware
Isolated Virtual
Machine
Applications
+
Virtual
MachinesPalacios VMM
Kitten Co-kernel
(1)
Kitten Co-kernel
(2)
Isolated
Application
Pisces Pisces
http://www.prognosticlab.org/pisces/
Pisces Design Goals
— Performance isolation at both software and hardware level
— Dynamic creation of resizable enclaves
— Isolated virtual environments
8. Design Decisions
8
— Elimination of cross OS dependencies
— Each enclave must implement its own complete set of supported
system calls
— No system call forwarding is allowed
— Internalized management of I/O
— Each enclave must provide its own I/O device drivers and manage
its hardware resources directly
— Userspace cross enclave communication
— Cross enclave communication is not a kernel provided feature
— Explicitly setup cross enclave shared memory at runtime (XEMEM)
— Using virtualization to provide missing OS features
9. Cross Kernel Communication
9
Hardware'Par))on' Hardware'Par))on'
User%
Context%
Kernel%
Context% Linux'
Cross%Kernel*
Messages*
Control'
Process'
Control'
Process'
Shared*Mem*
*Ctrl*Channel*
Linux'
Compa)ble'
Workloads'
Isolated'
Processes''
+'
Virtual'
Machines'
Shared*Mem*
Communica6on*Channels*
Ki@en'
CoAKernel'
XEMEM: Efficient Shared Memory for Composed
Applications on Multi-OS/R Exascale Systems
[Kocoloski and Lange, HPDC‘15]
11. Challenges & Approaches
11
— How to boot a co-kernel?
— Hot-remove resources from Linux, and load co-kernel
— Reuse Linux boot code with modified target kernel address
— Restrict the Kitten co-kernel to access assigned resources only
— How to share hardware resources among kernels?
— Hot-remove from Linux + direct assignment and adjustment (e.g.
CPU cores, memory blocks, PCI devices)
— Managed by Linux and Pisces (e.g. IOMMU)
— How to communicate with a co-kernel?
— Kernel level: IPI + shared memory, primarily for Pisces commands
— Application level: XEMEM [Kocoloski HPDC’15]
— How to route device interrupts?
12. I/O Interrupt Routing
12
Legacy
Device
IO-APIC
Management
Kernel Co-Kernel
IRQ
Forwarder
IRQ
Handler
MSI/MSI-X
Device
Management
Kernel Co-Kernel
IRQ
Forwarder
IRQ
Handler
MSI/MSI-X
Device
MSI MSI
INTx
IPI
Legacy Interrupt Forwarding Direct Device Assignment (w/ MSI)
• Legacy interrupt vectors are potentially shared among multiple devices
• Pisces provides IRQ forwarding service
• IRQ forwarding is only used during initialization for PCI devices
• Modern PCI devices support dedicated interrupt vectors (MSI/MSI-X)
• Directly route to the corresponding enclave
13. Implementation
13
— Pisces
— Linux kernel module supports unmodified Linux kernels
(2.6.3x – 3.x.y)
— Co-kernel initialization and management
— Kitten (~9000 LOC changes)
— Manage assigned hardware resources
— Dynamic resource assignment
— Kernel level communication channel
— Palacios (~5000 LOC changes)
— Dynamic resource assignment
— Command forwarding channel
Pisces: http://www.prognosticlab.org/pisces/
Kitten: https://software.sandia.gov/trac/kitten
Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/
15. Evaluation
15
— 8 node Dell R450 cluster
— Two six-core Intel “Ivy-Bridge” Xeon processors
— 24GB RAM split across two NUMA domains
— QDR Infiniband
— CentOS 7, Linux kernel 3.16
— For performance isolation experiments, the hardware is
partitioned by NUMA domains.
— i.e. Linux on one NUMA domain, co-kernel on the other
16. Fast Pisces Management Operations
16
Operations Latency (ms)
Booting a co-kernel 265.98
Adding a single CPU core 33.74
Adding a 128MB memory block 82.66
Adding an Ethernet NIC 118.98
17. Eliminating Cross Kernel Dependencies
17
solitary workloads (us) w/ other workloads (us)
Linux 3.05 3.48
co-kernel fwd 6.12 14.00
co-kernel 0.39 0.36
ExecutionTime of getpid()
— Co-kernel has the best average performance
— Co-kernel has the most consistent performance
— System call forwarding has longer latency and suffers from
cross stack performance interference
18. Noise Analysis
18
0
5
10
15
20
0 1 2 3 4 5
Latency(us) Time (seconds)
(a) without competing workloads
0 1 2 3 4 5
Time (seconds)
(b) with competing workloads
0
5
10
15
20
0 1 2 3 4 5
Latency(us)
Time (seconds)
(a) without competing workloads
0 1 2 3 4 5
Time (seconds)
(b) with competing workloads
Linux
Kitten co-kernel
Co-Kernel: less noise + better isolation
* Each point represents the latency of an OS interruption
19. Single Node Performance
19
0
1
CentOS Kitten/KVM co-Kernel
82
83
84
85
CompletionTime(Seconds)
without bg
with bg
0
250
CentOS Kitten/KVM co-Kernel
20250
20500
20750
21000
21250
Throughput(GUPS)
without bg
with bg
CoMD Performance Stream Performance
Co-Kernel: consist performance + performance isolation
21. Co-VMM for HPC in the Cloud
21
0
20
40
60
80
100
44 45 46 47 48 49 50 51
CDF(%)
Runtime (seconds)
Co-VMM Native KVM
CDF of HPCCG Performance (running with Hadoop, 8 nodes)
co-VMM: consistent performance + performance isolation
22. Related Work
22
— Exascale operating systems and runtimes (OS/Rs)
— Hobbes (SNL, LBNL, LANL, ORNL, U. Pitt, various universities)
— Argo (ANL, LLNL, PNNL, various universities)
— FusedOS (Intel / IBM)
— mOS (Intel)
— McKernel (RIKENAICS, University ofTokyo)
Our uniqueness: performance isolation, dynamic
resource composition, lightweight virtualization
23. Conclusion
23
— Design and implementation of the Pisces co-kernel architecture
— Pisces framework
— Kitten co-kernel
— PalaciosVMM for Kitten co-kernel
— Demonstrated that the co-kernel architecture provides
— Optimized execution environments for in situ processing
— Performance isolation
https://software.sandia.gov/trac/kitten
http://www.prognosticlab.org/pisces/
http://www.prognosticlab.org/palacios
24. Thank You
Jiannan Ouyang
— Ph.D. Candidate @ University of Pittsburgh
— ouyang@cs.pitt.edu
— http://people.cs.pitt.edu/~ouyang/
— The Prognostic Lab @ U. Pittsburgh
— http://www.prognosticlab.org