Jiannan Ouyang, Brian Kocoloski, John Lange
The Prognostic Lab @ University of Pittsburgh
Kevin Pedretti
Sandia National Laboratories
HPDC 2015
Achieving Performance Isolation
with Lightweight Co-Kernels
HPC Architecture
2
—  Move computation to data
—  Improved data locality
—  Reduced power consumption
—  Reduced network traffic
Compute Node
Operating System and Runtimes
(OS/R)
Simulation
Analytic /
Visualization
Supercomputer
Shared Storage Cluster
Processing Cluster
Problem: massive data movement
over interconnects
Traditional In Situ Data Processing
Challenge: Predictable High Performance
3
—  Tightly coupled HPC workloads are sensitive to OS noise
and overhead [Petrini SC’03, Ferreira SC’08, Hoefler SC’10]
—  Specialized kernels for predictable performance
—  Tailored from Linux: CNL for Cray supercomputers
—  Lightweight kernels (LWK) developed from scratch: IBM CNK, Kitten
—  Data processing workloads favor Linux environments
—  Cross workload interference
—  Shared hardware (CPU time, cache, memory bandwidth)
—  Shared system software
How to provide both Linux and specialized kernels on the same node,
while ensuring performance isolation??
Approach: Lightweight Co-Kernels
4
—  Hardware resources on one node are dynamically composed into
multiple partitions or enclaves
—  Independent software stacks are deployed on each enclave
—  Optimized for certain applications and hardware
—  Performance isolation at both the software and hardware level
Hardware
Linux LWK
Analytic /
Visualization
Hardware
Linux
Analytic /
Visualization
Simulation
Simulation
Agenda
5
—  Introduction
—  The Pisces Lightweight Co-Kernel Architecture
—  Implementation
—  Evaluation
—  RelatedWork
—  Conclusion
Building Blocks: Kitten and Palacios
—  the Kitten Lightweight Kernel (LWK)
—  Goal: provide predictable performance for massively parallel HPC applications
—  Simple resource management policies
—  Limited kernel I/O support + direct user-level network access
—  the Palacios LightweightVirtual Machine Monitor (VMM)
—  Goal: predictable performance
—  Lightweight resource management policies
—  Established history of providing virtualized environments for HPC [Lange et al.
VEE ’11, Kocoloski and Lange ROSS‘12]
Kitten: https://software.sandia.gov/trac/kitten
Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/
The Pisces Lightweight Co-Kernel Architecture
7
Linux
Hardware
Isolated Virtual
Machine
Applications
+
Virtual
MachinesPalacios VMM
Kitten Co-kernel
(1)
Kitten Co-kernel
(2)
Isolated
Application
Pisces Pisces
http://www.prognosticlab.org/pisces/
Pisces Design Goals
—  Performance isolation at both software and hardware level
—  Dynamic creation of resizable enclaves
—  Isolated virtual environments
Design Decisions
8
—  Elimination of cross OS dependencies
—  Each enclave must implement its own complete set of supported
system calls
—  No system call forwarding is allowed
—  Internalized management of I/O
—  Each enclave must provide its own I/O device drivers and manage
its hardware resources directly
—  Userspace cross enclave communication
—  Cross enclave communication is not a kernel provided feature
—  Explicitly setup cross enclave shared memory at runtime (XEMEM)
—  Using virtualization to provide missing OS features
Cross Kernel Communication
9
Hardware'Par))on' Hardware'Par))on'
User%
Context%
Kernel%
Context% Linux'
Cross%Kernel*
Messages*
Control'
Process'
Control'
Process'
Shared*Mem*
*Ctrl*Channel*
Linux'
Compa)ble'
Workloads'
Isolated'
Processes''
+'
Virtual'
Machines'
Shared*Mem*
Communica6on*Channels*
Ki@en'
CoAKernel'
XEMEM: Efficient Shared Memory for Composed
Applications on Multi-OS/R Exascale Systems
[Kocoloski and Lange, HPDC‘15]
Agenda
10
—  Introduction
—  The Pisces Lightweight Co-Kernel Architecture
—  Implementation
—  Evaluation
—  RelatedWork
—  Conclusion
Challenges & Approaches
11
—  How to boot a co-kernel?
—  Hot-remove resources from Linux, and load co-kernel
—  Reuse Linux boot code with modified target kernel address
—  Restrict the Kitten co-kernel to access assigned resources only
—  How to share hardware resources among kernels?
—  Hot-remove from Linux + direct assignment and adjustment (e.g.
CPU cores, memory blocks, PCI devices)
—  Managed by Linux and Pisces (e.g. IOMMU)
—  How to communicate with a co-kernel?
—  Kernel level: IPI + shared memory, primarily for Pisces commands
—  Application level: XEMEM [Kocoloski HPDC’15]
—  How to route device interrupts?
I/O Interrupt Routing
12
Legacy
Device
IO-APIC
Management
Kernel Co-Kernel
IRQ
Forwarder
IRQ
Handler
MSI/MSI-X
Device
Management
Kernel Co-Kernel
IRQ
Forwarder
IRQ
Handler
MSI/MSI-X
Device
MSI MSI
INTx
IPI
Legacy Interrupt Forwarding Direct Device Assignment (w/ MSI)
•  Legacy interrupt vectors are potentially shared among multiple devices
•  Pisces provides IRQ forwarding service
•  IRQ forwarding is only used during initialization for PCI devices
•  Modern PCI devices support dedicated interrupt vectors (MSI/MSI-X)
•  Directly route to the corresponding enclave
Implementation
13
—  Pisces
—  Linux kernel module supports unmodified Linux kernels
(2.6.3x – 3.x.y)
—  Co-kernel initialization and management
—  Kitten (~9000 LOC changes)
—  Manage assigned hardware resources
—  Dynamic resource assignment
—  Kernel level communication channel
—  Palacios (~5000 LOC changes)
—  Dynamic resource assignment
—  Command forwarding channel
Pisces: http://www.prognosticlab.org/pisces/
Kitten: https://software.sandia.gov/trac/kitten
Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/
Agenda
14
—  Introduction
—  The Pisces Lightweight Co-Kernel Architecture
—  Implementation
—  Evaluation
—  RelatedWork
—  Conclusion
Evaluation
15
—  8 node Dell R450 cluster
—  Two six-core Intel “Ivy-Bridge” Xeon processors
—  24GB RAM split across two NUMA domains
—  QDR Infiniband
—  CentOS 7, Linux kernel 3.16
—  For performance isolation experiments, the hardware is
partitioned by NUMA domains.
—  i.e. Linux on one NUMA domain, co-kernel on the other
Fast Pisces Management Operations
16
Operations Latency (ms)
Booting a co-kernel 265.98
Adding a single CPU core 33.74
Adding a 128MB memory block 82.66
Adding an Ethernet NIC 118.98
Eliminating Cross Kernel Dependencies
17
solitary workloads (us) w/ other workloads (us)
Linux 3.05 3.48
co-kernel fwd 6.12 14.00
co-kernel 0.39 0.36
ExecutionTime of getpid()
—  Co-kernel has the best average performance
—  Co-kernel has the most consistent performance
—  System call forwarding has longer latency and suffers from
cross stack performance interference
Noise Analysis
18
0
5
10
15
20
0 1 2 3 4 5
Latency(us) Time (seconds)
(a) without competing workloads
0 1 2 3 4 5
Time (seconds)
(b) with competing workloads
0
5
10
15
20
0 1 2 3 4 5
Latency(us)
Time (seconds)
(a) without competing workloads
0 1 2 3 4 5
Time (seconds)
(b) with competing workloads
Linux
Kitten co-kernel
Co-Kernel: less noise + better isolation
* Each point represents the latency of an OS interruption
Single Node Performance
19
0
1
CentOS Kitten/KVM co-Kernel
82
83
84
85
CompletionTime(Seconds)
without bg
with bg
0
250
CentOS Kitten/KVM co-Kernel
20250
20500
20750
21000
21250
Throughput(GUPS)
without bg
with bg
CoMD Performance Stream Performance
Co-Kernel: consist performance + performance isolation
8 Node Performance
20
2
4
6
8
10
12
14
16
18
20
1 2 3 4 5 6 7 8
Throughput(GFLOP/s)
Number of Nodes
co-VMM
native
KVM
co-VMM bg
native bg
KVM bg
w/o bg: co-VMM achieves native Linux performance
w/ bg: co-VMM outperforms native Linux
Co-VMM for HPC in the Cloud
21
0
20
40
60
80
100
44 45 46 47 48 49 50 51
CDF(%)
Runtime (seconds)
Co-VMM Native KVM
CDF of HPCCG Performance (running with Hadoop, 8 nodes)
co-VMM: consistent performance + performance isolation
Related Work
22
— Exascale operating systems and runtimes (OS/Rs)
—  Hobbes (SNL, LBNL, LANL, ORNL, U. Pitt, various universities)
—  Argo (ANL, LLNL, PNNL, various universities)
—  FusedOS (Intel / IBM)
—  mOS (Intel)
—  McKernel (RIKENAICS, University ofTokyo)
Our uniqueness: performance isolation, dynamic
resource composition, lightweight virtualization
Conclusion
23
—  Design and implementation of the Pisces co-kernel architecture
—  Pisces framework
—  Kitten co-kernel
—  PalaciosVMM for Kitten co-kernel
—  Demonstrated that the co-kernel architecture provides
—  Optimized execution environments for in situ processing
—  Performance isolation
https://software.sandia.gov/trac/kitten
http://www.prognosticlab.org/pisces/
http://www.prognosticlab.org/palacios
Thank You
Jiannan Ouyang
—  Ph.D. Candidate @ University of Pittsburgh
—  ouyang@cs.pitt.edu
—  http://people.cs.pitt.edu/~ouyang/
—  The Prognostic Lab @ U. Pittsburgh
—  http://www.prognosticlab.org

Achieving Performance Isolation with Lightweight Co-Kernels

  • 1.
    Jiannan Ouyang, BrianKocoloski, John Lange The Prognostic Lab @ University of Pittsburgh Kevin Pedretti Sandia National Laboratories HPDC 2015 Achieving Performance Isolation with Lightweight Co-Kernels
  • 2.
    HPC Architecture 2 —  Movecomputation to data —  Improved data locality —  Reduced power consumption —  Reduced network traffic Compute Node Operating System and Runtimes (OS/R) Simulation Analytic / Visualization Supercomputer Shared Storage Cluster Processing Cluster Problem: massive data movement over interconnects Traditional In Situ Data Processing
  • 3.
    Challenge: Predictable HighPerformance 3 —  Tightly coupled HPC workloads are sensitive to OS noise and overhead [Petrini SC’03, Ferreira SC’08, Hoefler SC’10] —  Specialized kernels for predictable performance —  Tailored from Linux: CNL for Cray supercomputers —  Lightweight kernels (LWK) developed from scratch: IBM CNK, Kitten —  Data processing workloads favor Linux environments —  Cross workload interference —  Shared hardware (CPU time, cache, memory bandwidth) —  Shared system software How to provide both Linux and specialized kernels on the same node, while ensuring performance isolation??
  • 4.
    Approach: Lightweight Co-Kernels 4 — Hardware resources on one node are dynamically composed into multiple partitions or enclaves —  Independent software stacks are deployed on each enclave —  Optimized for certain applications and hardware —  Performance isolation at both the software and hardware level Hardware Linux LWK Analytic / Visualization Hardware Linux Analytic / Visualization Simulation Simulation
  • 5.
    Agenda 5 —  Introduction —  ThePisces Lightweight Co-Kernel Architecture —  Implementation —  Evaluation —  RelatedWork —  Conclusion
  • 6.
    Building Blocks: Kittenand Palacios —  the Kitten Lightweight Kernel (LWK) —  Goal: provide predictable performance for massively parallel HPC applications —  Simple resource management policies —  Limited kernel I/O support + direct user-level network access —  the Palacios LightweightVirtual Machine Monitor (VMM) —  Goal: predictable performance —  Lightweight resource management policies —  Established history of providing virtualized environments for HPC [Lange et al. VEE ’11, Kocoloski and Lange ROSS‘12] Kitten: https://software.sandia.gov/trac/kitten Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/
  • 7.
    The Pisces LightweightCo-Kernel Architecture 7 Linux Hardware Isolated Virtual Machine Applications + Virtual MachinesPalacios VMM Kitten Co-kernel (1) Kitten Co-kernel (2) Isolated Application Pisces Pisces http://www.prognosticlab.org/pisces/ Pisces Design Goals —  Performance isolation at both software and hardware level —  Dynamic creation of resizable enclaves —  Isolated virtual environments
  • 8.
    Design Decisions 8 —  Eliminationof cross OS dependencies —  Each enclave must implement its own complete set of supported system calls —  No system call forwarding is allowed —  Internalized management of I/O —  Each enclave must provide its own I/O device drivers and manage its hardware resources directly —  Userspace cross enclave communication —  Cross enclave communication is not a kernel provided feature —  Explicitly setup cross enclave shared memory at runtime (XEMEM) —  Using virtualization to provide missing OS features
  • 9.
    Cross Kernel Communication 9 Hardware'Par))on'Hardware'Par))on' User% Context% Kernel% Context% Linux' Cross%Kernel* Messages* Control' Process' Control' Process' Shared*Mem* *Ctrl*Channel* Linux' Compa)ble' Workloads' Isolated' Processes'' +' Virtual' Machines' Shared*Mem* Communica6on*Channels* Ki@en' CoAKernel' XEMEM: Efficient Shared Memory for Composed Applications on Multi-OS/R Exascale Systems [Kocoloski and Lange, HPDC‘15]
  • 10.
    Agenda 10 —  Introduction —  ThePisces Lightweight Co-Kernel Architecture —  Implementation —  Evaluation —  RelatedWork —  Conclusion
  • 11.
    Challenges & Approaches 11 — How to boot a co-kernel? —  Hot-remove resources from Linux, and load co-kernel —  Reuse Linux boot code with modified target kernel address —  Restrict the Kitten co-kernel to access assigned resources only —  How to share hardware resources among kernels? —  Hot-remove from Linux + direct assignment and adjustment (e.g. CPU cores, memory blocks, PCI devices) —  Managed by Linux and Pisces (e.g. IOMMU) —  How to communicate with a co-kernel? —  Kernel level: IPI + shared memory, primarily for Pisces commands —  Application level: XEMEM [Kocoloski HPDC’15] —  How to route device interrupts?
  • 12.
    I/O Interrupt Routing 12 Legacy Device IO-APIC Management KernelCo-Kernel IRQ Forwarder IRQ Handler MSI/MSI-X Device Management Kernel Co-Kernel IRQ Forwarder IRQ Handler MSI/MSI-X Device MSI MSI INTx IPI Legacy Interrupt Forwarding Direct Device Assignment (w/ MSI) •  Legacy interrupt vectors are potentially shared among multiple devices •  Pisces provides IRQ forwarding service •  IRQ forwarding is only used during initialization for PCI devices •  Modern PCI devices support dedicated interrupt vectors (MSI/MSI-X) •  Directly route to the corresponding enclave
  • 13.
    Implementation 13 —  Pisces —  Linuxkernel module supports unmodified Linux kernels (2.6.3x – 3.x.y) —  Co-kernel initialization and management —  Kitten (~9000 LOC changes) —  Manage assigned hardware resources —  Dynamic resource assignment —  Kernel level communication channel —  Palacios (~5000 LOC changes) —  Dynamic resource assignment —  Command forwarding channel Pisces: http://www.prognosticlab.org/pisces/ Kitten: https://software.sandia.gov/trac/kitten Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/
  • 14.
    Agenda 14 —  Introduction —  ThePisces Lightweight Co-Kernel Architecture —  Implementation —  Evaluation —  RelatedWork —  Conclusion
  • 15.
    Evaluation 15 —  8 nodeDell R450 cluster —  Two six-core Intel “Ivy-Bridge” Xeon processors —  24GB RAM split across two NUMA domains —  QDR Infiniband —  CentOS 7, Linux kernel 3.16 —  For performance isolation experiments, the hardware is partitioned by NUMA domains. —  i.e. Linux on one NUMA domain, co-kernel on the other
  • 16.
    Fast Pisces ManagementOperations 16 Operations Latency (ms) Booting a co-kernel 265.98 Adding a single CPU core 33.74 Adding a 128MB memory block 82.66 Adding an Ethernet NIC 118.98
  • 17.
    Eliminating Cross KernelDependencies 17 solitary workloads (us) w/ other workloads (us) Linux 3.05 3.48 co-kernel fwd 6.12 14.00 co-kernel 0.39 0.36 ExecutionTime of getpid() —  Co-kernel has the best average performance —  Co-kernel has the most consistent performance —  System call forwarding has longer latency and suffers from cross stack performance interference
  • 18.
    Noise Analysis 18 0 5 10 15 20 0 12 3 4 5 Latency(us) Time (seconds) (a) without competing workloads 0 1 2 3 4 5 Time (seconds) (b) with competing workloads 0 5 10 15 20 0 1 2 3 4 5 Latency(us) Time (seconds) (a) without competing workloads 0 1 2 3 4 5 Time (seconds) (b) with competing workloads Linux Kitten co-kernel Co-Kernel: less noise + better isolation * Each point represents the latency of an OS interruption
  • 19.
    Single Node Performance 19 0 1 CentOSKitten/KVM co-Kernel 82 83 84 85 CompletionTime(Seconds) without bg with bg 0 250 CentOS Kitten/KVM co-Kernel 20250 20500 20750 21000 21250 Throughput(GUPS) without bg with bg CoMD Performance Stream Performance Co-Kernel: consist performance + performance isolation
  • 20.
    8 Node Performance 20 2 4 6 8 10 12 14 16 18 20 12 3 4 5 6 7 8 Throughput(GFLOP/s) Number of Nodes co-VMM native KVM co-VMM bg native bg KVM bg w/o bg: co-VMM achieves native Linux performance w/ bg: co-VMM outperforms native Linux
  • 21.
    Co-VMM for HPCin the Cloud 21 0 20 40 60 80 100 44 45 46 47 48 49 50 51 CDF(%) Runtime (seconds) Co-VMM Native KVM CDF of HPCCG Performance (running with Hadoop, 8 nodes) co-VMM: consistent performance + performance isolation
  • 22.
    Related Work 22 — Exascale operatingsystems and runtimes (OS/Rs) —  Hobbes (SNL, LBNL, LANL, ORNL, U. Pitt, various universities) —  Argo (ANL, LLNL, PNNL, various universities) —  FusedOS (Intel / IBM) —  mOS (Intel) —  McKernel (RIKENAICS, University ofTokyo) Our uniqueness: performance isolation, dynamic resource composition, lightweight virtualization
  • 23.
    Conclusion 23 —  Design andimplementation of the Pisces co-kernel architecture —  Pisces framework —  Kitten co-kernel —  PalaciosVMM for Kitten co-kernel —  Demonstrated that the co-kernel architecture provides —  Optimized execution environments for in situ processing —  Performance isolation https://software.sandia.gov/trac/kitten http://www.prognosticlab.org/pisces/ http://www.prognosticlab.org/palacios
  • 24.
    Thank You Jiannan Ouyang — Ph.D. Candidate @ University of Pittsburgh —  ouyang@cs.pitt.edu —  http://people.cs.pitt.edu/~ouyang/ —  The Prognostic Lab @ U. Pittsburgh —  http://www.prognosticlab.org