SlideShare a Scribd company logo
1 of 55
Download to read offline
Building Efficient HPC Clouds with MVAPICH2 and
RDMA-Hadoop over SR-IOV InfiniBand Clusters
Talk at OpenFabrics Alliance Workshop (OFAW ‘17)
by
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Xiaoyi Lu
The Ohio State University
E-mail: luxi@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~luxi
OFAW 2017 2Network Based Computing Laboratory
• Cloud Computing focuses on maximizing the effectiveness of the shared resources
• Virtualization is the key technology for resource sharing in the Cloud
• Widely adopted in industry computing environment
• IDC Forecasts Worldwide Public IT Cloud Services Spending to Reach Nearly $108 Billion
by 2017 (Courtesy: http://www.idc.com/getdoc.jsp?containerId=prUS24298013)
Cloud Computing and Virtualization
VirtualizationCloud Computing
OFAW 2017 3Network Based Computing Laboratory
Drivers of Modern HPC Cluster and Cloud Architecture
• Multi-core/many-core technologies, Accelerators
• Large memory nodes
• Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
• Single Root I/O Virtualization (SR-IOV)
High Performance Interconnects –
InfiniBand (with SR-IOV)
<1usec latency, 200Gbps Bandwidth>
Multi-/Many-core
Processors
SSDs, Object Storage
Clusters
Large memory nodes
(Upto 2 TB)
Cloud CloudSDSC Comet TACC Stampede
OFAW 2017 4Network Based Computing Laboratory
• Single Root I/O Virtualization (SR-IOV) is providing new opportunities to design
HPC cloud with very little low overhead
Single Root I/O Virtualization (SR-IOV)
• Allows a single physical device, or a
Physical Function (PF), to present itself as
multiple virtual devices, or Virtual
Functions (VFs)
• VFs are designed based on the existing
non-virtualized PFs, no need for driver
change
• Each VF can be dedicated to a single VM
through PCI pass-through
• Work with 10/40 GigE and InfiniBand
OFAW 2017 5Network Based Computing Laboratory
• High-Performance Computing (HPC) has adopted advanced interconnects and protocols
– InfiniBand
– 10/40/100 Gigabit Ethernet/iWARP
– RDMA over Converged Enhanced Ethernet (RoCE)
• Very Good Performance
– Low latency (few micro seconds)
– High Bandwidth (100 Gb/s with EDR InfiniBand)
– Low CPU overhead (5-10%)
• OpenFabrics software stack with IB, iWARP and RoCE interfaces are driving HPC systems
• How to Build HPC Clouds with SR-IOV and InfiniBand for delivering optimal performance?
Building HPC Cloud with SR-IOV and InfiniBand
OFAW 2017 6Network Based Computing Laboratory
• Virtualization Support with Virtual Machines and Containers
– KVM, Docker, Singularity, etc.
• Communication coordination among optimized communication channels on Clouds
– SR-IOV, IVShmem, IPC-Shm, CMA, etc.
• Locality-aware communication
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)
• Scalable Collective communication
– Offload; Non-blocking; Topology-aware
• Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores)
– Multiple end-points per node
• NUMA-aware communication for nested virtualization
• Integrated Support for GPGPUs and Accelerators
• Fault-tolerance/resiliency
– Migration support with virtual machines
• QoS support for communication and I/O
• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, …)
• Energy-Awareness
• Co-design with resource management and scheduling systems on Clouds
– OpenStack, Slurm, etc.
Broad Challenges in Designing Communication and I/O Middleware
for HPC on Clouds
OFAW 2017 7Network Based Computing Laboratory
• High-Performance designs for Big Data middleware
– RDMA-based designs to accelerate Big Data middleware on high-performance Interconnects
– NVM-aware communication and I/O schemes for Big Data
– SATA-/PCIe-/NVMe-SSD support
– Parallel Filesystem support
– Optimized overlapping among Computation, Communication, and I/O
– Threaded Models and Synchronization
• Fault-tolerance/resiliency
– Migration support with virtual machines
– Data replication
• Efficient data access and placement policies
• Efficient task scheduling
• Fast deployment and automatic configurations on Clouds
Additional Challenges in Designing Communication and I/O
Middleware for Big Data on Clouds
OFAW 2017 8Network Based Computing Laboratory
• MVAPICH2-Virt with SR-IOV and IVSHMEM
– Standalone, OpenStack
• SR-IOV-enabled VM Migration Support in MVAPICH2
• MVAPICH2 with Containers (Docker and Singularity)
• MVAPICH2 with Nested Virtualization (Container over VM)
• MVAPICH2-Virt on SLURM
– SLURM alone, SLURM + OpenStack
• Big Data Libraries on Cloud
– RDMA-Hadoop, OpenStack Swift
Approaches to Build HPC Clouds
OFAW 2017 9Network Based Computing Laboratory
MVAPICH2 Software Family
High-Performance Parallel Programming Libraries
MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE
MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and
MPI+PGAS programming models with unified communication runtime
MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs
MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud
MVAPICH2-EA Energy aware and High-performance MPI
MVAPICH2-MIC Optimized MPI for clusters with Intel KNC
Microbenchmarks
OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++)
libraries for CPUs and GPUs
Tools
OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler
integration
OEMT Utility to measure the energy consumption of MPI applications
OFAW 2017 10Network Based Computing Laboratory
HPC on Cloud Computing Systems: Challenges Addressed by OSU
So Far
HPC and Big Data Middleware
Networking Technologies
(InfiniBand, Omni-Path, 1/10/40/100
GigE and Intelligent NICs)
Storage Technologies
(HDD, SSD, NVRAM, and NVMe-SSD)
HPC (MPI, PGAS, MPI+PGAS, MPI+OpenMP, etc.)
Applications
Commodity Computing System
Architectures
(Multi- and Many-core architectures
and accelerators)
Communication and I/O Library
Future Studies
Resource Management and Scheduling Systems for Cloud Computing
(OpenStack Nova, Heat; Slurm)
Virtualization
(Hypervisor and Container)
Locality- and NUMA-aware
Communication
Communication Channels
(SR-IOV, IVShmem, IPC-Shm, CMA)
Fault-Tolerance & Consolidation
(Migration)
QoS-aware
OFAW 2017 11Network Based Computing Laboratory
• Redesign MVAPICH2 to make it
virtual machine aware
– SR-IOV shows near to native
performance for inter-node point to
point communication
– IVSHMEM offers shared memory based
data access across co-resident VMs
– Locality Detector: maintains the locality
information of co-resident virtual machines
– Communication Coordinator: selects the
communication channel (SR-IOV, IVSHMEM)
adaptively
Overview of MVAPICH2-Virt with SR-IOV and IVSHMEM
J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM
Shmem Benefit MPI Applications on SR-IOV based
Virtualized InfiniBand Clusters? Euro-Par, 2014
J. Zhang, X. Lu, J. Jose, R. Shi, M. Li, D. K. Panda. High
Performance MPI Library over SR-IOV Enabled
InfiniBand Clusters. HiPC, 2014
OFAW 2017 12Network Based Computing Laboratory
• OpenStack is one of the most popular
open-source solutions to build clouds and
manage virtual machines
• Deployment with OpenStack
– Supporting SR-IOV configuration
– Supporting IVSHMEM configuration
– Virtual Machine aware design of MVAPICH2
with SR-IOV
• An efficient approach to build HPC Clouds
with MVAPICH2-Virt and OpenStack
MVAPICH2-Virt with SR-IOV and IVSHMEM over OpenStack
J. Zhang, X. Lu, M. Arnold, D. K. Panda. MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to
Build HPC Clouds. CCGrid, 2015
OFAW 2017 13Network Based Computing Laboratory
0
50
100
150
200
250
300
350
400
milc leslie3d pop2 GAPgeofem zeusmp2 lu
ExecutionTime(s)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
1%
9.5%
0
1000
2000
3000
4000
5000
6000
22,20 24,10 24,16 24,20 26,10 26,16
ExecutionTime(ms)
Problem Size (Scale, Edgefactor)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
2%
• 32 VMs, 6 Core/VM
• Compared to Native, 2-5% overhead for Graph500 with 128 Procs
• Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs
Application-Level Performance on Chameleon
SPEC MPI2007Graph500
5%
OFAW 2017 14Network Based Computing Laboratory
• MVAPICH2-Virt with SR-IOV and IVSHMEM
– Standalone, OpenStack
• SR-IOV-enabled VM Migration Support in MVAPICH2
• MVAPICH2 with Containers (Docker and Singularity)
• MVAPICH2 with Nested Virtualization (Container over VM)
• MVAPICH2-Virt on SLURM
– SLURM alone, SLURM + OpenStack
• Big Data Libraries on Cloud
– RDMA-Hadoop, OpenStack Swift
Approaches to Build HPC Clouds
OFAW 2017 15Network Based Computing Laboratory
Execute Live Migration with SR-IOV Device
OFAW 2017 16Network Based Computing Laboratory
High Performance SR-IOV enabled VM Migration Support in
MVAPICH2
J. Zhang, X. Lu, D. K. Panda. High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV
enabled InfiniBand Clusters. IPDPS, 2017
• Migration with SR-IOV device has to handle the
challenges of detachment/re-attachment of
virtualized IB device and IB connection
• Consist of SR-IOV enabled IB Cluster and External
Migration Controller
• Multiple parallel libraries to notify MPI
applications during migration (detach/reattach
SR-IOV/IVShmem, migrate VMs, migration status)
• Handle the IB connection suspending and
reactivating
• Propose Progress engine (PE) and migration
thread based (MT) design to optimize VM
migration and MPI application performance
OFAW 2017 17Network Based Computing Laboratory
• Compared with the TCP, the RDMA scheme reduces the total migration time by 20%
• Total time is dominated by `Migration’ time; Times on other steps are similar across different schemes
• Proposed migration framework could reduce up to 51% migration time
Performance Evaluation of VM Migration Framework
0
0.5
1
1.5
2
2.5
3
TCP IPoIB RDMA
Times(s)
Set POST_MIGRATION Add IVSHMEM Attach VF
Migration Remove IVSHMEM Detach VF
Set PRE_MIGRATION
Breakdown of VM migration
0
5
10
15
20
25
30
35
2 VM 4 VM 8 VM 16 VM
Time(s)
Sequential Migration Framework
Proposed Migration Framework
Multiple VM Migration Time
OFAW 2017 18Network Based Computing Laboratory
Bcast (4VMs * 2Procs/VM)
• Migrate a VM from one machine to another while benchmark is running inside
• Proposed MT-based designs perform slightly worse than PE-based designs because of lock/unlock
• No benefit from MT because of NO computation involved
Performance Evaluation of VM Migration Framework
Pt2Pt Latency
0
20
40
60
80
100
120
140
160
180
200
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Latency(us)
Message Size ( bytes)
PE-IPoIB
PE-RDMA
MT-IPoIB
MT-RDMA
0
100
200
300
400
500
600
700
1 4 16 64 256 1K 4K 16K 64K 256K 1M
2Latency(us)
Message Size ( bytes)
PE-IPoIB
PE-RDMA
MT-IPoIB
MT-RDMA
OFAW 2017 19Network Based Computing Laboratory
Graph500
• 8 VMs in total and 1 VM carries out migration during application running
• Compared with NM, MT- worst and PE incur some overhead compared with NM
• MT-typical allows migration to be completely overlapped with computation
Performance Evaluation of VM Migration Framework
NAS
0
20
40
60
80
100
120
LU.C EP.C IS.C MG.C CG.C
ExecutionTime(s)
PE
MT-worst
MT-typical
NM
0.0
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
20,10 20,16 20,20 22,10
ExecutionTime(s)
PE
MT-worst
MT-typical
NM
OFAW 2017 20Network Based Computing Laboratory
• MVAPICH2-Virt with SR-IOV and IVSHMEM
– Standalone, OpenStack
• SR-IOV-enabled VM Migration Support in MVAPICH2
• MVAPICH2 with Containers (Docker and Singularity)
• MVAPICH2 with Nested Virtualization (Container over VM)
• MVAPICH2-Virt on SLURM
– SLURM alone, SLURM + OpenStack
• Big Data Libraries on Cloud
– RDMA-Hadoop, OpenStack Swift
Approaches to Build HPC Clouds
OFAW 2017 21Network Based Computing Laboratory
• Container-based technologies (e.g., Docker) provide lightweight virtualization solutions
• Container-based virtualization – share host kernel by containers
Overview of Containers-based Virtualization
VM1
Container1
OFAW 2017 22Network Based Computing Laboratory
Benefits of Containers-based Virtualization for HPC on Cloud
65 65.5
178.7
253.7
0
50
100
150
200
250
300
Native-16P 1Conts*16P 2Conts*8P 4Conts*4P
BFSExecutionTime(ms)
Scale, Edgefactor (20,16)
• Experiment on NFS Chameleon Cloud
• Container has less overhead than VM
• BFS time in Graph 500 significantly increases as the number of container increases on one host. Why?
ib_send_lat Graph500
0
1
2
3
4
5
6
7
8
9
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K
Latency(us)
Message Size (bytes)
VM-PT
VM-SR-IOV
Container-PT
Native
J. Zhang, X. Lu, D. K. Panda. Performance Characterization of Hypervisor- and Container-Based Virtualization
for HPC on SR-IOV Enabled InfiniBand Clusters. IPDRM, IPDPS Workshop, 2016
OFAW 2017 23Network Based Computing Laboratory
• What are the performance bottlenecks when
running MPI applications on multiple
containers per host in HPC cloud?
• Can we propose a new design to overcome the
bottleneck on such container-based HPC
cloud?
• Can optimized design deliver near-native
performance for different container
deployment scenarios?
• Locality-aware based design to enable CMA
and Shared memory channels for MPI
communication across co-resident containers
Containers-based Design: Issues, Challenges, and Approaches
J. Zhang, X. Lu, D. K. Panda. High Performance MPI Library for Container-based HPC Cloud on InfiniBand Clusters.
ICPP, 2016
OFAW 2017 24Network Based Computing Laboratory
0
10
20
30
40
50
60
70
80
90
100
MG.D FT.D EP.D LU.D CG.D
ExecutionTime(s)
Container-Def
Container-Opt
Native
• 64 Containers across 16 nodes, pining 4 Cores per Container
• Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500
• Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500
Application-Level Performance on Docker with MVAPICH2
Graph 500
NAS
11%
0
50
100
150
200
250
300
1Cont*16P 2Conts*8P 4Conts*4P
BFSExecutionTime(ms)
Scale, Edgefactor (20,16)
Container-Def
Container-Opt
Native
73%
OFAW 2017 25Network Based Computing Laboratory
• Less than 18% overhead on latency
• Less than 13% overhead on BW
MVAPICH2 Intra-Node and Inter-Node Point-to-Point
Performance on Singularity
0
2
4
6
8
10
12
14
16
18
1 4 16 64 256 1024 4096 16384 65536
Latency(us)
Message Size (Byte)
Latency
Singularity-Intra-Node
Native-Intra-Node
Singularity-Inter-Node
Native-Inter-Node
18%
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Bandwidth(MB/s)
Message Size (Byte)
BW
Singularity-Intra-Node
Native-Intra-Node
Singularity-Inter-Node
Native-Inter-Node
13%
OFAW 2017 26Network Based Computing Laboratory
• 512 Processes across 32 nodes
• Less than 15% and 14% overhead for Bcast and Allreduce, respectively
MVAPICH2 Collective Performance on Singularity
0
10
20
30
40
50
60
70
80
1 4 16 64 256 1024 4096 16384 65536
Latency(us)
Message Size (Byte)
Bcast
Singularity
Native
15%
0
20
40
60
80
100
120
140
160
180
200
4 16 64 256 1024 4096 16384 65536
Latency(us)
Message Size (Byte)
Allreduce
Singularity
Native
14%
OFAW 2017 27Network Based Computing Laboratory
• 512 Processes across 32 nodes
• Less than 16% and 11% overhead for NPB and Graph500, respectively
Application-Level Performance on Singularity with MVAPICH2
0
500
1000
1500
2000
2500
3000
22,16 22,20 24,16 24,20 26,16 26,20
BFSExecutionTime(ms)
Problem Size (Scale, Edgefactor)
Graph500
Singularity
Native
16%
11%
0
50
100
150
200
250
300
CG EP FT IS LU MG
ExecutionTime(s)
NPB (Class D)
Singularity
Native
OFAW 2017 28Network Based Computing Laboratory
• MVAPICH2-Virt with SR-IOV and IVSHMEM
– Standalone, OpenStack
• SR-IOV-enabled VM Migration Support in MVAPICH2
• MVAPICH2 with Containers (Docker and Singularity)
• MVAPICH2 with Nested Virtualization (Container over VM)
• MVAPICH2-Virt on SLURM
– SLURM alone, SLURM + OpenStack
• Big Data Libraries on Cloud
– RDMA-Hadoop, OpenStack Swift
Approaches to Build HPC Clouds
OFAW 2017 29Network Based Computing Laboratory
Nested Virtualization: Containers over Virtual Machines
• Useful for live migration, sandbox application, legacy system
integration, software deployment, etc.
• Performance issues because of the redundant call stacks (two-layer
virtualization) and isolated physical resources
OFAW 2017 30Network Based Computing Laboratory
Multiple Communication Paths in Nested Virtualization
1. Intra-VM Intra-Container (across core 4 and core 5)
2. Intra-VM Inter-Container (across core 13 and core 14)
3. Inter-VM Inter-Container (across core 6 and core 12)
4. Inter-Node Inter-Container (across core 15 and the core on remote node)
• Different VM placements introduce multiple communication paths
on container level
OFAW 2017 31Network Based Computing Laboratory
Performance Characteristics on Communication Paths
• Two VMs are deployed on the same socket and different sockets, respectively
• *-Def and Inter-VM Inter-Container-1Layer have similar performance
• Large gap compared to native performance
0
10
20
30
40
1 4 16 64 256 1024 4096 16384 66636
Latency(us)
Message Size (Byte)
Intra-Socket
Intra-VM Inter-Container-Def
Inter-VM Inter-Container-Def
Intra-VM Inter-Container-1Layer*
Inter-VM Inter-Container-1Layer*
Native
0
5
10
15
20
1 4 16 64 256 1024 4096 16384 66636
Latency(us)
Message Size (Byte)
Inter-Socket
Intra-VM Inter-Container-Def
Inter-VM Inter-Container-Def
Intra-VM Inter-Container-1Layer*
Inter-VM Inter-Container-1Layer*
Native
1Layer* - J. Zhang, X. Lu, D. K. Panda. High
Performance MPI Library for Container-based HPC
Cloud on InfiniBand, ICPP, 2016
OFAW 2017 32Network Based Computing Laboratory
Challenges of Nested Virtualization
• How to further reduce the performance overhead of running applications on
the nested virtualization environment?
• What are the impacts of the different VM/container placement schemes for
the communication on the container level?
• Can we propose a design which can adapt these different VM/container
placement schemes and deliver near-native performance for nested
virtualization environments?
OFAW 2017 33Network Based Computing Laboratory
Overview of Proposed Design in MVAPICH2
Two-Layer Locality Detector: Dynamically
detecting MPI processes in the co-
resident containers inside one VM as well
as the ones in the co-resident VMs
Two-Layer NUMA Aware
Communication Coordinator:
Leverage nested locality info, NUMA
architecture info and message to
select appropriate communication
channel
J. Zhang, X. Lu, D. K. Panda. Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization
based HPC Cloud with SR-IOV Enabled InfiniBand, VEE, 2017
OFAW 2017 34Network Based Computing Laboratory
Inter-VM Inter-Container Pt2Pt (Intra-Socket)
• 1Layer has similar performance to the Default
• Compared with 1Layer, 2Layer delivers up to 84% and 184% improvement for
latency and BW
Latency BW
OFAW 2017 35Network Based Computing Laboratory
Inter-VM Inter-Container Pt2Pt (Inter-Socket)
Latency BW
• 1-Layer has similar performance to the Default
• 2-Layer has near-native performance for small msg, but clear overhead on large msg
• Compared to 2-Layer, Hybrid design brings up to 42% and 25% improvement for
latency and BW, respectively
OFAW 2017 36Network Based Computing Laboratory
Application-level Evaluations
• 256 processes across 64 containers on 16 nodes
• Compared with Default, enhanced-hybrid design reduces up to 16% (28,16) and 10% (LU) of
execution time for Graph 500 and NAS, respectively
• Compared with the 1Layer case, enhanced-hybrid design also brings up to 12% (28,16) and 6%
(LU) performance benefit
Class D NASGraph500
OFAW 2017 37Network Based Computing Laboratory
• MVAPICH2-Virt with SR-IOV and IVSHMEM
– Standalone, OpenStack
• SR-IOV-enabled VM Migration Support in MVAPICH2
• MVAPICH2 with Containers (Docker and Singularity)
• MVAPICH2 with Nested Virtualization (Container over VM)
• MVAPICH2-Virt on SLURM
– SLURM alone, SLURM + OpenStack
• Big Data Libraries on Cloud
– RDMA-Hadoop, OpenStack Swift
Approaches to Build HPC Clouds
OFAW 2017 38Network Based Computing Laboratory
• Requirement of managing and isolating virtualized resources of SR-IOV and IVSHMEM
• Such kind of management and isolation is hard to be achieved by MPI library alone, but
much easier with SLURM
• Efficient running MPI applications on HPC Clouds needs SLURM to support managing
SR-IOV and IVSHMEM
– Can critical HPC resources be efficiently shared among users by extending SLURM with
support for SR-IOV and IVSHMEM based virtualization?
– Can SR-IOV and IVSHMEM enabled SLURM and MPI library provide bare-metal performance
for end applications on HPC Clouds?
Need for Supporting SR-IOV and IVSHMEM in SLURM
OFAW 2017 39Network Based Computing Laboratory
load SPANK
reclaim VMs
register job
step reply
register job
step req
Slurmctld Slurmd Slurmd
release hosts
run job step req
run job step reply
mpirun_vm
MPI Job
across VMs
VM Config
Reader
load SPANK
VM Launcher
load SPANK
VM Reclaimer
• VM Configuration Reader –
Register all VM configuration
options, set in the job control
environment so that they are
visible to all allocated nodes.
• VM Launcher – Setup VMs on
each allocated nodes.
- File based lock to detect occupied VF
and exclusively allocate free VF
- Assign a unique ID to each IVSHMEM
and dynamically attach to each VM
• VM Reclaimer – Tear down
VMs and reclaim resources
SLURM SPANK Plugin based Design
MPIMPI
vm hostfile
OFAW 2017 40Network Based Computing Laboratory
• VM Configuration Reader – VM
options register
• VM Launcher, VM Reclaimer –
Offload to underlying OpenStack
infrastructure
- PCI Whitelist to passthrough free VF to VM
- Extend Nova to enable IVSHMEM when
launching VM
SLURM SPANK Plugin with OpenStack based Design
J. Zhang, X. Lu, S. Chakraborty, D. K. Panda.
SLURM-V: Extending SLURM for Building Efficient
HPC Cloud with SR-IOV and IVShmem. Euro-Par,
2016
reclaim VMs
register job
step reply
register job
step req
Slurmctld Slurmd
release
hosts
launch VM
mpirun_vm
load SPANK
VM Config
Reader
MPI
VM hostfile
OpenStack
daemon
request launch VM
VM Launcher
return
request reclaim VM
VM Reclaimer
return
......
......
......
......
OFAW 2017 41Network Based Computing Laboratory
• 32 VMs across 8 nodes, 6 Core/VM
• EASJ - Compared to Native, less than 4% overhead with 128 Procs
• SACJ, EACJ – Also minor overhead, when running NAS as concurrent job with 64 Procs
Application-Level Performance on Chameleon (Graph500)
Exclusive Allocations
Sequential Jobs
0
500
1000
1500
2000
2500
3000
24,16 24,20 26,10
BFSExecutionTime(ms)
Problem Size (Scale, Edgefactor)
VM
Native
0
50
100
150
200
250
22,10 22,16 22,20BFSExecutionTime(ms)
Problem Size (Scale, Edgefactor)
VM
Native
0
50
100
150
200
250
22 10 22 16 22 20
BFSExecutionTime(ms)
Problem Size (Scale, Edgefactor)
VM
Native
Shared-host Allocations
Concurrent Jobs
Exclusive Allocations
Concurrent Jobs
4%
OFAW 2017 42Network Based Computing Laboratory
• MVAPICH2-Virt with SR-IOV and IVSHMEM
– Standalone, OpenStack
• SR-IOV-enabled VM Migration Support in MVAPICH2
• MVAPICH2 with Containers (Docker and Singularity)
• MVAPICH2 with Nested Virtualization (Container over VM)
• MVAPICH2-Virt on SLURM
– SLURM alone, SLURM + OpenStack
• Big Data Libraries on Cloud
– RDMA-Hadoop, OpenStack Swift
Approaches to Build HPC Clouds
OFAW 2017 43Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 215 organizations from 29 countries
• More than 21,000 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
OFAW 2017 44Network Based Computing Laboratory
High-Performance Apache Hadoop over Clouds: Challenges
• How about performance characteristics of native IB-based designs for Apache
Hadoop over SR-IOV enabled cloud environment?
• To achieve locality-aware communication, how can the cluster topology be
automatically detected in a scalable and efficient manner and be exposed to the
Hadoop framework?
• How can we design virtualization-aware policies in Hadoop for efficiently taking
advantage of the detected topology?
• Can the proposed policies improve the performance and fault tolerance of
Hadoop on virtualized platforms?
“How can we design a high-performance Hadoop library for Cloud-based systems?”
OFAW 2017 45Network Based Computing Laboratory
Overview of RDMA-Hadoop-Virt Architecture
• Virtualization-aware modules in all the four
main Hadoop components:
– HDFS: Virtualization-aware Block Management
to improve fault-tolerance
– YARN: Extensions to Container Allocation Policy
to reduce network traffic
– MapReduce: Extensions to Map Task Scheduling
Policy to reduce network traffic
– Hadoop Common: Topology Detection Module
for automatic topology detection
• Communications in HDFS, MapReduce, and RPC
go through RDMA-based designs over SR-IOV
enabled InfiniBand
HDFS
YARN
HadoopCommon
MapReduce
HBase Others
Virtual Machines Bare-Metal nodesContainers
Big Data ApplicationsTopologyDetectionModule
Map Task Scheduling
Policy Extension
Container Allocation
Policy Extension
CloudBurst MR-MS Polygraph Others
Virtualization Aware
Block Management
S. Gugnani, X. Lu, D. K. Panda. Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on
SR-IOV-enabled Clouds. CloudCom, 2016.
OFAW 2017 46Network Based Computing Laboratory
Evaluation with Applications
– 14% and 24% improvement with Default Mode for CloudBurst and Self-Join
– 30% and 55% improvement with Distributed Mode for CloudBurst and Self-Join
0
20
40
60
80
100
Default Mode Distributed Mode
EXECUTIONTIME CloudBurst
RDMA-Hadoop RDMA-Hadoop-Virt
0
50
100
150
200
250
300
350
400
Default Mode Distributed Mode
EXECUTIONTIME
Self-Join
RDMA-Hadoop RDMA-Hadoop-Virt
30%
reduction
55%
reduction
OFAW 2017 47Network Based Computing Laboratory
• Distributed Cloud-based Object Storage Service
• Deployed as part of OpenStack installation
• Can be deployed as standalone storage solution as well
• Worldwide data access via Internet
– HTTP-based
• Architecture
– Multiple Object Servers: To store data
– Few Proxy Servers: Act as a proxy for all requests
– Ring: Handles metadata
• Usage
– Input/output source for Big Data applications (most common use
case)
– Software/Data backup
– Storage of VM/Docker images
• Based on traditional TCP sockets communication
OpenStack Swift Overview
Send PUT or GET request
PUT/GET /v1/<account>/<container>/<object>
Proxy
Server
Object
Server
Object
Server
Object
Server
Ring
Disk 1
Disk 2
Disk 1
Disk 2
Disk 1
Disk 2
Swift Architecture
OFAW 2017 48Network Based Computing Laboratory
• Challenges
– Proxy server is a bottleneck for large scale deployments
– Object upload/download operations network intensive
– Can an RDMA-based approach benefit?
• Design
– Re-designed Swift architecture for improved scalability and
performance; Two proposed designs:
• Client-Oblivious Design: No changes required on the client side
• Metadata Server-based Design: Direct communication between
client and object servers; bypass proxy server
– RDMA-based communication framework for accelerating
networking performance
– High-performance I/O framework to provide maximum
overlap between communication and I/O
Swift-X: Accelerating OpenStack Swift with RDMA for Building
Efficient HPC Clouds
S. Gugnani, X. Lu, and D. K. Panda, Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud,
accepted at CCGrid’17, May 2017
Client-Oblivious Design
(D1)
Metadata Server-based
Design (D2)
OFAW 2017 49Network Based Computing Laboratory
0
5
10
15
20
25
Swift
PUT
Swift-X
(D1)
PUT
Swift-X
(D2)
PUT
Swift
GET
Swift-X
(D1)
GET
Swift-X
(D2)
GET
LATENCY(S)
TIME BREAKUP OF GET AND PUT
Communication I/O Hashsum Other
Swift-X: Accelerating OpenStack Swift with RDMA for Building
Efficient HPC Clouds
0
5
10
15
20
1MB 4MB 16MB 64MB 256MB 1GB 4GB
LATENCY(s)
OBJECT SIZE
GET LATENCY EVALUATION
Swift Swift-X (D2) Swift-X (D1)
Reduced
by 66%
• Up to 66% reduction in GET latency• Communication time reduced by up to
3.8x for PUT and up to 2.8x for GET
OFAW 2017 50Network Based Computing Laboratory
Available Appliances on Chameleon Cloud*
Appliance Description
CentOS 7 KVM SR-
IOV
Chameleon bare-metal image customized with the KVM hypervisor and a
recompiled kernel to enable SR-IOV over InfiniBand.
https://www.chameleoncloud.org/appliances/3/
MPI bare-metal
cluster complex
appliance (Based on
Heat)
This appliance deploys an MPI cluster composed of bare metal instances using the
MVAPICH2 library over InfiniBand.
https://www.chameleoncloud.org/appliances/29/
MPI + SR-IOV KVM
cluster (Based on
Heat)
This appliance deploys an MPI cluster of KVM virtual machines using the
MVAPICH2-Virt implementation and configured with SR-IOV for high-performance
communication over InfiniBand. https://www.chameleoncloud.org/appliances/28/
CentOS 7 SR-IOV
RDMA-Hadoop
The CentOS 7 SR-IOV RDMA-Hadoop appliance is built from the CentOS 7
appliance and additionally contains RDMA-Hadoop library with SR-IOV.
https://www.chameleoncloud.org/appliances/17/
• Through these available appliances, users and researchers can easily deploy HPC clouds to perform experiments and run jobs with
– High-Performance SR-IOV + InfiniBand
– High-Performance MVAPICH2 Library over bare-metal InfiniBand clusters
– High-Performance MVAPICH2 Library with Virtualization Support over SR-IOV enabled KVM clusters
– High-Performance Hadoop with RDMA-based Enhancements Support
[*] Only include appliances contributed by OSU NowLab
OFAW 2017 51Network Based Computing Laboratory
• MVAPICH2-Virt over SR-IOV-enabled InfiniBand is an efficient approach to build HPC Clouds
– Standalone, OpenStack, Slurm, and Slurm + OpenStack
– Support Virtual Machine Migration with SR-IOV InfiniBand devices
– Support Virtual Machine, Container (Docker and Singularity), and Nested Virtualization
• Very little overhead with virtualization, near native performance at application level
• Much better performance than Amazon EC2
• MVAPICH2-Virt is available for building HPC Clouds
– SR-IOV, IVSHMEM, Docker support, OpenStack
• Big Data analytics stacks such as RDMA-Hadoop can benefit from cloud-aware designs
• Appliances for MVAPICH2-Virt and RDMA-Hadoop are available for building HPC Clouds
• Future releases for supporting running MPI jobs in VMs/Containers with SLURM, etc.
• SR-IOV/container support and appliances for other MVAPICH2 libraries (MVAPICH2-X,
MVAPICH2-GDR, ...) and RDMA-Spark/Memcached
Conclusions
OFAW 2017 52Network Based Computing Laboratory
One More Presentation
• Friday (03/31/17) at 11:00am
NVM-aware RDMA-Based Communication and I/O Schemes for High-Perf Big
Data Analytics
OFAW 2017 53Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by
OFAW 2017 54Network Based Computing Laboratory
Personnel Acknowledgments
Current Students
– A. Awan (Ph.D.)
– R. Biswas (M.S.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)
Past Students
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist
– K. Hamidouche
– S. Sur
Past Post-Docs
– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– C.-H. Chu (Ph.D.)
– S. Guganani (Ph.D.)
– J. Hashmi (Ph.D.)
– H. Javed (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Scientists
– X. Lu
– H. Subramoni
Past Programmers
– D. Bureddy
– M. Arnold
– J. Perkins
Current Research Specialist
– J. Smith
– M. Li (Ph.D.)
– D. Shankar (Ph.D.)
– H. Shi (Ph.D.)
– J. Zhang (Ph.D.)
– S. Marcarelli
– J. Vienne
– H. Wang
OFAW 2017 55Network Based Computing Laboratory
{panda, luxi}@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
http://www.cse.ohio-state.edu/~luxi
Thank You!
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/

More Related Content

What's hot

Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined InfrastructureRed Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined InfrastructureIntel® Software
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyondinside-BigData.com
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowDataWorks Summit
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIinside-BigData.com
 
High Performance Interconnects: Landscape, Assessments & Rankings
High Performance Interconnects: Landscape, Assessments & RankingsHigh Performance Interconnects: Landscape, Assessments & Rankings
High Performance Interconnects: Landscape, Assessments & Rankingsinside-BigData.com
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed_Hat_Storage
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmapinside-BigData.com
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges DataWorks Summit
 
Programming Models for Exascale Systems
Programming Models for Exascale SystemsProgramming Models for Exascale Systems
Programming Models for Exascale Systemsinside-BigData.com
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...DataWorks Summit
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsGanesan Narayanasamy
 
Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?DataWorks Summit
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 

What's hot (20)

Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined InfrastructureRed Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyond
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPI
 
High Performance Interconnects: Landscape, Assessments & Rankings
High Performance Interconnects: Landscape, Assessments & RankingsHigh Performance Interconnects: Landscape, Assessments & Rankings
High Performance Interconnects: Landscape, Assessments & Rankings
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmap
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges
 
Interconnect your future
Interconnect your futureInterconnect your future
Interconnect your future
 
Programming Models for Exascale Systems
Programming Models for Exascale SystemsProgramming Models for Exascale Systems
Programming Models for Exascale Systems
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
 
Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 

Similar to Building Efficient HPC Clouds with MCAPICH2 and RDMA-Hadoop over SR-IOV InfiniBand Clusters

Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418inside-BigData.com
 
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale SystemsDesigning Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systemsinside-BigData.com
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...OCCIware
 
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...OW2
 
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformOCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformMarc Dutoo
 
[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...
[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...
[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...OpenStack Korea Community
 
Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Community
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
Ceph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Community
 
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, SmileOCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, SmileOCCIware
 
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017Marc Dutoo
 
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDHigh-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDinside-BigData.com
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSSteve Wong
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
 
ProductX2014 Tom thirer. mellanox
ProductX2014 Tom thirer. mellanoxProductX2014 Tom thirer. mellanox
ProductX2014 Tom thirer. mellanoxProduct Excellence
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceObject Automation
 
ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
 ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case StudiesOpenNebula Project
 
ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
 ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case StudiesIgnacio M. Llorente
 

Similar to Building Efficient HPC Clouds with MCAPICH2 and RDMA-Hadoop over SR-IOV InfiniBand Clusters (20)

Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418
 
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale SystemsDesigning Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
 
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
 
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformOCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
 
[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...
[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...
[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...
 
Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Ceph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance Networks
 
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, SmileOCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
 
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
 
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDHigh-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
 
ProductX2014 Tom thirer. mellanox
ProductX2014 Tom thirer. mellanoxProductX2014 Tom thirer. mellanox
ProductX2014 Tom thirer. mellanox
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
 
ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
 ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
 
ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
 ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 

Recently uploaded (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 

Building Efficient HPC Clouds with MCAPICH2 and RDMA-Hadoop over SR-IOV InfiniBand Clusters

  • 1. Building Efficient HPC Clouds with MVAPICH2 and RDMA-Hadoop over SR-IOV InfiniBand Clusters Talk at OpenFabrics Alliance Workshop (OFAW ‘17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi
  • 2. OFAW 2017 2Network Based Computing Laboratory • Cloud Computing focuses on maximizing the effectiveness of the shared resources • Virtualization is the key technology for resource sharing in the Cloud • Widely adopted in industry computing environment • IDC Forecasts Worldwide Public IT Cloud Services Spending to Reach Nearly $108 Billion by 2017 (Courtesy: http://www.idc.com/getdoc.jsp?containerId=prUS24298013) Cloud Computing and Virtualization VirtualizationCloud Computing
  • 3. OFAW 2017 3Network Based Computing Laboratory Drivers of Modern HPC Cluster and Cloud Architecture • Multi-core/many-core technologies, Accelerators • Large memory nodes • Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Single Root I/O Virtualization (SR-IOV) High Performance Interconnects – InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth> Multi-/Many-core Processors SSDs, Object Storage Clusters Large memory nodes (Upto 2 TB) Cloud CloudSDSC Comet TACC Stampede
  • 4. OFAW 2017 4Network Based Computing Laboratory • Single Root I/O Virtualization (SR-IOV) is providing new opportunities to design HPC cloud with very little low overhead Single Root I/O Virtualization (SR-IOV) • Allows a single physical device, or a Physical Function (PF), to present itself as multiple virtual devices, or Virtual Functions (VFs) • VFs are designed based on the existing non-virtualized PFs, no need for driver change • Each VF can be dedicated to a single VM through PCI pass-through • Work with 10/40 GigE and InfiniBand
  • 5. OFAW 2017 5Network Based Computing Laboratory • High-Performance Computing (HPC) has adopted advanced interconnects and protocols – InfiniBand – 10/40/100 Gigabit Ethernet/iWARP – RDMA over Converged Enhanced Ethernet (RoCE) • Very Good Performance – Low latency (few micro seconds) – High Bandwidth (100 Gb/s with EDR InfiniBand) – Low CPU overhead (5-10%) • OpenFabrics software stack with IB, iWARP and RoCE interfaces are driving HPC systems • How to Build HPC Clouds with SR-IOV and InfiniBand for delivering optimal performance? Building HPC Cloud with SR-IOV and InfiniBand
  • 6. OFAW 2017 6Network Based Computing Laboratory • Virtualization Support with Virtual Machines and Containers – KVM, Docker, Singularity, etc. • Communication coordination among optimized communication channels on Clouds – SR-IOV, IVShmem, IPC-Shm, CMA, etc. • Locality-aware communication • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) • Scalable Collective communication – Offload; Non-blocking; Topology-aware • Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores) – Multiple end-points per node • NUMA-aware communication for nested virtualization • Integrated Support for GPGPUs and Accelerators • Fault-tolerance/resiliency – Migration support with virtual machines • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, …) • Energy-Awareness • Co-design with resource management and scheduling systems on Clouds – OpenStack, Slurm, etc. Broad Challenges in Designing Communication and I/O Middleware for HPC on Clouds
  • 7. OFAW 2017 7Network Based Computing Laboratory • High-Performance designs for Big Data middleware – RDMA-based designs to accelerate Big Data middleware on high-performance Interconnects – NVM-aware communication and I/O schemes for Big Data – SATA-/PCIe-/NVMe-SSD support – Parallel Filesystem support – Optimized overlapping among Computation, Communication, and I/O – Threaded Models and Synchronization • Fault-tolerance/resiliency – Migration support with virtual machines – Data replication • Efficient data access and placement policies • Efficient task scheduling • Fast deployment and automatic configurations on Clouds Additional Challenges in Designing Communication and I/O Middleware for Big Data on Clouds
  • 8. OFAW 2017 8Network Based Computing Laboratory • MVAPICH2-Virt with SR-IOV and IVSHMEM – Standalone, OpenStack • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM) • MVAPICH2-Virt on SLURM – SLURM alone, SLURM + OpenStack • Big Data Libraries on Cloud – RDMA-Hadoop, OpenStack Swift Approaches to Build HPC Clouds
  • 9. OFAW 2017 9Network Based Computing Laboratory MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications
  • 10. OFAW 2017 10Network Based Computing Laboratory HPC on Cloud Computing Systems: Challenges Addressed by OSU So Far HPC and Big Data Middleware Networking Technologies (InfiniBand, Omni-Path, 1/10/40/100 GigE and Intelligent NICs) Storage Technologies (HDD, SSD, NVRAM, and NVMe-SSD) HPC (MPI, PGAS, MPI+PGAS, MPI+OpenMP, etc.) Applications Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Communication and I/O Library Future Studies Resource Management and Scheduling Systems for Cloud Computing (OpenStack Nova, Heat; Slurm) Virtualization (Hypervisor and Container) Locality- and NUMA-aware Communication Communication Channels (SR-IOV, IVShmem, IPC-Shm, CMA) Fault-Tolerance & Consolidation (Migration) QoS-aware
  • 11. OFAW 2017 11Network Based Computing Laboratory • Redesign MVAPICH2 to make it virtual machine aware – SR-IOV shows near to native performance for inter-node point to point communication – IVSHMEM offers shared memory based data access across co-resident VMs – Locality Detector: maintains the locality information of co-resident virtual machines – Communication Coordinator: selects the communication channel (SR-IOV, IVSHMEM) adaptively Overview of MVAPICH2-Virt with SR-IOV and IVSHMEM J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? Euro-Par, 2014 J. Zhang, X. Lu, J. Jose, R. Shi, M. Li, D. K. Panda. High Performance MPI Library over SR-IOV Enabled InfiniBand Clusters. HiPC, 2014
  • 12. OFAW 2017 12Network Based Computing Laboratory • OpenStack is one of the most popular open-source solutions to build clouds and manage virtual machines • Deployment with OpenStack – Supporting SR-IOV configuration – Supporting IVSHMEM configuration – Virtual Machine aware design of MVAPICH2 with SR-IOV • An efficient approach to build HPC Clouds with MVAPICH2-Virt and OpenStack MVAPICH2-Virt with SR-IOV and IVSHMEM over OpenStack J. Zhang, X. Lu, M. Arnold, D. K. Panda. MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds. CCGrid, 2015
  • 13. OFAW 2017 13Network Based Computing Laboratory 0 50 100 150 200 250 300 350 400 milc leslie3d pop2 GAPgeofem zeusmp2 lu ExecutionTime(s) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native 1% 9.5% 0 1000 2000 3000 4000 5000 6000 22,20 24,10 24,16 24,20 26,10 26,16 ExecutionTime(ms) Problem Size (Scale, Edgefactor) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native 2% • 32 VMs, 6 Core/VM • Compared to Native, 2-5% overhead for Graph500 with 128 Procs • Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs Application-Level Performance on Chameleon SPEC MPI2007Graph500 5%
  • 14. OFAW 2017 14Network Based Computing Laboratory • MVAPICH2-Virt with SR-IOV and IVSHMEM – Standalone, OpenStack • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM) • MVAPICH2-Virt on SLURM – SLURM alone, SLURM + OpenStack • Big Data Libraries on Cloud – RDMA-Hadoop, OpenStack Swift Approaches to Build HPC Clouds
  • 15. OFAW 2017 15Network Based Computing Laboratory Execute Live Migration with SR-IOV Device
  • 16. OFAW 2017 16Network Based Computing Laboratory High Performance SR-IOV enabled VM Migration Support in MVAPICH2 J. Zhang, X. Lu, D. K. Panda. High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV enabled InfiniBand Clusters. IPDPS, 2017 • Migration with SR-IOV device has to handle the challenges of detachment/re-attachment of virtualized IB device and IB connection • Consist of SR-IOV enabled IB Cluster and External Migration Controller • Multiple parallel libraries to notify MPI applications during migration (detach/reattach SR-IOV/IVShmem, migrate VMs, migration status) • Handle the IB connection suspending and reactivating • Propose Progress engine (PE) and migration thread based (MT) design to optimize VM migration and MPI application performance
  • 17. OFAW 2017 17Network Based Computing Laboratory • Compared with the TCP, the RDMA scheme reduces the total migration time by 20% • Total time is dominated by `Migration’ time; Times on other steps are similar across different schemes • Proposed migration framework could reduce up to 51% migration time Performance Evaluation of VM Migration Framework 0 0.5 1 1.5 2 2.5 3 TCP IPoIB RDMA Times(s) Set POST_MIGRATION Add IVSHMEM Attach VF Migration Remove IVSHMEM Detach VF Set PRE_MIGRATION Breakdown of VM migration 0 5 10 15 20 25 30 35 2 VM 4 VM 8 VM 16 VM Time(s) Sequential Migration Framework Proposed Migration Framework Multiple VM Migration Time
  • 18. OFAW 2017 18Network Based Computing Laboratory Bcast (4VMs * 2Procs/VM) • Migrate a VM from one machine to another while benchmark is running inside • Proposed MT-based designs perform slightly worse than PE-based designs because of lock/unlock • No benefit from MT because of NO computation involved Performance Evaluation of VM Migration Framework Pt2Pt Latency 0 20 40 60 80 100 120 140 160 180 200 1 4 16 64 256 1K 4K 16K 64K 256K 1M Latency(us) Message Size ( bytes) PE-IPoIB PE-RDMA MT-IPoIB MT-RDMA 0 100 200 300 400 500 600 700 1 4 16 64 256 1K 4K 16K 64K 256K 1M 2Latency(us) Message Size ( bytes) PE-IPoIB PE-RDMA MT-IPoIB MT-RDMA
  • 19. OFAW 2017 19Network Based Computing Laboratory Graph500 • 8 VMs in total and 1 VM carries out migration during application running • Compared with NM, MT- worst and PE incur some overhead compared with NM • MT-typical allows migration to be completely overlapped with computation Performance Evaluation of VM Migration Framework NAS 0 20 40 60 80 100 120 LU.C EP.C IS.C MG.C CG.C ExecutionTime(s) PE MT-worst MT-typical NM 0.0 0.1 0.1 0.2 0.2 0.3 0.3 0.4 0.4 20,10 20,16 20,20 22,10 ExecutionTime(s) PE MT-worst MT-typical NM
  • 20. OFAW 2017 20Network Based Computing Laboratory • MVAPICH2-Virt with SR-IOV and IVSHMEM – Standalone, OpenStack • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM) • MVAPICH2-Virt on SLURM – SLURM alone, SLURM + OpenStack • Big Data Libraries on Cloud – RDMA-Hadoop, OpenStack Swift Approaches to Build HPC Clouds
  • 21. OFAW 2017 21Network Based Computing Laboratory • Container-based technologies (e.g., Docker) provide lightweight virtualization solutions • Container-based virtualization – share host kernel by containers Overview of Containers-based Virtualization VM1 Container1
  • 22. OFAW 2017 22Network Based Computing Laboratory Benefits of Containers-based Virtualization for HPC on Cloud 65 65.5 178.7 253.7 0 50 100 150 200 250 300 Native-16P 1Conts*16P 2Conts*8P 4Conts*4P BFSExecutionTime(ms) Scale, Edgefactor (20,16) • Experiment on NFS Chameleon Cloud • Container has less overhead than VM • BFS time in Graph 500 significantly increases as the number of container increases on one host. Why? ib_send_lat Graph500 0 1 2 3 4 5 6 7 8 9 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K Latency(us) Message Size (bytes) VM-PT VM-SR-IOV Container-PT Native J. Zhang, X. Lu, D. K. Panda. Performance Characterization of Hypervisor- and Container-Based Virtualization for HPC on SR-IOV Enabled InfiniBand Clusters. IPDRM, IPDPS Workshop, 2016
  • 23. OFAW 2017 23Network Based Computing Laboratory • What are the performance bottlenecks when running MPI applications on multiple containers per host in HPC cloud? • Can we propose a new design to overcome the bottleneck on such container-based HPC cloud? • Can optimized design deliver near-native performance for different container deployment scenarios? • Locality-aware based design to enable CMA and Shared memory channels for MPI communication across co-resident containers Containers-based Design: Issues, Challenges, and Approaches J. Zhang, X. Lu, D. K. Panda. High Performance MPI Library for Container-based HPC Cloud on InfiniBand Clusters. ICPP, 2016
  • 24. OFAW 2017 24Network Based Computing Laboratory 0 10 20 30 40 50 60 70 80 90 100 MG.D FT.D EP.D LU.D CG.D ExecutionTime(s) Container-Def Container-Opt Native • 64 Containers across 16 nodes, pining 4 Cores per Container • Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500 • Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500 Application-Level Performance on Docker with MVAPICH2 Graph 500 NAS 11% 0 50 100 150 200 250 300 1Cont*16P 2Conts*8P 4Conts*4P BFSExecutionTime(ms) Scale, Edgefactor (20,16) Container-Def Container-Opt Native 73%
  • 25. OFAW 2017 25Network Based Computing Laboratory • Less than 18% overhead on latency • Less than 13% overhead on BW MVAPICH2 Intra-Node and Inter-Node Point-to-Point Performance on Singularity 0 2 4 6 8 10 12 14 16 18 1 4 16 64 256 1024 4096 16384 65536 Latency(us) Message Size (Byte) Latency Singularity-Intra-Node Native-Intra-Node Singularity-Inter-Node Native-Inter-Node 18% 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Bandwidth(MB/s) Message Size (Byte) BW Singularity-Intra-Node Native-Intra-Node Singularity-Inter-Node Native-Inter-Node 13%
  • 26. OFAW 2017 26Network Based Computing Laboratory • 512 Processes across 32 nodes • Less than 15% and 14% overhead for Bcast and Allreduce, respectively MVAPICH2 Collective Performance on Singularity 0 10 20 30 40 50 60 70 80 1 4 16 64 256 1024 4096 16384 65536 Latency(us) Message Size (Byte) Bcast Singularity Native 15% 0 20 40 60 80 100 120 140 160 180 200 4 16 64 256 1024 4096 16384 65536 Latency(us) Message Size (Byte) Allreduce Singularity Native 14%
  • 27. OFAW 2017 27Network Based Computing Laboratory • 512 Processes across 32 nodes • Less than 16% and 11% overhead for NPB and Graph500, respectively Application-Level Performance on Singularity with MVAPICH2 0 500 1000 1500 2000 2500 3000 22,16 22,20 24,16 24,20 26,16 26,20 BFSExecutionTime(ms) Problem Size (Scale, Edgefactor) Graph500 Singularity Native 16% 11% 0 50 100 150 200 250 300 CG EP FT IS LU MG ExecutionTime(s) NPB (Class D) Singularity Native
  • 28. OFAW 2017 28Network Based Computing Laboratory • MVAPICH2-Virt with SR-IOV and IVSHMEM – Standalone, OpenStack • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM) • MVAPICH2-Virt on SLURM – SLURM alone, SLURM + OpenStack • Big Data Libraries on Cloud – RDMA-Hadoop, OpenStack Swift Approaches to Build HPC Clouds
  • 29. OFAW 2017 29Network Based Computing Laboratory Nested Virtualization: Containers over Virtual Machines • Useful for live migration, sandbox application, legacy system integration, software deployment, etc. • Performance issues because of the redundant call stacks (two-layer virtualization) and isolated physical resources
  • 30. OFAW 2017 30Network Based Computing Laboratory Multiple Communication Paths in Nested Virtualization 1. Intra-VM Intra-Container (across core 4 and core 5) 2. Intra-VM Inter-Container (across core 13 and core 14) 3. Inter-VM Inter-Container (across core 6 and core 12) 4. Inter-Node Inter-Container (across core 15 and the core on remote node) • Different VM placements introduce multiple communication paths on container level
  • 31. OFAW 2017 31Network Based Computing Laboratory Performance Characteristics on Communication Paths • Two VMs are deployed on the same socket and different sockets, respectively • *-Def and Inter-VM Inter-Container-1Layer have similar performance • Large gap compared to native performance 0 10 20 30 40 1 4 16 64 256 1024 4096 16384 66636 Latency(us) Message Size (Byte) Intra-Socket Intra-VM Inter-Container-Def Inter-VM Inter-Container-Def Intra-VM Inter-Container-1Layer* Inter-VM Inter-Container-1Layer* Native 0 5 10 15 20 1 4 16 64 256 1024 4096 16384 66636 Latency(us) Message Size (Byte) Inter-Socket Intra-VM Inter-Container-Def Inter-VM Inter-Container-Def Intra-VM Inter-Container-1Layer* Inter-VM Inter-Container-1Layer* Native 1Layer* - J. Zhang, X. Lu, D. K. Panda. High Performance MPI Library for Container-based HPC Cloud on InfiniBand, ICPP, 2016
  • 32. OFAW 2017 32Network Based Computing Laboratory Challenges of Nested Virtualization • How to further reduce the performance overhead of running applications on the nested virtualization environment? • What are the impacts of the different VM/container placement schemes for the communication on the container level? • Can we propose a design which can adapt these different VM/container placement schemes and deliver near-native performance for nested virtualization environments?
  • 33. OFAW 2017 33Network Based Computing Laboratory Overview of Proposed Design in MVAPICH2 Two-Layer Locality Detector: Dynamically detecting MPI processes in the co- resident containers inside one VM as well as the ones in the co-resident VMs Two-Layer NUMA Aware Communication Coordinator: Leverage nested locality info, NUMA architecture info and message to select appropriate communication channel J. Zhang, X. Lu, D. K. Panda. Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled InfiniBand, VEE, 2017
  • 34. OFAW 2017 34Network Based Computing Laboratory Inter-VM Inter-Container Pt2Pt (Intra-Socket) • 1Layer has similar performance to the Default • Compared with 1Layer, 2Layer delivers up to 84% and 184% improvement for latency and BW Latency BW
  • 35. OFAW 2017 35Network Based Computing Laboratory Inter-VM Inter-Container Pt2Pt (Inter-Socket) Latency BW • 1-Layer has similar performance to the Default • 2-Layer has near-native performance for small msg, but clear overhead on large msg • Compared to 2-Layer, Hybrid design brings up to 42% and 25% improvement for latency and BW, respectively
  • 36. OFAW 2017 36Network Based Computing Laboratory Application-level Evaluations • 256 processes across 64 containers on 16 nodes • Compared with Default, enhanced-hybrid design reduces up to 16% (28,16) and 10% (LU) of execution time for Graph 500 and NAS, respectively • Compared with the 1Layer case, enhanced-hybrid design also brings up to 12% (28,16) and 6% (LU) performance benefit Class D NASGraph500
  • 37. OFAW 2017 37Network Based Computing Laboratory • MVAPICH2-Virt with SR-IOV and IVSHMEM – Standalone, OpenStack • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM) • MVAPICH2-Virt on SLURM – SLURM alone, SLURM + OpenStack • Big Data Libraries on Cloud – RDMA-Hadoop, OpenStack Swift Approaches to Build HPC Clouds
  • 38. OFAW 2017 38Network Based Computing Laboratory • Requirement of managing and isolating virtualized resources of SR-IOV and IVSHMEM • Such kind of management and isolation is hard to be achieved by MPI library alone, but much easier with SLURM • Efficient running MPI applications on HPC Clouds needs SLURM to support managing SR-IOV and IVSHMEM – Can critical HPC resources be efficiently shared among users by extending SLURM with support for SR-IOV and IVSHMEM based virtualization? – Can SR-IOV and IVSHMEM enabled SLURM and MPI library provide bare-metal performance for end applications on HPC Clouds? Need for Supporting SR-IOV and IVSHMEM in SLURM
  • 39. OFAW 2017 39Network Based Computing Laboratory load SPANK reclaim VMs register job step reply register job step req Slurmctld Slurmd Slurmd release hosts run job step req run job step reply mpirun_vm MPI Job across VMs VM Config Reader load SPANK VM Launcher load SPANK VM Reclaimer • VM Configuration Reader – Register all VM configuration options, set in the job control environment so that they are visible to all allocated nodes. • VM Launcher – Setup VMs on each allocated nodes. - File based lock to detect occupied VF and exclusively allocate free VF - Assign a unique ID to each IVSHMEM and dynamically attach to each VM • VM Reclaimer – Tear down VMs and reclaim resources SLURM SPANK Plugin based Design MPIMPI vm hostfile
  • 40. OFAW 2017 40Network Based Computing Laboratory • VM Configuration Reader – VM options register • VM Launcher, VM Reclaimer – Offload to underlying OpenStack infrastructure - PCI Whitelist to passthrough free VF to VM - Extend Nova to enable IVSHMEM when launching VM SLURM SPANK Plugin with OpenStack based Design J. Zhang, X. Lu, S. Chakraborty, D. K. Panda. SLURM-V: Extending SLURM for Building Efficient HPC Cloud with SR-IOV and IVShmem. Euro-Par, 2016 reclaim VMs register job step reply register job step req Slurmctld Slurmd release hosts launch VM mpirun_vm load SPANK VM Config Reader MPI VM hostfile OpenStack daemon request launch VM VM Launcher return request reclaim VM VM Reclaimer return ...... ...... ...... ......
  • 41. OFAW 2017 41Network Based Computing Laboratory • 32 VMs across 8 nodes, 6 Core/VM • EASJ - Compared to Native, less than 4% overhead with 128 Procs • SACJ, EACJ – Also minor overhead, when running NAS as concurrent job with 64 Procs Application-Level Performance on Chameleon (Graph500) Exclusive Allocations Sequential Jobs 0 500 1000 1500 2000 2500 3000 24,16 24,20 26,10 BFSExecutionTime(ms) Problem Size (Scale, Edgefactor) VM Native 0 50 100 150 200 250 22,10 22,16 22,20BFSExecutionTime(ms) Problem Size (Scale, Edgefactor) VM Native 0 50 100 150 200 250 22 10 22 16 22 20 BFSExecutionTime(ms) Problem Size (Scale, Edgefactor) VM Native Shared-host Allocations Concurrent Jobs Exclusive Allocations Concurrent Jobs 4%
  • 42. OFAW 2017 42Network Based Computing Laboratory • MVAPICH2-Virt with SR-IOV and IVSHMEM – Standalone, OpenStack • SR-IOV-enabled VM Migration Support in MVAPICH2 • MVAPICH2 with Containers (Docker and Singularity) • MVAPICH2 with Nested Virtualization (Container over VM) • MVAPICH2-Virt on SLURM – SLURM alone, SLURM + OpenStack • Big Data Libraries on Cloud – RDMA-Hadoop, OpenStack Swift Approaches to Build HPC Clouds
  • 43. OFAW 2017 43Network Based Computing Laboratory • RDMA for Apache Spark • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions • RDMA for Apache HBase • RDMA for Memcached (RDMA-Memcached) • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) • OSU HiBD-Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro-benchmarks • http://hibd.cse.ohio-state.edu • Users Base: 215 organizations from 29 countries • More than 21,000 downloads from the project site The High-Performance Big Data (HiBD) Project Available for InfiniBand and RoCE
  • 44. OFAW 2017 44Network Based Computing Laboratory High-Performance Apache Hadoop over Clouds: Challenges • How about performance characteristics of native IB-based designs for Apache Hadoop over SR-IOV enabled cloud environment? • To achieve locality-aware communication, how can the cluster topology be automatically detected in a scalable and efficient manner and be exposed to the Hadoop framework? • How can we design virtualization-aware policies in Hadoop for efficiently taking advantage of the detected topology? • Can the proposed policies improve the performance and fault tolerance of Hadoop on virtualized platforms? “How can we design a high-performance Hadoop library for Cloud-based systems?”
  • 45. OFAW 2017 45Network Based Computing Laboratory Overview of RDMA-Hadoop-Virt Architecture • Virtualization-aware modules in all the four main Hadoop components: – HDFS: Virtualization-aware Block Management to improve fault-tolerance – YARN: Extensions to Container Allocation Policy to reduce network traffic – MapReduce: Extensions to Map Task Scheduling Policy to reduce network traffic – Hadoop Common: Topology Detection Module for automatic topology detection • Communications in HDFS, MapReduce, and RPC go through RDMA-based designs over SR-IOV enabled InfiniBand HDFS YARN HadoopCommon MapReduce HBase Others Virtual Machines Bare-Metal nodesContainers Big Data ApplicationsTopologyDetectionModule Map Task Scheduling Policy Extension Container Allocation Policy Extension CloudBurst MR-MS Polygraph Others Virtualization Aware Block Management S. Gugnani, X. Lu, D. K. Panda. Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds. CloudCom, 2016.
  • 46. OFAW 2017 46Network Based Computing Laboratory Evaluation with Applications – 14% and 24% improvement with Default Mode for CloudBurst and Self-Join – 30% and 55% improvement with Distributed Mode for CloudBurst and Self-Join 0 20 40 60 80 100 Default Mode Distributed Mode EXECUTIONTIME CloudBurst RDMA-Hadoop RDMA-Hadoop-Virt 0 50 100 150 200 250 300 350 400 Default Mode Distributed Mode EXECUTIONTIME Self-Join RDMA-Hadoop RDMA-Hadoop-Virt 30% reduction 55% reduction
  • 47. OFAW 2017 47Network Based Computing Laboratory • Distributed Cloud-based Object Storage Service • Deployed as part of OpenStack installation • Can be deployed as standalone storage solution as well • Worldwide data access via Internet – HTTP-based • Architecture – Multiple Object Servers: To store data – Few Proxy Servers: Act as a proxy for all requests – Ring: Handles metadata • Usage – Input/output source for Big Data applications (most common use case) – Software/Data backup – Storage of VM/Docker images • Based on traditional TCP sockets communication OpenStack Swift Overview Send PUT or GET request PUT/GET /v1/<account>/<container>/<object> Proxy Server Object Server Object Server Object Server Ring Disk 1 Disk 2 Disk 1 Disk 2 Disk 1 Disk 2 Swift Architecture
  • 48. OFAW 2017 48Network Based Computing Laboratory • Challenges – Proxy server is a bottleneck for large scale deployments – Object upload/download operations network intensive – Can an RDMA-based approach benefit? • Design – Re-designed Swift architecture for improved scalability and performance; Two proposed designs: • Client-Oblivious Design: No changes required on the client side • Metadata Server-based Design: Direct communication between client and object servers; bypass proxy server – RDMA-based communication framework for accelerating networking performance – High-performance I/O framework to provide maximum overlap between communication and I/O Swift-X: Accelerating OpenStack Swift with RDMA for Building Efficient HPC Clouds S. Gugnani, X. Lu, and D. K. Panda, Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud, accepted at CCGrid’17, May 2017 Client-Oblivious Design (D1) Metadata Server-based Design (D2)
  • 49. OFAW 2017 49Network Based Computing Laboratory 0 5 10 15 20 25 Swift PUT Swift-X (D1) PUT Swift-X (D2) PUT Swift GET Swift-X (D1) GET Swift-X (D2) GET LATENCY(S) TIME BREAKUP OF GET AND PUT Communication I/O Hashsum Other Swift-X: Accelerating OpenStack Swift with RDMA for Building Efficient HPC Clouds 0 5 10 15 20 1MB 4MB 16MB 64MB 256MB 1GB 4GB LATENCY(s) OBJECT SIZE GET LATENCY EVALUATION Swift Swift-X (D2) Swift-X (D1) Reduced by 66% • Up to 66% reduction in GET latency• Communication time reduced by up to 3.8x for PUT and up to 2.8x for GET
  • 50. OFAW 2017 50Network Based Computing Laboratory Available Appliances on Chameleon Cloud* Appliance Description CentOS 7 KVM SR- IOV Chameleon bare-metal image customized with the KVM hypervisor and a recompiled kernel to enable SR-IOV over InfiniBand. https://www.chameleoncloud.org/appliances/3/ MPI bare-metal cluster complex appliance (Based on Heat) This appliance deploys an MPI cluster composed of bare metal instances using the MVAPICH2 library over InfiniBand. https://www.chameleoncloud.org/appliances/29/ MPI + SR-IOV KVM cluster (Based on Heat) This appliance deploys an MPI cluster of KVM virtual machines using the MVAPICH2-Virt implementation and configured with SR-IOV for high-performance communication over InfiniBand. https://www.chameleoncloud.org/appliances/28/ CentOS 7 SR-IOV RDMA-Hadoop The CentOS 7 SR-IOV RDMA-Hadoop appliance is built from the CentOS 7 appliance and additionally contains RDMA-Hadoop library with SR-IOV. https://www.chameleoncloud.org/appliances/17/ • Through these available appliances, users and researchers can easily deploy HPC clouds to perform experiments and run jobs with – High-Performance SR-IOV + InfiniBand – High-Performance MVAPICH2 Library over bare-metal InfiniBand clusters – High-Performance MVAPICH2 Library with Virtualization Support over SR-IOV enabled KVM clusters – High-Performance Hadoop with RDMA-based Enhancements Support [*] Only include appliances contributed by OSU NowLab
  • 51. OFAW 2017 51Network Based Computing Laboratory • MVAPICH2-Virt over SR-IOV-enabled InfiniBand is an efficient approach to build HPC Clouds – Standalone, OpenStack, Slurm, and Slurm + OpenStack – Support Virtual Machine Migration with SR-IOV InfiniBand devices – Support Virtual Machine, Container (Docker and Singularity), and Nested Virtualization • Very little overhead with virtualization, near native performance at application level • Much better performance than Amazon EC2 • MVAPICH2-Virt is available for building HPC Clouds – SR-IOV, IVSHMEM, Docker support, OpenStack • Big Data analytics stacks such as RDMA-Hadoop can benefit from cloud-aware designs • Appliances for MVAPICH2-Virt and RDMA-Hadoop are available for building HPC Clouds • Future releases for supporting running MPI jobs in VMs/Containers with SLURM, etc. • SR-IOV/container support and appliances for other MVAPICH2 libraries (MVAPICH2-X, MVAPICH2-GDR, ...) and RDMA-Spark/Memcached Conclusions
  • 52. OFAW 2017 52Network Based Computing Laboratory One More Presentation • Friday (03/31/17) at 11:00am NVM-aware RDMA-Based Communication and I/O Schemes for High-Perf Big Data Analytics
  • 53. OFAW 2017 53Network Based Computing Laboratory Funding Acknowledgments Funding Support by Equipment Support by
  • 54. OFAW 2017 54Network Based Computing Laboratory Personnel Acknowledgments Current Students – A. Awan (Ph.D.) – R. Biswas (M.S.) – M. Bayatpour (Ph.D.) – S. Chakraborthy (Ph.D.) Past Students – A. Augustine (M.S.) – P. Balaji (Ph.D.) – S. Bhagvat (M.S.) – A. Bhat (M.S.) – D. Buntinas (Ph.D.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – R. Rajachandrasekar (Ph.D.) – G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Sur (Ph.D.) – H. Subramoni (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) Past Research Scientist – K. Hamidouche – S. Sur Past Post-Docs – D. Banerjee – X. Besseron – H.-W. Jin – W. Huang (Ph.D.) – W. Jiang (M.S.) – J. Jose (Ph.D.) – S. Kini (M.S.) – M. Koop (Ph.D.) – K. Kulkarni (M.S.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – K. Kandalla (Ph.D.) – P. Lai (M.S.) – J. Liu (Ph.D.) – M. Luo (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – V. Meshram (M.S.) – A. Moody (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – S. Potluri (Ph.D.) – C.-H. Chu (Ph.D.) – S. Guganani (Ph.D.) – J. Hashmi (Ph.D.) – H. Javed (Ph.D.) – J. Lin – M. Luo – E. Mancini Current Research Scientists – X. Lu – H. Subramoni Past Programmers – D. Bureddy – M. Arnold – J. Perkins Current Research Specialist – J. Smith – M. Li (Ph.D.) – D. Shankar (Ph.D.) – H. Shi (Ph.D.) – J. Zhang (Ph.D.) – S. Marcarelli – J. Vienne – H. Wang
  • 55. OFAW 2017 55Network Based Computing Laboratory {panda, luxi}@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~luxi Thank You! Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/