5. HPC on Cloud (8 papers)
1. “Reliability
Guided
Resource
Alloca+on
for
Large-‐Scale
Systems,”
S.
Umamaheshwaran
and
T.
J.
Hacker
(Purdue
U.)
2. “Energy-‐Efficient
Scheduling
of
Urgent
Bag-‐of-‐Tasks
Applica+ons
in
Clouds
through
DVFS,”
R.
N.
Calheiros
and
R.
Buyya
(U.
Melbourne)
3. “A
Framework
for
Measuring
the
Impact
and
Effec+veness
of
the
NEES
Cyber-‐
infrastructure
for
Earthquake
Engineering,”
T.
Hacker
and
A.
J.
Magana
(Purdue
U.)
4. “Execu+ng
Bag
of
Distributed
Tasks
on
the
Cloud:
Inves+ga+ng
the
Trade-‐Offs
between
Performance
and
Cost,”
L.
Thai,
B.
Varghese,
and
A.
Barker
(U.
St
Andrew)
5. “CPU
Performance
Coefficient
(CPU-‐PC):
A
Novel
Performance
Metric
Based
on
Real-‐Time
CPU
Resource
Provisioning
in
Time-‐Shared
Cloud
Environments,”
T.
Mastelić,
I.
Brandić,
and
J.
Jašarević
(Vienna
U.
of
Technology)
6. “Performance
Analysis
of
Cloud
Environments
on
Top
of
Energy-‐Efficient
Pla^orms
Featuring
Low
Power
Processors,”
V.
Plugaru,
S.
Varre[e,
and
P.
Bouvry
(U.
Luxembourg)
7. “Exploring
the
Performance
Impact
of
Virtualiza+on
on
an
HPC
Cloud,”
N.
Chakthranont,
P.
Khunphet,
R.
Takano,
and
T.
Ikegami
(KMUTNB,
AIST)
8. “GateCloud:
An
Integra+on
of
Gate
Monte
Carlo
Simula+on
with
a
Cloud
Compu+ng
Environment,”
B.
A.
Rowedder,
H.
Wang,
and
Y.
Kuang
(UNLV)
5
8. ASGC Hardware Spec.
8
Compute Node
CPU Intel Xeon E5-2680v2/2.8GHz
(10 core) x 2CPU
Memory 128 GB DDR3-1866
InfiniBand Mellanox ConnectX-3 (FDR)
Ethernet Intel X520-DA2 (10 GbE)
Disk Intel SSD DC S3500 600 GB
• 155 node-cluster consists of Cray H2312 blade server
• The theoretical peak performance is 69.44 TFLOPS
• The operation started from July, 2014
Exploring
the
Performance
Impact
of
Virtualiza+on
on
an
HPC
Cloud
9. ASGC Software Stack
Management Stack
– CentOS 6.5 (QEMU/KVM 0.12.1.2)
– Apache CloudStack 4.3 + our extensions
• PCI passthrough/SR-IOV support (KVM only)
• sgc-tools: Virtual cluster construction utility
– RADOS cluster storage
HPC Stack (Virtual Cluster)
– Intel Compiler/Math Kernel Library SP1 1.1.106
– Open MPI 1.6.5
– Mellanox OFED 2.1
– Torque job scheduler
9
Exploring
the
Performance
Impact
of
Virtualiza+on
on
an
HPC
Cloud
10. Benchmark Programs
Micro benchmark
– Intel Micro Benchmark (IMB) version 3.2.4
Application-level benchmark
– HPC Challenge (HPCC) version 1.4.3
• G-HPL
• EP-STREAM
• G-RandomAccess
• G-FFT
– OpenMX version 3.7.4
– Graph 500 version 2.1.4
10
Exploring
the
Performance
Impact
of
Virtualiza+on
on
an
HPC
Cloud
12. MPI Collectives (64bytes)
12
0
1000
2000
3000
4000
5000
0 32 64 96 128
ExecutionTime(usec)
Number of Nodes
Physical Cluster
Virtual Cluster
0
200
400
600
800
1,000
1,200
0 32 64 96 128
ExecutionTime(usec)
Number of Nodes
Physical Cluster
Virtual Cluster
0
2000
4000
6000
0 32 64 96 128
ExecutionTime(usec)
Number of Nodes
Physical Cluster
Virtual Cluster
Allgather Allreduce
Alltoall
IMB
The overhead becomes
significant as the number
of nodes increases.
… load imbalance?
+77% +88%
+43%
Exploring
the
Performance
Impact
of
Virtualiza+on
on
an
HPC
Cloud
13. G-HPL (LINPACK)
13
0
10
20
30
40
50
60
0 32 64 96 128
Performance(TFLOPS)
Number of Nodes
Physical Cluster
Virtual Cluster
Performance degradation:
5.4 - 6.6%
Efficiency* on 128 nodes
・Physical: 90%
・Virtual: 84%
*) Rmax / Rpeak
HPCCExploring
the
Performance
Impact
of
Virtualiza+on
on
an
HPC
Cloud
14. EP-STREAM and G-FFT
14
0
2
4
6
0 32 64 96 128
Performance(GB/s)
Number of Nodes
Physical Cluster
Virtual Cluster
0
40
80
120
160
0 32 64 96 128
Performance(GFLOPS)
Number of Nodes
Physical Cluster
Virtual Cluster
EP-STREAM G-FFT
HPCC
The overheads are ignorable.
memory intensive
with no communication
all-to-all communication
with large messages
Exploring
the
Performance
Impact
of
Virtualiza+on
on
an
HPC
Cloud
15. Graph500 (replicated-csc, scale 26)
15
1.00E+07
1.00E+08
1.00E+09
1.00E+10
0 16 32 48 64
Performance(TEPS)
Number of Nodes
Physical Cluster
Virtual Cluster
Graph500
Performance degradation:
2% (64node)
Graph500 is a Hybrid parallel program (MPI + OpenMP).
We used a combination of 2 MPI processes and 10 OpenMP threads.
Exploring
the
Performance
Impact
of
Virtualiza+on
on
an
HPC
Cloud
16. Findings
• PCI passthrough is effective in improving the I/O
performance, however, it is still unable to achieve
the low communication latency of a physical cluster
due to a virtual interrupt injection.
• VCPU pinning improves the performance for HPC
applications.
• Almost all MPI collectives suffer from the scalability
issue.
• The overhead of virtualization has less impact on
actual applications.
16
Exploring
the
Performance
Impact
of
Virtualiza+on
on
an
HPC
Cloud