Large-scale optimization strategies for typical HPC workloads include:
1) Building a powerful profiling tool to analyze application performance and identify bottlenecks like inefficient instructions, memory bandwidth, and network utilization.
2) Harnessing state-of-the-art hardware like new CPU architectures, instruction sets, and accelerators to maximize application performance.
3) Leveraging the latest algorithms and computational models that are better suited for large-scale parallelization and new hardware.
Large-scale optimization strategies for typical HPC workloads
1. Large-scale optimization strategies for
typical HPC workloads
Inspur Group
The PASC19 Conference
12th-14th June 2019, ETH Zurich, Switzerland
Yu Liu
2. Who are we?
• No.3 Server Vendor Worldwide
• Leading System Share in TOP500 List
• Top AI GPU Server Market Share
• R&D contribution to OCP, ODCC, Open19
• The only vendor that can provide both
Power (by Inspur Power Systems) and
x86 infrastructure solutions.
3. The challenge of large-scale Optimization
➢New HPC architecture
➢Poor application scalability
➢Low hardware utilization rage
CPU CPU+MIC MPE+CPE CPU+GPU
0
2
4
6
8
AVX
x87
SSE
Low
efficiency
(GFlops)
Ideal
Actual
4. How to be faster?
Compute
faster
Data throughput
faster
Communicate
faster
PCIe NVlink IB OPA Ethernet
8. Very complex model
Very old codes
Big data input / output
Computing intensive
Large parallelism required
Run-time sensitive
Optimization
Computing
Communication
I/O
• Instruction
• Algorithm
• Architecture
• Model
WRF: A state-of-the-art atmospheric modeling system
Build a Powerful Profiling Tool
9. Target Performance
(WPS): 10min
WRF(include real)
:90 min
Post-Processing:
20min
Before
Optimization
> 150min
scheme
number of grids 330x336,605x449
grid length 6km,1.5km
vertical levels 39
Run on 4096 cores
Build a Powerful Profiling Tool
11. • IO Optimization
Analysis:3 methods to
improve IO performance
-Lustre performance:
accelerate IO speed
-Quilt IO: asynchronous IO
-Pnetcdf:Parallel IO
0
100
200
300
400
500
600
700
1
71
141
211
281
351
421
491
561
631
701
771
841
911
981
ib_XmitData_MB ib_RcvData_MB
0
200
400
600
800
1000
1200
1
47
93
139
185
231
277
323
369
415
461
507
553
599
645
ib_XmitData_MB ib_RcvData_MB
Before After
Before After Time saving
reading grids, task allocation 130s 75s
175sWrfout 40s 1-2s
Nudge 100s 24s
Data taken from Inspur TEYE
Build a Powerful Profiling Tool
12. • Network Optimization
Analysis:
MPI communication became
bottleneck while WRF
running in thousands cores;
MPI+OpenMP hybird mode
is the best solution to reduce
MPI process communication
consuming.
MPI only MPI+OpenMP
0
500
1000
1500
1
70
139
208
277
346
415
484
553
622
IB Send and Receive
ib_XmitData_MB ib_RcvData_MB
-300
200
700
1200
1
51
101
151
201
251
301
351
401
451
IB Send and Receive
ib_XmitData_MB ib_RcvData_MB
Performance
↑26.9%
Data taken from Inspur TEYE
Build a Powerful Profiling Tool
13. • Memory bandwidth optimization
Analysis:
MPI+OpenMP hybrid mode,
reduces not only network
width, but also memory and
cache processing frequency.
The comparison figure
shows, optimized memory
width is obviously reduced.
Before After
0
5
10
15
20
25
1
70
139
208
277
346
415
484
553
622
Memory Bandwidth
mem_total_bw_GB
0
5
10
15
20
25
1
47
93
139
185
231
277
323
369
415
461
Memory Bandwidth
mem_total_bw_GB
Data taken from Inspur TEYE
Build a Powerful Profiling Tool
16. Harness the State-of-the-art Hardware
➢Hardware and Software matching
Using new architecture
Using new instruction sets
Lammps on TH2: ~ 24000 cores QE on TH2: ~ 20000 cores
0
0.5
1
1.5
2
2.5
3
3.5
64 128 256 512 1024 2048 4096
2CPUs 1MIC 2MICs 3MICs
GTC on TH2: ~ 400000 cores
17. Customer in-house codes for
globalscale atmospheric modeling system
➢ JOB information:
JOB RESOLUTION RUNNING TIME
Global Weather Forecast 0.125° < 3 hours
RHEL 7.2 Intel Compiler 2017
CPU Memory Network
Model Xeon E5-2690v3 DDR3 Intel OPA
Peak 1.0 TFlops DP 136.5 GB/s 100Gbit/s
➢ Platform information:
Harness the State-of-the-art Hardware
18. Analyzing elapsed time of the main function in an atmosphere model
Reference to Helmholz Equation
Above
50%
Run on 8192 cores
Harness the State-of-the-art Hardware
21. 1073
541
762
335
Total Time Model_integrate time
original
optimized
32
8
Number of iterations
40%
61%
The GMRM algorithm can significantly reduce the number of floating point
operations with less number of iterations. Calling general function library
also improves model performance. The whole optimized efficiency is up to
40%.
Harness the State-of-the-art Hardware
22. ➢ JOB information:
➢ Platform information:
➢ Environment information:
ION Basesets Bands K-Points AG
NiMnSb 289 ~140 million 2176 16
CG/RMM-
DIIS/HSE
RHEL 7.5 Intel Compiler 2019 VASP 5.4.4
CPU GPU Memory Network
Model Xeon 6142 V100/P100 DDR4 EDR
Peak 2.66 TFlops DP 7.8 / 5.3 Tflops DP 256 GB/s 100Gbit/s
NiMnSb
Harness the State-of-the-art Hardware
24. After opt the
flops has
significantly
increased.
After opt the
flops were
mainly
provided by
AVX512
instruction
sets.
Data taken from Inspur TEYERun on 512 cores
Harness the State-of-the-art Hardware
25. 403
123
Loop Time
original
optimized
228%
With the help of AVX512 instruction set, the whole
optimized efficiency is up to 228%.
0
0.5
1
1.5
2
2.5
3
3.5
Performance
Xeon E5-2650 Xeon E5-2680v3 Xeon 6142
90%
70%
Run on 512 cores
Harness the State-of-the-art Hardware
26. Data taken from Inspur TEYERun on 512 cores
Non-persistent
network
communication
is conducive to
improving
application
scalability
Harness the State-of-the-art Hardware
28. Leverage the Latest Model Algorithms
➢New algorithm & model
Application science
Computing science
Large-scale
computation
New physical model
matches computing
architecture
New computing
architecture driven
by application model
29. Customer in-house codes for
global climate simulation
❖ Research background:
Climate models need to better integrate the biological, chemical, and
physical components of the Earth system. The coupler connects various
physical component models, such as atmosphere, ocean, land et al.
The coupler receives the two-dimensional boundary data from each
component mode, and integrates the collected data into appropriate
calculations, and then transmits the data required back to each component.
❖ Performance before optimization:
The more higher resolution of space and time, the more Computing
resources we need.
Resolution x 2 (horizontal & vertical)
Computing time: 24
Leverage the Latest Model Algorithms
30. Leverage the Latest Model Algorithms
Run on 1212 cores
cpl
atm
lnd
ice
ocn
atm
lnd
ice
ocn cpl
31. Leverage the Latest Model Algorithms
cpl
atm
ocn
lnd
ice
atm driver
Physical process Dynamical frame
land atm-ice atm-ocean
cpl driver
atm lnd ice ocn
coupling
atm
coupling
lnd
coupling
atm-ice
coupling
atm-ocn
cpl
atm
ocn
lnd
ice
➢ Optimized atm module
➢ Optimized cpl to change the model structure
➢ Optimized MPI collective functions
Atm module control the physical process of
the whole model, other module need wait to
be called by atm. Communication response
time (MPI_barrier in ocn, MPI_Bcast in lnd
and ice) is too long, which lead to low model
performance.
190%
22%
120%
130%
150%
32. Take the run-time performance data with powerful profiling tools
Get the application features and bottlenecks by analyzing
the performance data for platform optimization
State-of-the-art technology to maximize the
performance by using newly technologies
The best way to realize the highest
performance for large-scale applications
Interdisciplinary innovation for new
models or algorithms to optimize the
codes of all the application
Realize better large-scale computing
33. • 3D Elastic Wave Modeling
for petroleum prospecting
• Run on TH2 with 200K cores
ASC14 Tianhe-2
• Gordon Bell Prize
Nominated Application—
MASNUM
• Run on Sunway TaihuLight
with 10K cores
ASC17 Sunway TaihuLight
ASC Student Supercomputer Challenge
ASC provide a stage for students practicing large-scale HPC optimization