Large-scale optimization strategies for typical HPC workloads

Large-scale optimization strategies for
typical HPC workloads
Inspur Group
The PASC19 Conference
12th-14th June 2019, ETH Zurich, Switzerland
Yu Liu

Who are we?
• No.3 Server Vendor Worldwide
• Leading System Share in TOP500 List
• Top AI GPU Server Market Share
• R&D contribution to OCP, ODCC, Open19
• The only vendor that can provide both
Power (by Inspur Power Systems) and
x86 infrastructure solutions.

The challenge of large-scale Optimization
➢New HPC architecture
➢Poor application scalability
➢Low hardware utilization rage
CPU CPU+MIC MPE+CPE CPU+GPU
0
2
4
6
8
AVX
x87
SSE
Low
efficiency
(GFlops)
Ideal
Actual

How to be faster?
Compute
faster
Data throughput
faster
Communicate
faster
PCIe NVlink IB OPA Ethernet

Our strategies
Large-scale
Optimization
Build a Powerful Profiling Tool
Harness the State-of-the-art Hardware
Leverage the Latest Model Algorithms

Compute Data throughput Communicate
➢ user%, sys%, iowait%
➢ SSE/AVX/AVX512 GFlops
➢ Vectorization rate
➢ Clock cycle per
instruction
➢ Memory bandwidth
➢ PCIe bandwidth
➢ Nvlink bandwidth
➢ IO bandwidth
➢ IOPS
➢ IB/OPA bandwidth
➢ Ethernet bandwidth
➢ Message size

Very complex model
Very old codes
Big data input / output
Computing intensive
Large parallelism required
Run-time sensitive
Optimization
Computing
Communication
I/O
• Instruction
• Algorithm
• Architecture
• Model
WRF: A state-of-the-art atmospheric modeling system

Target Performance
（WPS）： 10min
WRF（include real）
：90 min
Post-Processing:
20min
Before
Optimization
> 150min
scheme
number of grids 330x336,605x449
grid length 6km,1.5km
vertical levels 39
Run on 4096 cores

(0.10)
0.40
0.90
1.40
1
42
83
124
165
206
247
288
329
370
411
452
493
534
575
616
657
698
739
780
821
862
903
944
985
1026
Total_DP_GFlops Total_SP_GFlops X87_GFlops SSE_DP_Packed_GFlops
SSE_DP_Scalar_GFlops AVX_DP_Packed_GFlops SSE_SP_Packed_GFlops AVX_SP_Packed_GFlops
Analysis：
These figures shows us
that WRF using single
precision floating point
processing and it is not
floating point intensive
application. Besides,
WRF is not highly
optimized for the AVX
instructions.
reading
nested grids,
task allocation
numerical
calculations and
differential
equations solving
IO，
writing
wrfout
Nudging
Data taken from Inspur TEYE

• IO Optimization
Analysis：3 methods to
improve IO performance
-Lustre performance：
accelerate IO speed
-Quilt IO： asynchronous IO
-Pnetcdf：Parallel IO
0
100
200
300
400
500
600
700
1
71
141
211
281
351
421
491
561
631
701
771
841
911
981
ib_XmitData_MB ib_RcvData_MB
0
200
400
600
800
1000
1200
1
47
93
139
185
231
277
323
369
415
461
507
553
599
645
Before After
Before After Time saving
reading grids, task allocation 130s 75s
175sWrfout 40s 1-2s
Nudge 100s 24s

• Network Optimization
Analysis：
MPI communication became
bottleneck while WRF
running in thousands cores;
MPI+OpenMP hybird mode
is the best solution to reduce
MPI process communication
consuming.
MPI only MPI+OpenMP
0
500
1000
1500
1
70
139
208
277
346
415
484
553
622
IB Send and Receive
-300
200
700
1200
1
51
101
151
201
251
301
351
401
451
IB Send and Receive
Performance
↑26.9%

• Memory bandwidth optimization
Analysis：
MPI+OpenMP hybrid mode,
reduces not only network
width, but also memory and
cache processing frequency.
The comparison figure
shows, optimized memory
width is obviously reduced.
Before After
0
5
10
15
20
25
1
70
139
208
277
346
415
484
553
622
Memory Bandwidth
mem_total_bw_GB
0
5
10
15
20
25
1
47
93
139
185
231
277
323
369
415
461
Memory Bandwidth
mem_total_bw_GB

• Performance improvement
0
500
1000
1
14
27
40
53
66
79
92
105
118
131
144
157
170
183
196
209
222
235
248
ib_XmitData_MB
ib_RcvData_MB
0
1000
2000
1
15
29
43
57
71
85
99
113
127
141
155
169
ib_XmitData_MB
ib_RcvData_MB
0
200
400
1 112131415161718191
ib_XmitData_MB
ib_RcvData_MB
I/O
Optimization
Network
Optimization
Analysis：
IO and network
optimization
increase the total
performance
nearly 200%

➢Hardware and Software matching
Using new architecture
Using new instruction sets
Lammps on TH2: ~ 24000 cores QE on TH2: ~ 20000 cores
0
0.5
1
1.5
2
2.5
3
3.5
64 128 256 512 1024 2048 4096
2CPUs 1MIC 2MICs 3MICs
GTC on TH2: ~ 400000 cores

Customer in-house codes for
globalscale atmospheric modeling system
➢ JOB information:
JOB RESOLUTION RUNNING TIME
Global Weather Forecast 0.125° < 3 hours
RHEL 7.2 Intel Compiler 2017
CPU Memory Network
Model Xeon E5-2690v3 DDR3 Intel OPA
Peak 1.0 TFlops DP 136.5 GB/s 100Gbit/s
➢ Platform information:

Analyzing elapsed time of the main function in an atmosphere model
Reference to Helmholz Equation
Above
50%
Run on 8192 cores

Flops were
mainly
provided by
SSE
instruction
sets.
SSE
vectorization
is not high
enough and
AVX
instruction
sets has NOT
been used.

After opt
the flops
has
significantly
increased.
After opt
the flops
were mainly
provided by
AVX
instruction
sets.

1073
541
762
335
Total Time Model_integrate time
original
optimized
32
8
Number of iterations
40%
61%
The GMRM algorithm can significantly reduce the number of floating point
operations with less number of iterations. Calling general function library
also improves model performance. The whole optimized efficiency is up to
40%.

➢ JOB information:
➢ Platform information:
➢ Environment information:
ION Basesets Bands K-Points AG
NiMnSb 289 ~140 million 2176 16
CG/RMM-
DIIS/HSE
RHEL 7.5 Intel Compiler 2019 VASP 5.4.4
CPU GPU Memory Network
Model Xeon 6142 V100/P100 DDR4 EDR
Peak 2.66 TFlops DP 7.8 / 5.3 Tflops DP 256 GB/s 100Gbit/s
NiMnSb

Flops were
mainly
provided by
SSE
instruction
sets.
SSE
vectorization
is high
enough but
AVX/AVX512
instruction
sets has NOT
been used.
Data taken from Inspur TEYERun on 512 cores

After opt the
flops has
significantly
increased.
After opt the
flops were
mainly
provided by
AVX512
instruction
sets.

403
123
Loop Time
original
optimized
228%
With the help of AVX512 instruction set, the whole
optimized efficiency is up to 228%.
0
0.5
1
1.5
2
2.5
3
3.5
Performance
Xeon E5-2650 Xeon E5-2680v3 Xeon 6142
90%
70%
Run on 512 cores

Non-persistent
network
communication
is conducive to
improving
application
scalability

➢New algorithm & model
Application science
Computing science
Large-scale
computation
New physical model
matches computing
architecture
New computing
architecture driven
by application model

Customer in-house codes for
global climate simulation
❖ Research background：
Climate models need to better integrate the biological, chemical, and
physical components of the Earth system. The coupler connects various
physical component models, such as atmosphere, ocean, land et al.
The coupler receives the two-dimensional boundary data from each
component mode, and integrates the collected data into appropriate
calculations, and then transmits the data required back to each component.
❖ Performance before optimization：
The more higher resolution of space and time, the more Computing
resources we need.
Resolution x 2 (horizontal & vertical)
Computing time: 24

Run on 1212 cores
cpl
atm
lnd
ice
ocn
atm
lnd
ice
ocn cpl

cpl
atm
ocn
lnd
ice
atm driver
Physical process Dynamical frame
land atm-ice atm-ocean
cpl driver
atm lnd ice ocn
coupling
atm
coupling
lnd
coupling
atm-ice
coupling
atm-ocn
cpl
atm
ocn
lnd
ice
➢ Optimized atm module
➢ Optimized cpl to change the model structure
➢ Optimized MPI collective functions
Atm module control the physical process of
the whole model, other module need wait to
be called by atm. Communication response
time (MPI_barrier in ocn, MPI_Bcast in lnd
and ice) is too long, which lead to low model
performance.
190%
22%
120%
130%
150%

Take the run-time performance data with powerful profiling tools
Get the application features and bottlenecks by analyzing
the performance data for platform optimization
State-of-the-art technology to maximize the
performance by using newly technologies
The best way to realize the highest
performance for large-scale applications
Interdisciplinary innovation for new
models or algorithms to optimize the
codes of all the application
Realize better large-scale computing

• 3D Elastic Wave Modeling
for petroleum prospecting
• Run on TH2 with 200K cores
ASC14 Tianhe-2
• Gordon Bell Prize
Nominated Application—
MASNUM
• Run on Sunway TaihuLight
with 10K cores
ASC17 Sunway TaihuLight
ASC Student Supercomputer Challenge
ASC provide a stage for students practicing large-scale HPC optimization

Large-scale optimization strategies for typical HPC workloads

Large-scale optimization strategies for typical HPC workloads

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Large-scale optimization strategies for typical HPC workloads

Similar to Large-scale optimization strategies for typical HPC workloads (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

Large-scale optimization strategies for typical HPC workloads