Large-scale optimization strategies for
typical HPC workloads
Inspur Group
The PASC19 Conference
12th-14th June 2019, ETH Zurich, Switzerland
Yu Liu
Who are we?
• No.3 Server Vendor Worldwide
• Leading System Share in TOP500 List
• Top AI GPU Server Market Share
• R&D contribution to OCP, ODCC, Open19
• The only vendor that can provide both
Power (by Inspur Power Systems) and
x86 infrastructure solutions.
The challenge of large-scale Optimization
➢New HPC architecture
➢Poor application scalability
➢Low hardware utilization rage
CPU CPU+MIC MPE+CPE CPU+GPU
0
2
4
6
8
AVX
x87
SSE
Low
efficiency
(GFlops)
Ideal
Actual
How to be faster?
Compute
faster
Data throughput
faster
Communicate
faster
PCIe NVlink IB OPA Ethernet
Our strategies
Large-scale
Optimization
Build a Powerful Profiling Tool
Harness the State-of-the-art Hardware
Leverage the Latest Model Algorithms
Sharp tools make good work
Build a Powerful Profiling Tool
Compute Data throughput Communicate
➢ user%, sys%, iowait%
➢ SSE/AVX/AVX512 GFlops
➢ Vectorization rate
➢ Clock cycle per
instruction
➢ Memory bandwidth
➢ PCIe bandwidth
➢ Nvlink bandwidth
➢ IO bandwidth
➢ IOPS
➢ IB/OPA bandwidth
➢ Ethernet bandwidth
➢ Message size
Very complex model
Very old codes
Big data input / output
Computing intensive
Large parallelism required
Run-time sensitive
Optimization
Computing
Communication
I/O
• Instruction
• Algorithm
• Architecture
• Model
WRF: A state-of-the-art atmospheric modeling system
Build a Powerful Profiling Tool
Target Performance
(WPS): 10min
WRF(include real)
:90 min
Post-Processing:
20min
Before
Optimization
> 150min
scheme
number of grids 330x336,605x449
grid length 6km,1.5km
vertical levels 39
Run on 4096 cores
Build a Powerful Profiling Tool
(0.10)
0.40
0.90
1.40
1
42
83
124
165
206
247
288
329
370
411
452
493
534
575
616
657
698
739
780
821
862
903
944
985
1026
Total_DP_GFlops Total_SP_GFlops X87_GFlops SSE_DP_Packed_GFlops
SSE_DP_Scalar_GFlops AVX_DP_Packed_GFlops SSE_SP_Packed_GFlops AVX_SP_Packed_GFlops
Analysis:
These figures shows us
that WRF using single
precision floating point
processing and it is not
floating point intensive
application. Besides,
WRF is not highly
optimized for the AVX
instructions.
reading
nested grids,
task allocation
numerical
calculations and
differential
equations solving
IO,
writing
wrfout
Nudging
Data taken from Inspur TEYE
Build a Powerful Profiling Tool
• IO Optimization
Analysis:3 methods to
improve IO performance
-Lustre performance:
accelerate IO speed
-Quilt IO: asynchronous IO
-Pnetcdf:Parallel IO
0
100
200
300
400
500
600
700
1
71
141
211
281
351
421
491
561
631
701
771
841
911
981
ib_XmitData_MB ib_RcvData_MB
0
200
400
600
800
1000
1200
1
47
93
139
185
231
277
323
369
415
461
507
553
599
645
ib_XmitData_MB ib_RcvData_MB
Before After
Before After Time saving
reading grids, task allocation 130s 75s
175sWrfout 40s 1-2s
Nudge 100s 24s
Data taken from Inspur TEYE
Build a Powerful Profiling Tool
• Network Optimization
Analysis:
MPI communication became
bottleneck while WRF
running in thousands cores;
MPI+OpenMP hybird mode
is the best solution to reduce
MPI process communication
consuming.
MPI only MPI+OpenMP
0
500
1000
1500
1
70
139
208
277
346
415
484
553
622
IB Send and Receive
ib_XmitData_MB ib_RcvData_MB
-300
200
700
1200
1
51
101
151
201
251
301
351
401
451
IB Send and Receive
ib_XmitData_MB ib_RcvData_MB
Performance
↑26.9%
Data taken from Inspur TEYE
Build a Powerful Profiling Tool
• Memory bandwidth optimization
Analysis:
MPI+OpenMP hybrid mode,
reduces not only network
width, but also memory and
cache processing frequency.
The comparison figure
shows, optimized memory
width is obviously reduced.
Before After
0
5
10
15
20
25
1
70
139
208
277
346
415
484
553
622
Memory Bandwidth
mem_total_bw_GB
0
5
10
15
20
25
1
47
93
139
185
231
277
323
369
415
461
Memory Bandwidth
mem_total_bw_GB
Data taken from Inspur TEYE
Build a Powerful Profiling Tool
• Performance improvement
0
500
1000
1
14
27
40
53
66
79
92
105
118
131
144
157
170
183
196
209
222
235
248
ib_XmitData_MB
ib_RcvData_MB
0
1000
2000
1
15
29
43
57
71
85
99
113
127
141
155
169
ib_XmitData_MB
ib_RcvData_MB
0
200
400
1 112131415161718191
ib_XmitData_MB
ib_RcvData_MB
I/O
Optimization
Network
Optimization
Analysis:
IO and network
optimization
increase the total
performance
nearly 200%
Data taken from Inspur TEYE
Build a Powerful Profiling Tool
Our strategies
Large-scale
Optimization
Build a Powerful Profiling Tool
Harness the State-of-the-art Hardware
Leverage the Latest Model Algorithms
Harness the State-of-the-art Hardware
➢Hardware and Software matching
Using new architecture
Using new instruction sets
Lammps on TH2: ~ 24000 cores QE on TH2: ~ 20000 cores
0
0.5
1
1.5
2
2.5
3
3.5
64 128 256 512 1024 2048 4096
2CPUs 1MIC 2MICs 3MICs
GTC on TH2: ~ 400000 cores
Customer in-house codes for
globalscale atmospheric modeling system
➢ JOB information:
JOB RESOLUTION RUNNING TIME
Global Weather Forecast 0.125° < 3 hours
RHEL 7.2 Intel Compiler 2017
CPU Memory Network
Model Xeon E5-2690v3 DDR3 Intel OPA
Peak 1.0 TFlops DP 136.5 GB/s 100Gbit/s
➢ Platform information:
Harness the State-of-the-art Hardware
Analyzing elapsed time of the main function in an atmosphere model
Reference to Helmholz Equation
Above
50%
Run on 8192 cores
Harness the State-of-the-art Hardware
Flops were
mainly
provided by
SSE
instruction
sets.
SSE
vectorization
is not high
enough and
AVX
instruction
sets has NOT
been used.
Data taken from Inspur TEYE
Harness the State-of-the-art Hardware
After opt
the flops
has
significantly
increased.
After opt
the flops
were mainly
provided by
AVX
instruction
sets.
Data taken from Inspur TEYE
Harness the State-of-the-art Hardware
1073
541
762
335
Total Time Model_integrate time
original
optimized
32
8
Number of iterations
40%
61%
The GMRM algorithm can significantly reduce the number of floating point
operations with less number of iterations. Calling general function library
also improves model performance. The whole optimized efficiency is up to
40%.
Harness the State-of-the-art Hardware
➢ JOB information:
➢ Platform information:
➢ Environment information:
ION Basesets Bands K-Points AG
NiMnSb 289 ~140 million 2176 16
CG/RMM-
DIIS/HSE
RHEL 7.5 Intel Compiler 2019 VASP 5.4.4
CPU GPU Memory Network
Model Xeon 6142 V100/P100 DDR4 EDR
Peak 2.66 TFlops DP 7.8 / 5.3 Tflops DP 256 GB/s 100Gbit/s
NiMnSb
Harness the State-of-the-art Hardware
Flops were
mainly
provided by
SSE
instruction
sets.
SSE
vectorization
is high
enough but
AVX/AVX512
instruction
sets has NOT
been used.
Data taken from Inspur TEYERun on 512 cores
Harness the State-of-the-art Hardware
After opt the
flops has
significantly
increased.
After opt the
flops were
mainly
provided by
AVX512
instruction
sets.
Data taken from Inspur TEYERun on 512 cores
Harness the State-of-the-art Hardware
403
123
Loop Time
original
optimized
228%
With the help of AVX512 instruction set, the whole
optimized efficiency is up to 228%.
0
0.5
1
1.5
2
2.5
3
3.5
Performance
Xeon E5-2650 Xeon E5-2680v3 Xeon 6142
90%
70%
Run on 512 cores
Harness the State-of-the-art Hardware
Data taken from Inspur TEYERun on 512 cores
Non-persistent
network
communication
is conducive to
improving
application
scalability
Harness the State-of-the-art Hardware
Our strategies
Large-scale
Optimization
Build a Powerful Profiling Tool
Harness the State-of-the-art Hardware
Leverage the Latest Model Algorithms
Leverage the Latest Model Algorithms
➢New algorithm & model
Application science
Computing science
Large-scale
computation
New physical model
matches computing
architecture
New computing
architecture driven
by application model
Customer in-house codes for
global climate simulation
❖ Research background:
Climate models need to better integrate the biological, chemical, and
physical components of the Earth system. The coupler connects various
physical component models, such as atmosphere, ocean, land et al.
The coupler receives the two-dimensional boundary data from each
component mode, and integrates the collected data into appropriate
calculations, and then transmits the data required back to each component.
❖ Performance before optimization:
The more higher resolution of space and time, the more Computing
resources we need.
Resolution x 2 (horizontal & vertical)
Computing time: 24
Leverage the Latest Model Algorithms
Leverage the Latest Model Algorithms
Run on 1212 cores
cpl
atm
lnd
ice
ocn
atm
lnd
ice
ocn cpl
Leverage the Latest Model Algorithms
cpl
atm
ocn
lnd
ice
atm driver
Physical process Dynamical frame
land atm-ice atm-ocean
cpl driver
atm lnd ice ocn
coupling
atm
coupling
lnd
coupling
atm-ice
coupling
atm-ocn
cpl
atm
ocn
lnd
ice
➢ Optimized atm module
➢ Optimized cpl to change the model structure
➢ Optimized MPI collective functions
Atm module control the physical process of
the whole model, other module need wait to
be called by atm. Communication response
time (MPI_barrier in ocn, MPI_Bcast in lnd
and ice) is too long, which lead to low model
performance.
190%
22%
120%
130%
150%
Take the run-time performance data with powerful profiling tools
Get the application features and bottlenecks by analyzing
the performance data for platform optimization
State-of-the-art technology to maximize the
performance by using newly technologies
The best way to realize the highest
performance for large-scale applications
Interdisciplinary innovation for new
models or algorithms to optimize the
codes of all the application
Realize better large-scale computing
• 3D Elastic Wave Modeling
for petroleum prospecting
• Run on TH2 with 200K cores
ASC14 Tianhe-2
• Gordon Bell Prize
Nominated Application—
MASNUM
• Run on Sunway TaihuLight
with 10K cores
ASC17 Sunway TaihuLight
ASC Student Supercomputer Challenge
ASC provide a stage for students practicing large-scale HPC optimization
Large-Scale Optimization Strategies for Typical HPC Workloads

Large-Scale Optimization Strategies for Typical HPC Workloads

  • 1.
    Large-scale optimization strategiesfor typical HPC workloads Inspur Group The PASC19 Conference 12th-14th June 2019, ETH Zurich, Switzerland Yu Liu
  • 2.
    Who are we? •No.3 Server Vendor Worldwide • Leading System Share in TOP500 List • Top AI GPU Server Market Share • R&D contribution to OCP, ODCC, Open19 • The only vendor that can provide both Power (by Inspur Power Systems) and x86 infrastructure solutions.
  • 3.
    The challenge oflarge-scale Optimization ➢New HPC architecture ➢Poor application scalability ➢Low hardware utilization rage CPU CPU+MIC MPE+CPE CPU+GPU 0 2 4 6 8 AVX x87 SSE Low efficiency (GFlops) Ideal Actual
  • 4.
    How to befaster? Compute faster Data throughput faster Communicate faster PCIe NVlink IB OPA Ethernet
  • 5.
    Our strategies Large-scale Optimization Build aPowerful Profiling Tool Harness the State-of-the-art Hardware Leverage the Latest Model Algorithms
  • 6.
  • 7.
    Build a PowerfulProfiling Tool Compute Data throughput Communicate ➢ user%, sys%, iowait% ➢ SSE/AVX/AVX512 GFlops ➢ Vectorization rate ➢ Clock cycle per instruction ➢ Memory bandwidth ➢ PCIe bandwidth ➢ Nvlink bandwidth ➢ IO bandwidth ➢ IOPS ➢ IB/OPA bandwidth ➢ Ethernet bandwidth ➢ Message size
  • 8.
    Very complex model Veryold codes Big data input / output Computing intensive Large parallelism required Run-time sensitive Optimization Computing Communication I/O • Instruction • Algorithm • Architecture • Model WRF: A state-of-the-art atmospheric modeling system Build a Powerful Profiling Tool
  • 9.
    Target Performance (WPS): 10min WRF(includereal) :90 min Post-Processing: 20min Before Optimization > 150min scheme number of grids 330x336,605x449 grid length 6km,1.5km vertical levels 39 Run on 4096 cores Build a Powerful Profiling Tool
  • 10.
    (0.10) 0.40 0.90 1.40 1 42 83 124 165 206 247 288 329 370 411 452 493 534 575 616 657 698 739 780 821 862 903 944 985 1026 Total_DP_GFlops Total_SP_GFlops X87_GFlopsSSE_DP_Packed_GFlops SSE_DP_Scalar_GFlops AVX_DP_Packed_GFlops SSE_SP_Packed_GFlops AVX_SP_Packed_GFlops Analysis: These figures shows us that WRF using single precision floating point processing and it is not floating point intensive application. Besides, WRF is not highly optimized for the AVX instructions. reading nested grids, task allocation numerical calculations and differential equations solving IO, writing wrfout Nudging Data taken from Inspur TEYE Build a Powerful Profiling Tool
  • 11.
    • IO Optimization Analysis:3methods to improve IO performance -Lustre performance: accelerate IO speed -Quilt IO: asynchronous IO -Pnetcdf:Parallel IO 0 100 200 300 400 500 600 700 1 71 141 211 281 351 421 491 561 631 701 771 841 911 981 ib_XmitData_MB ib_RcvData_MB 0 200 400 600 800 1000 1200 1 47 93 139 185 231 277 323 369 415 461 507 553 599 645 ib_XmitData_MB ib_RcvData_MB Before After Before After Time saving reading grids, task allocation 130s 75s 175sWrfout 40s 1-2s Nudge 100s 24s Data taken from Inspur TEYE Build a Powerful Profiling Tool
  • 12.
    • Network Optimization Analysis: MPIcommunication became bottleneck while WRF running in thousands cores; MPI+OpenMP hybird mode is the best solution to reduce MPI process communication consuming. MPI only MPI+OpenMP 0 500 1000 1500 1 70 139 208 277 346 415 484 553 622 IB Send and Receive ib_XmitData_MB ib_RcvData_MB -300 200 700 1200 1 51 101 151 201 251 301 351 401 451 IB Send and Receive ib_XmitData_MB ib_RcvData_MB Performance ↑26.9% Data taken from Inspur TEYE Build a Powerful Profiling Tool
  • 13.
    • Memory bandwidthoptimization Analysis: MPI+OpenMP hybrid mode, reduces not only network width, but also memory and cache processing frequency. The comparison figure shows, optimized memory width is obviously reduced. Before After 0 5 10 15 20 25 1 70 139 208 277 346 415 484 553 622 Memory Bandwidth mem_total_bw_GB 0 5 10 15 20 25 1 47 93 139 185 231 277 323 369 415 461 Memory Bandwidth mem_total_bw_GB Data taken from Inspur TEYE Build a Powerful Profiling Tool
  • 14.
    • Performance improvement 0 500 1000 1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248 ib_XmitData_MB ib_RcvData_MB 0 1000 2000 1 15 29 43 57 71 85 99 113 127 141 155 169 ib_XmitData_MB ib_RcvData_MB 0 200 400 1112131415161718191 ib_XmitData_MB ib_RcvData_MB I/O Optimization Network Optimization Analysis: IO and network optimization increase the total performance nearly 200% Data taken from Inspur TEYE Build a Powerful Profiling Tool
  • 15.
    Our strategies Large-scale Optimization Build aPowerful Profiling Tool Harness the State-of-the-art Hardware Leverage the Latest Model Algorithms
  • 16.
    Harness the State-of-the-artHardware ➢Hardware and Software matching Using new architecture Using new instruction sets Lammps on TH2: ~ 24000 cores QE on TH2: ~ 20000 cores 0 0.5 1 1.5 2 2.5 3 3.5 64 128 256 512 1024 2048 4096 2CPUs 1MIC 2MICs 3MICs GTC on TH2: ~ 400000 cores
  • 17.
    Customer in-house codesfor globalscale atmospheric modeling system ➢ JOB information: JOB RESOLUTION RUNNING TIME Global Weather Forecast 0.125° < 3 hours RHEL 7.2 Intel Compiler 2017 CPU Memory Network Model Xeon E5-2690v3 DDR3 Intel OPA Peak 1.0 TFlops DP 136.5 GB/s 100Gbit/s ➢ Platform information: Harness the State-of-the-art Hardware
  • 18.
    Analyzing elapsed timeof the main function in an atmosphere model Reference to Helmholz Equation Above 50% Run on 8192 cores Harness the State-of-the-art Hardware
  • 19.
    Flops were mainly provided by SSE instruction sets. SSE vectorization isnot high enough and AVX instruction sets has NOT been used. Data taken from Inspur TEYE Harness the State-of-the-art Hardware
  • 20.
    After opt the flops has significantly increased. Afteropt the flops were mainly provided by AVX instruction sets. Data taken from Inspur TEYE Harness the State-of-the-art Hardware
  • 21.
    1073 541 762 335 Total Time Model_integratetime original optimized 32 8 Number of iterations 40% 61% The GMRM algorithm can significantly reduce the number of floating point operations with less number of iterations. Calling general function library also improves model performance. The whole optimized efficiency is up to 40%. Harness the State-of-the-art Hardware
  • 22.
    ➢ JOB information: ➢Platform information: ➢ Environment information: ION Basesets Bands K-Points AG NiMnSb 289 ~140 million 2176 16 CG/RMM- DIIS/HSE RHEL 7.5 Intel Compiler 2019 VASP 5.4.4 CPU GPU Memory Network Model Xeon 6142 V100/P100 DDR4 EDR Peak 2.66 TFlops DP 7.8 / 5.3 Tflops DP 256 GB/s 100Gbit/s NiMnSb Harness the State-of-the-art Hardware
  • 23.
    Flops were mainly provided by SSE instruction sets. SSE vectorization ishigh enough but AVX/AVX512 instruction sets has NOT been used. Data taken from Inspur TEYERun on 512 cores Harness the State-of-the-art Hardware
  • 24.
    After opt the flopshas significantly increased. After opt the flops were mainly provided by AVX512 instruction sets. Data taken from Inspur TEYERun on 512 cores Harness the State-of-the-art Hardware
  • 25.
    403 123 Loop Time original optimized 228% With thehelp of AVX512 instruction set, the whole optimized efficiency is up to 228%. 0 0.5 1 1.5 2 2.5 3 3.5 Performance Xeon E5-2650 Xeon E5-2680v3 Xeon 6142 90% 70% Run on 512 cores Harness the State-of-the-art Hardware
  • 26.
    Data taken fromInspur TEYERun on 512 cores Non-persistent network communication is conducive to improving application scalability Harness the State-of-the-art Hardware
  • 27.
    Our strategies Large-scale Optimization Build aPowerful Profiling Tool Harness the State-of-the-art Hardware Leverage the Latest Model Algorithms
  • 28.
    Leverage the LatestModel Algorithms ➢New algorithm & model Application science Computing science Large-scale computation New physical model matches computing architecture New computing architecture driven by application model
  • 29.
    Customer in-house codesfor global climate simulation ❖ Research background: Climate models need to better integrate the biological, chemical, and physical components of the Earth system. The coupler connects various physical component models, such as atmosphere, ocean, land et al. The coupler receives the two-dimensional boundary data from each component mode, and integrates the collected data into appropriate calculations, and then transmits the data required back to each component. ❖ Performance before optimization: The more higher resolution of space and time, the more Computing resources we need. Resolution x 2 (horizontal & vertical) Computing time: 24 Leverage the Latest Model Algorithms
  • 30.
    Leverage the LatestModel Algorithms Run on 1212 cores cpl atm lnd ice ocn atm lnd ice ocn cpl
  • 31.
    Leverage the LatestModel Algorithms cpl atm ocn lnd ice atm driver Physical process Dynamical frame land atm-ice atm-ocean cpl driver atm lnd ice ocn coupling atm coupling lnd coupling atm-ice coupling atm-ocn cpl atm ocn lnd ice ➢ Optimized atm module ➢ Optimized cpl to change the model structure ➢ Optimized MPI collective functions Atm module control the physical process of the whole model, other module need wait to be called by atm. Communication response time (MPI_barrier in ocn, MPI_Bcast in lnd and ice) is too long, which lead to low model performance. 190% 22% 120% 130% 150%
  • 32.
    Take the run-timeperformance data with powerful profiling tools Get the application features and bottlenecks by analyzing the performance data for platform optimization State-of-the-art technology to maximize the performance by using newly technologies The best way to realize the highest performance for large-scale applications Interdisciplinary innovation for new models or algorithms to optimize the codes of all the application Realize better large-scale computing
  • 33.
    • 3D ElasticWave Modeling for petroleum prospecting • Run on TH2 with 200K cores ASC14 Tianhe-2 • Gordon Bell Prize Nominated Application— MASNUM • Run on Sunway TaihuLight with 10K cores ASC17 Sunway TaihuLight ASC Student Supercomputer Challenge ASC provide a stage for students practicing large-scale HPC optimization