SlideShare a Scribd company logo
1 of 34
Download to read offline
Large-scale optimization strategies for
typical HPC workloads
Inspur Group
The PASC19 Conference
12th-14th June 2019, ETH Zurich, Switzerland
Yu Liu
Who are we?
• No.3 Server Vendor Worldwide
• Leading System Share in TOP500 List
• Top AI GPU Server Market Share
• R&D contribution to OCP, ODCC, Open19
• The only vendor that can provide both
Power (by Inspur Power Systems) and
x86 infrastructure solutions.
The challenge of large-scale Optimization
➢New HPC architecture
➢Poor application scalability
➢Low hardware utilization rage
CPU CPU+MIC MPE+CPE CPU+GPU
0
2
4
6
8
AVX
x87
SSE
Low
efficiency
(GFlops)
Ideal
Actual
How to be faster?
Compute
faster
Data throughput
faster
Communicate
faster
PCIe NVlink IB OPA Ethernet
Our strategies
Large-scale
Optimization
Build a Powerful Profiling Tool
Harness the State-of-the-art Hardware
Leverage the Latest Model Algorithms
Sharp tools make good work
Build a Powerful Profiling Tool
Compute Data throughput Communicate
➢ user%, sys%, iowait%
➢ SSE/AVX/AVX512 GFlops
➢ Vectorization rate
➢ Clock cycle per
instruction
➢ Memory bandwidth
➢ PCIe bandwidth
➢ Nvlink bandwidth
➢ IO bandwidth
➢ IOPS
➢ IB/OPA bandwidth
➢ Ethernet bandwidth
➢ Message size
Very complex model
Very old codes
Big data input / output
Computing intensive
Large parallelism required
Run-time sensitive
Optimization
Computing
Communication
I/O
• Instruction
• Algorithm
• Architecture
• Model
WRF: A state-of-the-art atmospheric modeling system
Build a Powerful Profiling Tool
Target Performance
(WPS): 10min
WRF(include real)
:90 min
Post-Processing:
20min
Before
Optimization
> 150min
scheme
number of grids 330x336,605x449
grid length 6km,1.5km
vertical levels 39
Run on 4096 cores
Build a Powerful Profiling Tool
(0.10)
0.40
0.90
1.40
1
42
83
124
165
206
247
288
329
370
411
452
493
534
575
616
657
698
739
780
821
862
903
944
985
1026
Total_DP_GFlops Total_SP_GFlops X87_GFlops SSE_DP_Packed_GFlops
SSE_DP_Scalar_GFlops AVX_DP_Packed_GFlops SSE_SP_Packed_GFlops AVX_SP_Packed_GFlops
Analysis:
These figures shows us
that WRF using single
precision floating point
processing and it is not
floating point intensive
application. Besides,
WRF is not highly
optimized for the AVX
instructions.
reading
nested grids,
task allocation
numerical
calculations and
differential
equations solving
IO,
writing
wrfout
Nudging
Data taken from Inspur TEYE
Build a Powerful Profiling Tool
• IO Optimization
Analysis:3 methods to
improve IO performance
-Lustre performance:
accelerate IO speed
-Quilt IO: asynchronous IO
-Pnetcdf:Parallel IO
0
100
200
300
400
500
600
700
1
71
141
211
281
351
421
491
561
631
701
771
841
911
981
ib_XmitData_MB ib_RcvData_MB
0
200
400
600
800
1000
1200
1
47
93
139
185
231
277
323
369
415
461
507
553
599
645
ib_XmitData_MB ib_RcvData_MB
Before After
Before After Time saving
reading grids, task allocation 130s 75s
175sWrfout 40s 1-2s
Nudge 100s 24s
Data taken from Inspur TEYE
Build a Powerful Profiling Tool
• Network Optimization
Analysis:
MPI communication became
bottleneck while WRF
running in thousands cores;
MPI+OpenMP hybird mode
is the best solution to reduce
MPI process communication
consuming.
MPI only MPI+OpenMP
0
500
1000
1500
1
70
139
208
277
346
415
484
553
622
IB Send and Receive
ib_XmitData_MB ib_RcvData_MB
-300
200
700
1200
1
51
101
151
201
251
301
351
401
451
IB Send and Receive
ib_XmitData_MB ib_RcvData_MB
Performance
↑26.9%
Data taken from Inspur TEYE
Build a Powerful Profiling Tool
• Memory bandwidth optimization
Analysis:
MPI+OpenMP hybrid mode,
reduces not only network
width, but also memory and
cache processing frequency.
The comparison figure
shows, optimized memory
width is obviously reduced.
Before After
0
5
10
15
20
25
1
70
139
208
277
346
415
484
553
622
Memory Bandwidth
mem_total_bw_GB
0
5
10
15
20
25
1
47
93
139
185
231
277
323
369
415
461
Memory Bandwidth
mem_total_bw_GB
Data taken from Inspur TEYE
Build a Powerful Profiling Tool
• Performance improvement
0
500
1000
1
14
27
40
53
66
79
92
105
118
131
144
157
170
183
196
209
222
235
248
ib_XmitData_MB
ib_RcvData_MB
0
1000
2000
1
15
29
43
57
71
85
99
113
127
141
155
169
ib_XmitData_MB
ib_RcvData_MB
0
200
400
1 112131415161718191
ib_XmitData_MB
ib_RcvData_MB
I/O
Optimization
Network
Optimization
Analysis:
IO and network
optimization
increase the total
performance
nearly 200%
Data taken from Inspur TEYE
Build a Powerful Profiling Tool
Our strategies
Large-scale
Optimization
Build a Powerful Profiling Tool
Harness the State-of-the-art Hardware
Leverage the Latest Model Algorithms
Harness the State-of-the-art Hardware
➢Hardware and Software matching
Using new architecture
Using new instruction sets
Lammps on TH2: ~ 24000 cores QE on TH2: ~ 20000 cores
0
0.5
1
1.5
2
2.5
3
3.5
64 128 256 512 1024 2048 4096
2CPUs 1MIC 2MICs 3MICs
GTC on TH2: ~ 400000 cores
Customer in-house codes for
globalscale atmospheric modeling system
➢ JOB information:
JOB RESOLUTION RUNNING TIME
Global Weather Forecast 0.125° < 3 hours
RHEL 7.2 Intel Compiler 2017
CPU Memory Network
Model Xeon E5-2690v3 DDR3 Intel OPA
Peak 1.0 TFlops DP 136.5 GB/s 100Gbit/s
➢ Platform information:
Harness the State-of-the-art Hardware
Analyzing elapsed time of the main function in an atmosphere model
Reference to Helmholz Equation
Above
50%
Run on 8192 cores
Harness the State-of-the-art Hardware
Flops were
mainly
provided by
SSE
instruction
sets.
SSE
vectorization
is not high
enough and
AVX
instruction
sets has NOT
been used.
Data taken from Inspur TEYE
Harness the State-of-the-art Hardware
After opt
the flops
has
significantly
increased.
After opt
the flops
were mainly
provided by
AVX
instruction
sets.
Data taken from Inspur TEYE
Harness the State-of-the-art Hardware
1073
541
762
335
Total Time Model_integrate time
original
optimized
32
8
Number of iterations
40%
61%
The GMRM algorithm can significantly reduce the number of floating point
operations with less number of iterations. Calling general function library
also improves model performance. The whole optimized efficiency is up to
40%.
Harness the State-of-the-art Hardware
➢ JOB information:
➢ Platform information:
➢ Environment information:
ION Basesets Bands K-Points AG
NiMnSb 289 ~140 million 2176 16
CG/RMM-
DIIS/HSE
RHEL 7.5 Intel Compiler 2019 VASP 5.4.4
CPU GPU Memory Network
Model Xeon 6142 V100/P100 DDR4 EDR
Peak 2.66 TFlops DP 7.8 / 5.3 Tflops DP 256 GB/s 100Gbit/s
NiMnSb
Harness the State-of-the-art Hardware
Flops were
mainly
provided by
SSE
instruction
sets.
SSE
vectorization
is high
enough but
AVX/AVX512
instruction
sets has NOT
been used.
Data taken from Inspur TEYERun on 512 cores
Harness the State-of-the-art Hardware
After opt the
flops has
significantly
increased.
After opt the
flops were
mainly
provided by
AVX512
instruction
sets.
Data taken from Inspur TEYERun on 512 cores
Harness the State-of-the-art Hardware
403
123
Loop Time
original
optimized
228%
With the help of AVX512 instruction set, the whole
optimized efficiency is up to 228%.
0
0.5
1
1.5
2
2.5
3
3.5
Performance
Xeon E5-2650 Xeon E5-2680v3 Xeon 6142
90%
70%
Run on 512 cores
Harness the State-of-the-art Hardware
Data taken from Inspur TEYERun on 512 cores
Non-persistent
network
communication
is conducive to
improving
application
scalability
Harness the State-of-the-art Hardware
Our strategies
Large-scale
Optimization
Build a Powerful Profiling Tool
Harness the State-of-the-art Hardware
Leverage the Latest Model Algorithms
Leverage the Latest Model Algorithms
➢New algorithm & model
Application science
Computing science
Large-scale
computation
New physical model
matches computing
architecture
New computing
architecture driven
by application model
Customer in-house codes for
global climate simulation
❖ Research background:
Climate models need to better integrate the biological, chemical, and
physical components of the Earth system. The coupler connects various
physical component models, such as atmosphere, ocean, land et al.
The coupler receives the two-dimensional boundary data from each
component mode, and integrates the collected data into appropriate
calculations, and then transmits the data required back to each component.
❖ Performance before optimization:
The more higher resolution of space and time, the more Computing
resources we need.
Resolution x 2 (horizontal & vertical)
Computing time: 24
Leverage the Latest Model Algorithms
Leverage the Latest Model Algorithms
Run on 1212 cores
cpl
atm
lnd
ice
ocn
atm
lnd
ice
ocn cpl
Leverage the Latest Model Algorithms
cpl
atm
ocn
lnd
ice
atm driver
Physical process Dynamical frame
land atm-ice atm-ocean
cpl driver
atm lnd ice ocn
coupling
atm
coupling
lnd
coupling
atm-ice
coupling
atm-ocn
cpl
atm
ocn
lnd
ice
➢ Optimized atm module
➢ Optimized cpl to change the model structure
➢ Optimized MPI collective functions
Atm module control the physical process of
the whole model, other module need wait to
be called by atm. Communication response
time (MPI_barrier in ocn, MPI_Bcast in lnd
and ice) is too long, which lead to low model
performance.
190%
22%
120%
130%
150%
Take the run-time performance data with powerful profiling tools
Get the application features and bottlenecks by analyzing
the performance data for platform optimization
State-of-the-art technology to maximize the
performance by using newly technologies
The best way to realize the highest
performance for large-scale applications
Interdisciplinary innovation for new
models or algorithms to optimize the
codes of all the application
Realize better large-scale computing
• 3D Elastic Wave Modeling
for petroleum prospecting
• Run on TH2 with 200K cores
ASC14 Tianhe-2
• Gordon Bell Prize
Nominated Application—
MASNUM
• Run on Sunway TaihuLight
with 10K cores
ASC17 Sunway TaihuLight
ASC Student Supercomputer Challenge
ASC provide a stage for students practicing large-scale HPC optimization
Large-scale optimization strategies for typical HPC workloads

More Related Content

What's hot

HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPCHPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPCHPC DAY
 
HPC on Azure for Reserach
HPC on Azure for ReserachHPC on Azure for Reserach
HPC on Azure for ReserachJürgen Ambrosi
 
HPC Market Update and Observations on Big Memory
HPC Market Update and Observations on Big MemoryHPC Market Update and Observations on Big Memory
HPC Market Update and Observations on Big MemoryMemVerge
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY
 
HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-Data
HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-DataHPC DAY 2017 | The network part in accelerating Machine-Learning and Big-Data
HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-DataHPC DAY
 
Fujitsu World Tour 2017 - Compute Platform For The Digital World
Fujitsu World Tour 2017 - Compute Platform For The Digital WorldFujitsu World Tour 2017 - Compute Platform For The Digital World
Fujitsu World Tour 2017 - Compute Platform For The Digital WorldFujitsu India
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERinside-BigData.com
 
Introduction to High Performance Computing
Introduction to High Performance ComputingIntroduction to High Performance Computing
Introduction to High Performance ComputingUmarudin Zaenuri
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIBM Switzerland
 
Nvidia SC16: The Greatest Challenges Can't Wait
Nvidia SC16: The Greatest Challenges Can't WaitNvidia SC16: The Greatest Challenges Can't Wait
Nvidia SC16: The Greatest Challenges Can't Waitinside-BigData.com
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformGanesan Narayanasamy
 
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 

What's hot (20)

WML OpenPOWER presentation
WML OpenPOWER presentationWML OpenPOWER presentation
WML OpenPOWER presentation
 
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPCHPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
 
OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar
 
HPC on Azure for Reserach
HPC on Azure for ReserachHPC on Azure for Reserach
HPC on Azure for Reserach
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
HPC Market Update and Observations on Big Memory
HPC Market Update and Observations on Big MemoryHPC Market Update and Observations on Big Memory
HPC Market Update and Observations on Big Memory
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
 
HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-Data
HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-DataHPC DAY 2017 | The network part in accelerating Machine-Learning and Big-Data
HPC DAY 2017 | The network part in accelerating Machine-Learning and Big-Data
 
Fujitsu World Tour 2017 - Compute Platform For The Digital World
Fujitsu World Tour 2017 - Compute Platform For The Digital WorldFujitsu World Tour 2017 - Compute Platform For The Digital World
Fujitsu World Tour 2017 - Compute Platform For The Digital World
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
 
Introduction to High Performance Computing
Introduction to High Performance ComputingIntroduction to High Performance Computing
Introduction to High Performance Computing
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
 
Nvidia SC16: The Greatest Challenges Can't Wait
Nvidia SC16: The Greatest Challenges Can't WaitNvidia SC16: The Greatest Challenges Can't Wait
Nvidia SC16: The Greatest Challenges Can't Wait
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
EC2 Foundations - Laura Thomson
EC2 Foundations - Laura ThomsonEC2 Foundations - Laura Thomson
EC2 Foundations - Laura Thomson
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 

Similar to Large-scale optimization strategies for typical HPC workloads

In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeAnand Haridass
 
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)Amazon Web Services
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Community
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
Intel new processors
Intel new processorsIntel new processors
Intel new processorszaid_b
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsYinghai Lu
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdfhellobank1
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Danielle Womboldt
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Community
 

Similar to Large-scale optimization strategies for typical HPC workloads (20)

In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 
chameleon chip
chameleon chipchameleon chip
chameleon chip
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand Challenge
 
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and Solutions
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Large-scale optimization strategies for typical HPC workloads

  • 1. Large-scale optimization strategies for typical HPC workloads Inspur Group The PASC19 Conference 12th-14th June 2019, ETH Zurich, Switzerland Yu Liu
  • 2. Who are we? • No.3 Server Vendor Worldwide • Leading System Share in TOP500 List • Top AI GPU Server Market Share • R&D contribution to OCP, ODCC, Open19 • The only vendor that can provide both Power (by Inspur Power Systems) and x86 infrastructure solutions.
  • 3. The challenge of large-scale Optimization ➢New HPC architecture ➢Poor application scalability ➢Low hardware utilization rage CPU CPU+MIC MPE+CPE CPU+GPU 0 2 4 6 8 AVX x87 SSE Low efficiency (GFlops) Ideal Actual
  • 4. How to be faster? Compute faster Data throughput faster Communicate faster PCIe NVlink IB OPA Ethernet
  • 5. Our strategies Large-scale Optimization Build a Powerful Profiling Tool Harness the State-of-the-art Hardware Leverage the Latest Model Algorithms
  • 6. Sharp tools make good work
  • 7. Build a Powerful Profiling Tool Compute Data throughput Communicate ➢ user%, sys%, iowait% ➢ SSE/AVX/AVX512 GFlops ➢ Vectorization rate ➢ Clock cycle per instruction ➢ Memory bandwidth ➢ PCIe bandwidth ➢ Nvlink bandwidth ➢ IO bandwidth ➢ IOPS ➢ IB/OPA bandwidth ➢ Ethernet bandwidth ➢ Message size
  • 8. Very complex model Very old codes Big data input / output Computing intensive Large parallelism required Run-time sensitive Optimization Computing Communication I/O • Instruction • Algorithm • Architecture • Model WRF: A state-of-the-art atmospheric modeling system Build a Powerful Profiling Tool
  • 9. Target Performance (WPS): 10min WRF(include real) :90 min Post-Processing: 20min Before Optimization > 150min scheme number of grids 330x336,605x449 grid length 6km,1.5km vertical levels 39 Run on 4096 cores Build a Powerful Profiling Tool
  • 10. (0.10) 0.40 0.90 1.40 1 42 83 124 165 206 247 288 329 370 411 452 493 534 575 616 657 698 739 780 821 862 903 944 985 1026 Total_DP_GFlops Total_SP_GFlops X87_GFlops SSE_DP_Packed_GFlops SSE_DP_Scalar_GFlops AVX_DP_Packed_GFlops SSE_SP_Packed_GFlops AVX_SP_Packed_GFlops Analysis: These figures shows us that WRF using single precision floating point processing and it is not floating point intensive application. Besides, WRF is not highly optimized for the AVX instructions. reading nested grids, task allocation numerical calculations and differential equations solving IO, writing wrfout Nudging Data taken from Inspur TEYE Build a Powerful Profiling Tool
  • 11. • IO Optimization Analysis:3 methods to improve IO performance -Lustre performance: accelerate IO speed -Quilt IO: asynchronous IO -Pnetcdf:Parallel IO 0 100 200 300 400 500 600 700 1 71 141 211 281 351 421 491 561 631 701 771 841 911 981 ib_XmitData_MB ib_RcvData_MB 0 200 400 600 800 1000 1200 1 47 93 139 185 231 277 323 369 415 461 507 553 599 645 ib_XmitData_MB ib_RcvData_MB Before After Before After Time saving reading grids, task allocation 130s 75s 175sWrfout 40s 1-2s Nudge 100s 24s Data taken from Inspur TEYE Build a Powerful Profiling Tool
  • 12. • Network Optimization Analysis: MPI communication became bottleneck while WRF running in thousands cores; MPI+OpenMP hybird mode is the best solution to reduce MPI process communication consuming. MPI only MPI+OpenMP 0 500 1000 1500 1 70 139 208 277 346 415 484 553 622 IB Send and Receive ib_XmitData_MB ib_RcvData_MB -300 200 700 1200 1 51 101 151 201 251 301 351 401 451 IB Send and Receive ib_XmitData_MB ib_RcvData_MB Performance ↑26.9% Data taken from Inspur TEYE Build a Powerful Profiling Tool
  • 13. • Memory bandwidth optimization Analysis: MPI+OpenMP hybrid mode, reduces not only network width, but also memory and cache processing frequency. The comparison figure shows, optimized memory width is obviously reduced. Before After 0 5 10 15 20 25 1 70 139 208 277 346 415 484 553 622 Memory Bandwidth mem_total_bw_GB 0 5 10 15 20 25 1 47 93 139 185 231 277 323 369 415 461 Memory Bandwidth mem_total_bw_GB Data taken from Inspur TEYE Build a Powerful Profiling Tool
  • 14. • Performance improvement 0 500 1000 1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248 ib_XmitData_MB ib_RcvData_MB 0 1000 2000 1 15 29 43 57 71 85 99 113 127 141 155 169 ib_XmitData_MB ib_RcvData_MB 0 200 400 1 112131415161718191 ib_XmitData_MB ib_RcvData_MB I/O Optimization Network Optimization Analysis: IO and network optimization increase the total performance nearly 200% Data taken from Inspur TEYE Build a Powerful Profiling Tool
  • 15. Our strategies Large-scale Optimization Build a Powerful Profiling Tool Harness the State-of-the-art Hardware Leverage the Latest Model Algorithms
  • 16. Harness the State-of-the-art Hardware ➢Hardware and Software matching Using new architecture Using new instruction sets Lammps on TH2: ~ 24000 cores QE on TH2: ~ 20000 cores 0 0.5 1 1.5 2 2.5 3 3.5 64 128 256 512 1024 2048 4096 2CPUs 1MIC 2MICs 3MICs GTC on TH2: ~ 400000 cores
  • 17. Customer in-house codes for globalscale atmospheric modeling system ➢ JOB information: JOB RESOLUTION RUNNING TIME Global Weather Forecast 0.125° < 3 hours RHEL 7.2 Intel Compiler 2017 CPU Memory Network Model Xeon E5-2690v3 DDR3 Intel OPA Peak 1.0 TFlops DP 136.5 GB/s 100Gbit/s ➢ Platform information: Harness the State-of-the-art Hardware
  • 18. Analyzing elapsed time of the main function in an atmosphere model Reference to Helmholz Equation Above 50% Run on 8192 cores Harness the State-of-the-art Hardware
  • 19. Flops were mainly provided by SSE instruction sets. SSE vectorization is not high enough and AVX instruction sets has NOT been used. Data taken from Inspur TEYE Harness the State-of-the-art Hardware
  • 20. After opt the flops has significantly increased. After opt the flops were mainly provided by AVX instruction sets. Data taken from Inspur TEYE Harness the State-of-the-art Hardware
  • 21. 1073 541 762 335 Total Time Model_integrate time original optimized 32 8 Number of iterations 40% 61% The GMRM algorithm can significantly reduce the number of floating point operations with less number of iterations. Calling general function library also improves model performance. The whole optimized efficiency is up to 40%. Harness the State-of-the-art Hardware
  • 22. ➢ JOB information: ➢ Platform information: ➢ Environment information: ION Basesets Bands K-Points AG NiMnSb 289 ~140 million 2176 16 CG/RMM- DIIS/HSE RHEL 7.5 Intel Compiler 2019 VASP 5.4.4 CPU GPU Memory Network Model Xeon 6142 V100/P100 DDR4 EDR Peak 2.66 TFlops DP 7.8 / 5.3 Tflops DP 256 GB/s 100Gbit/s NiMnSb Harness the State-of-the-art Hardware
  • 23. Flops were mainly provided by SSE instruction sets. SSE vectorization is high enough but AVX/AVX512 instruction sets has NOT been used. Data taken from Inspur TEYERun on 512 cores Harness the State-of-the-art Hardware
  • 24. After opt the flops has significantly increased. After opt the flops were mainly provided by AVX512 instruction sets. Data taken from Inspur TEYERun on 512 cores Harness the State-of-the-art Hardware
  • 25. 403 123 Loop Time original optimized 228% With the help of AVX512 instruction set, the whole optimized efficiency is up to 228%. 0 0.5 1 1.5 2 2.5 3 3.5 Performance Xeon E5-2650 Xeon E5-2680v3 Xeon 6142 90% 70% Run on 512 cores Harness the State-of-the-art Hardware
  • 26. Data taken from Inspur TEYERun on 512 cores Non-persistent network communication is conducive to improving application scalability Harness the State-of-the-art Hardware
  • 27. Our strategies Large-scale Optimization Build a Powerful Profiling Tool Harness the State-of-the-art Hardware Leverage the Latest Model Algorithms
  • 28. Leverage the Latest Model Algorithms ➢New algorithm & model Application science Computing science Large-scale computation New physical model matches computing architecture New computing architecture driven by application model
  • 29. Customer in-house codes for global climate simulation ❖ Research background: Climate models need to better integrate the biological, chemical, and physical components of the Earth system. The coupler connects various physical component models, such as atmosphere, ocean, land et al. The coupler receives the two-dimensional boundary data from each component mode, and integrates the collected data into appropriate calculations, and then transmits the data required back to each component. ❖ Performance before optimization: The more higher resolution of space and time, the more Computing resources we need. Resolution x 2 (horizontal & vertical) Computing time: 24 Leverage the Latest Model Algorithms
  • 30. Leverage the Latest Model Algorithms Run on 1212 cores cpl atm lnd ice ocn atm lnd ice ocn cpl
  • 31. Leverage the Latest Model Algorithms cpl atm ocn lnd ice atm driver Physical process Dynamical frame land atm-ice atm-ocean cpl driver atm lnd ice ocn coupling atm coupling lnd coupling atm-ice coupling atm-ocn cpl atm ocn lnd ice ➢ Optimized atm module ➢ Optimized cpl to change the model structure ➢ Optimized MPI collective functions Atm module control the physical process of the whole model, other module need wait to be called by atm. Communication response time (MPI_barrier in ocn, MPI_Bcast in lnd and ice) is too long, which lead to low model performance. 190% 22% 120% 130% 150%
  • 32. Take the run-time performance data with powerful profiling tools Get the application features and bottlenecks by analyzing the performance data for platform optimization State-of-the-art technology to maximize the performance by using newly technologies The best way to realize the highest performance for large-scale applications Interdisciplinary innovation for new models or algorithms to optimize the codes of all the application Realize better large-scale computing
  • 33. • 3D Elastic Wave Modeling for petroleum prospecting • Run on TH2 with 200K cores ASC14 Tianhe-2 • Gordon Bell Prize Nominated Application— MASNUM • Run on Sunway TaihuLight with 10K cores ASC17 Sunway TaihuLight ASC Student Supercomputer Challenge ASC provide a stage for students practicing large-scale HPC optimization