Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community

Pre-exascale Architectures:
OpenPOWER Performance and
Usability Assessment for French
Scientific Community
GTC17 12/11/2017
G. Hautreux (GENCI), E. Boyer (GENCI)
Technological watch group,
GENCI-CEA-CNRS-INRIA and French Universities
+
Abel Marin-Lafleche and Matthieu Haefele
(Maison de la Simulation)

l l12/10/2017GTC EUROPE 2017 2
GENCI
Presentation
In charge of national HPC strategy for civil research
▪ Close to 7 Pflops available on 3
national centers (CINES, IDRIS and TGCC)
Partnerships at the regional level
▪ Equip@meso, 15 partners
Represent France in the PRACE research infrastructure
Promote the use of supercomputing for the benefit of French scientific communities and for
industries
▪ Specific action to SMEs through the Simseo intiative

l l12/10/2017GTC EUROPE 2017 3
TECHNOLOGICAL WATCH GROUP
Led by GENCI and its partners
➢ Goals:
➢ anticipate upcoming (pre) exascale architectures
➢ deploy prototypes and prepare our users
➢ organise code modernization
➢ share and mutualise expertise
➢ Preserve legacy codes by using standards – OpenMP

l l12/10/2017GTC17 4
OUESSANT
OpenPOWER based prototype
OpenPOWER platform @ IDRIS, Orsay (France)
▪ 12 IBM System S822LC “Minsky”, >250 Tflops peak
• 2 IBM Power8 10-core processors @ 4.2GHz
• 128GB of memory per node
• 2 Nvidia P100 GPUs per socket
• Connection socket <-> GPU with NVLink 1.0 (80GB/s)
▪ IB EDR Interconnect
Software stack
▪ Multiple compilers
• PGI (main target OpenACC)
• IBM XL (main target OpenMP)
• LLVM (fortran troubles in 2016)
▪ Power AI within Docker
High level support
▪ Multiple workshops organised
▪ Thanks to IBM and Nvidia teams

l l12/10/2017GTC17 5
RELEVANT SET OF APPLICATIONS
Represent French research community
18 « real » applications
▪ 2 GPU focused (RAMSES end EMMA)
▪ 1 OpenCL (CMS-MEM)
• No official support at the moment
▪ 15 « standard applications »
coming from various scientific
and industrial domains
4 withdrawals so far
 Work performed on 14 applications

l l12/10/2017GTC EUROPE 2017 6
APPLICATION RESULTS
Scope
The results provided in the papers aim to define
▪ baseline in terms on performance on 1 full Minsky node
▪ the porting effort on GPU
▪ the software stack maturity (for code offloading)
Power8 results
▪ no real result on Power8 are shown
▪ compile and run, no trouble from moving from x86 to P8 processors
The comparison is made between
▪ Power8 node only (dual socket)
▪ OpenPOWER node (dual socket + 4GPUs)

l l12/10/2017GTC17 7
PERFORMANCE SUMMARY
Preliminary results
The overall performance for those applications at the moment:
▪ Mainly CUDA
▪ Unfortunately no real OpenMP porting at the moment

l l12/10/2017GTC17 8
FIRST CONCLUSIONS
Feedback on OpenPower platform
Power8 processor is easy to use (compile and run)
Programming models
▪ CUDA: very high performances
▪ OpenACC: high performances
▪ OpenMP 4.5: no global feedback at the moment
Compilers
▪ PGI working efficiently (for both Power8 and GPUs with OpenACC)
▪ IBM XL is more and more OpenMP for GPU aware
First results are very promising
▪ Opening of the platform to the full French community in April 2017
▪ more applications and new focus on AI applications (50% of applications received)

l l12/10/2017GTC17 9
FOCUS ON METALWALLS
First results
Molecular dynamic application
▪ Co-developed by “Université Pierre et Marie Curie and
Maison de la Simulation” (UPMC and MdS)
▪ Used for development of novel storage devices : supercapacitors
MPI + OpenACC
Abel Marin-Lafleche, Matthieu Haefele (MdS)
▪ Development started in Q1 2017
▪ 3500 lines of code (computationnal part)
▪ First results availabe after one month
▪ More or less 90% of the app ported
▪ Porting effort: 2 months
MPI + OpenMP
▪ Development started in Q3 2017
▪ First results available after a week (thanks to OpenACC existing implementation)
▪ Numerical results issues at the moment
+! -!
Q = f(t) ?!

l l12/10/2017GTC17 10
FOCUS ON METALWALLS
Application study
2 main types of functions
▪ One takes more than 80% of the computational time
▪ 2 main loops, look more or less like the following:
+! -!
Q = f(t) ?!
Loop #1 Loop #2
Then 

l l12/10/2017GTC17 11
FOCUS ON METALWALLS
Application study
Function study
▪ l, m, n loop express enough parallelism
▪ SIMD is expressed through the inner i loop
+! -!
Q = f(t) ?!
Loop #1 (the same apply to loop #2)
High enough parallelism
SIMD
Conditions for efficient
use of the GPU

l l12/10/2017GTC17 12
FOCUS ON METALWALLS
Memory management
OpenACC pragmas added to both loops
• Can we improve the memory management?
+! -!
Q = f(t) ?!
Loop #1 Loop #2

l l12/10/2017GTC17 13
FOCUS ON METALWALLS
Memory management
Managed memory: how does it work?
+! -!
Q = f(t) ?!What we could
expect :
CPU GPU

l l12/10/2017GTC17 14
FOCUS ON METALWALLS
Memory management
Managed memory: how does it work?
+! -!
Q = f(t) ?!
CPU GPU
BUFFER(n,m,l)
Smart data
management
 less memory
transfers
What is done :

l l12/10/2017GTC17 15
FOCUS ON METALWALLS
Memory management
Explicit data management
▪ OpenACC declarations added to the function to see the impact
+! -!
Q = f(t) ?!

l l12/10/2017GTC17 16
FOCUS ON METALWALLS
Results
Main function
Speed-ups
▪ Using a single GPU vs a full Power8 node: x11.9
▪ Using 4 GPUs vs 1 GPU : x3.5
• Quite good scalability for using 4GPUs
▪ 4GPUs vs a full node: x40,9
▪ Using guided memory vs automatic managed memory: no real impact, smart memory management
+! -!
Q = f(t) ?!
Implementation
Architecture used
Fortran
MPI
Power8 (20 cores)
OpenACC
managed memory
1 GPU (P100)
OpenACC
guided memory
1 GPU (P100)
OpenACC
guided memory
4 GPU (P100)
Time to solution 368s 31s 31s 9s

l l12/10/2017GTC17 17
FOCUS ON METALWALLS
OpenACC vs OpenMP implementations
Easy to port OpenACC to OpenMP
▪ Private (or firstprivate,…) MUST be used as OpenMP is an explicit model
+! -!
Q = f(t) ?!
Loop #1

l l12/10/2017GTC17 18
FOCUS ON METALWALLS
+! -!
Q = f(t) ?!
Loop #2

l l12/10/2017GTC17 19
FOCUS ON METALWALLS
▪ Using OpenMP, you HAVE TO be explicit for the array lengths, otherwise you get wrong results
▪ The OpenMP code is running but has numerical stability issues, work ongoing
+! -!
Q = f(t) ?!
Data management

l l12/10/2017GTC17 20
CONCLUSIONS
Overall feedback
Porting to GPU with OpenACC
▪ Work has to be done on the way you express parallelism
• Here we work on large arrays in 3 dimensions
▪ Compiler helps a lot for memory management
• Automatic management using -ta=tesla:managed compilation option
▪ Very good performances
Porting to GPU with OpenMP
▪ Easy if you start from OpenACC
▪ Difficult to have a feedback from the XL compiler (compared to OpenACC)
▪ Numercial stability issues at the moment on the application
• Do not know if it comes from the implementation or from the compiler
▪ OpenMP for GPU is not understood by CPUs
OpenMP-GPU is a more and more serious candidate for using GPUs
+! -!
Q = f(t) ?!

Thank you for your attention!
Questions?
gabriel.hautreux@genci.fr

Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community

Recommended

Recommended

More Related Content

Similar to Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community

Similar to Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community