In this video from the NVIDIA GPU Technology Conference, Gabriel Hautreux from GENCI presents: Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community.
"In order to prepare the scientific communities, GENCI and its partners have set up a technology watch group and lead collaborations with vendors, relying on HPC experts and early adopted HPC solutions. The two main objectives are providing guidance and prepare the scientific communities to challenges of exascale architectures. The talk will present the OpenPOWER platform bought by GENCI and provided to the scientific community. Then, it will present the first results obtained on the platform for a set of about 15 applications using all the solutions provided to the users (CUDA,OpenACC,OpenMP,...). Finally, a presentation about one specific application will be made regarding its porting effort and techniques used for GPUs with both OpenACC and OpenMP."
Watch the video: https://wp.me/p3RLHQ-iyl
Learn more: https://openpowerfoundation.org/
and
https://www.nvidia.com/en-us/gtc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community
1. Pre-exascale Architectures:
OpenPOWER Performance and
Usability Assessment for French
Scientific Community
GTC17 12/11/2017
G. Hautreux (GENCI), E. Boyer (GENCI)
Technological watch group,
GENCI-CEA-CNRS-INRIA and French Universities
+
Abel Marin-Lafleche and Matthieu Haefele
(Maison de la Simulation)
2. l l12/10/2017GTC EUROPE 2017 2
GENCI
Presentation
In charge of national HPC strategy for civil research
▪ Close to 7 Pflops available on 3
national centers (CINES, IDRIS and TGCC)
Partnerships at the regional level
▪ Equip@meso, 15 partners
Represent France in the PRACE research infrastructure
Promote the use of supercomputing for the benefit of French scientific communities and for
industries
▪ Specific action to SMEs through the Simseo intiative
3. l l12/10/2017GTC EUROPE 2017 3
TECHNOLOGICAL WATCH GROUP
Led by GENCI and its partners
➢ Goals:
➢ anticipate upcoming (pre) exascale architectures
➢ deploy prototypes and prepare our users
➢ organise code modernization
➢ share and mutualise expertise
➢ Preserve legacy codes by using standards – OpenMP
4. l l12/10/2017GTC17 4
OUESSANT
OpenPOWER based prototype
OpenPOWER platform @ IDRIS, Orsay (France)
▪ 12 IBM System S822LC “Minsky”, >250 Tflops peak
• 2 IBM Power8 10-core processors @ 4.2GHz
• 128GB of memory per node
• 2 Nvidia P100 GPUs per socket
• Connection socket <-> GPU with NVLink 1.0 (80GB/s)
▪ IB EDR Interconnect
Software stack
▪ Multiple compilers
• PGI (main target OpenACC)
• IBM XL (main target OpenMP)
• LLVM (fortran troubles in 2016)
▪ Power AI within Docker
High level support
▪ Multiple workshops organised
▪ Thanks to IBM and Nvidia teams
5. l l12/10/2017GTC17 5
RELEVANT SET OF APPLICATIONS
Represent French research community
18 « real » applications
▪ 2 GPU focused (RAMSES end EMMA)
▪ 1 OpenCL (CMS-MEM)
• No official support at the moment
▪ 15 « standard applications »
coming from various scientific
and industrial domains
4 withdrawals so far
Work performed on 14 applications
6. l l12/10/2017GTC EUROPE 2017 6
APPLICATION RESULTS
Scope
The results provided in the papers aim to define
▪ baseline in terms on performance on 1 full Minsky node
▪ the porting effort on GPU
▪ the software stack maturity (for code offloading)
Power8 results
▪ no real result on Power8 are shown
▪ compile and run, no trouble from moving from x86 to P8 processors
The comparison is made between
▪ Power8 node only (dual socket)
▪ OpenPOWER node (dual socket + 4GPUs)
7. l l12/10/2017GTC17 7
PERFORMANCE SUMMARY
Preliminary results
The overall performance for those applications at the moment:
▪ Mainly CUDA
▪ Unfortunately no real OpenMP porting at the moment
8. l l12/10/2017GTC17 8
FIRST CONCLUSIONS
Feedback on OpenPower platform
Power8 processor is easy to use (compile and run)
Programming models
▪ CUDA: very high performances
▪ OpenACC: high performances
▪ OpenMP 4.5: no global feedback at the moment
Compilers
▪ PGI working efficiently (for both Power8 and GPUs with OpenACC)
▪ IBM XL is more and more OpenMP for GPU aware
First results are very promising
▪ Opening of the platform to the full French community in April 2017
▪ more applications and new focus on AI applications (50% of applications received)
9. l l12/10/2017GTC17 9
FOCUS ON METALWALLS
First results
Molecular dynamic application
▪ Co-developed by “Université Pierre et Marie Curie and
Maison de la Simulation” (UPMC and MdS)
▪ Used for development of novel storage devices : supercapacitors
MPI + OpenACC
Abel Marin-Lafleche, Matthieu Haefele (MdS)
▪ Development started in Q1 2017
▪ 3500 lines of code (computationnal part)
▪ First results availabe after one month
▪ More or less 90% of the app ported
▪ Porting effort: 2 months
MPI + OpenMP
▪ Development started in Q3 2017
▪ First results available after a week (thanks to OpenACC existing implementation)
▪ Numerical results issues at the moment
+! -!
Q = f(t) ?!
10. l l12/10/2017GTC17 10
FOCUS ON METALWALLS
Application study
2 main types of functions
▪ One takes more than 80% of the computational time
▪ 2 main loops, look more or less like the following:
+! -!
Q = f(t) ?!
Loop #1 Loop #2
Then
11. l l12/10/2017GTC17 11
FOCUS ON METALWALLS
Application study
Function study
▪ l, m, n loop express enough parallelism
▪ SIMD is expressed through the inner i loop
+! -!
Q = f(t) ?!
Loop #1 (the same apply to loop #2)
High enough parallelism
SIMD
Conditions for efficient
use of the GPU
12. l l12/10/2017GTC17 12
FOCUS ON METALWALLS
Memory management
OpenACC pragmas added to both loops
• Can we improve the memory management?
+! -!
Q = f(t) ?!
Loop #1 Loop #2
13. l l12/10/2017GTC17 13
FOCUS ON METALWALLS
Memory management
Managed memory: how does it work?
+! -!
Q = f(t) ?!What we could
expect :
CPU GPU
14. l l12/10/2017GTC17 14
FOCUS ON METALWALLS
Memory management
Managed memory: how does it work?
+! -!
Q = f(t) ?!
CPU GPU
BUFFER(n,m,l)
Smart data
management
less memory
transfers
What is done :
15. l l12/10/2017GTC17 15
FOCUS ON METALWALLS
Memory management
Explicit data management
▪ OpenACC declarations added to the function to see the impact
+! -!
Q = f(t) ?!
16. l l12/10/2017GTC17 16
FOCUS ON METALWALLS
Results
Main function
Speed-ups
▪ Using a single GPU vs a full Power8 node: x11.9
▪ Using 4 GPUs vs 1 GPU : x3.5
• Quite good scalability for using 4GPUs
▪ 4GPUs vs a full node: x40,9
▪ Using guided memory vs automatic managed memory: no real impact, smart memory management
+! -!
Q = f(t) ?!
Implementation
Architecture used
Fortran
MPI
Power8 (20 cores)
OpenACC
managed memory
1 GPU (P100)
OpenACC
guided memory
1 GPU (P100)
OpenACC
guided memory
4 GPU (P100)
Time to solution 368s 31s 31s 9s
17. l l12/10/2017GTC17 17
FOCUS ON METALWALLS
OpenACC vs OpenMP implementations
Easy to port OpenACC to OpenMP
▪ Private (or firstprivate,…) MUST be used as OpenMP is an explicit model
+! -!
Q = f(t) ?!
Loop #1
18. l l12/10/2017GTC17 18
FOCUS ON METALWALLS
OpenACC vs OpenMP implementations
Easy to port OpenACC to OpenMP
+! -!
Q = f(t) ?!
Loop #2
19. l l12/10/2017GTC17 19
FOCUS ON METALWALLS
OpenACC vs OpenMP implementations
Easy to port OpenACC to OpenMP
▪ Using OpenMP, you HAVE TO be explicit for the array lengths, otherwise you get wrong results
▪ The OpenMP code is running but has numerical stability issues, work ongoing
+! -!
Q = f(t) ?!
Data management
20. l l12/10/2017GTC17 20
CONCLUSIONS
Overall feedback
Porting to GPU with OpenACC
▪ Work has to be done on the way you express parallelism
• Here we work on large arrays in 3 dimensions
▪ Compiler helps a lot for memory management
• Automatic management using -ta=tesla:managed compilation option
▪ Very good performances
Porting to GPU with OpenMP
▪ Easy if you start from OpenACC
▪ Difficult to have a feedback from the XL compiler (compared to OpenACC)
▪ Numercial stability issues at the moment on the application
• Do not know if it comes from the implementation or from the compiler
▪ OpenMP for GPU is not understood by CPUs
OpenMP-GPU is a more and more serious candidate for using GPUs
+! -!
Q = f(t) ?!
21. Thank you for your attention!
Questions?
gabriel.hautreux@genci.fr