SlideShare a Scribd company logo
1 of 21
Download to read offline
Pre-exascale Architectures:
OpenPOWER Performance and
Usability Assessment for French
Scientific Community
GTC17 12/11/2017
G. Hautreux (GENCI), E. Boyer (GENCI)
Technological watch group,
GENCI-CEA-CNRS-INRIA and French Universities
+
Abel Marin-Lafleche and Matthieu Haefele
(Maison de la Simulation)
l l12/10/2017GTC EUROPE 2017 2
GENCI
Presentation
In charge of national HPC strategy for civil research
▪ Close to 7 Pflops available on 3
national centers (CINES, IDRIS and TGCC)
Partnerships at the regional level
▪ Equip@meso, 15 partners
Represent France in the PRACE research infrastructure
Promote the use of supercomputing for the benefit of French scientific communities and for
industries
▪ Specific action to SMEs through the Simseo intiative
l l12/10/2017GTC EUROPE 2017 3
TECHNOLOGICAL WATCH GROUP
Led by GENCI and its partners
➢ Goals:
➢ anticipate upcoming (pre) exascale architectures
➢ deploy prototypes and prepare our users
➢ organise code modernization
➢ share and mutualise expertise
➢ Preserve legacy codes by using standards – OpenMP
l l12/10/2017GTC17 4
OUESSANT
OpenPOWER based prototype
OpenPOWER platform @ IDRIS, Orsay (France)
▪ 12 IBM System S822LC “Minsky”, >250 Tflops peak
• 2 IBM Power8 10-core processors @ 4.2GHz
• 128GB of memory per node
• 2 Nvidia P100 GPUs per socket
• Connection socket <-> GPU with NVLink 1.0 (80GB/s)
▪ IB EDR Interconnect
Software stack
▪ Multiple compilers
• PGI (main target OpenACC)
• IBM XL (main target OpenMP)
• LLVM (fortran troubles in 2016)
▪ Power AI within Docker
High level support
▪ Multiple workshops organised
▪ Thanks to IBM and Nvidia teams
l l12/10/2017GTC17 5
RELEVANT SET OF APPLICATIONS
Represent French research community
18 « real » applications
▪ 2 GPU focused (RAMSES end EMMA)
▪ 1 OpenCL (CMS-MEM)
• No official support at the moment
▪ 15 « standard applications »
coming from various scientific
and industrial domains
4 withdrawals so far
 Work performed on 14 applications
l l12/10/2017GTC EUROPE 2017 6
APPLICATION RESULTS
Scope
The results provided in the papers aim to define
▪ baseline in terms on performance on 1 full Minsky node
▪ the porting effort on GPU
▪ the software stack maturity (for code offloading)
Power8 results
▪ no real result on Power8 are shown
▪ compile and run, no trouble from moving from x86 to P8 processors
The comparison is made between
▪ Power8 node only (dual socket)
▪ OpenPOWER node (dual socket + 4GPUs)
l l12/10/2017GTC17 7
PERFORMANCE SUMMARY
Preliminary results
The overall performance for those applications at the moment:
▪ Mainly CUDA
▪ Unfortunately no real OpenMP porting at the moment
l l12/10/2017GTC17 8
FIRST CONCLUSIONS
Feedback on OpenPower platform
Power8 processor is easy to use (compile and run)
Programming models
▪ CUDA: very high performances
▪ OpenACC: high performances
▪ OpenMP 4.5: no global feedback at the moment
Compilers
▪ PGI working efficiently (for both Power8 and GPUs with OpenACC)
▪ IBM XL is more and more OpenMP for GPU aware
First results are very promising
▪ Opening of the platform to the full French community in April 2017
▪ more applications and new focus on AI applications (50% of applications received)
l l12/10/2017GTC17 9
FOCUS ON METALWALLS
First results
Molecular dynamic application
▪ Co-developed by “Université Pierre et Marie Curie and
Maison de la Simulation” (UPMC and MdS)
▪ Used for development of novel storage devices : supercapacitors
MPI + OpenACC
Abel Marin-Lafleche, Matthieu Haefele (MdS)
▪ Development started in Q1 2017
▪ 3500 lines of code (computationnal part)
▪ First results availabe after one month
▪ More or less 90% of the app ported
▪ Porting effort: 2 months
MPI + OpenMP
▪ Development started in Q3 2017
▪ First results available after a week (thanks to OpenACC existing implementation)
▪ Numerical results issues at the moment
+! -!
Q = f(t) ?!
l l12/10/2017GTC17 10
FOCUS ON METALWALLS
Application study
2 main types of functions
▪ One takes more than 80% of the computational time
▪ 2 main loops, look more or less like the following:
+! -!
Q = f(t) ?!
Loop #1 Loop #2
Then 
l l12/10/2017GTC17 11
FOCUS ON METALWALLS
Application study
Function study
▪ l, m, n loop express enough parallelism
▪ SIMD is expressed through the inner i loop
+! -!
Q = f(t) ?!
Loop #1 (the same apply to loop #2)
High enough parallelism
SIMD
Conditions for efficient
use of the GPU
l l12/10/2017GTC17 12
FOCUS ON METALWALLS
Memory management
OpenACC pragmas added to both loops
• Can we improve the memory management?
+! -!
Q = f(t) ?!
Loop #1 Loop #2
l l12/10/2017GTC17 13
FOCUS ON METALWALLS
Memory management
Managed memory: how does it work?
+! -!
Q = f(t) ?!What we could
expect :
CPU GPU
l l12/10/2017GTC17 14
FOCUS ON METALWALLS
Memory management
Managed memory: how does it work?
+! -!
Q = f(t) ?!
CPU GPU
BUFFER(n,m,l)
Smart data
management
 less memory
transfers
What is done :
l l12/10/2017GTC17 15
FOCUS ON METALWALLS
Memory management
Explicit data management
▪ OpenACC declarations added to the function to see the impact
+! -!
Q = f(t) ?!
l l12/10/2017GTC17 16
FOCUS ON METALWALLS
Results
Main function
Speed-ups
▪ Using a single GPU vs a full Power8 node: x11.9
▪ Using 4 GPUs vs 1 GPU : x3.5
• Quite good scalability for using 4GPUs
▪ 4GPUs vs a full node: x40,9
▪ Using guided memory vs automatic managed memory: no real impact, smart memory management
+! -!
Q = f(t) ?!
Implementation
Architecture used
Fortran
MPI
Power8 (20 cores)
OpenACC
managed memory
1 GPU (P100)
OpenACC
guided memory
1 GPU (P100)
OpenACC
guided memory
4 GPU (P100)
Time to solution 368s 31s 31s 9s
l l12/10/2017GTC17 17
FOCUS ON METALWALLS
OpenACC vs OpenMP implementations
Easy to port OpenACC to OpenMP
▪ Private (or firstprivate,…) MUST be used as OpenMP is an explicit model
+! -!
Q = f(t) ?!
Loop #1
l l12/10/2017GTC17 18
FOCUS ON METALWALLS
OpenACC vs OpenMP implementations
Easy to port OpenACC to OpenMP
+! -!
Q = f(t) ?!
Loop #2
l l12/10/2017GTC17 19
FOCUS ON METALWALLS
OpenACC vs OpenMP implementations
Easy to port OpenACC to OpenMP
▪ Using OpenMP, you HAVE TO be explicit for the array lengths, otherwise you get wrong results
▪ The OpenMP code is running but has numerical stability issues, work ongoing
+! -!
Q = f(t) ?!
Data management
l l12/10/2017GTC17 20
CONCLUSIONS
Overall feedback
Porting to GPU with OpenACC
▪ Work has to be done on the way you express parallelism
• Here we work on large arrays in 3 dimensions
▪ Compiler helps a lot for memory management
• Automatic management using -ta=tesla:managed compilation option
▪ Very good performances
Porting to GPU with OpenMP
▪ Easy if you start from OpenACC
▪ Difficult to have a feedback from the XL compiler (compared to OpenACC)
▪ Numercial stability issues at the moment on the application
• Do not know if it comes from the implementation or from the compiler
▪ OpenMP for GPU is not understood by CPUs
OpenMP-GPU is a more and more serious candidate for using GPUs
+! -!
Q = f(t) ?!
Thank you for your attention!
Questions?
gabriel.hautreux@genci.fr

More Related Content

Similar to Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community

BUD17-405: Building a reference IoT product with Zephyr
BUD17-405: Building a reference IoT product with Zephyr BUD17-405: Building a reference IoT product with Zephyr
BUD17-405: Building a reference IoT product with Zephyr
Linaro
 
Open Process Automation: Status of the O-PAS™ Standard, Conformance Certifica...
Open Process Automation: Status of the O-PAS™ Standard, Conformance Certifica...Open Process Automation: Status of the O-PAS™ Standard, Conformance Certifica...
Open Process Automation: Status of the O-PAS™ Standard, Conformance Certifica...
Yokogawa1
 

Similar to Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community (20)

OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdfdigitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
 
P4+ONOS SRv6 tutorial.pptx
P4+ONOS SRv6 tutorial.pptxP4+ONOS SRv6 tutorial.pptx
P4+ONOS SRv6 tutorial.pptx
 
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing ProcessorPEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
 
OpenContrail, Real Speed: Offloading vRouter
OpenContrail, Real Speed: Offloading vRouterOpenContrail, Real Speed: Offloading vRouter
OpenContrail, Real Speed: Offloading vRouter
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
 
Speeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCSpeeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCC
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020
 
OpenACC Monthly Highlights- December
OpenACC Monthly Highlights- DecemberOpenACC Monthly Highlights- December
OpenACC Monthly Highlights- December
 
IoT with Ruby/mruby - RubyWorld Conference 2015
IoT with Ruby/mruby - RubyWorld Conference 2015IoT with Ruby/mruby - RubyWorld Conference 2015
IoT with Ruby/mruby - RubyWorld Conference 2015
 
BUD17-405: Building a reference IoT product with Zephyr
BUD17-405: Building a reference IoT product with Zephyr BUD17-405: Building a reference IoT product with Zephyr
BUD17-405: Building a reference IoT product with Zephyr
 
OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020
 
Kernel Recipes 2018 - Live (Kernel) Patching: status quo and status futurus -...
Kernel Recipes 2018 - Live (Kernel) Patching: status quo and status futurus -...Kernel Recipes 2018 - Live (Kernel) Patching: status quo and status futurus -...
Kernel Recipes 2018 - Live (Kernel) Patching: status quo and status futurus -...
 
OpenACC Monthly Highlights - May and June 2018
OpenACC Monthly Highlights - May and June 2018OpenACC Monthly Highlights - May and June 2018
OpenACC Monthly Highlights - May and June 2018
 
Poc exadata 2018
Poc exadata 2018Poc exadata 2018
Poc exadata 2018
 
Open Process Automation: Status of the O-PAS™ Standard, Conformance Certifica...
Open Process Automation: Status of the O-PAS™ Standard, Conformance Certifica...Open Process Automation: Status of the O-PAS™ Standard, Conformance Certifica...
Open Process Automation: Status of the O-PAS™ Standard, Conformance Certifica...
 
OpenACC Monthly Highlights February 2019
OpenACC Monthly Highlights February 2019OpenACC Monthly Highlights February 2019
OpenACC Monthly Highlights February 2019
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community

  • 1. Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community GTC17 12/11/2017 G. Hautreux (GENCI), E. Boyer (GENCI) Technological watch group, GENCI-CEA-CNRS-INRIA and French Universities + Abel Marin-Lafleche and Matthieu Haefele (Maison de la Simulation)
  • 2. l l12/10/2017GTC EUROPE 2017 2 GENCI Presentation In charge of national HPC strategy for civil research ▪ Close to 7 Pflops available on 3 national centers (CINES, IDRIS and TGCC) Partnerships at the regional level ▪ Equip@meso, 15 partners Represent France in the PRACE research infrastructure Promote the use of supercomputing for the benefit of French scientific communities and for industries ▪ Specific action to SMEs through the Simseo intiative
  • 3. l l12/10/2017GTC EUROPE 2017 3 TECHNOLOGICAL WATCH GROUP Led by GENCI and its partners ➢ Goals: ➢ anticipate upcoming (pre) exascale architectures ➢ deploy prototypes and prepare our users ➢ organise code modernization ➢ share and mutualise expertise ➢ Preserve legacy codes by using standards – OpenMP
  • 4. l l12/10/2017GTC17 4 OUESSANT OpenPOWER based prototype OpenPOWER platform @ IDRIS, Orsay (France) ▪ 12 IBM System S822LC “Minsky”, >250 Tflops peak • 2 IBM Power8 10-core processors @ 4.2GHz • 128GB of memory per node • 2 Nvidia P100 GPUs per socket • Connection socket <-> GPU with NVLink 1.0 (80GB/s) ▪ IB EDR Interconnect Software stack ▪ Multiple compilers • PGI (main target OpenACC) • IBM XL (main target OpenMP) • LLVM (fortran troubles in 2016) ▪ Power AI within Docker High level support ▪ Multiple workshops organised ▪ Thanks to IBM and Nvidia teams
  • 5. l l12/10/2017GTC17 5 RELEVANT SET OF APPLICATIONS Represent French research community 18 « real » applications ▪ 2 GPU focused (RAMSES end EMMA) ▪ 1 OpenCL (CMS-MEM) • No official support at the moment ▪ 15 « standard applications » coming from various scientific and industrial domains 4 withdrawals so far  Work performed on 14 applications
  • 6. l l12/10/2017GTC EUROPE 2017 6 APPLICATION RESULTS Scope The results provided in the papers aim to define ▪ baseline in terms on performance on 1 full Minsky node ▪ the porting effort on GPU ▪ the software stack maturity (for code offloading) Power8 results ▪ no real result on Power8 are shown ▪ compile and run, no trouble from moving from x86 to P8 processors The comparison is made between ▪ Power8 node only (dual socket) ▪ OpenPOWER node (dual socket + 4GPUs)
  • 7. l l12/10/2017GTC17 7 PERFORMANCE SUMMARY Preliminary results The overall performance for those applications at the moment: ▪ Mainly CUDA ▪ Unfortunately no real OpenMP porting at the moment
  • 8. l l12/10/2017GTC17 8 FIRST CONCLUSIONS Feedback on OpenPower platform Power8 processor is easy to use (compile and run) Programming models ▪ CUDA: very high performances ▪ OpenACC: high performances ▪ OpenMP 4.5: no global feedback at the moment Compilers ▪ PGI working efficiently (for both Power8 and GPUs with OpenACC) ▪ IBM XL is more and more OpenMP for GPU aware First results are very promising ▪ Opening of the platform to the full French community in April 2017 ▪ more applications and new focus on AI applications (50% of applications received)
  • 9. l l12/10/2017GTC17 9 FOCUS ON METALWALLS First results Molecular dynamic application ▪ Co-developed by “Université Pierre et Marie Curie and Maison de la Simulation” (UPMC and MdS) ▪ Used for development of novel storage devices : supercapacitors MPI + OpenACC Abel Marin-Lafleche, Matthieu Haefele (MdS) ▪ Development started in Q1 2017 ▪ 3500 lines of code (computationnal part) ▪ First results availabe after one month ▪ More or less 90% of the app ported ▪ Porting effort: 2 months MPI + OpenMP ▪ Development started in Q3 2017 ▪ First results available after a week (thanks to OpenACC existing implementation) ▪ Numerical results issues at the moment +! -! Q = f(t) ?!
  • 10. l l12/10/2017GTC17 10 FOCUS ON METALWALLS Application study 2 main types of functions ▪ One takes more than 80% of the computational time ▪ 2 main loops, look more or less like the following: +! -! Q = f(t) ?! Loop #1 Loop #2 Then 
  • 11. l l12/10/2017GTC17 11 FOCUS ON METALWALLS Application study Function study ▪ l, m, n loop express enough parallelism ▪ SIMD is expressed through the inner i loop +! -! Q = f(t) ?! Loop #1 (the same apply to loop #2) High enough parallelism SIMD Conditions for efficient use of the GPU
  • 12. l l12/10/2017GTC17 12 FOCUS ON METALWALLS Memory management OpenACC pragmas added to both loops • Can we improve the memory management? +! -! Q = f(t) ?! Loop #1 Loop #2
  • 13. l l12/10/2017GTC17 13 FOCUS ON METALWALLS Memory management Managed memory: how does it work? +! -! Q = f(t) ?!What we could expect : CPU GPU
  • 14. l l12/10/2017GTC17 14 FOCUS ON METALWALLS Memory management Managed memory: how does it work? +! -! Q = f(t) ?! CPU GPU BUFFER(n,m,l) Smart data management  less memory transfers What is done :
  • 15. l l12/10/2017GTC17 15 FOCUS ON METALWALLS Memory management Explicit data management ▪ OpenACC declarations added to the function to see the impact +! -! Q = f(t) ?!
  • 16. l l12/10/2017GTC17 16 FOCUS ON METALWALLS Results Main function Speed-ups ▪ Using a single GPU vs a full Power8 node: x11.9 ▪ Using 4 GPUs vs 1 GPU : x3.5 • Quite good scalability for using 4GPUs ▪ 4GPUs vs a full node: x40,9 ▪ Using guided memory vs automatic managed memory: no real impact, smart memory management +! -! Q = f(t) ?! Implementation Architecture used Fortran MPI Power8 (20 cores) OpenACC managed memory 1 GPU (P100) OpenACC guided memory 1 GPU (P100) OpenACC guided memory 4 GPU (P100) Time to solution 368s 31s 31s 9s
  • 17. l l12/10/2017GTC17 17 FOCUS ON METALWALLS OpenACC vs OpenMP implementations Easy to port OpenACC to OpenMP ▪ Private (or firstprivate,…) MUST be used as OpenMP is an explicit model +! -! Q = f(t) ?! Loop #1
  • 18. l l12/10/2017GTC17 18 FOCUS ON METALWALLS OpenACC vs OpenMP implementations Easy to port OpenACC to OpenMP +! -! Q = f(t) ?! Loop #2
  • 19. l l12/10/2017GTC17 19 FOCUS ON METALWALLS OpenACC vs OpenMP implementations Easy to port OpenACC to OpenMP ▪ Using OpenMP, you HAVE TO be explicit for the array lengths, otherwise you get wrong results ▪ The OpenMP code is running but has numerical stability issues, work ongoing +! -! Q = f(t) ?! Data management
  • 20. l l12/10/2017GTC17 20 CONCLUSIONS Overall feedback Porting to GPU with OpenACC ▪ Work has to be done on the way you express parallelism • Here we work on large arrays in 3 dimensions ▪ Compiler helps a lot for memory management • Automatic management using -ta=tesla:managed compilation option ▪ Very good performances Porting to GPU with OpenMP ▪ Easy if you start from OpenACC ▪ Difficult to have a feedback from the XL compiler (compared to OpenACC) ▪ Numercial stability issues at the moment on the application • Do not know if it comes from the implementation or from the compiler ▪ OpenMP for GPU is not understood by CPUs OpenMP-GPU is a more and more serious candidate for using GPUs +! -! Q = f(t) ?!
  • 21. Thank you for your attention! Questions? gabriel.hautreux@genci.fr