IBM POWER8 as an HPC platform
Alexander Pozdneev, Georgy Pavlov
IBM
October 23, 2015 — IBM Linux on Power: Platform News
1 c 2015 IBM Corporation
What is HPC?
• HPC — High Performance Computing
• A.k.a. technical computing
• Aeroacoustics: Effects of chevrons on jet noise
• Supersonic jet engine noise computational fluid dynamics simulation
• 128k Blue Gene/P cores — ≈ 100 hours
• 1M Blue Gene/Q cores — ≈ 12 hours
http://youtu.be/cjoz5tncRUs http://youtu.be/uxT-VmY3OWc
2 c 2015 IBM Corporation
Secrets of the Dark Universe
• Cosmology: The evolution of the Universe simulation
• Understanding the physics of the dark matter and energy
• 1 BG/Q rack — 68B particles
• 32 BG/Q racks — 1.1T particles
http://www.youtube.com/watch?v=tdv8yrJk4VE http://www.youtube.com/watch?v=-S-T_iTiAxQ
3 c 2015 IBM Corporation
Real-time modeling of human heart ventricles
• Physiology: Simulation of drug-induced
arrhythmias
• Resolution — 0.1 mm
• 768k Blue Gene/Q cores
• 43% peak
• http://dl.acm.org/citation.cfm?id=2388999
• LLNL, IBM Research, IBM Research
Collaboratory for Life Sciences
4 c 2015 IBM Corporation
Modelling of a complete human viral pathogen poliovirus
• Molecular biology: Reconstruction and simulation of poliovirus
• Antiviral drugs, virus infection, modelling related viruses
• 3.3M–3.7M atoms
• Blue Gene/Q, Victorian Life Sciences Computing Initiative
• http://www.youtube.com/watch?v=Nih0Qa673FY
5 c 2015 IBM Corporation
Speakers
• Alexander Pozdneev
Research Software Engineer
HPC
• Georgy Pavlov
Software Engineer
ESSL Russian team leader
6 c 2015 IBM Corporation
Outline
1 Data centric computing as a new HPC paradigm
2 Architecture of IBM HPC systems based on POWER8+NVIDIA servers
3 Software stack of IBM HPC systems
4 IBM HPC mathematical libraries
5 Measuring efficiency of an HPC system on real applications
6 Summary
7 c 2015 IBM Corporation
Application diversity
Image credit: http://www.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=POL03229USEN
8 c 2015 IBM Corporation
Data centric computing as a new HPC paradigm
• That is all about moving data around
• Memory bandwidth
• Memory latency
• High value of “memory access operations” / “computations”
• Number of FLOPs1 per cycle is no longer relevant
• Offloading computing to memory (Active Memory Cube by Micron)
1
FLOP — Floating-Point Operation
9 c 2015 IBM Corporation
Department of Energy CORAL project
10 c 2015 IBM Corporation
IBM systems in CORAL project
11 c 2015 IBM Corporation
Architecture of POWER8+NVIDIA systems for HPC
12 c 2015 IBM Corporation
Overview of IBM Power System S822LC
Power S822LC model 8335-GTA
• POWER8 processor module:
8-core, 3.32 GHz
10-core, 2.92 GHz
• Two sockets
• Graphics processing units
Two NVIDIA K80 GPUs
• Eight memory slots
• 2U height
13 c 2015 IBM Corporation
Architecture of IBM Power System S822LC
14 c 2015 IBM Corporation
System software
Software stack of IBM HPC systems
• System software
Operating system: Linux, bare-metal (no virtualization)
Drivers:
• Mellanox InfiniBand OFED
• NVIDIA
Deployment: xCAT
• Parallel operating environment
IBM Parallel Environment Runtime Edition (PE RTE)
Workload scheduler: IBM Platform LSF
Parallel filesystem: IBM Spectrum Scale (“GPFS”2
)
2
General Parallel File System
15 c 2015 IBM Corporation
Development tools
Software stack of IBM HPC systems
• Compilers
IBM XL C/C++/Fortran compilers
IBM Advance Toolchain, http://ibm.co/AdvanceToolchain
• Fork of GNU compiler/tools optimized for POWER8
• gcc, g++, gfortran
• Analysis tools (oprofile, valgrind, itrace)
Vanilla GCC, binutils, etc.
CUDA Toolkit
• IBM Parallel Environment Developer Edition (PE DE)
• IBM Software Development Kit for Linux on Power
• Mathematical libraries
Mathematical Acceleration Subsystem (MASS)
IBM Engineering and Scientific Subroutine Library (ESSL)
IBM Parallel ESSL
16 c 2015 IBM Corporation
Engineering and Scientific Subroutine Library
• High-performance mathematical functions
Scientific applications
Engineering applications
• Platforms
IBM POWER servers
IBM POWER clusters
• Libraries
ESSL Serial and SMP: 600+ subroutines
(SMP — Symmetric Multi-Processing)
Parallel ESSL: 125+ subroutines
• Languages:
C
C++
Fortran
http://www.ibm.com/systems/power/software/essl
17 c 2015 IBM Corporation
ESSL: Industry de facto standards
• ESSL implements the following interfaces:
BLAS (linear algebra)
LAPACK (linear algebra)
FFTW (Fourier transformation)
• Parallel ESSL implements the following interfaces:
ScaLAPACK
• Easy migration
• Just recompile! http://fftw.org
18 c 2015 IBM Corporation
What mathematical areas are supported?
ESSL
• Linear algebra subprograms
• Matrix operations
• Linear algebraic equations
• Eigensystems analysis
• Fourier transforms, convolution,
correlation, . . .
• Sorting and searching
• Interpolation
• Numerical quadrature
• Random number generation
Parallel ESSL
• BLACS
• Level 2 parallel BLAS
• Level 3 parallel BLAS
• Linear algebraic equations
• Eigensystems analysis
• Fourier transforms
• Random number generation
19 c 2015 IBM Corporation
How to leverage the hardware?
Symmetric multiprocessing:
• Multiple hardware threads
• Multiple cores
POWER8+NVIDIA:
• Use multiple GPUs
• Select which GPU to use
• Run ESSL in a hybrid mode
20 c 2015 IBM Corporation
Synthetic benchmarks vs. real apps
Measuring car pollution in official tests?
• You get low toxic nitrogen oxides in a lab environment
• You cannot predict how much smoke you produce,
unless you test your scenarios
• You would run a testdrive prior to car purchase
21 c 2015 IBM Corporation
Threads behavior: Typical vs. HPC
Typical workload
HPC workload
22 c 2015 IBM Corporation
Importance of threads affinity
NAS Parallel Benchmarks, mg.C (peaks at SMT1), 20 cores
23 c 2015 IBM Corporation
Choice of compilation parameters: -O5 -qnohot
NAS Parallel Benchmarks, bt.C, affinity, baseline: -O3 -qhot
24 c 2015 IBM Corporation
Compilation parameters: -O3 -qhot, -O5 -qprefetch
NAS Parallel Benchmarks, mg.C, affinity, baseline: -O5 -qnohot
25 c 2015 IBM Corporation
Choice of an SMT mode: SMT1
NAS Parallel Benchmarks, mg.C, affinity, baseline: SMT8
26 c 2015 IBM Corporation
Choice of an SMT mode: SMT2, SMT4
NAS Parallel Benchmarks, bt.C, affinity, baseline: SMT1
27 c 2015 IBM Corporation
Choice of an SMT mode: SMT8
NAS Parallel Benchmarks, cg.C, affinity, baseline: SMT1
28 c 2015 IBM Corporation
Summary
• Technical computing problems ⇒ need for HPC
• Data centric computing as a new HPC paradigm
• CORAL project
• IBM Power System S822LC model 8335-GTA
• IBM HPC Software stack
• High performance math libraries
• Leveraging performance
29 c 2015 IBM Corporation
Publications
http://www.redbooks.ibm.com/abstracts/sg248263.html
30 c 2015 IBM Corporation
Further reads
• XL C/C++ for Linux,
http://www.ibm.com/support/knowledgecenter/SSXVZZ/
• XL Fortran for Linux,
http://www.ibm.com/support/knowledgecenter/SSAT4T/
• XL C/C++ for Linux 13.1.2 Optimization and Programming Guide,
http:
//www.ibm.com/support/knowledgecenter/SSXVZZ_13.1.2/com.
ibm.xlcpp1312.lelinux.doc/proguide/optimization.html
• XL Fortran for Linux 15.1.2 Optimization and Programming Guide,
http://www.ibm.com/support/knowledgecenter/SSAT4T_15.1.
2/com.ibm.xlf1512.lelinux.doc/proguide/optimization.html
31 c 2015 IBM Corporation
Relevance of LINPACK
• Based on DGEMM()
• 80–90% of peak performance
• Commercial deployment verification test for large systems
• Proprietary binary files run by the installation team
32 c 2015 IBM Corporation
LINPACK vs. HPC analytics
33 c 2015 IBM Corporation
Benchmarking methodology options
1. Take one core for the initial tuning
Try SMT1, SMT2, SMT4, SMT8
(number of threads + affinity + -qtune=pwr8:XXX
Try different optimization options (-O3, -O4, . . . )
2. Choose SMT-mode and compiler options that provide the best timing
3. Take one core as a baseline
Run on 1–5 cores (within one chip)
Run on 5 and 10 cores (within one socket)
Run on 10 and 20 cores
34 c 2015 IBM Corporation
POWER8 features
• Eight threads per core
Hide memory latency (like GPU3
)
Instuction flow is arbitrary (unlike GPU)
• Memory bandwidth
• No sense in benchmarking only one thread (like in GPU)
• Scalability within a core depends only on the application
• Advanced features to try
Transactional memory
Relaxed memory model
Decimal floating point unit
3
GPU — Graphical Processing Unit
35 c 2015 IBM Corporation
Disclaimer
All the information, representations, statements, opinions and proposals in this
document are correct and accurate to the best of our present knowledge but are
not intended (and should not be taken) to be contractually binding unless and
until they become the subject of separate, specific agreement between us.
Any IBM Machines provided are subject to the Statements of Limited Warranty
accompanying the applicable Machine.
Any IBM Program Products provided are subject to their applicable license terms.
Nothing herein, in whole or in part, shall be deemed to constitute a warranty.
IBM products are subject to withdrawal from marketing and or service upon
notice, and changes to product configurations, or follow-on products, may result
in price changes.
Any references in this document to “partner” or “partnership” do not constitute or
imply a partnership in the sense of the Partnership Act 1890.
IBM is not responsible for printing errors in this proposal that result in pricing or
information inaccuracies.
36 c 2015 IBM Corporation
Правовая информация
IBM, логотип IBM, BladeCenter, System Storage и System x являются товарными знаками International Business
Machines Corporation в США и/или других странах. Полный список товарных знаков компании IBM смотрите
на узле Web: www.ibm.com/legal/copytrade.shtml.
Названия других компаний, продуктов и услуг могут являться товарными знаками или знаками обслуживания
других компаний.
(c) 2015 International Business Machines Corporation. Все права защищены.
Упоминание в этой публикации продуктов или услуг корпорации IBM не означает, что IBM предполагает
предоставлять их во всех странах, в которых осуществляет свою деятельность, информация о
предоставлении продуктов или услуг может быть изменена без уведомления. За самой свежей информацией
о продуктах и услугах компании IBM, предоставляемых в Вашем регионе, следует обращаться в ближайшее
торговое представительство IBM или к авторизованным бизнес-партнерам.
Все заявления относительно намерений и перспективных планов IBM могут быть изменены без уведомления.
Информация о продуктах третьих фирм получена от производителей этих продуктов или из опубликованных
анонсов указанных продуктов. IBM не тестировала эти продукты и не может подтвердить
производительность, совместимость, или любые другие заявления относительно продуктов третьих фирм.
Вопросы о возможностях продуктов третьих фирм следует адресовать поставщику этих продуктов.
Информация может содержать технические неточности или типографические ошибки. В представленную в
публикации информацию могут вноситься изменения, эти изменения будут включаться в новые редакции
данной публикации. IBM может вносить изменения в рассматриваемые в данной публикации продукты или
услуги в любое время без уведомления.
Любые ссылки на узлы Web третьих фирм приведены только для удобства и никоим образом не служат
поддержкой этим узлам Web. Материалы на указанных узлах Web не являются частью материалов для
данного продукта IBM.
37 c 2015 IBM Corporation

IBM POWER8 as an HPC platform

  • 1.
    IBM POWER8 asan HPC platform Alexander Pozdneev, Georgy Pavlov IBM October 23, 2015 — IBM Linux on Power: Platform News 1 c 2015 IBM Corporation
  • 2.
    What is HPC? •HPC — High Performance Computing • A.k.a. technical computing • Aeroacoustics: Effects of chevrons on jet noise • Supersonic jet engine noise computational fluid dynamics simulation • 128k Blue Gene/P cores — ≈ 100 hours • 1M Blue Gene/Q cores — ≈ 12 hours http://youtu.be/cjoz5tncRUs http://youtu.be/uxT-VmY3OWc 2 c 2015 IBM Corporation
  • 3.
    Secrets of theDark Universe • Cosmology: The evolution of the Universe simulation • Understanding the physics of the dark matter and energy • 1 BG/Q rack — 68B particles • 32 BG/Q racks — 1.1T particles http://www.youtube.com/watch?v=tdv8yrJk4VE http://www.youtube.com/watch?v=-S-T_iTiAxQ 3 c 2015 IBM Corporation
  • 4.
    Real-time modeling ofhuman heart ventricles • Physiology: Simulation of drug-induced arrhythmias • Resolution — 0.1 mm • 768k Blue Gene/Q cores • 43% peak • http://dl.acm.org/citation.cfm?id=2388999 • LLNL, IBM Research, IBM Research Collaboratory for Life Sciences 4 c 2015 IBM Corporation
  • 5.
    Modelling of acomplete human viral pathogen poliovirus • Molecular biology: Reconstruction and simulation of poliovirus • Antiviral drugs, virus infection, modelling related viruses • 3.3M–3.7M atoms • Blue Gene/Q, Victorian Life Sciences Computing Initiative • http://www.youtube.com/watch?v=Nih0Qa673FY 5 c 2015 IBM Corporation
  • 6.
    Speakers • Alexander Pozdneev ResearchSoftware Engineer HPC • Georgy Pavlov Software Engineer ESSL Russian team leader 6 c 2015 IBM Corporation
  • 7.
    Outline 1 Data centriccomputing as a new HPC paradigm 2 Architecture of IBM HPC systems based on POWER8+NVIDIA servers 3 Software stack of IBM HPC systems 4 IBM HPC mathematical libraries 5 Measuring efficiency of an HPC system on real applications 6 Summary 7 c 2015 IBM Corporation
  • 8.
    Application diversity Image credit:http://www.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=POL03229USEN 8 c 2015 IBM Corporation
  • 9.
    Data centric computingas a new HPC paradigm • That is all about moving data around • Memory bandwidth • Memory latency • High value of “memory access operations” / “computations” • Number of FLOPs1 per cycle is no longer relevant • Offloading computing to memory (Active Memory Cube by Micron) 1 FLOP — Floating-Point Operation 9 c 2015 IBM Corporation
  • 10.
    Department of EnergyCORAL project 10 c 2015 IBM Corporation
  • 11.
    IBM systems inCORAL project 11 c 2015 IBM Corporation
  • 12.
    Architecture of POWER8+NVIDIAsystems for HPC 12 c 2015 IBM Corporation
  • 13.
    Overview of IBMPower System S822LC Power S822LC model 8335-GTA • POWER8 processor module: 8-core, 3.32 GHz 10-core, 2.92 GHz • Two sockets • Graphics processing units Two NVIDIA K80 GPUs • Eight memory slots • 2U height 13 c 2015 IBM Corporation
  • 14.
    Architecture of IBMPower System S822LC 14 c 2015 IBM Corporation
  • 15.
    System software Software stackof IBM HPC systems • System software Operating system: Linux, bare-metal (no virtualization) Drivers: • Mellanox InfiniBand OFED • NVIDIA Deployment: xCAT • Parallel operating environment IBM Parallel Environment Runtime Edition (PE RTE) Workload scheduler: IBM Platform LSF Parallel filesystem: IBM Spectrum Scale (“GPFS”2 ) 2 General Parallel File System 15 c 2015 IBM Corporation
  • 16.
    Development tools Software stackof IBM HPC systems • Compilers IBM XL C/C++/Fortran compilers IBM Advance Toolchain, http://ibm.co/AdvanceToolchain • Fork of GNU compiler/tools optimized for POWER8 • gcc, g++, gfortran • Analysis tools (oprofile, valgrind, itrace) Vanilla GCC, binutils, etc. CUDA Toolkit • IBM Parallel Environment Developer Edition (PE DE) • IBM Software Development Kit for Linux on Power • Mathematical libraries Mathematical Acceleration Subsystem (MASS) IBM Engineering and Scientific Subroutine Library (ESSL) IBM Parallel ESSL 16 c 2015 IBM Corporation
  • 17.
    Engineering and ScientificSubroutine Library • High-performance mathematical functions Scientific applications Engineering applications • Platforms IBM POWER servers IBM POWER clusters • Libraries ESSL Serial and SMP: 600+ subroutines (SMP — Symmetric Multi-Processing) Parallel ESSL: 125+ subroutines • Languages: C C++ Fortran http://www.ibm.com/systems/power/software/essl 17 c 2015 IBM Corporation
  • 18.
    ESSL: Industry defacto standards • ESSL implements the following interfaces: BLAS (linear algebra) LAPACK (linear algebra) FFTW (Fourier transformation) • Parallel ESSL implements the following interfaces: ScaLAPACK • Easy migration • Just recompile! http://fftw.org 18 c 2015 IBM Corporation
  • 19.
    What mathematical areasare supported? ESSL • Linear algebra subprograms • Matrix operations • Linear algebraic equations • Eigensystems analysis • Fourier transforms, convolution, correlation, . . . • Sorting and searching • Interpolation • Numerical quadrature • Random number generation Parallel ESSL • BLACS • Level 2 parallel BLAS • Level 3 parallel BLAS • Linear algebraic equations • Eigensystems analysis • Fourier transforms • Random number generation 19 c 2015 IBM Corporation
  • 20.
    How to leveragethe hardware? Symmetric multiprocessing: • Multiple hardware threads • Multiple cores POWER8+NVIDIA: • Use multiple GPUs • Select which GPU to use • Run ESSL in a hybrid mode 20 c 2015 IBM Corporation
  • 21.
    Synthetic benchmarks vs.real apps Measuring car pollution in official tests? • You get low toxic nitrogen oxides in a lab environment • You cannot predict how much smoke you produce, unless you test your scenarios • You would run a testdrive prior to car purchase 21 c 2015 IBM Corporation
  • 22.
    Threads behavior: Typicalvs. HPC Typical workload HPC workload 22 c 2015 IBM Corporation
  • 23.
    Importance of threadsaffinity NAS Parallel Benchmarks, mg.C (peaks at SMT1), 20 cores 23 c 2015 IBM Corporation
  • 24.
    Choice of compilationparameters: -O5 -qnohot NAS Parallel Benchmarks, bt.C, affinity, baseline: -O3 -qhot 24 c 2015 IBM Corporation
  • 25.
    Compilation parameters: -O3-qhot, -O5 -qprefetch NAS Parallel Benchmarks, mg.C, affinity, baseline: -O5 -qnohot 25 c 2015 IBM Corporation
  • 26.
    Choice of anSMT mode: SMT1 NAS Parallel Benchmarks, mg.C, affinity, baseline: SMT8 26 c 2015 IBM Corporation
  • 27.
    Choice of anSMT mode: SMT2, SMT4 NAS Parallel Benchmarks, bt.C, affinity, baseline: SMT1 27 c 2015 IBM Corporation
  • 28.
    Choice of anSMT mode: SMT8 NAS Parallel Benchmarks, cg.C, affinity, baseline: SMT1 28 c 2015 IBM Corporation
  • 29.
    Summary • Technical computingproblems ⇒ need for HPC • Data centric computing as a new HPC paradigm • CORAL project • IBM Power System S822LC model 8335-GTA • IBM HPC Software stack • High performance math libraries • Leveraging performance 29 c 2015 IBM Corporation
  • 30.
  • 31.
    Further reads • XLC/C++ for Linux, http://www.ibm.com/support/knowledgecenter/SSXVZZ/ • XL Fortran for Linux, http://www.ibm.com/support/knowledgecenter/SSAT4T/ • XL C/C++ for Linux 13.1.2 Optimization and Programming Guide, http: //www.ibm.com/support/knowledgecenter/SSXVZZ_13.1.2/com. ibm.xlcpp1312.lelinux.doc/proguide/optimization.html • XL Fortran for Linux 15.1.2 Optimization and Programming Guide, http://www.ibm.com/support/knowledgecenter/SSAT4T_15.1. 2/com.ibm.xlf1512.lelinux.doc/proguide/optimization.html 31 c 2015 IBM Corporation
  • 32.
    Relevance of LINPACK •Based on DGEMM() • 80–90% of peak performance • Commercial deployment verification test for large systems • Proprietary binary files run by the installation team 32 c 2015 IBM Corporation
  • 33.
    LINPACK vs. HPCanalytics 33 c 2015 IBM Corporation
  • 34.
    Benchmarking methodology options 1.Take one core for the initial tuning Try SMT1, SMT2, SMT4, SMT8 (number of threads + affinity + -qtune=pwr8:XXX Try different optimization options (-O3, -O4, . . . ) 2. Choose SMT-mode and compiler options that provide the best timing 3. Take one core as a baseline Run on 1–5 cores (within one chip) Run on 5 and 10 cores (within one socket) Run on 10 and 20 cores 34 c 2015 IBM Corporation
  • 35.
    POWER8 features • Eightthreads per core Hide memory latency (like GPU3 ) Instuction flow is arbitrary (unlike GPU) • Memory bandwidth • No sense in benchmarking only one thread (like in GPU) • Scalability within a core depends only on the application • Advanced features to try Transactional memory Relaxed memory model Decimal floating point unit 3 GPU — Graphical Processing Unit 35 c 2015 IBM Corporation
  • 36.
    Disclaimer All the information,representations, statements, opinions and proposals in this document are correct and accurate to the best of our present knowledge but are not intended (and should not be taken) to be contractually binding unless and until they become the subject of separate, specific agreement between us. Any IBM Machines provided are subject to the Statements of Limited Warranty accompanying the applicable Machine. Any IBM Program Products provided are subject to their applicable license terms. Nothing herein, in whole or in part, shall be deemed to constitute a warranty. IBM products are subject to withdrawal from marketing and or service upon notice, and changes to product configurations, or follow-on products, may result in price changes. Any references in this document to “partner” or “partnership” do not constitute or imply a partnership in the sense of the Partnership Act 1890. IBM is not responsible for printing errors in this proposal that result in pricing or information inaccuracies. 36 c 2015 IBM Corporation
  • 37.
    Правовая информация IBM, логотипIBM, BladeCenter, System Storage и System x являются товарными знаками International Business Machines Corporation в США и/или других странах. Полный список товарных знаков компании IBM смотрите на узле Web: www.ibm.com/legal/copytrade.shtml. Названия других компаний, продуктов и услуг могут являться товарными знаками или знаками обслуживания других компаний. (c) 2015 International Business Machines Corporation. Все права защищены. Упоминание в этой публикации продуктов или услуг корпорации IBM не означает, что IBM предполагает предоставлять их во всех странах, в которых осуществляет свою деятельность, информация о предоставлении продуктов или услуг может быть изменена без уведомления. За самой свежей информацией о продуктах и услугах компании IBM, предоставляемых в Вашем регионе, следует обращаться в ближайшее торговое представительство IBM или к авторизованным бизнес-партнерам. Все заявления относительно намерений и перспективных планов IBM могут быть изменены без уведомления. Информация о продуктах третьих фирм получена от производителей этих продуктов или из опубликованных анонсов указанных продуктов. IBM не тестировала эти продукты и не может подтвердить производительность, совместимость, или любые другие заявления относительно продуктов третьих фирм. Вопросы о возможностях продуктов третьих фирм следует адресовать поставщику этих продуктов. Информация может содержать технические неточности или типографические ошибки. В представленную в публикации информацию могут вноситься изменения, эти изменения будут включаться в новые редакции данной публикации. IBM может вносить изменения в рассматриваемые в данной публикации продукты или услуги в любое время без уведомления. Любые ссылки на узлы Web третьих фирм приведены только для удобства и никоим образом не служат поддержкой этим узлам Web. Материалы на указанных узлах Web не являются частью материалов для данного продукта IBM. 37 c 2015 IBM Corporation