Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Post-K: Building the Arm HPC Ecosystem

1,105 views

Published on

By Koichi Hirai, Fujitsu

Post-K use Arm based super computer. But there are not too many Arm based servers for HPC. Therefore we think to need to build Arm HPC Ecosystem until Post-K release. In this presentation, we describe our collaboration efforts to build the Arm HPC Ecosystem.

For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/

Published in: Technology
  • Be the first to comment

Post-K: Building the Arm HPC Ecosystem

  1. 1. Kouichi Hirai FUJITSU LIMITED Dec 12th, 2017 Post-K: Building the Arm HPC Ecosystem 0 Copyright 2017 FUJITSU LIMITEDLinaro Work Shop, Dec. 12, 2017
  2. 2. Post-K: Building up Arm HPC Ecosystem  Fujitsu’s approach for HPC  For making the Post-K a resounding success  The high performance compiler increases software portability  Summary Copyright 2017 FUJITSU LIMITEDLinaro Work Shop, Dec. 12, 2017 1
  3. 3. Fujitsu HPC Solutions to Meet Customer Demands  Supercomputers, both Fujitsu-developed CPUs and x86  Single system image operation w/ Fujitsu system software  High performance, high availability, and high reliability Copyright 2017 FUJITSU LIMITED x86 Cluster RX2530/RX2540 CX600CX400 High scalability with Fujitsu- developed CPU and interconnect PRIMERGY x86 cluster systems support the latest CPUs and accelerators Under Development w/ RIKEN High -end Divisional Departmental Workgroup PRIMEHPC FX10 PRIMEHPC FX100 Post-KK computer Co-developed with RIKEN © RIKEN Large-Scale SMP System RX900 Linaro Work Shop, Dec. 12, 2017 2
  4. 4. Fujitsu High-end Supercomputers Development 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 PRIMEHPC FX10  1.8x CPU perf. of K  Easier installation  4x(DP) / 8x(SP) CPU per. of K, Tofu2  High-density pkg & lower energy App. review FS projects HPCI strategic apps program Operation of K computerDevelopment Japan’s National Projects FUJITSU Post-K computer development PRIMEHPC FX100 K computer and PRIMEHPC FX10/FX100 in operation The CPU and interconnect of FX10/FX100 inherit the K computer architectural concept, featuring state-of- the-art technologies System software “TCS” supports Fujitsu supercomputer with originally introduced technologies Many applications are currently running and being developed for science and various industries RIKEN and Fujitsu are working together to provide a successor to K computer with application R&D teams using co- design approach Technical Computing Suite (TCS) Handles millions of parallel jobs FEFS: super scalable file system MPI: Ultra scalable collective communication libraries  OS: Lower OS jitter w/ assistant core Copyright 2017 FUJITSU LIMITED Post-K supercomputer Post-K Linaro Work Shop, Dec. 12, 2017 3
  5. 5. Post-K Features and Status  Fujitsu CPU core (w/ Arm SVE) and Tofu maintain the programming models and provide high application performance  RIKEN & Fujitsu system software enable high performance and low power consumption with flexible operations  Apps from 9 “priority issues” & many “exploratory challenges” are being optimized for the Post-K Functions & architecture Post-K FX100 FX10 K CPU Core Instruction set architecture Armv8-A SPARC V9 SIMD width 512bit 256bit 128bit 128bit Double precision (64bit) ✔ ✔ ✔ ✔ Single precision (32bit) ✔ ✔ ✔ ✔ Half precision (16bit) ✔ - - - Interconnect Tofu interconnect Enhanced Tofu2 Tofu Tofu Copyright 2017 FUJITSU LIMITED Post-K Linaro Work Shop, Dec. 12, 2017 4
  6. 6. Post-K Software Stack  Valuable feedbacks through “co-design” from application R&D teams Post-K System Hardware FUJITSU Technical Computing Suite / RIKEN Advanced System Software Linux OS / McKernel (Lightweight Kernel) Post-K Applications System management for highly available & power saving operation Job management for higher system utilization & power efficiency Lustre-based distributed file system FEFS OpenMP, COARRAY, Math Libs Compilers (C, C++, Fortran) Debugging and tuning tools Management Software Programming EnvironmentHierarchical File I/O Software MPI (Open MPI, MPICH) XcalableMP Application-oriented file I/O middleware Post-K Under Development w/ RIKEN Copyright 2017 FUJITSU LIMITEDLinaro Work Shop, Dec. 12, 2017 5
  7. 7. Post-K to be More Useful?  More apps from OSS & ISVs High performance on “real” applications Lower TCO • Low power consumption • Water cooling De-facto standards • Lowering barriers in developing and porting Ecosystem • More Arm platforms • More partners • More knowledge/experience inside/outside of communities Copyright 2017 FUJITSU LIMITEDLinaro Work Shop, Dec. 12, 2017 6
  8. 8. Making the Post-K a Resounding Success  Recapping the goal & requirements  High performance HW and SW complying open standards  Apps in quality & variety  Environments – rich, modern, and comprehensive  Our approach  Arm architecture (w/ Fujitsu’s proven microarchitecture) • SBSA: Server Base System Architecture • SBBR: Server Base Boot Requirements • VLA: Vector-Length Agnostic  Fujitsu enhanced/maintained system software • Based on Linux & OSSs • Single source for x86 & Arm • Open MPI, OpenMP, Libraries, • Performance analyzer, Debugger  Powerful but original compilers --- will be aligned to be useful & popular Copyright 2017 FUJITSU LIMITED Assure binary compatibility Lowering barriers for single source development Linaro Work Shop, Dec. 12, 2017 7
  9. 9.  Transform our original & powerful compilers to be all-around  Working and contributing for the Clang project to satisfy both high performance and portability  Fujitsu’s back-end advantage  Auto-parallelization for many-core architecture  Auto-vectorization for Scalable Vector Extension  Strong software pipelining with loop fission Compilers to Increase Software Portability Copyright 2017 FUJITSU LIMITED Utilize Post-K μArch: • Rich & wide SIMD • Sector cache… Software: Apps, Middleware, and Basics (written in variety of styles) Portable binariesFujitsu original front-end Fujitsu original back-end from knowledge of CPU development Clang front-end Clang back-end Linaro Work Shop, Dec. 12, 2017 8
  10. 10. Auto-vectorization for Arm SVE  4 Byte x 16 SIMD List Memory Access by utilizing 512bit Register  Various Types of SIMD Optimization by Utilizing Predicate Registers Copyright 2017 FUJITSU LIMITED for (int i=0; i<n; ++i) { if (mask[i] !=0) { a[i] = b[i]; } } for (int i=0; i<VL/2; ++i) { a[i] = b[i] * c[i]; } do { b[i] = a[i]; } while(a[i++] != 0); Loop including IF clause Small Loop less than SIMD length While Loop with Data Dependency SVE Reg. dest. Reg. index int index[n] float P[n], Q[n]; for (i=0; i<n; ++i) { P[i] = Q[index[i]]; } Q[14] Q[1] ・ Q[13] ・ Q[0] Q[3] Q[15] Q[2] 14 1 ・ 13 ・ 0 3 15 2 Memory Q [15] [14] [13] ・ ・ [3] [2] [1] [0]1 2 3 4 5 6 7 1 2 3 1 2 3 1 2 3 Linaro Work Shop, Dec. 12, 2017 9
  11. 11. Fujitsu Compiler Back-end Optimization Flow  Loop Fission reduces required resources, such as registers  Software Pipelining and Register Allocation  Best utilization of hardware functions and resources Copyright 2017 FUJITSU LIMITED Back-end optimization pipeline Portable Arm binaries SIMDize Loop Fission Software Pipelining Register Allocation Instruction Scheduling for (...) { } // Reduced # of Regs. for (...) { } // Reduced # of Regs. for (...) { } // Higher ILP for (...) { } // Higher ILP for (...) { } Software pipelined #1 Software pipelined #2 Divided # 1 Divided# 2 Original 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Linaro Work Shop, Dec. 12, 2017 10
  12. 12. Copyright 2017 FUJITSU LIMITED Effectiveness of SWP w/ Loop Fission and SoA  Runs on FX100 w/ 32 registers  72% speed-up per core is observed  >2x speed-up compared w/ K computer  Software Pipelining w/ Loop Fission utilizes CPU resources  SoA-style layout extracts more NICAM* single core performance on FX100 w/ 32 regs (Source: http://www.riken.jp/pr/topics/2013/20130920_1/) CPUclocksnormalized byKcomputer *NICAM-DC-MINI: Climate simulations with fine mesh, https://github.com/fiber-miniapp/nicam-dc-mini SWP w/ Loop fission + SoA style 72% speedup w/ loop fission + SoA Without Loop fission Linaro Work Shop, Dec. 12, 2017 11
  13. 13. Summary  Fujitsu’s Approach to HPC  Supporting high-end supercomputers with original CPU & x86 clusters  Developing the Post-K for app performance and low power consumption  Expecting more apps from OSS & ISVs through growing ecosystem  Keys for Post-K Success  High performance standard-compliant HW and SW  All-around high performance compiler with binary compatibility  Many and varied high quality apps with x86 software compatibility  Open & Highly Optimized Compilers  Clang + Fujitsu technologies  Tentative evaluation results are encouraging Copyright 2017 FUJITSU LIMITEDLinaro Work Shop, Dec. 12, 2017 12
  14. 14. Copyright 2017 FUJITSU LIMITED

×