SlideShare a Scribd company logo
Technologies beyond
            the K computer
September 5th, 2012

Takashi Aoki
Next Generation Technical Computing Unit
Fujitsu Limited
Agenda
 Corporate profile
 Fujitsu supercomputer past and present
 Second generation Petascale supercomputer PRIMEHPC FX10
   Hardware
   Software
 Challenge to the future




Sep 5th, 2012 TACC-2012            1/41             Copyright 2012 FUJITSU LIMITED
Who we are

    Japan’s largest IT services provider and
              No. 3 in the world. *

     We do everything in ICT. We use our
experience and the power of ICT to shape the
    future of society with our customers.

       Over 170,000 Fujitsu people support
      customers in more than 100 countries.




                                                *2011 IT Services Vendor Revenue. Source: Gartner, "Market
                                                           Share: IT Services, 2011" 9 April 2012

Sep 5th, 2012 TACC-2012                  2/41                                 Copyright 2012 FUJITSU LIMITED
Our products and services

   Technology Solutions
                 Services                                          Systems platform




Our datacenters in the world                               PRIMERGY       ETERNUS             Supercomputer
                                                             TX120         DX8000            PRIMEHPC FX10




Ubiquitous Product Solutions                                  Device solutions




    LIFEBOOK              Smart phone   Tablet PC   High-end multi-core        FM3 family            FRAM
      E751C                 F07D        ARROWS          processor          (32-bit RISC MCU)      (Ferroelectric
                                                      SPARC64 VII+                               Random Access
                                                                                                    Memory)



Sep 5th, 2012 TACC-2012                             3/41                              Copyright 2012 FUJITSU LIMITED
Where we work

       ‘shaping tomorrow with you’ wherever you are.                            As of March 2012




                          EMEA
                          31,000
                                              Japan
                                             107,000
                                                                Americas
                                                                 8,000
                              Asia-Pacific
                                27,000


    Over 170,000 Fujitsu colleagues working with customers in over 100 countries
Sep 5th, 2012 TACC-2012                       4/41                     Copyright 2012 FUJITSU LIMITED
Fujitsu HPC Servers - past and present -




                                                                                                                        FX10

                                                                     No.1 in Top500
                                                                  (June and Nov., 2011)               K computer
                                                                                              FX1
                                                World’s Fastest
                                            Vector Processor (1999)                                   Most Efficient
                                                                                                      Performance
                                                         VPP5000            SPARC
                                         NWT*                                                         in Top500 (Nov. 2008)
                                                                         Enterprise
                          Developed with NAL
                     No.1 in Top500                                                  PRIMEQUEST
                                                 VPP300/700                                                                PRIMERGY
                       (Nov. 1993)                                    PRIMEPOWER                                             CX400
                    Gordon Bell Prize                                                                                    Skinless server
                                                                       HPC2500
                      (1994, 95, 96)                                                                                     (coming soon)
                                        VPP500                                 World’s Most
                                                                                 Scalable
                                                                                                                    PRIMERGY
                      VP Series                                                                                       BX900
                                                                              Supercomputer                        Cluster node
                                                                                  (2003)
                                                           AP3000
                                                                                                          HX600
                                                                                                       Cluster node
 F230-75APU
                                             AP1000                                             PRIMERGY RX200
                                                                                                  Cluster node
                                                                                  Japan’s Largest
                                                                                  Cluster in Top500
   Japan’s First
                                                                                     (July 2004)                 *NWT:
  Vector (Array)
  Supercomputer                                                                                                   Numerical Wind Tunnel
      (1977)


Sep 5th, 2012 TACC-2012                                                5/41                                       Copyright 2012 FUJITSU LIMITED
HPC Platform Solutions - Hardware -
 Full range coverage with choice of HPC hardware platform
                             Petascale            High Performance scaling over several PFlops
                           Supercomputer           Fujitsu propriety CPU and interconnect
                                                    technologies for high performance, high
                                                    reliability and high operability


                                                       High Performance de facto HPC cluster
                                              x86       Following Intel CPU and MIC roadmap
                            PRIMEHPC       HPC Cluster    and adopt Fujitsu latest packaging
                              FX10                                technologies for high performance and
              High-end
                                                                  high operability
                                               CX400
                                             Skinless server

                                                                      Large-Scale
              Divisional                                              SMP System

                                       BX Series
                                          BX900
          Departmental                    BX400

                                                                          RX900
                                       RX Series               PRIMERGY
            Work Group                   RX200
                                                                 series
Sep 5th, 2012 TACC-2012                          6/41                               Copyright 2012 FUJITSU LIMITED
Design targets and features of FX10

                                                       High parallel application
                                                         productivity
  High Performance                                       Easy to achieve high
    High peak performance and high                        performance running highly
     application performance                               paralleled programs without
                                                           inordinate effort of
                                                           programming



                          Customer ‘s requirement and FX10 design targets

  High operability
    Low power consumption
    High reliability and ease of                      K computer compatibility
     operation                                           Binary compatibility
                                                         Same programing
                                                          environment



Sep 5th, 2012 TACC-2012                        7/41                      Copyright 2012 FUJITSU LIMITED
Design targets and features of FX10

                                                                                          High parallel application
                                                                                           productivity
  High Performance                                                                          Easy to achieve high
  High-performance CPU
     High peak performance and high                                                 “VISIMPACT *2” supports efficient
                                                                                             performance running highly
   “SPARC64 IXfx” with SPARC V9
      application performance                                                         hybrid paralleled programs without
                                                                                              parallel execution
   + HPC-ACE architecture                                                                    inordinate effort of
                                                                                             programming
    High performance, highly
     
    reliable and fault tolerant 6D
    mesh/torus interconnect
    “Tofu*1”      Customer ‘s requirement and FX10 design targets
                                             Parallel Language, programing tools
                                              and Petascale HPC middleware for
  High operability                           high reliability and operability
    Low power consumption
    High reliability and ease of              K computer compatibility
   Water cooling system
      operation                                   Binary compatibility
                                                  Same programing
  High reliability components & functions based    environment
   on mainframe development experience
 *1) Tofu: Torus Fusion
 *2) VISIMPACT: Virtual Single Processor by Integrated Multicore Parallel Architecture

Sep 5th, 2012 TACC-2012                                                    8/41                             Copyright 2012 FUJITSU LIMITED
PRIMEHPC FX10 System Configuration
                                                              SPARC64TM IXfx
                                                              CPU
                      PRIMEHPC FX10                                                    DDR3
                                                                                      memory



                                                                                   ICC
                                                                                   (Interconnect
                                                                                   Control Chip)

                                                               Compute node configuration

                                                                                 Management servers
                           Compute Nodes
                                                                                                       Portal
                                                                                                       servers




                                  IO Network
                           Tofu interconnect for I/O                                                Login
                                                                     Network                        server
                                 I/O nodes                          (IB or GB)

                                                                 File servers          Global file system
                              Local disks
                          Local file system                       Global disk              IB: InfiniBand
                                                                                           GB: GigaBit Ethernet
Sep 5th, 2012 TACC-2012                                9/41                           Copyright 2012 FUJITSU LIMITED
FX10 System H/W Specifications
                              PRIMEHPC FX10 H/W Specifications
                                 Name                        SPARC64TM IXfx
              CPU
                              Performance                236.5GFlops@1.848GHz
                              Configuration                   1 CPU / Node
             Node
                             Memory capacity                     32, 64 GB
             Rack           Performance/rack                   22.7 TFlops
                           No. of compute node                384 to 98,304
            System
                              Performance                90.8TFlops to 23.2PFlops
      (4 ~1024 racks)
                                Memory                        12 TB to 6 PB

                     System rack
                       96 compute nodes
 SPARC64TM IXfx CPU
                       6 I/O nodes
   16 cores/socket    With optional water
   236.5 GFlops         cooling exhaust unit

                                                                  System
                                                                    Max. 23.2 PFlops
                                                                    Max. 1,024 racks
                                                                    Max. 98,304 CPUs
                            System board
                              4 nodes (4 CPUs)
 Sep 5th, 2012 TACC-2012                         10/41                  Copyright 2012 FUJITSU LIMITED
The K computer and FX10
Comparison of System H/W Specifications

                                                    K computer                   FX10
                                Name              SPARC64TM VIIIfx          SPARC64TM IXfx
                             Performance          128GFlops@2GHz         236.5GFlops@1.848GHz
                                                   SPARC V9 +
                             Architecture        HPC-ACE extension                 ←
                                                L1(I) Cache:32KB/core,
          CPU
                                                L1(D) Cache:32KB/core              ←
                          Cache configuration
                                                L2 Cache: 6MB(shared)    L2 Cache: 12MB(shared)
                          No. of cores/socket                 8                    16
                          Memory band width            64 GB/s.                 85 GB/s.
                             Configuration          1 CPU / Node                    ←
         Node
                           Memory capacity                  16 GB              32, 64 GB
   System board           Node/system board             4 Nodes                    ←
                          System board/rack       24 System boards                 ←
         Rack
                           Performance/rack          12.3 TFlops              22.7 TFlops



Sep 5th, 2012 TACC-2012                             11/41                      Copyright 2012 FUJITSU LIMITED
The K computer and FX10
Comparison of System H/W Specifications (cont.)

                                                      K computer                     FX10
                                Topology             6D Mesh/Torus                     ←
                                                        5GB/s x2
                              Performance
                                                     (bi-directional)                  ←
   Interconnect            No. of link per node              10                         ←
                                                  H/W barrier, reduction                ←
                           Additional features
                                                  no external switch box                ←
                          CPU, ICC(interconnect
                                                   Direct water cooling                 ←
                             chip), DDCON
      Cooling                                                                   Air cooling +
                               Other parts              Air cooling        Exhaust air water cooling
                                                                               unit (Optional)




Sep 5th, 2012 TACC-2012                              12/41                        Copyright 2012 FUJITSU LIMITED
Node configuration
 Single CPU as a node                                                                     Node
                                                                SPARC64™ IXfx
   SPARC64TM IXfx based                                          L2$       MC            Memory
   32/64GB memory capacity                                     Core
   Single CPU per node to maximize memory BW                   Core   SX
                                                                  :    ctrl              ICC
   High memory bandwidth of 85 GB/s                            Core
                                                                  :
                                                                Core
 On board InterConnect Controller (ICC)
                                                                          Interconnect          I/O
   Direct RDMA and global synchronization operations
   No external switch
                                                                                         CPU
 Node type                                               ICC
                                                                              CPU
   Compute node
           Consist of CPU, ICC and memory
           No I/O capability except interconnect                             CPU
           Four nodes are mounted on a system board                                     CPU
      I/O node
         Same CPU as compute node                                       System Board
         Includes four PCI Express Gen2 x8 slots
         8 GB/s I/O bandwidth per I/O node
         One node is mounted on an I/O system board   I/O Slots
                                                                                    CPU        ICC


                                                                              I/O SB
    th
Sep 5 , 2012 TACC-2012                       13/41                      Copyright 2012 FUJITSU LIMITED
SPARC64™ IXfx
 High-performance and low-power multi-core CPU
   High performance core by HPC-ACE
            Multiply number of register, SIMD operation, software controllable cache, etc.
      VISIMPACT : Support highly efficient hybrid execution model (thread + process)
        Shared second cache, hardware barrier among cores and compiler support

   SPARC64™ IXfx specifications
   Architecture         SPARC V9 + HPC-ACE
   # of FP operations
                       8 (= 4 Multiply and Add )                                               HSIO
           /clock/core
   No. of cores                   16                                             Core       Core    Core       Core
   Peak performance
                       236.5 Gflops@1.848GHz                                     Core       Core    Core       Core
   and clock
   Memory bandwidth            85 GB/s




                                                                DDR3 interface




                                                                                                                      DDR3 interface
   Power                                                                         L2$ Data                  L2$ Data




                                                                     MAC




                                                                                                                           MAC
                            110 W (typical)
   consumption                                                                                 L2$




                                                                  MAC




                                                                                                                        MAC
                                                                                              Control
      High performance-per-power ratio and                                      L2$ Data                  L2$ Data
         High reliability
           Water cooling system has lowered the CPU                             Core       Core    Core       Core
            temperature and leak current
           Wide-ranging error detection/self-recovery                           Core       Core    Core       Core
            functions, instruction retry function
Sep 5th, 2012 TACC-2012                           14/41                                        Copyright 2012 FUJITSU LIMITED
Overview of HPC-ACE

“High Performance Computing - Arithmetic Computational Extensions”
 Extended number of integer registers and floating point registers
 Software-controllable “Sector Cache”
 Flexible Single Instruction Multiple Data (SIMD) operation
 Hardware barrier synchronization for VISIMPACT
      VISIMPACT: automatic thread-parallelization compiler technology
 Other special features
      XFILL instruction
      Reciprocal approximation instruction
      Reciprocal square root approximation instruction
      Trigonometric function acceleration instructions




Sep 5th, 2012 TACC-2012                    15/41                   Copyright 2012 FUJITSU LIMITED
HPC-ACE:Extended Number of Registers
 Enables larger loop unrolling and eliminates register spills


 Integer registers
      SPARC-V9           160 / 32                    V9     Register      V9   32
                                                             Window
      HPC-ACE            192 / 64                                      HPC-ACE 32
                                           160



                                            32 HPC-ACE
 Double precision floating-point registers
      SPARC-V9           32                                               V9            32
      HPC-ACE            256 (Scalar) / 128 (SIMD)
                                                              SIMD HPC-ACE
                                                              basic

                                                                                         224
                                                               SIMD
                                                           extended

Sep 5th, 2012 TACC-2012                     16/41                       Copyright 2012 FUJITSU LIMITED
HPC-ACE:Number of FP registers extension (1)
 NPB3.3-LU high cost loop
      By using extended number of registers, compiler can generate more efficient
         scheduling and also eliminate unnecessary memory operations



                                             1.6E+01

                                                                                    x 1.42 improvement
                                             1.4E+01



                                             1.2E+01




                                     [sec]
                                             1.0E+01



                                             8.0E+00



                                             6.0E+00



                                             4.0E+00



                                             2.0E+00



                                             0.0E+00
                                                        lu proc0 jacld-loop 32reg        lu proc0 jacld-loop 256reg
                                                        32 registers                    256 registers
Sep 5th, 2012 TACC-2012                         17/41                                   Copyright 2012 FUJITSU LIMITED
HPC-ACE:Number of FP registers extension (2)
 Performance boost by 256 FP registers w/ 138 application program kernels




                            Performance improvement
                           Average            120%
   Improved ratio




                           Max.               252%




                                                                                 Program No.
                    Performance improvement by # of FP registers extension(from 32 to 256)
Sep 5th, 2012 TACC-2012                          18/41                    Copyright 2012 FUJITSU LIMITED
HPC-ACE:Sector Cache(1)
 Increasing the cache hit rate by selectively leave a reused data in the
     cache
        The cache is divided into two sectors
         (Sectors 0 and 1).                                                               Cache
        Sector 1 is used for data that will be reused.
                                                                                                     Reusable data are
        Sector 0 is used for other data.                             Works in ordinary cache
                                                                                                     loaded by special
                                                                       replacement policy
                                                                                                         load inst.
      Data in Sector 1, which will be used again
       soon, is no longer removed from cache, by
       the access of data that uses Sector 0.
                                                                             Sector 0                     Sector 1
        The user can specify the data to be
         retained in Sector 1 by specifying it on
         the compiler directive line.                           Dividing N ways of the L2 cache as follows:
                                                                               N1: Sector 0
                                                                               N2: Sector 1
           !ocl CACHE_SECTOR_SIZE(N1,N2)
           !ocl CACHE_SUBSECTOR_ASSIGN(a)
           do j=1,m                                                       Array a is no longer removed from the
             do i=1,n
                a(i) = a(i) + b(i,j) * c(i,j)                                cache by references to array b or c.
             enddo
           Enddo                              • Array a is held in Sector 1.
                                             • All others are held in Sector 0.


Sep 5th, 2012 TACC-2012                                 19/41                                    Copyright 2012 FUJITSU LIMITED
HPC-ACE:Sector Cache (2)
 NPB3.3-CG case
   By putting array P on sector 1, floating point data cache access wait is reduced




                                          [sec.]
                                        2.5E-01
                                                            x 1.23 improvement

                                        2.0E-01



                                        1.5E-01



                                        1.0E-01



                                        5.0E-02



                                        0.0E+00

                                                   w/o改善前 $
                                                      sector       with 改善後 $
                                                                        sector

Sep 5th, 2012 TACC-2012                  20/41                     Copyright 2012 FUJITSU LIMITED
HPC-ACE: SIMD (Single Instruction Multiple Data)
 Eight floating-point ops can be executed
                                                                  Floating-point Registers
     simultaneously per core
                                                                     SIMD                SIMD
      Two SIMD instructions can be executed
                                                                     basic              extended
       simultaneously per core
                                                      SIMD[0]         f [0]                f [256]
      SIMD instruction executes two floating-        SIMD[1]         f [2]                f [258]
       point ops (single or double precision)
      FMA is supported
                                                      SIMD[126]      f [252]               f [508]
 Software can flexibly perform SIMD
                                                      SIMD[127]      f [254]               f [510]
     optimization
      It is possible to execute operations in
       SIMD by obtaining pieces of data one by
       one from noncontiguous memory spaces           Operation
      It is possible to selectively store floating   Operation
       register into memory (mask operation)
                                                                     A                    C
                                                                              B                  D


                                                                  Floating-point Pipelines
Sep 5th, 2012 TACC-2012                      21/41                                Copyright 2012 FUJITSU LIMITED
HPC-ACE:SIMD extension (mask operation effect)
 Example of Computational chemistry program
   Due to the branch operation, “if” in the loop, SIMD option shows NO effect
   By using mask operation, compiler can SIMDize the loop and utilize software
    pipelining. Results 2.5x performance improvement




                                    [sec.]
                                 1.0E-01                              x 2.5
                                 9.0E-02                          improvement
                                 8.0E-02

                                 7.0E-02

                                 6.0E-02

                                 5.0E-02

                                 4.0E-02

                                 3.0E-02

                                 2.0E-02

                                 1.0E-02

                                 0.0E+00

                                 -1.0E-02
                                                nosimd    simd              simd=2

Sep 5th, 2012 TACC-2012                      22/41               Copyright 2012 FUJITSU LIMITED
HPC-ACE:XFILL capability
 XFILL capability works in Earthquake simulation program
   XFILL fills L2 cache line with undetermined data(allocate cache line without data
    load)
   So, with XFILL in advance, following FP reg store instructions should hit and
    would not cause data load from memory
   XFILL can reduce memory read accesses and improve performance when a
    memory throughput is the bottleneck

                                    [sec.]
                                  1.0E-01
                                                                      x 1.5 improvement
                                  9.0E-02

                                  8.0E-02

                                  7.0E-02

                                  6.0E-02

                                  5.0E-02

                                  4.0E-02

                                  3.0E-02

                                  2.0E-02

                                  1.0E-02

                                  0.0E+00
                                                     without XFILL
                                                         pdiffz3_m4        with XFILL
                                                                            pdiffz3_m4 xfill

Sep 5th, 2012 TACC-2012                      23/41                        Copyright 2012 FUJITSU LIMITED
VISIMPACT technology
 Fine-grain thread-parallelization
      Low-overhead barrier synchronization with HPC-ACE ASI registers
      Coalesced memory access exploits shared L2 cache
      “Virtual Single Processor by Integrated Multi-core Parallel Architecture”

           Vectorization              Conventional Threading                VISIMPACT
              DO J=1,N                               P DO J=1,N              DO J=1,N
                DO I=1,M                             P   DO I=1,M          P   DO I=1,M
                  A(I,J)=...                         P     A(I,J)=...      P     A(I,J)=...
                END                                  P   END               P   END
              END                                    P END                   END
                                                                               Parallel
                   Vector                                 Serial
                                          Parallel




                                                                                                 Serial
                            Serial




                                     requires separate or large L2 cache
 Fujitsu compilers support VISIMPACT automatic parallelization

Sep 5th, 2012 TACC-2012                                   24/41               Copyright 2012 FUJITSU LIMITED
VISIMPACT technology
 Fujitsu compiler transforms MPI programs to hybrid parallel executions
  automatically, by parallelizing a process on a CPU into multi-threads to
  cores
 By reducing the number of ranks, communication efficiency would be
  improved
 Inter-core hardware barrier and shared L2 cache help efficient execution


                VISIMPACT model                                                     pure-MPI model
                              Interconnect                                                   Interconnect
         Node0                                          Node1               Node0                                       Node1


                 Process                 Process   Process

            T    T        T   T          T    T     T   T                     P     P   P    P          P   P      P     P
          CPU CPU CPU CPU              CPU CPU     CPU CPU                   CPU CPU CPU CPU           CPU CPU CPU CPU



                           Multi-threads                                                Parallel process
                          parallel process
                                                                            : Process            : Thread                 Inter process
                                                                P                        T
                                                                                                                         communication
Sep 5th, 2012 TACC-2012                                             25/41                                       Copyright 2012 FUJITSU LIMITED
6D-Mesh/Torus Network Topology
 Higher bisection bandwidth and smaller hops than 3D-Torus
 Torus fusion
      Every XYZ Cartesian grid point has another ABC 3D-Torus
      X, Z and B are torus (ring) axes
      A, C and Y are mesh (linear) axes




                                           Z
              B

                                  C
                          A                          X                    Y
                              Conceptual Model
Sep 5th, 2012 TACC-2012                    26/41                 Copyright 2012 FUJITSU LIMITED
Virtual Topology
 System software generates virtual 1d-, 2d- or 3d-torus for an arbitrary size
     of 6d-cuboid
       4

              3
       5

              2




                                 6d-cuboid




                                                                              4
       6

              1




                                                                                  3
                                                                      5
                                                                          2
       7
            X 0




                                                                  6
                                                                      1
                                                                          C
       A




                                                                              Z
                                                           7
                          10      9       3       4
                  B
                                                                  0
                            11        8       7       6
                      Y 0         1       2       5
 Virtual topology expands the range of applicable algorithms

Sep 5th, 2012 TACC-2012                                   27/41                       Copyright 2012 FUJITSU LIMITED
ICC : Tofu Interconnect Controller
 Companion chip for SPARC64TM VIIIfx / IXfx processors
 Tofu Interconnect
      4 Tofu Network Interfaces
      Tofu Network Router
                                                                                                 Host Bus Interface
 PCI Express Gen2                                                                                                                  PCI
                                                                                                                                  Express
      2 ports for I/O nodes
                                                                                                Tofu Network   Tofu Network




                                                             Routing Routing Routing Routing
                                                             Routing Routing Routing Routing
                                                             Routing Routing Routing Routing
 Water-cooled




                                                              // Link
                                                              // Link
                                                                                                  Interface      Interface          PCI




                                                                 Link
                                                                 Link
                                                                                                                                  Express
Process technology        65 nm




                                                                      // Link
                                                                      // Link




                                                                                                                                     Routing Routing
Die size                  18.2 mm x 18.1 mm




                                                                         Link
                                                                         Link




                                                                                                                                              / Link
                                                                                                               Tofu Network
Frequency                 312.5 MHz                                                                              Interface
                                                                                                Tofu Network
                                                                                                                    with
No. of Tofu link          10 ports                                                                Interface
                                                                                                                Tofu Barrier

                                                                              // Link
                                                                              // Link
                                                                                 Link
                                                                                 Link
                                                                                                                 Interface




                                                                                                                                      / Link
Tofu link throughput      in 5 GB/s + out 5 GB/s
PCI Express Gen2          8 lane×2 ports
                                                                                                          Crossbar
Host Bus Interface        in 20 GB/s + out 20 GB/s
                                                                                      // Link
                                                                                      // Link


                                                                                                Routing Routing Routing Routing
                                                                                         Link
                                                                                         Link



Power consumption         28 W (typical)                                                         / Link  / Link  / Link  / Link
No. of transistors        200 million
Signal Transfer Speed 6.25 Gbps
Differential signals      128 lanes

Sep 5th, 2012 TACC-2012                              28/41                                                            Copyright 2012 FUJITSU LIMITED
Static and Dynamic Failure Avoidance
 Static Failure Avoidance
      Pre-calculated routing table
      For intra-job communication
 Dynamic Failure Avoidance
      Time-out detection by the protocol
      For I/O communication




                                                          Failure




Sep 5th, 2012 TACC-2012                     29/41   Copyright 2012 FUJITSU LIMITED
Fault Isolation by Virtual Topology
 Jobs using virtual topology can use rectangle region including failed node




                              10       9               3       4
                   B              11       8               7       6
                          Y   0        1               2       5


                              9        8               7       6
                   B              10                       3       4
                          Y   0        1               2       5

 Decreases in executable job size and in system availability are minimized


Sep 5th, 2012 TACC-2012                        30/41               Copyright 2012 FUJITSU LIMITED
All-to-all communication performance
 Link utilization is important for actual communications
 New optimized algorithm
      Uses all links uniformly to maximize All-to-All communication performance
      Four RDMA engines execute 4 sends and 4 receives simultaneously
 Using Tofu features                             4
      Virtual 3D-Torus                                 Tofu (8x4x8=256)
      Flow-control features                            InfiniBand QDR (256)
                                                  3
            for congestion prevention
 Many applications use All-to-All                    New algorithm
     type of communication and                    2


                                           GB/s
     enjoy this acceleration
                                                  1


                                              0
                                              1.E+00     1.E+02    1.E+04       1.E+06
                                                          Message size in bytes

Sep 5th, 2012 TACC-2012                   31/41                        Copyright 2012 FUJITSU LIMITED
All-to-all communication trace on Tofu


                             Trace Result of the K computer

                              System configuration of Tofu
                          24×18×16×2×3×2 = 82,944 nodes
                               Each node transfers 32KB

                                    Left: new algorithm
                                 Right: standard OpenMPI
                                   (pair-wise exchange)

                            Colors show link utilization and wait time
                                    Greener – Higher utilization
                                     Redder – Longer wait time
                                                             Standard OpenMPI
         New Algorithm
                                                            (pair-wise exchange)
     Elapsed Time: 2.77sec
                                                           Elapsed Time: 24.08sec
Sep 5th, 2012 TACC-2012                         32/41                    Copyright 2012 FUJITSU LIMITED
FX10 Software Stack

                                       Applications
                          HPC Portal / System Management Portal
                                 Technical Computing Suite
      System Management               High Performance            Automatic parallelization
                                     Parallel File System                  compiler
                                                                  Fortran
     System management                     FEFS
                                                                 C
     System control
                                                                  C++
     System monitoring
                                                                  Tools and math. libraries
     System operation support       Lustre based high
                                      performance                 Programming support tools
        Job Management                distributed file            Mathematical libraries
                                      system                        (SSL II/BLAS etc.)
   Job manager                      High scalability, high   Parallel languages and libraries
   Job scheduler                     reliability and             OpenMP
   Resource management               availability
                                                                  MPI
   Parallel job execution                                        XPFortran


                           Linux based OS enhanced for FX10

                                    PRIMEHPC FX10
Sep 5th, 2012 TACC-2012                        33/41                        Copyright 2012 FUJITSU LIMITED
Lustre Extension of FEFS: Features


New                                       FEFS Features
Extended                      Large scale                        High performance
Reuse                             Max file size          File striping      MDS response
                             Max number of files
                                                            Parallel I/O     I/O zoning
                              Max client number
                             Max stripe count                 Client cache
                             512KB block                       Server cache    OS jitter reduction

                          Network                                   Operations Management
        Tofu Interconnect           IB/Ether        Lustre                  ACL      QoS
                                                                       Disk Quota Directory Quota
             IB Multi-rail       LNET Router
                                                   Features         Dynamic configuration change

                             Connectivity                            Reliability
                    Lustre mount      NFS export                   Failover     RAS
                                                              Journal / fsck




Sep 5th, 2012 TACC-2012                              34/41                         Copyright 2012 FUJITSU LIMITED
FEFS performance
                                                    * : Collaborative work with RIKEN on the K computer
 Achieved the world’s top-level                                                   400

                                                                                   350
   throughput*




                                                               Throughput [GB/s]
                                                                                   300
     Read 334GB/s, Write 249GB/s                                                  250

         (574 OSSs, 18432 Clients, 192 racks)                                      200
                                                                                                                                                read diret
                                                                                   150                                                          write direct
                                                                                   100

                                                                                    50
 Metadata performance of mdtest*                                                    0
     (distributed directory)                                                             0   100      200       300      400      500    600
                                                                                                            Number of OSSs



                                    FEFS                                                      Lustre
     IOPS                 K computer**      IA***                                                  IA***
                                                                                   1.8.5                      2.0.0.1
     create                    34697.6       31803.9                                 24628.1                       17672.2
     unlink                    39660.5       26049.5                                 26419.5                       20231.5
     mkdir                     87741.6       77931.3                                 38015.5                       22846.8
     rmdir                     28153.8       24671.4                                 17565.1                       13973.4
                     ** : MDS:RX300S6 (X5680 3.33 GHz 6core x2, 48GB, IB(QDR)x2)
                     *** : MDS:RX200S5 (E5520 2.27GHz 4core x2, 48GB, IB(QDR)x1)
Sep 5th, 2012 TACC-2012                                35/41                                                                 Copyright 2012 FUJITSU LIMITED
Language System overview
 Fortran C/C++/Fortran Compiler
 Programming model (OpenMP, MPI, XPFortran)
 Instruction level /Loop level optimization using HPC-ACE
 Debugging and Tuning tools for highly parallel computer


                        Programming Language, MPI               Programming tool              Math. Lib.

                    Fortran 2003     •Insts. level opt.
                                        Instruction                   IDE
       Intra Node




                    C                     scheduling                                            BLAS
                                        SIMDization                Debugger                   LAPACK
                    C++              •Loop level opt.                Profiler                   SSL II
                                         Automatic
                    OpenMP 3.0           Parallelization
                                *1
       Inter Node




                    XPFortran                                              *2
                                                                RMATT                       ScaLAPACK
                    MPI 2.1
                                                 *1: eXtended Parallel Fortran (Distributed Parallel Fortran)
                                                                       *2: Rank Map Automatic Tuning Tool
Sep 5th, 2012 TACC-2012                                36/41                             Copyright 2012 FUJITSU LIMITED
Programming Environment


                                                            FX10 System
  User Client
                            Login Node                                        Compute Nodes

                               IDE Interface


                                       Command          Job Control
       IDE                              Interface


                                                                      debugger

                                       Debugger                         App
                                       Interface                                      App

  Interactive
 Debugger GUI
                                         Data                                      Data
                                       Converter                                  Sampler



                          Visualized                 Sampling
                             Data                      Data
                                                                          Stage out
     Profiler



Sep 5th, 2012 TACC-2012                             37/41                             Copyright 2012 FUJITSU LIMITED
Application Tuning Cycle and Tools


        Job                 Profiler             RMATT
    Information
                          Vampir-trace           Tofu-PA

                                                           Profiler snapshot
                          MPI Tuning

                                            Overall
        Execution                           Tuning
                          CPU Tuning


 FX10 Specific              Profiler
    Tools
                          Vampir-trace

 Open Source
                             PAPI
    Tools


Sep 5th, 2012 TACC-2012                  38/41                  Copyright 2012 FUJITSU LIMITED
On Course to Exascale
 World’s first 1 Exa-Flops computer is expected to appear by 2020




Sep 5th, 2012 TACC-2012             39/41                  Copyright 2012 FUJITSU LIMITED
Towards exascale
 Realization of Exascale system is grand challenge
      At least two-step development is necessary
      The biggest challenge is high density and low power consumption
 Fujitsu is developing a Trans-Exa system as a midterm goal
      The Trans-Exa system is expected to be scalable to 100 Petaflops
      Employs
           Wide SIMD and multicore CPU
           High performance and lower power consumption interconnect
           High performance and high density memory technologies
 Continues to invest effort in research for the exascale system
      Higher performance and lower power consumption technologies
      Technologies for higher reliability
                                                                               Exascale system

                                              No.1 in Top500
                                                                   Trans-Exa system
                                             (June, Nov. 2011)

                                                      K computer
                                              2010               2015         2020
    th
Sep 5 , 2012 TACC-2012                        40/41                        Copyright 2012 FUJITSU LIMITED
Key technology developments on Trans-Exa

  Goal

         Significant improvement of power efficiency, high density

  Technology                                     Gains
                Silicon tech.                      Performance / power
            ⇒Employs the latest tech.                  consumption

          Innovative memory tech.
         ⇒High density & BW memory                  Performance / rack

          System integration tech.
         ⇒Higher integration & density
                                                    Accumulation of key
                                                    technologies toward
           The latest optical tech.
                                                     exascale systems
          ⇒High speed signal transfer

Sep 5th, 2012 TACC-2012                  41/41              Copyright 2012 FUJITSU LIMITED
Sep 5th, 2012 TACC-2012

More Related Content

Similar to Fujitsu - Technologies beyond-the-k-computer

3 Par
3 Par3 Par
Isc group llc gral presentation final revised 050212
Isc group llc gral   presentation final revised  050212Isc group llc gral   presentation final revised  050212
Isc group llc gral presentation final revised 050212
Joelchait
 
Oracle Systems _ Angus MacDonald _ An insight into what is coming next!.pdf
Oracle Systems _ Angus MacDonald _ An insight into what is coming next!.pdfOracle Systems _ Angus MacDonald _ An insight into what is coming next!.pdf
Oracle Systems _ Angus MacDonald _ An insight into what is coming next!.pdfInSync2011
 
Fujitsu POS Printer: FP-1000 series
Fujitsu POS Printer: FP-1000 seriesFujitsu POS Printer: FP-1000 series
Fujitsu POS Printer: FP-1000 series
FujitsuComponents
 
Sun sparc enterprise t5440 server customer presentation
Sun sparc enterprise t5440 server customer presentationSun sparc enterprise t5440 server customer presentation
Sun sparc enterprise t5440 server customer presentationxKinAnx
 
SAN storage arrays
SAN storage arraysSAN storage arrays
SAN storage arraysmosinyi
 
POWER Systems April 2012
POWER Systems April 2012POWER Systems April 2012
POWER Systems April 2012
COMMON Europe
 
Oracle Systems _ Angus MacDonald _ Running Oracle on Oracle .pdf
Oracle Systems _ Angus MacDonald _ Running Oracle on Oracle .pdfOracle Systems _ Angus MacDonald _ Running Oracle on Oracle .pdf
Oracle Systems _ Angus MacDonald _ Running Oracle on Oracle .pdfInSync2011
 
Introdution - Fujitsu PRIMEQUEST
Introdution - Fujitsu PRIMEQUESTIntrodution - Fujitsu PRIMEQUEST
Introdution - Fujitsu PRIMEQUEST
Andrew Wong
 
The Cloud and Exa by intel
The Cloud and Exa by intelThe Cloud and Exa by intel
The Cloud and Exa by intelOracle Day
 
Certifications list 2018
Certifications list 2018Certifications list 2018
Certifications list 2018
Clinton Strader
 
IT FUTURE 2011 - Symantec
IT FUTURE 2011 - SymantecIT FUTURE 2011 - Symantec
IT FUTURE 2011 - Symantec
Fujitsu France
 
Sun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentationSun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentationxKinAnx
 
Top500 11/2011 BOF Slides
Top500 11/2011 BOF SlidesTop500 11/2011 BOF Slides
Top500 11/2011 BOF Slides
top500
 
Fujitsu-Make IT Dynamic a TBIZ2011
Fujitsu-Make IT Dynamic a TBIZ2011Fujitsu-Make IT Dynamic a TBIZ2011
Fujitsu-Make IT Dynamic a TBIZ2011TechnologyBIZ
 
Juniper IPv6 Workshop by Irzan
Juniper IPv6 Workshop by IrzanJuniper IPv6 Workshop by Irzan
Juniper IPv6 Workshop by IrzanFebrian ‎
 
The Explosion of Petascale in the Race to Exascale
The Explosion of Petascale in the Race to ExascaleThe Explosion of Petascale in the Race to Exascale
The Explosion of Petascale in the Race to Exascale
Intel IT Center
 
Thoughts Beyond High Performance Computing: A Personal Assessment
Thoughts Beyond High Performance Computing: A Personal AssessmentThoughts Beyond High Performance Computing: A Personal Assessment
Thoughts Beyond High Performance Computing: A Personal Assessment
Marek Michalewicz
 
TOP500 List November 2014
TOP500 List November 2014TOP500 List November 2014
TOP500 List November 2014
top500
 
Fujitsu PRIMERGY BX400 Server
Fujitsu PRIMERGY BX400 ServerFujitsu PRIMERGY BX400 Server
Fujitsu PRIMERGY BX400 Server
Kingfin Enterprises Limited
 

Similar to Fujitsu - Technologies beyond-the-k-computer (20)

3 Par
3 Par3 Par
3 Par
 
Isc group llc gral presentation final revised 050212
Isc group llc gral   presentation final revised  050212Isc group llc gral   presentation final revised  050212
Isc group llc gral presentation final revised 050212
 
Oracle Systems _ Angus MacDonald _ An insight into what is coming next!.pdf
Oracle Systems _ Angus MacDonald _ An insight into what is coming next!.pdfOracle Systems _ Angus MacDonald _ An insight into what is coming next!.pdf
Oracle Systems _ Angus MacDonald _ An insight into what is coming next!.pdf
 
Fujitsu POS Printer: FP-1000 series
Fujitsu POS Printer: FP-1000 seriesFujitsu POS Printer: FP-1000 series
Fujitsu POS Printer: FP-1000 series
 
Sun sparc enterprise t5440 server customer presentation
Sun sparc enterprise t5440 server customer presentationSun sparc enterprise t5440 server customer presentation
Sun sparc enterprise t5440 server customer presentation
 
SAN storage arrays
SAN storage arraysSAN storage arrays
SAN storage arrays
 
POWER Systems April 2012
POWER Systems April 2012POWER Systems April 2012
POWER Systems April 2012
 
Oracle Systems _ Angus MacDonald _ Running Oracle on Oracle .pdf
Oracle Systems _ Angus MacDonald _ Running Oracle on Oracle .pdfOracle Systems _ Angus MacDonald _ Running Oracle on Oracle .pdf
Oracle Systems _ Angus MacDonald _ Running Oracle on Oracle .pdf
 
Introdution - Fujitsu PRIMEQUEST
Introdution - Fujitsu PRIMEQUESTIntrodution - Fujitsu PRIMEQUEST
Introdution - Fujitsu PRIMEQUEST
 
The Cloud and Exa by intel
The Cloud and Exa by intelThe Cloud and Exa by intel
The Cloud and Exa by intel
 
Certifications list 2018
Certifications list 2018Certifications list 2018
Certifications list 2018
 
IT FUTURE 2011 - Symantec
IT FUTURE 2011 - SymantecIT FUTURE 2011 - Symantec
IT FUTURE 2011 - Symantec
 
Sun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentationSun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentation
 
Top500 11/2011 BOF Slides
Top500 11/2011 BOF SlidesTop500 11/2011 BOF Slides
Top500 11/2011 BOF Slides
 
Fujitsu-Make IT Dynamic a TBIZ2011
Fujitsu-Make IT Dynamic a TBIZ2011Fujitsu-Make IT Dynamic a TBIZ2011
Fujitsu-Make IT Dynamic a TBIZ2011
 
Juniper IPv6 Workshop by Irzan
Juniper IPv6 Workshop by IrzanJuniper IPv6 Workshop by Irzan
Juniper IPv6 Workshop by Irzan
 
The Explosion of Petascale in the Race to Exascale
The Explosion of Petascale in the Race to ExascaleThe Explosion of Petascale in the Race to Exascale
The Explosion of Petascale in the Race to Exascale
 
Thoughts Beyond High Performance Computing: A Personal Assessment
Thoughts Beyond High Performance Computing: A Personal AssessmentThoughts Beyond High Performance Computing: A Personal Assessment
Thoughts Beyond High Performance Computing: A Personal Assessment
 
TOP500 List November 2014
TOP500 List November 2014TOP500 List November 2014
TOP500 List November 2014
 
Fujitsu PRIMERGY BX400 Server
Fujitsu PRIMERGY BX400 ServerFujitsu PRIMERGY BX400 Server
Fujitsu PRIMERGY BX400 Server
 

More from Fujitsu Global

Cloud Integration and Management - CA World 2013
Cloud Integration and Management - CA World 2013Cloud Integration and Management - CA World 2013
Cloud Integration and Management - CA World 2013
Fujitsu Global
 
Fujitsu Technology perspectives 2013 slides
Fujitsu Technology perspectives 2013 slidesFujitsu Technology perspectives 2013 slides
Fujitsu Technology perspectives 2013 slides
Fujitsu Global
 
Fujitsu ASEAN roadshow - How to leverage information, innovation and insight
Fujitsu ASEAN roadshow - How to leverage information, innovation and insightFujitsu ASEAN roadshow - How to leverage information, innovation and insight
Fujitsu ASEAN roadshow - How to leverage information, innovation and insight
Fujitsu Global
 
Fujitsu CIO technology trends survey 2012
Fujitsu CIO technology trends survey  2012Fujitsu CIO technology trends survey  2012
Fujitsu CIO technology trends survey 2012
Fujitsu Global
 
Cloud fusion concept fujitsu scientific tech journal april 2012
Cloud fusion concept fujitsu scientific tech journal april 2012Cloud fusion concept fujitsu scientific tech journal april 2012
Cloud fusion concept fujitsu scientific tech journal april 2012
Fujitsu Global
 
Low carbon earth summit china alison rowe fujitsu presentation
Low carbon earth summit china   alison rowe fujitsu presentationLow carbon earth summit china   alison rowe fujitsu presentation
Low carbon earth summit china alison rowe fujitsu presentation
Fujitsu Global
 
Fujitsu keynote at Oracle OpenWorld 2012
Fujitsu keynote at Oracle OpenWorld 2012 Fujitsu keynote at Oracle OpenWorld 2012
Fujitsu keynote at Oracle OpenWorld 2012
Fujitsu Global
 
Fujitsu Group Sustainability Report 2012
Fujitsu Group Sustainability Report 2012Fujitsu Group Sustainability Report 2012
Fujitsu Group Sustainability Report 2012
Fujitsu Global
 
Fujitsu - sustainability - our heritage
Fujitsu - sustainability - our heritageFujitsu - sustainability - our heritage
Fujitsu - sustainability - our heritage
Fujitsu Global
 
International Green Awards Asia Pacific Summit Fujitsu Alison Rowe
International Green Awards Asia Pacific Summit Fujitsu Alison RoweInternational Green Awards Asia Pacific Summit Fujitsu Alison Rowe
International Green Awards Asia Pacific Summit Fujitsu Alison Rowe
Fujitsu Global
 
Transforming Healthcare - Fujitsu's Dr Lester Russell
Transforming Healthcare - Fujitsu's Dr Lester RussellTransforming Healthcare - Fujitsu's Dr Lester Russell
Transforming Healthcare - Fujitsu's Dr Lester Russell
Fujitsu Global
 
Green and Sustainable ICT - Fujitsu's Alison Rowe at the Korea Australian New...
Green and Sustainable ICT - Fujitsu's Alison Rowe at the Korea Australian New...Green and Sustainable ICT - Fujitsu's Alison Rowe at the Korea Australian New...
Green and Sustainable ICT - Fujitsu's Alison Rowe at the Korea Australian New...
Fujitsu Global
 

More from Fujitsu Global (12)

Cloud Integration and Management - CA World 2013
Cloud Integration and Management - CA World 2013Cloud Integration and Management - CA World 2013
Cloud Integration and Management - CA World 2013
 
Fujitsu Technology perspectives 2013 slides
Fujitsu Technology perspectives 2013 slidesFujitsu Technology perspectives 2013 slides
Fujitsu Technology perspectives 2013 slides
 
Fujitsu ASEAN roadshow - How to leverage information, innovation and insight
Fujitsu ASEAN roadshow - How to leverage information, innovation and insightFujitsu ASEAN roadshow - How to leverage information, innovation and insight
Fujitsu ASEAN roadshow - How to leverage information, innovation and insight
 
Fujitsu CIO technology trends survey 2012
Fujitsu CIO technology trends survey  2012Fujitsu CIO technology trends survey  2012
Fujitsu CIO technology trends survey 2012
 
Cloud fusion concept fujitsu scientific tech journal april 2012
Cloud fusion concept fujitsu scientific tech journal april 2012Cloud fusion concept fujitsu scientific tech journal april 2012
Cloud fusion concept fujitsu scientific tech journal april 2012
 
Low carbon earth summit china alison rowe fujitsu presentation
Low carbon earth summit china   alison rowe fujitsu presentationLow carbon earth summit china   alison rowe fujitsu presentation
Low carbon earth summit china alison rowe fujitsu presentation
 
Fujitsu keynote at Oracle OpenWorld 2012
Fujitsu keynote at Oracle OpenWorld 2012 Fujitsu keynote at Oracle OpenWorld 2012
Fujitsu keynote at Oracle OpenWorld 2012
 
Fujitsu Group Sustainability Report 2012
Fujitsu Group Sustainability Report 2012Fujitsu Group Sustainability Report 2012
Fujitsu Group Sustainability Report 2012
 
Fujitsu - sustainability - our heritage
Fujitsu - sustainability - our heritageFujitsu - sustainability - our heritage
Fujitsu - sustainability - our heritage
 
International Green Awards Asia Pacific Summit Fujitsu Alison Rowe
International Green Awards Asia Pacific Summit Fujitsu Alison RoweInternational Green Awards Asia Pacific Summit Fujitsu Alison Rowe
International Green Awards Asia Pacific Summit Fujitsu Alison Rowe
 
Transforming Healthcare - Fujitsu's Dr Lester Russell
Transforming Healthcare - Fujitsu's Dr Lester RussellTransforming Healthcare - Fujitsu's Dr Lester Russell
Transforming Healthcare - Fujitsu's Dr Lester Russell
 
Green and Sustainable ICT - Fujitsu's Alison Rowe at the Korea Australian New...
Green and Sustainable ICT - Fujitsu's Alison Rowe at the Korea Australian New...Green and Sustainable ICT - Fujitsu's Alison Rowe at the Korea Australian New...
Green and Sustainable ICT - Fujitsu's Alison Rowe at the Korea Australian New...
 

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 

Fujitsu - Technologies beyond-the-k-computer

  • 1. Technologies beyond the K computer September 5th, 2012 Takashi Aoki Next Generation Technical Computing Unit Fujitsu Limited
  • 2. Agenda  Corporate profile  Fujitsu supercomputer past and present  Second generation Petascale supercomputer PRIMEHPC FX10  Hardware  Software  Challenge to the future Sep 5th, 2012 TACC-2012 1/41 Copyright 2012 FUJITSU LIMITED
  • 3. Who we are Japan’s largest IT services provider and No. 3 in the world. * We do everything in ICT. We use our experience and the power of ICT to shape the future of society with our customers. Over 170,000 Fujitsu people support customers in more than 100 countries. *2011 IT Services Vendor Revenue. Source: Gartner, "Market Share: IT Services, 2011" 9 April 2012 Sep 5th, 2012 TACC-2012 2/41 Copyright 2012 FUJITSU LIMITED
  • 4. Our products and services Technology Solutions Services Systems platform Our datacenters in the world PRIMERGY ETERNUS Supercomputer TX120 DX8000 PRIMEHPC FX10 Ubiquitous Product Solutions Device solutions LIFEBOOK Smart phone Tablet PC High-end multi-core FM3 family FRAM E751C F07D ARROWS processor (32-bit RISC MCU) (Ferroelectric SPARC64 VII+ Random Access Memory) Sep 5th, 2012 TACC-2012 3/41 Copyright 2012 FUJITSU LIMITED
  • 5. Where we work ‘shaping tomorrow with you’ wherever you are. As of March 2012 EMEA 31,000 Japan 107,000 Americas 8,000 Asia-Pacific 27,000 Over 170,000 Fujitsu colleagues working with customers in over 100 countries Sep 5th, 2012 TACC-2012 4/41 Copyright 2012 FUJITSU LIMITED
  • 6. Fujitsu HPC Servers - past and present - FX10 No.1 in Top500 (June and Nov., 2011) K computer FX1 World’s Fastest Vector Processor (1999) Most Efficient Performance VPP5000 SPARC NWT* in Top500 (Nov. 2008) Enterprise Developed with NAL No.1 in Top500 PRIMEQUEST VPP300/700 PRIMERGY (Nov. 1993) PRIMEPOWER CX400 Gordon Bell Prize Skinless server HPC2500 (1994, 95, 96) (coming soon) VPP500 World’s Most Scalable PRIMERGY VP Series BX900 Supercomputer Cluster node (2003) AP3000 HX600 Cluster node F230-75APU AP1000 PRIMERGY RX200 Cluster node Japan’s Largest Cluster in Top500 Japan’s First (July 2004) *NWT: Vector (Array) Supercomputer Numerical Wind Tunnel (1977) Sep 5th, 2012 TACC-2012 5/41 Copyright 2012 FUJITSU LIMITED
  • 7. HPC Platform Solutions - Hardware -  Full range coverage with choice of HPC hardware platform Petascale High Performance scaling over several PFlops Supercomputer  Fujitsu propriety CPU and interconnect technologies for high performance, high reliability and high operability High Performance de facto HPC cluster x86  Following Intel CPU and MIC roadmap PRIMEHPC HPC Cluster and adopt Fujitsu latest packaging FX10 technologies for high performance and High-end high operability CX400 Skinless server Large-Scale Divisional SMP System BX Series BX900 Departmental BX400 RX900 RX Series PRIMERGY Work Group RX200 series Sep 5th, 2012 TACC-2012 6/41 Copyright 2012 FUJITSU LIMITED
  • 8. Design targets and features of FX10  High parallel application productivity  High Performance  Easy to achieve high  High peak performance and high performance running highly application performance paralleled programs without inordinate effort of programming Customer ‘s requirement and FX10 design targets  High operability  Low power consumption  High reliability and ease of  K computer compatibility operation  Binary compatibility  Same programing environment Sep 5th, 2012 TACC-2012 7/41 Copyright 2012 FUJITSU LIMITED
  • 9. Design targets and features of FX10  High parallel application productivity  High Performance  Easy to achieve high  High-performance CPU  High peak performance and high  “VISIMPACT *2” supports efficient performance running highly “SPARC64 IXfx” with SPARC V9 application performance hybrid paralleled programs without parallel execution + HPC-ACE architecture inordinate effort of programming High performance, highly  reliable and fault tolerant 6D mesh/torus interconnect “Tofu*1” Customer ‘s requirement and FX10 design targets  Parallel Language, programing tools and Petascale HPC middleware for  High operability high reliability and operability  Low power consumption  High reliability and ease of  K computer compatibility  Water cooling system operation  Binary compatibility  Same programing  High reliability components & functions based environment on mainframe development experience *1) Tofu: Torus Fusion *2) VISIMPACT: Virtual Single Processor by Integrated Multicore Parallel Architecture Sep 5th, 2012 TACC-2012 8/41 Copyright 2012 FUJITSU LIMITED
  • 10. PRIMEHPC FX10 System Configuration SPARC64TM IXfx CPU PRIMEHPC FX10 DDR3 memory ICC (Interconnect Control Chip) Compute node configuration Management servers Compute Nodes Portal servers IO Network Tofu interconnect for I/O Login Network server I/O nodes (IB or GB) File servers Global file system Local disks Local file system Global disk IB: InfiniBand GB: GigaBit Ethernet Sep 5th, 2012 TACC-2012 9/41 Copyright 2012 FUJITSU LIMITED
  • 11. FX10 System H/W Specifications PRIMEHPC FX10 H/W Specifications Name SPARC64TM IXfx CPU Performance 236.5GFlops@1.848GHz Configuration 1 CPU / Node Node Memory capacity 32, 64 GB Rack Performance/rack 22.7 TFlops No. of compute node 384 to 98,304 System Performance 90.8TFlops to 23.2PFlops (4 ~1024 racks) Memory 12 TB to 6 PB  System rack  96 compute nodes  SPARC64TM IXfx CPU  6 I/O nodes  16 cores/socket  With optional water  236.5 GFlops cooling exhaust unit  System  Max. 23.2 PFlops  Max. 1,024 racks  Max. 98,304 CPUs  System board  4 nodes (4 CPUs) Sep 5th, 2012 TACC-2012 10/41 Copyright 2012 FUJITSU LIMITED
  • 12. The K computer and FX10 Comparison of System H/W Specifications K computer FX10 Name SPARC64TM VIIIfx SPARC64TM IXfx Performance 128GFlops@2GHz 236.5GFlops@1.848GHz SPARC V9 + Architecture HPC-ACE extension ← L1(I) Cache:32KB/core, CPU L1(D) Cache:32KB/core ← Cache configuration L2 Cache: 6MB(shared) L2 Cache: 12MB(shared) No. of cores/socket 8 16 Memory band width 64 GB/s. 85 GB/s. Configuration 1 CPU / Node ← Node Memory capacity 16 GB 32, 64 GB System board Node/system board 4 Nodes ← System board/rack 24 System boards ← Rack Performance/rack 12.3 TFlops 22.7 TFlops Sep 5th, 2012 TACC-2012 11/41 Copyright 2012 FUJITSU LIMITED
  • 13. The K computer and FX10 Comparison of System H/W Specifications (cont.) K computer FX10 Topology 6D Mesh/Torus ← 5GB/s x2 Performance (bi-directional) ← Interconnect No. of link per node 10 ← H/W barrier, reduction ← Additional features no external switch box ← CPU, ICC(interconnect Direct water cooling ← chip), DDCON Cooling Air cooling + Other parts Air cooling Exhaust air water cooling unit (Optional) Sep 5th, 2012 TACC-2012 12/41 Copyright 2012 FUJITSU LIMITED
  • 14. Node configuration  Single CPU as a node Node SPARC64™ IXfx  SPARC64TM IXfx based L2$ MC Memory  32/64GB memory capacity Core  Single CPU per node to maximize memory BW Core SX : ctrl ICC  High memory bandwidth of 85 GB/s Core : Core  On board InterConnect Controller (ICC) Interconnect I/O  Direct RDMA and global synchronization operations  No external switch CPU  Node type ICC CPU  Compute node  Consist of CPU, ICC and memory  No I/O capability except interconnect CPU  Four nodes are mounted on a system board CPU  I/O node  Same CPU as compute node System Board  Includes four PCI Express Gen2 x8 slots  8 GB/s I/O bandwidth per I/O node  One node is mounted on an I/O system board I/O Slots CPU ICC I/O SB th Sep 5 , 2012 TACC-2012 13/41 Copyright 2012 FUJITSU LIMITED
  • 15. SPARC64™ IXfx  High-performance and low-power multi-core CPU  High performance core by HPC-ACE  Multiply number of register, SIMD operation, software controllable cache, etc.  VISIMPACT : Support highly efficient hybrid execution model (thread + process)  Shared second cache, hardware barrier among cores and compiler support SPARC64™ IXfx specifications Architecture SPARC V9 + HPC-ACE # of FP operations 8 (= 4 Multiply and Add ) HSIO /clock/core No. of cores 16 Core Core Core Core Peak performance 236.5 Gflops@1.848GHz Core Core Core Core and clock Memory bandwidth 85 GB/s DDR3 interface DDR3 interface Power L2$ Data L2$ Data MAC MAC 110 W (typical) consumption L2$ MAC MAC Control  High performance-per-power ratio and L2$ Data L2$ Data High reliability  Water cooling system has lowered the CPU Core Core Core Core temperature and leak current  Wide-ranging error detection/self-recovery Core Core Core Core functions, instruction retry function Sep 5th, 2012 TACC-2012 14/41 Copyright 2012 FUJITSU LIMITED
  • 16. Overview of HPC-ACE “High Performance Computing - Arithmetic Computational Extensions”  Extended number of integer registers and floating point registers  Software-controllable “Sector Cache”  Flexible Single Instruction Multiple Data (SIMD) operation  Hardware barrier synchronization for VISIMPACT  VISIMPACT: automatic thread-parallelization compiler technology  Other special features  XFILL instruction  Reciprocal approximation instruction  Reciprocal square root approximation instruction  Trigonometric function acceleration instructions Sep 5th, 2012 TACC-2012 15/41 Copyright 2012 FUJITSU LIMITED
  • 17. HPC-ACE:Extended Number of Registers  Enables larger loop unrolling and eliminates register spills  Integer registers  SPARC-V9 160 / 32 V9 Register V9 32 Window  HPC-ACE 192 / 64 HPC-ACE 32 160 32 HPC-ACE  Double precision floating-point registers  SPARC-V9 32 V9 32  HPC-ACE 256 (Scalar) / 128 (SIMD) SIMD HPC-ACE basic 224 SIMD extended Sep 5th, 2012 TACC-2012 16/41 Copyright 2012 FUJITSU LIMITED
  • 18. HPC-ACE:Number of FP registers extension (1)  NPB3.3-LU high cost loop  By using extended number of registers, compiler can generate more efficient scheduling and also eliminate unnecessary memory operations 1.6E+01 x 1.42 improvement 1.4E+01 1.2E+01 [sec] 1.0E+01 8.0E+00 6.0E+00 4.0E+00 2.0E+00 0.0E+00 lu proc0 jacld-loop 32reg lu proc0 jacld-loop 256reg 32 registers 256 registers Sep 5th, 2012 TACC-2012 17/41 Copyright 2012 FUJITSU LIMITED
  • 19. HPC-ACE:Number of FP registers extension (2)  Performance boost by 256 FP registers w/ 138 application program kernels Performance improvement Average 120% Improved ratio Max. 252% Program No. Performance improvement by # of FP registers extension(from 32 to 256) Sep 5th, 2012 TACC-2012 18/41 Copyright 2012 FUJITSU LIMITED
  • 20. HPC-ACE:Sector Cache(1)  Increasing the cache hit rate by selectively leave a reused data in the cache  The cache is divided into two sectors (Sectors 0 and 1). Cache  Sector 1 is used for data that will be reused. Reusable data are  Sector 0 is used for other data. Works in ordinary cache loaded by special replacement policy load inst.  Data in Sector 1, which will be used again soon, is no longer removed from cache, by the access of data that uses Sector 0. Sector 0 Sector 1  The user can specify the data to be retained in Sector 1 by specifying it on the compiler directive line. Dividing N ways of the L2 cache as follows: N1: Sector 0 N2: Sector 1 !ocl CACHE_SECTOR_SIZE(N1,N2) !ocl CACHE_SUBSECTOR_ASSIGN(a) do j=1,m Array a is no longer removed from the do i=1,n a(i) = a(i) + b(i,j) * c(i,j) cache by references to array b or c. enddo Enddo • Array a is held in Sector 1. • All others are held in Sector 0. Sep 5th, 2012 TACC-2012 19/41 Copyright 2012 FUJITSU LIMITED
  • 21. HPC-ACE:Sector Cache (2)  NPB3.3-CG case  By putting array P on sector 1, floating point data cache access wait is reduced [sec.] 2.5E-01 x 1.23 improvement 2.0E-01 1.5E-01 1.0E-01 5.0E-02 0.0E+00 w/o改善前 $ sector with 改善後 $ sector Sep 5th, 2012 TACC-2012 20/41 Copyright 2012 FUJITSU LIMITED
  • 22. HPC-ACE: SIMD (Single Instruction Multiple Data)  Eight floating-point ops can be executed Floating-point Registers simultaneously per core SIMD SIMD  Two SIMD instructions can be executed basic extended simultaneously per core SIMD[0] f [0] f [256]  SIMD instruction executes two floating- SIMD[1] f [2] f [258] point ops (single or double precision)  FMA is supported SIMD[126] f [252] f [508]  Software can flexibly perform SIMD SIMD[127] f [254] f [510] optimization  It is possible to execute operations in SIMD by obtaining pieces of data one by one from noncontiguous memory spaces Operation  It is possible to selectively store floating Operation register into memory (mask operation) A C B D Floating-point Pipelines Sep 5th, 2012 TACC-2012 21/41 Copyright 2012 FUJITSU LIMITED
  • 23. HPC-ACE:SIMD extension (mask operation effect)  Example of Computational chemistry program  Due to the branch operation, “if” in the loop, SIMD option shows NO effect  By using mask operation, compiler can SIMDize the loop and utilize software pipelining. Results 2.5x performance improvement [sec.] 1.0E-01 x 2.5 9.0E-02 improvement 8.0E-02 7.0E-02 6.0E-02 5.0E-02 4.0E-02 3.0E-02 2.0E-02 1.0E-02 0.0E+00 -1.0E-02 nosimd simd simd=2 Sep 5th, 2012 TACC-2012 22/41 Copyright 2012 FUJITSU LIMITED
  • 24. HPC-ACE:XFILL capability  XFILL capability works in Earthquake simulation program  XFILL fills L2 cache line with undetermined data(allocate cache line without data load)  So, with XFILL in advance, following FP reg store instructions should hit and would not cause data load from memory  XFILL can reduce memory read accesses and improve performance when a memory throughput is the bottleneck [sec.] 1.0E-01 x 1.5 improvement 9.0E-02 8.0E-02 7.0E-02 6.0E-02 5.0E-02 4.0E-02 3.0E-02 2.0E-02 1.0E-02 0.0E+00 without XFILL pdiffz3_m4 with XFILL pdiffz3_m4 xfill Sep 5th, 2012 TACC-2012 23/41 Copyright 2012 FUJITSU LIMITED
  • 25. VISIMPACT technology  Fine-grain thread-parallelization  Low-overhead barrier synchronization with HPC-ACE ASI registers  Coalesced memory access exploits shared L2 cache  “Virtual Single Processor by Integrated Multi-core Parallel Architecture” Vectorization Conventional Threading VISIMPACT DO J=1,N P DO J=1,N DO J=1,N DO I=1,M P DO I=1,M P DO I=1,M A(I,J)=... P A(I,J)=... P A(I,J)=... END P END P END END P END END Parallel Vector Serial Parallel Serial Serial requires separate or large L2 cache  Fujitsu compilers support VISIMPACT automatic parallelization Sep 5th, 2012 TACC-2012 24/41 Copyright 2012 FUJITSU LIMITED
  • 26. VISIMPACT technology  Fujitsu compiler transforms MPI programs to hybrid parallel executions automatically, by parallelizing a process on a CPU into multi-threads to cores  By reducing the number of ranks, communication efficiency would be improved  Inter-core hardware barrier and shared L2 cache help efficient execution VISIMPACT model pure-MPI model Interconnect Interconnect Node0 Node1 Node0 Node1 Process Process Process T T T T T T T T P P P P P P P P CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Multi-threads Parallel process parallel process : Process : Thread Inter process P T communication Sep 5th, 2012 TACC-2012 25/41 Copyright 2012 FUJITSU LIMITED
  • 27. 6D-Mesh/Torus Network Topology  Higher bisection bandwidth and smaller hops than 3D-Torus  Torus fusion  Every XYZ Cartesian grid point has another ABC 3D-Torus  X, Z and B are torus (ring) axes  A, C and Y are mesh (linear) axes Z B C A X Y Conceptual Model Sep 5th, 2012 TACC-2012 26/41 Copyright 2012 FUJITSU LIMITED
  • 28. Virtual Topology  System software generates virtual 1d-, 2d- or 3d-torus for an arbitrary size of 6d-cuboid 4 3 5 2 6d-cuboid 4 6 1 3 5 2 7 X 0 6 1 C A Z 7 10 9 3 4 B 0 11 8 7 6 Y 0 1 2 5  Virtual topology expands the range of applicable algorithms Sep 5th, 2012 TACC-2012 27/41 Copyright 2012 FUJITSU LIMITED
  • 29. ICC : Tofu Interconnect Controller  Companion chip for SPARC64TM VIIIfx / IXfx processors  Tofu Interconnect  4 Tofu Network Interfaces  Tofu Network Router Host Bus Interface  PCI Express Gen2 PCI Express  2 ports for I/O nodes Tofu Network Tofu Network Routing Routing Routing Routing Routing Routing Routing Routing Routing Routing Routing Routing  Water-cooled // Link // Link Interface Interface PCI Link Link Express Process technology 65 nm // Link // Link Routing Routing Die size 18.2 mm x 18.1 mm Link Link / Link Tofu Network Frequency 312.5 MHz Interface Tofu Network with No. of Tofu link 10 ports Interface Tofu Barrier // Link // Link Link Link Interface / Link Tofu link throughput in 5 GB/s + out 5 GB/s PCI Express Gen2 8 lane×2 ports Crossbar Host Bus Interface in 20 GB/s + out 20 GB/s // Link // Link Routing Routing Routing Routing Link Link Power consumption 28 W (typical) / Link / Link / Link / Link No. of transistors 200 million Signal Transfer Speed 6.25 Gbps Differential signals 128 lanes Sep 5th, 2012 TACC-2012 28/41 Copyright 2012 FUJITSU LIMITED
  • 30. Static and Dynamic Failure Avoidance  Static Failure Avoidance  Pre-calculated routing table  For intra-job communication  Dynamic Failure Avoidance  Time-out detection by the protocol  For I/O communication Failure Sep 5th, 2012 TACC-2012 29/41 Copyright 2012 FUJITSU LIMITED
  • 31. Fault Isolation by Virtual Topology  Jobs using virtual topology can use rectangle region including failed node 10 9 3 4 B 11 8 7 6 Y 0 1 2 5 9 8 7 6 B 10 3 4 Y 0 1 2 5  Decreases in executable job size and in system availability are minimized Sep 5th, 2012 TACC-2012 30/41 Copyright 2012 FUJITSU LIMITED
  • 32. All-to-all communication performance  Link utilization is important for actual communications  New optimized algorithm  Uses all links uniformly to maximize All-to-All communication performance  Four RDMA engines execute 4 sends and 4 receives simultaneously  Using Tofu features 4  Virtual 3D-Torus Tofu (8x4x8=256)  Flow-control features InfiniBand QDR (256) 3  for congestion prevention  Many applications use All-to-All New algorithm type of communication and 2 GB/s enjoy this acceleration 1 0 1.E+00 1.E+02 1.E+04 1.E+06 Message size in bytes Sep 5th, 2012 TACC-2012 31/41 Copyright 2012 FUJITSU LIMITED
  • 33. All-to-all communication trace on Tofu Trace Result of the K computer System configuration of Tofu 24×18×16×2×3×2 = 82,944 nodes Each node transfers 32KB Left: new algorithm Right: standard OpenMPI (pair-wise exchange) Colors show link utilization and wait time Greener – Higher utilization Redder – Longer wait time Standard OpenMPI New Algorithm (pair-wise exchange) Elapsed Time: 2.77sec Elapsed Time: 24.08sec Sep 5th, 2012 TACC-2012 32/41 Copyright 2012 FUJITSU LIMITED
  • 34. FX10 Software Stack Applications HPC Portal / System Management Portal Technical Computing Suite System Management High Performance Automatic parallelization Parallel File System compiler  Fortran  System management FEFS C  System control  C++  System monitoring Tools and math. libraries  System operation support  Lustre based high performance  Programming support tools Job Management distributed file  Mathematical libraries system (SSL II/BLAS etc.)  Job manager  High scalability, high Parallel languages and libraries  Job scheduler reliability and  OpenMP  Resource management availability  MPI  Parallel job execution  XPFortran Linux based OS enhanced for FX10 PRIMEHPC FX10 Sep 5th, 2012 TACC-2012 33/41 Copyright 2012 FUJITSU LIMITED
  • 35. Lustre Extension of FEFS: Features New FEFS Features Extended Large scale High performance Reuse Max file size File striping MDS response Max number of files Parallel I/O I/O zoning Max client number Max stripe count Client cache 512KB block Server cache OS jitter reduction Network Operations Management Tofu Interconnect IB/Ether Lustre ACL QoS Disk Quota Directory Quota IB Multi-rail LNET Router Features Dynamic configuration change Connectivity Reliability Lustre mount NFS export Failover RAS Journal / fsck Sep 5th, 2012 TACC-2012 34/41 Copyright 2012 FUJITSU LIMITED
  • 36. FEFS performance * : Collaborative work with RIKEN on the K computer  Achieved the world’s top-level 400 350 throughput* Throughput [GB/s] 300  Read 334GB/s, Write 249GB/s 250 (574 OSSs, 18432 Clients, 192 racks) 200 read diret 150 write direct 100 50  Metadata performance of mdtest* 0 (distributed directory) 0 100 200 300 400 500 600 Number of OSSs FEFS Lustre IOPS K computer** IA*** IA*** 1.8.5 2.0.0.1 create 34697.6 31803.9 24628.1 17672.2 unlink 39660.5 26049.5 26419.5 20231.5 mkdir 87741.6 77931.3 38015.5 22846.8 rmdir 28153.8 24671.4 17565.1 13973.4 ** : MDS:RX300S6 (X5680 3.33 GHz 6core x2, 48GB, IB(QDR)x2) *** : MDS:RX200S5 (E5520 2.27GHz 4core x2, 48GB, IB(QDR)x1) Sep 5th, 2012 TACC-2012 35/41 Copyright 2012 FUJITSU LIMITED
  • 37. Language System overview  Fortran C/C++/Fortran Compiler  Programming model (OpenMP, MPI, XPFortran)  Instruction level /Loop level optimization using HPC-ACE  Debugging and Tuning tools for highly parallel computer Programming Language, MPI Programming tool Math. Lib. Fortran 2003 •Insts. level opt.  Instruction IDE Intra Node C scheduling BLAS  SIMDization Debugger LAPACK C++ •Loop level opt. Profiler SSL II  Automatic OpenMP 3.0 Parallelization *1 Inter Node XPFortran *2 RMATT ScaLAPACK MPI 2.1 *1: eXtended Parallel Fortran (Distributed Parallel Fortran) *2: Rank Map Automatic Tuning Tool Sep 5th, 2012 TACC-2012 36/41 Copyright 2012 FUJITSU LIMITED
  • 38. Programming Environment FX10 System User Client Login Node Compute Nodes IDE Interface Command Job Control IDE Interface debugger Debugger App Interface App Interactive Debugger GUI Data Data Converter Sampler Visualized Sampling Data Data Stage out Profiler Sep 5th, 2012 TACC-2012 37/41 Copyright 2012 FUJITSU LIMITED
  • 39. Application Tuning Cycle and Tools Job Profiler RMATT Information Vampir-trace Tofu-PA Profiler snapshot MPI Tuning Overall Execution Tuning CPU Tuning FX10 Specific Profiler Tools Vampir-trace Open Source PAPI Tools Sep 5th, 2012 TACC-2012 38/41 Copyright 2012 FUJITSU LIMITED
  • 40. On Course to Exascale  World’s first 1 Exa-Flops computer is expected to appear by 2020 Sep 5th, 2012 TACC-2012 39/41 Copyright 2012 FUJITSU LIMITED
  • 41. Towards exascale  Realization of Exascale system is grand challenge  At least two-step development is necessary  The biggest challenge is high density and low power consumption  Fujitsu is developing a Trans-Exa system as a midterm goal  The Trans-Exa system is expected to be scalable to 100 Petaflops  Employs  Wide SIMD and multicore CPU  High performance and lower power consumption interconnect  High performance and high density memory technologies  Continues to invest effort in research for the exascale system  Higher performance and lower power consumption technologies  Technologies for higher reliability Exascale system No.1 in Top500 Trans-Exa system (June, Nov. 2011) K computer 2010 2015 2020 th Sep 5 , 2012 TACC-2012 40/41 Copyright 2012 FUJITSU LIMITED
  • 42. Key technology developments on Trans-Exa Goal Significant improvement of power efficiency, high density Technology Gains Silicon tech. Performance / power ⇒Employs the latest tech. consumption Innovative memory tech. ⇒High density & BW memory Performance / rack System integration tech. ⇒Higher integration & density Accumulation of key technologies toward The latest optical tech. exascale systems ⇒High speed signal transfer Sep 5th, 2012 TACC-2012 41/41 Copyright 2012 FUJITSU LIMITED
  • 43. Sep 5th, 2012 TACC-2012