Fujitsu - Technologies beyond-the-k-computer

Technologies beyond
the K computer
September 5th, 2012

Takashi Aoki
Next Generation Technical Computing Unit
Fujitsu Limited

Agenda
 Corporate profile
 Fujitsu supercomputer past and present
 Second generation Petascale supercomputer PRIMEHPC FX10
 Hardware
 Software
 Challenge to the future

Sep 5th, 2012 TACC-2012 1/41 Copyright 2012 FUJITSU LIMITED

Who we are

Japan’s largest IT services provider and
No. 3 in the world. *

We do everything in ICT. We use our
experience and the power of ICT to shape the
future of society with our customers.

Over 170,000 Fujitsu people support
customers in more than 100 countries.

*2011 IT Services Vendor Revenue. Source: Gartner, "Market
Share: IT Services, 2011" 9 April 2012


Our products and services

Technology Solutions
Services Systems platform

Our datacenters in the world PRIMERGY ETERNUS Supercomputer
TX120 DX8000 PRIMEHPC FX10

Ubiquitous Product Solutions Device solutions

LIFEBOOK Smart phone Tablet PC High-end multi-core FM3 family FRAM
E751C F07D ARROWS processor (32-bit RISC MCU) (Ferroelectric
SPARC64 VII+ Random Access
Memory)


Where we work

‘shaping tomorrow with you’ wherever you are. As of March 2012

EMEA
31,000
Japan
107,000
Americas
8,000
Asia-Pacific
27,000

Over 170,000 Fujitsu colleagues working with customers in over 100 countries

Fujitsu HPC Servers - past and present -

FX10

No.1 in Top500
(June and Nov., 2011) K computer
FX1
World’s Fastest
Vector Processor (1999) Most Efficient
Performance
VPP5000 SPARC
NWT* in Top500 (Nov. 2008)
Enterprise
Developed with NAL
No.1 in Top500 PRIMEQUEST
VPP300/700 PRIMERGY
(Nov. 1993) PRIMEPOWER CX400
Gordon Bell Prize Skinless server
HPC2500
(1994, 95, 96) (coming soon)
VPP500 World’s Most
Scalable
PRIMERGY
VP Series BX900
Supercomputer Cluster node
(2003)
AP3000
HX600
Cluster node
F230-75APU
AP1000 PRIMERGY RX200
Cluster node
Japan’s Largest
Cluster in Top500
Japan’s First
(July 2004) *NWT:
Vector (Array)
Supercomputer Numerical Wind Tunnel
(1977)


HPC Platform Solutions - Hardware -
 Full range coverage with choice of HPC hardware platform
Petascale High Performance scaling over several PFlops
Supercomputer  Fujitsu propriety CPU and interconnect
technologies for high performance, high
reliability and high operability

High Performance de facto HPC cluster
x86  Following Intel CPU and MIC roadmap
PRIMEHPC HPC Cluster and adopt Fujitsu latest packaging
FX10 technologies for high performance and
High-end
high operability
CX400
Skinless server

Large-Scale
Divisional SMP System

BX Series
BX900
Departmental BX400

RX900
RX Series PRIMERGY
Work Group RX200
series

Design targets and features of FX10

 High parallel application
productivity
 High Performance  Easy to achieve high
 High peak performance and high performance running highly
application performance paralleled programs without
inordinate effort of
programming

Customer ‘s requirement and FX10 design targets

 High operability
 Low power consumption
 High reliability and ease of  K computer compatibility
operation  Binary compatibility
 Same programing
environment


Design targets and features of FX10

 High parallel application
productivity
 High Performance  Easy to achieve high
 High-performance CPU
 High peak performance and high  “VISIMPACT *2” supports efficient
performance running highly
“SPARC64 IXfx” with SPARC V9
application performance hybrid paralleled programs without
parallel execution
+ HPC-ACE architecture inordinate effort of
programming
High performance, highly

reliable and fault tolerant 6D
mesh/torus interconnect
“Tofu*1” Customer ‘s requirement and FX10 design targets
 Parallel Language, programing tools
and Petascale HPC middleware for
 High operability high reliability and operability
 Low power consumption
 High reliability and ease of  K computer compatibility
 Water cooling system
operation  Binary compatibility
 Same programing
 High reliability components & functions based environment
on mainframe development experience
*1) Tofu: Torus Fusion
*2) VISIMPACT: Virtual Single Processor by Integrated Multicore Parallel Architecture


PRIMEHPC FX10 System Configuration
SPARC64TM IXfx
CPU
PRIMEHPC FX10 DDR3
memory

ICC
(Interconnect
Control Chip)

Compute node configuration

Management servers
Compute Nodes
Portal
servers

IO Network
Tofu interconnect for I/O Login
Network server
I/O nodes (IB or GB)

File servers Global file system
Local disks
Local file system Global disk IB: InfiniBand
GB: GigaBit Ethernet

FX１０ System H/W Specifications
PRIMEHPC FX10 H/W Specifications
Name SPARC64TM IXfx
CPU
Performance 236.5GFlops@1.848GHz
Configuration 1 CPU / Node
Node
Memory capacity 32, 64 GB
Rack Performance/rack 22.7 TFlops
No. of compute node 384 to 98,304
System
Performance 90.8TFlops to 23.2PFlops
(4 ~1024 racks)
Memory 12 TB to 6 PB

 System rack
 96 compute nodes
 SPARC64TM IXfx CPU
 6 I/O nodes
 16 cores/socket  With optional water
 236.5 GFlops cooling exhaust unit

 System
 Max. 23.2 PFlops
 Max. 1,024 racks
 Max. 98,304 CPUs
 System board
 4 nodes (4 CPUs)

The K computer and FX10
Comparison of System H/W Specifications

K computer FX10
Name SPARC64TM VIIIfx SPARC64TM IXfx
Performance 128GFlops@2GHz 236.5GFlops@1.848GHz
SPARC V9 +
Architecture HPC-ACE extension ←
L1(I) Cache:32KB/core,
CPU
L1(D) Cache:32KB/core ←
Cache configuration
L2 Cache: 6MB(shared) L2 Cache: 12MB(shared)
No. of cores/socket 8 16
Memory band width 64 GB/s. 85 GB/s.
Configuration 1 CPU / Node ←
Node
Memory capacity 16 GB 32, 64 GB
System board Node/system board 4 Nodes ←
System board/rack 24 System boards ←
Rack
Performance/rack 12.3 TFlops 22.7 TFlops


The K computer and FX10
Comparison of System H/W Specifications (cont.)

K computer FX10
Topology 6D Mesh/Torus ←
5GB/s x2
Performance
(bi-directional) ←
Interconnect No. of link per node 10 ←
H/W barrier, reduction ←
Additional features
no external switch box ←
CPU, ICC(interconnect
Direct water cooling ←
chip), DDCON
Cooling Air cooling +
Other parts Air cooling Exhaust air water cooling
unit (Optional)


Node configuration
 Single CPU as a node Node
SPARC64™ IXfx
 SPARC64TM IXfx based L2$ MC Memory
 32/64GB memory capacity Core
 Single CPU per node to maximize memory BW Core SX
: ctrl ICC
 High memory bandwidth of 85 GB/s Core
:
Core
 On board InterConnect Controller (ICC)
Interconnect I/O
 Direct RDMA and global synchronization operations
 No external switch
CPU
 Node type ICC
CPU
 Compute node
 Consist of CPU, ICC and memory
 No I/O capability except interconnect CPU
 Four nodes are mounted on a system board CPU
 I/O node
 Same CPU as compute node System Board
 Includes four PCI Express Gen2 x8 slots
 8 GB/s I/O bandwidth per I/O node
 One node is mounted on an I/O system board I/O Slots
CPU ICC

I/O SB
th
Sep 5 , 2012 TACC-2012 13/41 Copyright 2012 FUJITSU LIMITED

SPARC64™ IXfx
 High-performance and low-power multi-core CPU
 High performance core by HPC-ACE
 Multiply number of register, SIMD operation, software controllable cache, etc.
 VISIMPACT : Support highly efficient hybrid execution model (thread + process)
 Shared second cache, hardware barrier among cores and compiler support

SPARC64™ IXfx specifications
Architecture SPARC V9 + HPC-ACE
# of FP operations
8 (= 4 Multiply and Add ) HSIO
/clock/core
No. of cores 16 Core Core Core Core
Peak performance
236.5 Gflops@1.848GHz Core Core Core Core
and clock
Memory bandwidth 85 GB/s

DDR3 interface

DDR3 interface
Power L2$ Data L2$ Data

MAC

MAC
110 W (typical)
consumption L2$

MAC

MAC
Control
 High performance-per-power ratio and L2$ Data L2$ Data
High reliability
 Water cooling system has lowered the CPU Core Core Core Core
temperature and leak current
 Wide-ranging error detection/self-recovery Core Core Core Core
functions, instruction retry function

Overview of HPC-ACE

“High Performance Computing - Arithmetic Computational Extensions”
 Extended number of integer registers and floating point registers
 Software-controllable “Sector Cache”
 Flexible Single Instruction Multiple Data (SIMD) operation
 Hardware barrier synchronization for VISIMPACT
 VISIMPACT: automatic thread-parallelization compiler technology
 Other special features
 XFILL instruction
 Reciprocal approximation instruction
 Reciprocal square root approximation instruction
 Trigonometric function acceleration instructions


HPC-ACE:Extended Number of Registers
 Enables larger loop unrolling and eliminates register spills

 Integer registers
 SPARC-V9 160 / 32 V9 Register V9 32
Window
 HPC-ACE 192 / 64 HPC-ACE 32
160

32 HPC-ACE
 Double precision floating-point registers
 SPARC-V9 32 V9 32
 HPC-ACE 256 (Scalar) / 128 (SIMD)
SIMD HPC-ACE
basic

224
SIMD
extended


HPC-ACE:Number of FP registers extension (1)
 NPB3.3-LU high cost loop
 By using extended number of registers, compiler can generate more efficient
scheduling and also eliminate unnecessary memory operations

1.6E+01

x 1.42 improvement
1.4E+01

1.2E+01

[sec]
1.0E+01

8.0E+00

6.0E+00

4.0E+00

2.0E+00

0.0E+00
lu proc0 jacld-loop 32reg lu proc0 jacld-loop 256reg
32 registers 256 registers

HPC-ACE:Number of FP registers extension (２)
 Performance boost by 256 FP registers w/ 138 application program kernels

Performance improvement
Average 120%
Improved ratio

Max. 252%

Program No.
Performance improvement by # of FP registers extension(from 32 to 256)

HPC-ACE:Sector Cache(1)
 Increasing the cache hit rate by selectively leave a reused data in the
cache
 The cache is divided into two sectors
(Sectors 0 and 1). Cache
 Sector 1 is used for data that will be reused.
Reusable data are
 Sector 0 is used for other data. Works in ordinary cache
loaded by special
replacement policy
load inst.
 Data in Sector 1, which will be used again
soon, is no longer removed from cache, by
the access of data that uses Sector 0.
Sector 0 Sector 1
 The user can specify the data to be
retained in Sector 1 by specifying it on
the compiler directive line. Dividing N ways of the L2 cache as follows:
N1: Sector 0
N2: Sector 1
!ocl CACHE_SECTOR_SIZE(N1,N2)
!ocl CACHE_SUBSECTOR_ASSIGN(a)
do j=1,m Array a is no longer removed from the
do i=1,n
a(i) = a(i) + b(i,j) * c(i,j) cache by references to array b or c.
enddo
Enddo • Array a is held in Sector 1.
• All others are held in Sector 0.


HPC-ACE:Sector Cache (2)
 NPB3.3-CG case
 By putting array P on sector 1, floating point data cache access wait is reduced

[sec.]
2.5E-01
x 1.23 improvement

2.0E-01

1.5E-01

1.0E-01

5.0E-02

0.0E+00

w/o改善前 $
sector with 改善後 $
sector


HPC-ACE: SIMD (Single Instruction Multiple Data)
 Eight floating-point ops can be executed
Floating-point Registers
simultaneously per core
SIMD SIMD
 Two SIMD instructions can be executed
basic extended
simultaneously per core
SIMD[0] f [0] f [256]
 SIMD instruction executes two floating- SIMD[1] f [2] f [258]
point ops (single or double precision)
 FMA is supported
SIMD[126] f [252] f [508]
 Software can flexibly perform SIMD
SIMD[127] f [254] f [510]
optimization
 It is possible to execute operations in
SIMD by obtaining pieces of data one by
one from noncontiguous memory spaces Operation
 It is possible to selectively store floating Operation
register into memory (mask operation)
A C
B D

Floating-point Pipelines

HPC-ACE:SIMD extension (mask operation effect)
 Example of Computational chemistry program
 Due to the branch operation, “if” in the loop, SIMD option shows NO effect
 By using mask operation, compiler can SIMDize the loop and utilize software
pipelining. Results 2.5x performance improvement

[sec.]
1.0E-01 x 2.5
9.0E-02 improvement
8.0E-02

7.0E-02

6.0E-02

5.0E-02

4.0E-02

3.0E-02

2.0E-02

1.0E-02

0.0E+00

-1.0E-02
nosimd simd simd=2


HPC-ACE:XFILL capability
 XFILL capability works in Earthquake simulation program
 XFILL fills L2 cache line with undetermined data(allocate cache line without data
load)
 So, with XFILL in advance, following FP reg store instructions should hit and
would not cause data load from memory
 XFILL can reduce memory read accesses and improve performance when a
memory throughput is the bottleneck

[sec.]
1.0E-01
x 1.5 improvement
9.0E-02

8.0E-02

7.0E-02

6.0E-02

5.0E-02

4.0E-02

3.0E-02

2.0E-02

1.0E-02

0.0E+00
without XFILL
pdiffz3_m4 with XFILL
pdiffz3_m4 xfill


VISIMPACT technology
 Fine-grain thread-parallelization
 Low-overhead barrier synchronization with HPC-ACE ASI registers
 Coalesced memory access exploits shared L2 cache
 “Virtual Single Processor by Integrated Multi-core Parallel Architecture”

Vectorization Conventional Threading VISIMPACT
DO J=1,N P DO J=1,N DO J=1,N
DO I=1,M P DO I=1,M P DO I=1,M
A(I,J)=... P A(I,J)=... P A(I,J)=...
END P END P END
END P END END
Parallel
Vector Serial
Parallel

Serial
Serial

requires separate or large L2 cache
 Fujitsu compilers support VISIMPACT automatic parallelization


VISIMPACT technology
 Fujitsu compiler transforms MPI programs to hybrid parallel executions
automatically, by parallelizing a process on a CPU into multi-threads to
cores
 By reducing the number of ranks, communication efficiency would be
improved
 Inter-core hardware barrier and shared L2 cache help efficient execution

VISIMPACT model pure-MPI model
Interconnect Interconnect
Node0 Node1 Node0 Node1

Process Process Process

T T T T T T T T P P P P P P P P
CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU

Multi-threads Parallel process
parallel process
: Process : Thread Inter process
P T
communication

6D-Mesh/Torus Network Topology
 Higher bisection bandwidth and smaller hops than 3D-Torus
 Torus fusion
 Every XYZ Cartesian grid point has another ABC 3D-Torus
 X, Z and B are torus (ring) axes
 A, C and Y are mesh (linear) axes

Z
B

C
A X Y
Conceptual Model

Virtual Topology
 System software generates virtual 1d-, 2d- or 3d-torus for an arbitrary size
of 6d-cuboid
4

3
5

2

6d-cuboid

4
6

1

3
5
2
7
X 0

6
1
C
A

Z
7
10 9 3 4
B
0
11 8 7 6
Y 0 1 2 5
 Virtual topology expands the range of applicable algorithms


ICC : Tofu Interconnect Controller
 Companion chip for SPARC64TM VIIIfx / IXfx processors
 Tofu Interconnect
 4 Tofu Network Interfaces
 Tofu Network Router
Host Bus Interface
 PCI Express Gen2 PCI
Express
 2 ports for I/O nodes
Tofu Network Tofu Network

Routing Routing Routing Routing
 Water-cooled

// Link
// Link
Interface Interface PCI

Link
Link
Express
Process technology 65 nm

// Link
// Link

Routing Routing
Die size 18.2 mm x 18.1 mm

Link
Link

/ Link
Tofu Network
Frequency 312.5 MHz Interface
Tofu Network
with
No. of Tofu link 10 ports Interface
Tofu Barrier

// Link
// Link
Link
Link
Interface

/ Link
Tofu link throughput in 5 GB/s + out 5 GB/s
PCI Express Gen2 8 lane×2 ports
Crossbar
Host Bus Interface in 20 GB/s + out 20 GB/s
// Link
// Link

Link
Link

Power consumption 28 W (typical) / Link / Link / Link / Link
No. of transistors 200 million
Signal Transfer Speed 6.25 Gbps
Differential signals 128 lanes


Static and Dynamic Failure Avoidance
 Static Failure Avoidance
 Pre-calculated routing table
 For intra-job communication
 Dynamic Failure Avoidance
 Time-out detection by the protocol
 For I/O communication

Failure


Fault Isolation by Virtual Topology
 Jobs using virtual topology can use rectangle region including failed node

10 9 3 4
B 11 8 7 6
Y 0 1 2 5

9 8 7 6
B 10 3 4
Y 0 1 2 5

 Decreases in executable job size and in system availability are minimized


All-to-all communication performance
 Link utilization is important for actual communications
 New optimized algorithm
 Uses all links uniformly to maximize All-to-All communication performance
 Four RDMA engines execute 4 sends and 4 receives simultaneously
 Using Tofu features 4
 Virtual 3D-Torus Tofu (8x4x8=256)
 Flow-control features InfiniBand QDR (256)
3
 for congestion prevention
 Many applications use All-to-All New algorithm
type of communication and 2

GB/s
enjoy this acceleration
1

0
1.E+00 1.E+02 1.E+04 1.E+06
Message size in bytes


All-to-all communication trace on Tofu

Trace Result of the K computer

System configuration of Tofu
24×18×16×2×3×2 ＝ 82,944 nodes
Each node transfers 32KB

Left: new algorithm
Right: standard OpenMPI
(pair-wise exchange)

Colors show link utilization and wait time
Greener – Higher utilization
Redder – Longer wait time
Standard OpenMPI
New Algorithm
(pair-wise exchange)
Elapsed Time: 2.77sec
Elapsed Time: 24.08sec

FX10 Software Stack

Applications
HPC Portal / System Management Portal
Technical Computing Suite
System Management High Performance Automatic parallelization
Parallel File System compiler
 Fortran
 System management FEFS
C
 System control
 C++
 System monitoring
Tools and math. libraries
 System operation support  Lustre based high
performance  Programming support tools
Job Management distributed file  Mathematical libraries
system (SSL II/BLAS etc.)
 Job manager  High scalability, high Parallel languages and libraries
 Job scheduler reliability and  OpenMP
 Resource management availability
 MPI
 Parallel job execution  XPFortran

Linux based OS enhanced for FX10

PRIMEHPC FX10

Lustre Extension of FEFS: Features

New FEFS Features
Extended Large scale High performance
Reuse Max file size File striping MDS response
Max number of files
Parallel I/O I/O zoning
Max client number
Max stripe count Client cache
512KB block Server cache OS jitter reduction

Network Operations Management
Tofu Interconnect IB/Ether Lustre ACL QoS
Disk Quota Directory Quota
IB Multi-rail LNET Router
Features Dynamic configuration change

Connectivity Reliability
Lustre mount NFS export Failover RAS
Journal / fsck


FEFS performance
* : Collaborative work with RIKEN on the K computer
 Achieved the world’s top-level 400

350
throughput*

Throughput [GB/s]
300
 Read 334GB/s, Write 249GB/s 250

(574 OSSs, 18432 Clients, 192 racks) 200
read diret
150 write direct
100

50
 Metadata performance of mdtest* 0
(distributed directory) 0 100 200 300 400 500 600
Number of OSSs

FEFS Lustre
IOPS K computer** IA*** IA***
1.8.5 2.0.0.1
create 34697.6 31803.9 24628.1 17672.2
unlink 39660.5 26049.5 26419.5 20231.5
mkdir 87741.6 77931.3 38015.5 22846.8
rmdir 28153.8 24671.4 17565.1 13973.4
** : MDS:RX300S6 (X5680 3.33 GHz 6core x2, 48GB, IB(QDR)x2)
*** : MDS:RX200S5 (E5520 2.27GHz 4core x2, 48GB, IB(QDR)x1)

Language System overview
 Fortran C/C++/Fortran Compiler
 Programming model (OpenMP, MPI, XPFortran)
 Instruction level /Loop level optimization using HPC-ACE
 Debugging and Tuning tools for highly parallel computer

Programming Language, MPI Programming tool Math. Lib.

Fortran 2003 •Insts. level opt.
 Instruction IDE
Intra Node

C scheduling BLAS
 SIMDization Debugger LAPACK
C++ •Loop level opt. Profiler SSL II
 Automatic
OpenMP 3.0 Parallelization
*1
Inter Node

XPFortran *2
RMATT ScaLAPACK
MPI 2.1
*1: eXtended Parallel Fortran (Distributed Parallel Fortran)
*2: Rank Map Automatic Tuning Tool

Programming Environment

FX10 System
User Client
Login Node Compute Nodes

IDE Interface

Command Job Control
IDE Interface

debugger

Debugger App
Interface App

Interactive
Debugger GUI
Data Data
Converter Sampler

Visualized Sampling
Data Data
Stage out
Profiler


Application Tuning Cycle and Tools

Job Profiler RMATT
Information
Vampir-trace Tofu-PA

Profiler snapshot
MPI Tuning

Overall
Execution Tuning
CPU Tuning

FX10 Specific Profiler
Tools
Vampir-trace

Open Source
PAPI
Tools


On Course to Exascale
 World’s first 1 Exa-Flops computer is expected to appear by 2020


Towards exascale
 Realization of Exascale system is grand challenge
 At least two-step development is necessary
 The biggest challenge is high density and low power consumption
 Fujitsu is developing a Trans-Exa system as a midterm goal
 The Trans-Exa system is expected to be scalable to 100 Petaflops
 Employs
 Wide SIMD and multicore CPU
 High performance and lower power consumption interconnect
 High performance and high density memory technologies
 Continues to invest effort in research for the exascale system
 Higher performance and lower power consumption technologies
 Technologies for higher reliability
Exascale system

No.1 in Top500
Trans-Exa system
(June, Nov. 2011)

K computer
2010 2015 2020
th
Sep 5 , 2012 TACC-2012 40/41 Copyright 2012 FUJITSU LIMITED

Key technology developments on Trans-Exa

Goal

Significant improvement of power efficiency, high density

Technology Gains
Silicon tech. Performance / power
⇒Employs the latest tech. consumption

Innovative memory tech.
⇒High density & BW memory Performance / rack

System integration tech.
⇒Higher integration & density
Accumulation of key
technologies toward
The latest optical tech.
exascale systems
⇒High speed signal transfer


Fujitsu - Technologies beyond-the-k-computer

Recommended

Recommended

More Related Content

Similar to Fujitsu - Technologies beyond-the-k-computer

Similar to Fujitsu - Technologies beyond-the-k-computer (20)

More from Fujitsu Global

More from Fujitsu Global (12)

Recently uploaded

Recently uploaded (20)

Fujitsu - Technologies beyond-the-k-computer