Your SlideShare is downloading. ×
Fujitsu - Technologies beyond-the-k-computer
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Fujitsu - Technologies beyond-the-k-computer


Published on

The K computer Day is a special one-day session, jointly sponsored by RIKEN, Fujitsu, University of Pavia and Comitato TACC. …

The K computer Day is a special one-day session, jointly sponsored by RIKEN, Fujitsu, University of Pavia and Comitato TACC.

Part of TACC-2012 (Theory and Applications of Computational Chemistry), the day will consist of a series of presentations, interactive discussions and conclude with an exhibition of the K computer and its applications.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Technologies beyond the K computerSeptember 5th, 2012Takashi AokiNext Generation Technical Computing UnitFujitsu Limited
  • 2. Agenda Corporate profile Fujitsu supercomputer past and present Second generation Petascale supercomputer PRIMEHPC FX10  Hardware  Software Challenge to the futureSep 5th, 2012 TACC-2012 1/41 Copyright 2012 FUJITSU LIMITED
  • 3. Who we are Japan’s largest IT services provider and No. 3 in the world. * We do everything in ICT. We use ourexperience and the power of ICT to shape the future of society with our customers. Over 170,000 Fujitsu people support customers in more than 100 countries. *2011 IT Services Vendor Revenue. Source: Gartner, "Market Share: IT Services, 2011" 9 April 2012Sep 5th, 2012 TACC-2012 2/41 Copyright 2012 FUJITSU LIMITED
  • 4. Our products and services Technology Solutions Services Systems platformOur datacenters in the world PRIMERGY ETERNUS Supercomputer TX120 DX8000 PRIMEHPC FX10Ubiquitous Product Solutions Device solutions LIFEBOOK Smart phone Tablet PC High-end multi-core FM3 family FRAM E751C F07D ARROWS processor (32-bit RISC MCU) (Ferroelectric SPARC64 VII+ Random Access Memory)Sep 5th, 2012 TACC-2012 3/41 Copyright 2012 FUJITSU LIMITED
  • 5. Where we work ‘shaping tomorrow with you’ wherever you are. As of March 2012 EMEA 31,000 Japan 107,000 Americas 8,000 Asia-Pacific 27,000 Over 170,000 Fujitsu colleagues working with customers in over 100 countriesSep 5th, 2012 TACC-2012 4/41 Copyright 2012 FUJITSU LIMITED
  • 6. Fujitsu HPC Servers - past and present - FX10 No.1 in Top500 (June and Nov., 2011) K computer FX1 World’s Fastest Vector Processor (1999) Most Efficient Performance VPP5000 SPARC NWT* in Top500 (Nov. 2008) Enterprise Developed with NAL No.1 in Top500 PRIMEQUEST VPP300/700 PRIMERGY (Nov. 1993) PRIMEPOWER CX400 Gordon Bell Prize Skinless server HPC2500 (1994, 95, 96) (coming soon) VPP500 World’s Most Scalable PRIMERGY VP Series BX900 Supercomputer Cluster node (2003) AP3000 HX600 Cluster node F230-75APU AP1000 PRIMERGY RX200 Cluster node Japan’s Largest Cluster in Top500 Japan’s First (July 2004) *NWT: Vector (Array) Supercomputer Numerical Wind Tunnel (1977)Sep 5th, 2012 TACC-2012 5/41 Copyright 2012 FUJITSU LIMITED
  • 7. HPC Platform Solutions - Hardware - Full range coverage with choice of HPC hardware platform Petascale High Performance scaling over several PFlops Supercomputer  Fujitsu propriety CPU and interconnect technologies for high performance, high reliability and high operability High Performance de facto HPC cluster x86  Following Intel CPU and MIC roadmap PRIMEHPC HPC Cluster and adopt Fujitsu latest packaging FX10 technologies for high performance and High-end high operability CX400 Skinless server Large-Scale Divisional SMP System BX Series BX900 Departmental BX400 RX900 RX Series PRIMERGY Work Group RX200 seriesSep 5th, 2012 TACC-2012 6/41 Copyright 2012 FUJITSU LIMITED
  • 8. Design targets and features of FX10  High parallel application productivity  High Performance  Easy to achieve high  High peak performance and high performance running highly application performance paralleled programs without inordinate effort of programming Customer ‘s requirement and FX10 design targets  High operability  Low power consumption  High reliability and ease of  K computer compatibility operation  Binary compatibility  Same programing environmentSep 5th, 2012 TACC-2012 7/41 Copyright 2012 FUJITSU LIMITED
  • 9. Design targets and features of FX10  High parallel application productivity  High Performance  Easy to achieve high  High-performance CPU  High peak performance and high  “VISIMPACT *2” supports efficient performance running highly “SPARC64 IXfx” with SPARC V9 application performance hybrid paralleled programs without parallel execution + HPC-ACE architecture inordinate effort of programming High performance, highly  reliable and fault tolerant 6D mesh/torus interconnect “Tofu*1” Customer ‘s requirement and FX10 design targets  Parallel Language, programing tools and Petascale HPC middleware for  High operability high reliability and operability  Low power consumption  High reliability and ease of  K computer compatibility  Water cooling system operation  Binary compatibility  Same programing  High reliability components & functions based environment on mainframe development experience *1) Tofu: Torus Fusion *2) VISIMPACT: Virtual Single Processor by Integrated Multicore Parallel ArchitectureSep 5th, 2012 TACC-2012 8/41 Copyright 2012 FUJITSU LIMITED
  • 10. PRIMEHPC FX10 System Configuration SPARC64TM IXfx CPU PRIMEHPC FX10 DDR3 memory ICC (Interconnect Control Chip) Compute node configuration Management servers Compute Nodes Portal servers IO Network Tofu interconnect for I/O Login Network server I/O nodes (IB or GB) File servers Global file system Local disks Local file system Global disk IB: InfiniBand GB: GigaBit EthernetSep 5th, 2012 TACC-2012 9/41 Copyright 2012 FUJITSU LIMITED
  • 11. FX10 System H/W Specifications PRIMEHPC FX10 H/W Specifications Name SPARC64TM IXfx CPU Performance 236.5GFlops@1.848GHz Configuration 1 CPU / Node Node Memory capacity 32, 64 GB Rack Performance/rack 22.7 TFlops No. of compute node 384 to 98,304 System Performance 90.8TFlops to 23.2PFlops (4 ~1024 racks) Memory 12 TB to 6 PB  System rack  96 compute nodes SPARC64TM IXfx CPU  6 I/O nodes  16 cores/socket  With optional water  236.5 GFlops cooling exhaust unit  System  Max. 23.2 PFlops  Max. 1,024 racks  Max. 98,304 CPUs  System board  4 nodes (4 CPUs) Sep 5th, 2012 TACC-2012 10/41 Copyright 2012 FUJITSU LIMITED
  • 12. The K computer and FX10Comparison of System H/W Specifications K computer FX10 Name SPARC64TM VIIIfx SPARC64TM IXfx Performance 128GFlops@2GHz 236.5GFlops@1.848GHz SPARC V9 + Architecture HPC-ACE extension ← L1(I) Cache:32KB/core, CPU L1(D) Cache:32KB/core ← Cache configuration L2 Cache: 6MB(shared) L2 Cache: 12MB(shared) No. of cores/socket 8 16 Memory band width 64 GB/s. 85 GB/s. Configuration 1 CPU / Node ← Node Memory capacity 16 GB 32, 64 GB System board Node/system board 4 Nodes ← System board/rack 24 System boards ← Rack Performance/rack 12.3 TFlops 22.7 TFlopsSep 5th, 2012 TACC-2012 11/41 Copyright 2012 FUJITSU LIMITED
  • 13. The K computer and FX10Comparison of System H/W Specifications (cont.) K computer FX10 Topology 6D Mesh/Torus ← 5GB/s x2 Performance (bi-directional) ← Interconnect No. of link per node 10 ← H/W barrier, reduction ← Additional features no external switch box ← CPU, ICC(interconnect Direct water cooling ← chip), DDCON Cooling Air cooling + Other parts Air cooling Exhaust air water cooling unit (Optional)Sep 5th, 2012 TACC-2012 12/41 Copyright 2012 FUJITSU LIMITED
  • 14. Node configuration Single CPU as a node Node SPARC64™ IXfx  SPARC64TM IXfx based L2$ MC Memory  32/64GB memory capacity Core  Single CPU per node to maximize memory BW Core SX : ctrl ICC  High memory bandwidth of 85 GB/s Core : Core On board InterConnect Controller (ICC) Interconnect I/O  Direct RDMA and global synchronization operations  No external switch CPU Node type ICC CPU  Compute node  Consist of CPU, ICC and memory  No I/O capability except interconnect CPU  Four nodes are mounted on a system board CPU  I/O node  Same CPU as compute node System Board  Includes four PCI Express Gen2 x8 slots  8 GB/s I/O bandwidth per I/O node  One node is mounted on an I/O system board I/O Slots CPU ICC I/O SB thSep 5 , 2012 TACC-2012 13/41 Copyright 2012 FUJITSU LIMITED
  • 15. SPARC64™ IXfx High-performance and low-power multi-core CPU  High performance core by HPC-ACE  Multiply number of register, SIMD operation, software controllable cache, etc.  VISIMPACT : Support highly efficient hybrid execution model (thread + process)  Shared second cache, hardware barrier among cores and compiler support SPARC64™ IXfx specifications Architecture SPARC V9 + HPC-ACE # of FP operations 8 (= 4 Multiply and Add ) HSIO /clock/core No. of cores 16 Core Core Core Core Peak performance 236.5 Gflops@1.848GHz Core Core Core Core and clock Memory bandwidth 85 GB/s DDR3 interface DDR3 interface Power L2$ Data L2$ Data MAC MAC 110 W (typical) consumption L2$ MAC MAC Control  High performance-per-power ratio and L2$ Data L2$ Data High reliability  Water cooling system has lowered the CPU Core Core Core Core temperature and leak current  Wide-ranging error detection/self-recovery Core Core Core Core functions, instruction retry functionSep 5th, 2012 TACC-2012 14/41 Copyright 2012 FUJITSU LIMITED
  • 16. Overview of HPC-ACE“High Performance Computing - Arithmetic Computational Extensions” Extended number of integer registers and floating point registers Software-controllable “Sector Cache” Flexible Single Instruction Multiple Data (SIMD) operation Hardware barrier synchronization for VISIMPACT  VISIMPACT: automatic thread-parallelization compiler technology Other special features  XFILL instruction  Reciprocal approximation instruction  Reciprocal square root approximation instruction  Trigonometric function acceleration instructionsSep 5th, 2012 TACC-2012 15/41 Copyright 2012 FUJITSU LIMITED
  • 17. HPC-ACE:Extended Number of Registers Enables larger loop unrolling and eliminates register spills Integer registers  SPARC-V9 160 / 32 V9 Register V9 32 Window  HPC-ACE 192 / 64 HPC-ACE 32 160 32 HPC-ACE Double precision floating-point registers  SPARC-V9 32 V9 32  HPC-ACE 256 (Scalar) / 128 (SIMD) SIMD HPC-ACE basic 224 SIMD extendedSep 5th, 2012 TACC-2012 16/41 Copyright 2012 FUJITSU LIMITED
  • 18. HPC-ACE:Number of FP registers extension (1) NPB3.3-LU high cost loop  By using extended number of registers, compiler can generate more efficient scheduling and also eliminate unnecessary memory operations 1.6E+01 x 1.42 improvement 1.4E+01 1.2E+01 [sec] 1.0E+01 8.0E+00 6.0E+00 4.0E+00 2.0E+00 0.0E+00 lu proc0 jacld-loop 32reg lu proc0 jacld-loop 256reg 32 registers 256 registersSep 5th, 2012 TACC-2012 17/41 Copyright 2012 FUJITSU LIMITED
  • 19. HPC-ACE:Number of FP registers extension (2) Performance boost by 256 FP registers w/ 138 application program kernels Performance improvement Average 120% Improved ratio Max. 252% Program No. Performance improvement by # of FP registers extension(from 32 to 256)Sep 5th, 2012 TACC-2012 18/41 Copyright 2012 FUJITSU LIMITED
  • 20. HPC-ACE:Sector Cache(1) Increasing the cache hit rate by selectively leave a reused data in the cache  The cache is divided into two sectors (Sectors 0 and 1). Cache  Sector 1 is used for data that will be reused. Reusable data are  Sector 0 is used for other data. Works in ordinary cache loaded by special replacement policy load inst.  Data in Sector 1, which will be used again soon, is no longer removed from cache, by the access of data that uses Sector 0. Sector 0 Sector 1  The user can specify the data to be retained in Sector 1 by specifying it on the compiler directive line. Dividing N ways of the L2 cache as follows: N1: Sector 0 N2: Sector 1 !ocl CACHE_SECTOR_SIZE(N1,N2) !ocl CACHE_SUBSECTOR_ASSIGN(a) do j=1,m Array a is no longer removed from the do i=1,n a(i) = a(i) + b(i,j) * c(i,j) cache by references to array b or c. enddo Enddo • Array a is held in Sector 1. • All others are held in Sector 0.Sep 5th, 2012 TACC-2012 19/41 Copyright 2012 FUJITSU LIMITED
  • 21. HPC-ACE:Sector Cache (2) NPB3.3-CG case  By putting array P on sector 1, floating point data cache access wait is reduced [sec.] 2.5E-01 x 1.23 improvement 2.0E-01 1.5E-01 1.0E-01 5.0E-02 0.0E+00 w/o改善前 $ sector with 改善後 $ sectorSep 5th, 2012 TACC-2012 20/41 Copyright 2012 FUJITSU LIMITED
  • 22. HPC-ACE: SIMD (Single Instruction Multiple Data) Eight floating-point ops can be executed Floating-point Registers simultaneously per core SIMD SIMD  Two SIMD instructions can be executed basic extended simultaneously per core SIMD[0] f [0] f [256]  SIMD instruction executes two floating- SIMD[1] f [2] f [258] point ops (single or double precision)  FMA is supported SIMD[126] f [252] f [508] Software can flexibly perform SIMD SIMD[127] f [254] f [510] optimization  It is possible to execute operations in SIMD by obtaining pieces of data one by one from noncontiguous memory spaces Operation  It is possible to selectively store floating Operation register into memory (mask operation) A C B D Floating-point PipelinesSep 5th, 2012 TACC-2012 21/41 Copyright 2012 FUJITSU LIMITED
  • 23. HPC-ACE:SIMD extension (mask operation effect) Example of Computational chemistry program  Due to the branch operation, “if” in the loop, SIMD option shows NO effect  By using mask operation, compiler can SIMDize the loop and utilize software pipelining. Results 2.5x performance improvement [sec.] 1.0E-01 x 2.5 9.0E-02 improvement 8.0E-02 7.0E-02 6.0E-02 5.0E-02 4.0E-02 3.0E-02 2.0E-02 1.0E-02 0.0E+00 -1.0E-02 nosimd simd simd=2Sep 5th, 2012 TACC-2012 22/41 Copyright 2012 FUJITSU LIMITED
  • 24. HPC-ACE:XFILL capability XFILL capability works in Earthquake simulation program  XFILL fills L2 cache line with undetermined data(allocate cache line without data load)  So, with XFILL in advance, following FP reg store instructions should hit and would not cause data load from memory  XFILL can reduce memory read accesses and improve performance when a memory throughput is the bottleneck [sec.] 1.0E-01 x 1.5 improvement 9.0E-02 8.0E-02 7.0E-02 6.0E-02 5.0E-02 4.0E-02 3.0E-02 2.0E-02 1.0E-02 0.0E+00 without XFILL pdiffz3_m4 with XFILL pdiffz3_m4 xfillSep 5th, 2012 TACC-2012 23/41 Copyright 2012 FUJITSU LIMITED
  • 25. VISIMPACT technology Fine-grain thread-parallelization  Low-overhead barrier synchronization with HPC-ACE ASI registers  Coalesced memory access exploits shared L2 cache  “Virtual Single Processor by Integrated Multi-core Parallel Architecture” Vectorization Conventional Threading VISIMPACT DO J=1,N P DO J=1,N DO J=1,N DO I=1,M P DO I=1,M P DO I=1,M A(I,J)=... P A(I,J)=... P A(I,J)=... END P END P END END P END END Parallel Vector Serial Parallel Serial Serial requires separate or large L2 cache Fujitsu compilers support VISIMPACT automatic parallelizationSep 5th, 2012 TACC-2012 24/41 Copyright 2012 FUJITSU LIMITED
  • 26. VISIMPACT technology Fujitsu compiler transforms MPI programs to hybrid parallel executions automatically, by parallelizing a process on a CPU into multi-threads to cores By reducing the number of ranks, communication efficiency would be improved Inter-core hardware barrier and shared L2 cache help efficient execution VISIMPACT model pure-MPI model Interconnect Interconnect Node0 Node1 Node0 Node1 Process Process Process T T T T T T T T P P P P P P P P CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Multi-threads Parallel process parallel process : Process : Thread Inter process P T communicationSep 5th, 2012 TACC-2012 25/41 Copyright 2012 FUJITSU LIMITED
  • 27. 6D-Mesh/Torus Network Topology Higher bisection bandwidth and smaller hops than 3D-Torus Torus fusion  Every XYZ Cartesian grid point has another ABC 3D-Torus  X, Z and B are torus (ring) axes  A, C and Y are mesh (linear) axes Z B C A X Y Conceptual ModelSep 5th, 2012 TACC-2012 26/41 Copyright 2012 FUJITSU LIMITED
  • 28. Virtual Topology System software generates virtual 1d-, 2d- or 3d-torus for an arbitrary size of 6d-cuboid 4 3 5 2 6d-cuboid 4 6 1 3 5 2 7 X 0 6 1 C A Z 7 10 9 3 4 B 0 11 8 7 6 Y 0 1 2 5 Virtual topology expands the range of applicable algorithmsSep 5th, 2012 TACC-2012 27/41 Copyright 2012 FUJITSU LIMITED
  • 29. ICC : Tofu Interconnect Controller Companion chip for SPARC64TM VIIIfx / IXfx processors Tofu Interconnect  4 Tofu Network Interfaces  Tofu Network Router Host Bus Interface PCI Express Gen2 PCI Express  2 ports for I/O nodes Tofu Network Tofu Network Routing Routing Routing Routing Routing Routing Routing Routing Routing Routing Routing Routing Water-cooled // Link // Link Interface Interface PCI Link Link ExpressProcess technology 65 nm // Link // Link Routing RoutingDie size 18.2 mm x 18.1 mm Link Link / Link Tofu NetworkFrequency 312.5 MHz Interface Tofu Network withNo. of Tofu link 10 ports Interface Tofu Barrier // Link // Link Link Link Interface / LinkTofu link throughput in 5 GB/s + out 5 GB/sPCI Express Gen2 8 lane×2 ports CrossbarHost Bus Interface in 20 GB/s + out 20 GB/s // Link // Link Routing Routing Routing Routing Link LinkPower consumption 28 W (typical) / Link / Link / Link / LinkNo. of transistors 200 millionSignal Transfer Speed 6.25 GbpsDifferential signals 128 lanesSep 5th, 2012 TACC-2012 28/41 Copyright 2012 FUJITSU LIMITED
  • 30. Static and Dynamic Failure Avoidance Static Failure Avoidance  Pre-calculated routing table  For intra-job communication Dynamic Failure Avoidance  Time-out detection by the protocol  For I/O communication FailureSep 5th, 2012 TACC-2012 29/41 Copyright 2012 FUJITSU LIMITED
  • 31. Fault Isolation by Virtual Topology Jobs using virtual topology can use rectangle region including failed node 10 9 3 4 B 11 8 7 6 Y 0 1 2 5 9 8 7 6 B 10 3 4 Y 0 1 2 5 Decreases in executable job size and in system availability are minimizedSep 5th, 2012 TACC-2012 30/41 Copyright 2012 FUJITSU LIMITED
  • 32. All-to-all communication performance Link utilization is important for actual communications New optimized algorithm  Uses all links uniformly to maximize All-to-All communication performance  Four RDMA engines execute 4 sends and 4 receives simultaneously Using Tofu features 4  Virtual 3D-Torus Tofu (8x4x8=256)  Flow-control features InfiniBand QDR (256) 3  for congestion prevention Many applications use All-to-All New algorithm type of communication and 2 GB/s enjoy this acceleration 1 0 1.E+00 1.E+02 1.E+04 1.E+06 Message size in bytesSep 5th, 2012 TACC-2012 31/41 Copyright 2012 FUJITSU LIMITED
  • 33. All-to-all communication trace on Tofu Trace Result of the K computer System configuration of Tofu 24×18×16×2×3×2 = 82,944 nodes Each node transfers 32KB Left: new algorithm Right: standard OpenMPI (pair-wise exchange) Colors show link utilization and wait time Greener – Higher utilization Redder – Longer wait time Standard OpenMPI New Algorithm (pair-wise exchange) Elapsed Time: 2.77sec Elapsed Time: 24.08secSep 5th, 2012 TACC-2012 32/41 Copyright 2012 FUJITSU LIMITED
  • 34. FX10 Software Stack Applications HPC Portal / System Management Portal Technical Computing Suite System Management High Performance Automatic parallelization Parallel File System compiler  Fortran  System management FEFS C  System control  C++  System monitoring Tools and math. libraries  System operation support  Lustre based high performance  Programming support tools Job Management distributed file  Mathematical libraries system (SSL II/BLAS etc.)  Job manager  High scalability, high Parallel languages and libraries  Job scheduler reliability and  OpenMP  Resource management availability  MPI  Parallel job execution  XPFortran Linux based OS enhanced for FX10 PRIMEHPC FX10Sep 5th, 2012 TACC-2012 33/41 Copyright 2012 FUJITSU LIMITED
  • 35. Lustre Extension of FEFS: FeaturesNew FEFS FeaturesExtended Large scale High performanceReuse Max file size File striping MDS response Max number of files Parallel I/O I/O zoning Max client number Max stripe count Client cache 512KB block Server cache OS jitter reduction Network Operations Management Tofu Interconnect IB/Ether Lustre ACL QoS Disk Quota Directory Quota IB Multi-rail LNET Router Features Dynamic configuration change Connectivity Reliability Lustre mount NFS export Failover RAS Journal / fsckSep 5th, 2012 TACC-2012 34/41 Copyright 2012 FUJITSU LIMITED
  • 36. FEFS performance * : Collaborative work with RIKEN on the K computer Achieved the world’s top-level 400 350 throughput* Throughput [GB/s] 300  Read 334GB/s, Write 249GB/s 250 (574 OSSs, 18432 Clients, 192 racks) 200 read diret 150 write direct 100 50 Metadata performance of mdtest* 0 (distributed directory) 0 100 200 300 400 500 600 Number of OSSs FEFS Lustre IOPS K computer** IA*** IA*** 1.8.5 create 34697.6 31803.9 24628.1 17672.2 unlink 39660.5 26049.5 26419.5 20231.5 mkdir 87741.6 77931.3 38015.5 22846.8 rmdir 28153.8 24671.4 17565.1 13973.4 ** : MDS:RX300S6 (X5680 3.33 GHz 6core x2, 48GB, IB(QDR)x2) *** : MDS:RX200S5 (E5520 2.27GHz 4core x2, 48GB, IB(QDR)x1)Sep 5th, 2012 TACC-2012 35/41 Copyright 2012 FUJITSU LIMITED
  • 37. Language System overview Fortran C/C++/Fortran Compiler Programming model (OpenMP, MPI, XPFortran) Instruction level /Loop level optimization using HPC-ACE Debugging and Tuning tools for highly parallel computer Programming Language, MPI Programming tool Math. Lib. Fortran 2003 •Insts. level opt.  Instruction IDE Intra Node C scheduling BLAS  SIMDization Debugger LAPACK C++ •Loop level opt. Profiler SSL II  Automatic OpenMP 3.0 Parallelization *1 Inter Node XPFortran *2 RMATT ScaLAPACK MPI 2.1 *1: eXtended Parallel Fortran (Distributed Parallel Fortran) *2: Rank Map Automatic Tuning ToolSep 5th, 2012 TACC-2012 36/41 Copyright 2012 FUJITSU LIMITED
  • 38. Programming Environment FX10 System User Client Login Node Compute Nodes IDE Interface Command Job Control IDE Interface debugger Debugger App Interface App Interactive Debugger GUI Data Data Converter Sampler Visualized Sampling Data Data Stage out ProfilerSep 5th, 2012 TACC-2012 37/41 Copyright 2012 FUJITSU LIMITED
  • 39. Application Tuning Cycle and Tools Job Profiler RMATT Information Vampir-trace Tofu-PA Profiler snapshot MPI Tuning Overall Execution Tuning CPU Tuning FX10 Specific Profiler Tools Vampir-trace Open Source PAPI ToolsSep 5th, 2012 TACC-2012 38/41 Copyright 2012 FUJITSU LIMITED
  • 40. On Course to Exascale World’s first 1 Exa-Flops computer is expected to appear by 2020Sep 5th, 2012 TACC-2012 39/41 Copyright 2012 FUJITSU LIMITED
  • 41. Towards exascale Realization of Exascale system is grand challenge  At least two-step development is necessary  The biggest challenge is high density and low power consumption Fujitsu is developing a Trans-Exa system as a midterm goal  The Trans-Exa system is expected to be scalable to 100 Petaflops  Employs  Wide SIMD and multicore CPU  High performance and lower power consumption interconnect  High performance and high density memory technologies Continues to invest effort in research for the exascale system  Higher performance and lower power consumption technologies  Technologies for higher reliability Exascale system No.1 in Top500 Trans-Exa system (June, Nov. 2011) K computer 2010 2015 2020 thSep 5 , 2012 TACC-2012 40/41 Copyright 2012 FUJITSU LIMITED
  • 42. Key technology developments on Trans-Exa Goal Significant improvement of power efficiency, high density Technology Gains Silicon tech. Performance / power ⇒Employs the latest tech. consumption Innovative memory tech. ⇒High density & BW memory Performance / rack System integration tech. ⇒Higher integration & density Accumulation of key technologies toward The latest optical tech. exascale systems ⇒High speed signal transferSep 5th, 2012 TACC-2012 41/41 Copyright 2012 FUJITSU LIMITED
  • 43. Sep 5th, 2012 TACC-2012