• Like
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data
Upcoming SlideShare
Loading in...5
×

[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data

  • 33,053 views
Uploaded on

Rakuten Technology Conference 2013 …

Rakuten Technology Conference 2013
"TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data"
Satoshi Matsuoka
Professor
Global Scientific Information and Computing (GSIC) Center
Tokyo Institute of Technology
Fellow, Association for Computing Machinery (ACM)

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
33,053
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
13
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data Satoshi Matsuoka Professor Global Scientific Information and Computing (GSIC) Center Tokyo Institute of Technology Fellow, Association for Computing Machinery (ACM) Rakuten Technology Conference 2013 2013/10/26 Tokyo, Japan
  • 2. Supercomputers from the Past Fast, Big, Special, Inefficient, Evil device to conquer the world…
  • 3. Let us go back to the mid ’70s Birth of “microcomputers” and arrival of commodity computing (start of my career) • Commodity 8-bit CPUs… – Intel 4004/8008/8080/8085, Zilog Z-80, Motorola 6800, MOS Tech. 6502, … • Lead to hobbyist computing… – Evaluation boards: Intel SDK-80, Motorola MEK6800D2, MOS Tech. KIM-1, (in Japan) NEC TK-80, Fujitsu Lkit-8, … – System Kits: MITS Altair 8800/680b, IMSAI 8080, Proc. Tech. SOL-20, SWTPC 6800, … • & Lead to early personal computers – Commodore PET, Tandy TRS-80, Apple II – (in Japan): Hitachi Basic Master, NEC CompoBS / PC8001, Fujitsu FM-8, …
  • 4. Supercomputing vs. Personal Computing in the late 1970s. • Hitachi Basic Master (1978) – “The first PC in Japan” – Motorola 6802--1Mhz, 16KB ROM, 16KB RAM – Linpack in BASIC: Approx. 70-80 FLOPS (1/1,000,000) • We got “simulation” done (in assembly language) – Nintendo NES (1982) • MOS Technology 6502 1Mhz (Same as Apple II) – “Pinball” by Matsuoka & Iwata (now CEO Nintendo) • Realtime dynamics + collision + lots of shortcuts • Average ~a few KFLOPS Cf. Cray-1 Running Linpack 10 (1976) Linpack 80-90MFlops (est.)
  • 5. Then things got accelerated around the mid 80s to mid 90s (rapid commoditization towards what we use now) • PC CPUs: Intel 8086/286/386/486/Pentium (Superscalar&fast FP x86), Motorola 68000/020/030/040, … to Xeons, GPUs, Xeon Phi’s – C.f. RISCs: SPARC, MIPS, PA-RISC, IBM Power, DEC Alpha, … • Storage Evolution: Cassettes, Floppies to HDDs, optical disk to Flash • Network Evolution: RS-232C to Ethernet now to FDR Infinininband • PC (incl. I/O): IBM PC “Clones” and Macintoshes: ISA to VLB to PCIe • Software Evolution: CP/M to MS-DOS to Windows, Linux, • WAN Evolution: RS-232+Modem+BBS to Modem+Internet to ISDN/ADSL/FTTH broadband, DWDM Backbone, LTE, … • Internet Evolution: email + ftp to Web, Java, Ruby, … • Then Clusters, Grid/Clouds, 3-D Gaming, and Top500 all started in the mid 90s(!), and commoditized supercomputing
  • 6. Modern Day Supercomputers  Now supercomputers “look like” IDC servers  High-End COTS dominate Linux based machine with standard + HPC OSS Software Stack NEC Confidential
  • 7. 1957 2010 “Reclaimed No.1 Supercomputer  Rank in the World” 2011 2012 7
  • 8. Top Supercomputers vs. Global IDC K Computer (#1 2011-12) Riken-AICS Fujitsu Sparc VIII-fx Venus CPU 88,000 nodes, 800,000CPU cores ~11 Petaflops (1016) 1.4 Petabyte memory, 13 MW Power 864 racks、3000m2 Tianhe2 (#1 2013) China Gwanjou 48,000 KNC Xeon Phi + 36,000 Ivy Bridge Xeon 18,000 nodes, >3 Million CPU cores 54 Petaflops (1016) 0.8 Petabyte memory, 20 MW Power ??? racks、???m2 C.f. Amazon ~= 450,000 Nodes, ~3 million Cores #1 2012 IBM BlueGene/Q “Sequoia” Lawrence Livermore National Lab DARPA study IBM PowerPC System-On-Chip 98,000 nodes, 1.57million Cores 2020 Exaflop (1018) ~20 Petaflops 100 million~ 1.6 Petabytes, 8MW, 96 racks NEC Confidential 1 Billion Cores
  • 9. Scalability and Massive Parallelism  More nodes & core => Massive Increase in parallelism Faster, “Bigger” Simulation Qualitative Difference Performance BAD! GOOD! BAD! Ideal Linear Scaling Difficult to Achieve Limitations in Power, Cost, Reliability Limitations in Scaling CPU Cores ~= Parallelism NEC Confidential
  • 10. TSUBAME2.0
  • 11. 2006: TSUBAME1.0 as No.1 in Japan All University Centers COMBINED 45 TeraFlops > Total 85 TeraFlops, #7 Top500 June 2006 Earth Simulator 40TeraFlops #1 2002~2004
  • 12. TSUBAME2.0 Nov. 1, 2010 “The Greenest Production Supercomputer in the World” TSUBAME 2.0 New Development 32nm 40nm >12TB/s Mem BW >400GB/s Mem BW >1.6TB/s Mem BW 35KW Max 80Gbps NW BW ~1KW max 12 >600TB/s Mem BW 220Tbps NW Bisecion BW 1.4MW Max
  • 13. 1500 1250 1000 750 500 CPU 250 0 GPU GPU Memory Bandwidth [GByte/s] Peak Performance [GFLOPS] 1750 Performance Comparison of CPU vs. 200 GPU 160 120 80 CPU 40 0 x5-6 socket-to-socket advantage in both compute and memory bandwidth, Same power (200W GPU vs. 200W CPU+memory+NW+…)
  • 14. TSUBAME2.0 Compute Node Thin Node Infiniband QDR x2 (80Gbps) 1.6 Tflops 400GB/s Mem BW 80GBps NW ~1KW max Productized as HP ProLiant SL390s HP SL390G7 (Developed for TSUBAME 2.0) GPU: NVIDIA Fermi M2050 x 3 515GFlops, 3GByte memory /GPU CPU: Intel Westmere-EP 2.93GHz x2 (12cores/node) Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD:60GBx2, 120GBx2 NEC Confidential Total Perf 2.4PFlops Mem: ~100TB SSD: ~200TB 4-1
  • 15. TSUBAME2.0 Storage Overview TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape) Infiniband QDR Network for LNET and Other Services QDR IB (×4) × 8 QDR IB(×4) × 20 GPFS#1 SFA10k #1 SFA10k #2 /work9 “Global Work Space” #1 GPFS with HSM SFA10k #3 SFA10k #4 SFA10k #5 /work0 /work19 /gscr0 “Global Work Space” #2 “Global Work Space” #3 Lustre “Scratch” 3.6 PB 30~60GB/s GPFS#2 GPFS#3 10GbE × 2 GPFS#4 HOME HOME System application iSCSI SFA10k #6 “cNFS/Clusterd Samba w/ GPFS” “NFS/CIFS/iSCSI by BlueARC” Home Volumes 1.2PB Parallel File System Volumes 2.4 PB HDD + 〜4PB Tape “Thin node SSD” “Fat/Medium node SSD” 250 TB, 300~500GB/s Scratch 130 TB=> 500TB~1PB Grid Storage
  • 16. TSUBAME2.0 Storage Overview TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape) Infiniband QDR Network for LNET and Other Services QDR IB (×4) × 8 QDR IB(×4) × 20 GPFS#1 Concurrent Parallel I/O (e.g. MPI-IO) SFA10k #1 SFA10k #2 /work9 SFA10k #3 SFA10k #4 SFA10k #5 /work0 /work19 /gscr0 Read mostly I/O (data-intensive apps, parallel workflow, “Global Work “Global Work parameterSpace” #1 survey) Space” #2 “Global Work Space” #3 GPFS with HSM “Scratch” Lustre 3.6 Fine-grained R/W PB I/O Parallel File System Volumes (checkpoints, temporary files, Big Data processing) GPFS#2 GPFS#3 10GbE × 2 GPFS#4 • Home storage for computing nodes •HOME Cloud-based campus storage HOME services System application iSCSI SFA10k #6 “cNFS/Clusterd Samba w/ GPFS” “NFS/CIFS/iSCSI by BlueARC” Home Volumes 1.2PB Data transfer service between SCs/CCs 2.4Long-Term PB HDD + Backup 〜4PB Tape “Thin node SSD” “Fat/Medium node SSD” 250 TB, 300GB/s Scratch 130 TB=> 500TB~1PB HPCI Storage
  • 17. 3500 Fiber Cables > 100Km w/DFB Silicon Photonics End-to-End 7.5GB/s, > 2us Non-Blocking 200Tbps Bisection NEC Confidential
  • 18. 2010: TSUBAME2.0 as No.1 in Japan > Total 2.4 Petaflops #4 Top500, Nov. 2010 All Other Japanese Centers on the Top500 COMBINED 2.3 PetaFlops
  • 19. TSUBAME Wins Awards… “Greenest Production Supercomputer in the World” the Green 500 Nov. 2010, June 2011 (#4 Top500 Nov. 2010) 3 times more power efficient than a laptop!
  • 20. TSUBAME Wins Awards… ACM Gordon Bell Prize 2011 2.0 Petaflops Dendrite Simulation Special Achievements in Scalability and Time-to-Solution “Peta-Scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer”
  • 21. TSUBAME Wins Awards… Commendation for Sci &Tech by Ministry of Education 2012 (文部科学大臣表彰) Prize for Sci & Tech, Development Category Development of Greenest Production Peta-scale Supercomputer Satoshi Matsuoka, Toshio Endo, Takayuki Aoki
  • 22. Precise Bloodflow Simulation of Artery on TSUBAME2.0 (Bernaschi et. al., IAC-CNR, Italy) Personal CT Scan + Simulation => Accurate Diagnostics of Cardiac Illness 5 Billion Red Blood Cells + 10 Billion Degrees of Freedom
  • 23. MUPHY: Multiphysics simulation of blood flow (Melchionna, Bernaschi et al.) Combined Lattice-Boltzmann (LB) simulation for plasma and Molecular Dynamics (MD) for Red Blood Cells Realistic geometry ( from CAT scan) Multiphyics simulation with MUPHY software Fluid: Blood plasma Lattice Boltzmann Body: Red blood cell coupled Irregular mesh is divided by using PT-SCOTCH tool, considering cutoff distance Extended MD Red blood cells (RBCs) are represented as ellipsoidal particles Two-levels of parallelism: CUDA (on GPU) + MPI • 1 Billion mesh node for LB component •100 Million RBCs 4000 GPUs, ACM Gordon Bell Prize 2011 Honorable Mention 0.6Petaflops
  • 24. Lattice-Boltzmann-LES with Coherent-structure SGS model [Onodera&Aoki2013] Coherent-structure Smagorinsky model Second invariant of the velocity gradient tensor(Q) and the Energy dissipation(ε) The model parameter is locally determined by second invariant of the velocity gradient tensor. ◎ Turbulent flow around a complex object ◎ Large-scale parallel computation Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology
  • 25. Computational Area – Entire Downtown Tokyo Major part of Tokyo Including Shnjuku-ku, Chiyoda-ku, Minato-ku, Meguro-ku, Chuou-ku, Shinjyuku Tokyo 10km×10km Building Data: Pasco Co. Ltd. TDM 3D Achieved 0.592 Petaflops using over 4000 GPUs (15% efficiency) Shibuya Shinagawa Map ©2012 Google, ZENRIN Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 25
  • 26. Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
  • 27. Area Around Metropolitan Government Building Flow profile at the 25m height on the ground Wind 640 m 960 m 地図データ ©2012 Google, ZENRIN Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 27
  • 28. Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 28
  • 29. Current Weather Forecast 5km Resolution (Inaccurate Cloud Simulation) ASUCA Typhoon Simulation on TSUBAME2.0 500m Resolution 4792×4696×48 , 437 GPUs (x1000 resolution) 29
  • 30. CFD analysis over a car body Calculation conditions * Number of grid points: 3,623,878,656 (3,072 × 1,536 × 768) *Grid resolution:4.2mm (13m x 6.5 m x 3.25m) *Number of GPUs: 288 (96 Nodes) 60 km/h
  • 31. LBM: DriVer: BMW-Audi Lehrstuhl für Aerodynamik und Strömungsmechanik Technische Universität München 3,000x1,500x1,500 Re = 1,000,000 32
  • 32. 33
  • 33. 34
  • 34. Industry prog.: TOTO INC. TSUBAME 150 GPUs In-House Cluster
  • 35. アステラス製薬とのデング熱等の熱帯病の 特効薬の創薬 Accelerate In‐ silico screeninig and data mining
  • 36. 100‐million‐atom MD Simulation M. Sekijima (Tokyo Tech), Jim Phillips (UIUC)
  • 37. Mixed Precision Amber on Tsubame2.0  for Industrial Drug Discovery x10 faster Mixed‐Precision ヌクレオソーム (25095 粒子) $500mil~$1bil dev. cost per drug Even 5-10% improvement of the process will more than pay for TSUBAME 75% Energy Efficient
  • 38. Towards TSUBAME 3.0 Interim Upgrade TSUBAME2.0 to 2.5 (Early Fall 2013) • Upgrade the TSUBAME2.0s GPUs NVIDIA Fermi M2050 to Kepler K20X SFP/DFP peak from 4.8PF/2.4PF => 17PF/5.7PF TSUBAME2.0 Compute Node Fermi GPU 3 x 1408 = 4224 GPUs c.f. The K Computer 11.2/11.2 Acceleration of Important Apps Considerable Improvement Summer 2013 Significant Capacity Improvement at low cost & w/o Power Increase TSUBAME3.0 2H2015
  • 39. TSUBAME2.0⇒2.5 Thin Node Upgrade Thin Node Infiniband QDR x2 (80Gbps) Peak Perf. 4.08 Tflops ~800GB/s Mem BW 80GBps NW ~1KW max HP SL390G7 (Developed for TSUBAME 2.0, Modified for 2.5) GPU: NVIDIA Kepler K20X x 3 1310GFlops, 6GByte Mem(per GPU) Productized as HP ProLiant SL390s Modified for TSUABME2.5 CPU: Intel Westmere-EP 2.93GHz x2 Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD:60GBx2, 120GBx2 NVIDIA Fermi  M2050 1039/515 GFlops NVIDIA Kepler K20X 3950/1310 GFlops
  • 40. 2013: TSUBAME2.5 No.1 in Japan in Single Precision FP, 17 Petaflops All University Centers COMBINED 9 Petaflops SFP ~= Total 17.1 Petaflops SFP 5.76 Petaflops DFP K Computer 11.4 Petaflops SFP/DFP
  • 41. TSUBAME2.0 TSUBAME2.5 Thin Node x 1408 Units Node Machine CPU HP Proliant SL390s Intel Xeon X5670 ← No Change ← No Change (6core 2.93GHz, Westmere) x 2 GPU Node Performance (incl. CPU Turbo boost) NVIDIA Tesla M2050 x 3  448 CUDA cores (Fermi)  SFP 1.03TFlops  DFP 0.515TFlops  3GiB GDDR5 memory  150GB Peak, ~90GB/s STREAM Memory BW  SFP 3.40TFlops  DFP 1.70TFlops  ~500GB Peak, ~300GB/s STREAM Memory BW NVIDIA Tesla K20X x 3  2688 CUDA cores (Kepler)  SFP 3.95TFlops  DFP 1.31TFlops  6GiB GDDR5 memory  250GB Peak, ~180GB/s STREAM Memory BW  SFP 12.2TFlops  DFP 4.08TFlops  ~800GB Peak, ~570GB/s STREAM Memory BW TOTAL System Total System Performance  SFP 4.80PFlops  DFP 2.40PFlops  Peak ~0.70PB/s, STREAM ~0.440PB/s Memory BW  SFP 17.1PFlops (x3.6)  DFP 5.76PFlops (x2.4)  Peak ~1.16PB/s, STREAM ~0.804PB/s Memory BW (x1.8)
  • 42. Phase‐field simulation for Dendritic Solidification [Shimokawabe, Aoki et. al.] Weak scaling on TSUBAME (Single precision) Mesh size(1GPU+4 CPU cores):4096 x 162 x 130 TSUBAME 2.5 3.444 PFlops (3,968 GPUs+15,872 CPU cores) 4,096 x 5,022 x 16,640 Developing lightweight strengthening  material by controlling microstructure TSUBAME 2.0 2.000 PFlops (4,000 GPUs+16,000 CPU cores) Low‐carbon society 4,096 x 6,480 x 13,000 • • Peta‐Scale phase‐field simulations can simulate the multiple dendritic growth during  solidification required for the evaluation of new materials. 2011 ACM Gordon Bell Prize Special Achievements in Scalability and Time‐to‐Solution
  • 43. Peta‐scale stencil application : A Large‐scale LES Wind Simulation using Lattice Boltzmann Method [Onodera, Aoki et. al.] Weak scalability in single precision Large-scale Wind Simulation for a 10km x 10km Area in Metropolitan Tokyo (N = 192 x 256 x 256) Performance [TFlops] 10,080 x 10,240 x 512  (4,032 GPUs) The above peta‐scale simulations were executed as the  TSUBAME Grand Challenge Program, Category A in 2012 fall.  • • ▲ TSUBAME 2.5 (overlap) ● TSUBAME 2.0 (overlap) TSUBAME 2.5 1142 TFlops (3968 GPUs) 288 GFlops / GPU x 1.93 TSUBAME 2.0 149 TFlops (1000 GPUs) 149 GFlops / GPU Number of GPUs The LES wind simulation for the area 10km × 10km with 1‐m resolution has never  been done before in the world.  We achieved 1.14 PFLOPS using 3968 GPUs on the TSUBAME 2.5 supercomputer.
  • 44. AMBER pmemd benchmark Nucleosome = 25,095 atoms 11.39 K20X×8 6.66 K20X×4 4.04 K20X×2 3.11 3.44 K20X×1 M2050×8 M2050×4 M2050×2 2.22 1.85 0.99 0.31 MPI 4node 0.15 MPI 2node 0.11 M2050×1 MPI 1node (12 core) 0 Dr.Sekijima@Tokyo Tech 2 TSUBAME2.0 M2050 TSUBAME2.5 K20X 4 6 ns/day 8 10 12
  • 45. Application TSUBAME2.0 Performance TSUBAME2.5 Performance Boost  Ratio Top500/Linpack (PFlops) 1.192 2.843 2.39 Green500/Linpack (GFlops/W) 0.958 > 2.400 > 2.50 Semi‐Definite Programming  Nonlinear Optimization (PFlops) 1.019 1.713 1.68 Gordon Bell Dandrite Stencil  (PFlops) 2.000 3.444 1.72 LBM LES Whole City Airflow  (PFlops) 0.600 1.142 1.90 Amber 12 pmemd 4 nodes 8  GPUs (nsec/day) 3.44 11.39 3.31 GHOSTM Genome Homology  Search (Sec) 19361 10785 1.80 MEGADOC Protein Docking (vs.  1CPU core) 37.11 83.49 2.25
  • 46. TSUBAME Evolution Towards Exascale and Extreme Big Data 25‐30PF 1TB/s 5.7PF Graph 500 No. 3 (2011) Awards 3.0 Phase2 Fast I/O 2.5 5~10PB Phase1 10TB/s Fast I/O > 100mil  250TB 300GB/s iOPs 30PB/Day 1ExaB/Day 2015H2 Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 47
  • 47. x DoE Exascale Parameters x1000 power efficiency in 10 years System attributes “2010” “2015” System peak 2 PetaFlops 100-200 PetaFlops 1 ExaFlop Power Jaguar TSUBAME 6 MW 1.3 MW 15 MW 20 MW System Memory 0.3PB 0.1PB 5 PB 32-64PB Node Perf 125GF 1.6TF 0.5TF Node Mem BW 25GB/s Node Concurrency #Nodes 1TF 10TF 0.5TB/s 0.1TB/s 1TB/s 0.4TB/s 4TB/s 12 O(1000) O(100) O(1000) O(1000) 18,700 1442 50,000 5,000 1 million 8GB/s 20GB/s 200GB/s O(1 day) O(1 day) Total Node 1.5GB/s Interconnect BW MTTI “2020” O(days) 7TF O(10000 ) Billion Cores 100,000
  • 48. Challenges of Exascale (FLOPS, Byte, …) (1018)! Various Physical Limitations Surface All‐at‐Once • # CPU Cores: 1Bil Low Power • # Nodes 100K~xM c.f. Total # of Smartphones sold globally = 400Mil c.f. The K Computer ~100K Google ~ 1 Mil • Mem: x00PB~ExaB c.f. Total mem all PCs (300Mil)  shipped globally in 2011 ~ ExaB BTW 264~=1.8x1019=18ExaB • Storage: xExaB c.f. Google Storage  2 Exabytes (200Mil x 7GB+)  • All of this at 20MW (50GFlops/W), reliability (MTTI=days),  ease of programming (billion cores?), cost… in 2020?! 49
  • 49. Focused Research Towards Tsubame 3.0 and Beyond towards Exa • Green Computing: Ultra Power Efficient HPC • High Radix Bisection Networks – HW, Topology, Routing Algorithms, Placement… • Fault Tolerance – Group-based Hierarchical Checkpointing, Fault Prediction, Hybrid Algorithms • Scientific “Extreme” Big Data – Ultra Fast I/O, Hadoop Acceleration, Large Graphs • New memory systems – Pushing the envelops of low Power vs. Capacity vs. BW, exploit the deep hierarchy with new algorithms to decrease Bytes/Flops • Post Petascale Programming – OpenACC and other manycore programming substrates, Task Parallel • Scalable Algorithms for Many Core – Apps/System/HW Co-Design
  • 50. JST‐CREST “Ultra Low Power (ULP)‐HPC” Project  2007‐2012 Ultra Multi-Core Slow & Parallel (& ULP) Auto‐Tuning for Perf. & Power ULP‐HPC SIMD‐ Vector (GPGPU, etc.) ABCLibScript: アルゴリズム選択 モデルと実測の Bayes 的融合 実行起動前自動チューニング指定、 • Bayes モデルと事前分布 アルゴリズム選択処理の指定 !ABCLib$ static select region start !ABCLib$ parameter (in CacheS, in NB, in NPrc) コスト定義関数で使われる !ABCLib$ select sub region start 入力変数 !ABCLib$ according estimated !ABCLib$ (2.0d0*CacheS*NB)/(3.0d0*NPrc) モデルによる 所要時間の推定 yi ~ N (  i ,  ) 2 i i |  ,  i2 ~ N ( xiT  ,  i2 /  0 ) コスト定義関数  i2 ~ Inv -  2 (v0 ,  02 ) ULP‐HPC Networks MRAM PRAM Flash etc. 所要時間の実測データ • n 回実測後の事後予測分布 yi | ( yi1 , yi 2 , , yin ) ~ t n ( in ,   n 1 /  n ) 2 in 0  n   0  n,  n   0  n,  n  ( 0 xiT   nyi ) /  n Low Power  High Perf Model Power Optimize using Novel Components in HPC 2  n n   0 02   ( ym  yi ) 2   0 n( yi  xiT  ) 2 /  n 対象1(アルゴリズム1) select sub region end select sub region start according estimated (4.0d0*ChcheS*dlog(NB))/(2.0d0*NPrc) 対象2(アルゴリズム2) !ABCLib$ select sub region end !ABCLib$ static select region end !ABCLib$ !ABCLib$ !ABCLib$ !ABCLib$ 1 1 yi   yim n Optimization  Point Power Perf x10 Power  Efficiencty 0 Perf. Model Algorithms Power‐Aware and Optimizable Applications 対象領域1、2 x1000 Improvement in 10 years
  • 51. Aggressive Power Saving in HPC Methodologies Enterprise/Business Clouds HPC Server Consolidation Good NG! Good Poor Poor Good Poor Good Limited Good DVFS (Dynamic Voltage/Frequency Scaling) New Devices New HW &SW Architecture Novel Cooling (Cost & Continuity) (Cost & Continuity) (Cost & Continuity) (high thermal density)
  • 52. How do we achive x1000? Process Shrink x100 X Many-Core GPU Usage x5 X DVFS & Other LP SW x1.5 X Efficient Cooling x1.4 x1000 !!! ULP-HPC Project 2007-12 Ultra Green Supercomputing Project 2011-15
  • 53. Statistical Power Modeling of GPUs [IEEE IGCC10] GPU performance counters n p    i ci   i 1 • Estimates GPU power consumption GPU  statistically • Linear regression model using performance  counters as explanatory variables High accuracy(Avg Err 4.7%) Average power consumption • Prevents overtraining by ridge regression • Determines optimal parameters by cross  fitting Power meter with high resolution Accurate even with DVFS Future:  Model‐based power opt. Linear model shows sufficient accuracy Possibility of optimization of Exascale systems  with O(10^8) processors
  • 54. Power Efficiency in Denderite Applications on TSUBAME1.0 thru JST‐CREST ULPHPC  Prototype running Gordon Bell Denderite App 
  • 55. TSUBAME‐KFC: Ultra‐Green Supercomputer Testbed [2011‐2015] Fluid Submersion Cooling + Outdoor Air Cooling + High  Density GPU Supercomputing in a 20‐feet container Compute Nodes K20X GPU GRC Submersion Rack Heat Exchanger Processors 80~90℃ Oil 35~45℃ ⇒ Coolant oil 35~45℃ ⇒ Water 25~35℃ Heat Dissipation NEC/SMC 1U server x 40 • Intel IvyBridge 2.1GHz 6core×2 • NVIDIA Tesla K20X GPU ×4 • DDR3 memory 64GB, SSD 120GB • 4x FDR InfiniBand 56Gbps Per node Total Peak 210TFlops (DP) 630TFlops (SP) Target Facility 20 feet container(16m2) • Cooling Tower: Water 25~35℃ ⇒ Outdoor air Coolant oil: Spectrasyn8 • World’s top power efficiency (>3GFlops/Watt) • Average PUE 1.05, lower component power • Field test ULP‐HPC results
  • 56. TSUBAME-KFC Towards TSUBAME3.0 and Beyond Shooting for #1 on Nov. 2013 Green 500!
  • 57. Total Mem BW TB/s (STREAM) Mem BW MByte/S / W Factor (incl.  cooling) Linpack Linpack MFLOPs/ Perf W (PF) 10MW 1.8MW 0.036 3.6 0.038 21 13,400 160 2,368 13 16 7.2 ORNL Jaguar (XT5. 2009Q4) ~9MW 1.76 196 256 432 48 Tsubame2.0 (2010Q4) 1.8MW 1.2 667 75 440 244 K Computer (2011Q2) ~16MW 10 625 BlueGene/Q (2012Q1) ~12MW? 17 TSUBAME2.5 (2013Q3) 1.4MW Tsubame3.0 (2015Q4~2016Q1) 1.5MW EXA (2019~20) 20MW Machine Earth Simulator 1 Tsubame1.0 (2006Q1) Power  x31.6 3300 206 ~1400 ~35 3000 250 ~3 ~2100 ~24 802 572 ~20 ~13,000 6000 4000 1000 80 x34 ~4 ~x20 50,000 1 ~x13.7 100K 5000
  • 58. Extreme Big Data (EBD) Next Generation Big Data Infrastructure Technologies Towards Yottabyte/Year Principal Invesigator Satoshi Matsuoka Global Scientific Information and Computing Center Tokyo Institute of Technolgoy
  • 59. The current “Big Data” are not really that Big… • Typical “real” definition: “Mining people’s privacy data to make money” • Corporate data are usually in data warehoused silo > limited volume, in Gigabytes~Terabytes, seldom Petabytes. • Processing involve simple O(n) algorithms, or those that can be accelerated with DB-inherited indexing algorithms • Executed on re-purposed commodity “web” servers linked with 1Gbps networks running Hadoop/HDFS • Vicious cycle of stagnation in innovations… • NEW: Breaking down of Silos ⇒ Convergence with Supercomputing with Extreme Big Data
  • 60. But “Extreme Big Data” will change everything • “Breaking down of Silos” (Rajeeb Harza, Intel VP of Technical Computing) • Already happening in Science & Engineering due to Open Data movement • More complex analysis algorithms: O(n log n), O(m x n), … • Will become the NORM for competitiveness reasons.
  • 61. We will have tons of unknown genes [Slide Courtesy Yutaka  Akiyama @ Tokyo Tech.] Metagenome analysis • Directly sequencing uncultured microbiomes obtained  from target environment and analyzing the sequence  data – Finding novel genes from unculturable microorganism – Elucidating composition of species/genes of environments Examples of microbiome Gut microbiome Human  body Soil Sea 62
  • 62. Results from Akiyama group@Tokyo Tech Ultra high‐sensitive “big data” metagenome sequence analysis of human oral microbiome ‐ Required  > 1 million  node*hour product  on K‐computer ‐ World’s most sensitive sequence analysis  (based on amino acid similarity matrix) ‐ Discovered  at least three microbiome clusters with functional differences. (Integrated 422 experiment samples taken from 9 different oral parts) 572.8 M Reads / hour  82,944 node (663,552 Cores) K‐computer (2012) Metabolic Pathway Map 歯列の内側 歯列の外側 歯垢 63
  • 63. Extreme Big Data in Genomics Impact of new generation sequencers [Slide Courtesy Yutaka  Akiyama @ Tokyo Tech.] Sequencing data (bp)/$ becomes x4000 per 5 years c.f., HPC x33 in 5 years 64 Lincoln Stein, Genome Biology, vol. 11(5), 2010
  • 64. Extremely “Big” Graphs • Large scale graphs in various fields – US Road network    :    58 million edges – Twitter follow‐ship : 1.47 billion edges – Neuronal network :   100 trillion edges large Social network Twitter 61.6 million vertices &  1.47 billion edges • Fast and scalable graph processing by using HPC Neuronal network @ Human Brain Project 89 billion vertices & 100 trillion edges US road network Cyber‐security 24 million vertices & 58 million edges 15 billion log entries / day
  • 65. K computer: 65536nodes Graph500: 5524GTEPS # of edges Human Brain Project 45 Graph500 (Huge) Symbolic Network 1 trillion edges Graph500 (Large) log2(m) 40 Graph500 (Medium) 35 Graph500 (Small) 1 billion edges 30 Twitter (tweets / day) Graph500 (Mini) Graph500 (Toy) USA-road-d.USA.gr 25 USA-road-d.LKS.gr 20 Android  tablet Tegra3 1.7GHz : 1GB RAM 0.15GTEPS: 64.12MTEPS/W 20 25 1 trillion nodes 1 billion nodes USA-road-d.NY.gr 15 USA Road Network 30 log2(n) 35 40 45 # of nodes
  • 66. Towards Continuous Billion-Scale Social Simulation with Real-Time Streaming Data (Toyotaro Suzumura/IBM-Tokyo Tech)  Applications – Target Area: Planet (Open Street Map) – 7 billion people  Input Data – Road Network (Open Street Map) for Planet: 300 GB (XML) – Trip data for 7 billion people • 10 KB (1 trip) x 7 billion = 70 TB – Real-Time Streaming Data (e.g. Social sensor, physical data)  Simulated Output for 1 Iteration – 700 TB
  • 67. Graph500 “Big Data” Benchmark Kronecker graph BSP Problem A: 0.57, B: 0.19 C: 0.19, D: 0.05 November 15, 2010 Graph 500 Takes Aim at a New Kind of HPC Richard Murphy (Sandia NL => Micron) “ I expect that this ranking may at times look very  different from the TOP500 list. Cloud architectures  will almost certainly dominate a major chunk of  part of the list.” Reality: Top500 Supercomputers Dominate No Cloud IDCs at all TSUBAME2.0 #3(Nov.2011) #4(Jun.2012)
  • 68. Supercomputer Tokyo Tech. Tsubame 2.0 #4 Top500 (2010) A Major Northern Japanese Cloud Datacenter (2013) the Internet >> Advanced Silicon Photonics 40G single CMOS Die 1490nm DFB 100km Fiber 10GbE ~1500 nodes compute & storage Full Bisection Multi-Rail Optical Network Injection 80GBps/Node Bisection 220Terabps Juniper MX480 Juniper MX480 2 zone switches (Virtual Chassis) x1000! Juniper  EX4200 10GbE 10GbE Juniper EX8208 Juniper EX8208 10GbE LACP Juniper  EX4200 Zone (700 nodes) Juniper  EX4200 Juniper  EX4200 Zone (700 nodes) Juniper  EX4200 Juniper  EX4200 Zone (700 nodes) 8 zones, Total 5600 nodes, Injection 1GBps/Node Bisection 160Gigabps
  • 69. But what does “220Tbps” mean? Global IP Traffic, 2011-2016 (Source Cicso) 2011 2012 2013 2014 2015 2016 CAGR 2011-2016 By Type (PB per Month / Average Bitrate in Tbps) Fixed Internet Manage d IP Mobile data Total IP traffic 23,288 71.9 6,849 21.1 597 1.8 30,734 94.9 32,990 101.8 9,199 28.4 1,252 3.9 43,441 134.1 40,587 125.3 11,846 36.6 2,379 7.3 54,812 169.2 50,888 157.1 13,925 43.0 4,215 13.0 69,028 213.0 64,349 81,347 198.6 251.1 16,085 18,131 49.6 56.0 6,896 10,804 21.3 33.3 87,331 110,282 269.5 340.4 TSUBAME2.0 Network has TWICE the capacity of the Global Internet, being used by 2.1 Billion users NEC Confidential 28% 21% 78% 29%
  • 70. “convergence” at future extreme scale  for computing and data (in Clouds?) Source: Assessing trends over time in performance, costs, and energy use for servers, Intel, 2009. HPC: x1000 in 10 years CAGR ~= 100% IDC: x30 in 10 years Server unit sales flat  (replacement demand) CAGR ~= 30‐40%
  • 71. What does this all mean? • “Leveraging of mainframe technologies in HPC has been dead for some time.” • But will leveraging Cloud/Mobile be sufficient? • NO! They are already falling behind, and will be perpetually behind – CAGR of Clouds 30%, HPC 100%: all data supports it – Stagnation in network, storage, scaling, … • Rather, HPC will be the technology driver for future Big Data, for Cloud/Mobile to leverage! – Rather than repurposed standard servers
  • 72. Future “Extreme Big Data” - NOT mining Tbytes Silo Data - Peta~Zetabytes of Data Ultra High-BW Data Stream Highly Unstructured, Irregular Complex correlations between data from multiple sources - Extreme Capacity, Bandwidth, 73 Compute All Required
  • 73. [Slide courtesy Alok Choudhary, Northeastern Extreme Big Data not just traditional HPC!!! U --- Analysis of required system properties --- 74 Extreme-Scale Computing Big Data Analytics BDEC Knowledge Discovery Engine Processor Speed 1 Algorithmic Variety Memory/ops 0.8 0.6 Power Optimization Opportunities OPS 0.4 0.2 0 Comm patterns variability Approximate Computations Comm Latency tolerance Write Performance Local Persistent Storage Read Performance
  • 74. EBE Research Scheme Future Non-Silo Extreme Big Data Apps Ultra Large Scale Graphs and Social Infrastructures Large Scale Metagenomics Co-Design EBD Bag Massive Sensors and Data Assimilation in Weather Prediction Co-Design Co-Design EBD System Software incl. EBD Object System Cartesian Plane KV S KV S Graph Store NVM/Fla NVM/Flas 2Tbps HBM NVM/Fla sh h 4~6HBM Channels NVM/Flas NVM/Fla NVM/Flas sh h 1.5TB/s DRAM & h sh DRAM DRAM NVM BW DRAM DRAM DRAM DRAM Low Low High Powered 30PB/s I/O BW Possible Main CPU Power Power 1 Yottabyte / YearCPU CPU TSV Interposer KV S EBD KVS Exascale Big Data HPC PCB Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW Cloud IDC Very low BW & Efficiencty Supercomputers Compute&Batch-Oriented
  • 75. Phase4: 2019‐20 DRAM+NVM+CPU  with 3D/2.5D Die Stacking ‐The Ultimate Convergence of BD and EC‐ NVM/Flash NVM/Flash NVM/Flash 2Tbps HBM 4~6HBM Channels 1.5TB/s DRAM &  NVM BW DRAM NVM/Flash NVM/Flash NVM/Flash DRAM DRAM DRAM 30PB/s I/O BW Possible 1 Yottabyte / Year Low Power CPU High Powered Main CPU Low Power CPU DRAM TSV Interposer PCB DRAM
  • 76. Preliminary I/O Performance Evaluation  on GPU and NVRAM How to design local storage for next‐gen supercomputers ? ‐ Designed a local I/O prototype using 16 mSATA SSDs mSATA mSATA mSATA ・Capacity: 4TB ・Read bandwidth: 8 GB/s mSATA RAID card Mother board I/O performance from GPU to multiple mSATA SSDs I/O performance of multiple mSATA SSD 9000 7000 3 Throughuput [GB/s] Bandwidth [MB/s] 3.5 Raw mSATA 4KB RAID0 1MB RAID0 64KB 8000 6000 5000 4000 3000 2000 〜 7.39 GB/s from  1000 16 mSATA SSDs (Enabled RAID0) 0 Raw 8 mSATA 8 mSATA RAID0 (1MB) 8 mSATA RAID0 (64KB) 2.5 2 1.5 1 0.5 〜 3.06 GB/s from  8 mSATA SSDs to GPU 0 0 5 10 # mSATAs 15 20 0.2740.547 1.09 2.19 4.38 8.75 17.5 35 Matrix Size [GB] 70 140
  • 77. Algorithm Kernels on EBD Large Scale BFS Using NVRAM 1. Introduction • Large scale graph processing in various domains DRAM resources has increased • Spread of Flash Devices Prof : Price per bit,Energy consumption Cons: Latency,Throughput Using NVRAMs for large scale graph processing has possibilities of  minimum performance degradation  2.Hybrid‐BFS 3.Proporsal ① offload small accesses data  Switch two approaches Top‐down Bottom‐up # of frontiers:nfrontier, # of all vertices:nall,           parameter : α, β GTEPS 4.Evaluation 6.0 5.0 4.0 3.0 2.0 1.0 0.0 5.2GTEPS ② BFS with reading data  from NVRAM DRAM Only(β=10α) ● Pearce, et al. :  13 times larger datasets DRAM+SSD(β=0.1α) with  52 MTEPS(DRAM 1TB, 12TB NVRAM) 2.8GTEPS (47.1% down) 1.E+04 1.E+05 1.E+06 1.E+07 Switching Parameter α ● We could reduce half the size of DRAM  with 47.1% performance degradation (130M vertices,2.1G edges) ● We are working on multiplexed I/O → multiplexed I/O improve  NVRAM’s I/O performance
  • 78. High Performance Sorting Fast algorithms: Distribution vs Comparison-based N log(N) classical sorts (quick, merge etc) MSD radix sort LSD radix sort (THRUST) variable length / short length / long keys high efficiency on small fixed-length keys alphabets apple apricot banana kiwi Scalability Bitonic sort Computational don't have to examine Genomics all characters (A,C,G,T) Comparison of keys Map-Reduce Hadoop easy to use but not that efficient integer sorts Efficient implementation GPUs are good at counting numbers Hybrid approaches/ Best to be found Good for GPU nodes Balancing IO / computation –
  • 79. Twitter network (Application of Graph500 Benchmark) Follow‐ship network 2009 Frontier size in BFS with source as User 21,804,357 User j Lv User i (i, j)‐edge 41 million vertices and 2.47 billion edges Our NUMA‐optimized BFS on 4‐way Xeon system 69 ms / BFS        ⇒ 21.28 GTEPS Six‐degrees of separation 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total Frontier size 1  7  6,188  510,515  29,526,508  11,314,238  282,456  11536  673  68  19  10  5  2  2  2  41,652,230  Freq. (%) Cum. Freq. (%) 0.00  0.00  0.01  1.23  70.89  27.16  0.68  0.03  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  100.00  0.00  0.00  0.01  1.24  72.13  99.29  99.97  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  ‐
  • 80. 100,000 Times Fold EBD “Convergent” System Overview Tasks 5‐1~5‐3 EBD Application Co‐ Design and Validation Task 3 EBD Programming System Task 2 Graph Store Task 1 Data Assimilation in Large Scale Sensors  and Exascale  Atmospherics Task 4 EBD “converged” Real‐Time  Resource Scheduling EBD Distrbuted Object Store on 100,000 NVM Extreme Compute  and Data Nodes Ultra Parallel & Low Powe I/O EBD  “Convergent” Supercomputer ~10TB/s⇒~100TB/s⇒~10PB/s Ultra High BW & Low Latency NVM TSUBAME 2.0/2.5 EBD Performance Modeling  & Evaluation Large Scale Graphs  and Social  Infrastructure Apps Large Scale  Genomic  Correlation EBD Bag Task6 TSUBAME 3.0 Cartesian Plane KVS KVS KVS EBD KVS Ultra High BW & Low Latency NW  Processor‐in‐memory 3D stacking
  • 81. Summary • TSUBAME1.0->2.0->2.5->3.0->… – Tsubame 2.5 Number 1 in Japan, 17 Petaflops SFP – Template for future supercomputers and IDC machines • TSUBAME3.0 Early 2016 – New supercomputing leadership – Tremendous power efficiency, extreme big data, extremely high reliability • Lots of background R&D for TSUBAME3.0 and towards Exascale – – – – – Green Computing: ULP-HPC & TSUBAME-KFC Extreme Big Data – Convergence of HPC and IDC! Exascale Resilience Programming with Millions of Cores … • Please stay tuned! 乞うご期待。応援をお願いします。