Japan Lustre User Group 2014

1. Extreme Big Data (EBD) Convergence of Extreme Computing and Big Data Technologies 7RUIHRCJPLTPBJTMU:SHPUTHTK3USV=PTN3LTL: DUQ@UTXP=LUMDLJOTURUN@ 8PUXOPCHU

2. Big Data Examples Rates and Volumes are extremely immense Social NW • Facebook – 1 billion users – Average 130 friends – 30 billion pieces of content shared per month • Twitter – 500 million active users – 340 million tweets per day • Internet – 300 million new websites per year – 48 hours of video to YouTube per minute – 30,000 YouTube videos played per second Genomics Social Simulation

3. Sequencing)data)(bp)/$) )becomes)x4000)per)5)years) c.f.,)HPC)x33)in)5)years ).121 ,.1(,120, .2252 • Applications – Target Area: Planet (Open Street Map) – 7 billion people • Input Data – Road Network for Planet: 300GB (XML) – Trip data for 7 billion people 10KB (1trip) x 7 billion = 70TB – Real-Time Streaming Data (e.g., Social sensor, physical data) • Simulated Output for 1 Iteration – 700TB Weather A#1.'Quality'Control A#2.'Data'Processing 30#sec'Ensemble' Forecast'Simulations 2'PFLOP Ensemble Data'Assimilation 2'PFLOP Himawari 500MB/2.5min Ensemble'Forecasts

4. 200GB

5. Phased'Array'Radar 1GB/30sec/2'radars Ensemble'Analyses

6. 200GB

7. B#1.'Quality'Control B#2.'Data'Processing Analysis'Data 2GB 30#min Forecast'Simulation 1.2'PFLOP 30#min Forecast 2GB Repeat'every'30'sec.

8. Big Data Examples Rates and Volumes are extremely immense Social NW • Facebook Future “Extreme Big Data” • NOT mining Tbytes Silo Data • Peta~Zetabytes of Data • Ultra High-BW Data Stream • Highly Unstructured, Irregular • Complex correlations between data from multiple sources • Extreme Capacity, Bandwidth, Compute All Required – 1 billion users – Average 130 friends – 30 billion pieces of content shared per month • Twitter – 500 million active users – 340 million tweets per day • Internet – 300 million new websites per year – 48 hours of video to YouTube per minute – 30,000 YouTube videos played per second Genomics Social Simulation

9. Sequencing)data)(bp)/$) )becomes)x4000)per)5)years) c.f.,)HPC)x33)in)5)years ).121 ,.1(,120, .2252 • Applications – Target Area: Planet (Open Street Map) – 7 billion people • Input Data – Road Network for Planet: 300GB (XML) – Trip data for 7 billion people 10KB (1trip) x 7 billion = 70TB – Real-Time Streaming Data (e.g., Social sensor, physical data) • Simulated Output for 1 Iteration – 700TB Weather A#1.'Quality'Control A#2.'Data'Processing 30#sec'Ensemble' Forecast'Simulations 2'PFLOP Ensemble Data'Assimilation 2'PFLOP Himawari 500MB/2.5min Ensemble'Forecasts

10. 200GB

11. Phased'Array'Radar 1GB/30sec/2'radars Ensemble'Analyses

12. 200GB

13. B#1.'Quality'Control B#2.'Data'Processing Analysis'Data 2GB 30#min Forecast'Simulation 1.2'PFLOP 30#min Forecast 2GB Repeat'every'30'sec.

14. Graph500(“Big(Data”(Benchmark

15. A: 0.57, B: 0.19 C: 0.19, D: 0.05 November(15,(2010( Graph500TakesAimataNewKindofHPC RichardMurphy(SandiaNL=Micron) “(IexpectthatthisrankingmayatImeslookvery differentfromtheTOP500list.Cloudarchitectureswill almostcertainlydominateamajorchunkofpartofthe list.”( The 8th Graph500 List (June2014): K Computer #1, TSUBAME2 #12 Koji Ueno, Tokyo Institute of Technology/RIKEN AICS RIKEN Advanced Institute for Computational Science (AICS)’s K computer is ranked No.1 Reality: Top500 Supercomputers Dominate on the Graph500 Ranking of Supercomputers with 17977.1 GE/s on Scale 40 on the 8th Graph500 list published at the International Supercomputing Conference, June 22, 2014. Congratulations from the Graph500 Executive Committee #1 K Computer Global Scientific Information and Computing Center, Tokyo Institute of Technology’s TSUBAME 2.5 is ranked No.12 on the Graph500 Ranking of Supercomputers with 1280.43 GE/s on Scale 36 on the 8th Graph500 list published at the International Supercomputing Conference, June 22, 2014. Congratulations from the Graph500 Executive Committee #12 TSUBAME2 No Cloud IDCs at all

16. -‐‑‒ :2 2222: 22-‐‑‒DUQ@UDLJO (.#6C6# ,.#646# ( UTDUV, :=T) ( !:2 ( /TUKLX#:HJQX 3#E0TLRGLUTG,-‐‑‒. -‐‑‒JU:LXC 7#E0!F41DLXRH) GC 5 0,/7244B ( 8A CC40-‐‑‒ 72C)

17. 2 CA4B6=RRIPXLJPUTVPJHRTBTPIHTK 2 CC40) D2 8440.#2=X:L#7#6C DHVL0#2CU:HNLDLQC/, C)

18. TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD) ++CompuIngNodes- 17.1PFlops(SFP),5.76PFlops(DFP),224.69TFlops(CPU),~100TBMEM,~200TBSSD Edge(Switch( Edge(Switch((/w(10GbE(ports)( QDR(IB(×4)(×(20 QDR(IB((×4)(×(8 10GbE(×(2 Core(Switch( SFA10k(#1 SFA10k(#2 SFA10k(#3 SFA10k(#4 GPFS+Tape Lustre Home /data0( /work0 /work1(((((/gscr “Global(Work(Space”(#1 SFA10k(#5 “Global(Work( Space”(#2 “Global(Work(Space”(#3 GPFS#1 GPFS#2 GPFS#3 GPFS#4 HOME System( applicaQon “cNFS/Clusterd(Samba(w/(GPFS”(( HOME iSCSI “NFS/CIFS/iSCSI(by(BlueARC”(( Infiniband(QDR(Networks SFA10k(#6 1.2PB 3.6PB Parallel(File(System(Volumes Home(Volumes /data1( (( (( Thin(nodes 1408nodes((((32nodes(x44(Racks)( HP(Proliant(SL390s(G7(1408nodes++++++++++++ CPU:(Intel(WestmerecEP((2.93GHz(( (((((((((6cores(×(2(=(12cores/node( GPU:(NVIDIA(Tesla(K20X,(3GPUs/node( Mem:(54GB((96GB)( SSD:((60GB(x(2(=(120GB((120GB(x(2(=(240GB)+( Medium(nodes HP(Proliant(DL580(G7(24nodes(( CPU:(Intel(NehalemcEX(2.0GHz( (((((((((8cores(×(2(=(32cores/node( GPU:(NVIDIA((Tesla(S1070,(( ((((((((((NextIO(vCORE(Express(2070( Mem:128GB( SSD:(120GB(x(4(=(480GB( (( Fat(nodes HP(Proliant(DL580(G7(10nodes( CPU:(Intel(NehalemcEX(2.0GHz( (((((((((8cores(×(2(=(32cores/node(( GPU:(NVIDIA(Tesla(S1070( Mem:(256GB((512GB)( SSD:(120GB(x(4(=(480GB( ,,,,,, Interconnets:+Full_bisecIonOpIcalQDRInfinibandNetwork (( Voltaire(Grid(Director(4700((×12( IB(QDR:(324(ports( (( (( Voltaire(Grid(Director(4036(×179( IB(QDR(:(36(ports( Voltaire((Grid(Director(4036E(×6( IB(QDR:34ports((( 10GbE:((2port( 12switches( 179switches( 6switches( 2.4PBHDD+ 4PBTape Local SSDs

19. TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD) ++CompuIngNodes- 17.1PFlops(SFP),5.76PFlops(DFP),224.69TFlops(CPU),~100TBMEM,~200TBSSD Edge(Switch( Finecgrained(R/W(I/O( (check(point,(temporal(files) Edge(Switch((/w(10GbE(ports)( QDR(IB(×4)(×(20 QDR(IB((×4)(×(8 10GbE(×(2 Core(Switch( SFA10k(#1 SFA10k(#2 SFA10k(#3 SFA10k(#4 GPFS+Tape Lustre Home /data0( /work0 /work1(((((/gscr “Global(Work(Space”(#1 SFA10k(#5 “Global(Work( Space”(#2 “Global(Work(Space”(#3 GPFS#1 GPFS#2 GPFS#3 GPFS#4 HOME System( applicaQon “cNFS/Clusterd(Samba(w/(GPFS”(( HOME iSCSI “NFS/CIFS/iSCSI(by(BlueARC”(( Infiniband(QDR(Networks SFA10k(#6 1.2PB 3.6PB Parallel(File(System(Volumes Home(Volumes /data1( (( (( Thin(nodes 1408nodes((((32nodes(x44(Racks)( HP(Proliant(SL390s(G7(1408nodes++++++++++++ CPU:(Intel(WestmerecEP((2.93GHz(( (((((((((6cores(×(2(=(12cores/node( GPU:(NVIDIA(Tesla(K20X,(3GPUs/node( Mem:(54GB((96GB)( SSD:((60GB(x(2(=(120GB((120GB(x(2(=(240GB)+( Medium(nodes HP(Proliant(DL580(G7(24nodes(( CPU:(Intel(NehalemcEX(2.0GHz( (((((((((8cores(×(2(=(32cores/node( GPU:(NVIDIA((Tesla(S1070,(( ((((((((((NextIO(vCORE(Express(2070( Mem:128GB( SSD:(120GB(x(4(=(480GB( (( Fat(nodes HP(Proliant(DL580(G7(10nodes( CPU:(Intel(NehalemcEX(2.0GHz( (((((((((8cores(×(2(=(32cores/node(( GPU:(NVIDIA(Tesla(S1070( Mem:(256GB((512GB)( SSD:(120GB(x(4(=(480GB( ,,,,,, Interconnets:+Full_bisecIonOpIcalQDRInfinibandNetwork (( Voltaire(Grid(Director(4700((×12( IB(QDR:(324(ports( (( (( Voltaire(Grid(Director(4036(×179( IB(QDR(:(36(ports( Voltaire((Grid(Director(4036E(×6( IB(QDR:34ports((( 10GbE:((2port( 12switches( 179switches( 6switches( 2.4PBHDD+ 4PBTape Local SSDs Read(mostly(I/O(( (datacintensive(apps,(parallel(workflow,( parameter(survey) • (Home(storage(for(compuQng(nodes( • (Cloudcbased(campus(storage(services Backup Finecgrained(R/W(I/O( (check(point,(temporal(files)

20. TSUBAME2 Storage Usage Since Nov. 2010

21. DCE21 5),HXH2PN4HHTM:HX:=J=:L • 7#EIHXLK HT@JU:L1JJLRL:HU: – !F41DLXRH) G))JH:KX • 8=NL LSU:@FUR=SLXVL:!UKL – ,/72VL:TUKLC( /TUKLX • 6HD:LL4=HRBHPRA4BTBTP2HTK – ) DIVXUMM=RRIPXLJPUTIHTK?PKO • UJHRCC4KLPJLXVL:TUKL – ) D2PTUHR • H:NLXJHRLCU:HNLC@XLSX – =X:L#7#6C – .#2UM844X##2UMDHVLX

22. DCE21 5),HXH2PN4HHTM:HX:=J=:L • 7#EIHXLK HT@JU:L1JJLRL:HU: – !F41DLXRH) G))JH:KX • 8=NL LSU:@FUR=SLXVL:!UKL – ,/72VL:TUKLC( /TUKLX • 6HD:LL4=HRBHPRA4BTBTP2HTK – ) DIVXUMM=RRIPXLJPUTIHTK?PKO • UJHRCC4KLPJLXVL:TUKL – ) D2PTUHR • H:NLXJHRLCU:HNLC@XLSX – =X:L#7#6C – .#2UM844X##2UMDHVLX

23. A Major Northern Japanese Cloud Datacenter (2013) 10GbE 10GbE Juniper(MX480 Juniper(MX480 2(zone(switches((Virtual(Chassis) 10GbE Juniper(EX8208 Juniper(EX8208 Juniper( EX4200 Juniper( EX4200 Zone((700(nodes) Juniper( EX4200 Juniper( EX4200 Zone((700(nodes) Juniper( EX4200 10GbE Juniper( EX4200 Zone((700(nodes) LACP the(Internet 8 zones, Total 5600 nodes, Injection 1GBps/Node Bisection 160Gigabps Supercomputer Tokyo Tech. Tsubame 2.0 #4 Top500 (2010) Advanced Silicon Photonics 40G single CMOS Die 1490nm DFB 100km Fiber ~1500 nodes compute storage Full Bisection Multi-Rail Optical Network Injection 80GBps/Node Bisection 220Terabps x1000!

24. Towards Extreme-scale Supercomputers and BigData Machines • Computation – Increase in Parallelism, Heterogeneity, Density • Multi-core, Many-core processors • Heterogeneous processors • Hierarchial Memory/Storage Architecture – NVM (Non-Volatile Memory), SCM (Storage Class Memory) • 61C8##3 #CDD B1 #BLB1 # 8 3#LJ – Next-gen HDDs (SMR), Tapes (LTFS) ( Algorithm Network Locality Power FT Productivity Storage Hierarchy I/O Problems Scalability Heterogeneity

25. Extreme Big Data (EBD) Next Generation Big Data Infrastructure Technologies Towards Yottabyte/Year Principal Investigator Satoshi Matsuoka Global Scientific Information and Computing Center Tokyo Institute of Technology 2014/11/05 JST CREST Big Data Symposium

26. EBE Research Scheme Future Non-Silo Extreme Big Data Apps Co-Design Co-Design Co-Design 日本地図13/06/06 22:36 EBD System Software incl. EBD Object System NVM/ Flash NVM/ Flash NVM/ Flash DRAM DRAM DRAM 2Tbps HBM 4~6HBM Channels 1.5TB/s DRAM NVM BW 30PB/s I/O BW Possible 1 Yottabyte / Year TSV Interposer NVM/ Flash NVM/ Flash NVM/ Flash DRAM DRAM DRAM EBD Bag Cartesian Plane KV S KV S EBD KVS 1000km KV S Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW Supercomputers ComputeBatch-Oriented Cloud+IDC Very low BW Efficiencty PCB High Powered Main CPU Low Power CPU Low Power CPU Introduction Problem Domain In most living organisms their development molecule called DNA consists of called nucleotides. The four bases found A), cytosine (C), Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Algorithm Large Scale Metagenomics Massive Sensors and Data Assimilation in Weather Prediction Ultra Large Scale Graphs and Social Infrastructures Exascale Big Data HPC Graph Store file:///Users/shirahata/Pictures/日本地図.svg 1/1 ページ

27. Extreme Big Data (EBD) Team) Co-Design EHPC and EBD Apps • Satoshi Matsuoka (PI), Toshio Endo, Hitoshi Sato (Tokyo Tech.) (Tasks 1, 3, 4, 6) • Osamu Tatebe (Univ. Tsukuba) (Tasks 2, 3) • Michihiro Koibuchi (NII) (Tasks 1, 2) • Yutaka Akiyama, Ken Kurokawa (Tokyo Tech, 5-1) • Toyotaro Suzumura (IBM Lab, 5-2) • Takemasa Miyoshi (Riken AICS, 5-3)

28. Tasks(5c1~5c3 Task6 Problem Domain In most their development molecule DNA consists called nucleotides. The four A), cytosine Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Introduction 100,000TimesFoldEBD“Convergent”SystemOverview Problem Domain EBD(Performance(Modeling(( (EvaluaQon Task(4 To decipher the information contained we need to determine the order This task is important for many areas of science and 日本地図medicine. 13/06/06 22:36 Cartesian(Plane Modern sequencing techniques KVS molecule into pieces (called reads) KVS processed separately KVS to increase sequencing throughput. 1000km Reads must be aligned file:///Users/shirahata/Pictures/日本地図.svg to the 1/1 ページ reference sequence to determine their position molecule. This process is called alignment. Task(3 EBD(Programming(System( Graph(Store EBD(ApplicaQon(Coc Design(and(ValidaQon( Ultra(High(BW((Low(Latency(NVM Ultra(High(BW((Low(Latency(NW( Processorcincmemory 3D(stacking Large(Scale( Genomic( CorrelaQon Data(AssimilaQon( in(Large(Scale(Sensors( and(Exascale( Atmospherics Large(Scale(Graphs( and(Social( Infrastructure(Apps TSUBAME(3.0 TSUBAME(2.0/2.5 EBD(“converged”(RealcTime( Resource(Scheduling( EBD(Distrbuted(Object(Store(on( 100,000(NVM(Extreme(Compute( and(Data(Nodes( Task(2 EBD(Bag EBD(KVS( Ultra(Parallel((Low(Powe(I/O(EBD( “Convergent”(Supercomputer( ~10TB/s*~100TB/s*~10PB/s Task(1

29. Problem Domain In most their development molecule DNA consists called nucleotides. The four A), cytosine Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Introduction 100,000TimesFoldEBD“Convergent”SystemOverview Problem Domain To decipher the information contained we need to determine the order This task is important for many areas of science and medicine. Modern sequencing techniques molecule into pieces (called reads) processed separately to increase sequencing throughput. Reads must be aligned to the reference sequence to determine their position molecule. This process is called alignment. SQLforEBD Xpregel(Graph) Workflow/ScripIngLanguagesforEBD MapReduceforEBD 日本地図13/06/06 22:36 Cartesian(Plane KVS EBDBurstI/OBuffer EBDNetworkTopologyandRouIng KVS 1000km KVS file:///Users/shirahata/Pictures/日本地図.svg 1/1 ページ MessagePassing(MPI,X10)forEBD EBD(Bag EBDFileSystem EBDDataObject Graph(Store Cloud(Datacenter Large(Scale(Genomic( CorrelaQon Data(AssimilaQon( in(Large(Scale(Sensors( and(Exascale( Atmospherics Large(Scale(Graphs( and(Social( Infrastructure(Apps TSUBAME(3.0( TSUBAMEcGoldenBox EBD(KVS( Interconnect (InfiniBand(100GbE) EBDAbstractDataModels (Distributed(Array,(Key(Value,(Sparse(Data(Model,(Tree,(etc.) EBDAlgorithmKernels(( (Search/(Sort,(Matching,(Graph(Traversals,(,(etc.)( NVM (61C8##3 #CDD B1 # BLB1 #8 3#LJ)( HPCStorage PGAS/GlobalArrayforEBD Network (SINET5) Intercloud(/(Grid((HPCI) WebObjectStorage

30. Hamar((Highly(Accelerated(Map(Reduce)( [Shirahata,(Sato(et(al.(Cluster2014] • A((soqware(framework(for(largecscale(supercomputers(( w/(manyccore(accelerators(and(local(NVM(devices( – AbstracQon(for(deepening(memory(hierarchy( • Device(memory(on(GPUs,(DRAM,(Flash(devices,(etc.(( • Features( – Objectcoriented(( • C++cbased(implementaQon( • Easy(adaptaQon(to(modern(commodity(( manyccore(accelerator/Flash(devices(w/(SDKs( – CUDA,(OpenNVM,(etc.( – Weakcscaling(over(1000(GPUs(( • TSUBAME2( – Outcofccore(GPU(data(management( • OpQmized(data(streaming(between(( device/host(memory( • GPUcbased(external(sorQng( – OpQmized(data(formats(for(( manyccore(accelerators( • Similar(to(JDS(format(

31. Hamar(Overview Rank(0 Rank(1 Rank(n Map Local(Array Local(Array Local(Array Local(Array Distributed(Array Reduce Map Reduce Map Reduce Shuffle Shuffle Data(Transfer(between(ranks Shuffle Shuffle Local(Array Local(Array Local(Array Local(Array Local(Array(on(NVM Local(Array(on(NVM Local(Array(on(NVMVirtualizedL(oDcaalt(Aar(rOayb(ojne(NcVtM Device(GPU)( Data Host(CPU)( Data Memcpy(( (H2D,(D2H)

32. ApplicaQon(Example(:(GIMcV( Generalized(IteraQve(MatrixcVector(mulQplicaQon*1 • Easy(descripQon(of(various(graph(algorithms(by(implemenQng( combine2,(combineAll,(assign(funcQons( • PageRank,(Random(Walk(Restart,(Connected(Component( – v’#=#M#×G#v((where( v’i(=(assign(vj(,(combineAllj(({xj#|(j#=(1..n,(xj#=(combine2(mi,j,(vj)}))(((i(=(1..n)( – IteraQve(2(phases(MapReduce(operaQons( Straighyorward(implementaQon(using(Hamar v’ . ×G i mi,j vj v’ M combine2(stage1) combineAll andassign(stage2) assign v *1(:(Kang,(U.(et(al,(“PEGASUS:(A(PetacScale(Graph(Mining(Systemc(ImplementaQon(( and(ObservaQons”,(IEEE(INTERNATIONAL(CONFERENCE(ON(DATA(MINING(2009

33. MapReducecbased(Graph(Processing( with(Outcofccore(Support(on(GPUs 3000( 2500( 2000( 1500( 1000( 500( 0( Weakscalingperformance 1CPU((S23(per(node)( 1GPU((S23(per(node)( 2CPUs((S24(per(node)( 2GPUs((S24(per(node)( 3GPUs((S24(per(node)( 0( 200( 400( 600( 800( 1000( 1200( Performance[MEdges/sec] NumberofComputeNodes 2.81(GE/s(on(( 3072(GPUs(( (SCALE(34) 2.10x(Speedup( (3(GPU(v(2CPU) Bemer • Hierarchical(memory(management(for(largecscale(( graph(processing(using(mulQcGPUs( – Support(outcofccore(processing(on(GPU( – Overlapping(computaQon(and(CPUcGPU(communicaQon( • PageRank(applicaQon(on(TSUBAME(2.5(

34. EBD-IO Device: A Prototype of Local Storage reliable storage designs for resilient extreme scale computing. Configuration [Shirahata, Sato et al. GTC2014] 3.2 Burst Buffer System To solve the problems in a flat buffer system, we consider a burst buffer system [21]. A burst buffer is a storage space to bridge the gap in latency and bandwidth between node-local stor-age High(Bandwidth(and(IOPS,(Huge(Capacity,(Low(Cost,(Power(Efficient(( 16cardsofmSATASSDdevices and the PFS, and is shared by a subset of compute nodes. Although additional nodes are required, a burst buffer can offer a system many advantages including higher reliability and effi-ciency Capacity:256GBx16→4TBReadBW:0.5GB/sx16→8GB/s over a flat buffer system. A burst buffer system is more reliable for checkpointing because burst buffers are located on a smaller number of dedicated I/O nodes, so the probability of lost checkpoints is decreased. In addition, even if a large number of compute nodes fail concurrently, an application can still ac-cess the checkpoints from the burst buffer. A burst buffer system provides more efficient utilization of storage resources for partial restart of uncoordinated checkpointing because processes involv-ing restart can exploit higher storage bandwidth. For example, if compute node 1 and 3 are in the same cluster, and both restart from a failure, the processes can utilize all SSD bandwidth unlike a flat buffer system. This capability accelerates the partial restart of uncoordinated checkpoint/restart. Table 1 Node specification CPU Intel Core i7-3770K CPU (3.50GHz x 4 cores) Memory Cetus DDR3-1600 (16GB) M/B GIGABYTE GA-Z77X-UD5H SSD Crucial m4 msata 256GB CT256M4SSD3 (Peak read: 500MB/s, Peak write: 260MB/s) SATA converter KOUTECH IO-ASS110 mSATA to 2.5’ SATA Device Converter with Metal Fram RAID Card Adaptec RAID 7805Q ASR-7805Q Single A(single(mSATA(SSD( 8(integrated(mSATA(SSDs( RAID(cards( Prototype/Test(machine(

35. Preliminary I/O Performance Evaluation Between GPU and EBD-I/O device Using Matrix-Vector Multiplication 9000( 8000( 7000( 6000( 5000( 4000( 3000( 2000( 1000( 0( Raw(mSATA(4KB( RAID0(1MB( RAID0(64KB( 0( 5( 10( 15( 20( Bandwidth[MB/s] #mSATAs 3.5( 3( 2.5( 2( 1.5( 1( 0.5( 0( Throughuput[GB/s] Raw(8(mSATA( 8(mSATA(RAID0((1MB)( 8(mSATA(RAID0((64KB)( MatrixSize[GB] I/O(Performance(using(FIO( I/O(Performance(with(MatrixcVector(( MulQplicaQon(on(GPU I/OPerformanceofEBD_I/Odevice 7.39GB/s(RAID0) I/OPerformancetoGPU 3.06GB/s(UptoPCI_EBW)

36. Sorting for EBD Plugging in GPUs for large-scale sorting 30 20 10 0 0 500 1000 1500 2000 # of proccesses (2 proccesses per node) Keys/second(billions) HykSort 1thread HykSort 6threads HykSort GPU + 6threads K20x x4 faster than K20x 60 40 20 0 0 500 1000 1500 2000 0 500 1000 1500 2000 # of proccesses (2 proccesses per node) Keys/second(billions) HykSort 6threads HykSort GPU + 6threads PCIe_10 PCIe_100 PCIe_200 PCIe_50 Prediction of our implementation [Shamoto, Sato et al. BigData 2014] • GPU implementation of splitter-based sorting (HykSort) • Weak scaling performance (Grand Challenge on TSUBAME2.5) – 1 ~ 1024 nodes (2 ~ 2048 GPUs) – 2 processes per node and each node has 2GB 64bit integer • Yahoo/Hadoop Terasort: 0.02[TB/s] – Including I/O x1.4 x3.61 x389 0.25 [TB/s] • Performance prediction ! PCIe_#: #GB/s bandwidth x2.2 speedup compared to CPU-based implmentataion when the # of PCI bandwidth increase to 50GB/s of interconnect between CPU and GPU 8.8% reduction of overall runtime when the accelerators work 4 times faster than K20x

37. Graph500Benchmark h{p://www.graph500.org ! New BigData Benchmark based on Large-scale Graph Search for Ranking Supercomputers ! BFS (Breadth First Search) from a single vertex on a static, undirected Kronecker graph with average vertex degree edgegactor (=16). ! Evaluation criteria: TEPS (Traversed Edges Per Second), and problem size that can be solved on a system, minimum execution time. SCALE(and(edgefactor(=16) Input parameters Graph generation Graph construction BFS Validation Results BFS Validation Input parameters Graph generation Graph construction BFS Validation TEPS MedianTEPS 1. GeneraIon - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS TEPS ratio 64 Iterations - SCALE - edgefactor - SCALE - edgefactor - BFS - Traversed - TEPS Input parameters Graph generation Graph construction TEPS ratio 64 Iterations - SCALE - edgefactor ratio 64 Iterations 2. ConstrucIon 3. BFS x64 • Kronecker(graph( TEPS(raQox64 – 2SCALE(verQces(and(2SCALE+4(edges( – syntheQc(scalecfree(network(

38. Scalable distributed memory BFS for Graph500 Koji Ueno (Tokyo Institute of Technology) et al. ! What’s the best algorithm for distributed memory BFS? We proposed and explored many optimizations. Optimizations SC11 ISC12 SC12 ISC14 2D decomposition ✓ ✓ ✓ ✓ vertex sorting ✓ direction optimization ✓ data compression ✓ ✓ ✓ sparse vector with pop counting ✓ adaptive data representation ✓ overlapped communication ✓ ✓ ✓ ✓ shared memory ✓ GPGPU ✓ ✓ ✓ Utilization for each version 100 317 462 1,280 1400 1200 1000 800 600 400 200 0 SC'11 ISC'12 SC'12 ISC'14 Performance (GTEPS) Graph500 score history of TSUBAME2 Continuous effort to improve performance Optimized for various machines Machine # of nodes Performance K computer 65536 5524 GTEPS TSUBAME2.5 1024 1280 GTEPS TSUBAME-KFC 32 104 GTEPS

39. LargeScaleGraphProcessingUsingNVM 1.Hybrid_BFS(Beamer’11) 2.Proposal [Iwabuchi(Sato(et(al.(BigData2014] Holds(highly(accessed(data Holds(full(size(of(Graph Switching(two(approaches( $'! !%$

40. ( DRAM

41. ( 3.Experiment DRAM NVM Load(highly(accessed(graph(data(before(BFS + ) ( 8EGELIC419+)1 419524#7 419+)1 () ( (+ ( (- (. (/ ) ) =3185 MB-LE#B.0(=3185 ). 9BAE+6ED!5= 6ED!-MB-.BA5ADB.B-=B#I+A. mSATA-SSD RAID Card www.crucial.com/ # ×8 # ,,, www.adaptec.com 4(Qmes(larger(graph(with( 6.9(%(of(degradaQon( Top_down Bomom_up #(of(fronQers:nfron2er#(of(all(verQces:nall,###### ###parameter:α,#β#

42. Results:BFSPerformance The(Graph(500(2014(June( ((((((((((DRAM(+(NVM(model MEM_CRESTNode#2 (Supermicro2027GR_TRF) GraphCrestNode#1 EBD_RH5885v2 (HuaweiTecalRH5885V2) DRAM 128(GB 256(GB 1024(GB NVM ioDrive2(1.2(TB(×(2 EBDcI/O(2TB(×(2 • Tecal(ES3000( 800GBx2,1.2TBx2( • EBDcI/O(4TB(×(2 SCALE( (Total(Data(Size) 30( (500GB) 31( (1TB) 33( (4TB) GTEPS 7.98 13.80( 3.11 MTEPS(/(W 28.88( 35.21 3.42

43. The2ndGreenGraph500list(on(Nov.(2013 • Measures(powercefficient(using(TEPS/W(raQo( • Results(on(various(system(such(as(Huawei’sRH5885v2w/TecalES3000 PCIeSSD800GB*2and1.2TB*2 • hmp://green.graph500.org(

45. Expectations for Next-Gen Storage System (Towards TSUBAME3.0) • Achieving high IOPS – Many apps with massive small I/O ops • graph, etc. • Utilizing NVM devices – Discrete local SSDs on TSUBAME2 – How to aggregate them? • Stability/Reliability as Archival Storage • I/O resource reduction/consolidation – Can we allow a large number of OSSs for achieving ~TB/s throughput – Many constraints • Space, Power, Budget, etc.

46. Current Status • New Approaches • Tsubame 2.0 has pioneered the use of local flash storage as a high-IOPS alternative to an external PFS • Tired and hybrid storage environments, combining (node) local flash with an external PFS • Industry Status • High-performance, high-capacity flash (and other new semiconductor devices) are becoming available at reasonable cost • New approaches/interface to use high-performance devices (e.g. NVMexpress)

Japan Lustre User Group 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Japan Lustre User Group 2014

Similar to Japan Lustre User Group 2014 (20)

More from Hitoshi Sato

More from Hitoshi Sato (6)

Recently uploaded

Recently uploaded (20)

Japan Lustre User Group 2014