Extreme Big Data (EBD) 
Convergence of Extreme Computing 
and Big Data Technologies 
7RUIHRCJPLTPBJTMU:SHPUTHTK3USV=PTN3LTL: 
DUQ@UTXP=LUMDLJOTURUN@ 
 
8PUXOPCHU
Big Data Examples 
Rates and Volumes are extremely immense 
Social NW 
• Facebook 
– 1 billion users 
– Average 130 friends 
– 30 billion pieces of content 
shared per month 
• Twitter 
– 500 million active users 
– 340 million tweets per day 
• Internet 
– 300 million new websites per year 
– 48 hours of video to YouTube per minute 
– 30,000 YouTube videos played per second 
Genomics 
Social Simulation
Sequencing)data)(bp)/$) 
)becomes)x4000)per)5)years) 
c.f.,)HPC)x33)in)5)years 
 
).121
,.1(,120,
.2252	  
• Applications 
– Target Area: Planet 
(Open Street Map) 
– 7 billion people 
• Input Data 
– Road Network for Planet: 
300GB (XML) 
– Trip data for 7 billion people 
10KB (1trip) x 7 billion = 70TB 
– Real-Time Streaming Data 
(e.g., Social sensor, physical data) 
• Simulated Output for 1 Iteration 
– 700TB 
Weather 
A#1.'Quality'Control 
A#2.'Data'Processing 
30#sec'Ensemble' 
Forecast'Simulations 
2'PFLOP 
Ensemble 
Data'Assimilation 
2'PFLOP 
Himawari 
500MB/2.5min 
Ensemble'Forecasts
200GB
Phased'Array'Radar 
1GB/30sec/2'radars 
Ensemble'Analyses
200GB
B#1.'Quality'Control 
B#2.'Data'Processing 
Analysis'Data 
2GB 
30#min 
Forecast'Simulation 
1.2'PFLOP 
30#min Forecast 
2GB 
Repeat'every'30'sec.
Big Data Examples 
Rates and Volumes are extremely immense 
Social NW 
• Facebook 
Future “Extreme Big Data” 
• NOT mining Tbytes Silo Data 
• Peta~Zetabytes of Data 
• Ultra High-BW Data Stream 
• Highly Unstructured, Irregular 
• Complex correlations between data from multiple sources 
• Extreme Capacity, Bandwidth, Compute All Required 
– 1 billion users 
– Average 130 friends 
– 30 billion pieces of content 
shared per month 
• Twitter 
– 500 million active users 
– 340 million tweets per day 
• Internet 
– 300 million new websites per year 
– 48 hours of video to YouTube per minute 
– 30,000 YouTube videos played per second 
Genomics 
Social Simulation
Sequencing)data)(bp)/$) 
)becomes)x4000)per)5)years) 
c.f.,)HPC)x33)in)5)years 
 
).121
,.1(,120,
.2252	  
• Applications 
– Target Area: Planet 
(Open Street Map) 
– 7 billion people 
• Input Data 
– Road Network for Planet: 
300GB (XML) 
– Trip data for 7 billion people 
10KB (1trip) x 7 billion = 70TB 
– Real-Time Streaming Data 
(e.g., Social sensor, physical data) 
• Simulated Output for 1 Iteration 
– 700TB 
Weather 
A#1.'Quality'Control 
A#2.'Data'Processing 
30#sec'Ensemble' 
Forecast'Simulations 
2'PFLOP 
Ensemble 
Data'Assimilation 
2'PFLOP 
Himawari 
500MB/2.5min 
Ensemble'Forecasts
200GB
Phased'Array'Radar 
1GB/30sec/2'radars 
Ensemble'Analyses
200GB
B#1.'Quality'Control 
B#2.'Data'Processing 
Analysis'Data 
2GB 
30#min 
Forecast'Simulation 
1.2'PFLOP 
30#min Forecast 
2GB 
Repeat'every'30'sec.
Graph500(“Big(Data”(Benchmark
A: 0.57, B: 0.19 
C: 0.19, D: 0.05 
November(15,(2010( 
Graph500TakesAimataNewKindofHPC 
RichardMurphy(SandiaNL=Micron) 
“(IexpectthatthisrankingmayatImeslookvery 
differentfromtheTOP500list.Cloudarchitectureswill 
almostcertainlydominateamajorchunkofpartofthe 
list.”( 
The 8th Graph500 List (June2014): K Computer #1, TSUBAME2 #12 
Koji Ueno, Tokyo Institute of Technology/RIKEN AICS 
RIKEN Advanced Institute for Computational 
Science (AICS)’s K computer 
is ranked 
No.1 
Reality: Top500 Supercomputers Dominate 
on the Graph500 Ranking of Supercomputers with 
17977.1 GE/s on Scale 40 
on the 8th Graph500 list published at the International 
Supercomputing Conference, June 22, 2014. 
Congratulations from the Graph500 Executive Committee 
#1 K Computer 
Global Scientific Information and Computing Center, Tokyo Institute 
of Technology’s TSUBAME 2.5 
is ranked 
No.12 
on the Graph500 Ranking of Supercomputers with 
1280.43 GE/s on Scale 36 
on the 8th Graph500 list published at the International 
Supercomputing Conference, June 22, 2014. 
Congratulations from the Graph500 Executive Committee 
#12 TSUBAME2 
No Cloud IDCs at all
-‐‑‒ :2
2222: 
22-‐‑‒DUQ@UDLJO 
(.#6C6#	,.#646# (
UTDUV,

:=T)
( 
!:2 
(
/TUKLX#:HJQX 
 
3#E0TLRGLUTG,-‐‑‒.
-‐‑‒JU:LXC
 
7#E0!F41DLXRH)
GC
 
 5 0,/7244B
(


 8A 
CC40-‐‑‒
72C)
2  
CA4B6=RRIPXLJPUTVPJHRTBTPIHTK 
2 
CC40)

D2 
8440.#2=X:L#7#6C 
DHVL0#2CU:HNLDLQC/,

C)
TSUBAME2 System Overview 
11PB (7PB HDD, 4PB Tape, 200TB SSD) 
 
++CompuIngNodes- 
17.1PFlops(SFP),5.76PFlops(DFP),224.69TFlops(CPU),~100TBMEM,~200TBSSD 
Edge(Switch( 
Edge(Switch((/w(10GbE(ports)( 
QDR(IB(×4)(×(20 QDR(IB((×4)(×(8 10GbE(×(2 
Core(Switch( 
SFA10k(#1 SFA10k(#2 SFA10k(#3 SFA10k(#4 
GPFS+Tape Lustre Home 
/data0( /work0 /work1(((((/gscr 
“Global(Work(Space”(#1 
SFA10k(#5 
“Global(Work( 
Space”(#2 “Global(Work(Space”(#3 
GPFS#1 GPFS#2 GPFS#3 GPFS#4 
HOME 
System( 
applicaQon 
“cNFS/Clusterd(Samba(w/(GPFS”(( 
HOME 
iSCSI 
“NFS/CIFS/iSCSI(by(BlueARC”(( 
Infiniband(QDR(Networks 
SFA10k(#6 
1.2PB 3.6PB 
Parallel(File(System(Volumes Home(Volumes 
/data1( 
(( 
(( 
Thin(nodes 1408nodes((((32nodes(x44(Racks)( 
HP(Proliant(SL390s(G7(1408nodes++++++++++++ 
CPU:(Intel(WestmerecEP((2.93GHz(( 
(((((((((6cores(×(2(=(12cores/node( 
GPU:(NVIDIA(Tesla(K20X,(3GPUs/node( 
Mem:(54GB((96GB)( 
SSD:((60GB(x(2(=(120GB((120GB(x(2(=(240GB)+( 
Medium(nodes 
HP(Proliant(DL580(G7(24nodes(( 
CPU:(Intel(NehalemcEX(2.0GHz( 
(((((((((8cores(×(2(=(32cores/node( 
GPU:(NVIDIA((Tesla(S1070,(( 
((((((((((NextIO(vCORE(Express(2070( 
Mem:128GB( 
SSD:(120GB(x(4(=(480GB( 
(( 
Fat(nodes 
HP(Proliant(DL580(G7(10nodes( 
CPU:(Intel(NehalemcEX(2.0GHz( 
(((((((((8cores(×(2(=(32cores/node(( 
GPU:(NVIDIA(Tesla(S1070( 
Mem:(256GB((512GB)( 
SSD:(120GB(x(4(=(480GB( 
,,,,,, 
Interconnets:+Full_bisecIonOpIcalQDRInfinibandNetwork 
(( 
Voltaire(Grid(Director(4700((×12( 
IB(QDR:(324(ports( 
(( 
(( 
Voltaire(Grid(Director(4036(×179( 
IB(QDR(:(36(ports( 
Voltaire((Grid(Director(4036E(×6( 
IB(QDR:34ports((( 
10GbE:((2port( 
12switches( 
179switches( 6switches( 
2.4PBHDD+ 
4PBTape 
Local SSDs
TSUBAME2 System Overview 
11PB (7PB HDD, 4PB Tape, 200TB SSD) 
 
++CompuIngNodes- 
17.1PFlops(SFP),5.76PFlops(DFP),224.69TFlops(CPU),~100TBMEM,~200TBSSD 
Edge(Switch( 
Finecgrained(R/W(I/O( 
(check(point,(temporal(files) 
Edge(Switch((/w(10GbE(ports)( 
QDR(IB(×4)(×(20 QDR(IB((×4)(×(8 10GbE(×(2 
Core(Switch( 
SFA10k(#1 SFA10k(#2 SFA10k(#3 SFA10k(#4 
GPFS+Tape Lustre Home 
/data0( /work0 /work1(((((/gscr 
“Global(Work(Space”(#1 
SFA10k(#5 
“Global(Work( 
Space”(#2 “Global(Work(Space”(#3 
GPFS#1 GPFS#2 GPFS#3 GPFS#4 
HOME 
System( 
applicaQon 
“cNFS/Clusterd(Samba(w/(GPFS”(( 
HOME 
iSCSI 
“NFS/CIFS/iSCSI(by(BlueARC”(( 
Infiniband(QDR(Networks 
SFA10k(#6 
1.2PB 3.6PB 
Parallel(File(System(Volumes Home(Volumes 
/data1( 
(( 
(( 
Thin(nodes 1408nodes((((32nodes(x44(Racks)( 
HP(Proliant(SL390s(G7(1408nodes++++++++++++ 
CPU:(Intel(WestmerecEP((2.93GHz(( 
(((((((((6cores(×(2(=(12cores/node( 
GPU:(NVIDIA(Tesla(K20X,(3GPUs/node( 
Mem:(54GB((96GB)( 
SSD:((60GB(x(2(=(120GB((120GB(x(2(=(240GB)+( 
Medium(nodes 
HP(Proliant(DL580(G7(24nodes(( 
CPU:(Intel(NehalemcEX(2.0GHz( 
(((((((((8cores(×(2(=(32cores/node( 
GPU:(NVIDIA((Tesla(S1070,(( 
((((((((((NextIO(vCORE(Express(2070( 
Mem:128GB( 
SSD:(120GB(x(4(=(480GB( 
(( 
Fat(nodes 
HP(Proliant(DL580(G7(10nodes( 
CPU:(Intel(NehalemcEX(2.0GHz( 
(((((((((8cores(×(2(=(32cores/node(( 
GPU:(NVIDIA(Tesla(S1070( 
Mem:(256GB((512GB)( 
SSD:(120GB(x(4(=(480GB( 
,,,,,, 
Interconnets:+Full_bisecIonOpIcalQDRInfinibandNetwork 
(( 
Voltaire(Grid(Director(4700((×12( 
IB(QDR:(324(ports( 
(( 
(( 
Voltaire(Grid(Director(4036(×179( 
IB(QDR(:(36(ports( 
Voltaire((Grid(Director(4036E(×6( 
IB(QDR:34ports((( 
10GbE:((2port( 
12switches( 
179switches( 6switches( 
2.4PBHDD+ 
4PBTape 
Local SSDs 
Read(mostly(I/O(( 
(datacintensive(apps,(parallel(workflow,( 
parameter(survey) 
• (Home(storage(for(compuQng(nodes( 
• (Cloudcbased(campus(storage(services 
Backup 
Finecgrained(R/W(I/O( 
(check(point,(temporal(files)
TSUBAME2 Storage Usage Since Nov. 2010
DCE21 5),HXH2PN4HHTM:HX:=J=:L 
• 7#EIHXLK HT@JU:L1JJLRL:HU: 
– !F41DLXRH)
G))JH:KX 
• 8=NL LSU:@FUR=SLXVL:!UKL 
– ,/72VL:TUKLC(
/TUKLX 
• 6HD:LL4=HRBHPRA4BTBTP2HTK 
– )

DIVXUMM=RRIPXLJPUTIHTK?PKO 
 
• UJHRCC4KLPJLXVL:TUKL 
– )

D2PTUHR 
 
• H:NLXJHRLCU:HNLC@XLSX 
– =X:L#7#6C 
– .#2UM844X##2UMDHVLX
DCE21 5),HXH2PN4HHTM:HX:=J=:L 
• 7#EIHXLK HT@JU:L1JJLRL:HU: 
– !F41DLXRH)
G))JH:KX 
• 8=NL LSU:@FUR=SLXVL:!UKL 
– ,/72VL:TUKLC(
/TUKLX 
• 6HD:LL4=HRBHPRA4BTBTP2HTK 
– )

DIVXUMM=RRIPXLJPUTIHTK?PKO 
 
• UJHRCC4KLPJLXVL:TUKL 
– )

D2PTUHR 
 
• H:NLXJHRLCU:HNLC@XLSX 
– =X:L#7#6C 
– .#2UM844X##2UMDHVLX
A Major Northern Japanese 
Cloud Datacenter (2013)  
10GbE 10GbE 
Juniper(MX480 Juniper(MX480 
2(zone(switches((Virtual(Chassis) 
10GbE 
Juniper(EX8208 Juniper(EX8208 
Juniper( 
EX4200 
Juniper( 
EX4200 
Zone((700(nodes) 
Juniper( 
EX4200 
Juniper( 
EX4200 
Zone((700(nodes) 
Juniper( 
EX4200 
10GbE 
Juniper( 
EX4200 
Zone((700(nodes) 
LACP 
the(Internet 
8 zones, Total 5600 nodes, 
Injection 1GBps/Node 
Bisection 160Gigabps 
Supercomputer Tokyo Tech. 
Tsubame 2.0 
#4 Top500 (2010) 
Advanced Silicon 
Photonics 40G 
single CMOS Die 
1490nm DFB 
100km Fiber 
~1500 nodes compute  storage 
Full Bisection Multi-Rail 
Optical Network 
Injection 80GBps/Node 
Bisection 220Terabps 
 
x1000!
Towards Extreme-scale 
Supercomputers and BigData Machines 
• Computation 
– Increase in Parallelism, Heterogeneity, Density 
• Multi-core, Many-core processors 
• Heterogeneous processors 
• Hierarchial Memory/Storage Architecture 
– NVM (Non-Volatile Memory), 
SCM (Storage Class Memory) 
• 61C8##3 #CDD B1 #BLB1 # 
8 3#LJ 
– Next-gen HDDs (SMR), 
Tapes (LTFS) 
( 
Algorithm 
Network 
Locality 
Power 
FT Productivity 
Storage Hierarchy 
I/O 
Problems 
Scalability 
Heterogeneity
Extreme Big Data (EBD) 
Next Generation Big Data 
Infrastructure Technologies Towards 
Yottabyte/Year  
Principal Investigator 
Satoshi Matsuoka 
Global Scientific Information and 
Computing Center 
Tokyo Institute of Technology 
2014/11/05 JST CREST Big Data Symposium
EBE Research Scheme 
Future Non-Silo Extreme Big Data Apps 
Co-Design 
Co-Design 
Co-Design 日本地図13/06/06 22:36 
EBD System Software 
incl. EBD Object System 
NVM/ 
Flash 
NVM/ 
Flash 
NVM/ 
Flash 
DRAM 
DRAM 
DRAM 
2Tbps HBM 
4~6HBM Channels 
1.5TB/s DRAM  
NVM BW 
30PB/s I/O BW Possible 
1 Yottabyte / Year 
TSV Interposer 
NVM/ 
Flash 
NVM/ 
Flash 
NVM/ 
Flash 
DRAM 
DRAM 
DRAM 
EBD Bag 
Cartesian 
Plane 
KV 
S 
KV 
S 
EBD KVS 
1000km 
KV 
S 
Convergent Architecture (Phases 1~4) 
Large Capacity NVM, High-Bisection NW 
Supercomputers 
ComputeBatch-Oriented 
Cloud+IDC 
Very low BW  Efficiencty 
 
PCB 
High Powered 
Main CPU 
Low 
Power 
CPU 
Low 
Power 
CPU 
Introduction 
Problem Domain 
In most living organisms their development molecule called DNA consists of called nucleotides. 
The four bases found A), cytosine (C), Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Algorithm Large Scale 
Metagenomics 
Massive Sensors and 
Data Assimilation in 
Weather Prediction 
Ultra Large Scale 
Graphs and Social 
Infrastructures 
Exascale Big Data HPC  
Graph 
Store 
file:///Users/shirahata/Pictures/日本地図.svg 1/1 ページ
Extreme Big Data (EBD) Team) 
Co-Design EHPC and EBD Apps 
• Satoshi Matsuoka (PI), Toshio 
Endo, Hitoshi Sato (Tokyo 
Tech.) (Tasks 1, 3, 4, 6) 
• Osamu Tatebe (Univ. 
Tsukuba) (Tasks 2, 3) 
• Michihiro Koibuchi (NII) 
(Tasks 1, 2) 
• Yutaka Akiyama, Ken 
Kurokawa (Tokyo Tech, 5-1) 
• Toyotaro Suzumura 
(IBM Lab, 5-2) 
• Takemasa Miyoshi (Riken 
AICS, 5-3)
Tasks(5c1~5c3 Task6 
Problem Domain 
In most their development molecule DNA consists called nucleotides. 
The four A), cytosine Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Introduction 
100,000TimesFoldEBD“Convergent”SystemOverview 
Problem Domain 
EBD(Performance(Modeling(( 
(EvaluaQon 
Task(4 
To decipher the information contained we need to determine the order This task is important for many areas of science and 日本地図medicine. 
13/06/06 22:36 
Cartesian(Plane 
Modern sequencing techniques KVS 
molecule into pieces (called reads) KVS 
processed separately KVS 
to increase sequencing throughput. 
1000km 
Reads must be aligned file:///Users/shirahata/Pictures/日本地図.svg to the 1/1 ページ 
reference 
sequence to determine their position molecule. This process is called alignment. 
Task(3 
EBD(Programming(System( 
Graph(Store 
EBD(ApplicaQon(Coc 
Design(and(ValidaQon( 
Ultra(High(BW((Low(Latency(NVM Ultra(High(BW((Low(Latency(NW( 
Processorcincmemory 3D(stacking 
Large(Scale( 
Genomic( 
CorrelaQon 
Data(AssimilaQon( 
in(Large(Scale(Sensors( 
and(Exascale( 
Atmospherics 
Large(Scale(Graphs( 
and(Social( 
Infrastructure(Apps 
TSUBAME(3.0 
TSUBAME(2.0/2.5 
EBD(“converged”(RealcTime( 
Resource(Scheduling( 
EBD(Distrbuted(Object(Store(on( 
100,000(NVM(Extreme(Compute( 
and(Data(Nodes( 
Task(2 
EBD(Bag 
EBD(KVS( 
Ultra(Parallel((Low(Powe(I/O(EBD( 
“Convergent”(Supercomputer( 
~10TB/s*~100TB/s*~10PB/s 
Task(1
Problem Domain 
In most their development molecule DNA consists called nucleotides. 
The four A), cytosine Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Introduction 
100,000TimesFoldEBD“Convergent”SystemOverview 
Problem Domain 
To decipher the information contained we need to determine the order This task is important for many areas of science and medicine. 
Modern sequencing techniques molecule into pieces (called reads) processed separately to increase sequencing throughput. 
Reads must be aligned to the reference 
sequence to determine their position molecule. This process is called alignment. 
SQLforEBD Xpregel(Graph) 
Workflow/ScripIngLanguagesforEBD MapReduceforEBD 
日本地図13/06/06 22:36 
Cartesian(Plane 
KVS 
EBDBurstI/OBuffer EBDNetworkTopologyandRouIng 
KVS 
1000km 
KVS 
file:///Users/shirahata/Pictures/日本地図.svg 1/1 ページ 
MessagePassing(MPI,X10)forEBD 
EBD(Bag 
EBDFileSystem EBDDataObject 
Graph(Store 
Cloud(Datacenter 
Large(Scale(Genomic( 
CorrelaQon 
Data(AssimilaQon( 
in(Large(Scale(Sensors( 
and(Exascale( 
Atmospherics 
Large(Scale(Graphs( 
and(Social( 
Infrastructure(Apps 
TSUBAME(3.0( TSUBAMEcGoldenBox 
EBD(KVS( 
Interconnect 
(InfiniBand(100GbE) 
EBDAbstractDataModels 
(Distributed(Array,(Key(Value,(Sparse(Data(Model,(Tree,(etc.) 
EBDAlgorithmKernels(( 
(Search/(Sort,(Matching,(Graph(Traversals,(,(etc.)( 
NVM 
(61C8##3 #CDD B1 # 
BLB1 #8 3#LJ)( 
HPCStorage 
PGAS/GlobalArrayforEBD 
Network 
(SINET5) 
Intercloud(/(Grid((HPCI) 
WebObjectStorage
Hamar((Highly(Accelerated(Map(Reduce)( 
[Shirahata,(Sato(et(al.(Cluster2014] 
• A((soqware(framework(for(largecscale(supercomputers(( 
w/(manyccore(accelerators(and(local(NVM(devices( 
– AbstracQon(for(deepening(memory(hierarchy( 
• Device(memory(on(GPUs,(DRAM,(Flash(devices,(etc.(( 
• Features( 
– Objectcoriented(( 
• C++cbased(implementaQon( 
• Easy(adaptaQon(to(modern(commodity(( 
manyccore(accelerator/Flash(devices(w/(SDKs( 
– CUDA,(OpenNVM,(etc.( 
– Weakcscaling(over(1000(GPUs(( 
• TSUBAME2( 
– Outcofccore(GPU(data(management( 
• OpQmized(data(streaming(between(( 
device/host(memory( 
• GPUcbased(external(sorQng( 
– OpQmized(data(formats(for(( 
manyccore(accelerators( 
• Similar(to(JDS(format(
Hamar(Overview 
Rank(0 Rank(1 Rank(n 
Map 
Local(Array Local(Array Local(Array Local(Array 
Distributed(Array 
Reduce 
Map 
Reduce 
Map 
Reduce 
Shuffle 
Shuffle 
Data(Transfer(between(ranks 
Shuffle 
Shuffle 
Local(Array Local(Array Local(Array Local(Array 
Local(Array(on(NVM Local(Array(on(NVM Local(Array(on(NVMVirtualizedL(oDcaalt(Aar(rOayb(ojne(NcVtM 
Device(GPU)( 
Data 
Host(CPU)( 
Data Memcpy(( 
(H2D,(D2H)
ApplicaQon(Example(:(GIMcV( 
Generalized(IteraQve(MatrixcVector(mulQplicaQon*1  
• Easy(descripQon(of(various(graph(algorithms(by(implemenQng( 
combine2,(combineAll,(assign(funcQons( 
• PageRank,(Random(Walk(Restart,(Connected(Component( 
– v’#=#M#×G#v((where( 
v’i(=(assign(vj(,(combineAllj(({xj#|(j#=(1..n,(xj#=(combine2(mi,j,(vj)}))(((i(=(1..n)( 
– IteraQve(2(phases(MapReduce(operaQons( 
Straighyorward(implementaQon(using(Hamar 
v’ . ×G i mi,j 
vj 
v’ M 
combine2(stage1) 
combineAll 
andassign(stage2) 
assign v 
*1(:(Kang,(U.(et(al,(“PEGASUS:(A(PetacScale(Graph(Mining(Systemc(ImplementaQon(( 
and(ObservaQons”,(IEEE(INTERNATIONAL(CONFERENCE(ON(DATA(MINING(2009

Japan Lustre User Group 2014

  • 1.
    Extreme Big Data(EBD) Convergence of Extreme Computing and Big Data Technologies 7RUIHRCJPLTPBJTMU:SHPUTHTK3USV=PTN3LTL: DUQ@UTXP=LUMDLJOTURUN@ 8PUXOPCHU
  • 2.
    Big Data Examples Rates and Volumes are extremely immense Social NW • Facebook – 1 billion users – Average 130 friends – 30 billion pieces of content shared per month • Twitter – 500 million active users – 340 million tweets per day • Internet – 300 million new websites per year – 48 hours of video to YouTube per minute – 30,000 YouTube videos played per second Genomics Social Simulation
  • 3.
    Sequencing)data)(bp)/$) )becomes)x4000)per)5)years) c.f.,)HPC)x33)in)5)years ).121 ,.1(,120, .2252 • Applications – Target Area: Planet (Open Street Map) – 7 billion people • Input Data – Road Network for Planet: 300GB (XML) – Trip data for 7 billion people 10KB (1trip) x 7 billion = 70TB – Real-Time Streaming Data (e.g., Social sensor, physical data) • Simulated Output for 1 Iteration – 700TB Weather A#1.'Quality'Control A#2.'Data'Processing 30#sec'Ensemble' Forecast'Simulations 2'PFLOP Ensemble Data'Assimilation 2'PFLOP Himawari 500MB/2.5min Ensemble'Forecasts
  • 4.
  • 5.
  • 6.
  • 7.
    B#1.'Quality'Control B#2.'Data'Processing Analysis'Data 2GB 30#min Forecast'Simulation 1.2'PFLOP 30#min Forecast 2GB Repeat'every'30'sec.
  • 8.
    Big Data Examples Rates and Volumes are extremely immense Social NW • Facebook Future “Extreme Big Data” • NOT mining Tbytes Silo Data • Peta~Zetabytes of Data • Ultra High-BW Data Stream • Highly Unstructured, Irregular • Complex correlations between data from multiple sources • Extreme Capacity, Bandwidth, Compute All Required – 1 billion users – Average 130 friends – 30 billion pieces of content shared per month • Twitter – 500 million active users – 340 million tweets per day • Internet – 300 million new websites per year – 48 hours of video to YouTube per minute – 30,000 YouTube videos played per second Genomics Social Simulation
  • 9.
    Sequencing)data)(bp)/$) )becomes)x4000)per)5)years) c.f.,)HPC)x33)in)5)years ).121 ,.1(,120, .2252 • Applications – Target Area: Planet (Open Street Map) – 7 billion people • Input Data – Road Network for Planet: 300GB (XML) – Trip data for 7 billion people 10KB (1trip) x 7 billion = 70TB – Real-Time Streaming Data (e.g., Social sensor, physical data) • Simulated Output for 1 Iteration – 700TB Weather A#1.'Quality'Control A#2.'Data'Processing 30#sec'Ensemble' Forecast'Simulations 2'PFLOP Ensemble Data'Assimilation 2'PFLOP Himawari 500MB/2.5min Ensemble'Forecasts
  • 10.
  • 11.
  • 12.
  • 13.
    B#1.'Quality'Control B#2.'Data'Processing Analysis'Data 2GB 30#min Forecast'Simulation 1.2'PFLOP 30#min Forecast 2GB Repeat'every'30'sec.
  • 14.
  • 15.
    A: 0.57, B:0.19 C: 0.19, D: 0.05 November(15,(2010( Graph500TakesAimataNewKindofHPC RichardMurphy(SandiaNL=Micron) “(IexpectthatthisrankingmayatImeslookvery differentfromtheTOP500list.Cloudarchitectureswill almostcertainlydominateamajorchunkofpartofthe list.”( The 8th Graph500 List (June2014): K Computer #1, TSUBAME2 #12 Koji Ueno, Tokyo Institute of Technology/RIKEN AICS RIKEN Advanced Institute for Computational Science (AICS)’s K computer is ranked No.1 Reality: Top500 Supercomputers Dominate on the Graph500 Ranking of Supercomputers with 17977.1 GE/s on Scale 40 on the 8th Graph500 list published at the International Supercomputing Conference, June 22, 2014. Congratulations from the Graph500 Executive Committee #1 K Computer Global Scientific Information and Computing Center, Tokyo Institute of Technology’s TSUBAME 2.5 is ranked No.12 on the Graph500 Ranking of Supercomputers with 1280.43 GE/s on Scale 36 on the 8th Graph500 list published at the International Supercomputing Conference, June 22, 2014. Congratulations from the Graph500 Executive Committee #12 TSUBAME2 No Cloud IDCs at all
  • 16.
    -‐‑‒ :2 2222: 22-‐‑‒DUQ@UDLJO (.#6C6# ,.#646# ( UTDUV, :=T) ( !:2 ( /TUKLX#:HJQX 3#E0TLRGLUTG,-‐‑‒. -‐‑‒JU:LXC 7#E0!F41DLXRH) GC 5 0,/7244B ( 8A CC40-‐‑‒ 72C)
  • 17.
    2 CA4B6=RRIPXLJPUTVPJHRTBTPIHTK 2 CC40) D2 8440.#2=X:L#7#6C DHVL0#2CU:HNLDLQC/, C)
  • 18.
    TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD) ++CompuIngNodes- 17.1PFlops(SFP),5.76PFlops(DFP),224.69TFlops(CPU),~100TBMEM,~200TBSSD Edge(Switch( Edge(Switch((/w(10GbE(ports)( QDR(IB(×4)(×(20 QDR(IB((×4)(×(8 10GbE(×(2 Core(Switch( SFA10k(#1 SFA10k(#2 SFA10k(#3 SFA10k(#4 GPFS+Tape Lustre Home /data0( /work0 /work1(((((/gscr “Global(Work(Space”(#1 SFA10k(#5 “Global(Work( Space”(#2 “Global(Work(Space”(#3 GPFS#1 GPFS#2 GPFS#3 GPFS#4 HOME System( applicaQon “cNFS/Clusterd(Samba(w/(GPFS”(( HOME iSCSI “NFS/CIFS/iSCSI(by(BlueARC”(( Infiniband(QDR(Networks SFA10k(#6 1.2PB 3.6PB Parallel(File(System(Volumes Home(Volumes /data1( (( (( Thin(nodes 1408nodes((((32nodes(x44(Racks)( HP(Proliant(SL390s(G7(1408nodes++++++++++++ CPU:(Intel(WestmerecEP((2.93GHz(( (((((((((6cores(×(2(=(12cores/node( GPU:(NVIDIA(Tesla(K20X,(3GPUs/node( Mem:(54GB((96GB)( SSD:((60GB(x(2(=(120GB((120GB(x(2(=(240GB)+( Medium(nodes HP(Proliant(DL580(G7(24nodes(( CPU:(Intel(NehalemcEX(2.0GHz( (((((((((8cores(×(2(=(32cores/node( GPU:(NVIDIA((Tesla(S1070,(( ((((((((((NextIO(vCORE(Express(2070( Mem:128GB( SSD:(120GB(x(4(=(480GB( (( Fat(nodes HP(Proliant(DL580(G7(10nodes( CPU:(Intel(NehalemcEX(2.0GHz( (((((((((8cores(×(2(=(32cores/node(( GPU:(NVIDIA(Tesla(S1070( Mem:(256GB((512GB)( SSD:(120GB(x(4(=(480GB( ,,,,,, Interconnets:+Full_bisecIonOpIcalQDRInfinibandNetwork (( Voltaire(Grid(Director(4700((×12( IB(QDR:(324(ports( (( (( Voltaire(Grid(Director(4036(×179( IB(QDR(:(36(ports( Voltaire((Grid(Director(4036E(×6( IB(QDR:34ports((( 10GbE:((2port( 12switches( 179switches( 6switches( 2.4PBHDD+ 4PBTape Local SSDs
  • 19.
    TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD) ++CompuIngNodes- 17.1PFlops(SFP),5.76PFlops(DFP),224.69TFlops(CPU),~100TBMEM,~200TBSSD Edge(Switch( Finecgrained(R/W(I/O( (check(point,(temporal(files) Edge(Switch((/w(10GbE(ports)( QDR(IB(×4)(×(20 QDR(IB((×4)(×(8 10GbE(×(2 Core(Switch( SFA10k(#1 SFA10k(#2 SFA10k(#3 SFA10k(#4 GPFS+Tape Lustre Home /data0( /work0 /work1(((((/gscr “Global(Work(Space”(#1 SFA10k(#5 “Global(Work( Space”(#2 “Global(Work(Space”(#3 GPFS#1 GPFS#2 GPFS#3 GPFS#4 HOME System( applicaQon “cNFS/Clusterd(Samba(w/(GPFS”(( HOME iSCSI “NFS/CIFS/iSCSI(by(BlueARC”(( Infiniband(QDR(Networks SFA10k(#6 1.2PB 3.6PB Parallel(File(System(Volumes Home(Volumes /data1( (( (( Thin(nodes 1408nodes((((32nodes(x44(Racks)( HP(Proliant(SL390s(G7(1408nodes++++++++++++ CPU:(Intel(WestmerecEP((2.93GHz(( (((((((((6cores(×(2(=(12cores/node( GPU:(NVIDIA(Tesla(K20X,(3GPUs/node( Mem:(54GB((96GB)( SSD:((60GB(x(2(=(120GB((120GB(x(2(=(240GB)+( Medium(nodes HP(Proliant(DL580(G7(24nodes(( CPU:(Intel(NehalemcEX(2.0GHz( (((((((((8cores(×(2(=(32cores/node( GPU:(NVIDIA((Tesla(S1070,(( ((((((((((NextIO(vCORE(Express(2070( Mem:128GB( SSD:(120GB(x(4(=(480GB( (( Fat(nodes HP(Proliant(DL580(G7(10nodes( CPU:(Intel(NehalemcEX(2.0GHz( (((((((((8cores(×(2(=(32cores/node(( GPU:(NVIDIA(Tesla(S1070( Mem:(256GB((512GB)( SSD:(120GB(x(4(=(480GB( ,,,,,, Interconnets:+Full_bisecIonOpIcalQDRInfinibandNetwork (( Voltaire(Grid(Director(4700((×12( IB(QDR:(324(ports( (( (( Voltaire(Grid(Director(4036(×179( IB(QDR(:(36(ports( Voltaire((Grid(Director(4036E(×6( IB(QDR:34ports((( 10GbE:((2port( 12switches( 179switches( 6switches( 2.4PBHDD+ 4PBTape Local SSDs Read(mostly(I/O(( (datacintensive(apps,(parallel(workflow,( parameter(survey) • (Home(storage(for(compuQng(nodes( • (Cloudcbased(campus(storage(services Backup Finecgrained(R/W(I/O( (check(point,(temporal(files)
  • 20.
    TSUBAME2 Storage UsageSince Nov. 2010
  • 21.
    DCE21 5),HXH2PN4HHTM:HX:=J=:L •7#EIHXLK HT@JU:L1JJLRL:HU: – !F41DLXRH) G))JH:KX • 8=NL LSU:@FUR=SLXVL:!UKL – ,/72VL:TUKLC( /TUKLX • 6HD:LL4=HRBHPRA4BTBTP2HTK – ) DIVXUMM=RRIPXLJPUTIHTK?PKO • UJHRCC4KLPJLXVL:TUKL – ) D2PTUHR • H:NLXJHRLCU:HNLC@XLSX – =X:L#7#6C – .#2UM844X##2UMDHVLX
  • 22.
    DCE21 5),HXH2PN4HHTM:HX:=J=:L •7#EIHXLK HT@JU:L1JJLRL:HU: – !F41DLXRH) G))JH:KX • 8=NL LSU:@FUR=SLXVL:!UKL – ,/72VL:TUKLC( /TUKLX • 6HD:LL4=HRBHPRA4BTBTP2HTK – ) DIVXUMM=RRIPXLJPUTIHTK?PKO • UJHRCC4KLPJLXVL:TUKL – ) D2PTUHR • H:NLXJHRLCU:HNLC@XLSX – =X:L#7#6C – .#2UM844X##2UMDHVLX
  • 23.
    A Major NorthernJapanese Cloud Datacenter (2013) 10GbE 10GbE Juniper(MX480 Juniper(MX480 2(zone(switches((Virtual(Chassis) 10GbE Juniper(EX8208 Juniper(EX8208 Juniper( EX4200 Juniper( EX4200 Zone((700(nodes) Juniper( EX4200 Juniper( EX4200 Zone((700(nodes) Juniper( EX4200 10GbE Juniper( EX4200 Zone((700(nodes) LACP the(Internet 8 zones, Total 5600 nodes, Injection 1GBps/Node Bisection 160Gigabps Supercomputer Tokyo Tech. Tsubame 2.0 #4 Top500 (2010) Advanced Silicon Photonics 40G single CMOS Die 1490nm DFB 100km Fiber ~1500 nodes compute storage Full Bisection Multi-Rail Optical Network Injection 80GBps/Node Bisection 220Terabps x1000!
  • 24.
    Towards Extreme-scale Supercomputersand BigData Machines • Computation – Increase in Parallelism, Heterogeneity, Density • Multi-core, Many-core processors • Heterogeneous processors • Hierarchial Memory/Storage Architecture – NVM (Non-Volatile Memory), SCM (Storage Class Memory) • 61C8##3 #CDD B1 #BLB1 # 8 3#LJ – Next-gen HDDs (SMR), Tapes (LTFS) ( Algorithm Network Locality Power FT Productivity Storage Hierarchy I/O Problems Scalability Heterogeneity
  • 25.
    Extreme Big Data(EBD) Next Generation Big Data Infrastructure Technologies Towards Yottabyte/Year Principal Investigator Satoshi Matsuoka Global Scientific Information and Computing Center Tokyo Institute of Technology 2014/11/05 JST CREST Big Data Symposium
  • 26.
    EBE Research Scheme Future Non-Silo Extreme Big Data Apps Co-Design Co-Design Co-Design 日本地図13/06/06 22:36 EBD System Software incl. EBD Object System NVM/ Flash NVM/ Flash NVM/ Flash DRAM DRAM DRAM 2Tbps HBM 4~6HBM Channels 1.5TB/s DRAM NVM BW 30PB/s I/O BW Possible 1 Yottabyte / Year TSV Interposer NVM/ Flash NVM/ Flash NVM/ Flash DRAM DRAM DRAM EBD Bag Cartesian Plane KV S KV S EBD KVS 1000km KV S Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW Supercomputers ComputeBatch-Oriented Cloud+IDC Very low BW Efficiencty PCB High Powered Main CPU Low Power CPU Low Power CPU Introduction Problem Domain In most living organisms their development molecule called DNA consists of called nucleotides. The four bases found A), cytosine (C), Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Algorithm Large Scale Metagenomics Massive Sensors and Data Assimilation in Weather Prediction Ultra Large Scale Graphs and Social Infrastructures Exascale Big Data HPC Graph Store file:///Users/shirahata/Pictures/日本地図.svg 1/1 ページ
  • 27.
    Extreme Big Data(EBD) Team) Co-Design EHPC and EBD Apps • Satoshi Matsuoka (PI), Toshio Endo, Hitoshi Sato (Tokyo Tech.) (Tasks 1, 3, 4, 6) • Osamu Tatebe (Univ. Tsukuba) (Tasks 2, 3) • Michihiro Koibuchi (NII) (Tasks 1, 2) • Yutaka Akiyama, Ken Kurokawa (Tokyo Tech, 5-1) • Toyotaro Suzumura (IBM Lab, 5-2) • Takemasa Miyoshi (Riken AICS, 5-3)
  • 28.
    Tasks(5c1~5c3 Task6 ProblemDomain In most their development molecule DNA consists called nucleotides. The four A), cytosine Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Introduction 100,000TimesFoldEBD“Convergent”SystemOverview Problem Domain EBD(Performance(Modeling(( (EvaluaQon Task(4 To decipher the information contained we need to determine the order This task is important for many areas of science and 日本地図medicine. 13/06/06 22:36 Cartesian(Plane Modern sequencing techniques KVS molecule into pieces (called reads) KVS processed separately KVS to increase sequencing throughput. 1000km Reads must be aligned file:///Users/shirahata/Pictures/日本地図.svg to the 1/1 ページ reference sequence to determine their position molecule. This process is called alignment. Task(3 EBD(Programming(System( Graph(Store EBD(ApplicaQon(Coc Design(and(ValidaQon( Ultra(High(BW((Low(Latency(NVM Ultra(High(BW((Low(Latency(NW( Processorcincmemory 3D(stacking Large(Scale( Genomic( CorrelaQon Data(AssimilaQon( in(Large(Scale(Sensors( and(Exascale( Atmospherics Large(Scale(Graphs( and(Social( Infrastructure(Apps TSUBAME(3.0 TSUBAME(2.0/2.5 EBD(“converged”(RealcTime( Resource(Scheduling( EBD(Distrbuted(Object(Store(on( 100,000(NVM(Extreme(Compute( and(Data(Nodes( Task(2 EBD(Bag EBD(KVS( Ultra(Parallel((Low(Powe(I/O(EBD( “Convergent”(Supercomputer( ~10TB/s*~100TB/s*~10PB/s Task(1
  • 29.
    Problem Domain Inmost their development molecule DNA consists called nucleotides. The four A), cytosine Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Introduction 100,000TimesFoldEBD“Convergent”SystemOverview Problem Domain To decipher the information contained we need to determine the order This task is important for many areas of science and medicine. Modern sequencing techniques molecule into pieces (called reads) processed separately to increase sequencing throughput. Reads must be aligned to the reference sequence to determine their position molecule. This process is called alignment. SQLforEBD Xpregel(Graph) Workflow/ScripIngLanguagesforEBD MapReduceforEBD 日本地図13/06/06 22:36 Cartesian(Plane KVS EBDBurstI/OBuffer EBDNetworkTopologyandRouIng KVS 1000km KVS file:///Users/shirahata/Pictures/日本地図.svg 1/1 ページ MessagePassing(MPI,X10)forEBD EBD(Bag EBDFileSystem EBDDataObject Graph(Store Cloud(Datacenter Large(Scale(Genomic( CorrelaQon Data(AssimilaQon( in(Large(Scale(Sensors( and(Exascale( Atmospherics Large(Scale(Graphs( and(Social( Infrastructure(Apps TSUBAME(3.0( TSUBAMEcGoldenBox EBD(KVS( Interconnect (InfiniBand(100GbE) EBDAbstractDataModels (Distributed(Array,(Key(Value,(Sparse(Data(Model,(Tree,(etc.) EBDAlgorithmKernels(( (Search/(Sort,(Matching,(Graph(Traversals,(,(etc.)( NVM (61C8##3 #CDD B1 # BLB1 #8 3#LJ)( HPCStorage PGAS/GlobalArrayforEBD Network (SINET5) Intercloud(/(Grid((HPCI) WebObjectStorage
  • 30.
    Hamar((Highly(Accelerated(Map(Reduce)( [Shirahata,(Sato(et(al.(Cluster2014] •A((soqware(framework(for(largecscale(supercomputers(( w/(manyccore(accelerators(and(local(NVM(devices( – AbstracQon(for(deepening(memory(hierarchy( • Device(memory(on(GPUs,(DRAM,(Flash(devices,(etc.(( • Features( – Objectcoriented(( • C++cbased(implementaQon( • Easy(adaptaQon(to(modern(commodity(( manyccore(accelerator/Flash(devices(w/(SDKs( – CUDA,(OpenNVM,(etc.( – Weakcscaling(over(1000(GPUs(( • TSUBAME2( – Outcofccore(GPU(data(management( • OpQmized(data(streaming(between(( device/host(memory( • GPUcbased(external(sorQng( – OpQmized(data(formats(for(( manyccore(accelerators( • Similar(to(JDS(format(
  • 31.
    Hamar(Overview Rank(0 Rank(1Rank(n Map Local(Array Local(Array Local(Array Local(Array Distributed(Array Reduce Map Reduce Map Reduce Shuffle Shuffle Data(Transfer(between(ranks Shuffle Shuffle Local(Array Local(Array Local(Array Local(Array Local(Array(on(NVM Local(Array(on(NVM Local(Array(on(NVMVirtualizedL(oDcaalt(Aar(rOayb(ojne(NcVtM Device(GPU)( Data Host(CPU)( Data Memcpy(( (H2D,(D2H)
  • 32.
    ApplicaQon(Example(:(GIMcV( Generalized(IteraQve(MatrixcVector(mulQplicaQon*1 • Easy(descripQon(of(various(graph(algorithms(by(implemenQng( combine2,(combineAll,(assign(funcQons( • PageRank,(Random(Walk(Restart,(Connected(Component( – v’#=#M#×G#v((where( v’i(=(assign(vj(,(combineAllj(({xj#|(j#=(1..n,(xj#=(combine2(mi,j,(vj)}))(((i(=(1..n)( – IteraQve(2(phases(MapReduce(operaQons( Straighyorward(implementaQon(using(Hamar v’ . ×G i mi,j vj v’ M combine2(stage1) combineAll andassign(stage2) assign v *1(:(Kang,(U.(et(al,(“PEGASUS:(A(PetacScale(Graph(Mining(Systemc(ImplementaQon(( and(ObservaQons”,(IEEE(INTERNATIONAL(CONFERENCE(ON(DATA(MINING(2009