Advanced
Compu=ng
and
Op=miza=on
Infrastructure
for
Extremely
Large-‐Scale
Graphs
on
Post
Peta-‐Scale
Supercomputers
• JST(Japan
Science
and
Technology
Agency)
CREST(Core
Research
for
Evoluonal
Science
and
Technology)
Project
(Oct,
2011
䡚㻌
March,
2017)
• 4
groups,
over
60
members
1. Fujisawa-‐G
(Kyushu
University)
:
Large-‐scale
Mathemacal
Opmizaon
2. Suzumura-‐G
(University
College
Dublin,
Ireland)
:
Large-‐scale
Graph
Processing
3. Sato-‐G
(Tokyo
Instute
of
Technology)
:
Hierarchical
Graph
Store
System
4. Wakita-‐G
(Tokyo
Instute
of
Technology)
:
Graph
Visualizaon
• Innova=ve
Algorithms
and
implementa=ons
• Opmizaon,
Searching,
Clustering,
Network
flow,
etc.
• Extreme
Big
Graph
Data
for
emerging
applicaons
• 230
~
242
nodes
and
240
~
246
edges
• Over
1M
threads
are
required
for
real-‐me
analysis
• Many
applicaons
on
post
peta-‐scale
supercomputers
• Analyzing
massive
cyber
security
and
social
networks
• Opmizing
smart
grid
networks
• Health
care
and
medical
science
• Understanding
complex
life
system
The 2nd GreenGraph500 list on Nov. 2013
• Measures power-efficient using TEPS/W ratio
• Results on various system such as Huawei’s RH5885v2 w/
Tecal ES3000 PCIe SSD 800GB * 2 and 1.2TB * 2
• http://green.graph500.org
30.
Tokyo’s Institute ofTechnology
GraphCREST-Custom #1
is ranked
No.3
in the Big Data category of the Green Graph 500
Ranking of Supercomputers with
35.21 MTEPS/W on Scale 31
on the third Green Graph 500 list published at the
International Supercomputing Conference, June 23, 2014.
Congratulations from the Green Graph 500 Chair
31.
Lessons
from
our
Graph500
acvies
• We
can
efficiently
process
large-‐scale
data
that
exceeds
the
DRAM
capacity
of
a
compute
node
by
ulizing
commodity-‐based
NVM
devices
• Convergence
of
praccal
algorithms
and
sodware
implementaon
techniques
is
very
important
• Basically,
BigData
consists
of
a
set
of
sparse
data.
Converng
sparse
datasets
to
dense
is
also
a
key
for
performing
BigData
processing
Hamar
Overview
Rank
0 Rank
1 Rank
n
Map
Local
Array Local
Array Local
Array Local
Array
Distributed
Array
Reduce
Map
Reduce
Map
Reduce
Shuffle
Shuffle
Data
Transfer
between
ranks
Shuffle
Shuffle
Local
Array Local
Array Local
Array Local
Array
Local
Array
on
NVM Local
Array
on
NVM Local
Device(GPU)
Data
Host(CPU)
Data Memcpy
(H2D,
Array
on
NVMVirtualizedL
oDcaalt
Aar
rOayb
ojne
NcVtM
D2H)
34.
Applicaon
Example
:
GIM-‐V
Generalized
Iterave
Matrix-‐Vector
mulplicaon*1
• Easy
descripon
of
various
graph
algorithms
by
implemenng
combine2,
combineAll,
assign
funcons
• PageRank,
Random
Walk
Restart,
Connected
Component
– v’
=
M
×G
v
where
v’i
=
assign(vj
,
combineAllj
({xj
|
j
=
1..n,
xj
=
combine2(mi,j,
vj)}))
(i
=
1..n)
– Iterave
2
phases
MapReduce
operaons
Straighporward
implementaon
using
Hamar
v’ 䠙 ×G i mi,j
vj
v’ M
combine2
(stage1)
combineAll
and
assign
(stage2)
assign v
*1
:
Kang,
U.
et
al,
“PEGASUS:
A
Peta-‐Scale
Graph
Mining
System-‐
Implementaon
and
Observaons”,
IEEE
INTERNATIONAL
CONFERENCE
ON
DATA
MINING
2009
35.
TSUBAME2.5での弱スケーリング
[Shirahata, Satoet al. Cluster2014]
• PageRankアプリケーション
• GPUのメモリを超える規模のグラフを対象(RMAT Graph)
3000
SCALE
23
-‐
24
per
Node
Performance
[MEdges/sec] Number
2500
2000
1500
1000
500
0
0
200
400
600
800
1000
1200
of
Compute
Nodes
1CPU
(S23
per
node)
1GPU
(S23
per
node)
2CPUs
(S24
per
node)
2GPUs
(S24
per
node)
3GPUs
(S24
per
node)
2.81
GE/s
on
3072
GPUs
(SCALE
34)
2.10x
Speedup
(3
GPU
v
2CPU)
36.
GPUアクセラレータと不揮発性メモリを考慮した
reliable storagedesigns for resilient extreme scale computing.
I/O構成法 [Shirahata, Sato et al. HPC141]
3.2 Burst Buffer System
To solve the problems in a flat buffer system, we consider a
burst buffer system [21]. A burst buffer is a storage space to
bridge the gap in latency and bandwidth between node-local stor-age
16
ᯛ䛾㻌mSATA
SSD
䜢⏝䛔䛯䝥䝻䝖䝍䜲䝥䝬䝅䞁䛾タィ
and the PFS, and is shared by a subset of compute nodes.
Although additional nodes are required, a burst ᐜ㔞:
256GB
x
16ᯛ
→
4TB
buffer can offer
a system many advantages including higher reliability and effi-ciency
Read䝞䞁䝗ᖜ:
0.5GB/s
x
16ᯛ
→
over a flat buffer system. A burst buffer system is more
reliable for checkpointing because burst buffers are located on
a smaller number of dedicated I/O nodes, so the probability of
lost checkpoints is decreased. In addition, even if a large number
of compute nodes fail concurrently, an application can still ac-cess
the checkpoints from the burst buffer. A burst buffer system
provides more efficient utilization of storage resources for partial
restart of uncoordinated checkpointing because processes involv-ing
restart can exploit higher storage bandwidth. For example, if
compute node 1 and 3 are in the same cluster, and both restart
from a failure, the processes can utilize all SSD bandwidth unlike
a flat buffer system. This capability accelerates the partial restart
of uncoordinated checkpoint/restart.
Table 1 Node specification
CPU Intel Core i7-3770K CPU (3.50GHz x 4 cores)
Memory Cetus DDR3-1600 (16GB)
M/B GIGABYTE GA-Z77X-UD5H
SSD Crucial m4 msata 256GB CT256M4SSD3
(Peak read: 500MB/s, Peak write: 260MB/s)
SATA converter KOUTECH IO-ASS110 mSATA to 2.5’ SATA
Device Converter with Metal Fram
RAID Card Adaptec RAID 7805Q ASR-7805Q Single
8
GB/s
A
single
mSATA
SSD
8
integrated
mSATA
SSDs
RAID
cards
Prototype/Test
machine