MemoryPlus Workshop

不不揮発性メモリを考慮した
⼤大規模なグラフの⾼高速処理理
東京⼯工業⼤大学学術国際情報センター
佐藤仁

Memory Plus Workshop 2014年年9⽉月17⽇日

先端スパコンの現状: TSUABME2 (v.2.5) (東⼯工⼤大)
17PF(単精度度)/5.7PF(倍精度度) #13位 (2014年年6⽉月)
System
1408 nodes, 44 racks

CPU: Intel Xeon X5670 (6 cores) × 2
GPU: NVIDIA Tesla K20X × 3
MEM: 58 GB DDR3 1333MHz
SSD: 60 GB × 2
Network
4× QDR Full bisection Optical Infiniband
Storage
SSD: 200TB
HDD: 7PB (Lustre, GPFS)
Tape: 4PB (SorageTek SL8500× 2)

TSUBAME2.5の⼤大規模データ処理理基盤としての側⾯面
• アクセラレータの搭載
– NVIDIA Tesla K20X 4224枚
• ノードあたりのメモリ搭載量量が(⽐比較的)多い
– 58GB/ノード × 1408 ノード
• ファットツリー構成のDual-‐‑‒Rail QDR Infiniband
– 200Tbpsのフルバイセクションバンド幅
• 各計算ノードにSSDを搭載
– システム全体で200TB程度度

• ⼤大規模ストレージ
– LustreとGPFSの採⽤用
– 7PBからなるHDD領領域,
4PBからなるテープ領領域

将来のスパコンの予測と課題
• 並列列数の爆発的増⼤大・不不均質化・⾼高密度度化
– プロセッサのマルチコア化・メニーコア化
– アクセラレータの台頭
• 記憶装置の変⾰革・多階層化
– 不不揮発性メモリ・次世代メモリの登場
• FLASH, PCM, STT-‐‑‒MRAM, ReRAM,
HMC, etc.
– データ移動のコスト
(性能・消費電⼒力力)が増⼤大
㏻ಙ᭱㐺໬
ᒁᡤᛶ䛾ά⏝
⏕⏘ᛶ
ᩘ༓୓୪ิつᶍ䛾㻌
䝇䜿䞊䝷䝡䝸䝔䜱
⪏ᨾ㞀
୙ᆒ㉁ᛶ
୪ิ䜰䝹䝂䝸䝈䝮
┬㟁ຊ
ᢏ⾡ⓗㄢ㢟
䝇䝖䝺䞊䝆䛾
㝵ᒙᛶ䛾῝໬
኱つᶍ䝕䞊䝍I/O

将来のスパコンの予測と課題
• 並列列数の爆発的増⼤大・不不均質化・⾼高密度度化
– プロセッサのマルチコア化・メニーコア化
– アクセラレータの台頭
• 記憶装置の変⾰革・多階層化
– 不不揮発性メモリ・次世代メモリの登場
• FLASH, PCM, STT-‐‑‒MRAM, ReRAM,
HMC, etc.
– データ移動のコスト
(性能・消費電⼒力力)が増⼤大
どのようなソᒁᡤᛶ䛾ά⏝
フトウェアが必要か？
局⏕⏘ᛶ
所性の㏻ಙ᭱㐺໬
活⽤用・多階層のメモリ管理理・
ᩘ༓୓୪ิつᶍ䛾㻌
䝇䜿䞊䝷䝡䝸䝔䜱
⪏ᨾ㞀
୙ᆒ㉁ᛶ
୪ิ䜰䝹䝂䝸䝈䝮
超並列列でのスケーラビリティ
┬㟁ຊ
ᢏ⾡ⓗㄢ㢟
䝇䝖䝺䞊䝆䛾
㝵ᒙᛶ䛾῝໬
኱つᶍ䝕䞊䝍I/O

⼤大規模グラフ処理理
• 様々な応⽤用分野が存在
– 交通シミュレーション, SNS,
電⼒力力網, バイオインフォマティクス, BI,
サイバーセキュリティ
• ⼤大規模データの出現
– インフラのコモディティ化
• センサー・サーバ・ストレージ
– オープンデータ化
• 性能・容量量に対する厳しい要求
– ランダムなデータ・アクセスが多い
(といわれている)
– メモリ、ストレージ、ネットワーク
– スーパコンピュータでも重要な
カーネルのひとつ

Advanced
Compu=ng
and
Op=miza=on
Infrastructure
for
Extremely
Large-‐Scale
Graphs
on
Post
Peta-‐Scale
Supercomputers
• JST(Japan
Science
and
Technology
Agency)
CREST(Core
Research
for
Evoluonal
Science
and
Technology)
Project
(Oct,
2011
䡚㻌
March,
2017)
• 4
groups,
over
60
members
1. Fujisawa-‐G
(Kyushu
University)
:
Large-‐scale
Mathemacal
Opmizaon
2. Suzumura-‐G
(University
College
Dublin,
Ireland)
:
Large-‐scale
Graph
Processing
3. Sato-‐G
(Tokyo
Instute
of
Technology)
:
Hierarchical
Graph
Store
System
4. Wakita-‐G
(Tokyo
Instute
of
Technology)
:
Graph
Visualizaon
• Innova=ve
Algorithms
and
implementa=ons
• Opmizaon,
Searching,
Clustering,
Network
flow,
etc.
• Extreme
Big
Graph
Data
for
emerging
applicaons
• 230
~
242
nodes
and
240
~
246
edges
• Over
1M
threads
are
required
for
real-‐me
analysis
• Many
applicaons
on
post
peta-‐scale
supercomputers
• Analyzing
massive
cyber
security
and
social
networks
• Opmizing
smart
grid
networks
• Health
care
and
medical
science
• Understanding
complex
life
system

Graph500 (http://www.graph500.org/)
Kronecker graph
A: 0.57, B: 0.19
C: 0.19, D: 0.05
超⼤大規模グラフの探索索能⼒力力でスーパーコンピュータの
ビッグデータ処理理を評価する新しいベンチマーク
• 実グラフに似た性質を持つKronecker Graphに
対して幅優先探索索(BFS)を⾏行行う
ü スケールフリー・スモール・ワールド
ü 頂点数=2SCALE, エッジ数=頂点数×16
• 性能指標：TEPS(Traversed Edges per Second)
ü 単位時間辺りに処理理できたエッジ数
(枝刈り等も含む)

Graph500が対象とするグラフの規模
㻌㻠㻡
㻌㻠㻜
㻌㻟㻡
㻌㻳㼞㼍㼜㼔㻡㻜㻜㻌㻔㻿㼙㼍㼘㼘㻕
㻌㻌㻳㼞㼍㼜㼔㻡㻜㻜㻌㻔㻴㼡㼓㼑㻕
㻌㻳㼞㼍㼜㼔㻡㻜㻜㻌㻔㻸㼍㼞㼓㼑㻕
1 trillion
edges
Twitter (tweets / day)
㻞
㼘㼛㼓㻌㻟㻜
㻌㻞㻡
㻌㻞㻜
㻌㻝㻡㻌㻞㻜㻌㻞㻡㻌㻟㻜㻌㻟㻡㻌㻠㻜㻌㻠㻡㻔㼙㻕
㻌㻌㻳㼞㼍㼜㼔㻡㻜㻜㻌㻔㻹㼕㼚㼕㻕
㼁㻿㻭㻙㼞㼛㼍㼐㻙㼐㻚㼁㻿㻭㻚㼓㼞
㼁㻿㻭㻙㼞㼛㼍㼐㻙㼐㻚㻸㻷㻿㻚㼓㼞
㼘㼛㼓㻞
㻔㼚㻕
㼁㻿㻭㻙㼞㼛㼍㼐㻙㼐㻚㻺㼅㻚㼓㼞㻌
㻴㼡㼙㼍㼚㻌Brain 㻼㼞㼛㼖㼑㼏㼠㻌㻌㻌㻌
㻌㻌㻌㻳㼞㼍㼜㼔㻡㻜㻜㻌㻔㼀㼛㼥㻕
㻳㼞㼍㼜㼔㻡㻜㻜㻌㻔㻹㼑㼐㼕㼡㼙㻕
1 billion
nodes
1 trillion
nodes
1 billion
edges
Symbolic
Network
USA Road Network
# of vertices
# of edges
Original Slide from Prof. Katsuki Fujisawa (Kyushu Univ.)

スパコン上での⼤大規模グラフ処理理の技術的課題
• ⼤大規模グラフ処理理 (ビッグデータカーネル)
– ⾼高速処理理には⼤大容量量のDRAMが必要
– DRAMの増設は価格・消費電⼒力力の増⼤大を招く
• メモリ階層の深化
– 不不揮発性メモリ(NVM)の登場 – フラッシュ
– 容量量、消費電⼒力力、価格の⾯面で優位
– DRAMの延⻑⾧長として深化するメモリ階層をどのように
活⽤用するのか？
スーパーコンピュータ向けのビッグデータの
処理理能⼒力力を競うGraph500ベンチマークを題材に
要素技術を検討

BFS (幅優先探索) アルゴリズム
Top-down(Conventional)

㻝㻠
Hybrid BFS アルゴリズム [Beamer2012]
Top-down(Conventional) Bottom-up

高速なBFSアルゴリズム
[Beamer2012] Beamer, S. et al.:Direction-optimizing breadth-first search, SC '12

Hybrid BFS アルゴリズム [Beamer 2012]
Top-down(従来型) Bottom-up

ハイブリッドに
切切り替え

探索索済 → 未探索索未探索索 → 探索索済
3.E+09
親のエッジ合計⼦子
のため冗⻑⾧長な探索索増
⼤大量量の
未探索索頂点
Traversed Edges Level
2.E+09
2.E+09
1.E+09
5.E+08
0.E+00
Top-down
Bottom-up
枝刈り効果
0 1 2 3 4 5 6
図．Top-‐‑‒down，Bottom-‐‑‒upでの探索索エッジ数

NUMAアーキテクチャに最適化された
Hybrid-‐‑‒BFS実装 [Yasui2013]
• グラフ分割をNUMAに最適化することで
メモリアクセスを⾼高速化
– アプローチ毎に最適なグラフ形状を保持 (CSR形式)
• Top-‐‑‒down → FG (forward graph)
• Bottom-‐‑‒up → BG (backward graph)
– BFS Status
• ビットマップ、キューなど
Dualモデル（⾼高性能）
BFS Status
FG
Top-‐‑‒down
(forward graph)
BG
Bottom-‐‑‒up
(backward graph)
Singleモデル（省省メモリ）
BFS Status
Top-‐‑‒down
Bottom-‐‑‒up
BG
(backward graph)
データ量量は指数的に増加
1600
1400
1200
1000
800
600
400
200
0
FGBG
26
28
30
32
SCALE
Data Size(GB)

不不揮発性メモリを考慮したHybrid-‐‑‒BFS
[Iwabuchi, Sato et al. HPDIC2014, BigData2014]
CSR形式のグラフデータである，FG（forward graph）と
BG(backward graph)の階層メモリ上での配置について検討
Dual モデル
• アプローチ毎のグラフを使用
• 性能への影響が小さいと考えられる
FGをNVMへ退避
BFS Status
FG
BG
• アクセスパターンの分析が容易易
• 両アプローチともNUMAに対応
するため，⾼高速な可能性
DRAM
NVM
BFS Status（queueやbitmap）はサイズが小さくランダムな
I/Oが大量に発生し性能への影響が大きいためDRAM上に保持

不不揮発性メモリを考慮したHybrid-‐‑‒BFS
[Sato, Iwabuchi et al. HPDIC2014, BigData2014]
CSR形式のグラフデータである，FG（forward graph）と
BG(backward graph)の階層メモリ上での配置について検討
Single モデル
BFS Status
Dual モデル
BFS Status
FG
BG BG①
BG②
• アプローチ毎のグラフを使用
• 性能への影響が小さいと考えられる
FGをNVMへ退避
• 両アプローチでBGのみを使用
• BGは一部のみDRAM上に保持
• アクセスパターンの分析が容易易
• 両アプローチともNUMAに対応
するため，⾼高速な可能性
• より⼤大規模なグラフを実⾏行行可能
• デバイスの性能を考慮した
柔軟な容量量の変更更が可能
DRAM
NVM
Pros Cons
BFS Status（queueやbitmap）はサイズが小さくランダムな
I/Oが大量に発生し性能への影響が大きいためDRAM上に保持

19
グラフデータの退避手法
各頂点毎に一定数のエッジのみをDRAMに保持し
残りは不揮発性メモリ（NVM）に保持する
Graph Data(CSR-format)
0 1 2
DRAM
BG①
NVM
BG②
頂点 ID
エッジ (隣隣接先の頂点ID)の並び
DRAM NVM
BG②
BFS Status
Top-downアプローチ，
Bottom-upアプローチの
必要時にNVMから読込
（4KB）
BG①
• 細粒粒度度なI/Oと
• Bottom-upの⼤大半は
DRAM内で終える
隣隣接先頂点の次数が⼤大き
いものから優先的に
DRAMに配置
[yasui2013]

20
CPU
Intel(R) Xeon(R) CPU E5-2690 @ 2.90GHz
(8 cores, 16 threads ) × 2 sockets
DRAM
256 GB
NVM
EBD-I/O 2TB × 2
(EXT-4, I/Oスケジューラ：NOOP)
実験環境
EBD-I/O mSATA 256GB mSATA
× 2
RAID Card(RAID 0)
Adaptec ASR-7805Q
SATA3
PCIe 3.0 8レーン
Mother Board
評価項目
• NVMを活用しDRAM容量を超えた場合の性能
• 複数のデバイスを組み合わせたI/Oデバイス性能
RAID
Card
mSATA
SSD
× 8

スケーラビリティ ‒ DRAM容量を超えた場合での性能
DRAM容量量を超えた場合でも DRAM Onlyの限界
6.0
5.0
4.11GTEPS(DRAM Only)
Median GTEPS SCALE
4.0
3.0
2.0
1.0
0.0
DRAM + EBD-I/O
DRAM Only
約60%の
I/OがNVMへ
Bottom-upでの探索索は
DRAMで完結している
23 24 25 26 27 28 29 30 31
21
⾼高性能を維持
3.9 3.8
平均のI/Oのサイズは約64KBとなり
細粒粒度度のI/Oを削減

DRAM使用量を増加させた場合の性能変化 (Scale31)
250
200
150
100
50
0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
BFS性能
DRAM上のグラフサイズ(GB)
1
(22.4)
2
(27.0)
4
(34.2)
8
(44.5)
16
(59.5)
32
(79.3)
64
(105)
128
(173.8)
256
(173.8)
512
(222.3)
サイズ(GB)
Median GTEPS
DRAM上に配置するエッジの最大数
22
DRAM使用量が少ない方が性能が良い

250
200
150
100
50
0
5.0
4.0
3.0
2.0
1.0
0.0
1
(22.4)
2
(27.0)
細粒度のI/Oの減少
BFS性能 DRAM上のグラフサイズ(GB)
4
(34.2)
8
(44.5)
16
(59.5)
32
(79.3)
64
(105)
128
(173.8)
256
(173.8)
512
(222.3)
サイズ(GB)
Median
GTEPS
DRAM上に配置するエッジの最大数
NVMへの実際のI/Oは
32に比べ約10％増加
23
性能変化要因の分析
40%
35%
30%
25%
20%
15%
10%
5%
0%
DRAM内への
探索数の増加率が低下
Top-down + Bottomup
0 16 32 48 64
全探索に占める
DRAMへの割合
（積算値）
DRAMに配置するのエッジの最大数
DRAM上のデータサイズ
削減による
• キャッシュメモリ
• ページキャッシュの効果

24
電力効率評価 ‒ Green Graph500
l Graph500の電⼒力力効率率率版
l BFS中のシステム全体の消費電⼒力力（W）を計測
l TEPS/Wの指標によりランク付け
l SCALEによって，Big DataとSmall Dataカテゴリに分かれる
（現時点ではSCALE30以上がBigData）

The 2nd Green Graph500 list on Nov. 2013
• Measures power-efficient using TEPS/W ratio
• Results on various system such as Huawei’s RH5885v2 w/
Tecal ES3000 PCIe SSD 800GB * 2 and 1.2TB * 2
• http://green.graph500.org

Tokyo’s Institute of Technology
GraphCREST-Custom #1
is ranked
No.3
in the Big Data category of the Green Graph 500
Ranking of Supercomputers with
35.21 MTEPS/W on Scale 31
on the third Green Graph 500 list published at the
International Supercomputing Conference, June 23, 2014.
Congratulations from the Green Graph 500 Chair

Lessons
from
our
Graph500
acvies
• We
can
efficiently
process
large-‐scale
data
that
exceeds
the
DRAM
capacity
of
a
compute
node
by
ulizing
commodity-‐based
NVM
devices
• Convergence
of
praccal
algorithms
and
sodware
implementaon
techniques
is
very
important
• Basically,
BigData
consists
of
a
set
of
sparse
data.
Converng
sparse
datasets
to
dense
is
also
a
key
for
performing
BigData
processing

HAMAR: Highly Accelerated Data Parallel
Framework for Deep Memory Hierarchy Machines
•
⼤大規模データ並列列処理理フレームワーク
– MapReduce, その他
– アクセラレータと不不揮発性メモリを搭載した
スパコンを対象
– 深化するメモリの階層性を考慮
• 特徴
– 最新のデバイスのSDKを⽤用意に適⽤用
• C++-‐‑‒による実装
• CUDA, OpenNVM, etc.
– 1000台〜～のアクセラレータでの
弱スケーリング
• TSUBAME2
– GPUでのOut-‐‑‒of-‐‑‒coreなデータ管理理・
アルゴリズムの利利⽤用
• GPU-‐‑‒CPU間ストリーミングデータ転送
の最適化
• GPUベースの外部ソート
• 不不揮発性メモリへの対応 (開発中)
– アクセラレータ向けの最適なデータ
構造の利利⽤用
• JDSフォーマット

Hamar
Overview
Rank
0 Rank
1 Rank
n
Map
Local
Array Local
Array Local
Array Local
Array
Distributed
Array
Reduce
Map
Reduce
Map
Reduce
Shuffle
Shuffle
Data
Transfer
between
ranks
Shuffle
Shuffle
Local
Array Local
Array Local
Array Local
Array
Local
Array
on
NVM Local
Array
on
NVM Local
Device(GPU)
Data
Host(CPU)
Data Memcpy
(H2D,
Array
on
NVMVirtualizedL
oDcaalt
Aar
rOayb
ojne
NcVtM
D2H)

Applicaon
Example
:
GIM-‐V
Generalized
Iterave
Matrix-‐Vector
mulplicaon*1
• Easy
descripon
of
various
graph
algorithms
by
implemenng
combine2,
combineAll,
assign
funcons
• PageRank,
Random
Walk
Restart,
Connected
Component
– v’
=
M
×G
v
where
v’i
=
assign(vj
,
combineAllj
({xj
|
j
=
1..n,
xj
=
combine2(mi,j,
vj)}))
(i
=
1..n)
– Iterave
2
phases
MapReduce
operaons
Straighporward
implementaon
using
Hamar
v’ 䠙 ×G i mi,j
vj
v’ M
combine2
(stage1)
combineAll
and
assign
(stage2)
assign v
*1
:
Kang,
U.
et
al,
“PEGASUS:
A
Peta-‐Scale
Graph
Mining
System-‐
Implementaon
and
Observaons”,
IEEE
INTERNATIONAL
CONFERENCE
ON
DATA
MINING
2009

TSUBAME2.5での弱スケーリング
[Shirahata, Sato et al. Cluster2014]
• PageRankアプリケーション
• GPUのメモリを超える規模のグラフを対象(RMAT Graph)
3000
SCALE
23
-‐
24
per
Node
Performance
[MEdges/sec] Number
2500
2000
1500
1000
500
0
0
200
400
600
800
1000
1200
of
Compute
Nodes
1CPU
(S23
per
node)
1GPU
(S23
per
node)
2CPUs
(S24
per
node)
2GPUs
(S24
per
node)
3GPUs
(S24
per
node)
2.81
GE/s
on
3072
GPUs
(SCALE
34)
2.10x
Speedup
(3
GPU
v
2CPU)

GPUアクセラレータと不揮発性メモリを考慮した
reliable storage designs for resilient extreme scale computing.
I/O構成法 [Shirahata, Sato et al. HPC141]
3.2 Burst Buffer System
To solve the problems in a flat buffer system, we consider a
burst buffer system [21]. A burst buffer is a storage space to
bridge the gap in latency and bandwidth between node-local stor-age
16
ᯛ䛾㻌mSATA
SSD
䜢⏝䛔䛯䝥䝻䝖䝍䜲䝥䝬䝅䞁䛾タィ
and the PFS, and is shared by a subset of compute nodes.
Although additional nodes are required, a burst ᐜ㔞:
256GB
x
16ᯛ
→
4TB
buffer can offer
a system many advantages including higher reliability and effi-ciency
Read䝞䞁䝗ᖜ:
0.5GB/s
x
16ᯛ
→
over a flat buffer system. A burst buffer system is more
reliable for checkpointing because burst buffers are located on
a smaller number of dedicated I/O nodes, so the probability of
lost checkpoints is decreased. In addition, even if a large number
of compute nodes fail concurrently, an application can still ac-cess
the checkpoints from the burst buffer. A burst buffer system
provides more efficient utilization of storage resources for partial
restart of uncoordinated checkpointing because processes involv-ing
restart can exploit higher storage bandwidth. For example, if
compute node 1 and 3 are in the same cluster, and both restart
from a failure, the processes can utilize all SSD bandwidth unlike
a flat buffer system. This capability accelerates the partial restart
of uncoordinated checkpoint/restart.
Table 1 Node specification
CPU Intel Core i7-3770K CPU (3.50GHz x 4 cores)
Memory Cetus DDR3-1600 (16GB)
M/B GIGABYTE GA-Z77X-UD5H
SSD Crucial m4 msata 256GB CT256M4SSD3
(Peak read: 500MB/s, Peak write: 260MB/s)
SATA converter KOUTECH IO-ASS110 mSATA to 2.5’ SATA
Device Converter with Metal Fram
RAID Card Adaptec RAID 7805Q ASR-7805Q Single
8
GB/s
A
single
mSATA
SSD
8
integrated
mSATA
SSDs
RAID
cards
Prototype/Test
machine

9000
8000
7000
6000
5000
4000
3000
2000
1000
0
GPUから複数 mSATA SSD への
I/O性能の予備評価
Raw
mSATA
4KB
RAID0
1MB
RAID0
64KB
0
5
10
15
20
Bandwidth
[MB/s]
#
mSATAs
3.5
3
2.5
2
1.5
1
0.5
0
Throughuput
[GB/s]
Raw
8
mSATA
8
mSATA
RAID0
(1MB)
8
mSATA
RAID0
(64KB)
Matrix
Size
[GB]
FIO
䛻䜘䜛」ᩘ㻌mSATA
SSD
䛾
I/O
ᛶ⬟ ⾜ิ䝧䜽䝖䝹✚䛻䜘䜛
GPU
䜈䛾
I/O
ᛶ⬟
」ᩘ㻌mSATA
SSD
䛾
I/O
ᛶ⬟
ࠥ
7.39
GB/s
(RAID0
䜢౑⏝)
GPU
䜈䛾
I/O
ᛶ⬟
ࠥ
3.06
GB/s
(PCI-‐E
ୖ㝈)

MemoryPlus Workshop

More Related Content

What's hot

Similar to MemoryPlus Workshop

More from Hitoshi Sato

MemoryPlus Workshop