SlideShare a Scribd company logo
1 of 79
Download to read offline
最新のHPC技術を生かしたAI・ビッグデータ
インフラの東工大TSUBAME3.0
及び産総研ABCI
東京工業大学 学術国際情報センター (GSIC) 教授
産業技術総合研究所 人工知能研究センター (AIRC) 特定フェロー
産総研・東工大 リアルワールドビッグデータ
オープンイノベーションラボ ラボ長
松岡 聡
NVIDIA Deep Learning Institute Day 2017 in Nagoya Presentation
2017/05/22
32nm 40nm
>400GB/s Mem BW
80Gbps NW BW
~1KW max
>1.6TB/s Mem BW >12TB/s Mem BW
35KW Max
>600TB/s Mem BW
220Tbps NW
Bisecion BW
1.4MW Max
TSUBAME2.0 2010年11月1日稼働開始
(当時)世界最小のペタフロップス・省電力スパコン
各種基礎研究がベース
メーカーと新規共同開発
• 大規模なGPU採用による高性能と低電力の両立
• 最小の設置面積(200m2程度)、高いコストパフォーマンス
• 高性能にマッチした光ネットワーク、SSDストレージ
ACM Gordon Bell Prize 2011
ACM ゴードンベル賞
2.0ペタフロップス達成 (T2.5では3.4PF)
(京コンピュータと同時受賞)
TSUBAME2.0のアプリケーションの受賞
Special Achievements in Scalability and Time-to-Solution
“Peta-Scale Phase-Field Simulation for Dendritic
Solidification on the TSUBAME 2.0 Supercomputer”
TSUBAMEと京との比較(1)
性能≒
コスト<<
京コンピュータ (2011)
単精度11.4Petaflops
倍精度11.4Petaflops(最速)
約1500億円?/6年
(電気代等含)
TSUBAME2.0(2010)
→ TSUBAME2.5(2013)
単精度17.1 Petaflops(最速)
倍精度5.76 Petaflops
約50億円/6年 (電気代等含)
32nm 40nm
>400GB/s Mem BW
80Gbps NW BW
~1KW max
>1.6TB/s Mem BW >12TB/s Mem BW
35KW Max
4224 GPUs
>600TB/s Mem BW
220Tbps NW
Bisecion BW
1.4MW Max
TSUBAME2.0 Nov. 1, 2010
“The Greenest Production Supercomputer in the World”
• GPU-centric (> 4000) high performance & low power
• Small footprint (~200m2 or 2000 sq.ft), low TCO
• High bandwidth memory, optical network, SSD storage…
TSUBAME 2.0
New Development
2013 GPU
Upgrade
TSUBAME2.5
5.7 Petaflops
Core Center of AI for Industry-Academia Co-operation
Application Domains
NLP, NLU
Text mining
Behavior
Mining & Modeling
Manufacturing
Industrial robots
Automobile
Innovative
Retailing
Health Care
Elderly Care
Deployment of AI in real businesses and society
Data-Knowledge integration AIBrain Inspired AI
Ontology
Knowledge
Model of
Hippocampus
Model of
Basal ganglia
Logic & Probabilistic
Modeling
Bayesian net ・・・
・・・
Security
Network Services
Communication
Big Sciences
Bio-Medical Sciences
Material Sciences
Model of
Cerebral cortex
Technology transfer
Starting Enterprises
Start-Ups
Institutions
Companies
Technology transfer
Joint research Common AI Platform
Common Modules
Common Data/Models
Planning
Control
Prediction
Recommend
Image Recognition
3D Object recognition
Planning/Business Team
・・・
Effective Cycles among Research and Deployment of AI
Standard Tasks
Standard Data
AI Research Framework
Planning/Business Team
2015- AI Research Center (AIRC), AIST
Now > 400+ FTEs
Matsuoka : Joint
appointment as
“Designated” Fellow
since July 2017
Director:
Jun-ichi Tsujii
Industry
ITCS
Departments
Other Big Data / AI
research organizations
and proposals
JST BigData CREST
JST AI CREST
Etc.
Tsubame 3.0/2.5
Big Data /AI
resources
Industrial
Collaboration in data,
applications
Resources and Acceleration of
AI / Big Data, systems research
Basic Research
in Big Data / AI
algorithms and
methodologies
Joint
Research on
AI / Big Data
and
applications
Application Area
Natural Langauge
Processing
Robotics
Security
National Institute for
Advanced Industrial Science
and Technology (AIST)
Ministry of Economics
Trade and Industry (METI)
Director: Satoshi Matsuoka
Tokyo Institute of
Technology / GSIC
Joint Lab established Feb.
2017 to pursue BD/AI joint
research using large-scale
HPC BD/AI infrastructure
AIST Artificial
Intelligence
Research Center
(AIRC)
ABCI
AI Bridging Cloud
Infrastructure
The current status of AI & Big Data in Japan
We need the triage of advanced algorithms/infrastructure/data but we lack
the cutting edge infrastructure dedicated to AI & Big Data (c.f. HPC)
R&D ML
Algorithms
& SW
AI&Data
Infrastructures
“Big”Data
B
IoT Communication,
location & other data
Petabytes of Drive
Recording Video
FA&Robots
Web access and
merchandice
Use of Massive Scale Data now
Wasted
Seeking Innovative
Application of AI &
Data
AI Venture Startups
Big Companies AI/BD
R&D (also Science)
In HPC, Cloud continues to
be insufficient for cutting
edge research =>
dedicated SCs dominate &
racing to Exascale
Massive Rise in Computing
Requirements (1 AI-PF/person?)
Massive “Big” Data in
Training
Riken
-AIP
Joint
RWBC
Open Innov.
Lab (OIL)
(Director: Matsuoka)
AIST-AIRC
NICT-
UCRI
Over $1B Govt.
AI investment
over 10 years
AI/BD Centers &
Labs in National
Labs & Universities
大規模なビッグデータ・AI処理の二つの計算特性
ビッグデータ・AIとして
グラフ解析:ソーシャルネットワーク
ソート・ハッシュ:DB・ログデータ解析
記号処理:従来型の人工知能
HPCとして
整数・疎行列計算
データ移動中心・大規模メモリ
疎でかつランダムなデータ
ビッグデータ・AIとして
密行列演算:深層学習
推論・学習・生成
HPCとして
密行列計算ただし縮退精度中心
密で比較的整列されたネットワーク
やデータ
高速化・大規模化 高速化・大規模化スパコンによる加速
ただし、両者への適合
計算特性がお互い
全く異なるが、スパ
コンにおけるシミュ
レーション計算も本
質的にはどちらか
(ビッグデータの) BYTES 能力が、特にデータのバンド
幅と容量がBD/AIの加速には大変重要だが、近年の
(Linpack)FLOPS偏重のスパコンには多くが欠けている
• システム全体のバンド幅(BYTE/S)と容
量(BYTES)がHPC-BD/AI統合には必要
(メモリバンド幅だけではNG)
• 左側の疎なデータ中心のアプリには自明
• 右側のDNNでも、高速化のためのマルチ
ノードの並列化で強スケーリングする場合
や、CTや地層などの3次元の大規模化す
るデータ解析を行う場合はやはり必要
(Source: https://www.spineuniverse.com/image-
library/anterior-3d-ct-scan-progressive-kyphoscoliosis)
(Source: http://www.dgi.com/images/cvmain_overview/CV4DOverview_Model_001.jpg)
CaffeNetのTSUBAME-KFC/DL
における学習時の一回のイ
テレーションの時間
(Mini-batch size 256)
殆どの時間が通信やメモリ
内の捜査に使用
1 2 4 8 16 1A 2A 4A 8A 16A
Iteration
Other
H2D
Commun
D2H
ForwardB
Number of nodes
Executiontime[s]
0.00.20.40.60.8
1 2 4 8 16 1A 2A 4A 8A 16A
Iteration
Other
H2D
Communication
D2H
ForwardBackward
Number of nodes
Executiontime[s]
0.00.20.40.60.8
1 2 4 8 16 1A 2A 4A 8A 16A
Iteration
Other
H2D
Communication
D2H
ForwardBackward
Number of nodes
Executiontime[s]
0.00.20.40.60.8
Number of nodes
殆どが通信時間
GPU上での計算
は僅か3.9%
BD/AIとHPCを統
合するスパコン
では、メモリや
ネットワークの
容量や性能を確
保するアーキや
そのシステムソ
フトが大変重要
ビッグデータ・AI時代のBYTES中心スパコン研究開発要件
-東工大GSIC等・産総研AIRC、およびジョイントラボRWBC-OILでの活動-
Bytes-Centric Supercomputer
• BYTES中心スパコンの研究開発:倍精度FLOPS増加脱却 (TSUBAME3)
• ペタ~エクサバイトの大規模データがマシン内で自由に移動
• 最新の不揮発性メモリを用いた「ファットノード」の深いメモリ階層と、テラビット級
超高速ネットワークによるノード間結合
• BYTES中心スパコン上の整数・疎行列計算アルゴリズム・ソフト
• BYTES中心スパコン上の深層学習アルゴリズムのスケーリング・加速
• シミュレーションサイエンスと加速されたデータサイエンスの統合
• HPC加速されたBD/AI研究を行う研究組織体(例:東工大・産総研OIL)
• 我が国のAI基盤として産官学連携・民間技術移転(TSUBAME3->ABCI)
• 世界トップクラスのHPC-AIクラウド基盤を開発
TSUBAME-KFC/DL: ウルトラグリーン・スパコン研究設備
(文部科学省概算要求・2011-2015・約2億円)→機械学習スパコンへUG
高温冷却系
冷媒油 35~45℃
⇒ 水 25~35℃
(TSUBAME2は7~17℃)
冷却塔:
水 25~35℃
⇒ 自然大気へ
液浸冷却+高温大気冷却+高密度実装+電力制御のスパコン技術を統合
TSUBAME3.0のプロトタイプ
コンテナ型研究設備
20フィートコンテナ(16m2)
無人自動制御運転
将来はエネルギー回収も
高密度実装・油浸冷却
210TFlops (倍精度)
630TFlops (単精度)
1ラック
2013年11月/2014年6月
Green500ランキング
世界一位(日本初)
2015アップグレード
500TFlops (倍精度)
機械学習等1.5PFlops (単精度)
世界最高性能密度
(やはり1ラック=>7ラックで京相当)
1ラック60KWの高密度発熱
TSUBAME3.0
2006年 TSUBAME1.0
80テラフロップス・アジア一位
世界7位
2010年 TSUBAME2.0
2.4ペタフロップス・世界4位
運用スパコングリーン世界一
2013年TSUBAME2.5
アップグレード(補正予算)
5.7ペタフロップス
2013年TSUBAME-KFC
スパコングリーン世界一
2017年 TSUBAME3.0
12.1ペタフロップス
HPCIリーディングスパコン
HPC/BD/AIの統合
大規模学術
及び
産業利用アプリ2011年ACMゴードンベル賞
2017年 リーディングスパコンTSUBAME3.0へ
「最先端の技術チャレンジに挑むスパコン」
1. HPCIリーディングスパコン:TSUBAME2と合算し京の二倍の性能:15-20ペ
タフロップス、4-5ペタバイト/秒メモリバンド幅、ペタビット級光ネットワーク
2. BD/AI Data Centric Architecture 大規模シリコンストレージによりマルチテ
ラバイト/秒のI/O、機械学習・AI含むビッグデータ用ミドルウェアやアプリ
3. グリーンスパコン10ギガフロップス/W以上の
性能電力性能・グリーン・PUE < 1.05
4. クラウドスパコン:種々のクラウド
サービスと高速性の両立
5. 「見える化」スパコン:超複雑な
マシンの状態が「見える」
TSUBAME 3.0 のシステム概要
全GPU・全ノード・全メモリ階層での
BYTES中心スケーラブルアーキテクチャ
フルバイセクションバンド幅の
インテル® Omni-Path® 光ネットワーク
432 Terabits/秒 双方向
全インターネット平均通信量の2倍
DDNのストレージシステム
(並列FS 15.9PB+ Home 45TB)
540の計算ノード SGI ICE® XA
インテル®Xeon® CPU×2+NVIDIA Pascal GPU×4
256GBメモリ、2TBのNVMe対応インテル®SSD
47.2 AI-Petaflops, 12.1 Petaflops
2017年8月本稼働
15
NVIDIA TESLA
P100 “PASCAL”
GPU
NVLINKによる高速GPU間通信
ハードウェアによるCPU
とのメモリ共有
HPC&AI高性能演算性能
HBM2 2.5次元積層メモリ
K40
Teraflops(FP32/FP16)
5
10
15
20
P100
(FP32)
P100
(FP16)
M40
K40
Bi-directionalBW(GB/Sec)
40
80
120
160
P100
M40
K40
Bandwidth(GB/s)
200
400
600
P100
M40 K40
AddressableMemory(GB)
10
100
1000
P100
M40
21 Teraflops of FP16 深層学習性能
5.3 Teraflops FP64 HPC
16nm FinFETによる
高い電力効率
PCI-e 3 比で5x GPU-GPU通信バンド幅
大規模HPC&データのワークロードにて
3倍のメモリバンド幅 GPUのメモリ制限を撤廃
10000800
TSUBAME3では
2160機採用
5つの技術革新
TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC
16
Intra-node GPU via NVLink
20~40GB/s
Intra-node GPU via NVLink
20~40GB/s
Inter-node GPU via OmniPath
12.5GB/s fully switched
HBM2
64GB
2.5TB/s
DDR4
256GB
150GB/s
Intel Optane
1.5TB 12GB/s
(planned)
NVMe Flash
2TB 3GB/s
16GB/s PCIe
Fully Switched
16GB/s PCIe
Fully Switched
~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node)
 Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year
Terabit class network/node
800Gbps (400+400)
full bisection
Any “Big” Data in the
system can be moved
to anywhere via
RDMA speeds
minimum
12.5GBytes/s
also with Stream
Processing
Scalable to all 2160
GPUs, not just 8
TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC
17
Intra-node GPU via NVLink
20~40GB/s
Intra-node GPU via NVLink
20~40GB/s
Inter-node GPU via OmniPath
12.5GB/s fully switched
HBM2
64GB
2.5TB/s
DDR4
256GB
150GB/s
Intel Optane
1.5TB 12GB/s
(planned)
NVMe Flash
2TB 3GB/s
16GB/s PCIe
Fully Switched
16GB/s PCIe
Fully Switched
~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node)
 Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year
Any “Big” Data in the
system can be moved
to anywhere via
RDMA speeds
minimum
12.5GBytes/s
also with Stream
Processing
Scalable to all 2160
GPUs, not just 8
TSUBAME3.0 Co-Designed SGI ICE-XA Blade (new)
- No exterior cable mess (power, NW, water)
- Plan to become a future HPE product
CPU 0
PLX
GPU 0
OPA HFI
OPA HFI
DIMM
DIMM
DIMM
DIMM
GPU 1
CPU 1
DIMM
DIMM
DIMM
DIMM
PLX OPA HFI
GPU 2 GPU 3
OPA HFI
PCH
SSD
QPI NVLink
x16 PCIe
x16 PCIe
x16 PCIe
x16 PCIe
x16 PCIe
x16 PCIe x16 PCIe
x16 PCIe x16 PCIe
x16 PCIe
x4 PCIe
DMI
CPU 0
PLX
GPU 0
OPA HFI
OPA HFI
DIMM
DIMM
DIMM
DIMM
GPU 1
CPU 1
DIMM
DIMM
DIMM
DIMM
PLX OPA HFI
GPU 2 GPU 3
OPA HFI
PCH
SSD
QPI NVLink
x16 PCIe
x16 PCIe
x16 PCIe
x16 PCIe
x16 PCIe
x16 PCIe x16 PCIe
x16 PCIe x16 PCIe
x16 PCIe
x4 PCIe
DMI
TSUBAME3.0 Compute Node SGI ICE-XA, a New GPU Compute Blade Co-
Designed by SGI and Tokyo Tech GSIC
x9
SGI ICE XA Infrastructure
Intel Omnipath Spine Switch, Full Bisection Fat Tre Network
432 Terabit/s Bidirectional for HPC and DNN
Compute Blade Compute Blade
x60 sets
(540 nodes)
X60 Pairs
(Total 120 Switches)
18 Ports
18 Ports
18 Ports
18 Ports
Ultra high performance & bandwidth “Fat Node”
• High Performance: 4 SXM2(NVLink) NVIDIA Pascal
P100 GPU + 2 Intel Xeon 84 AI-TFLops
• High Network Bandwidth – Intel Omnipath 100GBps
x 4 = 400Gbps (100Gbps per GPU)
• High I/O Bandwidth - Intel 2 TeraByte NVMe
• > 1PB & 1.5~2TB/s system total
• Future Octane 3D-Xpoint memory
Petabyte or more directly accessible
• Ultra High Density, Hot Water Cooled Blades
• 36 blades / rack = 144 GPU + 72 CPU, 50-60KW,
x10 thermals c.f. IDC
CPU 0
PLX
GPU 0
OPA HFI
OPA HFI
DIMM
DIMM
DIMM
DIMM
GPU 1
CPU 1
DIMM
DIMM
DIMM
DIMM
PLX OPA HFI
GPU 2 GPU 3
OPA HFI
PCH
SSD
QPI NVLink
x16 PCIe
x16 PCIe
x16 PCIe
x16 PCIe
x16 PCIe
x16 PCIe x16 PCIe
x16 PCIe x16 PCIe
x16 PCIe
x4 PCIe
DMI
Optane
NVM x4 PCIe
Optane
NVM
400Gbps / node for
HPC and DNN
Terabytes
Memory
Japanese Open Supercomputing Sites Aug. 2017 (pink=HPCI Sites)
Peak
Rank
Institution System Double FP
Rpeak
Nov. 2016
Top500
1 U-Tokyo/Tsukuba U
JCAHP
Oakforest-PACS - PRIMERGY CX1640 M1, Intel Xeon Phi 7250 68C
1.4GHz, Intel Omni-Path
24.9 6
2 Tokyo Institute of
Technology GSIC
TSUBAME 3.0 - HPE/SGI ICE-XA custom NVIDIA Pascal P100 + Intel
Xeon, Intel OmniPath
12.1 NA
3 Riken AICS K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect
Fujitsu
11.3 7
4 Tokyo Institute of
Technology GSIC
TSUBAME 2.5 - Cluster Platform SL390s G7, Xeon X5670 6C
2.93GHz, Infiniband QDR, NVIDIA K20x NEC/HPE
5.71 40
5 Kyoto University Camphor 2 – Cray XC40 Intel Xeon Phi 68C 1.4Ghz 5.48 33
6 Japan Aerospace
eXploration Agency
SORA-MA - Fujitsu PRIMEHPC FX100, SPARC64 XIfx 32C 1.98GHz,
Tofu interconnect 2
3.48 30
7 Information Tech.
Center, Nagoya U
Fujitsu PRIMEHPC FX100, SPARC64 XIfx 32C 2.2GHz, Tofu
interconnect 2
3.24 35
8 National Inst. for
Fusion Science(NIFS)
Plasma Simulator - Fujitsu PRIMEHPC FX100, SPARC64 XIfx 32C
1.98GHz, Tofu interconnect 2
2.62 48
9 Japan Atomic Energy
Agency (JAEA)
SGI ICE X, Xeon E5-2680v3 12C 2.5GHz, Infiniband FDR 2.41 54
10 AIST AI Research
Center (AIRC)
AAIC (AIST AI Cloud) – NEC/SMC Cluster, NVIDIA Pascal P100 + Intel
Xeon, Infiniband EDR
2.2 NA
ビッグデータ・AI用浮動小数演算の精度
0 10 20 30 40 50 60 70
理研
東大
東工大
半精度以上の演算性能での拠点比較
(AI-FLOPS)
TSUBAME3.0 T2.5
京
Oakforest-PACS (JCAHPC)
Reedbush(U&H)
PFLOPS
倍精度 64bit 単精度 32bit 半精度 16bit
整数演算
数値解析
シミュレーション計算
コンピュータ・グラフィックス
ゲーミング
ビッグデータ
機械学習・人工知能
65.8 Petaflops
TSUBAME3+2.5+KFCの合算で、機械学習/AIの性能はスパコン・クラウドすべて含め日本トップ
T-KFC
~6700 GPUs + ~4000 CPUs
TSUBAME3.0 Datacenter
15 SGI ICE-XA Racks
2 Network Racks
3 DDN Storage Racks
20 Total Racks
Compute racks cooled with
32 degrees warm water,
Yearound ambient cooling
Av. PUE = 1.033
TSUBAME3.0 冷却システム系統図
32度の自然大気冷却水による高効率高温冷却
計算ノード
(SGI ICE XA system)
I/O , File system
自然大気
水冷冷却塔
Air conditioner
周辺機器用チラー
分岐ヘッダー
【1F-屋外】
【B1F】
【RF】
【1F-114室】【1F-114室】
分岐ヘッダー
【EPS】
分岐ヘッダー 分岐ヘッダー
空冷冷却
(22℃)
システム発熱量10~15%
+
環境潜熱
水冷冷却
(往き:約14℃)
水冷冷却
(還り:約21℃)
水冷冷却
(往き:約32℃)
水冷冷却
(還り:約40℃)
熱交換機
冷却補助回路(200kW)
水冷冷却
(往き:約14℃)水冷冷却
(還り:約21℃)
水冷冷却
(還り:約40℃)
水冷冷却
(往き:約32℃)
※RF or B1F設置
TSUBAME3.0 世界トップクラスの冷却効率
TSUBAME 3.0 PUE 予測 (900KW消費仮定) 2013~2015年の天候データを元に計算
1月 2月 3月 4月 5月 6月 7月 8月 9月 10月 11月 12月
年間
平均
冷却設備
平均消費電力
[kW]
28.83 28.83 28.83 28.83 29.21 30.06 30.06 32.00 30.06 29.21 28.83 28.83 29.465
PUE 1.032 1.032 1.032 1.032 1.032 1.033 1.033 1.036 1.033 1.032 1.032 1.032 1.033
PUEは冷却のオーバーヘッドを表す。1.0に近いほど良い
PUE = {(計算ノードの消費電力)+(計算ノードの冷却に必要なすべての設備消費電力)}
/ (計算ノードの消費電力)
• 通常のデータセンター:PUE = 2~3 (冷却の方がマシンより電気を食っている)
• 最新の空冷データセンター、TSUBAME1:PUE = 1.4~1.6
• TSUBAME2.0:PUE = 1.28
• TSUBAME-KFC:PUE = 1.09
TSUBAME3.0の計算ノードの年間PUE平均値は『1.033』
世界トップクラス 4-27
クラウド機能
複数のジョブによる計算ノードの共有
GPU0
GPU1
GPU2
NIC1
NIC0
MEM0
MEM1
CPU0
CPU1
TSUBAME2.5は固定分割
Gキュー(4コア+3GPU)とVキュー(8コア)
ジョブの種類 状況
多数のGPUを使うジョブ Gキューへ
CPUだけ使用するジョブ Vキューへ
GPUを1,2個使うジョブ GPUが余る
メモリ使用量が少ないジョブ メモリが余る
TSUBAME3.0の細粒度なリソース割り当て
Dockerにより計算ノードのリソース
(CPUコア、GPU、NIC、メモリ)
を分割。
ジョブスケジューラと連携しジョブ
が必要なリソースを割り当て、ノー
ド内のリソースを無駄なく活用。
GPU0
GPU1
GPU2
GPU3
NIC1
NIC0
NIC3
NIC2
MEM0
MEM1
CPU0
CPU1
図ではCPUのコア数は少なく記載
TSUBAME3.0の細粒度なリソース割り当て
GPU0
GPU1
GPU2
GPU3
NIC1
NIC0
NIC3
NIC2
MEM0
MEM1
CPU0
CPU1
ジョブ 割り当てリソース
1 CPU2コア、NIC0、GPU1、32GBメモリ
2 CPU8コア、64GBメモリ
3 CPU4コア、GPU0、16GBメモリ
4 CPU8コア、80GBメモリ
5 CPU4コア、NIC2&3、GPU2&3、48GBメ
モリ
図ではCPUのコア数は少なく記載
JST-CREST “Extreme Big Data” Project (2013-2018)
Supercomputers
Compute&Batch-Oriented
More fragile
Cloud IDC
Very low BW & Efficiency
Highly available, resilient
Convergent Architecture (Phases 1~4)
Large Capacity NVM, High-Bisection NW
PCB
TSV Interposer
High Powered
Main CPU
Low
Power
CPU
DRAM
DRAM
DRAM
NVM/Fla
sh
NVM/Fla
sh
NVM/Fla
sh
Low
Power
CPU
DRAM
DRAM
DRAM
NVM/Flas
h
NVM/Flas
h
NVM/Flas
h
2Tbps HBM
4~6HBM Channels
1.5TB/s DRAM &
NVM BW
30PB/s I/O BW Possible
1 Yottabyte / Year
EBD System Software
incl. EBD Object System
Introduction
ProblemDomain
Inmostliving
theirdevelopm
moleculecalled
DNAconsistso
callednucleotid
Thefourbases
A),cytosine(C
AleksandrDrozd,NaoyaMaruyama,SatoshiMatsuoka(TITECH)AMultiGPUReadAlignmentAl
Large Scale
Metagenomics
Massive Sensors and
Data Assimilation in
Weather Prediction
Ultra Large Scale
Graphs and Social
Infrastructures
Exascale Big Data HPC
Co-Design
FLOPS中心からBYTES中心のHPCとBD/AIの統合
Graph Store
EBD Bag
Co-Design 1 3 /0 6 /0 6 2 2 :3 6日本地図
1 /1 ページfile:///Users/shirahata/Pictu res/日本地図.svg
1000km
KV
S
KV
S
KV
S
EBD KVS
Cartesian Plane
Co-Design
Given a top-class
supercomputer,
how fast can we
accelerate next
generation big
data c.f.
conventional
Clouds?
Issues regarding
Architecture,
algorithms, system
software in co-design
Performance Model?
Accelerators?
• 世界トップクラスの大規模スーパーコンピュータ構築技
術,TSUBAME KFC(Kepler Fluid Cooling)などの
省電力計算機技術と豊富な運用実績
• 高速深層学習基盤の構築技術,大規模シミュレーション
技術,統計物理に基づくモデリング技術,生命情報解析
技術
• 機械学習などデータ処理用計算機と高性能計算機をつな
ぐシステム連携技術
• 大規模環境計測データの解析技術,サービス設計技術,
確率モデリング技術,多次元データ分析と可視化
個人から社会の
挙動を表す
実社会ビッグデータ
構築したビッグデータ活用基盤の利用を
産業界に広く提供し、保有する実社会
ビッグデータからの価値創造
生体
データ
個人行動
データ
企業・自治体保有のビッグデータ
集団行動
データ
ヘルスケア,
製薬
電機メーカ,
流通,交通
eコマース,通信
キャリア,交通
オープンデータ
地域活動
データ
地域経済分析
システム RESAS
高齢者
データ
高齢者健康調査
サンプル JAGES
…
自社サービスの
向上に利用
新サービスの
創出に利用
他企業へ売却
銀行・証券・保険,不動産,製造業ヘルスケア,eコマース プロバイダ,シンクタンク
産総研・東工大 OIL
実社会ビッグデータの活用基盤の構築
1. ビッグデータ処理オープンプラットフォーム
の確立
2. ビッグデータを活用するデータ処理技術の
開発
ビッグデータ処理システム
TSUBAME
GPGPU
計算機
次世代
計算機
データ
処理環境
確率モデリング
データマイニング
データ可視化
シミュレーション
データ
処理
産総研が産業界との連携・開発技術の実用化を主導
【産総研の強み】
ビッグデータ活用ソフトウェア開発
【東工大の強み】ハードウエア構築技術
産総研・東工大 実社会ビッグデータ活用オープンイノベーションラボラトリ
研究・実験
データ
大学・研究機関・
医療機関
学術データ
<OILの目的>
企業との連携拠点の形成
ビッグデータを高度に解析する
ためのハード・ソフトウェアの
研究開発
橋渡し
企業等
Sparse BYTES: The Graph500 – 2015~2016 – world #1 x 4
K Computer #1 Tokyo Tech[Matsuoka EBD CREST] Univ.
Kyushu [Fujisawa Graph CREST], Riken AICS, Fujitsu
List Rank GTEPS Implementation
November 2013 4 5524.12 Top-down only
June 2014 1 17977.05 Efficient hybrid
November 2014 2 19585.2 Efficient hybrid
June, Nov 2015
June Nov 2016
1 38621.4
Hybrid + Node
Compression
BYTES Rich
Machine + Superior
BYTES algoithm
88,000 nodes,
660,000 CPU Cores
1.3 Petabyte mem
20GB/s Tofu NW
≫
LLNL-IBM Sequoia
1.6 million CPUs
1.6 Petabyte mem
0
200
400
600
800
1000
1200
64 nodes
(Scale 30)
65536 nodes
(Scale 40)
ElapsedTime(ms)
Communi…
Computati…
73% total exec
time wait in
communication
TaihuLight
10 million CPUs
1.3 Petabyte mem
Effective x13
performance c.f.
Linpack
#1 38621.4 GTEPS
(#7 10.51PF Top500)
#2 23755.7 GTEPS
(#1 93.01PF Top500)
#3 23751 GTEPS
(#4 17.17PF Top500)
BYTES, not FLOPS!
K-computer No.1 on Graph500: 4th Consecutive Time
• What is Graph500 Benchmark?
• Supercomputer benchmark for data intensive applications.
• Rank supercomputers by the performance of Breadth-First Search for very huge
graph data.
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Jun 2012 Nov
2012
Jun 2013 Nov
2013
Jun 2014 Nov
2014
Jul 2015 Nov
2015
Jun 2016
Performance(GTEPS)
K computer (Japan)
Sequoia (U.S.A.)
Sunway TaihuLight (China)
No.1
This is achieved by a combination
of high machine performance and
our software optimization.
• Efficient Sparse Matrix Representation with
Bitmap
• Vertex Reordering for Bitmap Optimization
• Optimizing Inter-Node Communications
• Load Balancing
etc.
• Koji Ueno, Toyotaro Suzumura, Naoya Maruyama, Katsuki Fujisawa, and Satoshi Matsuoka, "Efficient Breadth-First Search on
Massively Parallel and Distributed Memory Machines", in proceedings of 2016 IEEE International Conference on Big Data (IEEE
BigData 2016), Washington D.C., Dec. 5-8, 2016 (to appear)
share.sa ndia.g ov
Dynamic Graph ApplicationStreaming edg es
NVRAM
Larg e-scale Dynamic Graph Data Store
NVRAM NVRAM
mmap mmap mmap
Comp.
N ode
Comp.
N ode
Comp.
N ode
Distributed Large-Scale Dynamic Graph Data Store (work with LLNL, [SC16 etc.])
Node Level Dynamic Graph Data Store
Extend for multi-processes using an async
MPI communication framework
Follows an adjacency-list format and leverages an
open address hashing to construct its tables
2 billion
insertions/s
InsertedBillionEdges/sec
Number of Nodes (24 processes per node)
Multi-node Experiment
STINGER
• A state-of-the-art dynamic graph processing
framework developed at Georgia Tech
Baseline model
• A naïve implementation using Boost library (C++) and
the MPI communication framework
Based on K-Computer results, adaping to (1) deep
memory hierarchy, (2) rapid dynamic graph changes
K. Iwabuchi, S. Sallinen, R. Pearce, B. V. Essen, M. Gokhale, and S. Matsuoka, Towards a distributed large-scale dynamic graph data store. In 2016
IEEE Interna- tional Parallel and Distributed Processing Symposium Workshops (IPDPSW)
C.f. STINGER (single-node, on memory)
0
200
6 12 24
SpeedUp
Parallels
Baseline DegAwareRHH 212x
Dynamic Graph Construction (on-memory & NVM)
K Computer
large
memory
but very
expensive
DRAM only
Develop
algorithms
and SW
exploiting
large
hierarchical
memory
Dynamic graph store
w/ world’s top graph
update performance
and scalability
Large-scale Graph Colouring (vertex coloring)
• Color each vertices with the minimal #colours so that no two adjacent
vertices have the same colour
• Compare our dynamic graph colouring algorithm on DegAwareRHH against:
1. two static algorithms including GraphLab
2. an another graph store implementation with same dynamic algorithm (Dynamic-MAP)
Static colouring Our DegAwareRHH Baseline
Scott Sallinen, Keita Iwabuchi, Roger Pearce, Maya Gokhale, Matei Ripeanu, “Graph Coloring as a Challenge Problem for Dynamic
Graph Processing on Distributed Systems”, SC’16
33
SC’16
Incremental Graph Community Detection
• Background
• Community detection for large-scale time-evolving and dynamic
graphs has been one of important research problems in graph
computing.
• It is time-wasting to compute communities entire graphs every time
from scratch.
• Proposal
• An incremental community detection algorithm based on core
procedures in a state-of-the-art community detection algorithm
named DEMON.
• Ego Minus Ego, Label Propagation and Merge
Hiroki Kanezashi and Toyotaro Suzumura, An Incremental Local-First Community Detection
Method for Dynamic Graphs, Third International Workshop on High Performance Big Graph
Data Management, Analysis, and Mining (BigGraphs 2016), to appear
101.0x
faster
101.5x
faster
69.2x
faster
GPU-based Distributed Sorting
[Shamoto, IEEE BigData 2014, IEEE Trans. Big Data 2015]
• Sorting: Kernel algorithm for various EBD processing
• Fast sorting methods
– Distributed Sorting: Sorting for distributed system
• Splitter-based parallel sort
• Radix sort
• Merge sort
– Sorting on heterogeneous architectures
• Many sorting algorithms are accelerated by many cores and high memory bandwidth.
• Sorting for large-scale heterogeneous systems remains unclear
• We develop and evaluate bandwidth and latency reducing GPU-based HykSort on
TSUBAME2.5 via latency hiding
– Now preparing to release the sorting library
EBD Algorithm Kernels
0
10
20
30
0 500 1000 1500 2000
# of proccesses (2 proccesses per node)
Keys/second(billions)
HykSort 1thread
HykSort 6threads
HykSort GPU + 6threads
x1.4
x3.61
x389
0.25
[TB/s]
Performance prediction
x2.2 speedup compared to
CPU-based
implementation when the
# of PCI bandwidth
increase to 50GB/s
8.8% reduction of overall
runtime when the
accelerators work 4 times
faster than K20x
• PCIe_#: #GB/s bandwidth
of interconnect between
CPU and GPU
• Weak scaling performance (Grand
Challenge on TSUBAME2.5)
– 1 ~ 1024 nodes (2 ~ 2048 GPUs)
– 2 processes per node
– Each node has 2GB 64bit integer
• C.f. Yahoo/Hadoop Terasort:
0.02[TB/s]
– Including I/O
GPU implementation of splitter-
based sorting (HykSort)
Xtr2sort: Out-of-core Sorting Acceleration
using GPU and Flash NVM [IEEE BigData2016]
• Sample-sort-based Out-of-core Sorting Approach for Deep Memory
Hierarchy Systems w/GPU and Flash NVM
– I/O chunking to fit device memory capacity of GPU
– Pipeline-based Latency hiding to overlap data transfers between NVM, CPU, and
GPU using asynchronous data transfers,
e.g., cudaMemCpyAsync(), libaio
RD R2H H2D EX D2H H2W WR
RD R2H H2D EX D2H H2W WR
RD R2H H2D EX D2H H2W WR
RD R2H H2D EX D2H H2W WR
RD R2H H2D EX D2H H2W WR
RD R2H H2D EX D2H H2W WR
chunk&i
chunk&i+1
chunk&i+2
chunk&i+3
chunk&i+4
RD R2H H2D EX D2H H2W WR
chunk&i+5
chunk&i+6
c chunks
time
GPU
GPU + CPU + NVM
CPU + NVM
How to combine deepening memory layers for
future
HPC/Big Data workloads, targeting Post Moore Era?
x4.39
BYTES中心のHPCア
ルゴリズム:GPUのバ
ンド幅高速ソートと、
不揮発性メモリによる
大容量化の両立
Hierarchical, UseR-level and ON-demand File system(HuronFS)
(IEEE ICPADS 2016) w/LLNL
• HuronFS: dedicated dynamic instances to provide “burst buffer” for caching data
• I/O requests from Compute Nodes are forwarded to HuronFS
• The whole system consists of several SHFS (Sub HuronFS)
• Workload are distributed among all the SHFS using hash of file path
• Each SHFS consists of a Master and several IOnodes
• Masters: controlling all IOnodes in the same SHFS and handling all I/O requests
• IOnodes: storing actual data and transferring data with Compute Nodes
• Supporting TCP/IP, Infiniband (CCI framework)
• Supporting Fuse, LD_PRELOAD
Parallel File System
Compute
node 1
Compute
node 2
Compute
node N
HuronFS
Master
0
IOnode
IOnode IOnode
IOnode IOnode
Parallel File system
SHFS 0
IOnode
IOnode IOnode
SHFS Y
HuronFS
Master
M-1
IOnode
IOnode IOnode
IOnode IOnode
SHFS M-1
Compute
node X
Master
Y
IOnode
IOnode
static hash
HuronFS Basic IO Performance
0
200
400
600
800
1000
1200
write read
Throughput(MiB/s)
IPoIB
CCI
0
200
400
600
800
IPOIB CCI
Latency(us)
fuse overhead
process
epoll overhead
comm
Latency Throughput from single client
Throughput from single IOnode
0
500
1000
1500
2000
2500
3000
3500
4000
1 2 4
Throughput(MiB/s) # of nodes (8 processes per node)
write IPDIB read IPOIB
write CCI read CCI
Inifinband 4X FDR 56
Gb/sec
mellanox
CPU Intel(R) Xeon(R)
CPU E5-2650 v3
@ 2.30GHz
Mem 251G
Plans
• Continuing researching on auto buffer allocation
• Utilizing computation power on IOnodes
• Data preprocessing
• Format conversion
Jobn
IOnode
Data preprocessing,
format conversion, etc..
Jobn+1
Jobn
IOnode
Jobn+1
format
conversion
Network
Network
Processing
on IOnodes
In Memory
• Background
Snore sound (SnS) data carry very important information for diagnosis and
evaluation of Primary Snoring and Obstructive Sleep Apnea (OSA). With
the increasing number of collected SnS data from subjects, how to handle
such large amount of data is a big challenge. In this study, we utilize the
Graphics Processing Unit (GPU) to process a large amount of SnS data
collected from two hospitals in China and Germany to accelerate the
features extraction of biomedical signal.
GPU-Based Fast Signal Processing for Large Amounts of Snore Sound Data
Subjects Total Time
(hours)
Data Size
(GB)
Data
format
Sampling Rate
57
(China +
Germany)
187.75 31.10 WAV 16 kHz, Mono
Snore sound data information
* Jian Guo, Kun Qian, Huijie Xu, Christoph Janott, Bjorn Schuller, Satoshi Matsuoka, “GPU-Based Fast Signal Processing for Large Amounts of Snore Sound Data”, In proceedings of 5th IEEE Global Conference on
Consumer Electronics (GCCE 2016), October 11-14, 2016.
• Acoustic features of SnS data
we extract 11 acoustic features from a large amount of SnS data, which can be
visualized to help doctors and specialists to diagnose, research, and remedy
the diseases efficiently. Results of GPU and CPU based systems for processing SnS data
• Result
We set 1 CPU (with Python2.7, numpy 1.10.4 and scipy 0.17 packages) for
processing 1 subject’s data as our baseline. Result show that the GPU based
system is almost 4.6×faster than the CPU implementation. However, the
speed-up decreases when increasing the data size. We think that this result
should be caused by the fact that, the transmission of data is not hidden by other
computations, as will be a real-world application.
Open Source Release of EBD System Software
(install on T3/Amazon/ABCI)
• mrCUDA
• rCUDA extension enabling remote-to-
local GPU migration
• https://github.com/EBD-
CREST/mrCUDA
• GPU 3.0
• Co-Funded by NVIDIA
• Huron FS (w/LLNL)
• I/O Burst Buffer for Inter Cloud
Environment
• https://github.com/EBD-CREST/cbb
• Apache License 2.0
• Co-funded by Amazon
• ScaleGraph Python
• Python Extension for ScaleGraph X10-
based Distributed Graph Library
• https://github.com/EBD-
CREST/scalegraphpython
• Eclipse Public License v1.0
• GPUSort
• GPU-based Large-scale Sort
• https://github.com/EBD-
CREST/gpusort
• MIT License
• Others, including dynamic graph
store
HPC and BD/AI Convergence Example [ Yutaka Akiyama, Tokyo Tech]
Oral/Gut Metagenomics
Ultra-fast Seq. Analysis
Exhaustive PPI
Prediction System
Pathway Predictions
Fragment-based
Virtual Screening
Learning-to-Rank VS
Genomics Protein-Protein
Interactions
Drug Discovery
• Ohue et al., Bioinformatics (2014)
• Suzuki et al., Bioinformatics (2015)
• Suzuki et al., PLOS ONE (2016)
• Matsuzaki et al., Protein Pept Lett (2014) • Suzuki et al., AROB2017 (2017)
• Yanagisawa et al., GIW (2016)
• Yamasawa et al., IIBMP (2016)
43
EBD vs. EBD : Large Scale Homology Search for Metagenomics
increasing
Taxonomic composition
Next generation sequencer
- Revealing uncultured microbiomes and finding novel genes in various environments
- Applied for human health in recent years
O(n)
Meas.
data
O(m) Reference
Database
O(m n) calculation
Correlation,
Similarity search
EBD
・with Tokyo Dental College, Prof. Kazuyuki Ishihara
・Comparative metagenomic analysis bewtween
healthy persons and patients
Various environments
Human
body
Sea
Soil
EBD
High risk microorganisms are detected.
Metabolic Pathway
Metagenomic analysis of periodontitis patients
increasing
Development of Ultra-fast Homology Search Tools
1
10
100
1000
10000
100000
1000000
BLAST GHOSTZ
computational time for
10,000 sequences (sec.)
(3.9 GB DB、1CPU core)
Suzuki, et al. Bioinformatics, 2015.
Subsequence sequence clustering
GHOSTZ-GPU
0
10
20
30
40
50
60
70
80
1C
1C+1G
12C+1G
12C+3G
Speed-upratiofor1core
×70 faster than 1 core
using 12 cores + 3 GPUs
Suzuki, et al. PLOS ONE, 2016.
Multithread on GPU MPI + OpenMP hybrid pallelization
Retaining strong scaling
up to 100,000 cores
GHOST-MP
Kakuta, et al. (submitted)
×240 faster than
conventional algorithm
TSUBAME 2.5 Thin node GPU
TSUBAME 2.5
__ GHOST-MP
mpi-BLAST
×80〜×100
faster
Plasma Protein Binding (PPB) Prediction by Machine Learning
Application for peptide drug discovery
Problems
・ Candidate peptides are tend to be degraded
and excreted faster than small molecule drugs
・ Strong needs to design bio-stable peptides for
drug candidates
Experimental value
Predictedvalue
Previous PPB prediction
software for small
molecule can not
predict peptide PPB
Solutions
Compute Feature Values
(more than 500 features)
LogS
LogP
:
MolWeight
:
SASA
polarity
R2 = 0.905
Experimental value
Predictedvalue
f
A constructed model can
explain peptide PPB well
PPB value
Combining feature values for
building a predictive model
Molecular Dynamics Simulation for Membrane Permeability
Sequence:D-Pro, D-Leu, D-Leu, L-Leu, D-Leu, L-Tyr
Membrane permeability :7.9 × 10 -6cm/s
1) Single residue mutation can drastically
change membrane permeability
2) Standard MD simulation can not follow
membrane permeation.
Membrane permeation is millisecond order phenomenon.
Ex ) Membrane thickness : 40 Å
Peptide membrane permeability : 7.9×10-6 cm/s
Typical peptide membrane permeation takes
40 Å / 7.9×10-6 cm/s = 0.5 millisecond
Problems
1) Apply enhanced sampling Supervised MD (SuMD)
Metadynamics (MTD)
CV
Freeenergy
2) GPU acceleration and massively parallel
computation.
Solutions
・ Millisecond order phenomenon can be simulated.
・ Hundreds of peptides can be calculated
simultaneously on TSUBAME.
Sequence:D-Pro, D-Leu, D-Leu, D-Leu, D-Leu, L-Tyr
Membrane permeability :0.045 × 10 -6cm/s
×0.006
GROMACS
DESMOND
MD engine
on GPU
Application for peptide drug discovery
RWBC-OIL 2-3: Tokyo Tech IT-Drug Discovery Factory
Simulation & Big Data & AI at Top HPC Scale
(Tonomachi, Kawasaki-city: planned 2017, PI Yutaka Akiyama)
Tokyo Tech’s research seeds
①Drug Target selection system
②Glide-based Virtual Screening
③Novel Algorithms for fast virtual
screening against huge databases
New Drug Discovery platform especially for
specialty peptide and nucl. acids.
Plasma binding
(ML-based)
Membrane penetration
(Mol. Dynamics simulation)
Minister of Health, Labour and Welfare Award of
the 11th annual Merit Awards for Industry-
Academia-Government Collaboration
TSUBAME’s GPU-environment allows
World’s top-tier Virtual Screening
• Yoshino et al., PLOS ONE (2015)
• Chiba et al., Sci Rep (2015)
Fragment-based efficient algorithm
designed for 100-millions cmpds data
• Yanagisawa et al., GIW (2016)
Application projects
Drug Discovery platform powered by
Supercomputing and Machine Learning
Investments from JP Govt., Tokyo Tech. (TSUBAME SC)
Muninciple Govt (Kawasaki), JP & US Pharma
Multi-Petaflops Compute
Peta~Exabytes Data
Processing Continuously
Cutting Edge, Large-
Scale HPC & BD/AI
Infrastructure
Absolutely Necessary
Estimated Compute Resource Requirements for Deep Learning
[Source: Preferred Network Japan Inc.]
2015 2020 2025 2030
1E〜100E Flops
自動運転車1台あたり1日 1TB
10台〜1000台, 100日分の走行データの学習
Bio / Healthcare
Image Recognition Robots / Drones
10P〜 Flops
1万人の5000時間分の音声データ
人工的に生成された10万時間の
音声データを基に学習 [Baidu 2015]
100P 〜 1E Flops
一人あたりゲノム解析で約10M個のSNPs
100万人で100PFlops、1億人で1EFlops
10P(Image) 〜 10E(Video) Flops
学習データ:1億枚の画像 10000クラス分類
数千ノードで6ヶ月 [Google 2015]
Image/Video
Recognition
1E〜100E Flops
1台あたり年間1TB
100万台〜1億台から得られた
データで学習する場合
Auto Driving
10PF 100EF100PF 1EF 10EF
P:Peta
E:Exa
F:Flops
機械学習、深層学習は学習データが大きいほど高精度になる
現在は人が生み出したデータが対象だが、今後は機械が生み出すデータが対象となる
各種推定値は1GBの学習データに対して1日で学習するためには
1TFlops必要だとして計算
To complete the learning phase in one day
It’s the FLOPS
(in reduced
precision)
and BW!
So both are
important in the
infrastructure
Fast and cost-effective deep learning algorithm
platform for video processing in social infrastructure
Principal Investigator: Koichi Shinoda
Collaborators: Satoshi Matsuoka
Tsuyoshi Murata
Rio Yokota
Tokyo Institute of Technology
(Members RWBC-OIL 1-1 and 2-1)
JST-REST “Development and Integration of Artificial
Intelligence Technologies for Innovation Acceleration”
Background
• Video processing in smart society
for safety and security
• Intelligent transport systems
Drive recorder video
• Security systems
Surveillance camera video
• Deep learning
• Much higher performance than
before
• IT giants with large computational
resources has formed a monopoly
Problems:
• Real-time accurate recognition of small objects and their movement
• Edge-computing without heavy traffic on Internet
• Flexible framework for training which can adapt rapidly to the
environmental changes 51
Research team
CPU GPUNode
Parallel
processing
Fast deep
learning
Minimize
network size
TokyoTech AIST AIRC
Denso・
Denso IT Lab
52
Collaborators
Yokota G
Matsuoka G
Shinoda G
Murata G
System
Reference
Argonne National
Laboratory and
Chicago Univ
Toyota
InfoTechnology
Center
Application
Example AI Research: Predicting Statistics of Asynchronous SGD Parameters
for a Large-Scale Distributed Deep Learning System on GPU Supercomputers
Background
• In large-scale Asynchronous Stochastic Gradient Descent
(ASGD), mini-batch size and gradient staleness tend to be
large and unpredictable, which increase the error of trained
DNN
Objective function E
W(t)
-ηΣi ∇Ei
W(t+1)
W(t+1)
-ηΣi ∇Ei
W(t+3)
W(t+2)
Twice asynchronous
updates within
gradient computation
Staleness=0
Staleness=2
DNN parameters space
Mini-batch size
(NSubbatch: # of samples per one GPU iteration)
100 200 300 400 500 600
0.000.1
Probabili
0 2 4 6 8 10
0.00.4
100 200 300 400 500 600
0.000.060.12
Probability
NSubbatch = 4
0 2 4 6 8 10
0.00.40.8
NSubbatch = 4
0 100 200 300 400 500 600
0.000.10
Probability
NSubbatch = 8
0 2 4 6 8 10
0.00.40.8
NSubbatch = 8
0 100 200 300 400 500 600
0.000.10
NMinibatch
Probability
NSubbatch = 11
0 2 4 6 8 10
0.00.40.8
NStaleness
NSubbatch = 11
NNode = 4 (measured)
NNode = 4 (predicted)
NNode = 4 (predicted_avg)
NNode = 8 (measured)
NNode = 8 (predicted)
NNode = 4 (measured)
NNode = 4 (predicted)
NNode = 8 (measured)
NNode = 8 (predicted)
100 200 300 400 500 600
0.000.10
Probability
NSubbatch = 1
0 2 4 6 8 10
0.00.40.8
NSubbatch = 1
100 200 300 400 500 600
0.000.060.12
Probability
NSubbatch = 4
0 2 4 6 8 10
0.00.40.8
NSubbatch = 4
0 100 200 300 400 500 600
0.000.10
Probability
NSubbatch = 8
0 2 4 6 8 10
0.00.40.8
NSubbatch = 8
Mini-batch size Staleness
Measured
Predicted
4 nodes
8 nodes
16 nodes Measured
Predicted
Proposal
• We propose a empirical performance model for an ASGD
deep learning system SPRINT which considers probability
distribution of mini-batch size and staleness
• Yosuke Oyama, Akihiro Nomura, Ikuro Sato, Hiroki Nishimura, Yukimasa Tamatsu, and Satoshi Matsuoka, "Predicting Statistics of
Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers", in proceedings of
2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016
Performance Prediction of Future HW for CNN
 Predicts the best performance with two future architectural extensions
 FP16: precision reduction to double the peak floating point performance
 EDR IB: 4xEDR InfiniBand (100Gbps) upgrade from FDR (56Gbps)
→ Not only # of nodes, but also fast interconnect is important for scalability
54
N_Node N_Subbatch Epoch Time Average Minibatch Size
(Current HW) 8 8 1779 165.1
FP16 7 22 1462 170.1
EDR IB 12 11 1245 166.6
FP16 + EDR IB 8 15 1128 171.5
TSUBAME-KFC/DL ILSVRC2012 dataset deep learning
Prediction of best parameters (average minibatch size 138±25%)
16/08/08SWoPP2016
Hierarchical matrix(H-matrix) for CNN acceleration
- Hierarchical matrix is an efficient data-sparse representations of
certain densely populated matrices.
- CNN(Convolutional Neural Network)
• Back ground
- Hierarchical matrix(H-matrix) is a an
approximated form represent 𝑛 × 𝑛 correlations
of 𝑛 objects, which usually requires a 𝑛 × 𝑛 huge
dense matrix.
- Significant savings in memory when compressed
𝑂 𝑛2 ⟹ 𝑂 𝑘𝑛 log 𝑛
- Computational complexity
𝑂(𝑛3) ⟹ 𝑂(𝑘2 𝑛 log 𝑛2)
such as matrix-matrix multiplication,
LU factorization, Inversion…
Hierarchical matrixdense matrix
The H-matrix approximation of dense matrix.
The red blocks are dense matrices. The green block
are low-rank matrices with rank 𝑘.
15.80
10.86
7.72
5.17
4.05
2.80 2.25 1.64 1.33
20.23
17.06
14.63 13.96
12.81 12.39 11.85 11.88
10.52
0.00
5.00
10.00
15.00
20.00
25.00
1632 2673 4150 6171 8856 12337 16758 22275 29056
Size of matrix(n × n)
Compression rate(%)
bem1d sdpara
Preliminary Results – Compression rate of matrices
We can compress the matrix in some applications.
- bem1d: 1-Dimention Boundary element method
- sdpara: A parallel implementation of the inter-point
method for Semi-Define Programming(SDP)
Compressive rate = (uncompressed size) / (compressed size)
4.717
4.338
6.946
1.610
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
Matrix A Matrix B
Size(MB)
Size of matrix
non-compressed compressed
2.396
1.834
0.841
3.567
0.000
0.500
1.000
1.500
2.000
2.500
3.000
3.500
4.000
Matrix A Matrix B
Size(MB)
Size of matrix
non-compressed compressed
→ Matrix A successfully compressed! → Matrix B successfully compressed!
SDPARA Deep Learning (CNN)
In CNN system application, Sgemm(Single precision
floating General Matrix Multiplication) C = 𝛼AB + 𝛽C
accounts for large part of calculation (around 70%).
(m, n, k) = (1764, 1350, 178) (m, n, k) = (324, 298, 1908)
Power optimization using Deep Q-Network
・ Background
Power optimization by frequency control in existing research
➢ Detailed analysis is necessary
➢ Low versatility Use Deep Learning for analysis.
Performance counter
Temperature
Frequency,…
FrequencyP = f (x1, x2,...)
Texe = g(x1, x2,...)
Kento Teranishi
・ Objective
Implement the computer
control system using Deep Q-Network. Counter
Power
Frequency
Temperature
etc.
↑↓
Frequency
control
Deep Q-Network (DQN)
Deep reinforcement learning
Calculate action value function Q from neural network
Used for game playing AI, robot car, AlphaGO.
METI AIST-AIRC ABCI
as the worlds first large-scale OPEN AI
Infrastructure
Univ. Tokyo Kashiwa
Campus
• >130 AI-Petaflops
• < 3MW Power
• < 1.1 Avg. PUE
• Operational ~2018Q1
• ABCI: AI Bridging Cloud Infrastructure
• Top-Level SC compute & data capability for DNN (>130 AI-
Petaflops)
• Open Public & Dedicated infrastructure for Al & Big Data
Algorithms, Software and Applications
• Platform to accelerate joint academic-industry R&D for AI in Japan
The “Chicken or Egg Problem” of AI-
HPC Infrastructures
• “On Premise” machines in clients => “Can’t invest in big in AI
machines unless we forecast good ROI. We don’t have the experience
in running on big machines.”
• Public Clouds other than the giants => “Can’t invest big in AI machines
unless we forecast good ROI. We are cutthroat.”
• Large scale supercomputer centers => “Can’t invest big in AI machines
unless we forecast good ROI. Can’t sacrifice our existing clients and
our machines are full”
• Thus the giants dominate, AI technologies, big data, and people stay
behind the corporate firewalls…
But Commercial Companies esp. the “AI
Giants”are Leading AI R&D, are they not?
• Yes, but that is because their shot-term goals could harvest the low
hanging fruits in DNN rejuvenated AI
• But AI/BD research is just beginning--- if we leave it to the interests of
commercial companies, we cannot tackle difficult problems with no
proven ROI
• Very unhealthy for research
• This is different from more mature
fields, such as pharmaceuticals or
aerospace, where there is balanced
investments and innovations in both
academia/government and the industry
ABCI Prototype: AIST AI Cloud (AAIC)
March 2017 (System Vendor: NEC)
• 400x NVIDIA Tesla P100s and Infiniband EDR accelerate various AI workloads
including ML (Machine Learning) and DL (Deep Learning).
• Advanced data analytics leveraged by 4PiB shared Big Data Storage and Apache
Spark w/ its ecosystem.
AI Computation System Large Capacity Storage System
Computation Nodes (w/GPU) x50
• Intel Xeon E5 v4 x2
• NVIDIA Tesla P100 (NVLink) x8
• 256GiB Memory, 480GB SSD
Computation Nodes (w/o GPU) x68
• Intel Xeon E5 v4 x2
• 256GiB Memory, 480GB SSD
Mgmt & Service
Nodes x16
Interactive Nodes
x2
400 Pascal GPUs
30TB Memory
56TB SSD
DDN SFA14K
• File server (w/10GbEx2,
IB EDRx4) x4
• 8TB 7.2Krpm NL-SAS
HDD x730
• GRIDScaler (GPFS)
>4PiB effective
RW100GB/s
Computation Network
Mellanox CS7520 Director Switch
• EDR (100Gbps) x216
Bi-direction 200Gbps
Full bi-section bandwidth
Service and Management Network
IB EDR (100Gbps) IB EDR (100Gbps)
GbE or 10GbE GbE or 10GbE
Firewall
• FortiGate 3815D x2
• FortiAnalyzer 1000E x2
UTM Firewall
40-100Gbps class
10GbE
SINET5
Internet
Connection
10-100GbE
The “Real” ABCI – 2018Q1
• Extreme computing power
– w/ >130 AI-PFlops for AI/ML especially DNN
– >x1 million speedup over high-end PC: 1 Day training for 3000-Year DNN
training job
– TSUBAME-KFC (1.4 AI-Pflops) x 90 users (T2 avg)
• Big Data and HPC converged modern design
– For advanced data analytics (Big Data) and scientific simulation (HPC), etc.
– Leverage Tokyo Tech’s “TSUBAME3” design, but differences/enhancements
being AI/BD centric
• Ultra high BW & Low latency memory, network, and storage
– For accelerating various AI/BD workloads
– Data-centric architecture, optimizes data movement
• Big Data/AI and HPC SW Stack Convergence
– Incl. results from JST-CREST EBD
– Wide contributions from the PC Cluster community desirable.
• Ultra-Green (PUE<1.1), High Thermal (60KW) Rack
– Custom, warehouse-like IDC building and internal pods
– Final “commoditization” of HPC technologies into Clouds
62
ABCI Cloud Infrastructure
• Ultra-dense IDC design from ground-up
– Custom inexpensive lightweight “warehouse” building w/ substantial
earthquake tolerance
– x20 thermal density of standard IDC
• Extreme green
– Ambient warm liquid cooling, large Li-ion battery storage, and high-
efficiency power supplies, etc.
– Commoditizing supercomputer cooling technologies to
Clouds (60KW/rack)
• Cloud ecosystem
– Wide-ranging Big Data and HPC standard software stacks
• Advanced cloud-based operation
– Incl. dynamic deployment, container-based virtualized provisioning,
multitenant partitioning, and automatic failure recovery, etc.
– Joining HPC and Cloud Software stack for real
• Final piece in the commoditization of HPC (into IDC)
63
ABCI AI-IDC CG Image
引用元: NEC導入事例
Reference Image
ABCI Cloud Data Center
“Commoditizing 60KW/rack Supercomputer”
Data Center Image
Layout Plan
high voltage transformers (3.25MW)
Passive Cooling Tower
Free cooling
Cooling Capacity: 3MW
Active Chillers
Cooling Capacity: 200kW
Lithium battery
1MWh, 1MVA
W:18m x D:24m x H:8m
72 Racks
18 Racks
• Single Floor, inexpensive build
• Hard concrete floor 2 tonnes/m2
weight tolerance for racks and
cooling pods
• Number of Racks
• Initial: 90
• Max: 144
• Power Capacity
• 3.25 MW (MAX)
• Cooling Capacity
• 3.2 MW (Minimum in
Summer)
Future
Expansion
Space
Implementing 60KW cooling in Cloud IDC – Cooling Pods
Cooling Block Diagram (Hot Rack)
19 or 23 inch
Rack (48U)
Computing
Server
Hot Water
Circuit: 40℃Cold Water
Circuit: 32℃
Hot Aisle: 40℃
Fan Coil Unit
Cooling
Capacity 10kW
Front
side
Water Block
(CPU or/and
Accelerator, etc.)
Air: 40℃Air: 35℃ CDU
Cooling
Capacity10kW
Cold Aisle: 35℃
Water
Cooling Capacity
• Fan Coil Unit 10KW/Rack
• Water Block: 50KW/Rack
Hot Aisle
Capping
Flat concrete slab – 2 tonnes/m2 weight tolerance
Commoditizing Supercomputing
Cooling Density and Efficiency
• Warm water cooling – 32C
• Liquid cooling & air cooling in same rack
• 60KW Cooling Capacity, 50KW
Liquid+10KW Air
• Very low PUE
• Structural integrity by rack + skeleton
frame built on high flat floor load
TSUBAME3.0とABCI比較
• TSUBAME3:「ビッグデータ・AI指向汎用スパコン」
ABCI:「スパコン指向次世代AI・ビッグデータクラウドIDC向けひな型」
• この違いが、 よく似た姉妹関係にある二台の差を生む
• ハードウェア的にはTSUBAME3がスパコンとして重要視する倍精度演算能力やネットワーク能力など
が、ABCIでは要求が低くなり、コストを抑える
• ノードやマシン全体のパッケージングや冷却では、TSUBAME3はスパコン向け特殊構造で性能密度
(ラックあたり3.1ペタフロップス)、熱密度(ラック当たり61KW)や冷却効率(PUE=1.033)を上げている。
ABCIではそれに近い密度や効率をIDC標準の19インチラックで実現し、将来のIDC導入を容易にす
る。
• ソフトウェア的には、両社ともクラウド・ビッグデータ・AIの各種ソフトウェアスタックをサポートするが、
ABCIの方がその採用や充実度は高い
• ABCIの一つのテーマは「TSUBAME3的なAI指向スパコンをいかにIDCのエコシステムに
普及させるか」==> 他の性能パラメタはTSUBAME3と酷似
• ネットワーク以外の計算ノードの各種性能パラメタは酷似
• 熱密度も50~60KW/ラック(通常のIDCは3~6KW/ラック)、PUE=1.1未満(通常のIDCは1.5~3)
• 全体の実証のため、ABCIでは将来のAI-IDCのひな型となる安価・超高効率なABCIデータ
TSUBAME3.0とABCI比較表
TSUBAME3 (2017年7月) ABCI (2018年3月) 参考:京(2012)
AI-FLOPS ピーク AI性能 47.2 PetaFlops (倍精度12.1)
ラックあたり3.1 PetaFlops
130~200 Petaflops, 倍精度不明
ラックあたり3~4 Petaflops
11.3 Petaflops
ラック 12.3TeraFlops
実装 スパコン特殊実装(水冷 ICE-XA) IDC 19インチラック(水冷) スパコン特殊実装
運用時電力(冷却含む) 1MW以下 2MW程度 15MW以上
最大ラック熱密度・冷却効率 (PUE) 61KW, 1.033 50-60KW, 1.1未満 (若干低下) ~20KW, ~1.3程度
ハードウェアアーキテクチャ Many-Core (NVIDIA Pascal P100)
+ Multi-Core (Intel Xeon)
Many-Core AI/DL 指向の汎用
プロセッサなど(未確定)
Multi-Core 均質
大型CPUコア
メモリ技術 HBM2+DDR4 高速オンパッケージ+DDR4 DDR3
ネットワーク技術 Intel OmniPath, ノードあたり4 x
100Gbps, フルバイセクション、ス
イッチ間光ネットワーク
T3比較で、ネットワーク速度・
バイセクションとも低下で低コ
スト化・IDC導入を容易に
銅線 Tofu
ノード単位不揮発性メモリ 2TeraByte NVMe/ノード > 400GB NVMe/ノード なし
電力モニタリング・制御・アクティブ
電力キャップ
綿密なノード・システム全体の監
視・キャッピング・パワーノブ
綿密なノード・システム全体の
監視・キャッピング・パワーノブ
全システムでの単純電
力計測のみ・制御なし
クラウド対応・仮想化・AI クラウドAPI、コンテナ仮想化・ノード
分割、動的プロビジョニング, 各種機
械学習ソフトウェアスタック
クラウドAPI、コンテナ仮想化・ノー
ド分割、動的プロビジョニング, 各
種機械学習ソフトウェアスタック
なし
ABCI Procurement Benchmarks
• Big Data Benchmarks
– (SPEC CPU Rate)
– Graph 500
– MinuteSort
– Node Local Storage I/O
– Parallel FS I/O
• AI/ML Benchmarks
– Low precision GEMM
• CNN Kernel, defines “AI-Flops”
– Single Node CNN
• AlexNet => RESNET?
• ILSVRC2012 Dataset
– Multi-Node CNN
• Caffe+MPI
– Large Memory CNN
• Convnet on Chainer
– RNN / LSTM
• Being defined with NICT Lab
68
No traditional HPC
Simulation Benchmarks
Except SPECCPU
Current ABCI Benchmarking Status
• Extensive discussions w/processor and system vendors, experienced DL
researchers and users
• Collaboration with NICT (National Institute of Communication Technologies
of Japan)
• Dedicated benchmark team formed within AIRC
• Early benchmarks and tweaking on AAIC
• Also looking at other HPC-DL benchmarks such as ANL CANDLE DNN
benchmark
• Will be able to directly contribute to IRDS (along with other HPC
benchmarks)
• We also have a generation of GPUs for DNN, from G200 to Latest Pascal,
with Volta forthcoming (DGX-1)
– G200-S1070 (2008), Fermi M2050 (2010), Kepler K20c/K20x (2013), /K40/K80
(2014), Maxwell Geforce cards (2015), Pascal P100 (2016), Volta V100 (2017)
69
Basic Requirements for AI Cloud System
PFS
Lustre・
GPFS
Batch Job
Schedulers
Local Flash+3D
XPoint
Storage
DFS
HDFS
BD/AI User Applications
RDB
PostgreSQL
Python, Jupyter Notebook, R etc. + IDL
SQL
Hive/Pig
CloudDB/NoSQL
Hbase/MondoDB/Redis
Resource
Brokers
Machine
Learning
Libraries
Numerical Libraries
BLAS/Matlab
Fortran・C・C++
Native Codes
BD Algorithm
Kernels (sort etc.)
Parallel Debuggers and Profilers
Workflow
Systems
Graph
Computing
Libraries
Deep
Learning
Frameworks
Web
Services
Linux Containers ・Cloud Services
MPI・OpenMP/ACC・CUDA/OpenCL
Linux OS
IB・OPA
High Capacity
Low Latency NW
X86 (Xeon, Phi)+
Accelerators e.g.
GPU, FPGA, Lake
Crest
Application
✓ Easy use of various ML/DL/Graph frameworks from
Python, Jupyter Notebook, R, etc.
✓ Web-based applications and services provision
System Software
✓ HPC-oriented techniques for numerical libraries, BD
Algorithm kernels, etc.
✓ Supporting long running jobs / workflow for DL
✓ Accelerated I/O and secure data access to large data
sets
✓ User-customized environment based on Linux
containers for easy deployment and reproducibility
OS
Hardware
✓ Modern supercomputing facilities based on commodity
components
Oct. 2015
TSUBAME-KFC/DL
(Tokyo Tech./NEC)
1.4 AI-PF(Petaflops)
Cutting Edge Research AI Infrastructures in Japan
Accelerating BD/AI with HPC
(and my effort to design & build them)
Mar. 2017
AIST AI Cloud
(AIST-AIRC/NEC)
8.2 AI-PF
Mar. 2017
AI Supercomputer
Riken AIP/Fujitsu
4.1 AI-PF
Aug. 2017
TSUBAME3.0 (Tokyo Tech./HPE)
47.2 AI-PF (65.8 AI-PF
w/Tsubame2.5)
In Production
Under
Acceptance
Being
Manufactured Mar. 2018
ABCI (AIST-AIRC)
130-200 AI-PF
Draft RFC out
IDC under
construction
1H 2019?
“ExaAI”
~1 AI-ExaFlop
Undergoing
Engineering
Study
R&D Investments into world leading
AI/BD HW & SW & Algorithms and their
co-design for cutting edge Infrastructure
absolutely necessary (just as is with
Japan Post-K and US ECP in HPC)
x5.8
x5.8
x2.8~4.2
x5.0~7.7
Big Data AI-
Oriented
Supercomput
ers
Acceleration
Scaling, and
Control of HPC via
BD/ML/AI and
future SC designs
Robots / Drones
Image and Video
Big Data and
ML/AI Apps
and
Methodologies
Large Scale Graphs
Future Big Data・AI
Supercomputer Design
Optimizing System
Software and Ops
Mutual and Semi-
Automated Co-
Acceleration of
HPC and BD/ML/AI
Co-Design of BD/ML/AI with HPC using BD/ML/AI
- for survival of HPC
Accelerating
Conventional HPC Apps
Acceleration and Scaling of
BD/ML/AI via HPC and
Technologies and
Infrastructures
ABCI: World’s first and
largest open 100 Peta AI-
Flops AI Supercomputer,
Fall 2017, for co-design
• Strategy 5: Develop shared public datasets and
environments for AI training and testing. The
depth, quality, and accuracy of training datasets
and resources significantly affect AI performance.
Researchers need to develop high quality
datasets and environments and enable
responsible access to high-quality datasets as well
as to testing and training resources.
• Strategy 6: Measure and evaluate AI technologies
through standards and benchmarks. Essential to
advancements in AI are standards, benchmarks,
testbeds, and community engagement that guide
and evaluate progress in AI. Additional research is
needed to develop a broad spectrum of
evaluative techniques.
We are implementing the US AI&BD strategies already
…in Japan, at AIRC w/ABCI
What is worse: Moore’s Law will end in the 2020’s
•Much of underlying IT performance growth due to Moore’s law
•“LSI: x2 transistors in 1~1.5 years”
•Causing qualitative “leaps” in IT and societal innovations
•The main reason we have supercomputers and Google...
•But this is slowing down & ending, by mid 2020s…!!!
•End of Lithography shrinks
•End of Dennard scaling
•End of Fab Economics
•How do we sustain “performance growth” beyond the “end of
Moore”?
•Not just one-time speed bumps
•Will affect all aspects of IT, including BD/AI/ML/IoT, not just HPC
•End of IT as we know it
Gordon Moore
The curse of constant
transistor power shall
soon be upon us
20 year Eras towards of End of Moore’s Law
3-5nm and
beyond 2025-
Constant
Transistor Power
• 1980s~2004
Dennard scaling,
perf+ = single
thread+ = transistor
& freq+ = power+
• 2004~2015 feature
scaling, perf+ =
transistor+ =
core#+, constant
power
• 2015~2025 all
above gets harder
• 2025~ post-Moore,
constant
feature&power =
flat performanceNeed to realize the next 20-year era of supercomputing
20 year
Post-Dennard
Many-Core Era
20-year
Moore-Dennard
Single Core
ILP-Vector
Killer-Micro Era
20-year
Next-Gen
Post-Moore era
The “curse of constant transistor power”
- Ignorance of this is like ignoring global warming -
• Systems people have been telling the algorithm people that
“FLOPS will be free, bandwidth is important, so devise
algorithms under that assumption”
• This will certainly be true until exascale in 2020…
• But when Moore’s Law ends in 2025-2030, constant transistor
power (esp. for logic) = FLOPS will no longer be free!
• So algorithms that simply increase arithmetic intensity will no
longer scale beyond that point
• Like countering global warming – need disruptive change in
computing – in HW-SW-Alg-Apps etc. for the next 20 year era
Performance growth via data-centric computing:
“From FLOPS to BYTES”
• Identify the new parameter(s) for scaling over time
• Because data-related parameters (e.g. capacity and bandwidth) will still
likely continue to grow towards 2040s
• Can grow transistor# for compute, but CANNOT use them AT THE SAME
TIME(Dark Silicon) => multiple computing units specialized to type of data
• Continued capacity growth: 3D stacking (esp. direct silicon layering) and
low power NVM (e.g. ReRAM)
• Continued BW growth: Data movement energy will be capped constant by
dense 3D design and advanced optics from silicon photonics technologies
• Almost back to the old “vector” days(?), but no free lunch – latency still
problem, locality still important, need general algorithmic acceleration
thru data capacity and bandwidth, not FLOPS
Transistor Lithography Scaling
(CMOS Logic Circuits, DRAM/SRAM)
Loosely Coupled with Electronic Interconnect
Data Data
Hardware/Software System APIs
Flops-Centric Massively Parallel Architecture
Flops-Centric System Software
Novel Devices + CMOS (Dark Silicon)
(Nanophotonics, Non-Volatile Devices etc.)
Ultra Tightly Coupled w/Aggressive
3-D+Photonic Switching Interconnected
Hardware/Software System APIs
Data-Centric Heterogeneous Architecture
Bytes-Centric System Software
Heterogeneous CPUs + Holistic Data
Data Data
Homogeneous General Purpose Nodes
+ Localized Data
Reconfigurable
Dataflow Optical
ComputingDNN&
Neuromorphic
Massive BW
3-D Package
Quantum
ComputingLow Precision
Error-Prone
Non-Volatile
Memory
Flops-Centric Algorithms and Apps Bytes-Centric Algorithms and Apps
Compute
Nodes
Gen CPU Gen CPU
汎用CPU Gen CPU
~2025
M-P Extinction
Event
Many Core Era Post Moore Era
Compute
Nodes
Compute
Nodes
Compute
Nodes
Post-Moore High Bandwidth Hierarchical Memory
Model
Post-Moore Data Science
and AI Libraries
Post-Moore Computational
Science Libraries
ComputationCommunicationMemory
PC-RAM
ReRAM
Photonic Switching
3D architecture
Next gen VIAs & silicon
fabrication
New memory Devices
Tnugsten VIAs and 3D silicon
Photonic Interposes
Silicon Photonics WDM Interconnect
Inductive TCI
Building Block “gluable” architecture
Neural Networks/
Neromorphic/
Izing - Annealing
Brain-inspired Computing
Low-Reliability computing
Near threshold computing
Low-Reliablility Communication
Quantum Computing
Customizable logic
Photonic Compute Devices
Optical Packet Switching
STT-MRAM
High B/F Algorithms
Low Precision &
Probablistic
Computing
Accelerator-
Specific
Compilers
Accelerator
“Binaries”
Programmable
Logic
Data Memoization
Machine Learning
based acceleration
Data Assimilation
Parallel Space-and-Time
Algorithms
BW Reducing Alg.
Post-Moore
Performance
Parameters
Auto Tuning
Post-Moore
Performamce
Models
Data & Custom Compute Centric Platform
Uncertainty
Quantification
Couplers
Multi-Phyics
SImulation
Low Rank
Approximation
Out-of-core Alg
Massive Medical
Imaging
Post-Moore Programming Model
Data-oriented
Scheduling
Latency
Hiding
Data-Movement
Runtime
Hierarchical Data
Abstractions
Fault
Tolerance
High-Level
Synthesis
Compilers
Fusion/Plasma EMF AnalysisManufacturing
Post-Moore is NOT a
More-Moore device
as a panacea
Device & arch. advances
improving data-related
parameters over time
“Rebooting Computing”
in terms of devices,
architectures, software.
Algorithms, and
applications necessary
=> Co-Design even
more important
c.f. Exascale

More Related Content

What's hot

近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision TransformerYusuke Uchida
 
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Modelscvpaper. challenge
 
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜SSII
 
深層学習による自然言語処理入門: word2vecからBERT, GPT-3まで
深層学習による自然言語処理入門: word2vecからBERT, GPT-3まで深層学習による自然言語処理入門: word2vecからBERT, GPT-3まで
深層学習による自然言語処理入門: word2vecからBERT, GPT-3までYahoo!デベロッパーネットワーク
 
SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII
 
AHC-Lab M1勉強会 論文の読み方・書き方
AHC-Lab M1勉強会 論文の読み方・書き方AHC-Lab M1勉強会 論文の読み方・書き方
AHC-Lab M1勉強会 論文の読み方・書き方Shinagawa Seitaro
 
ICML2021の連合学習の論文
ICML2021の連合学習の論文ICML2021の連合学習の論文
ICML2021の連合学習の論文Katsuya Ito
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida
 
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~SSII
 
全力解説!Transformer
全力解説!Transformer全力解説!Transformer
全力解説!TransformerArithmer Inc.
 
情報抽出入門 〜非構造化データを構造化させる技術〜
情報抽出入門 〜非構造化データを構造化させる技術〜情報抽出入門 〜非構造化データを構造化させる技術〜
情報抽出入門 〜非構造化データを構造化させる技術〜Yuya Unno
 
コンピュータビジョンの研究開発状況
コンピュータビジョンの研究開発状況コンピュータビジョンの研究開発状況
コンピュータビジョンの研究開発状況cvpaper. challenge
 
JDLA主催「CVPR2023技術報告会」発表資料
JDLA主催「CVPR2023技術報告会」発表資料JDLA主催「CVPR2023技術報告会」発表資料
JDLA主催「CVPR2023技術報告会」発表資料Morpho, Inc.
 
MS COCO Dataset Introduction
MS COCO Dataset IntroductionMS COCO Dataset Introduction
MS COCO Dataset IntroductionShinagawa Seitaro
 
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...Deep Learning JP
 
Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Yoshitaka Ushiku
 
畳み込みニューラルネットワークの高精度化と高速化
畳み込みニューラルネットワークの高精度化と高速化畳み込みニューラルネットワークの高精度化と高速化
畳み込みニューラルネットワークの高精度化と高速化Yusuke Uchida
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選Yusuke Uchida
 

What's hot (20)

近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer
 
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
【メタサーベイ】Transformerから基盤モデルまでの流れ / From Transformer to Foundation Models
 
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
 
深層学習による自然言語処理入門: word2vecからBERT, GPT-3まで
深層学習による自然言語処理入門: word2vecからBERT, GPT-3まで深層学習による自然言語処理入門: word2vecからBERT, GPT-3まで
深層学習による自然言語処理入門: word2vecからBERT, GPT-3まで
 
SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用
 
AHC-Lab M1勉強会 論文の読み方・書き方
AHC-Lab M1勉強会 論文の読み方・書き方AHC-Lab M1勉強会 論文の読み方・書き方
AHC-Lab M1勉強会 論文の読み方・書き方
 
ICML2021の連合学習の論文
ICML2021の連合学習の論文ICML2021の連合学習の論文
ICML2021の連合学習の論文
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
 
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
SSII2019OS: 深層学習にかかる時間を短くしてみませんか? ~分散学習の勧め~
 
全力解説!Transformer
全力解説!Transformer全力解説!Transformer
全力解説!Transformer
 
情報抽出入門 〜非構造化データを構造化させる技術〜
情報抽出入門 〜非構造化データを構造化させる技術〜情報抽出入門 〜非構造化データを構造化させる技術〜
情報抽出入門 〜非構造化データを構造化させる技術〜
 
MLOpsはバズワード
MLOpsはバズワードMLOpsはバズワード
MLOpsはバズワード
 
研究効率化Tips Ver.2
研究効率化Tips Ver.2研究効率化Tips Ver.2
研究効率化Tips Ver.2
 
コンピュータビジョンの研究開発状況
コンピュータビジョンの研究開発状況コンピュータビジョンの研究開発状況
コンピュータビジョンの研究開発状況
 
JDLA主催「CVPR2023技術報告会」発表資料
JDLA主催「CVPR2023技術報告会」発表資料JDLA主催「CVPR2023技術報告会」発表資料
JDLA主催「CVPR2023技術報告会」発表資料
 
MS COCO Dataset Introduction
MS COCO Dataset IntroductionMS COCO Dataset Introduction
MS COCO Dataset Introduction
 
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
 
Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)
 
畳み込みニューラルネットワークの高精度化と高速化
畳み込みニューラルネットワークの高精度化と高速化畳み込みニューラルネットワークの高精度化と高速化
畳み込みニューラルネットワークの高精度化と高速化
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選
 

Similar to 最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI

Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGaiKohei KaiGai
 
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...Equnix Business Solutions
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_ProcessingKohei KaiGai
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataHitoshi Sato
 
Implementing AI: High Performace Architectures
Implementing AI: High Performace ArchitecturesImplementing AI: High Performace Architectures
Implementing AI: High Performace ArchitecturesKTN
 
01 From K to Fugaku
01 From K to Fugaku01 From K to Fugaku
01 From K to FugakuRCCSRENKEI
 
AI Bridging Cloud Infrastructure (ABCI) and its communication performance
AI Bridging Cloud Infrastructure (ABCI) and its communication performanceAI Bridging Cloud Infrastructure (ABCI) and its communication performance
AI Bridging Cloud Infrastructure (ABCI) and its communication performanceinside-BigData.com
 
Introduce: IBM Power Linux with PowerKVM
Introduce: IBM Power Linux with PowerKVMIntroduce: IBM Power Linux with PowerKVM
Introduce: IBM Power Linux with PowerKVMZainal Abidin
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceAlison B. Lowndes
 
Year of the #WiFiCactus
Year of the #WiFiCactusYear of the #WiFiCactus
Year of the #WiFiCactusDefCamp
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
"Democratizing Big Data", Ami Gal, CEO & Co-Founder of SQream Technologies
"Democratizing Big Data", Ami Gal, CEO & Co-Founder of SQream Technologies"Democratizing Big Data", Ami Gal, CEO & Co-Founder of SQream Technologies
"Democratizing Big Data", Ami Gal, CEO & Co-Founder of SQream TechnologiesDataconomy Media
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Mark Goldstein
 
Application Report: Big Data - Big Cluster Interconnects
Application Report: Big Data - Big Cluster InterconnectsApplication Report: Big Data - Big Cluster Interconnects
Application Report: Big Data - Big Cluster InterconnectsIT Brand Pulse
 
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...KTN
 
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...Kohei KaiGai
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer FugakuRCCSRENKEI
 

Similar to 最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI (20)

Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
 
Implementing AI: High Performace Architectures
Implementing AI: High Performace ArchitecturesImplementing AI: High Performace Architectures
Implementing AI: High Performace Architectures
 
01 From K to Fugaku
01 From K to Fugaku01 From K to Fugaku
01 From K to Fugaku
 
AI Bridging Cloud Infrastructure (ABCI) and its communication performance
AI Bridging Cloud Infrastructure (ABCI) and its communication performanceAI Bridging Cloud Infrastructure (ABCI) and its communication performance
AI Bridging Cloud Infrastructure (ABCI) and its communication performance
 
Introduce: IBM Power Linux with PowerKVM
Introduce: IBM Power Linux with PowerKVMIntroduce: IBM Power Linux with PowerKVM
Introduce: IBM Power Linux with PowerKVM
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligence
 
Year of the #WiFiCactus
Year of the #WiFiCactusYear of the #WiFiCactus
Year of the #WiFiCactus
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
"Democratizing Big Data", Ami Gal, CEO & Co-Founder of SQream Technologies
"Democratizing Big Data", Ami Gal, CEO & Co-Founder of SQream Technologies"Democratizing Big Data", Ami Gal, CEO & Co-Founder of SQream Technologies
"Democratizing Big Data", Ami Gal, CEO & Co-Founder of SQream Technologies
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
 
Application Report: Big Data - Big Cluster Interconnects
Application Report: Big Data - Big Cluster InterconnectsApplication Report: Big Data - Big Cluster Interconnects
Application Report: Big Data - Big Cluster Interconnects
 
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...
 
Infrastructure Strategies 2007
Infrastructure Strategies 2007Infrastructure Strategies 2007
Infrastructure Strategies 2007
 
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
 

More from NVIDIA Japan

HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?NVIDIA Japan
 
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化NVIDIA Japan
 
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情NVIDIA Japan
 
20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdf20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdfNVIDIA Japan
 
開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDKNVIDIA Japan
 
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワークNVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワークNVIDIA Japan
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
HPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなのHPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなのNVIDIA Japan
 
Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報NVIDIA Japan
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラNVIDIA Japan
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことNVIDIA Japan
 
GPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIAGPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIANVIDIA Japan
 
GTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリーGTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリーNVIDIA Japan
 
テレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティテレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティNVIDIA Japan
 
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~NVIDIA Japan
 
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×RoboticsエンジニアへのロードマップNVIDIA Japan
 
2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育NVIDIA Japan
 
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育NVIDIA Japan
 
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報NVIDIA Japan
 
Jetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジにJetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジにNVIDIA Japan
 

More from NVIDIA Japan (20)

HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?
 
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
 
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
 
20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdf20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdf
 
開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK
 
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワークNVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
HPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなのHPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなの
 
Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラ
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないこと
 
GPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIAGPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIA
 
GTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリーGTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリー
 
テレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティテレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティ
 
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
 
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
 
2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育
 
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
 
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
 
Jetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジにJetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジに
 

Recently uploaded

Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 

Recently uploaded (20)

Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 

最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI

  • 1. 最新のHPC技術を生かしたAI・ビッグデータ インフラの東工大TSUBAME3.0 及び産総研ABCI 東京工業大学 学術国際情報センター (GSIC) 教授 産業技術総合研究所 人工知能研究センター (AIRC) 特定フェロー 産総研・東工大 リアルワールドビッグデータ オープンイノベーションラボ ラボ長 松岡 聡 NVIDIA Deep Learning Institute Day 2017 in Nagoya Presentation 2017/05/22
  • 2. 32nm 40nm >400GB/s Mem BW 80Gbps NW BW ~1KW max >1.6TB/s Mem BW >12TB/s Mem BW 35KW Max >600TB/s Mem BW 220Tbps NW Bisecion BW 1.4MW Max TSUBAME2.0 2010年11月1日稼働開始 (当時)世界最小のペタフロップス・省電力スパコン 各種基礎研究がベース メーカーと新規共同開発 • 大規模なGPU採用による高性能と低電力の両立 • 最小の設置面積(200m2程度)、高いコストパフォーマンス • 高性能にマッチした光ネットワーク、SSDストレージ
  • 3. ACM Gordon Bell Prize 2011 ACM ゴードンベル賞 2.0ペタフロップス達成 (T2.5では3.4PF) (京コンピュータと同時受賞) TSUBAME2.0のアプリケーションの受賞 Special Achievements in Scalability and Time-to-Solution “Peta-Scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer”
  • 5. 32nm 40nm >400GB/s Mem BW 80Gbps NW BW ~1KW max >1.6TB/s Mem BW >12TB/s Mem BW 35KW Max 4224 GPUs >600TB/s Mem BW 220Tbps NW Bisecion BW 1.4MW Max TSUBAME2.0 Nov. 1, 2010 “The Greenest Production Supercomputer in the World” • GPU-centric (> 4000) high performance & low power • Small footprint (~200m2 or 2000 sq.ft), low TCO • High bandwidth memory, optical network, SSD storage… TSUBAME 2.0 New Development 2013 GPU Upgrade TSUBAME2.5 5.7 Petaflops
  • 6. Core Center of AI for Industry-Academia Co-operation Application Domains NLP, NLU Text mining Behavior Mining & Modeling Manufacturing Industrial robots Automobile Innovative Retailing Health Care Elderly Care Deployment of AI in real businesses and society Data-Knowledge integration AIBrain Inspired AI Ontology Knowledge Model of Hippocampus Model of Basal ganglia Logic & Probabilistic Modeling Bayesian net ・・・ ・・・ Security Network Services Communication Big Sciences Bio-Medical Sciences Material Sciences Model of Cerebral cortex Technology transfer Starting Enterprises Start-Ups Institutions Companies Technology transfer Joint research Common AI Platform Common Modules Common Data/Models Planning Control Prediction Recommend Image Recognition 3D Object recognition Planning/Business Team ・・・ Effective Cycles among Research and Deployment of AI Standard Tasks Standard Data AI Research Framework Planning/Business Team 2015- AI Research Center (AIRC), AIST Now > 400+ FTEs Matsuoka : Joint appointment as “Designated” Fellow since July 2017 Director: Jun-ichi Tsujii
  • 7. Industry ITCS Departments Other Big Data / AI research organizations and proposals JST BigData CREST JST AI CREST Etc. Tsubame 3.0/2.5 Big Data /AI resources Industrial Collaboration in data, applications Resources and Acceleration of AI / Big Data, systems research Basic Research in Big Data / AI algorithms and methodologies Joint Research on AI / Big Data and applications Application Area Natural Langauge Processing Robotics Security National Institute for Advanced Industrial Science and Technology (AIST) Ministry of Economics Trade and Industry (METI) Director: Satoshi Matsuoka Tokyo Institute of Technology / GSIC Joint Lab established Feb. 2017 to pursue BD/AI joint research using large-scale HPC BD/AI infrastructure AIST Artificial Intelligence Research Center (AIRC) ABCI AI Bridging Cloud Infrastructure
  • 8. The current status of AI & Big Data in Japan We need the triage of advanced algorithms/infrastructure/data but we lack the cutting edge infrastructure dedicated to AI & Big Data (c.f. HPC) R&D ML Algorithms & SW AI&Data Infrastructures “Big”Data B IoT Communication, location & other data Petabytes of Drive Recording Video FA&Robots Web access and merchandice Use of Massive Scale Data now Wasted Seeking Innovative Application of AI & Data AI Venture Startups Big Companies AI/BD R&D (also Science) In HPC, Cloud continues to be insufficient for cutting edge research => dedicated SCs dominate & racing to Exascale Massive Rise in Computing Requirements (1 AI-PF/person?) Massive “Big” Data in Training Riken -AIP Joint RWBC Open Innov. Lab (OIL) (Director: Matsuoka) AIST-AIRC NICT- UCRI Over $1B Govt. AI investment over 10 years AI/BD Centers & Labs in National Labs & Universities
  • 9. 大規模なビッグデータ・AI処理の二つの計算特性 ビッグデータ・AIとして グラフ解析:ソーシャルネットワーク ソート・ハッシュ:DB・ログデータ解析 記号処理:従来型の人工知能 HPCとして 整数・疎行列計算 データ移動中心・大規模メモリ 疎でかつランダムなデータ ビッグデータ・AIとして 密行列演算:深層学習 推論・学習・生成 HPCとして 密行列計算ただし縮退精度中心 密で比較的整列されたネットワーク やデータ 高速化・大規模化 高速化・大規模化スパコンによる加速 ただし、両者への適合 計算特性がお互い 全く異なるが、スパ コンにおけるシミュ レーション計算も本 質的にはどちらか
  • 10. (ビッグデータの) BYTES 能力が、特にデータのバンド 幅と容量がBD/AIの加速には大変重要だが、近年の (Linpack)FLOPS偏重のスパコンには多くが欠けている • システム全体のバンド幅(BYTE/S)と容 量(BYTES)がHPC-BD/AI統合には必要 (メモリバンド幅だけではNG) • 左側の疎なデータ中心のアプリには自明 • 右側のDNNでも、高速化のためのマルチ ノードの並列化で強スケーリングする場合 や、CTや地層などの3次元の大規模化す るデータ解析を行う場合はやはり必要 (Source: https://www.spineuniverse.com/image- library/anterior-3d-ct-scan-progressive-kyphoscoliosis) (Source: http://www.dgi.com/images/cvmain_overview/CV4DOverview_Model_001.jpg) CaffeNetのTSUBAME-KFC/DL における学習時の一回のイ テレーションの時間 (Mini-batch size 256) 殆どの時間が通信やメモリ 内の捜査に使用 1 2 4 8 16 1A 2A 4A 8A 16A Iteration Other H2D Commun D2H ForwardB Number of nodes Executiontime[s] 0.00.20.40.60.8 1 2 4 8 16 1A 2A 4A 8A 16A Iteration Other H2D Communication D2H ForwardBackward Number of nodes Executiontime[s] 0.00.20.40.60.8 1 2 4 8 16 1A 2A 4A 8A 16A Iteration Other H2D Communication D2H ForwardBackward Number of nodes Executiontime[s] 0.00.20.40.60.8 Number of nodes 殆どが通信時間 GPU上での計算 は僅か3.9% BD/AIとHPCを統 合するスパコン では、メモリや ネットワークの 容量や性能を確 保するアーキや そのシステムソ フトが大変重要
  • 11. ビッグデータ・AI時代のBYTES中心スパコン研究開発要件 -東工大GSIC等・産総研AIRC、およびジョイントラボRWBC-OILでの活動- Bytes-Centric Supercomputer • BYTES中心スパコンの研究開発:倍精度FLOPS増加脱却 (TSUBAME3) • ペタ~エクサバイトの大規模データがマシン内で自由に移動 • 最新の不揮発性メモリを用いた「ファットノード」の深いメモリ階層と、テラビット級 超高速ネットワークによるノード間結合 • BYTES中心スパコン上の整数・疎行列計算アルゴリズム・ソフト • BYTES中心スパコン上の深層学習アルゴリズムのスケーリング・加速 • シミュレーションサイエンスと加速されたデータサイエンスの統合 • HPC加速されたBD/AI研究を行う研究組織体(例:東工大・産総研OIL) • 我が国のAI基盤として産官学連携・民間技術移転(TSUBAME3->ABCI) • 世界トップクラスのHPC-AIクラウド基盤を開発
  • 12. TSUBAME-KFC/DL: ウルトラグリーン・スパコン研究設備 (文部科学省概算要求・2011-2015・約2億円)→機械学習スパコンへUG 高温冷却系 冷媒油 35~45℃ ⇒ 水 25~35℃ (TSUBAME2は7~17℃) 冷却塔: 水 25~35℃ ⇒ 自然大気へ 液浸冷却+高温大気冷却+高密度実装+電力制御のスパコン技術を統合 TSUBAME3.0のプロトタイプ コンテナ型研究設備 20フィートコンテナ(16m2) 無人自動制御運転 将来はエネルギー回収も 高密度実装・油浸冷却 210TFlops (倍精度) 630TFlops (単精度) 1ラック 2013年11月/2014年6月 Green500ランキング 世界一位(日本初) 2015アップグレード 500TFlops (倍精度) 機械学習等1.5PFlops (単精度) 世界最高性能密度 (やはり1ラック=>7ラックで京相当) 1ラック60KWの高密度発熱
  • 13. TSUBAME3.0 2006年 TSUBAME1.0 80テラフロップス・アジア一位 世界7位 2010年 TSUBAME2.0 2.4ペタフロップス・世界4位 運用スパコングリーン世界一 2013年TSUBAME2.5 アップグレード(補正予算) 5.7ペタフロップス 2013年TSUBAME-KFC スパコングリーン世界一 2017年 TSUBAME3.0 12.1ペタフロップス HPCIリーディングスパコン HPC/BD/AIの統合 大規模学術 及び 産業利用アプリ2011年ACMゴードンベル賞 2017年 リーディングスパコンTSUBAME3.0へ 「最先端の技術チャレンジに挑むスパコン」 1. HPCIリーディングスパコン:TSUBAME2と合算し京の二倍の性能:15-20ペ タフロップス、4-5ペタバイト/秒メモリバンド幅、ペタビット級光ネットワーク 2. BD/AI Data Centric Architecture 大規模シリコンストレージによりマルチテ ラバイト/秒のI/O、機械学習・AI含むビッグデータ用ミドルウェアやアプリ 3. グリーンスパコン10ギガフロップス/W以上の 性能電力性能・グリーン・PUE < 1.05 4. クラウドスパコン:種々のクラウド サービスと高速性の両立 5. 「見える化」スパコン:超複雑な マシンの状態が「見える」
  • 14. TSUBAME 3.0 のシステム概要 全GPU・全ノード・全メモリ階層での BYTES中心スケーラブルアーキテクチャ フルバイセクションバンド幅の インテル® Omni-Path® 光ネットワーク 432 Terabits/秒 双方向 全インターネット平均通信量の2倍 DDNのストレージシステム (並列FS 15.9PB+ Home 45TB) 540の計算ノード SGI ICE® XA インテル®Xeon® CPU×2+NVIDIA Pascal GPU×4 256GBメモリ、2TBのNVMe対応インテル®SSD 47.2 AI-Petaflops, 12.1 Petaflops 2017年8月本稼働
  • 15. 15 NVIDIA TESLA P100 “PASCAL” GPU NVLINKによる高速GPU間通信 ハードウェアによるCPU とのメモリ共有 HPC&AI高性能演算性能 HBM2 2.5次元積層メモリ K40 Teraflops(FP32/FP16) 5 10 15 20 P100 (FP32) P100 (FP16) M40 K40 Bi-directionalBW(GB/Sec) 40 80 120 160 P100 M40 K40 Bandwidth(GB/s) 200 400 600 P100 M40 K40 AddressableMemory(GB) 10 100 1000 P100 M40 21 Teraflops of FP16 深層学習性能 5.3 Teraflops FP64 HPC 16nm FinFETによる 高い電力効率 PCI-e 3 比で5x GPU-GPU通信バンド幅 大規模HPC&データのワークロードにて 3倍のメモリバンド幅 GPUのメモリ制限を撤廃 10000800 TSUBAME3では 2160機採用 5つの技術革新
  • 16. TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC 16 Intra-node GPU via NVLink 20~40GB/s Intra-node GPU via NVLink 20~40GB/s Inter-node GPU via OmniPath 12.5GB/s fully switched HBM2 64GB 2.5TB/s DDR4 256GB 150GB/s Intel Optane 1.5TB 12GB/s (planned) NVMe Flash 2TB 3GB/s 16GB/s PCIe Fully Switched 16GB/s PCIe Fully Switched ~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node)  Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year Terabit class network/node 800Gbps (400+400) full bisection Any “Big” Data in the system can be moved to anywhere via RDMA speeds minimum 12.5GBytes/s also with Stream Processing Scalable to all 2160 GPUs, not just 8
  • 17. TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC 17 Intra-node GPU via NVLink 20~40GB/s Intra-node GPU via NVLink 20~40GB/s Inter-node GPU via OmniPath 12.5GB/s fully switched HBM2 64GB 2.5TB/s DDR4 256GB 150GB/s Intel Optane 1.5TB 12GB/s (planned) NVMe Flash 2TB 3GB/s 16GB/s PCIe Fully Switched 16GB/s PCIe Fully Switched ~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node)  Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year Any “Big” Data in the system can be moved to anywhere via RDMA speeds minimum 12.5GBytes/s also with Stream Processing Scalable to all 2160 GPUs, not just 8
  • 18. TSUBAME3.0 Co-Designed SGI ICE-XA Blade (new) - No exterior cable mess (power, NW, water) - Plan to become a future HPE product
  • 19. CPU 0 PLX GPU 0 OPA HFI OPA HFI DIMM DIMM DIMM DIMM GPU 1 CPU 1 DIMM DIMM DIMM DIMM PLX OPA HFI GPU 2 GPU 3 OPA HFI PCH SSD QPI NVLink x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x4 PCIe DMI CPU 0 PLX GPU 0 OPA HFI OPA HFI DIMM DIMM DIMM DIMM GPU 1 CPU 1 DIMM DIMM DIMM DIMM PLX OPA HFI GPU 2 GPU 3 OPA HFI PCH SSD QPI NVLink x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x4 PCIe DMI TSUBAME3.0 Compute Node SGI ICE-XA, a New GPU Compute Blade Co- Designed by SGI and Tokyo Tech GSIC x9 SGI ICE XA Infrastructure Intel Omnipath Spine Switch, Full Bisection Fat Tre Network 432 Terabit/s Bidirectional for HPC and DNN Compute Blade Compute Blade x60 sets (540 nodes) X60 Pairs (Total 120 Switches) 18 Ports 18 Ports 18 Ports 18 Ports Ultra high performance & bandwidth “Fat Node” • High Performance: 4 SXM2(NVLink) NVIDIA Pascal P100 GPU + 2 Intel Xeon 84 AI-TFLops • High Network Bandwidth – Intel Omnipath 100GBps x 4 = 400Gbps (100Gbps per GPU) • High I/O Bandwidth - Intel 2 TeraByte NVMe • > 1PB & 1.5~2TB/s system total • Future Octane 3D-Xpoint memory Petabyte or more directly accessible • Ultra High Density, Hot Water Cooled Blades • 36 blades / rack = 144 GPU + 72 CPU, 50-60KW, x10 thermals c.f. IDC CPU 0 PLX GPU 0 OPA HFI OPA HFI DIMM DIMM DIMM DIMM GPU 1 CPU 1 DIMM DIMM DIMM DIMM PLX OPA HFI GPU 2 GPU 3 OPA HFI PCH SSD QPI NVLink x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x16 PCIe x4 PCIe DMI Optane NVM x4 PCIe Optane NVM 400Gbps / node for HPC and DNN Terabytes Memory
  • 20. Japanese Open Supercomputing Sites Aug. 2017 (pink=HPCI Sites) Peak Rank Institution System Double FP Rpeak Nov. 2016 Top500 1 U-Tokyo/Tsukuba U JCAHP Oakforest-PACS - PRIMERGY CX1640 M1, Intel Xeon Phi 7250 68C 1.4GHz, Intel Omni-Path 24.9 6 2 Tokyo Institute of Technology GSIC TSUBAME 3.0 - HPE/SGI ICE-XA custom NVIDIA Pascal P100 + Intel Xeon, Intel OmniPath 12.1 NA 3 Riken AICS K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect Fujitsu 11.3 7 4 Tokyo Institute of Technology GSIC TSUBAME 2.5 - Cluster Platform SL390s G7, Xeon X5670 6C 2.93GHz, Infiniband QDR, NVIDIA K20x NEC/HPE 5.71 40 5 Kyoto University Camphor 2 – Cray XC40 Intel Xeon Phi 68C 1.4Ghz 5.48 33 6 Japan Aerospace eXploration Agency SORA-MA - Fujitsu PRIMEHPC FX100, SPARC64 XIfx 32C 1.98GHz, Tofu interconnect 2 3.48 30 7 Information Tech. Center, Nagoya U Fujitsu PRIMEHPC FX100, SPARC64 XIfx 32C 2.2GHz, Tofu interconnect 2 3.24 35 8 National Inst. for Fusion Science(NIFS) Plasma Simulator - Fujitsu PRIMEHPC FX100, SPARC64 XIfx 32C 1.98GHz, Tofu interconnect 2 2.62 48 9 Japan Atomic Energy Agency (JAEA) SGI ICE X, Xeon E5-2680v3 12C 2.5GHz, Infiniband FDR 2.41 54 10 AIST AI Research Center (AIRC) AAIC (AIST AI Cloud) – NEC/SMC Cluster, NVIDIA Pascal P100 + Intel Xeon, Infiniband EDR 2.2 NA
  • 21. ビッグデータ・AI用浮動小数演算の精度 0 10 20 30 40 50 60 70 理研 東大 東工大 半精度以上の演算性能での拠点比較 (AI-FLOPS) TSUBAME3.0 T2.5 京 Oakforest-PACS (JCAHPC) Reedbush(U&H) PFLOPS 倍精度 64bit 単精度 32bit 半精度 16bit 整数演算 数値解析 シミュレーション計算 コンピュータ・グラフィックス ゲーミング ビッグデータ 機械学習・人工知能 65.8 Petaflops TSUBAME3+2.5+KFCの合算で、機械学習/AIの性能はスパコン・クラウドすべて含め日本トップ T-KFC ~6700 GPUs + ~4000 CPUs
  • 22. TSUBAME3.0 Datacenter 15 SGI ICE-XA Racks 2 Network Racks 3 DDN Storage Racks 20 Total Racks Compute racks cooled with 32 degrees warm water, Yearound ambient cooling Av. PUE = 1.033
  • 23. TSUBAME3.0 冷却システム系統図 32度の自然大気冷却水による高効率高温冷却 計算ノード (SGI ICE XA system) I/O , File system 自然大気 水冷冷却塔 Air conditioner 周辺機器用チラー 分岐ヘッダー 【1F-屋外】 【B1F】 【RF】 【1F-114室】【1F-114室】 分岐ヘッダー 【EPS】 分岐ヘッダー 分岐ヘッダー 空冷冷却 (22℃) システム発熱量10~15% + 環境潜熱 水冷冷却 (往き:約14℃) 水冷冷却 (還り:約21℃) 水冷冷却 (往き:約32℃) 水冷冷却 (還り:約40℃) 熱交換機 冷却補助回路(200kW) 水冷冷却 (往き:約14℃)水冷冷却 (還り:約21℃) 水冷冷却 (還り:約40℃) 水冷冷却 (往き:約32℃) ※RF or B1F設置
  • 24. TSUBAME3.0 世界トップクラスの冷却効率 TSUBAME 3.0 PUE 予測 (900KW消費仮定) 2013~2015年の天候データを元に計算 1月 2月 3月 4月 5月 6月 7月 8月 9月 10月 11月 12月 年間 平均 冷却設備 平均消費電力 [kW] 28.83 28.83 28.83 28.83 29.21 30.06 30.06 32.00 30.06 29.21 28.83 28.83 29.465 PUE 1.032 1.032 1.032 1.032 1.032 1.033 1.033 1.036 1.033 1.032 1.032 1.032 1.033 PUEは冷却のオーバーヘッドを表す。1.0に近いほど良い PUE = {(計算ノードの消費電力)+(計算ノードの冷却に必要なすべての設備消費電力)} / (計算ノードの消費電力) • 通常のデータセンター:PUE = 2~3 (冷却の方がマシンより電気を食っている) • 最新の空冷データセンター、TSUBAME1:PUE = 1.4~1.6 • TSUBAME2.0:PUE = 1.28 • TSUBAME-KFC:PUE = 1.09 TSUBAME3.0の計算ノードの年間PUE平均値は『1.033』 世界トップクラス 4-27
  • 27. TSUBAME3.0の細粒度なリソース割り当て GPU0 GPU1 GPU2 GPU3 NIC1 NIC0 NIC3 NIC2 MEM0 MEM1 CPU0 CPU1 ジョブ 割り当てリソース 1 CPU2コア、NIC0、GPU1、32GBメモリ 2 CPU8コア、64GBメモリ 3 CPU4コア、GPU0、16GBメモリ 4 CPU8コア、80GBメモリ 5 CPU4コア、NIC2&3、GPU2&3、48GBメ モリ 図ではCPUのコア数は少なく記載
  • 28. JST-CREST “Extreme Big Data” Project (2013-2018) Supercomputers Compute&Batch-Oriented More fragile Cloud IDC Very low BW & Efficiency Highly available, resilient Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW PCB TSV Interposer High Powered Main CPU Low Power CPU DRAM DRAM DRAM NVM/Fla sh NVM/Fla sh NVM/Fla sh Low Power CPU DRAM DRAM DRAM NVM/Flas h NVM/Flas h NVM/Flas h 2Tbps HBM 4~6HBM Channels 1.5TB/s DRAM & NVM BW 30PB/s I/O BW Possible 1 Yottabyte / Year EBD System Software incl. EBD Object System Introduction ProblemDomain Inmostliving theirdevelopm moleculecalled DNAconsistso callednucleotid Thefourbases A),cytosine(C AleksandrDrozd,NaoyaMaruyama,SatoshiMatsuoka(TITECH)AMultiGPUReadAlignmentAl Large Scale Metagenomics Massive Sensors and Data Assimilation in Weather Prediction Ultra Large Scale Graphs and Social Infrastructures Exascale Big Data HPC Co-Design FLOPS中心からBYTES中心のHPCとBD/AIの統合 Graph Store EBD Bag Co-Design 1 3 /0 6 /0 6 2 2 :3 6日本地図 1 /1 ページfile:///Users/shirahata/Pictu res/日本地図.svg 1000km KV S KV S KV S EBD KVS Cartesian Plane Co-Design Given a top-class supercomputer, how fast can we accelerate next generation big data c.f. conventional Clouds? Issues regarding Architecture, algorithms, system software in co-design Performance Model? Accelerators?
  • 29. • 世界トップクラスの大規模スーパーコンピュータ構築技 術,TSUBAME KFC(Kepler Fluid Cooling)などの 省電力計算機技術と豊富な運用実績 • 高速深層学習基盤の構築技術,大規模シミュレーション 技術,統計物理に基づくモデリング技術,生命情報解析 技術 • 機械学習などデータ処理用計算機と高性能計算機をつな ぐシステム連携技術 • 大規模環境計測データの解析技術,サービス設計技術, 確率モデリング技術,多次元データ分析と可視化 個人から社会の 挙動を表す 実社会ビッグデータ 構築したビッグデータ活用基盤の利用を 産業界に広く提供し、保有する実社会 ビッグデータからの価値創造 生体 データ 個人行動 データ 企業・自治体保有のビッグデータ 集団行動 データ ヘルスケア, 製薬 電機メーカ, 流通,交通 eコマース,通信 キャリア,交通 オープンデータ 地域活動 データ 地域経済分析 システム RESAS 高齢者 データ 高齢者健康調査 サンプル JAGES … 自社サービスの 向上に利用 新サービスの 創出に利用 他企業へ売却 銀行・証券・保険,不動産,製造業ヘルスケア,eコマース プロバイダ,シンクタンク 産総研・東工大 OIL 実社会ビッグデータの活用基盤の構築 1. ビッグデータ処理オープンプラットフォーム の確立 2. ビッグデータを活用するデータ処理技術の 開発 ビッグデータ処理システム TSUBAME GPGPU 計算機 次世代 計算機 データ 処理環境 確率モデリング データマイニング データ可視化 シミュレーション データ 処理 産総研が産業界との連携・開発技術の実用化を主導 【産総研の強み】 ビッグデータ活用ソフトウェア開発 【東工大の強み】ハードウエア構築技術 産総研・東工大 実社会ビッグデータ活用オープンイノベーションラボラトリ 研究・実験 データ 大学・研究機関・ 医療機関 学術データ <OILの目的> 企業との連携拠点の形成 ビッグデータを高度に解析する ためのハード・ソフトウェアの 研究開発 橋渡し 企業等
  • 30. Sparse BYTES: The Graph500 – 2015~2016 – world #1 x 4 K Computer #1 Tokyo Tech[Matsuoka EBD CREST] Univ. Kyushu [Fujisawa Graph CREST], Riken AICS, Fujitsu List Rank GTEPS Implementation November 2013 4 5524.12 Top-down only June 2014 1 17977.05 Efficient hybrid November 2014 2 19585.2 Efficient hybrid June, Nov 2015 June Nov 2016 1 38621.4 Hybrid + Node Compression BYTES Rich Machine + Superior BYTES algoithm 88,000 nodes, 660,000 CPU Cores 1.3 Petabyte mem 20GB/s Tofu NW ≫ LLNL-IBM Sequoia 1.6 million CPUs 1.6 Petabyte mem 0 200 400 600 800 1000 1200 64 nodes (Scale 30) 65536 nodes (Scale 40) ElapsedTime(ms) Communi… Computati… 73% total exec time wait in communication TaihuLight 10 million CPUs 1.3 Petabyte mem Effective x13 performance c.f. Linpack #1 38621.4 GTEPS (#7 10.51PF Top500) #2 23755.7 GTEPS (#1 93.01PF Top500) #3 23751 GTEPS (#4 17.17PF Top500) BYTES, not FLOPS!
  • 31. K-computer No.1 on Graph500: 4th Consecutive Time • What is Graph500 Benchmark? • Supercomputer benchmark for data intensive applications. • Rank supercomputers by the performance of Breadth-First Search for very huge graph data. 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 Jun 2012 Nov 2012 Jun 2013 Nov 2013 Jun 2014 Nov 2014 Jul 2015 Nov 2015 Jun 2016 Performance(GTEPS) K computer (Japan) Sequoia (U.S.A.) Sunway TaihuLight (China) No.1 This is achieved by a combination of high machine performance and our software optimization. • Efficient Sparse Matrix Representation with Bitmap • Vertex Reordering for Bitmap Optimization • Optimizing Inter-Node Communications • Load Balancing etc. • Koji Ueno, Toyotaro Suzumura, Naoya Maruyama, Katsuki Fujisawa, and Satoshi Matsuoka, "Efficient Breadth-First Search on Massively Parallel and Distributed Memory Machines", in proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016 (to appear)
  • 32. share.sa ndia.g ov Dynamic Graph ApplicationStreaming edg es NVRAM Larg e-scale Dynamic Graph Data Store NVRAM NVRAM mmap mmap mmap Comp. N ode Comp. N ode Comp. N ode Distributed Large-Scale Dynamic Graph Data Store (work with LLNL, [SC16 etc.]) Node Level Dynamic Graph Data Store Extend for multi-processes using an async MPI communication framework Follows an adjacency-list format and leverages an open address hashing to construct its tables 2 billion insertions/s InsertedBillionEdges/sec Number of Nodes (24 processes per node) Multi-node Experiment STINGER • A state-of-the-art dynamic graph processing framework developed at Georgia Tech Baseline model • A naïve implementation using Boost library (C++) and the MPI communication framework Based on K-Computer results, adaping to (1) deep memory hierarchy, (2) rapid dynamic graph changes K. Iwabuchi, S. Sallinen, R. Pearce, B. V. Essen, M. Gokhale, and S. Matsuoka, Towards a distributed large-scale dynamic graph data store. In 2016 IEEE Interna- tional Parallel and Distributed Processing Symposium Workshops (IPDPSW) C.f. STINGER (single-node, on memory) 0 200 6 12 24 SpeedUp Parallels Baseline DegAwareRHH 212x Dynamic Graph Construction (on-memory & NVM) K Computer large memory but very expensive DRAM only Develop algorithms and SW exploiting large hierarchical memory Dynamic graph store w/ world’s top graph update performance and scalability
  • 33. Large-scale Graph Colouring (vertex coloring) • Color each vertices with the minimal #colours so that no two adjacent vertices have the same colour • Compare our dynamic graph colouring algorithm on DegAwareRHH against: 1. two static algorithms including GraphLab 2. an another graph store implementation with same dynamic algorithm (Dynamic-MAP) Static colouring Our DegAwareRHH Baseline Scott Sallinen, Keita Iwabuchi, Roger Pearce, Maya Gokhale, Matei Ripeanu, “Graph Coloring as a Challenge Problem for Dynamic Graph Processing on Distributed Systems”, SC’16 33 SC’16
  • 34. Incremental Graph Community Detection • Background • Community detection for large-scale time-evolving and dynamic graphs has been one of important research problems in graph computing. • It is time-wasting to compute communities entire graphs every time from scratch. • Proposal • An incremental community detection algorithm based on core procedures in a state-of-the-art community detection algorithm named DEMON. • Ego Minus Ego, Label Propagation and Merge Hiroki Kanezashi and Toyotaro Suzumura, An Incremental Local-First Community Detection Method for Dynamic Graphs, Third International Workshop on High Performance Big Graph Data Management, Analysis, and Mining (BigGraphs 2016), to appear 101.0x faster 101.5x faster 69.2x faster
  • 35. GPU-based Distributed Sorting [Shamoto, IEEE BigData 2014, IEEE Trans. Big Data 2015] • Sorting: Kernel algorithm for various EBD processing • Fast sorting methods – Distributed Sorting: Sorting for distributed system • Splitter-based parallel sort • Radix sort • Merge sort – Sorting on heterogeneous architectures • Many sorting algorithms are accelerated by many cores and high memory bandwidth. • Sorting for large-scale heterogeneous systems remains unclear • We develop and evaluate bandwidth and latency reducing GPU-based HykSort on TSUBAME2.5 via latency hiding – Now preparing to release the sorting library EBD Algorithm Kernels
  • 36. 0 10 20 30 0 500 1000 1500 2000 # of proccesses (2 proccesses per node) Keys/second(billions) HykSort 1thread HykSort 6threads HykSort GPU + 6threads x1.4 x3.61 x389 0.25 [TB/s] Performance prediction x2.2 speedup compared to CPU-based implementation when the # of PCI bandwidth increase to 50GB/s 8.8% reduction of overall runtime when the accelerators work 4 times faster than K20x • PCIe_#: #GB/s bandwidth of interconnect between CPU and GPU • Weak scaling performance (Grand Challenge on TSUBAME2.5) – 1 ~ 1024 nodes (2 ~ 2048 GPUs) – 2 processes per node – Each node has 2GB 64bit integer • C.f. Yahoo/Hadoop Terasort: 0.02[TB/s] – Including I/O GPU implementation of splitter- based sorting (HykSort)
  • 37. Xtr2sort: Out-of-core Sorting Acceleration using GPU and Flash NVM [IEEE BigData2016] • Sample-sort-based Out-of-core Sorting Approach for Deep Memory Hierarchy Systems w/GPU and Flash NVM – I/O chunking to fit device memory capacity of GPU – Pipeline-based Latency hiding to overlap data transfers between NVM, CPU, and GPU using asynchronous data transfers, e.g., cudaMemCpyAsync(), libaio RD R2H H2D EX D2H H2W WR RD R2H H2D EX D2H H2W WR RD R2H H2D EX D2H H2W WR RD R2H H2D EX D2H H2W WR RD R2H H2D EX D2H H2W WR RD R2H H2D EX D2H H2W WR chunk&i chunk&i+1 chunk&i+2 chunk&i+3 chunk&i+4 RD R2H H2D EX D2H H2W WR chunk&i+5 chunk&i+6 c chunks time GPU GPU + CPU + NVM CPU + NVM How to combine deepening memory layers for future HPC/Big Data workloads, targeting Post Moore Era? x4.39 BYTES中心のHPCア ルゴリズム:GPUのバ ンド幅高速ソートと、 不揮発性メモリによる 大容量化の両立
  • 38. Hierarchical, UseR-level and ON-demand File system(HuronFS) (IEEE ICPADS 2016) w/LLNL • HuronFS: dedicated dynamic instances to provide “burst buffer” for caching data • I/O requests from Compute Nodes are forwarded to HuronFS • The whole system consists of several SHFS (Sub HuronFS) • Workload are distributed among all the SHFS using hash of file path • Each SHFS consists of a Master and several IOnodes • Masters: controlling all IOnodes in the same SHFS and handling all I/O requests • IOnodes: storing actual data and transferring data with Compute Nodes • Supporting TCP/IP, Infiniband (CCI framework) • Supporting Fuse, LD_PRELOAD Parallel File System Compute node 1 Compute node 2 Compute node N HuronFS Master 0 IOnode IOnode IOnode IOnode IOnode Parallel File system SHFS 0 IOnode IOnode IOnode SHFS Y HuronFS Master M-1 IOnode IOnode IOnode IOnode IOnode SHFS M-1 Compute node X Master Y IOnode IOnode static hash
  • 39. HuronFS Basic IO Performance 0 200 400 600 800 1000 1200 write read Throughput(MiB/s) IPoIB CCI 0 200 400 600 800 IPOIB CCI Latency(us) fuse overhead process epoll overhead comm Latency Throughput from single client Throughput from single IOnode 0 500 1000 1500 2000 2500 3000 3500 4000 1 2 4 Throughput(MiB/s) # of nodes (8 processes per node) write IPDIB read IPOIB write CCI read CCI Inifinband 4X FDR 56 Gb/sec mellanox CPU Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz Mem 251G
  • 40. Plans • Continuing researching on auto buffer allocation • Utilizing computation power on IOnodes • Data preprocessing • Format conversion Jobn IOnode Data preprocessing, format conversion, etc.. Jobn+1 Jobn IOnode Jobn+1 format conversion Network Network Processing on IOnodes In Memory
  • 41. • Background Snore sound (SnS) data carry very important information for diagnosis and evaluation of Primary Snoring and Obstructive Sleep Apnea (OSA). With the increasing number of collected SnS data from subjects, how to handle such large amount of data is a big challenge. In this study, we utilize the Graphics Processing Unit (GPU) to process a large amount of SnS data collected from two hospitals in China and Germany to accelerate the features extraction of biomedical signal. GPU-Based Fast Signal Processing for Large Amounts of Snore Sound Data Subjects Total Time (hours) Data Size (GB) Data format Sampling Rate 57 (China + Germany) 187.75 31.10 WAV 16 kHz, Mono Snore sound data information * Jian Guo, Kun Qian, Huijie Xu, Christoph Janott, Bjorn Schuller, Satoshi Matsuoka, “GPU-Based Fast Signal Processing for Large Amounts of Snore Sound Data”, In proceedings of 5th IEEE Global Conference on Consumer Electronics (GCCE 2016), October 11-14, 2016. • Acoustic features of SnS data we extract 11 acoustic features from a large amount of SnS data, which can be visualized to help doctors and specialists to diagnose, research, and remedy the diseases efficiently. Results of GPU and CPU based systems for processing SnS data • Result We set 1 CPU (with Python2.7, numpy 1.10.4 and scipy 0.17 packages) for processing 1 subject’s data as our baseline. Result show that the GPU based system is almost 4.6×faster than the CPU implementation. However, the speed-up decreases when increasing the data size. We think that this result should be caused by the fact that, the transmission of data is not hidden by other computations, as will be a real-world application.
  • 42. Open Source Release of EBD System Software (install on T3/Amazon/ABCI) • mrCUDA • rCUDA extension enabling remote-to- local GPU migration • https://github.com/EBD- CREST/mrCUDA • GPU 3.0 • Co-Funded by NVIDIA • Huron FS (w/LLNL) • I/O Burst Buffer for Inter Cloud Environment • https://github.com/EBD-CREST/cbb • Apache License 2.0 • Co-funded by Amazon • ScaleGraph Python • Python Extension for ScaleGraph X10- based Distributed Graph Library • https://github.com/EBD- CREST/scalegraphpython • Eclipse Public License v1.0 • GPUSort • GPU-based Large-scale Sort • https://github.com/EBD- CREST/gpusort • MIT License • Others, including dynamic graph store
  • 43. HPC and BD/AI Convergence Example [ Yutaka Akiyama, Tokyo Tech] Oral/Gut Metagenomics Ultra-fast Seq. Analysis Exhaustive PPI Prediction System Pathway Predictions Fragment-based Virtual Screening Learning-to-Rank VS Genomics Protein-Protein Interactions Drug Discovery • Ohue et al., Bioinformatics (2014) • Suzuki et al., Bioinformatics (2015) • Suzuki et al., PLOS ONE (2016) • Matsuzaki et al., Protein Pept Lett (2014) • Suzuki et al., AROB2017 (2017) • Yanagisawa et al., GIW (2016) • Yamasawa et al., IIBMP (2016) 43
  • 44. EBD vs. EBD : Large Scale Homology Search for Metagenomics increasing Taxonomic composition Next generation sequencer - Revealing uncultured microbiomes and finding novel genes in various environments - Applied for human health in recent years O(n) Meas. data O(m) Reference Database O(m n) calculation Correlation, Similarity search EBD ・with Tokyo Dental College, Prof. Kazuyuki Ishihara ・Comparative metagenomic analysis bewtween healthy persons and patients Various environments Human body Sea Soil EBD High risk microorganisms are detected. Metabolic Pathway Metagenomic analysis of periodontitis patients increasing
  • 45. Development of Ultra-fast Homology Search Tools 1 10 100 1000 10000 100000 1000000 BLAST GHOSTZ computational time for 10,000 sequences (sec.) (3.9 GB DB、1CPU core) Suzuki, et al. Bioinformatics, 2015. Subsequence sequence clustering GHOSTZ-GPU 0 10 20 30 40 50 60 70 80 1C 1C+1G 12C+1G 12C+3G Speed-upratiofor1core ×70 faster than 1 core using 12 cores + 3 GPUs Suzuki, et al. PLOS ONE, 2016. Multithread on GPU MPI + OpenMP hybrid pallelization Retaining strong scaling up to 100,000 cores GHOST-MP Kakuta, et al. (submitted) ×240 faster than conventional algorithm TSUBAME 2.5 Thin node GPU TSUBAME 2.5 __ GHOST-MP mpi-BLAST ×80〜×100 faster
  • 46. Plasma Protein Binding (PPB) Prediction by Machine Learning Application for peptide drug discovery Problems ・ Candidate peptides are tend to be degraded and excreted faster than small molecule drugs ・ Strong needs to design bio-stable peptides for drug candidates Experimental value Predictedvalue Previous PPB prediction software for small molecule can not predict peptide PPB Solutions Compute Feature Values (more than 500 features) LogS LogP : MolWeight : SASA polarity R2 = 0.905 Experimental value Predictedvalue f A constructed model can explain peptide PPB well PPB value Combining feature values for building a predictive model
  • 47. Molecular Dynamics Simulation for Membrane Permeability Sequence:D-Pro, D-Leu, D-Leu, L-Leu, D-Leu, L-Tyr Membrane permeability :7.9 × 10 -6cm/s 1) Single residue mutation can drastically change membrane permeability 2) Standard MD simulation can not follow membrane permeation. Membrane permeation is millisecond order phenomenon. Ex ) Membrane thickness : 40 Å Peptide membrane permeability : 7.9×10-6 cm/s Typical peptide membrane permeation takes 40 Å / 7.9×10-6 cm/s = 0.5 millisecond Problems 1) Apply enhanced sampling Supervised MD (SuMD) Metadynamics (MTD) CV Freeenergy 2) GPU acceleration and massively parallel computation. Solutions ・ Millisecond order phenomenon can be simulated. ・ Hundreds of peptides can be calculated simultaneously on TSUBAME. Sequence:D-Pro, D-Leu, D-Leu, D-Leu, D-Leu, L-Tyr Membrane permeability :0.045 × 10 -6cm/s ×0.006 GROMACS DESMOND MD engine on GPU Application for peptide drug discovery
  • 48. RWBC-OIL 2-3: Tokyo Tech IT-Drug Discovery Factory Simulation & Big Data & AI at Top HPC Scale (Tonomachi, Kawasaki-city: planned 2017, PI Yutaka Akiyama) Tokyo Tech’s research seeds ①Drug Target selection system ②Glide-based Virtual Screening ③Novel Algorithms for fast virtual screening against huge databases New Drug Discovery platform especially for specialty peptide and nucl. acids. Plasma binding (ML-based) Membrane penetration (Mol. Dynamics simulation) Minister of Health, Labour and Welfare Award of the 11th annual Merit Awards for Industry- Academia-Government Collaboration TSUBAME’s GPU-environment allows World’s top-tier Virtual Screening • Yoshino et al., PLOS ONE (2015) • Chiba et al., Sci Rep (2015) Fragment-based efficient algorithm designed for 100-millions cmpds data • Yanagisawa et al., GIW (2016) Application projects Drug Discovery platform powered by Supercomputing and Machine Learning Investments from JP Govt., Tokyo Tech. (TSUBAME SC) Muninciple Govt (Kawasaki), JP & US Pharma Multi-Petaflops Compute Peta~Exabytes Data Processing Continuously Cutting Edge, Large- Scale HPC & BD/AI Infrastructure Absolutely Necessary
  • 49. Estimated Compute Resource Requirements for Deep Learning [Source: Preferred Network Japan Inc.] 2015 2020 2025 2030 1E〜100E Flops 自動運転車1台あたり1日 1TB 10台〜1000台, 100日分の走行データの学習 Bio / Healthcare Image Recognition Robots / Drones 10P〜 Flops 1万人の5000時間分の音声データ 人工的に生成された10万時間の 音声データを基に学習 [Baidu 2015] 100P 〜 1E Flops 一人あたりゲノム解析で約10M個のSNPs 100万人で100PFlops、1億人で1EFlops 10P(Image) 〜 10E(Video) Flops 学習データ:1億枚の画像 10000クラス分類 数千ノードで6ヶ月 [Google 2015] Image/Video Recognition 1E〜100E Flops 1台あたり年間1TB 100万台〜1億台から得られた データで学習する場合 Auto Driving 10PF 100EF100PF 1EF 10EF P:Peta E:Exa F:Flops 機械学習、深層学習は学習データが大きいほど高精度になる 現在は人が生み出したデータが対象だが、今後は機械が生み出すデータが対象となる 各種推定値は1GBの学習データに対して1日で学習するためには 1TFlops必要だとして計算 To complete the learning phase in one day It’s the FLOPS (in reduced precision) and BW! So both are important in the infrastructure
  • 50. Fast and cost-effective deep learning algorithm platform for video processing in social infrastructure Principal Investigator: Koichi Shinoda Collaborators: Satoshi Matsuoka Tsuyoshi Murata Rio Yokota Tokyo Institute of Technology (Members RWBC-OIL 1-1 and 2-1) JST-REST “Development and Integration of Artificial Intelligence Technologies for Innovation Acceleration”
  • 51. Background • Video processing in smart society for safety and security • Intelligent transport systems Drive recorder video • Security systems Surveillance camera video • Deep learning • Much higher performance than before • IT giants with large computational resources has formed a monopoly Problems: • Real-time accurate recognition of small objects and their movement • Edge-computing without heavy traffic on Internet • Flexible framework for training which can adapt rapidly to the environmental changes 51
  • 52. Research team CPU GPUNode Parallel processing Fast deep learning Minimize network size TokyoTech AIST AIRC Denso・ Denso IT Lab 52 Collaborators Yokota G Matsuoka G Shinoda G Murata G System Reference Argonne National Laboratory and Chicago Univ Toyota InfoTechnology Center Application
  • 53. Example AI Research: Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers Background • In large-scale Asynchronous Stochastic Gradient Descent (ASGD), mini-batch size and gradient staleness tend to be large and unpredictable, which increase the error of trained DNN Objective function E W(t) -ηΣi ∇Ei W(t+1) W(t+1) -ηΣi ∇Ei W(t+3) W(t+2) Twice asynchronous updates within gradient computation Staleness=0 Staleness=2 DNN parameters space Mini-batch size (NSubbatch: # of samples per one GPU iteration) 100 200 300 400 500 600 0.000.1 Probabili 0 2 4 6 8 10 0.00.4 100 200 300 400 500 600 0.000.060.12 Probability NSubbatch = 4 0 2 4 6 8 10 0.00.40.8 NSubbatch = 4 0 100 200 300 400 500 600 0.000.10 Probability NSubbatch = 8 0 2 4 6 8 10 0.00.40.8 NSubbatch = 8 0 100 200 300 400 500 600 0.000.10 NMinibatch Probability NSubbatch = 11 0 2 4 6 8 10 0.00.40.8 NStaleness NSubbatch = 11 NNode = 4 (measured) NNode = 4 (predicted) NNode = 4 (predicted_avg) NNode = 8 (measured) NNode = 8 (predicted) NNode = 4 (measured) NNode = 4 (predicted) NNode = 8 (measured) NNode = 8 (predicted) 100 200 300 400 500 600 0.000.10 Probability NSubbatch = 1 0 2 4 6 8 10 0.00.40.8 NSubbatch = 1 100 200 300 400 500 600 0.000.060.12 Probability NSubbatch = 4 0 2 4 6 8 10 0.00.40.8 NSubbatch = 4 0 100 200 300 400 500 600 0.000.10 Probability NSubbatch = 8 0 2 4 6 8 10 0.00.40.8 NSubbatch = 8 Mini-batch size Staleness Measured Predicted 4 nodes 8 nodes 16 nodes Measured Predicted Proposal • We propose a empirical performance model for an ASGD deep learning system SPRINT which considers probability distribution of mini-batch size and staleness • Yosuke Oyama, Akihiro Nomura, Ikuro Sato, Hiroki Nishimura, Yukimasa Tamatsu, and Satoshi Matsuoka, "Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers", in proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016
  • 54. Performance Prediction of Future HW for CNN  Predicts the best performance with two future architectural extensions  FP16: precision reduction to double the peak floating point performance  EDR IB: 4xEDR InfiniBand (100Gbps) upgrade from FDR (56Gbps) → Not only # of nodes, but also fast interconnect is important for scalability 54 N_Node N_Subbatch Epoch Time Average Minibatch Size (Current HW) 8 8 1779 165.1 FP16 7 22 1462 170.1 EDR IB 12 11 1245 166.6 FP16 + EDR IB 8 15 1128 171.5 TSUBAME-KFC/DL ILSVRC2012 dataset deep learning Prediction of best parameters (average minibatch size 138±25%) 16/08/08SWoPP2016
  • 55. Hierarchical matrix(H-matrix) for CNN acceleration - Hierarchical matrix is an efficient data-sparse representations of certain densely populated matrices. - CNN(Convolutional Neural Network) • Back ground - Hierarchical matrix(H-matrix) is a an approximated form represent 𝑛 × 𝑛 correlations of 𝑛 objects, which usually requires a 𝑛 × 𝑛 huge dense matrix. - Significant savings in memory when compressed 𝑂 𝑛2 ⟹ 𝑂 𝑘𝑛 log 𝑛 - Computational complexity 𝑂(𝑛3) ⟹ 𝑂(𝑘2 𝑛 log 𝑛2) such as matrix-matrix multiplication, LU factorization, Inversion… Hierarchical matrixdense matrix The H-matrix approximation of dense matrix. The red blocks are dense matrices. The green block are low-rank matrices with rank 𝑘.
  • 56. 15.80 10.86 7.72 5.17 4.05 2.80 2.25 1.64 1.33 20.23 17.06 14.63 13.96 12.81 12.39 11.85 11.88 10.52 0.00 5.00 10.00 15.00 20.00 25.00 1632 2673 4150 6171 8856 12337 16758 22275 29056 Size of matrix(n × n) Compression rate(%) bem1d sdpara Preliminary Results – Compression rate of matrices We can compress the matrix in some applications. - bem1d: 1-Dimention Boundary element method - sdpara: A parallel implementation of the inter-point method for Semi-Define Programming(SDP) Compressive rate = (uncompressed size) / (compressed size) 4.717 4.338 6.946 1.610 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 Matrix A Matrix B Size(MB) Size of matrix non-compressed compressed 2.396 1.834 0.841 3.567 0.000 0.500 1.000 1.500 2.000 2.500 3.000 3.500 4.000 Matrix A Matrix B Size(MB) Size of matrix non-compressed compressed → Matrix A successfully compressed! → Matrix B successfully compressed! SDPARA Deep Learning (CNN) In CNN system application, Sgemm(Single precision floating General Matrix Multiplication) C = 𝛼AB + 𝛽C accounts for large part of calculation (around 70%). (m, n, k) = (1764, 1350, 178) (m, n, k) = (324, 298, 1908)
  • 57. Power optimization using Deep Q-Network ・ Background Power optimization by frequency control in existing research ➢ Detailed analysis is necessary ➢ Low versatility Use Deep Learning for analysis. Performance counter Temperature Frequency,… FrequencyP = f (x1, x2,...) Texe = g(x1, x2,...) Kento Teranishi ・ Objective Implement the computer control system using Deep Q-Network. Counter Power Frequency Temperature etc. ↑↓ Frequency control Deep Q-Network (DQN) Deep reinforcement learning Calculate action value function Q from neural network Used for game playing AI, robot car, AlphaGO.
  • 58. METI AIST-AIRC ABCI as the worlds first large-scale OPEN AI Infrastructure Univ. Tokyo Kashiwa Campus • >130 AI-Petaflops • < 3MW Power • < 1.1 Avg. PUE • Operational ~2018Q1 • ABCI: AI Bridging Cloud Infrastructure • Top-Level SC compute & data capability for DNN (>130 AI- Petaflops) • Open Public & Dedicated infrastructure for Al & Big Data Algorithms, Software and Applications • Platform to accelerate joint academic-industry R&D for AI in Japan
  • 59. The “Chicken or Egg Problem” of AI- HPC Infrastructures • “On Premise” machines in clients => “Can’t invest in big in AI machines unless we forecast good ROI. We don’t have the experience in running on big machines.” • Public Clouds other than the giants => “Can’t invest big in AI machines unless we forecast good ROI. We are cutthroat.” • Large scale supercomputer centers => “Can’t invest big in AI machines unless we forecast good ROI. Can’t sacrifice our existing clients and our machines are full” • Thus the giants dominate, AI technologies, big data, and people stay behind the corporate firewalls…
  • 60. But Commercial Companies esp. the “AI Giants”are Leading AI R&D, are they not? • Yes, but that is because their shot-term goals could harvest the low hanging fruits in DNN rejuvenated AI • But AI/BD research is just beginning--- if we leave it to the interests of commercial companies, we cannot tackle difficult problems with no proven ROI • Very unhealthy for research • This is different from more mature fields, such as pharmaceuticals or aerospace, where there is balanced investments and innovations in both academia/government and the industry
  • 61. ABCI Prototype: AIST AI Cloud (AAIC) March 2017 (System Vendor: NEC) • 400x NVIDIA Tesla P100s and Infiniband EDR accelerate various AI workloads including ML (Machine Learning) and DL (Deep Learning). • Advanced data analytics leveraged by 4PiB shared Big Data Storage and Apache Spark w/ its ecosystem. AI Computation System Large Capacity Storage System Computation Nodes (w/GPU) x50 • Intel Xeon E5 v4 x2 • NVIDIA Tesla P100 (NVLink) x8 • 256GiB Memory, 480GB SSD Computation Nodes (w/o GPU) x68 • Intel Xeon E5 v4 x2 • 256GiB Memory, 480GB SSD Mgmt & Service Nodes x16 Interactive Nodes x2 400 Pascal GPUs 30TB Memory 56TB SSD DDN SFA14K • File server (w/10GbEx2, IB EDRx4) x4 • 8TB 7.2Krpm NL-SAS HDD x730 • GRIDScaler (GPFS) >4PiB effective RW100GB/s Computation Network Mellanox CS7520 Director Switch • EDR (100Gbps) x216 Bi-direction 200Gbps Full bi-section bandwidth Service and Management Network IB EDR (100Gbps) IB EDR (100Gbps) GbE or 10GbE GbE or 10GbE Firewall • FortiGate 3815D x2 • FortiAnalyzer 1000E x2 UTM Firewall 40-100Gbps class 10GbE SINET5 Internet Connection 10-100GbE
  • 62. The “Real” ABCI – 2018Q1 • Extreme computing power – w/ >130 AI-PFlops for AI/ML especially DNN – >x1 million speedup over high-end PC: 1 Day training for 3000-Year DNN training job – TSUBAME-KFC (1.4 AI-Pflops) x 90 users (T2 avg) • Big Data and HPC converged modern design – For advanced data analytics (Big Data) and scientific simulation (HPC), etc. – Leverage Tokyo Tech’s “TSUBAME3” design, but differences/enhancements being AI/BD centric • Ultra high BW & Low latency memory, network, and storage – For accelerating various AI/BD workloads – Data-centric architecture, optimizes data movement • Big Data/AI and HPC SW Stack Convergence – Incl. results from JST-CREST EBD – Wide contributions from the PC Cluster community desirable. • Ultra-Green (PUE<1.1), High Thermal (60KW) Rack – Custom, warehouse-like IDC building and internal pods – Final “commoditization” of HPC technologies into Clouds 62
  • 63. ABCI Cloud Infrastructure • Ultra-dense IDC design from ground-up – Custom inexpensive lightweight “warehouse” building w/ substantial earthquake tolerance – x20 thermal density of standard IDC • Extreme green – Ambient warm liquid cooling, large Li-ion battery storage, and high- efficiency power supplies, etc. – Commoditizing supercomputer cooling technologies to Clouds (60KW/rack) • Cloud ecosystem – Wide-ranging Big Data and HPC standard software stacks • Advanced cloud-based operation – Incl. dynamic deployment, container-based virtualized provisioning, multitenant partitioning, and automatic failure recovery, etc. – Joining HPC and Cloud Software stack for real • Final piece in the commoditization of HPC (into IDC) 63 ABCI AI-IDC CG Image 引用元: NEC導入事例 Reference Image
  • 64. ABCI Cloud Data Center “Commoditizing 60KW/rack Supercomputer” Data Center Image Layout Plan high voltage transformers (3.25MW) Passive Cooling Tower Free cooling Cooling Capacity: 3MW Active Chillers Cooling Capacity: 200kW Lithium battery 1MWh, 1MVA W:18m x D:24m x H:8m 72 Racks 18 Racks • Single Floor, inexpensive build • Hard concrete floor 2 tonnes/m2 weight tolerance for racks and cooling pods • Number of Racks • Initial: 90 • Max: 144 • Power Capacity • 3.25 MW (MAX) • Cooling Capacity • 3.2 MW (Minimum in Summer) Future Expansion Space
  • 65. Implementing 60KW cooling in Cloud IDC – Cooling Pods Cooling Block Diagram (Hot Rack) 19 or 23 inch Rack (48U) Computing Server Hot Water Circuit: 40℃Cold Water Circuit: 32℃ Hot Aisle: 40℃ Fan Coil Unit Cooling Capacity 10kW Front side Water Block (CPU or/and Accelerator, etc.) Air: 40℃Air: 35℃ CDU Cooling Capacity10kW Cold Aisle: 35℃ Water Cooling Capacity • Fan Coil Unit 10KW/Rack • Water Block: 50KW/Rack Hot Aisle Capping Flat concrete slab – 2 tonnes/m2 weight tolerance Commoditizing Supercomputing Cooling Density and Efficiency • Warm water cooling – 32C • Liquid cooling & air cooling in same rack • 60KW Cooling Capacity, 50KW Liquid+10KW Air • Very low PUE • Structural integrity by rack + skeleton frame built on high flat floor load
  • 66. TSUBAME3.0とABCI比較 • TSUBAME3:「ビッグデータ・AI指向汎用スパコン」 ABCI:「スパコン指向次世代AI・ビッグデータクラウドIDC向けひな型」 • この違いが、 よく似た姉妹関係にある二台の差を生む • ハードウェア的にはTSUBAME3がスパコンとして重要視する倍精度演算能力やネットワーク能力など が、ABCIでは要求が低くなり、コストを抑える • ノードやマシン全体のパッケージングや冷却では、TSUBAME3はスパコン向け特殊構造で性能密度 (ラックあたり3.1ペタフロップス)、熱密度(ラック当たり61KW)や冷却効率(PUE=1.033)を上げている。 ABCIではそれに近い密度や効率をIDC標準の19インチラックで実現し、将来のIDC導入を容易にす る。 • ソフトウェア的には、両社ともクラウド・ビッグデータ・AIの各種ソフトウェアスタックをサポートするが、 ABCIの方がその採用や充実度は高い • ABCIの一つのテーマは「TSUBAME3的なAI指向スパコンをいかにIDCのエコシステムに 普及させるか」==> 他の性能パラメタはTSUBAME3と酷似 • ネットワーク以外の計算ノードの各種性能パラメタは酷似 • 熱密度も50~60KW/ラック(通常のIDCは3~6KW/ラック)、PUE=1.1未満(通常のIDCは1.5~3) • 全体の実証のため、ABCIでは将来のAI-IDCのひな型となる安価・超高効率なABCIデータ
  • 67. TSUBAME3.0とABCI比較表 TSUBAME3 (2017年7月) ABCI (2018年3月) 参考:京(2012) AI-FLOPS ピーク AI性能 47.2 PetaFlops (倍精度12.1) ラックあたり3.1 PetaFlops 130~200 Petaflops, 倍精度不明 ラックあたり3~4 Petaflops 11.3 Petaflops ラック 12.3TeraFlops 実装 スパコン特殊実装(水冷 ICE-XA) IDC 19インチラック(水冷) スパコン特殊実装 運用時電力(冷却含む) 1MW以下 2MW程度 15MW以上 最大ラック熱密度・冷却効率 (PUE) 61KW, 1.033 50-60KW, 1.1未満 (若干低下) ~20KW, ~1.3程度 ハードウェアアーキテクチャ Many-Core (NVIDIA Pascal P100) + Multi-Core (Intel Xeon) Many-Core AI/DL 指向の汎用 プロセッサなど(未確定) Multi-Core 均質 大型CPUコア メモリ技術 HBM2+DDR4 高速オンパッケージ+DDR4 DDR3 ネットワーク技術 Intel OmniPath, ノードあたり4 x 100Gbps, フルバイセクション、ス イッチ間光ネットワーク T3比較で、ネットワーク速度・ バイセクションとも低下で低コ スト化・IDC導入を容易に 銅線 Tofu ノード単位不揮発性メモリ 2TeraByte NVMe/ノード > 400GB NVMe/ノード なし 電力モニタリング・制御・アクティブ 電力キャップ 綿密なノード・システム全体の監 視・キャッピング・パワーノブ 綿密なノード・システム全体の 監視・キャッピング・パワーノブ 全システムでの単純電 力計測のみ・制御なし クラウド対応・仮想化・AI クラウドAPI、コンテナ仮想化・ノード 分割、動的プロビジョニング, 各種機 械学習ソフトウェアスタック クラウドAPI、コンテナ仮想化・ノー ド分割、動的プロビジョニング, 各 種機械学習ソフトウェアスタック なし
  • 68. ABCI Procurement Benchmarks • Big Data Benchmarks – (SPEC CPU Rate) – Graph 500 – MinuteSort – Node Local Storage I/O – Parallel FS I/O • AI/ML Benchmarks – Low precision GEMM • CNN Kernel, defines “AI-Flops” – Single Node CNN • AlexNet => RESNET? • ILSVRC2012 Dataset – Multi-Node CNN • Caffe+MPI – Large Memory CNN • Convnet on Chainer – RNN / LSTM • Being defined with NICT Lab 68 No traditional HPC Simulation Benchmarks Except SPECCPU
  • 69. Current ABCI Benchmarking Status • Extensive discussions w/processor and system vendors, experienced DL researchers and users • Collaboration with NICT (National Institute of Communication Technologies of Japan) • Dedicated benchmark team formed within AIRC • Early benchmarks and tweaking on AAIC • Also looking at other HPC-DL benchmarks such as ANL CANDLE DNN benchmark • Will be able to directly contribute to IRDS (along with other HPC benchmarks) • We also have a generation of GPUs for DNN, from G200 to Latest Pascal, with Volta forthcoming (DGX-1) – G200-S1070 (2008), Fermi M2050 (2010), Kepler K20c/K20x (2013), /K40/K80 (2014), Maxwell Geforce cards (2015), Pascal P100 (2016), Volta V100 (2017) 69
  • 70. Basic Requirements for AI Cloud System PFS Lustre・ GPFS Batch Job Schedulers Local Flash+3D XPoint Storage DFS HDFS BD/AI User Applications RDB PostgreSQL Python, Jupyter Notebook, R etc. + IDL SQL Hive/Pig CloudDB/NoSQL Hbase/MondoDB/Redis Resource Brokers Machine Learning Libraries Numerical Libraries BLAS/Matlab Fortran・C・C++ Native Codes BD Algorithm Kernels (sort etc.) Parallel Debuggers and Profilers Workflow Systems Graph Computing Libraries Deep Learning Frameworks Web Services Linux Containers ・Cloud Services MPI・OpenMP/ACC・CUDA/OpenCL Linux OS IB・OPA High Capacity Low Latency NW X86 (Xeon, Phi)+ Accelerators e.g. GPU, FPGA, Lake Crest Application ✓ Easy use of various ML/DL/Graph frameworks from Python, Jupyter Notebook, R, etc. ✓ Web-based applications and services provision System Software ✓ HPC-oriented techniques for numerical libraries, BD Algorithm kernels, etc. ✓ Supporting long running jobs / workflow for DL ✓ Accelerated I/O and secure data access to large data sets ✓ User-customized environment based on Linux containers for easy deployment and reproducibility OS Hardware ✓ Modern supercomputing facilities based on commodity components
  • 71. Oct. 2015 TSUBAME-KFC/DL (Tokyo Tech./NEC) 1.4 AI-PF(Petaflops) Cutting Edge Research AI Infrastructures in Japan Accelerating BD/AI with HPC (and my effort to design & build them) Mar. 2017 AIST AI Cloud (AIST-AIRC/NEC) 8.2 AI-PF Mar. 2017 AI Supercomputer Riken AIP/Fujitsu 4.1 AI-PF Aug. 2017 TSUBAME3.0 (Tokyo Tech./HPE) 47.2 AI-PF (65.8 AI-PF w/Tsubame2.5) In Production Under Acceptance Being Manufactured Mar. 2018 ABCI (AIST-AIRC) 130-200 AI-PF Draft RFC out IDC under construction 1H 2019? “ExaAI” ~1 AI-ExaFlop Undergoing Engineering Study R&D Investments into world leading AI/BD HW & SW & Algorithms and their co-design for cutting edge Infrastructure absolutely necessary (just as is with Japan Post-K and US ECP in HPC) x5.8 x5.8 x2.8~4.2 x5.0~7.7
  • 72. Big Data AI- Oriented Supercomput ers Acceleration Scaling, and Control of HPC via BD/ML/AI and future SC designs Robots / Drones Image and Video Big Data and ML/AI Apps and Methodologies Large Scale Graphs Future Big Data・AI Supercomputer Design Optimizing System Software and Ops Mutual and Semi- Automated Co- Acceleration of HPC and BD/ML/AI Co-Design of BD/ML/AI with HPC using BD/ML/AI - for survival of HPC Accelerating Conventional HPC Apps Acceleration and Scaling of BD/ML/AI via HPC and Technologies and Infrastructures ABCI: World’s first and largest open 100 Peta AI- Flops AI Supercomputer, Fall 2017, for co-design
  • 73. • Strategy 5: Develop shared public datasets and environments for AI training and testing. The depth, quality, and accuracy of training datasets and resources significantly affect AI performance. Researchers need to develop high quality datasets and environments and enable responsible access to high-quality datasets as well as to testing and training resources. • Strategy 6: Measure and evaluate AI technologies through standards and benchmarks. Essential to advancements in AI are standards, benchmarks, testbeds, and community engagement that guide and evaluate progress in AI. Additional research is needed to develop a broad spectrum of evaluative techniques. We are implementing the US AI&BD strategies already …in Japan, at AIRC w/ABCI
  • 74. What is worse: Moore’s Law will end in the 2020’s •Much of underlying IT performance growth due to Moore’s law •“LSI: x2 transistors in 1~1.5 years” •Causing qualitative “leaps” in IT and societal innovations •The main reason we have supercomputers and Google... •But this is slowing down & ending, by mid 2020s…!!! •End of Lithography shrinks •End of Dennard scaling •End of Fab Economics •How do we sustain “performance growth” beyond the “end of Moore”? •Not just one-time speed bumps •Will affect all aspects of IT, including BD/AI/ML/IoT, not just HPC •End of IT as we know it Gordon Moore The curse of constant transistor power shall soon be upon us
  • 75. 20 year Eras towards of End of Moore’s Law 3-5nm and beyond 2025- Constant Transistor Power • 1980s~2004 Dennard scaling, perf+ = single thread+ = transistor & freq+ = power+ • 2004~2015 feature scaling, perf+ = transistor+ = core#+, constant power • 2015~2025 all above gets harder • 2025~ post-Moore, constant feature&power = flat performanceNeed to realize the next 20-year era of supercomputing 20 year Post-Dennard Many-Core Era 20-year Moore-Dennard Single Core ILP-Vector Killer-Micro Era 20-year Next-Gen Post-Moore era
  • 76. The “curse of constant transistor power” - Ignorance of this is like ignoring global warming - • Systems people have been telling the algorithm people that “FLOPS will be free, bandwidth is important, so devise algorithms under that assumption” • This will certainly be true until exascale in 2020… • But when Moore’s Law ends in 2025-2030, constant transistor power (esp. for logic) = FLOPS will no longer be free! • So algorithms that simply increase arithmetic intensity will no longer scale beyond that point • Like countering global warming – need disruptive change in computing – in HW-SW-Alg-Apps etc. for the next 20 year era
  • 77. Performance growth via data-centric computing: “From FLOPS to BYTES” • Identify the new parameter(s) for scaling over time • Because data-related parameters (e.g. capacity and bandwidth) will still likely continue to grow towards 2040s • Can grow transistor# for compute, but CANNOT use them AT THE SAME TIME(Dark Silicon) => multiple computing units specialized to type of data • Continued capacity growth: 3D stacking (esp. direct silicon layering) and low power NVM (e.g. ReRAM) • Continued BW growth: Data movement energy will be capped constant by dense 3D design and advanced optics from silicon photonics technologies • Almost back to the old “vector” days(?), but no free lunch – latency still problem, locality still important, need general algorithmic acceleration thru data capacity and bandwidth, not FLOPS
  • 78. Transistor Lithography Scaling (CMOS Logic Circuits, DRAM/SRAM) Loosely Coupled with Electronic Interconnect Data Data Hardware/Software System APIs Flops-Centric Massively Parallel Architecture Flops-Centric System Software Novel Devices + CMOS (Dark Silicon) (Nanophotonics, Non-Volatile Devices etc.) Ultra Tightly Coupled w/Aggressive 3-D+Photonic Switching Interconnected Hardware/Software System APIs Data-Centric Heterogeneous Architecture Bytes-Centric System Software Heterogeneous CPUs + Holistic Data Data Data Homogeneous General Purpose Nodes + Localized Data Reconfigurable Dataflow Optical ComputingDNN& Neuromorphic Massive BW 3-D Package Quantum ComputingLow Precision Error-Prone Non-Volatile Memory Flops-Centric Algorithms and Apps Bytes-Centric Algorithms and Apps Compute Nodes Gen CPU Gen CPU 汎用CPU Gen CPU ~2025 M-P Extinction Event Many Core Era Post Moore Era Compute Nodes Compute Nodes Compute Nodes
  • 79. Post-Moore High Bandwidth Hierarchical Memory Model Post-Moore Data Science and AI Libraries Post-Moore Computational Science Libraries ComputationCommunicationMemory PC-RAM ReRAM Photonic Switching 3D architecture Next gen VIAs & silicon fabrication New memory Devices Tnugsten VIAs and 3D silicon Photonic Interposes Silicon Photonics WDM Interconnect Inductive TCI Building Block “gluable” architecture Neural Networks/ Neromorphic/ Izing - Annealing Brain-inspired Computing Low-Reliability computing Near threshold computing Low-Reliablility Communication Quantum Computing Customizable logic Photonic Compute Devices Optical Packet Switching STT-MRAM High B/F Algorithms Low Precision & Probablistic Computing Accelerator- Specific Compilers Accelerator “Binaries” Programmable Logic Data Memoization Machine Learning based acceleration Data Assimilation Parallel Space-and-Time Algorithms BW Reducing Alg. Post-Moore Performance Parameters Auto Tuning Post-Moore Performamce Models Data & Custom Compute Centric Platform Uncertainty Quantification Couplers Multi-Phyics SImulation Low Rank Approximation Out-of-core Alg Massive Medical Imaging Post-Moore Programming Model Data-oriented Scheduling Latency Hiding Data-Movement Runtime Hierarchical Data Abstractions Fault Tolerance High-Level Synthesis Compilers Fusion/Plasma EMF AnalysisManufacturing Post-Moore is NOT a More-Moore device as a panacea Device & arch. advances improving data-related parameters over time “Rebooting Computing” in terms of devices, architectures, software. Algorithms, and applications necessary => Co-Design even more important c.f. Exascale