PyCoRAMを用いたグラフ処理FPGAアクセラレータ

PyCoRAMを用いた
グラフ処理FPGAアクセラレータ
高前田（山崎）伸也，枝元正寛，姚駿，中島康彦
奈良先端科学技術大学院大学情報科学研究科
2014年7月28日 15:15-15:45
SWoPP2014@新潟朱鷺メッセ

概要
n  メモリ抽象化フレームワークPyCoRAMを用いた
ダイクストラ法FPGAアクセラレータを開発
l  PyCoRAM
•  HDLとPythonを両用するFPGAアクセラレータ開発フレームワーク
2014-07-28 Shinya T-Y. NAIST 2

FPGA as SoC (System-on-Chip)
n  沢山のパーツを単一FPGA上に集積しSoCとして利用
l  CPUコア
•  Microblaze (ソフトマクロ)
•  Cortex-A9 (ハードマクロ)
l  アクセラレータロジック
•  普通のRTLでモデリング
–  Verilog HDL, VHDL
•  新しい言語でモデリング
–  Bluespec, AutoESL, Chisel, …
l  DDR3 DRAM
l  PCI-Express
l  Ethernet, …
FPGA
CPU
HW
Acc
DRAM
I/F
Ether PCI-E
Interconnect
HW
Acc

IPコアベースのシステム開発
n  IPコアを開発・追加して繋げばシステム完成J
l  標準的なインターコネクトでIPコア達を接続
l  EDAツールが自動的にインターコネクトと（いくつかの）
デバイス依存のインターフェースを生成してくれるから楽ちん
Xilinx Platform Studio (XPS)
IP-core List
Interconnect
FPGA
CPU
HW
Acc
DRAM
I/F
Ether PCI-E
Interconnect
HW
Acc
DRAM
IP-core
Instances

どうやってアクセラレータIPを実装するか？
n  普通にHDLでアクセラレータを実装するのは芸が無い
l  というかいろいろ面倒で嫌だ！
•  演算とメモリアクセスのスケジューリングロジック
–  ダブルバッファリングとか面倒
•  メモリシステムの制御回路
–  HDLでステートマシンを書くのは面倒だし間違えやすい
•  デバッグが面倒
l  でもパイプラインの振る舞いはサイクルレベルで定義したい
•  FPGAで性能を出すには高稼働率のパイプラインが重要
–  だから計算ロジックはHDLで書きたい
–  高位合成だとチューンがイマイチ難しい
n  抽象化されたメモリシステムが使えると幸せそう
CoRAMメモリアーキテクチャ

CoRAM [Chung+,FPGA’11]
n  FPGAアクセラレータのためのメモリ抽象化
l  高位モデルによるメモリ管理でアクセラレータをポータブルに
•  計算カーネルとメモリアクセスの分離
•  ソフトウェアのモデルによるメモリアクセスパターンの記述
HW Kernels
(Computing Logics)
CoRAM
Memory
Read
Write
Manage
Control Threads
(Memory Access
Pattern)
CoRAM
Channel
Read/Write Read/Write
Communication
FIFOs (Registers)
Abstracted
On-chip Memories
Off-chip
Memory

PyCoRAM [Takamaeda+,CARL’13]
n  ベンダーEDK向けのPythonベースのCoRAM実装
l  計算カーネルのRTL記述とメモリアクセスパターンの
Python記述からAXI4 IPコアを自動合成
l  出来上がったIPコアをEDKでポチポチつなげばシステム完成！
n  特徴
l  Pythonでのコントロールスレッド記述
•  Pythonで簡単にメモリアクセスパターンを記述できる
–  独自の高位合成コンパイラでPython記述からVerilog HDLのRTLを合成
l  AMBA AXI4インターコネクトのサポート
•  Xilinx Platform Studio (XPS)などを用いたIPコアベースの開発を支援
l  計算ロジックの複雑なデザインに対応
•  ハードウェアデザイン解析・生成のための
オープンソースツールキットPyverilogを活用

PyCoRAMマイクロアーキテクチャ
User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
CoRAM
Memory
DMAC
CoRAM
Stream FSM
GPIO
Modeled in RTL
(Verilog HDL)
Memory Access
Pattern
in Python

PyCoRAMマイクロアーキテクチャ
User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
CoRAM
Memory
DMAC
CoRAM
Stream FSM
GPIO
Modeled in RTL
(Verilog HDL)
Memory Access
Pattern
in Python
def calc_sum(times):�
ram = CoramMemory(idx=0, datawidth=32, size=1024)�
channel = CoramChannel(idx=0, datawidth=32)�
addr = 0�
sum = 0�
for i in range(times):�
ram.write(0, addr, 128)�
channel.write(addr)�
sum += channel.read()�
addr += 128 * (32/8)�
print(‘sum=’, sum)�
calc_sum(8)�
# Transfer (off-chip DRAM to BRAM)
# Notification to User-logic
# Wait for Notification from User-logic
# $display Verilog system task
�
0�
1�
2�
3�
4�
5�
6�
7�
8�
9�
10�
11�
CoramMemory1P�
#(�
.CORAM_THREAD_NAME("thread_name"),�
.CORAM_ID(0),�
.CORAM_ADDR_LEN(ADDR_LEN),�
.CORAM_DATA_WIDTH(DATA_WIDTH)�
)�
inst_memory�
(.CLK(CLK),�
.ADDR(mem_addr),�
.D(mem_d),�
.WE(mem_we),�
.Q(mem_q)�
);�

PyCoRAMマイクロアーキテクチャの実装
PyCoRAM IP
AXI4 Interconnect
DRAM ControllerFPGA
User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
AXI I/F
CoRAM
Memory
DMAC
AXI I/F
CoRAM
Stream FSM
GPIO

行列積・ステンシル計算[高前田+,ARC2014-01]
Computing Logic (Verilog HDL)
Control
Thread
(Python)
sum
CoRAM
Memory 0
B
× +
CoRAM
Memory 1
CoRAM
Memory 2
Control Logic
CoRAM
Channel 0
8-stage
Multiply
PipelineA
C
check
sum+
Computing Logic (Verilog HDL) Control
Thread
(Python)
CoRAM
Memory 0
d1
CoRAM
Memory 2
CoRAM
Memory 3
Control Logic
CoRAM
Channel 0
41-stage
Add-Divide
Pipeline
d0
rslt
d2
+ /
+ check
sum
CoRAM
Memory 1
行列積
9点ステンシル

メモリ性能
n  メモリバンド幅利用率：理論最大の約86%を引き出す
n  バンド幅律速なアプリには有効利用できそう
l  密行列積・ステンシル計算では高い性能・開発効率を達成
•  長いバーストでバンド幅を有効利用
l  じゃあレイテンシ律速なアプリではどうなの？
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
4 8 16 32
BandwidthUtilization
SIMD size [byte]
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
4 8 16 32 64
BandwidthUtilization
SIMD size [byte]
Atlys (1.2GB/s MAX) ML605 (6.4GB/s MAX)

本発表の目標
n  不規則なメモリアクセスパターンを持つアプリにおける
PyCoRAM適用可能性を明らかにする
l  規則的なメモリアクセスパターンを持つアプリケーション
（行列積・ステンシル）はバンド幅律速
l  メモリアクセスレイテンシの影響が大きいアプリで使えるの？
•  まずは実装してみましょう
n  今回の題材：グラフ処理
l  最短経路探索（ダイクストラ法）
•  ボトルネックになりそうな箇所
–  未訪問ノードの管理→距離をキーとした優先度キュー
–  隣接ノード情報（コスト・親ノード）の読み書き→ページング

最短経路探索: ダイクストラ法 (1)
S
b
a
G
c10
15
20
5
30
5
0
S, 0
Priority QueueGraph

S
b
a
G
c10
15
20
5
30
5
0
10
15
a, 10
Priority QueueGraph
b, 15

S
b
a
G
c10
15
20
5
30
5
0
10
15
b, 15
Priority QueueGraph
c, 30
30

S
b
a
G
c10
15
20
5
30
5
0
10
15
30 20
45
c, 20
Priority QueueGraph
G, 45c, 30

S
b
a
G
c10
15
20
5
30
5
0
10
15
30 20
45
G, 25
Priority QueueGraph
G, 45c, 30
25

S
b
a
G
c10
15
20
5
30
5
0
10
15
30 20
45 25
c, 30
G, 45
Priority QueueGraph

データ構造
n  ノード（Node）
l  4エントリの構造体
•  コスト・親ノードへのポインタ・
エッジ情報テーブルへのポインタ・訪問済みフラグ
n  エッジ（Edge Page）
l  隣接ノード情報への
ポインタを束ねて管理
l  CPUだとキャッシュに
乗る・プリフェッチ可
l  FPGAではバースト転送
できるようなる
Address
Num Entries
Next Page Pointer
Neighbor Node Ptr
Cost
Neighbor Node Ptr
Cost
Num Entries
Next Page Pointer
Neighbor Node Ptr
Cost
Neighbor Node Ptr
Cost
EdgePage0EdgePage1
Current Cost
Parent Node Pointer
Edge Page Pointer
Visited Flag
Current Cost
Parent Node Pointer
Edge Page Pointer
Visited Flag
Current Cost
Parent Node Pointer
Edge Page Pointer
Visited Flag
Node0Node0NodeN-1

ソフトウェアによる実装
n  C言語で実装
l  エッジはページ単位で管理：キャッシュに優しい
l  未訪問ノードは優先度付きキュー（バイナリヒープ）で管理

PyCoRAMを用いたダイクストラIPコア
n  PyCoRAMを使って演算モジュールはVerilog HDLで実装
メモリアクセス制御はPythonで実装
Read Node
InStream
Read Edge
InStream
Update
Node
OutStream
Mark Visited
OutStream
Priority Queue
InStreamOutStream
Edge
Page
Addr
Cost
+
Next Node Addr
Node
Addr
Next Node Cost
Node Addr
Next Node Addr
Next Node Cost
FSM
DMAC DMAC DMAC DMAC DMAC DMAC Slave I/F
AXI4 Master Interfaces
AXI4-lite
Slave Interfaces
Dijkstra
Logic
(Modeled in Verilog
HDL)
Mark Visited
Cthread
Read Node
Cthread
Read Edge
Cthread
Priority Queue Cthread
Mark Visited
Cthread
Main
CThread
Control Threads (Modeled in Python)
UserDefinition(ModeledinVerilogHDLandPython)Generatedby
PyCoRAM

n  ステージ1: 最小コストノード取り出し
Read Node
InStream
Read Edge
InStream
Update
Node
OutStream
Mark Visited
OutStream
Priority Queue
InStreamOutStream
Edge
Page
Addr
Cost
+
Next Node Addr
Node
Addr
Next Node Cost
Node Addr
Next Node Addr
Next Node Cost
FSM
AXI4-lite
Slave Interfaces
Dijkstra
Logic
(Modeled in Verilog
HDL)
Mark Visited
Cthread
Read Node
Cthread
Read Edge
Cthread
Mark Visited
Cthread
Main
CThread
PyCoRAM

n  ステージ2：ノード情報読み出し
Read Node
InStream
Read Edge
InStream
Update
Node
OutStream
Mark Visited
OutStream
Priority Queue
InStreamOutStream
Edge
Page
Addr
Cost
+
Next Node Addr
Node
Addr
Next Node Cost
Node Addr
Next Node Addr
Next Node Cost
FSM
AXI4-lite
Slave Interfaces
Dijkstra
Logic
(Modeled in Verilog
HDL)
Mark Visited
Cthread
Read Node
Cthread
Read Edge
Cthread
Mark Visited
Cthread
Main
CThread
PyCoRAM

n  ステージ3：ノードに訪問済みフラグ書き込み
Read Node
InStream
Read Edge
InStream
Update
Node
OutStream
Mark Visited
OutStream
Priority Queue
InStreamOutStream
Edge
Page
Addr
Cost
+
Next Node Addr
Node
Addr
Next Node Cost
Node Addr
Next Node Addr
Next Node Cost
FSM
AXI4-lite
Slave Interfaces
Dijkstra
Logic
(Modeled in Verilog
HDL)
Mark Visited
Cthread
Read Node
Cthread
Read Edge
Cthread
Mark Visited
Cthread
Main
CThread
PyCoRAM

n  パイプライン動作：（１）エッジ読み出し→
（２）隣接ノード読み出し→（３）隣接ノード更新
Read Node
InStream
Read Edge
InStream
Update
Node
OutStream
Mark Visited
OutStream
Priority Queue
InStreamOutStream
Edge
Page
Addr
Cost
+
Next Node Addr
Node
Addr
Next Node Cost
Node Addr
Next Node Addr
Next Node Cost
FSM
AXI4-lite
Slave Interfaces
Dijkstra
Logic
(Modeled in Verilog
HDL)
Mark Visited
Cthread
Read Node
Cthread
Read Edge
Cthread
Mark Visited
Cthread
Main
CThread
PyCoRAM
123

優先度付きキュー（ヒープ）
n  CoRAMメモリ x2 + BRAM x1
l  外部から読み出すための CoramInChannel
l  外部へ書き込むための CoramOutChannel
l  コストが小さいノード群を格納する BRAM
Control
Thread
(Modeled
in Python
FSM
Channel
FIFO
Out
DMACDMAC
Priority Queue Logic (Modeled in Verilog HDL)
Memory Bus (To DRAM)
DMA requests
ParentLeft Right Child
Compare Logic
BRAM
In
d, 20
f, 45b, 30
a, 40 e, 50
BRAM
Zone

評価
n  FPGAボード実機で評価
l  ボード: Digilent Atlys
•  FPGA: Spartan-6 LX45
•  DRAM: DDR2-800 (1.6GB/s), 128MB
l  ツール: Xilinx PlanAhead 14.7, XPS 14.7
n  汎用PC上と比較
l  Intel Core i7 3770K (3.5GHz), DDR3-1600 (12.8GB/s ×2)
l  Linux (Ubuntu 14.04), gcc 4.8.2 (-O3)
n  グラフ
l  XORSHIFT乱数を用いてランダムに生成
l  ノード数: 5000，エッジ数: 100000
•  より大規模なグラフはデバッグが間に合わなかったので
今後の課題ということで･･･

評価環境：FPGAシステム
n  ホストからUART経由で制御・グラフ構築にMicroblaze
Dijkstra Accelerator
CoRAM Abstraction
Read
Node
CThread
Dijkstra Logic
Read
Node
Read
Edge
Update
Node
Mark
Visited
Read
Edge
CThread
Update
Node
CThread
Mark
Visited
CThread
Main
Control
Tread
Priority
Queue
Priority
Queue
CThread
UART Loader
CoRAM Abstraction
UART
Loader
Logic
Control
Thread
Microblaze
3-stage
16KB Local memory
2KB I-Cache
2KB D-Cache
AXI4 Interconnect (128-bit, Crossbar)
DRAM Controller (DDR2-800 16-bit (1.6GB/s))
AXI4-lite Interconnect (32-bit, Shared bus)
Host PC
FPGA
(Spartan-6 LX45)
DRAM (128MB)

実装の詳細
n  AXI4インターコネクト：4構成
l  クロスバー2種：パイプラインレジスタ等を持つ高性能タイプ
•  C128: 128ビット幅
•  C32: 32ビット幅
l  共有バス2種：リソース使用量を優先した省エリアタイプ
•  S128: 128ビット幅
•  S32: 32ビット幅
n  AXI4バスでは異なるマスターポート間では
Read/WriteのIn-order順番が保証されていない
l  先にバスにリクエストが発行したからといって
必ず先に処理されるわけではない
l  特に Write → Read の依存関係には注意が必要
l  解決策
•  AXI4バスの設定で書き込みポートのPriorityを高くする

実行時間
n  FPGA上の実装は汎用PCと比べて25倍程度低速L
l  メモリバンド幅あたりの性能でも1.5倍程度悪い･･･
l  なぜか？
•  データセットが小さい・OoOプロセッサのMLP抽出能力は凄い
498.9 492.7
413.4 404.3
16.2
0.0
100.0
200.0
300.0
400.0
500.0
600.0
C128 C32 S128 S32 x86
ExecutionTime[msec]
25x

実行時間
n  FPGAでの実行時間を比べてみると直感と真逆の結果
l  クロスバーよりも共有バスの方が高性能！
l  バス幅が狭い方が高性能！
n  なぜか？
l  共有バスの方が
レイテンシが短い
l  バス幅が短い方が
レイテンシが短い
l  必要なモノ：
高バンド幅ではなく
短レイテンシ
498.9 492.7
413.4 404.3
16.2
0.0
100.0
200.0
300.0
400.0
500.0
600.0
C128 C32 S128 S32 x86
ExecutionTime[msec]

FPGAリソース使用量
0
1000
2000
3000
4000
5000
6000
7000
8000
C128 C32 S128 S32
#OccupiedSlices
Dijkstra Loader CPU Peripheral Interconnect
0
2000
4000
6000
8000
10000
12000
14000
C128 C32 S128 S32
#OccupiedRegs
0
5000
10000
15000
20000
C128 C32 S128 S32
#OccupiedLUTs
0
1000
2000
3000
4000
5000
6000
Dijkstra Reg Dijkstra LUT Loader Reg Loader LUT
#OccupiedResources
DMAC
Control Thread
User Logic

まとめ
n  メモリ抽象化フレームワークPyCoRAMを用いた
ダイクストラ法FPGAアクセラレータを開発
l  PyCoRAM
•  HDLとPythonを両用するFPGAアクセラレータ開発フレームワーク
l  汎用CPUと比べて25倍低速・バンド幅あたりの性能で1.5倍悪い
l  4種類のインターコネクトの構成で評価
•  どうやらスループットよりもレイテンシが重要
n  ツール・フレームワークはgithubにて公開中
l  PyCoRAM: http://shtaxxx.github.io/PyCoRAM/
l  Pyverilog: http://shtaxxx.github.io/Pyverilog/

PyCoRAMを用いたグラフ処理FPGAアクセラレータ

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to PyCoRAMを用いたグラフ処理FPGAアクセラレータ

Similar to PyCoRAMを用いたグラフ処理FPGAアクセラレータ (20)

More from Shinya Takamaeda-Y

More from Shinya Takamaeda-Y (12)

Recently uploaded

Recently uploaded (8)

PyCoRAMを用いたグラフ処理FPGAアクセラレータ