IMAX3: Amazing Dataflow-Centric CGRA and its Applications
I present this slide to all hungry engineers who are tired of CPU, GPU, FPGA, tensor core, AI core, who want some challenging one with no black box inside, and who want to improve by themselves.
7. 20210401
7
Scalar, SIMD and CGRA
time
I1
L2
VST
L2
VLD VLD
VFMA
I1
L2
VST
L2
VLD VLD
VFMA
I1
L2
VST
L2
VLD VLD
VFMA
I1
L2
VST
L2
VLD VLD
VFMA
MM
LD LM MM
LD LM FMA LM
ST LD LM LD LM FMA LM
ST
LD LM MM
LD LM FMA LM
ST LD LM LD LM FMA LM
ST
LD LM MM
LD LM FMA LM
ST LD LM LD LM FMA LM
ST
LD LM MM
LD LM FMA LM
ST LD LM LD LM FMA LM
ST
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
MM
I1
I1
I1
I1
VST
VST
VST
VST
VFMA
VFMA
VFMA
VFMA
VLD
VLD
VLD
VLD
VLD
VLD
VLD
VLD
MM
Scalar
(VL=32)
Vector1
(VL=256)
Vector2
(VL=2048)
CGRA
(VL=16K)
8. 20210401
8
従来のプログラムは手順を書く
A B C
D
for (i=0; i<128; i++)
D[i]=A[i]+B[i]*C[i];
D[256] A[256] B[256] C[256]
float A[256],B[256],C[256],D[256];
for (i=0; i<128; i++)
D[i+128]=A[i+128]+B[i]*C[i+128];
Main memory
9. D A B C B
D A C
20210401
9
キャッシュメモリが頑張る
A B C
D
for (i=0; i<128; i++)
D[i]=A[i]+B[i]*C[i];
D[256] A[256] B[256] C[256]
for (i=0; i<128; i++)
D[i+128]=A[i+128]+B[i]*C[i+128];
Main memory
Cache memory
10. 20210401
10
データフローを書くと明示的に分散配置できる
D
A B C
D
A B C
D
A B C
Load Ai (top=A,len=64)
Load Bi (top=B,len=64)
Load Ci (top=C,len=64)
Di=Ai+Bi*Ci
Store Di (top=D,len=64)
Similar to assembly language, but has DMA info.
j=i+64; Load Aj (top=A+64,len=64)
Load Bi (top=B, len=64)
Load Cj (top=C+64,len=64)
Dj=Aj+Bi*Cj
Store Dj (top=D+64,len=64)
k=i+128;Load Ak (top=A+128,len=64)
Load Bi (top=B, len=64)
Load Ck (top=C+128,len=64)
Dk=Ak+Bi*Ck
Store Dk (top=D+128,len=64)
Can broadcast
11. 1988 VPP 4way VLIW+8elem. Vector
My CGRAs began from VLIW+Vector processor
F D W
C
D
M M M M
M M M M D D M M M M
Load pipe Store pipe
FPU pipe
F D
F D
Cache lines are shared among
heterogeneous multithreads.
M M M M M M M M
D
W
M M M M M M M M
M M M M M M M M
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
C
E
E
E
E
M M M M
B
B
B
B
B
B
B
B
B
B
Vector Length
≤ 2048 dwords
($miss)
F D D
2008 LAPP 8way VLIW+32stage Array
C2
E
E
E
LD
M M M M
C1
E
E
E
LD
C0
E
E
E
E
B
B
B
B
B
C3
E
E
E
E
HW Mapper (8stages) HW Mapper (binary compatible)
Vector Length
≤ 1024 words
E
E
E
E
E
E
E
ST
E
E
E
LD
E
E
E
LD
E
E
E
E
E
E
E
ST
2006 OROCHI 9way VLIW+5way SS
20210401
11
Data from all cache ways are passed through array.
Location free LD/ST can keep binary compatibility.
26. さらに縦に並べて全体のデータ流は上から下へ
20210401
26
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
27. さらにマルチチップ拡張として、横に増やす
20210401
27
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
28. ここで、論理4UNITを物理1UNITに重畳
20210401
28
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
29. 演算位置とローカルメモリ位置の同調制御のため、リング構造化
20210401
29
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
31. メモリバスに応じて並列接続、この図では8万オペレーションを写像
A B
C
D
E
H
B
M
2
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
A B
C
D
E
2560op x 32 = 81920op
20210401
31
37. 20210401
37
外部メモリとも協調させ、途切れないデータフローを作る
C
O
N
F
R
E
G
s
A
D
D
R
Overlapping post-drain, burst-exec, pre-fetch
L
M
M
L
M
M
L
M
M
Burst exec.
R
E
G
s
A
D
D
R
L
M
M
L
M
M
L
M
M
Burst exec.
C
O
N
F
A
D
D
R
R
E
G
s
Sequential execution
L
M
M
L
M
M
L
M
M
Burst exec. A
D
D
R
R
E
G
s
L
M
M
time
L
M
M
L
M
M
Burst exec.
PIO/DMA External Memory
PIO/DMA External Memory
R
E
G
s
A
D
D
R
L
M
M
L
M
M
L
M
M
Burst exec.
89. CPU GPU
Ultimate CGRA w/ high-speed compiler
CGRA for Energy-efficient Cryptography
Beyond-Neuromorphic Systems
Non-Deterministic Computing
20210401
89
ナレータ VOICEVOX:もち子(cv 明日葉よもぎ)
はらぺこエンジニアに贈るCGRAの世界2022
(ハッシュ計算とFFTと文字列検索)
スパコンからIoTまで 省エネ社会に
AI+BCだけじゃない超効率計算手法
90. 20210401
90
ハッシュ計算には、面倒な演算が必要
for (i=0; i<ctx->mbuflen; i+=BLKSIZE) { /* 1データ流内の並列実行は不可能. 多数データ流のパイプライン実行のみ */
for (th=0; th<thnum; th++) {
sregs[th*8+0] = state[th*8+0]; sregs[th*8+1] = state[th*8+1]; sregs[th*8+2] = state[th*8+2]; sregs[th*8+3] = state[th*8+3];
sregs[th*8+4] = state[th*8+4]; sregs[th*8+5] = state[th*8+5]; sregs[th*8+6] = state[th*8+6]; sregs[th*8+7] = state[th*8+7];
}
for (j=0; j<BLKSIZE; j+=BLKSIZE/DIV) {
for (th=0; th<thnum; th++) {
a = sregs[th*8+0]; b = sregs[th*8+1]; c = sregs[th*8+2]; d = sregs[th*8+3];
e = sregs[th*8+4]; f = sregs[th*8+5]; g = sregs[th*8+6]; h = sregs[th*8+7];
#if (DIV==4)
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 0]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 0]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 1]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 1]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 2]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 2]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 3]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 3]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 4]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 4]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 5]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 5]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 6]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 6]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 7]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 7]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 8]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 8]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 9]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 9]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+10]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+10]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+11]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+11]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+12]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+12]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+13]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+13]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+14]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+14]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+15]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+15]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
#endif
sregs[th*8+0] = a; sregs[th*8+1] = b; sregs[th*8+2] = c; sregs[th*8+3] = d;
sregs[th*8+4] = e; sregs[th*8+5] = f; sregs[th*8+6] = g; sregs[th*8+7] = h;
}
}
for (th=0; th<thnum; th++) {
state[th*8+0] += sregs[th*8+0]; state[th*8+1] += sregs[th*8+1]; state[th*8+2] += sregs[th*8+2]; state[th*8+3] += sregs[th*8+3];
state[th*8+4] += sregs[th*8+4]; state[th*8+5] += sregs[th*8+5]; state[th*8+6] += sregs[th*8+6]; state[th*8+7] += sregs[th*8+7];
}
}