IMAX3: Amazing Dataflow-Centric CGRA and its Applications
I present this slide to all hungry engineers who are tired of CPU, GPU, FPGA, tensor core, AI core, who want some challenging one with no black box inside, and who want to improve by themselves.
1. CPU GPU
Ultimate CGRA w/ high-speed compiler
CGRA for Energy-efficient Cryptography
Beyond-Neuromorphic Systems
Non-Deterministic Computing
20210401
1
IMAX3: Amazing Dataflow-Centric CGRA
and its Applications
2. 20210401
2
20180716 2
IT devices consume
60% of total power
Filled up by AI and BC
Society 5.0 w/ much CO2
IT devices consume
10% of total power
Efficient AI and BC
Zero-carbon society
202X Phase-down of Von-Neumann computers!?
Programmability∝ Power dissipation
Easiness to use ∝ Power dissipation
7. 20210401
7
Scalar, SIMD and CGRA
time
I1
L2
VST
L2
VLD VLD
VFMA
I1
L2
VST
L2
VLD VLD
VFMA
I1
L2
VST
L2
VLD VLD
VFMA
I1
L2
VST
L2
VLD VLD
VFMA
MM
LD LM MM
LD LM FMA LM
ST LD LM LD LM FMA LM
ST
LD LM MM
LD LM FMA LM
ST LD LM LD LM FMA LM
ST
LD LM MM
LD LM FMA LM
ST LD LM LD LM FMA LM
ST
LD LM MM
LD LM FMA LM
ST LD LM LD LM FMA LM
ST
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
MM
I1
I1
I1
I1
VST
VST
VST
VST
VFMA
VFMA
VFMA
VFMA
VLD
VLD
VLD
VLD
VLD
VLD
VLD
VLD
MM
Scalar
(VL=32)
Vector1
(VL=256)
Vector2
(VL=2048)
CGRA
(VL=16K)
8. 20210401
8
Traditional style of programming based on procedures
A B C
D
for (i=0; i<128; i++)
D[i]=A[i]+B[i]*C[i];
D[256] A[256] B[256] C[256]
float A[256],B[256],C[256],D[256];
for (i=0; i<128; i++)
D[i+128]=A[i+128]+B[i]*C[i+128];
Main memory
9. D A B C B
D A C
9
Cache memory is convenient for high-speed
A B C
D
for (i=0; i<128; i++)
D[i]=A[i]+B[i]*C[i];
D[256] A[256] B[256] C[256]
for (i=0; i<128; i++)
D[i+128]=A[i+128]+B[i]*C[i+128];
Main memory
Cache memory
20210401
10. 20210401
10
Explicitly spatial programming based on dataflow
D
A B C
D
A B C
D
A B C
Load Ai (top=A,len=64)
Load Bi (top=B,len=64)
Load Ci (top=C,len=64)
Di=Ai+Bi*Ci
Store Di (top=D,len=64)
Similar to assembly language, but has DMA info.
j=i+64; Load Aj (top=A+64,len=64)
Load Bi (top=B, len=64)
Load Cj (top=C+64,len=64)
Dj=Aj+Bi*Cj
Store Dj (top=D+64,len=64)
k=i+128;Load Ak (top=A+128,len=64)
Load Bi (top=B, len=64)
Load Ck (top=C+128,len=64)
Dk=Ak+Bi*Ck
Store Dk (top=D+128,len=64)
Can broadcast
11. 1988 VPP 4way VLIW+8elem. Vector
My CGRAs began from VLIW+Vector processor
F D W
C
D
M M M M
M M M M D D M M M M
Load pipe Store pipe
FPU pipe
F D
F D
Cache lines are shared among
heterogeneous multithreads.
M M M M M M M M
D
W
M M M M M M M M
M M M M M M M M
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
C
E
E
E
E
M M M M
B
B
B
B
B
B
B
B
B
B
Vector Length
≤ 2048 dwords
($miss)
F D D
2008 LAPP 8way VLIW+32stage Array
C2
E
E
E
LD
M M M M
C1
E
E
E
LD
C0
E
E
E
E
B
B
B
B
B
C3
E
E
E
E
HW Mapper (8stages) HW Mapper (binary compatible)
Vector Length
≤ 1024 words
E
E
E
E
E
E
E
ST
E
E
E
LD
E
E
E
LD
E
E
E
E
E
E
E
ST
2006 OROCHI 9way VLIW+5way SS
20210401
11
Data from all cache ways are passed through array.
Location free LD/ST can keep binary compatibility.
12. CGRAs based on FPGA require much compilation time
20210401
12
One hour compilation. We are not so patient !
Much time for converting program + data
13. Start of IMAX
20210401
13
Source register file
(32bit x 2waySIMD = 64bit) x4
Condition code,
Pipelined FPU,
(32bit x 2wayFMA = 64bit)
Media UNIT (SAD,…),
Stochastic FMA (32 elements)、
Hash function for SHA256,
Address generator
Local memory (64KB)
Destination register file
(32bit x 2waySIMD = 64bit) x4
CGRA requires data
path similar to CPU
core.
14. 4way multithreading
20210401
14
Double buffered 16 registers
Column multithreading
Local memory variable
partitioning
(64KB, 32KB, 16KB, mix)
Double buffered 16 registers
Most CGRAs have no FPU.
Normal CGRAs dislike
multicycle operation.
Employing multithreading
breaks the limitation.
15. Slave interface
20210401
15
Simultaneous selective
receive into 4 logical
partitions (256bit/cycle)
Simultaneous selective send
and merge (256bit/cycle)
Connected to succeeding unit
Connected as slave
Low-cost broadcasting
and gathering from CPU
due to selective address
filters in local memory
DMAs are optimized and
integrated by compiler
24. How to put 64 units
20210401
24
F F A A
F F A A
F F M M
C C M M
R R R R
R
25. Programming mode is 4 logical units per row
20210401
25
R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
26. Expand to vertical directions
20210401
26
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
27. Expand to multi-chip
20210401
27
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
28. Multiplex 4 logical units into a physical unit
20210401
28
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
29. Reshaping as a ring
20210401
29
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
36. 20210401
36
Micro pipelining and macro pipelining w/ double buffering
Macro pipelining with log N-units
Merge sort O(N)
FFT O(N)
Writing next result & Reading previous result
37. 20210401
37
Non-stop data flow
C
O
N
F
R
E
G
s
A
D
D
R
Overlapping post-drain, burst-execution and prefetch
L
M
M
L
M
M
L
M
M
Burst exec.
R
E
G
s
A
D
D
R
L
M
M
L
M
M
L
M
M
Burst exec.
C
O
N
F
A
D
D
R
R
E
G
s
Sequential execution
L
M
M
L
M
M
L
M
M
Burst exec. A
D
D
R
R
E
G
s
L
M
M
time
L
M
M
L
M
M
Burst exec.
PIO/DMA External Memory
PIO/DMA External Memory
R
E
G
s
A
D
D
R
L
M
M
L
M
M
L
M
M
Burst exec.
38. Sandwich structure of ALU and memory seems best
More complicated memory access for light field, graph, string search, AI, …
20210401
38
80. CPU GPU
Ultimate CGRA w/ high-speed compiler
CGRA for Energy-efficient Cryptography
Beyond-Neuromorphic Systems
Non-Deterministic Computing
20210401
80
IMAX3: Amazing Dataflow-Centric CGRA
and its Applications
(Sparse matrix and sorting)
81. 20210401
81
Sparse MM
Merge sort
1 1.0 2 2.0 3 3.0 11 4.0 12 5.0 13 6.0
Array A
Array B 2 1.0 3 2.0 4 3.0 9 4.0 10 5.0 11 6.0
Index and value
1 a1 2 a2 3 s3 11 a11 12 a12 13 a13
Stream A
Stream B 2 b2 3 b3 4 b4 9 b9 10 b10 11 b11
Value and attribute
Dual address synchronizer for sparse matrix and sort
83. ①
②
③ ④
⑤
⑥
⑦
⑧
⑨
⑩ ⑪
⑨
①
② ⑤
⑥
⑦
⑧
①
①
② ②
⑦ ⑦
⑨ ⑨
⑤
⑧
⑤
⑥ ⑥
⑧
④
③
⑩ ⑪
+offset
Mask address
Load B42 Load A00
Base address of B and A
Store to C
20210401
83
Five rings in a physical unit (sparse MM)
86. ①
①
② ②
⑦ ⑦
⑨ ⑨
⑤
⑧
⑤
⑥ ⑥
⑧
④
③
+ offset
Mask address
Similar to sparse MM
Succeeding unit compares A and B
Conditional store of A or B
One of LogN-stages is mapped
Store address is sequential
Two level pipelining: succeeding unit
reads the previous data from local memory
and performs next stage of LogN-stages.
Four rings in a physical unit (merge sort)
20210401
86
Load B42 Load A00
⑪
⑩
87. Update address A or
B, based on the value
Conditional store of A or B
Store address is sequential
Store one stage of logN-stages
Addr.A
Addr.B Val.B Val.A
Update address A or
B, based on the value
Double buffering
Merge sort
20210401
87
Generate condition codes
89. CPU GPU
Ultimate CGRA w/ high-speed compiler
CGRA for Energy-efficient Cryptography
Beyond-Neuromorphic Systems
Non-Deterministic Computing
20210401
89
IMAX3: Amazing Dataflow-Centric CGRA
and its Applications
(Hash, FFT and string search)
90. 20210401
90
SHA256
for (i=0; i<ctx->mbuflen; i+=BLKSIZE) { /* 1データ流内の並列実行は不可能. 多数データ流のパイプライン実行のみ */
for (th=0; th<thnum; th++) {
sregs[th*8+0] = state[th*8+0]; sregs[th*8+1] = state[th*8+1]; sregs[th*8+2] = state[th*8+2]; sregs[th*8+3] = state[th*8+3];
sregs[th*8+4] = state[th*8+4]; sregs[th*8+5] = state[th*8+5]; sregs[th*8+6] = state[th*8+6]; sregs[th*8+7] = state[th*8+7];
}
for (j=0; j<BLKSIZE; j+=BLKSIZE/DIV) {
for (th=0; th<thnum; th++) {
a = sregs[th*8+0]; b = sregs[th*8+1]; c = sregs[th*8+2]; d = sregs[th*8+3];
e = sregs[th*8+4]; f = sregs[th*8+5]; g = sregs[th*8+6]; h = sregs[th*8+7];
#if (DIV==4)
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 0]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 0]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 1]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 1]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 2]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 2]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 3]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 3]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 4]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 4]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 5]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 5]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 6]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 6]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 7]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 7]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 8]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 8]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 9]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 9]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+10]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+10]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+11]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+11]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+12]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+12]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+13]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+13]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+14]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+14]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+15]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+15]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
#endif
sregs[th*8+0] = a; sregs[th*8+1] = b; sregs[th*8+2] = c; sregs[th*8+3] = d;
sregs[th*8+4] = e; sregs[th*8+5] = f; sregs[th*8+6] = g; sregs[th*8+7] = h;
}
}
for (th=0; th<thnum; th++) {
state[th*8+0] += sregs[th*8+0]; state[th*8+1] += sregs[th*8+1]; state[th*8+2] += sregs[th*8+2]; state[th*8+3] += sregs[th*8+3];
state[th*8+4] += sregs[th*8+4]; state[th*8+5] += sregs[th*8+5]; state[th*8+6] += sregs[th*8+6]; state[th*8+7] += sregs[th*8+7];
}
}