PBL1-v1-200e.pptx

CPU GPU
Ultimate CGRA w/ high-speed compiler
CGRA for Energy-efficient Cryptography
Beyond-Neuromorphic Systems
Non-Deterministic Computing
20210401
1
IMAX3: Amazing Dataflow-Centric CGRA
and its Applications

20210401
2
20180716 2
IT devices consume
60% of total power
Filled up by ＡＩ and BC
Society 5.0 w/ much CO2
IT devices consume
10% of total power
Efficient ＡＩ and BC
Zero-carbon society
202X Phase-down of Von-Neumann computers!?
Programmability∝ Power dissipation
Easiness to use ∝ Power dissipation

Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8 80-loops
ALUALU
I-Cache
ALU
ALU
D-Cache
Registers
I-Cache
ALU
ALU
D-Cache
Regs.
Regs.
D-Cache
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
All
instructions
are
executed.
I-Cache
ALU
ALU
D-Cache
Registers
3
Vect.Insn.1
Vect.Insn.2
Vect.Insn.3
Vect.Insn.4
Vect.Insn.5
Vect.Insn.6
Vect.Insn.7
Vect.Insn.8
Vector
Registers
Register
File
20210401
3
CPU ⇒ SIMD(vector) ⇒ MISD(CGRA)
"GoogleのTPUにも使われたシストリックアレイアー
キテクチャとDeep Learningについて", 富士通研究
所技術講演会, Jul. (2017)
"プログラマビリティを維持できる限界に向けて”,
SONY本社研究紹介, Mar. (2020)
"Deep Learningに向けたApproximate Computingと
シストリックアレイアーキテクチャ", 革新的コン
ピューティングの研究開発戦略検討会, CRDS/JST,
Jul. (2017)
"Approximate Computingとシストリックアレイ", ジス
クソフト技術講演会, Dec. (2017)
"99%メモリなアクセラレータIMAX(In Memory
Accelerator eXtension)", CAE計算環境研究会@関
西シスラボ第8回シンポジウム, Mar. (2017)
"Systolic Arrays as The Last Frontiers", Invited
talk in IPB Seminar and UI seminar @ Indonesia,
Jan. (2019)
“IMAX2: A CGRA with FPU+Multithreading+Chiplet",
Panel: CGRA and their Opportunities as Application
Accelerators, ASAP2021, invited panel, Jul. (2021)
“コンピュータ(データセンタ)の消費電力低減策
意見交換会”, LCS/JST, Jul. (2021)
"非ノイマン型の世界 -CGRAを含む最近の研究紹
介-", JEITAデバイス技術分科会招待講演, Nov.
(2021)
"CGRAのJITコンパイル化と高機能化の魔法教え
ます", 回路とシステムワークショップ招待講演, Aug.
(2022)

Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
for {
iter=0 iter=1 iter=2 iter=3 iter=4 iter=5 iter=6 iter=7
}
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Super Scalar, VLIW
I-Cache
ALU
ALU
D-Cache
Registers
EAG EAG
20210401
4

for {
}
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
VECTOR
Insn.1 Insn.1 Insn.1 Insn.1 Insn.1 Insn.1 Insn.1
Insn.1
I-Cache
ALU
ALU
D-Cache / Main Memory
Registers
EAG
V-insn.1
ALU
ALU
ALU
ALU
ALU
ALU
V-insn.2
Insn.2 Insn.2 Insn.2 Insn.2 Insn.2 Insn.2 Insn.2
Insn.2
20210401
5

for {
}
CGRA
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Insn.1
Insn.2
Insn.3
Insn.4
Insn.5
Insn.6
Insn.7
Insn.8
Local Memory
Registers EAG
ALU
ALU
Local Memory
Registers EAG
ALU
ALU
Local Memory
Registers EAG
ALU
ALU
Local Memory
Registers EAG
ALU
ALU
Insn.1 Insn.2
Insn.3 Insn.4
Insn.1 Insn.2 Insn.1 Insn.2
Insn.5 Insn.6
Insn.3 Insn.4
Insn.1 Insn.2
Insn.3 Insn.4
Insn.7 Insn.8
Insn.5 Insn.6
Insn.3 Insn.4
Insn.1 Insn.2
Insn.5 Insn.6
Insn.3 Insn.4
Insn.1 Insn.2
Insn.3 Insn.4
20210401
6

20210401
7
Scalar, SIMD and CGRA
time
I1
L2
VST
L2
VLD VLD
VFMA
I1
L2
VST
L2
VLD VLD
VFMA
I1
L2
VST
L2
VLD VLD
VFMA
I1
L2
VST
L2
VLD VLD
VFMA
MM
LD LM MM
LD LM FMA LM
ST LD LM LD LM FMA LM
ST
LD LM MM
LD LM FMA LM
ST
LD LM MM
LD LM FMA LM
ST
LD LM MM
LD LM FMA LM
ST
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
LD LD FMA ST
D1 D1 D1
I1 I1 I1
L2
L2
MM
I1
I1
I1
I1
VST
VST
VST
VST
VFMA
VFMA
VFMA
VFMA
VLD
VLD
VLD
VLD
VLD
VLD
VLD
VLD
MM
Scalar
(VL=32)
Vector1
(VL=256)
Vector2
(VL=2048)
CGRA
(VL=16K)

20210401
8
Traditional style of programming based on procedures
A B C
D
for (i=0; i<128; i++)
D[i]=A[i]+B[i]*C[i];
D[256] A[256] B[256] C[256]
float A[256],B[256],C[256],D[256];
for (i=0; i<128; i++)
D[i+128]=A[i+128]+B[i]*C[i+128];
Main memory

D A B C B
D A C
9
Cache memory is convenient for high-speed
A B C
D
for (i=0; i<128; i++)
D[i]=A[i]+B[i]*C[i];
D[256] A[256] B[256] C[256]
for (i=0; i<128; i++)
D[i+128]=A[i+128]+B[i]*C[i+128];
Main memory
Cache memory
20210401

20210401
10
Explicitly spatial programming based on dataflow
D
A B C
D
A B C
D
A B C
Load Ai (top=A,len=64)
Load Bi (top=B,len=64)
Load Ci (top=C,len=64)
Di=Ai+Bi*Ci
Store Di (top=D,len=64)
Similar to assembly language, but has DMA info.
j=i+64; Load Aj (top=A+64,len=64)
Load Bi (top=B, len=64)
Load Cj (top=C+64,len=64)
Dj=Aj+Bi*Cj
Store Dj (top=D+64,len=64)
k=i+128;Load Ak (top=A+128,len=64)
Load Bi (top=B, len=64)
Load Ck (top=C+128,len=64)
Dk=Ak+Bi*Ck
Store Dk (top=D+128,len=64)
Can broadcast

1988 VPP 4way VLIW+8elem. Vector
My CGRAs began from VLIW+Vector processor
F D W
C
D
M M M M
M M M M D D M M M M
Load pipe Store pipe
FPU pipe
F D
F D
Cache lines are shared among
heterogeneous multithreads.
M M M M M M M M
D
W
M M M M M M M M
M M M M M M M M
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
C
E
E
E
E
M M M M
B
B
B
B
B
B
B
B
B
B
Vector Length
≤ 2048 dwords
($miss)
F D D
2008 LAPP 8way VLIW+32stage Array
C2
E
E
E
LD
M M M M
C1
E
E
E
LD
C0
E
E
E
E
B
B
B
B
B
C3
E
E
E
E
HW Mapper (8stages) HW Mapper (binary compatible)
Vector Length
≤ 1024 words
E
E
E
E
E
E
E
ST
E
E
E
LD
E
E
E
LD
E
E
E
E
E
E
E
ST
2006 OROCHI 9way VLIW+5way SS
20210401
11
Data from all cache ways are passed through array.
Location free LD/ST can keep binary compatibility.

CGRAs based on FPGA require much compilation time
20210401
12
One hour compilation. We are not so patient !
Much time for converting program + data

Start of IMAX
20210401
13
Source register file
(32bit x 2waySIMD = 64bit) x4
Condition code,
Pipelined FPU,
(32bit x 2wayFMA = 64bit)
Media UNIT (SAD,…),
Stochastic FMA (32 elements)、
Hash function for SHA256,
Address generator
Local memory (64KB)
Destination register file
(32bit x 2waySIMD = 64bit) x4
CGRA requires data
path similar to CPU
core.

4way multithreading
20210401
14
Double buffered 16 registers
Column multithreading
Local memory variable
partitioning
(64KB, 32KB, 16KB, mix)
Double buffered 16 registers
Most CGRAs have no FPU.
Normal CGRAs dislike
multicycle operation.
Employing multithreading
breaks the limitation.

Slave interface
20210401
15
Simultaneous selective
receive into 4 logical
partitions (256bit/cycle)
Simultaneous selective send
and merge (256bit/cycle)
Connected to succeeding unit
Connected as slave
Low-cost broadcasting
and gathering from CPU
due to selective address
filters in local memory
DMAs are optimized and
integrated by compiler

20210401
16
Folding
Similar to CPU data path

20210401
17
Dual port
Four octa-word load
Eight double-word load
Four 64bit read-modify-write
Four partitions in maximum
Has double buffering

20210401
18
Four physical pass through
Sixteen logical pass through
Pass through

20210401
19
- Compression of matrices
- Sparse matrix multiplication
- Merge sort
Address synchronizer

40ops/1unit, 2560ops/64units, 10240ops/4chips 20210401
20
load store
octa-ld
Add/sub/mul
And/or/xor
Shift
+FPU
+Media
octa-ld
Add/sub/mul
And/or/xor
Shift
+FPU
+Media
For
1st column
Double buffer
for write
for read

20210401
21
For
2nd column
load store
octa-ld
Add/sub/mul
And/or/xor
Shift
+FPU
+Media
octa-ld
Add/sub/mul
And/or/xor
Shift
+FPU
+Media
Double buffer
for write
for read
40ops/1unit, 2560ops/64units, 10240ops/4chips

20210401
22
For
3rd column
load store
octa-ld
Add/sub/mul
And/or/xor
Shift
+FPU
+Media
octa-ld
Add/sub/mul
And/or/xor
Shift
+FPU
+Media
Double buffer
for write
for read

20210401
23
For
4th column
load store
octa-ld
Add/sub/mul
And/or/xor
Shift
+FPU
+Media
octa-ld
Add/sub/mul
And/or/xor
Shift
+FPU
+Media
Double buffer
for write
for read

How to put 64 units
20210401
24
F F A A
F F A A
F F M M
C C M M
R R R R
R

Programming mode is 4 logical units per row
20210401
25
R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R

Expand to vertical directions
20210401
26
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R

Expand to multi-chip
20210401
27
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R

Multiplex 4 logical units into a physical unit
20210401
28
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R

Reshaping as a ring
20210401
29
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R
F F A A
F F A A
F F M M
C C M M
R R R R

A
B
C
D
E
A
B
C
D
E
A
B
C
D
E
Final structure of 64 units and 4 chips (2560 operations x 4)
64bit
ARM
AXIIF
2560op x 4
64KB x 64 x 4
20210401
30

81920 operations with HBM2
20210401
31

HOST
20210401
32
Execution network and memory network
Ring for execution Eight lanes for memory IF

20210401
33
Multiplexing local memory

20210401
34
Non-stop data flow is more important than ALU array
load+burst-execution or burst-execution&prefetch

20210401
35
Non-stop data flow is more important than ALU array
burst-execution+store or post-drain&burst-execution

20210401
36
Micro pipelining and macro pipelining w/ double buffering
Macro pipelining with log N-units
Merge sort O(N)
FFT O(N)
Writing next result & Reading previous result

20210401
37
Non-stop data flow
C
O
N
F
R
E
G
s
A
D
D
R
Overlapping post-drain, burst-execution and prefetch
L
M
M
L
M
M
L
M
M
Burst exec.
R
E
G
s
A
D
D
R
L
M
M
L
M
M
L
M
M
Burst exec.
C
O
N
F
A
D
D
R
R
E
G
s
Sequential execution
L
M
M
L
M
M
L
M
M
Burst exec. A
D
D
R
R
E
G
s
L
M
M
time
L
M
M
L
M
M
Burst exec.
PIO/DMA External Memory
PIO/DMA External Memory
R
E
G
s
A
D
D
R
L
M
M
L
M
M
L
M
M
Burst exec.

Sandwich structure of ALU and memory seems best
More complicated memory access for light field, graph, string search, AI, …
20210401
38

39
・中島康彦, 木村睦, 張任遠: "制御装置（スパイクメモリ構成方法）", 特願2021- 27859 (2021. 2. 24)
・トランティホン, 中島康彦: "処理要素、その制御方法および制御プログラム、並びに処理装置（BC）", 特願2021-009164 (2021. 1. 22)
・中島康彦, 高前田伸也: "データ処理装置（メモリ内蔵アクセラレータの構成方法）", 中国ZL201680019602 (2020. 12. 11)
・中島康彦: "データ処理装置（高効率アクセラレータ構成方法）", PCT/JP2020/025123 (2020. 6. 26)
・中島康彦, 木村睦, 張任遠: "データ処理装置（メムキャパシタ構成方法）", 特願2020-91392 (2020. 5. 26)
・中島康彦: "データ処理装置（高効率アクセラレータ構成方法）", 特願2019-517698 (2019. 9. 19)
・Yasuhiko Nakashima, Shinya Takamaeda: "Data processing Device", United States Patent 10,275,392 (2019.4.30)
・中島康彦: "データ処理装置（NCHIP制御方法）", 特願2019-121853 (2019. 6. 28)
・Yasuhiko Nakashima, Takashi Nakada: "Data processing Device for Performing a Plurality of Calculation Processes in Parallel", European Patent Application No.09820420.9 (H31. 1. 18)
・中島康彦: "データ処理装置（高効率アクセラレータ構成方法）", PCT/JP2018/018169 (H30. 5. 10)
・中島康彦: "データ処理装置（高効率アクセラレータ構成方法）", 特願2017-96061 (H29. 5. 12)
・Jun Yao, Yasuhiko Nakashima, Tao Wang, Wei Zhang, Zuqi Liu, Shuzhan Bi: "METHOD FOR ACCESSING MEMORY OF MULTI-CORE SYSTEM, RELATED APPARATUS, SYSTEM, AND ST
MEDIUM", PCT/CN2017/083523 (2017. 5. 8)
・中島康彦, 高前田伸也: "データ処理装置（メモリ内蔵アクセラレータの構成方法）", PCT/JP2016/061302 (H28. 4. 6)
・中島康彦, 高前田伸也: "データ処理装置（メモリ内蔵アクセラレータの構成方法）", 特願2015-079552 (H27. 4. 8)
・中島康彦: "エミュレーション方式", 特願2013-055660 (H25. 3. 18)
・中島康彦, 姚駿: "データ供給装置及びデータ処理装置", PCT/JP2013/057503 (H25. 3. 15)
・中島康彦, 姚駿: "データ供給装置及びデータ処理装置", 特願2012-061110 (H24. 3. 16)
20210401
39
28nmLSI : 200x performance/area compared with GPGPU
Largest prototype for 10240 operations

20210401
40
Performance comparison
Kernel
CPU
ARMv8 1.2GHz
GPU 256core
JetsonTX2 1.3GHz
DDR4 480Gbps
16nm 43.6mm²
CGRA 64core*4
IMAX2 140MHz
DDR4 40Gbps
[28nm想定 14.6mm² *4]
8nm想定 1.2mm² *4
GPU 3584core
GTX1080Ti 1.5GHz
GDDR5 3872Gbps
16nm 471mm²
GPU 10496core
RTX3090 1.4GHz
GDDR6X 7490Gbps
8nm 628mm²
DDR bandwidth 12 1 97 187
Power 7.5W ARM 0.6W + [31W] 2.7W 250W 350W
MM 3160msec 170 16 [3msec]
[EDP=284] EDP=30
12
EDP=36K
1.2
EDP=504
CNN 2080msec
280
EDP=588K
23 [4msec]
[EDP=505] EDP=53
18
EDP=81K
2.9
EDP=2943
Lightfield 14500msec
1190
EDP=10.6M
754 [126msec]
[EDP=501K] EDP=52K
43
EDP=462K
35
EDP=428K
Sparse
MM 32002 - - 333+469 [134ms]
[EDP=567K] EDP=59K
Cusparse使用
2044
EDP=1045M
Cusparse使用
280
EDP=27.4M
Sparse
MM 40002 - - 2378+734 [519ms]
[EDP=8.51M] EDP=889K
Cusparse使用
3492
EDP=3049M
Cusparse使用
350
EDP=43.1M

Top-down approach
20210401
41
HBM2
HOST HOST
I/O I/O
FFT CONV
MM SORT SpMM

Multilevel pipelining
20210401
42
HBM2

20210401
43
75% is memory block
CGRA 64core*4
IMAX2 140MHz
DDR4 40Gbps
If 8nm, 1.2mm² *4
GPU 10496core
RTX3090 1.4GHz
GDDR6X 7490Gbps
8nm 628mm²
75% is DP-SRAM https://thinkcomputers.org/renowned-ir-photographer-fritzchens-fritz-
shares-die-shots-of-nvidia-3000-series-ga-102-silicon/
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAMLogic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
DP
SRAM
Logic
External interface
DP
SRAM

IMAX (64units) x 120 modules = 307200 operations / 4 cycles
20210401
44
HBM2
HOST HOST
IMAX (1.2mm2/8nm) x 120 modules ≃ 144mm2 ?

CPU GPU
20210401
45
(Image processing)

/* SCREEN=WD*HT */
for (row=0; row<HT; row++) {
for (col=0; col<WD; col++) {
pix = in[row*WD+col];
r = t[ pix>>24 ];
g = t[256+((pix>>16)&255)];
b = t[512+((pix>> 8)&255)];
out[row*WD+col]=r<<24 | g<<16 | b<<8;
} }
20210401
46
Tone curve
Load →
Store ←
Color map tables

int loop=WD/2;
//EMAX5A begin tone_curve mapdist=0
while (loop--) {
mop(OP_LDR, &BR[0][1][1], in++, 0LL, MSK_D0, in, WD, 0, 0, NULL, 0);
mop(OP_LDBR, &BR[1][1][1], t1, BR[0][1][1], MSK_B3, t, 256*3/4, 0, 0, NULL, 0);
exe(OP_CCAT, &r1, BR[1][1][0], EXP_H3210, BR[1][1][1], EXP_H3210, 0, EXP_H3210,OP_NOP,0,OP_NOP,0);
exe(OP_MMRG, &r0, r1, EXP_H3210, r2, EXP_H3210, r3, EXP_H3210,OP_NOP,0,OP_NOP,0);
mop(OP_STR, &r0, out++, 0LL, MSK_D0, out, WD, 0, 0, NULL, 0);
}
//EMAX5A end
20210401
47
2way SIMD in IMAX
SIMD Load →
SIMD Store ←
Color map tables

20210401
48
2way SIMD in IMAX
CCAT
Six loads from color tables

Input
pixels
Load
Sort
Select center
Output
4bytes
4bytes
20210401
49
Median filter

20210401
50
Median filter
LM
LM
LM
LM
Main
Memory
PREFETCH
DRAIN
LM
LM

20210401
51
Median filter
①
⑮
④
②
③
⑤
⑥
⑦
⑧
⑨
⑩
⑪
⑫
⑬
⑭
⑯
LM
LM
LM
LM
PREFETCH
LM
LM
DRAIN

20210401
52
Unsharp mask
//EMAX5A begin unsharp mapdist=1
for (CHIP=0; CHIP<NCHIP; CHIP++) { /* output channels are parallelized by multi-chip (OC/#chip) */
for (INIT0=1,LOOP0=WD,col=0-4LL; LOOP0--; INIT0=0) {
exe(OP_ADD, &col, col, EXP_H3210, 4LL, EXP_H3210, 0, EXP_H3210, OP_AND, 0x00000ffffffffLL, OP_NOP,0);
exe(OP_ADD, &in_center, row_center, EXP_H3210, col, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
mop(OP_LDWR, &r1, in_center, -1276, MSK_D0, row_prev, WD, NULL, 0);
exe(OP_MAUH, &r11, r1, EXP_B5410, r2, EXP_B5410, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
mop(OP_LDWR, &r6, in_center, 4, MSK_D0, row_center, WD, NULL, 0);
mop(OP_LDWR, &r7, in_center, -4, MSK_D0, row_center, WD, NULL, 0);
mop(OP_LDWR, &r0, in_center, 0, MSK_D0, row_center, WD, NULL, 0);
exe(OP_MLUH, &r20, r0, EXP_B5410, 239, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
exe(OP_MLUH, &r21, r0, EXP_B7632, 239, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
mop(OP_LDWR, &r3, in_center, 1284, MSK_D0, row_next, WD, row_next_next, WD);
mop(OP_LDWR, &r4, in_center, 1276, MSK_D0, row_next, WD, row_next_next, WD);
mop(OP_LDWR, &r8, in_center, 1280 , MSK_D0, row_next, WD, row_next_next, WD);
exe(OP_MAUH3, &r11, r3, EXP_B5410, r4, EXP_B5410, r11, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
exe(OP_MLUH, &r13, r11, EXP_H3210, 13, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
exe(OP_NOP, &r7, r15, EXP_H3210, 0LL, EXP_H3210, 0, EXP_H3210, OP_OR, 0, OP_SRLM, 2);
exe(OP_NOP, &r8, r16, EXP_H3210, 0LL, EXP_H3210, 0, EXP_H3210, OP_OR, 0, OP_SRLM, 2);
exe(OP_MSUH3, &r10, r20, EXP_H3210, r7, EXP_H3210, r17, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
exe(OP_MSUH3, &r11, r21, EXP_H3210, r8, EXP_H3210, r18, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
exe(OP_MSUH, &r20, r10, EXP_H3210, r13, EXP_H3210, 0, EXP_H3210, OP_OR, 0, OP_SRLM, 7);
exe(OP_MSUH, &r21, r11, EXP_H3210, r14, EXP_H3210, 0, EXP_H3210, OP_OR, 0, OP_SRLM, 7);
exe(OP_MH2BW, &r31, r21, EXP_H3210, r20, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
mop(OP_STWR, &r31, out_center, col, MSK_D0, out_center, WD, row_prev, WD);
}
}
//EMAX5A end
}
r1 r5 r2
r7 r0 r6
r4 r8 r3

20210401
53
Frame interpolation

20210401
54
Super resolution
for (Y=0; Y<768; Y++) {
k = Y*240/768;
kfraq = (((Y*240)<<4)/768)&15;
for (X=0; X<1024; X++) {
l = X*320/1024;
lfraq = (((X*320)<<4)/1024)&15;/
out[Y][X] = ((in[k ][l ]>>24&0xff)*r1 + (in[k ][l-1]>>24&0xff)*r2 + (in[k ][l+1]>>24&0xff)*r3
+ (in[k-1][l ]>>24&0xff)*r4 + (in[k+1][l ]>>24&0xff)*r5 + (in[k-1][l-1]>>24&0xff)*r6
+ (in[k-1][l+1]>>24&0xff)*r7 + (in[k+1][l-1]>>24&0xff)*r8 + (in[k+1][l+1]>>24&0xff)*r9)/256<<24
| ((in[k ][l ]>>16&0xff)*r1 + (in[k ][l-1]>>16&0xff)*r2 + (in[k ][l+1]>>16&0xff)*r3
+ (in[k-1][l ]>>16&0xff)*r4 + (in[k+1][l ]>>16&0xff)*r5 + (in[k-1][l-1]>>16&0xff)*r6
+ (in[k-1][l+1]>>16&0xff)*r7 + (in[k+1][l-1]>>16&0xff)*r8 + (in[k+1][l+1]>>16&0xff)*r9)/256<<16
| ((in[k ][l ]>> 8&0xff)*r1 + (in[k ][l-1]>> 8&0xff)*r2 + (in[k ][l+1]>> 8&0xff)*r3
+ (in[k-1][l ]>> 8&0xff)*r4 + (in[k+1][l ]>> 8&0xff)*r5 + (in[k-1][l-1]>> 8&0xff)*r6
+ (in[k-1][l+1]>> 8&0xff)*r7 + (in[k+1][l-1]>> 8&0xff)*r8 + (in[k+1][l+1]>> 8&0xff)*r9)/256<<8;
}
}
K-1
K = Y*240/768
L-1 L = X*320/1024
(X,Y) 1024x768の1画素
(L,K) 320x240画像
kfraq = (((Y*240)<<4)/ 768)&15;/* 4bit */
lfraq = (((X*320)<<4)/1024)&15;/* 4bit */
Y=1 kfraq= 5/16
Y=2 kfraq=10/16
Y=3 kfraq=15/16
Y=4 kfraq= 4/16
Y=5 kfraq= 9/16
X=1 lfraq= 5/16
X=2 lfraq=10/16
X=3 lfraq=15/16
X=4 lfraq= 4/16
X=5 lfraq= 9/16

20210401
55
Stereo matching
for (row=8; row<240-8; row++) {
for (Y=-8; Y<8; Y++) {
for (col=8; col<320-8; col++) {
for (X=-8; X<8; X++)
SAD2[row][col] += sad(L[row+Y][col+X+視差], R[row+Y][col+X]);
}
}
}

for (row=-8; row<240-8; row++) {
//EMAX5A begin wdifline mapdist=0
for (CHIP=0; CHIP<NCHIP; CHIP++) {
for (INIT0=1,LOOP0=320,col=0-4LL; LOOP0--; INIT0=0) {
exe(OP_ADD, &col, col, EXP_H3210, 4LL, EXP_H3210, 0, EXP_H3210, OP_AND, 0x00000000ffffffffLL, OP_NOP, 0);
exe(OP_ADD, &rofs1, L, EXP_H3210, col, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
exe(OP_ADD, &rofs2, R, EXP_H3210, col, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
mop(OP_LDWR, &r2, rofs1, 0, MSK_D0, L, 320, 0, 0, NULL, 320);
mop(OP_LDWR, &r6, rofs2, 0, MSK_D0, R, 320, 0, 0, NULL, 320);
exe(OP_MSAD, &r22, r2, EXP_H3210, r6, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
exe(OP_MSSAD, &r12, r22, EXP_H3210, r12, EXP_H3210, r16, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
exe(OP_MAUH3, &r31, r22, EXP_H3210, r23, EXP_H3210, r24, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
exe(OP_MAUH3, &r1, r31, EXP_H3210, r25, EXP_H3210, 0, EXP_H3210, OP_SUMHL,0, OP_NOP, 0);
mop(OP_LDWR, &BR[8][0][1], sad2, col, MSK_D0, sad2, 320, 0, 1, NULL, 320);
exe(OP_ADD, &AR[8][0], BR[8][0][1], EXP_H3210, r1, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0);
mop(OP_STWR, &AR[8][0], col, sad2, MSK_D0, sad2, 320, 0, 1, NULL, 320);
：のこり15か所のSADにも加算
}
}
//EMAX5A end
}
//EMAX5A drain_dirty_lmm
20210401
56
Stereo matching

CPU GPU
20210401
57
(Machine learning)

20210401
58
Machine learning = training + inference

GPU: unpack and matrix-multiplication (x9 memory space)
1 -1-1-1 1 -1-1-1 1 1 -1-1-1 1 -1-1-1 1
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3 … 0.27
1.3 … 1.27
2.3 … 2.27
: : :
: : :
27.0 27.1 27.2
: …
: : :
27.3 … 27:27
unpack_patch2col
1.0 1.1
1.1 1.2
1.2 1.3
1.0 1.1 …
1.1 1.2 :
1.2 1.3 …
1:25 2.0 2.1
1.26 2.1 2.2
1.27 2.2 2.3
… 1.25 25.0
:… 1.26 25.1
… 1.27 25.2
25.1 … 25.25
25.2 … 25.26
25.3 … 25.27
… 2.25 26.0
: 2.26 26.1
… 2.27 26.2
26.1 … 26.25
26.2 : 26.26
26.3 … 26:27
: : :
2.2 2.3 …
: : :
2.27 3.4 3.5
: : :
… 3.27 27.2
: : :
27.3 … 27:27
3x3
26x26
28
28
…
…
…
…
:
…
: : : : 4.5 4.27
4.4
x100
x100
copy
0.0 0.1 … 0.25
0.1 0.2 … 0.26
0.27
…
0.3
0.2
1 -1 -1
-1 1 -1
-1-1 1
C
M&A
Convolution on CPU/GPU
20210401
59

0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3 … 0.27
1.3 … 1.27
2.3 … 2.27
: : :
: : :
27.0 27.1 27.2
: …
: : :
27.3 … 27:27
…
…
…
…
:
…
: : : : 4.5 4.27
4.4
0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3 … 0.27
1.3 … 1.27
2.3 … 2.27
: : : : …
…
…
…
…
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3 … 0.27
1.3 … 1.27
2.3 … 2.27
: : : : …
…
…
…
…
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3 … 0.27
1.3 … 1.27
2.3 … 2.27
: : : : …
…
…
…
…
-1 -1
1 -1 1
1 1 -1
-1 … 1
-1 … 1
1 … 1
: : : : …
…
…
…
…
1 -1
1 -1 1
1 1 -1
-1 … 1
-1 … 1
1 … 1
: : : : …
…
…
…
…
1 -1 -1
1 -1 1
1 1 -1
-1 …
-1 … 1
1 … 1
: : : : …
…
…
…
…
0.0 0.1
-1
1
OCH#0
OCH#1
OCH#2
OCH#0
OCH#1
OCH#2
OCH#0
OCH#1
OCH#2
0.0 0.1 0.2
1.0 1.1 1.2
0.25
1.25
: : :
…
…
…
0.0 0.1 0.2
1.0 1.1 1.2
0.25
1.25
: : :
…
…
…
0.0 0.1 0.2
1.0 1.1 1.2
0.25
1.25
: : :
…
…
…
0.0 0.1 0.2
1.0 1.1 1.2
0.25
1.25
: : :
…
…
…
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
１ -１ -１
-１１ -１
-１ -１１
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3 … 0.27
1.3 … 1.27
2.3 … 2.27
: : :
: : :
27.0 27.1 27.2
: …
: : :
27.3 … 27:27
…
…
…
…
:
…
: : : : 4.5 4.27
4.4
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3 … 0.27
1.3 … 1.27
2.3 … 2.27
: : :
: : :
27.0 27.1 27.2
: …
: : :
27.3 … 27:27
…
…
…
…
:
…
: : : : 4.5 4.27
4.4
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3 … 0.27
1.3 … 1.27
2.3 … 2.27
: : :
: : :
27.0 27.1 27.2
: …
: : :
27.3 … 27:27
…
…
…
…
:
…
: : : : 4.5 4.27
4.4
ICH#0
ICH#0
ICH#1
ICH#2
ICH#3
ICH#0
ICH#1
ICH#2
ICH#3
OCH#0
OCH#1
1
1
FMUL
FMA
ICH#1
ICH#5
54 units
Kernel bcast
ICH#0 bcast
ICH#1 bcast
Drain
4 columns
20210401
60
Convolution on CGRA (hardware unpacking)

imax() {
Ull CHIP; Ull LOOP1, LOOP0; Ull INIT1, INIT0;
Ull AR[64][4]; Ull BR[64][4][4];
Ull r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r13, r14, r15;
Ull cc0, cc1, cc2, cc3, ex0, ex1; Ull cofs, rofs, oofs, k;
for (top=0; top<M1/NCHIP; top+=RMGRP) { /* will be parallelized by multi-chip (M/#chip) */
for (blk=0; blk<L; blk+=H) {
typedef struct {Uint i[8]} Ui8;
Uint *a0[NCHIP]; Uint *a[H][NCHIP]; Ui8 *b[H], *b0[H], *b1[H], *b2[H], *b3[H];
Ui8 *c0[NCHIP]; Ui8 *c00[NCHIP], *c01[NCHIP], *c02[NCHIP], *c03[NCHIP];
for (k=0; k<H; k++) {
b[k] = B+(blk+k)*M2; b0[k] = b[k]; b1[k] = (Uint*)b[k]+2; b2[k] = (Uint*)b[k]+4; b3[k] = (Uint*)b[k]+6;
}
for (CHIP=0; CHIP<NCHIP; CHIP++) { /* will be parallelized by multi-chip (M/#chip) */
a0[CHIP] = A+(CHIP*M1/NCHIP+top)*L;
for (k=0; k<H; k++) a[k][CHIP] = a0[CHIP]+blk+k;
c0[CHIP] = C1+(CHIP*M1/NCHIP+top)*M2;
c00[CHIP]= (Uint*)c0[CHIP]+0; c01[CHIP]= (Uint*)c0[CHIP]+2; c02[CHIP]= (Uint*)c0[CHIP]+4; c03[CHIP]= (Uint*)c0[CHIP]+6;
}
//EMAX5A begin mm mapdist=0
/*3*/ for (CHIP=0; CHIP<NCHIP; CHIP++) { /* will be parallelized by multi-chip (M/#chip) */
/*2*/ for (INIT1=1,LOOP1=RMGRP,rofs=(0-L*4)<<32|((0-M2*4)&0xffffffff); LOOP1--; INIT1=0) { /* stage#0 *//* mapped to FOR() on BR[63][1][0] */
/*1*/ for (INIT0=1,LOOP0=M2/W/2,cofs=(0-W*8)<<32|((0-W*8)&0xffffffff); LOOP0--; INIT0=0) { /* stage#0 *//* mapped to FOR() on BR[63][0][0] */
exe(OP_ADD, &cofs, INIT0?cofs:cofs, EXP_H3210, (W*8)<<32|(W*8), EXP_H3210, 0LL, EXP_H3210, OP_AND, 0xffffffffffffffffLL, OP_NOP, 0LL);
exe(OP_ADD, &rofs, rofs, EXP_H3210, INIT0?(L*4)<<32|(M2*4):0, EXP_H3210, 0LL, EXP_H3210, OP_NOP, 0LL, OP_NOP, 0LL);
exe(OP_ADD, &oofs, rofs, EXP_H3210, cofs, EXP_H3210, 0, EXP_H3210, OP_AND, 0xffffffff, OP_NOP, 0LL); /* stage#1 */
mop(OP_LDR, 3, &BR[1][0][1], (Ull)b0[0], (Ull)cofs, MSK_W1, (Ull)b[0], M2/2, 0, 0, (Ull)NULL, M2/2); /* stage#1 */
mop(OP_LDR, 3, &BR[1][1][0], (Ull)b3[0], (Ull)cofs, MSK_W1, (Ull)b[0], M2/2, 0, 0, (Ull)NULL, M2/2); /* stage#1 2KB */
mop(OP_LDUWR,1, &BR[1][2][1], (Ull)a[0][CHIP], (Ull)rofs, MSK_W1, (Ull)a0[CHIP], L*RMGRP/2, 0, 0, (Ull)NULL, L*RMGRP/2); /* stage#1 16KB */
exe(OP_FML, &AR[2][0], BR[1][0][1], EXP_H3210, BR[1][2][1], EXP_H3210, 0LL, EXP_H3210, OP_NOP, 0LL, OP_NOP, 0LL); /* stage#2 */
sgemm00_core1( 2, 1, 3);
sgemm00_core1( 3, 2, 4);
:
sgemm00_core1(59, 58, 60);
sgemm00_core1(60, 59, 61);
sgemm00_final(61, 62);
} } }
//EMAX5A end
} }
}
Just repeating following macros
Gemm in IMAX
20210401
61

#define sgemm00_core1(r, rm1, rp1)
mop(OP_LDR, 3, &BR[r][0][1], b0[rm1], cofs, MSK_W1, b[rm1], M2/2, 0, 0, NULL, M2/2);
mop(OP_LDUWR,1, &BR[r][2][1], a[rm1][CHIP], rofs, MSK_W1, a0[CHIP], L*RMGRP/2, 0, 0, NULL, L*RMGRP/2);
exe (OP_FMA, &AR[rp1][0], AR[r][0], EXP_H3210, BR[r][2][1], EXP_H3210, BR[r][0][1], EXP_H3210, OP_NOP, 0LL, OP_NOP, 0LL);
exe (OP_FMA, &AR[rp1][3], AR[r][3], EXP_H3210, BR[r][2][1], EXP_H3210, BR[r][1][0], EXP_H3210, OP_NOP, 0LL, OP_NOP, 0LL)
#define sgemm00_final(r, rp1)
exe(OP_CMP_LT, &cc1, cofs, EXP_H3210, cofslimit1, EXP_H3210, 0LL, EXP_H3210, OP_NOP, 0LL, OP_NOP, 0LL);
mop(OP_LDUWR, 1, &BR[rp1][0][1], c00[CHIP], oofs, MSK_W0, c0[CHIP], Clen, 0, 1, NULL, Clen);
exe (OP_FAD, &AR[rp1][0], AR[r][0], EXP_H3210, BR[rp1][0][1], EXP_H3210, 0LL, EXP_H3210, OP_NOP, 0LL, OP_NOP, 0LL);
mop(OP_STWR, 1, &AR[rp1][0], oofs, c00[CHIP], MSK_D0, c0[CHIP], Clen, 0, 1, NULL, Clen);
cex (OP_CEXE, &ex1, 0, 0, 0, cc1, 0xaaaa);
mop(OP_STWR, ex1, &AR[rp1][1], oofs, c01[CHIP], MSK_D0, c0[CHIP], Clen, 0, 1, NULL, Clen);
mop(OP_STWR, ex2, &AR[rp1][2], oofs, c02[CHIP], MSK_D0, c0[CHIP], Clen, 0, 1, NULL, Clen);
mop(OP_STWR, ex3, &AR[rp1][3], oofs, c03[CHIP], MSK_D0, c0[CHIP], Clen, 0, 1, NULL, Clen)
One physical (four logical) unit
One physical (four logical) unit
Inside of macros
Prefetching
Length of area
Top of area
LD/ST offset
LD/ST base addr.
20210401
62

#define sgemm00_core1
mop(OP_LDUWR, 1, &BR[r][0][1]
exe(OP_FMA, &AR[rp1][0]
#define sgemm00_final
exe(OP_CMP_LT, &cc1
exe(OP_CMP_LT, &cc2
exe(OP_CMP_LT, &cc3
mop(OP_LDUWR, 1, &BR[rp1][0][1]
exe(OP_FAD, &AR[rp1][0], AR[r][0]
mop(OP_STWR, 1, &AR[rp1][0]
cex(OP_CEXE, &ex1, cc1
mop(OP_STWR, ex1, &AR[rp1][1]
20210401
63
Last part of gemm

unpack_patch2col(tmp_col:25x[100x24x24] <-- ninput:■100x28x28, ksize, kstride)
reshape nhidden:100x[8x24x24] --> tmp_dst:8x[100x24x24]
(C1)multiply_float2D(▲g_Ki2h:8x25 <-- tmp_dst:8x[100x24x24], tmp_col:25x[100x24x24]^T)
(C2)multiply_float2D(tmp_col:25x[100x24x24] <-- Ki2h:8x25^T, tmp_dst:8x[100x24x24])
pack_col2patch(ninput:■100x28x28 <-- tmp_col:25x[100x24x24], 5, 1)
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
… … 0.27
… … 1.27
… … 2.27
: : :
: : :
27.0 27.1 27.2
… …
: : :
27.3 … 27:27
28
28
0:23
1:23
2:23
:
:
…
23:0 23:1 23:2 … … 4.27
23.23
xICx100
24
24
x100x8
tmp_dst
ninput
(original)
g_Ki2h
(C1)
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.23
1.23
2.23
: : :
23.0 23.1 23.2
:
23.23
…
…
…
…
…
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3
1.3
2.3
: : : :
0.4
1.4
2.4
:
: : : : 4.4
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3
1.3
2.3
: : : :
0.4
1.4
2.4
:
: : : : 4.4
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3
1.3
2.3
: : : :
0.4
1.4
2.4
:
: : : : 4.4
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3
1.3
2.3
: : : :
0.4
1.4
2.4
:
: : : : 4.4
0.0 0.1 …
: : …
0.4 0.5 …
0.23 1.0 1.1
: : :
0:27 1.4 1.5
1.0 1.1 …
: : :
1.4 1.5 …
1:23 2.0 2.1
: : :
1.27 2.4 2.5
… 1.23 23.0
:… : :
… 1.27 23.4
23.1 … 23.23
: … :
23.5 … 23.27
… 2.23 24.0
: : :
… 2.27 24.4
24.1 … 24.23
: : :
24.5 … 24:27
: : :
4.4 4.5 …
: : :
4.27 5.4 5.5
: : :
… 5.27 27.4
: : :
27.5 … 27:27
5x5
xIC
24x24
x100
tmp_col
(reshaped for GPU)
copy
5x5xICx8
nhidden 24 x8x100
20210401
64
Backpropagation for weights

unpack_patch2col(tmp_col:25x[100x24x24] <-- ninput:■100x28x28, ksize, kstride)
reshape nhidden:100x[8x24x24] --> tmp_dst:8x[100x24x24]
(C1)multiply_float2D(▲g_Ki2h:8x25 <-- tmp_dst:8x[100x24x24], tmp_col:25x[100x24x24]^T)
(C2)multiply_float2D(tmp_col:25x[100x24x24] <-- Ki2h:8x25^T, tmp_dst:8x[100x24x24])
pack_col2patch(ninput:■100x28x28 <-- tmp_col:25x[100x24x24], 5, 1)
24
24
x8x100
nhidden
Ki2h
ninput
(C2)
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3
1.3
2.3
: : : :
0.4
1.4
2.4
:
: : : : 4.4
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3
1.3
2.3
: : : :
0.4
1.4
2.4
:
: : : : 4.4
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3
1.3
2.3
: : : :
0.4
1.4
2.4
:
: : : : 4.4
0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.3
1.3
2.3
: : : :
0.4
1.4
2.4
:
: : : : 4.4
0.0 0.1
0.0 0.2
1.0 1.1 1.2
2.0 2.1 2.2
… … 0.27
… … 1.27
… … 2.27
: : :
: : :
27.0 27.1 27.2
… …
: : :
27.3 … 27:27
28
28
0:23
1:23
2:23
:
:
…
23:0 23:1 23:2 … … 4.27
23.23
xICx100
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.23
1.23
2.23
: : :
:23.0 23.1 23.2
:
23.23
…
…
…
…
…
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.23
1.23
2.23
: : :
:23.0 23.1 23.2
:
23.23
…
…
…
…
…
0.0 0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.23
1.23
2.23
: : :
:23.0 23.1 23.2
:
23.23
…
…
…
…
…
0.1 0.2
1.0 1.1 1.2
2.0 2.1 2.2
0.23
1.23
2.23
: : :
23.0 23.1 23.2
:
23.23
…
…
…
…
…
0.0
0.0 0.1 …
: …
0.4 0.5 …
0.23 1.0 1.1
: 1.1 :
0:27 1.4 1.5
1.0 1.1 …
1.1 : :
1.4 1.5 …
1:23 2.0 2.1
: : :
1.27 2.4 2.5
… 1.23 23.0
:… : :
… 1.27 23.4
23.1 … 23.23
: … :
23.5 … 23.27
… 2.23 24.0
: : :
… 2.27 24.4
24.1 … 24.23
: : :
24.5 … 24:27
: : :
4.4 4.5 …
: : :
4.27 5.4 5.5
: : :
… 5.27 27.4
: : :
27.5 … 27:27
5x5
xIC
24x24
x100
contribution
of pixel/ninput
:
merge
5x5xICx8
Integration
of OC(8)
0.1
tmp_col
(reshaped for GPU)
20210401
65
Backpropagation for input channel

０．５０×０．５０≒０．２０
００００１１１１０１
０１１１１１００００
００００１１００００
Multiplier by AND gate
０．５０＋０．５０≒０．５０＊２
００００１１１１０１
０１１１１１００００
０１０１１１１０００
Scaled adder by random selector
20210401
66
Stochastic FMA
０．５０＋０．５０＝１．００
Adder by population counter
Tati Erlina, Yan Chen, Renyuan Zhang and Yasuhiko Nakashima: "An Efficient Time-based Stochastic Computing
Circuitry Employing Neuron-MOS", GLSVLSI2019, pp.51-56, May. (2019)

CPU GPU
20210401
68
(High-degree stencil computation)

20210401
69
Variety of stencil computations

20210401
70
Mapping of FD6
Green: loads in x-axis
Purple: loads in y-axis
Red: loads in z-axis

20210401
71
Mapping of FD6
FD6

20210401
72
Another mapping of FD6

20210401
73
Light field image processing
Refocus All in Focus Perspective Shift

20210401
74

20210401
75
Rendering
Depth
extraction

CPU GPU
20210401
76
（Inverse matrix)

20210401
77
Inverse matrix is a special case of linear equation
Linear equation
LU decomposition
Forward elimination
Backward substitution

20210401
78
Read-modify-write for inverse matrix
for (j=i+1; j<M; j+=NCHIP*H*RMGRP) { /* 行方向 */
//EMAX5A begin inv_x1 mapdist=0
for (INIT1=1,LOOP1=RMGRP,rofs=0-M*4; LOOP1--; INIT1=0) { /* stage#0 *//* mapped to FOR() on BR[63][1][0] */
for (INIT0=1,LOOP0=M-(i+1),cofs=0; LOOP0--; INIT0=0) { /* stage#0 *//* mapped to FOR() on BR[63][0][0] */
exe(OP_ADD, &cofs, INIT0?cofs:cofs, EXP_H3210, 4LL, EXP_H3210, 0LL, EXP_H3210, OP_AND, 0x00000000ffffffffLL, OP_NOP, 0LL); /* stage#0 */
exe(OP_ADD, &rofs, rofs, EXP_H3210, INIT0?M*4:0, EXP_H3210, 0LL, EXP_H3210, OP_NOP, 0LL, OP_NOP, 0LL); /* stage#0 */
exe(OP_ADD, &oofs, rofs, EXP_H3210, cofs, EXP_H3210, 0LL, EXP_H3210, OP_AND, 0x00000000ffffffffLL, OP_NOP, 0LL); /* stage#1 */
/***************************/
/* + - - - - - - - - - - - */ /* A[p[i]] 先頭行 */ /* 先頭行はi更新まで再利用可能 */
/* | * > > > > > > > > > > */ /* A[p[j]] 次行から引く */ /* 1行をLMMに写像 */
/* | v + - - - - - - - - - */
/* | v | * > > > > > > > > */ /* M/60を収容してi更新までj+=60を繰り返す *//* 行番号比較とcstによる端数制御 */
/* | v | v + - - - - - - - */ /* + CHIP#0 h=0 grp=0 */
/* | v | v - + - - - - - - */ /* + CHIP#0 h=0 grp=1 */
/* | v | v - - + - - - - - */ /* + CHIP#1 h=0 grp=0 */
/* | v | v - - - + - - - - */ /* + CHIP#1 h=0 grp=1 */
/* | v | v - - - - + - - - */ /* + CHIP#0 h=1 grp=0 */
/* | v | v - - - - - + - - */ /* + CHIP#0 h=1 grp=1 */
/* | v | v - - - - - - + - */ /* + CHIP#1 h=1 grp=0 */
/* | v | v - - - - - - - + */ /* + CHIP#1 h=1 grp=1 */
/***************************/ /* 最大60行まで写像可能 */
/* FOLDING時は,少なくとも第0列がFOLDINGであることが必要(conv-c2c仕様) */
/* CEXEにも関わらずSTWRの無意味なLMM入れ換えが発生するため,A[M][*](枠外領域)を使用 */ /* OK exe-loop */
exe(OP_CMP_LT, &cc0, l00[CHIP], EXP_H3210, M, EXP_H3210, 0LL, EXP_H3210, OP_NOP, 0LL, OP_NOP, 0); /* stage#1 LD */
mop(OP_LDWR, 1, &BR[2][2][1], top, cofs, MSK_W0, topw, len, 0, 0, NULL, len); /* A[p[i]*M+k] stage#2 | */
mop(OP_LDWR, 1, &BR[2][0][1], d00[CHIP], oofs, MSK_W0, d00w[CHIP], len2,0, 1, NULL, len2); /* A[p[j+h*NCHIP+CHIP]*M+k] stage#2 +-> | */
mop(OP_LDWR, 1, &BR[2][1][1], d00[CHIP], rofs, MSK_W0, d00w[CHIP], len2,0, 1, NULL, len2); /* A[p[j+h*NCHIP+CHIP]*M+k] stage#2 +-> | */
exe(OP_FMS, &AR[2][0], BR[2][0][1], EXP_H3210, BR[2][1][1], EXP_H3210, BR[2][2][1], EXP_H3210, OP_NOP, 0LL, OP_NOP, 0); /* stage#2 | ■■■ | 1.0 */
cex(OP_CEXE, &ex0, 0, 0, 0, cc0, 0xaaaa); /* stage#2 | AR[1] | */
mop(OP_STWR,ex0, &AR[2][0], oofs, d00[CHIP], MSK_D0, d00w[CHIP], len2, 0, 1, NULL, len2); /* stage#2 | + ST v */
#if (H>1) /* *--------- BR[2] */
exe(OP_CMP_LT, &cc1, l01[CHIP], EXP_H3210, M, EXP_H3210, 0LL, EXP_H3210, OP_NOP, 0LL, OP_NOP, 0); /* stage#2 LD */
mop(OP_LDWR, 1, &BR[3][2][1], top, cofs, MSK_W0, topw, len, 0, 0, NULL, len); /* A[p[i]*M+k] stage#3 | */
mop(OP_LDWR, 1, &BR[3][0][1], d01[CHIP], oofs, MSK_W0, d01w[CHIP], len2,0, 1, NULL, len2); /* A[p[j+h*NCHIP+CHIP]*M+k] stage#3 +-> | */
mop(OP_LDWR, 1, &BR[3][1][1], d01[CHIP], rofs, MSK_W0, d01w[CHIP], len2,0, 1, NULL, len2); /* A[p[j+h*NCHIP+CHIP]*M+k] stage#3 +-> | */
exe(OP_FMS, &AR[3][0], BR[3][0][1], EXP_H3210, BR[3][1][1], EXP_H3210, BR[3][2][1], EXP_H3210, OP_NOP, 0LL, OP_NOP, 0); /* stage#3 | ■■■ | 1.0 */
cex(OP_CEXE, &ex0, 0, 0, 0, cc1, 0xaaaa); /* stage#3 | AR[2] | */
mop(OP_STWR,ex0, &AR[3][0], oofs, d01[CHIP], MSK_D0, d01w[CHIP], len2, 0, 1, NULL, len2); /* stage#3 | + ST v */
#if (H>2) /* *--------- BR[3] */
if (j+h*NCHIP*RMGRP+CHIP*RMGRP+grp<M)
A[(j+h*NCHIP*RMGRP+CHIP*RMGRP+grp)*M+i+1+k]
-= A[(j+h*NCHIP*RMGRP+CHIP*RMGRP+grp)*M+i]*A[i*M+i+1+k];

20210401
79
Conditional store is helpful
exe(OP_CMP_LT, &cc0, l00[CHIP], EXP_H3210, M, EXP_H3210, 0LL,
mop(OP_LDWR, 1, &BR[2][2][1], top, cofs, MSK_W0, topw, len, 0, 0, NULL, len);
mop(OP_LDWR, 1, &BR[2][0][1], d00[CHIP], oofs, MSK_W0, d00w[CHIP], len2,0, 1, NULL, len2);
mop(OP_LDWR, 1, &BR[2][1][1], d00[CHIP], rofs, MSK_W0, d00w[CHIP], len2,0, 1, NULL, len2);
exe(OP_FMS, &AR[2][0], BR[2][0][1], EXP_H3210, BR[2][1][1], EXP_H3210, BR[2][2][1],
cex(OP_CEXE, &ex0, 0, 0, 0, cc0, 0xaaaa );
mop(OP_STWR, ex0, &AR[2][0], oofs, d00[CHIP], MSK_D0, d00w[CHIP], len2, 0, 1, NULL, len2);
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

CPU GPU
20210401
80
（Sparse matrix and sorting）

20210401
81
Sparse MM
Merge sort
1 1.0 2 2.0 3 3.0 11 4.0 12 5.0 13 6.0
Array A
Array B 2 1.0 3 2.0 4 3.0 9 4.0 10 5.0 11 6.0
Index and value
1 a1 2 a2 3 s3 11 a11 12 a12 13 a13
Stream A
Stream B 2 b2 3 b3 4 b4 9 b9 10 b10 11 b11
Value and attribute
Dual address synchronizer for sparse matrix and sort

20210401
82
Dual address synchronizer for sparse matrix and sort

①
②
③ ④
⑤
⑥
⑦
⑧
⑨
⑩ ⑪
⑨
①
② ⑤
⑥
⑦
⑧
①
①
② ②
⑦ ⑦
⑨ ⑨
⑤
⑧
⑤
⑥ ⑥
⑧
④
③
⑩ ⑪
+offset
Mask address
Load B42 Load A00
Base address of B and A
Store to C
20210401
83
Five rings in a physical unit (sparse MM)

mex(OP_CMPA_LE, &b0[h], INIT0?b:b0[h], INIT0?0:8, BR[r][2][1], BR[r][2][0]);
mex(OP_CMPA_GE, &a0[h][CHIP], INIT0?a[h][CHIP]:a0[h][CHIP], INIT0?0:8, BR[r][2][1], BR[r][2][0]);
mop(OP_LDR, 3, &BR[r][2][1], b0[h], bofs, MSK_W1, b, 2*LP*RMGRP, 0, 0, NULL, 2*LP*RMGRP);
mop(OP_LDR, 3, &BR[r][2][0], a0[h][CHIP], bofs, MSK_W0, a[h][CHIP], 2*LP, 0, 0, NULL, 2*LP);
mop(OP_LDWR, 1, &c00, c0[h][CHIP], oofs, MSK_W0, c[h][CHIP], RMGRP, 0, 1, NULL, RMGRP);
exe(OP_CFMA, &c00, INIT0?c00:c00, EXP_H3210, BR[r][2][1], EXP_H3210, BR[r][2][0], EXP_H3210, OP_NOP, 00);
mop(OP_STWR, 1, &c00, oofs, c0[h][CHIP], MSK_D0, c[h][CHIP], RMGRP, 0, 1, NULL, RMGRP);
20210401
84
Ten rings in a physical unit (sparse MM)

20210401
85
void emax6sc_pth_imax_02(struct sc_param *param) {
Ull CHIP, LOOP0=param->LOOP0, LOOP1=param->LOOP1;
Ull INIT1[4], INIT0[4];
Uint uLOOP[4];
for (CHIP=0; CHIP<1; CHIP++) { /* unit2 */
uLOOP[CHIP]=LOOP1*LOOP0;
}
while (1) {
for (CHIP=0; CHIP<1; CHIP++)
if (uLOOP[CHIP]) break;
if (CHIP==1) break;
for (CHIP=0; CHIP<1; CHIP++) {
if (uLOOP[CHIP]==0) continue;
if ((2 && SCBR[1].enq[CHIP]==SCBR[1].deq[CHIP]) || (2<17 && SCBR[2].enq[CHIP]!=SCBR[2].deq[CHIP])) continue;
INIT1[CHIP]=(uLOOP[CHIP]>LOOP1*LOOP0-LOOP0);
INIT0[CHIP]=(uLOOP[CHIP]-(uLOOP[CHIP]/LOOP0*LOOP0)==0);
SCBR[1].deq[CHIP] = 1-SCBR[1].deq[CHIP];
{
SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][0] = SCBR[1].r[CHIP][SCBR[2].enq[CHIP]][6];
SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][3] = SCBR[1].r[CHIP][SCBR[2].enq[CHIP]][2];
}
{ Ull base, offs, adr, mexdist, load64;
static int emax6_unaligned_load_valid;
static Ull emax6_unaligned_load_high;
base = (!(0&1)||INIT0[CHIP]) ? ((0&2)?SCBR[1].r[CHIP][SCBR[2].enq[CHIP]][0]:SCM1[2].b[CHIP][0]) : SCM1[2].awoo[CHIP][0];
offs = eam(1 ? SCBR[1].r[CHIP][SCBR[2].enq[CHIP]][6] : SCM1[2].o[CHIP][0], 12);
mexdist = INIT0[CHIP] ? 0 : 0;
SCM1[2].awoo[CHIP][0] = (Ull)(INIT0[CHIP]?base:SCM1[2].awoo[CHIP][0]);
adr = (Uint)(SCM1[2].awoo[CHIP][0] + offs);
SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][1] = (Ull)*(Uint*)(adr&~3LL)<<32 | (Ull)*(Uint*)(adr&~3LL);
SCM1[2].d[CHIP][0] = SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][1];
}
{ Ull base, offs, adr, mexdist, load64;
static int emax6_unaligned_load_valid;
static Ull emax6_unaligned_load_high;
SCM1[2].awoo[CHIP][2] = (Ull)(INIT0[CHIP]?base:SCM1[2].awoo[CHIP][2])+(INIT0[CHIP]?0:((SCM0[2].d[CHIP][2]>>32)!=0xffffffff &&
(SCM1[2].d[CHIP][2]>>32)<=(SCM0[2].d[CHIP][2]>>32))?mexdist:0);
load64 = *(Ull*)(adr&~7LL);
if ((adr&7) == 0)
SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][9] = load64;
else if (!emax6_unaligned_load_valid) { /* BR[][][1] */
emax6_unaligned_load_valid = 1;
emax6_unaligned_load_high = load64;
SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][9] = load64 >> (adr&7)*8;
}
else { /* BR[][][0] */
SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][9] = emax6_unaligned_load_high << (8-(adr&7))*8 | load64 >> (adr&7)*8;
}
SCM0[2].awoo[CHIP][2] = (Ull)(INIT0[CHIP]?base:SCM0[2].awoo[CHIP][2])+(INIT0[CHIP]?0:((SCM1[2].d[CHIP][2]>>32)!=0xffffffff &&
(SCM1[2].d[CHIP][2]>>32)>=(SCM0[2].d[CHIP][2]>>32))?mexdist:0);
load64 = *(Ull*)(adr&~7LL);
if ((adr&7) == 0)
SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][8] = load64;
else if (!emax6_unaligned_load_valid) { /* BR[][][1] */
emax6_unaligned_load_high = load64;
SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][8] = load64 >> (adr&7)*8;
}
else { /* BR[][][0] */
SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][8] = emax6_unaligned_load_high << (8-(adr&7))*8 | load64 >> (adr&7)*8;
}
}
{ union { Uint i; float f; } f3, f2, f1, f0; Ull t3, t2, t1, t0, ex1, ex2, ex3, ex4, ex5, c1, c0, ex1_outd, ex2_outd;
ex1 = exm(!1||(INIT1[CHIP]&&INIT0[CHIP])||((1&1)&&INIT0[CHIP]) ? SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][1] : SCAR[2].r[CHIP][0], 0);
ex2 = exm(((1&2)&&!INIT0[CHIP]) ? 0 : SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][9], 0);
ex3 = exm(SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][8], 0);
f1.i = (Uint)(ex1);
f2.i = (Uint)(ex2>>32);
f3.i = (Uint)(ex3>>32);
if (f2.i != -1 && f2.i == f3.i) {
f2.i = (Uint)(ex2);
f3.i = (Uint)(ex3);
f0.f = f1.f + (f2.f * f3.f);
}
else {
f0.f = f1.f;
}
t0 = f0.i;
ex1_outd = t0;
ex2_outd = ex1_outd;
SCAR[2].r[CHIP][0] = ex2_outd;
}
{ Ull cs0, cs1, cs2, cs3, cex, base, offs, adr, mexdist;
cex = 3;
SCM0[2].awoo[CHIP][0] = (Ull)(INIT0[CHIP]?base:SCM0[2].awoo[CHIP][0]);
if (cex &1) *(Uint*)(adr&~3LL) = (1==1? SCAR[2].r[CHIP][0] : SCBR[2].r[CHIP][SCBR[2].enq[CHIP]][6]);
}
uLOOP[CHIP]--;
SCBR[2].enq[CHIP] = 1-SCBR[2].enq[CHIP];
}
}
}
RISC for CPU, but CISC for CGRA
Upper: IMAX program … executed w/ only 4cycles
Lower: Equivalent C program … much complicated
Integrating rich functions in data path makes CGRA power efficient ?

①
①
② ②
⑦ ⑦
⑨ ⑨
⑤
⑧
⑤
⑥ ⑥
⑧
④
③
+ offset
Mask address
Similar to sparse MM
Succeeding unit compares A and B
Conditional store of A or B
One of LogN-stages is mapped
Store address is sequential
Two level pipelining: succeeding unit
reads the previous data from local memory
and performs next stage of LogN-stages.
Four rings in a physical unit (merge sort)
20210401
86
Load B42 Load A00
⑪
⑩

Update address A or
B, based on the value
Conditional store of A or B
Store address is sequential
Store one stage of logN-stages
Addr.A
Addr.B Val.B Val.A
Update address A or
B, based on the value
Double buffering
Merge sort
20210401
87
Generate condition codes

Compression of sparse matrix
count0 = 0;
for (row=0; row<M1; row++) {
for (col=0; col<M2; col++) {
if (A32_0[row*M2+col] == 0)
continue;
if (count0 >= M1*M2)
continue;
A32_P[count0].d = A32_0[row*M2+col];
A32_P[count0].x = row*M2+col;
count0++;
}
}
*Bas1P = B32_P-1; /* end of B32_0 */
for (i=0; i<M1; i+=RMGRP) {
r1 = (Ull)(i*M2-1)<<32;
ibase0 = B32_0+i*M2;
itop0 = B32_0+i*M2;
itop1 = itop0+RMGRP*M2;
obase0 = *Bas1P; /* end of B32_0 */
otop1 = otop0;
otop0 = *Bas1P+8; /* top of B32_P */
//with-prefetch/post-drain
//EMAX5A begin imax mapdist=0
for (INIT1=1,LOOP1=RMGRP,rofs=0; LOOP1--; INIT1=0) {
for (INIT0=1,LOOP0=M2,cofs=0; LOOP0--; INIT0=0) {
mop(OP_LDWR, 1, &r0, ibase0++, 0, MSK_D0, itop0, M2*RMGRP, 0, 0, itop1, M2*RMGRP);
exe(OP_ADD, &r1, r1, EXP_H3210, 0x100000000LL, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0LL);
exe(OP_NOP, &std, r1, EXP_H3210, 0, EXP_H3210, 0, EXP_H3210, OP_OR, r0, OP_NOP, 0LL);
exe(OP_CMP_EQ, &cc0, r0, EXP_H1010, 0x00000000LL, EXP_H1010, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0LL);
exe(OP_CMP_EQ, &cc1, r0, EXP_H1010, 0x80000000LL, EXP_H1010, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0LL);
exe(OP_NOP, &cc2, cc0, EXP_H3210, 0, EXP_H3210, 0, EXP_H1010, OP_OR, cc1, OP_NOP, 0LL);
exe(OP_CMOV, &oofs, cc2, EXP_H3210, 0, EXP_H3210, 8, EXP_H3210, OP_NOP, 0, OP_NOP, 0LL);
exe(OP_ADD, &obase0, obase0, EXP_H3210, oofs, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP, 0LL);
mop(OP_STR, 3, &obase0, Bas1P, 0, MSK_D0, Bas1P, 2, 0, 0, NULL, 2);
exe(OP_NOP, &AR[5][0], 0, EXP_H3210, 0, EXP_H3210, 0, EXP_H1010, OP_NOP, 0, OP_NOP, 0LL);
cex(OP_CEXE, &ex0, 0, 0, 0, cc2, 0x0001);
mop(OP_STR, ex0, &std, obase0, 0, MSK_D0, otop0, LP*2*RMGRP, 0, 0, otop1, LP*2*RMGRP);
}
}
}
//EMAX5A end
}
20210401
88

CPU GPU
20210401
89
（Hash, FFT and string search）

20210401
90
SHA256
for (i=0; i<ctx->mbuflen; i+=BLKSIZE) { /* 1データ流内の並列実行は不可能. 多数データ流のパイプライン実行のみ */
for (th=0; th<thnum; th++) {
sregs[th*8+0] = state[th*8+0]; sregs[th*8+1] = state[th*8+1]; sregs[th*8+2] = state[th*8+2]; sregs[th*8+3] = state[th*8+3];
sregs[th*8+4] = state[th*8+4]; sregs[th*8+5] = state[th*8+5]; sregs[th*8+6] = state[th*8+6]; sregs[th*8+7] = state[th*8+7];
}
for (j=0; j<BLKSIZE; j+=BLKSIZE/DIV) {
a = sregs[th*8+0]; b = sregs[th*8+1]; c = sregs[th*8+2]; d = sregs[th*8+3];
e = sregs[th*8+4]; f = sregs[th*8+5]; g = sregs[th*8+6]; h = sregs[th*8+7];
#if (DIV==4)
t1 = h+EP1(e)+CH(e,f,g)+k[j+ 0]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+ 0]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
t1 = h+EP1(e)+CH(e,f,g)+k[j+10]+mbuf[i/BLKSIZE*MAX_THNUM*BLKSIZE+th*BLKSIZE+j+10]; t2 = EP0(a)+MAJ(a,b,c); h = g; g = f; f = e; e = d+t1; d = c; c = b; b = a; a = t1+t2;
#endif
sregs[th*8+0] = a; sregs[th*8+1] = b; sregs[th*8+2] = c; sregs[th*8+3] = d;
sregs[th*8+4] = e; sregs[th*8+5] = f; sregs[th*8+6] = g; sregs[th*8+7] = h;
}
}
state[th*8+0] += sregs[th*8+0]; state[th*8+1] += sregs[th*8+1]; state[th*8+2] += sregs[th*8+2]; state[th*8+3] += sregs[th*8+3];
state[th*8+4] += sregs[th*8+4]; state[th*8+5] += sregs[th*8+5]; state[th*8+6] += sregs[th*8+6]; state[th*8+7] += sregs[th*8+7];
}
}

20210401
91
Leaf functions of SHA256
case OP_MAJ: /* (((x) & (y))^((x) & (z))^((y) & (z))) */
ex1_outd = (r1&0xffffffff00000000LL) | (((r1 & r2)^(r1 & r3)^(r2 & r3))&0xffffffffLL);
case OP_CH: /* (((x) & (y))^(~(x) & (z))) */
ex1_outd = (r1&0xffffffff00000000LL) | (((r1 & r2)^(~r1 & r3))&0xffffffffLL);
case OP_ROTS: /* hi-32bit #define ROTRIGHT (((a) >> (b)) | ((a) << (32-(b)))) */
t2 = ex1_outd & 0xffffffff00000000LL;
ro10 = r4>>32 & 0xff;
ro11 = r4>>40 & 0xff;
ro12 = r4>>48 & 0xff;
t0 = ex1_outd & 0x00000000ffffffffLL;
ro00 = r4 & 0xff;
ro01 = r4>> 8 & 0xff;
ro02 = r4>>16 & 0xff;
ex2_outd = (((t2>>ro12|t2<<(32-ro12))^(t2>>ro11|t2<<(32-ro11))^(t2>>ro10|t2<<(32-ro10)))&0xffffffff00000000LL)
|(((t0>>ro02|t0<<(32-ro02))^(t0>>ro01|t0<<(32-ro01))^(t0>>ro00|t0<<(32-ro00)))&0x00000000ffffffffLL);
exe(OP_MAJ, &ep0maj, a, EXP_H1010, fb, EXP_H1010, gc, EXP_H1010, OP_ROTS, (2LL<<48)|(13LL<<40)|(22LL<<32). NOP,0);
exe(OP_CH, &ep1ch, e, EXP_H3232, fb, EXP_H3232, gc, EXP_H3232, OP_ROTS, (6LL<<48)|(11LL<<40)|(25LL<<32), NOP,0);

20210401
93
Double buffering and 2-level pipelining for FFT
BlockEnd = 1;
for (BlockSize=2; BlockSize<=NumSamples; BlockSize<<=1) {
for (i=0; i<NumSamples; i+=BlockSize) {
for (j=i,n=0; n<BlockEnd; j++,n++) {
k = j + BlockEnd;
idx = n + BlockEnd;
tr = art[idx]*RealOut[k] - ait[idx]*ImagOut[k];
ti = art[idx]*ImagOut[k] + ait[idx]*RealOut[k];
RealOut[k] = RealOut[j] - tr;
ImagOut[k] = ImagOut[j] - ti;
RealOut[j] += tr;
ImagOut[j] += ti;
}
}
BlockEnd = BlockSize;
} Stage1 stage2 stage3

20210401
95
String search
void strsearch(int i)
{
char *str = sstr[i];
int len = slen[i];
register size_t shift;
register size_t pos = len - 1;
char *found;
while (pos < clen) {
while (pos < clen && (shift = table[(unsigned char)target[pos]]) > 0)
pos += shift;
if (!shift) {
if (!strncmp(str, &target[pos-len+1], len))
out0[i*clen+(pos-len+1)] = 0xff;
pos++;
} } }

20210401
96
String search
for (i=0; i<snum; i+=OMAP) {
//EMAX5A begin search mapdist=0
for (CHIP=0; CHIP<NCHIP; CHIP++) { //チップ方向は検索対象文書の大きさ方向
for (INIT0=1,LOOP0=loop,dmy=0; LOOP0--; INIT0=0) { //長さは32KB文字まで
/* map#0 */
exe(OP_ADD, &t0[CHIP], t0[CHIP], EXP_H3210, 1LL, EXP_H3210, 0LL, EXP_H3210, OP_AND, 0x0000ffffffffffffLL, OP_NOP, 0LL);
exe(OP_MCAS, &r00, slen0, EXP_H3210, 1, EXP_H3210, 0LL, EXP_H3210, OP_NOP, 0LL, OP_NOP, 0LL);
mop(OP_LDBR, 1, &BR[1][0][1], t0[CHIP], 0, MSK_D0, t0t[CHIP], dwi, 0, 0, (Ull)NULL, dwi);
exe(OP_CMP_NE, &r16, c00, EXP_H3210, BR[1][0][1], EXP_H3210, 0LL, EXP_H3210, OP_AND, r00, OP_NOP, 0LL); // 1 if unmatch
exe(OP_ADD3, &r10, r16, EXP_H3210, r17, EXP_H3210, r18, EXP_H3210, OP_NOP, 0LL, OP_NOP, 0LL); //
exe(OP_ADD, &r12, r22, EXP_H3210, r23, EXP_H3210, 0LL, EXP_H3210, OP_NOP, 0LL, OP_NOP, 0LL); //
exe(OP_MCAS, &r31, 0LL, EXP_H3210, r00, EXP_H3210, 0LL, EXP_H3210, OP_NOP, 0LL, OP_NOP, 0LL); // FF if match
mop(OP_STBR, 3, &r31, r0[CHIP]++, 0, MSK_D0, r0t[CHIP], dwo, 0, 0, (Ull)NULL, dwo);
/* map#1 */
:
/* map#8 */
:
}
}
//EMAX5A end
}

CPU GPU
20210401
98
（High-speed compiler）

20210401
99
FMA C+B*A ⇒ D
Dual Random access

20210401
100
FMA C+B*A ⇒ D
D has location info (same as C)
Sequential updates

20210401
101
FMA C+B*A ⇒ C
Accumulation

20210401
102
Compilation of source code
void tone_curve(r, d, t)
unsigned int *r, *d;
unsigned char *t;
{
#if !defined(EMAX5) && !defined(EMAX6)
int j;
for (j=0; j<WD; j++) {
*d = ((t)[*r>>24])<<24 | (t[256+((*r>>16)&255)])<<16 | (t[512+((*r>>8)&255)])<<8;
r++; d++;
}
#else
Ull t1 = t;
Ull t2 = t+256;
Ull t3 = t+512;
Ull BR[16][4][4]; /* output registers in each unit */
int loop=WD;
while (loop--) {
mop(OP_LDWR, 1, &BR[0][1][1], (Ull)(r++), 0LL, MSK_D0, (Ull)r, 320, 0, 0, (Ull)NULL, 320); /* stage#0 */
mop(OP_LDUBR, 1, &BR[1][1][1], (Ull)t1, BR[0][1][1], MSK_B3, (Ull)t1, 64, 0, 0, (Ull)NULL, 64); /* stage#1 */
exe(OP_MMRG, &r1, BR[1][1][1], EXP_H3210, BR[1][2][1], EXP_H3210, BR[1][3][1], EXP_H3210, OP_NOP, 0LL, OP_NOP, 0LL);
mop(OP_STWR, 3, &r1, (Ull)(d++), 0LL, MSK_D0, (Ull)d, 320, 0, 0, (Ull)NULL, 320); /* stage#2 */
}
//EMAX5A end
#endif
}
 filter+rmm.c

20210401
103
Generated code
 filter+rmm-emax6.c ../../src/conv-c2c/emax6.h ../../src/conv-c2c/emax6lib.c
void tone_curve(r, d, t)
unsigned int *r, *d;
unsigned char *t;
{
Ull t1 = t;
Ull t2 = t+256;
Ull t3 = t+512;
Ull BR[16][4][4];
Ull r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r13, r14, r15, r16, r17, r18, r19, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, r31;
int loop=WD;
volatile emax6_conf_tone_curve();
emax6.lmmio = emax6.lmmic;
emax6.lmmic = 1-emax6.lmmic;
emax6.mapdist = 0;
*(Uint*)&emax6.lmmi[0][0][1][emax6.lmmic] = 0x013f0001|(0<<2);
emax6.lmmi[0][0][1][emax6.lmmic].ofs = 0; emax6.lmmi[0][0][1][emax6.lmmic].top = r;
emax6.lmmi[0][1][1][emax6.lmmic].ofs = 0; emax6.lmmi[0][1][1][emax6.lmmic].top = t1;
emax6.lmmi[0][2][0][emax6.lmmic].ofs = 0; emax6.lmmi[0][2][0][emax6.lmmic].top = d;
emax6.lmmi_bitmap[0] = 0x0000000000000004LL;
emax6_pre_with_drain_cache();
get_nanosec(NANOS_ARM);
if (emax6.last_conf == emax6_conf_tone_curve) {
emax6.status = STATUS_DRAIN;
emax6_check_lmmi_and_dma(0, 1, 0, 0, 2, 0);/*drain*/
}
get_nanosec(NANOS_DRAIN);
Updating of memory map
Drain of previous burst execution

20210401
104
Generated code
if (emax6.last_conf != emax6_conf_tone_curve) {
Dll *dst, *src;
int i,j;
emax6.status = STATUS_CONF;
emax6.last_conf = emax6_conf_tone_curve;
emax6.lastdist = 0;
dst = (Dll*)(((struct reg_ctrl*)emax6.reg_ctrl)->i[0].conf);
src = (Dll*)emax6_conf_tone_curve;
for (i=0; i<sizeof(conf)/sizeof(Dll); i++)
*dst++ = *src++;
for (i=0; i<64; i++) {
for (j=0; j<4; j++)
emax6.lmmi[0][i][j][emax6.lmmio].v = 0;
}
while (((struct reg_ctrl*)emax6.reg_ctrl)->i[0].stat & 0xf0); //LMRING_BUSY
}
get_nanosec(NANOS_CONF);
emax6.status = STATUS_REGV;
((struct reg_ctrl*)emax6.reg_ctrl)->i[0].breg[63][0].br[0] = loop;
((struct reg_ctrl*)emax6.reg_ctrl)->i[0].breg[63][0].br[1] = -1LL;
((struct reg_ctrl*)emax6.reg_ctrl)->i[0].addr[0][1].ea1b = (Ull)r;
((struct reg_ctrl*)emax6.reg_ctrl)->i[0].addr[0][1].ea1o = (Ull)0LL;
((struct reg_ctrl*)emax6.reg_ctrl)->i[0].addr[1][1].ea1b = (Ull)t1;
((struct reg_ctrl*)emax6.reg_ctrl)->i[0].addr[2][0].ea0b = (Ull)d;
((struct reg_ctrl*)emax6.reg_ctrl)->i[0].addr[2][0].ea0o = (Ull)0LL;
get_nanosec(NANOS_REGV);
emax6.status = STATUS_RANGE;
{struct reg_ctrl *reg_ctrl = emax6.reg_ctrl;
Uint lmmic = emax6.lmmic;
*(Ull*)&(reg_ctrl->i[0].addr[0][1].top)=((Ull)(emax6.lmmi[0][0][1][lmmic].top+*((Ushort*)&emax6.lmmi[0][0][1][lmmic]+1)*sizeof(Uint)+(sizeof(Uint)-1))<<32)|(Ull)(Uint)emax6.lmmi[0][0][1][lm
}
get_nanosec(NANOS_RANGE);
Sending of new configuration
Initialization of registers
Setting of top+bottom to local memory

20210401
105
Generated code
emax6.status = STATUS_LOAD;
emax6_check_lmmi_and_dma(0, 2, emax6.lastdist, 0, 0, 1);/*load*/
get_nanosec(NANOS_LOAD);
((struct reg_ctrl*)emax6.reg_ctrl)->i[0].cmd = 3LL; // EXEC
{struct reg_ctrl *reg_ctrl = emax6.reg_ctrl;
Uint lmmic = emax6.lmmic;
}
emax6.lmmd[2][0] = 0xff>>7;
while (((struct reg_ctrl*)emax6.reg_ctrl)->i[0].stat); //LMRING_BUSY|EXRING_BUSY
get_nanosec(NANOS_EXEC);
asm volatile("b emax6_conf_end_tone_curven"
".align 5n"
".global emax6_conf_tone_curven"
"emax6_conf_tone_curve:n"
" .word 0x031e0003, 0x00000000n"
" .word 0xffff0000, 0x00000000n"
" .word 0x00000000, 0x00000000n"
:
" .word 0xffff0000, 0x00000000n"
" .word 0x00000000, 0x00000000n"
" .word 0x00000000, 0x00000000n"
".global emax6_conf_end_tone_curven"
"emax6_conf_end_tone_curve:n"
);
}
DMA
Activation of IMAX
Simultaneous DMA
Binary of configuration

ld @(gr1, 0) -> gr2
add gr1, 4 -> gr1
sub gr3, 1 -> gr3, z
VLIW0
add sub
gr1 gr3z
Register File
gr1 4 gr3 1
eag
gr1 4
0 1 2 3 4 5 6
0 0 0 0 0 0 0
0 1 2 3 4 5 6
0 0 0 0 0 0 0
0 1 2 3 4 5 6
0 0 0 0 0 0 0
0 1 2 3 4 5 6
0 0 0 0 0 0 0
0 1 2 3 4 5 6
0 0 0 0 0 0 0
z
0
z
0
z
0
z
0
z
0
gr2
ld
Propagation skip table
gr/cr
Stage
0
1
2
3
4
20210401
106
Non-heuristic compilation similar to previous LAPP

ld @(gr4, 0) -> gr5
add gr4, 4 -> gr4
bz end
VLIW1
add sub
gr1 gr3z
add
gr4
Register File
gr1 4
gr4 4
gr3 1 gr4
bz
end
eag
eag
gr1 4
gr4 4
0 1 2 3 4 5 6
0 1 1 1 0 0 0
0 1 2 3 4 5 6
0 1 1 1 0 0 0
0 1 2 3 4 5 6
0 0 1 0 0 0 0
0 1 2 3 4 5 6
0 0 0 0 0 0 0
0 1 2 3 4 5 6
0 0 0 0 0 0 0
z
1
z
1
z
0
z
0
z
0
gr5
ld
gr2
ld
gr/cr
Stage
0
1
2
3
4
VLIW1
20210401
107

gr5
ld
gr2
ld
sll gr2, 16 -> gr2
VLIW2
add sub
gr1 gr3z
add
gr4
sll
Register File
gr1 4
gr4 4
16
gr3 1
0 1 2 3 4 5 6
0 1 1 1 1 1 0
0 1 2 3 4 5 6
0 1 1 1 1 1 0
0 1 2 3 4 5 6
0 0 1 0 1 1 0
0 1 2 3 4 5 6
0 0 0 0 0 1 0
0 1 2 3 4 5 6
0 0 0 0 0 0 0
z
1
z
1
z
0
z
0
z
0
gr4
bz
end
eag
eag
gr1 4
gr4 4
gr/cr
Stage
0
1
2
3
4
VLIW2
VLIW2
20210401
108

gr5
ld
gr2
ld
or gr2, gr5 -> gr5
VLIW3
add sub
gr1 gr3z
add
gr4
gr2
Register File
or
gr5
gr1 4
gr4 4
gr3 1
0 1 2 3 4 5 6
0 1 1 1 1 1 0
0 1 2 3 4 5 6
0 1 1 1 1 1 0
0 1 2 3 4 5 6
0 0 1 0 1 1 0
0 1 2 3 4 5 6
0 0 1 0 0 1 0
0 1 2 3 4 5 6
0 0 0 0 0 0 0
z
1
z
1
z
0
z
0
z
0
gr4
bz
end
eag
eag
gr1 4
gr4 4
gr/cr
Stage
0
1
2
3
4
sll
16
VLIW3
VLIW3
VLIW3
20210401
109

gr5
gr2
ld
st, gr5 -> @(gr6, 0)
add gr6, 4 -> gr6
bra loop
VLIW4
add sub
gr1 gr3z
add
gr4
Register File
gr5 gr5
add
gr6
gr1 4
gr4 4
16
gr6
gr6 4
gr6
gr6
gr3 1
0 1 2 3 4 5 6
0 1 1 1 1 1 0
0 1 2 3 4 5 6
0 1 1 1 1 1 0
0 1 2 3 4 5 6
0 0 1 0 1 1 0
0 1 2 3 4 5 6
0 0 1 0 0 1 0
0 1 2 3 4 5 6
0 0 0 0 0 1 0
z
1
z
1
z
0
z
0
z
0
gr4 gr6
bz
bra
end
loop
eag
eag
eag
gr1 4
gr4 4
gr6 0
st
gr/cr
Stage
0
1
2
3
4
gr5
ld
gr2
or
sll
VLIW4
VLIW4
VLIW4
VLIW4
20210401
110

CPU GPU
20210401
111
（Elaborate triple loop structure）

20210401
112
Triple loop = double loop + multiple IMAX
HOST
Ring for execution
Eight lanes for memory IF

20210401
113
Double loop and indices are managed by one physical unit

20210401
114
2-dimensional processing in stochastic FMA
//EMAX5A begin smax2 mapdist=0
for (INIT1=1,LOOP1=RMGRP,rofs=(0-IC32)<<32|((0-1LL)&0xffffffff); LOOP1--; INIT1=0) {
for (INIT0=1,LOOP0=IC32/32,cofs=(0-32LL)<<32|((0)&0xffffffff); LOOP0--; INIT0=0) {
① exe(OP_ADD, &cofs, INIT0?cofs:cofs, EXP_H3210, 32LL<<32|0, EXP_H3210, 0, EXP_H3210, OP_AND, 0xffffffffffffffffLL, OP_NOP,0);
② exe(OP_ADD, &rofs, rofs, EXP_H3210, INIT0?IC321:0, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP,0);
③ exe(OP_ADD, &bofs, rofs, EXP_H3210, cofs, EXP_H3210, 0, EXP_H3210, OP_AND, 0xffffffffffffffffLL, OP_NOP,0);
④ exe(OP_ADD, &oofs, rofs, EXP_H3210, cofs, EXP_H3210, 0, EXP_H3210, OP_AND, 0x00000000ffffffffLL, OP_NOP,0);
spike01_core1( 2, 0);
#define spike01_core1(r, s)
mo4(OP_LDRQ, 1, BR[r][2], b0, bofs, MSK_W1, b, IC32D4RMGRP, 0, 0, NULL, IC32D4RMGRP);
mo4(OP_LDRQ, 1, BR[r][1], a[s][CHIP], cofs, MSK_W1, a[s][CHIP], IC32D4, 0, 0, NULL, IC32D4 );
mop(OP_LDBR, 1, &b00, c0[s][CHIP], oofs, MSK_W0, c[s][CHIP], RMGRPD4, 0, 1, NULL, RMGRPD4 );
ex4(OP_SFMA, &b00, INIT0?b00:b00, EXP_H3210, BR[r][1], EXP_H3210, BR[r][2], EXP_H3210, OP_NOP,0, OP_NOP,0 );
mop(OP_STBR, 1, &b00, oofs, c0[s][CHIP], MSK_D0, c[s][CHIP], RMGRPD4, 0, 1, NULL, RMGRPD4 )

20210401
115
2-dimensional processing in Bayesian inference
//EMAX5A begin x1 mapdist=0
for (INIT1=1,LOOP1=RMGRP,row=0-M*4; LOOP1--; INIT1=0) {
for (INIT0=1,LOOP0=M/W,bofs=0-W*4; LOOP0--; INIT0=0) {
exe(OP_ADD, &bofs, INIT0?bofs:bofs, EXP_H3210, W*4, EXP_H3210, 0, EXP_H3210, OP_AND, 0x00ffffffffLL, OP_NOP,0);
exe(OP_ADD, &row, row, EXP_H3210, INIT0?M*4:0, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP,0);
exe(OP_ADD, &rofs, row, EXP_H3210, 0, EXP_H3210, 0, EXP_H3210, OP_AND, 0x00ffffffffLL, OP_NOP,0);
mop(OP_LDWR, 1, &b00, c600, rofs, MSK_W0, c60, M*RMGRP, 0, 1, NULL, M*RMGRP);
exe(OP_ADD, &b00, INIT0?b00:b00, EXP_H3210, PARAM, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP,0);
mop(OP_STWR, 1, &b00, rofs, c600, MSK_D0, c60, M*RMGRP, 0, 1, NULL, M*RMGRP);

20210401
116
2-dimensional processing in sparse MM
//EMAX5A begin imax mapdist=0
for (INIT1=1,LOOP1=RMGRP,rofs=(0-LP*8)<<32|((0-4LL)&0xffffffff); LOOP1--; INIT1=0) {
for (INIT0=1,LOOP0=LP,cofs=(0LL)<<32|((0LL)&0xffffffff); LOOP0--; INIT0=0) {
exe(OP_ADD, &rofs, rofs, EXP_H3210, INIT0?(LP*8)<<32|4:0, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP,0);
exe(OP_ADD, &bofs, rofs, EXP_H3210, 0, EXP_H3210, 0, EXP_H3210, OP_AND, 0xffffffff00000000LL, OP_NOP,0);
exe(OP_ADD, &oofs, rofs, EXP_H3210, 0, EXP_H3210, 0, EXP_H3210, OP_AND, 0x00000000ffffffffLL, OP_NOP,0);
sparse_core1( 2, 0);
sparse_core1( 3, 1); /* H=2 */
sparse_core1( 5, 3); /* H=4 */
sparse_core1( 9, 7); /* H=8 */
#define sparse_core1(r, h)
mex(OP_CMPA_LE, &b0[h],INIT0?b:b0[h],INIT0?0:8,OP_CMPA_GE,&a0[h][CHIP],INIT0?a[h][CHIP]:a0[h][CHIP],INIT0?0:8,0,BR[r][2][1],……);
mop(OP_LDR, 3, &BR[r][2][1], b0[h], bofs, MSK_W1, b, 2*LP*RMGRP, 0, 0, NULL, 2*LP*RMGRP );
mop(OP_LDR, 3, &BR[r][2][0], a0[h][CHIP], bofs, MSK_W0, a[h][CHIP], 2*LP, 0, 0, NULL, 2*LP );
exe(OP_NOP, &AR[r][0], 0, EXP_H3210, 0, EXP_H3210, 0, EXP_H3210, OP_NOP, 0, OP_NOP,0);
mop(OP_LDWR, 1, &c00, c0[h][CHIP], oofs, MSK_W0, c[h][CHIP], RMGRP, 0, 1, NULL, RMGRP );
exe(OP_CFMA, &c00, INIT0?c00:c00, EXP_H3210, BR[r][2][1], EXP_H3210, BR[r][2][0], EXP_H3210, OP_NOP, 0, OP_NOP,0);
mop(OP_STWR, 1, &c00, oofs, c0[h][CHIP], MSK_D0, c[h][CHIP], RMGRP, 0, 1, NULL, RMGRP )

CPU GPU
20210401
117
and its HW/SW codesign

20210401
118
Templates for IMAX programming
exe(OP_X, &var|&AR[0-63][0-3], s1, e1, s2, e2, s3, e3, OP_Y, s4, OP_Z, s5)
ex4(OP_X, &var|&AR[0-63], s1, e1, s2, e2, s3, e3, OP_Y, s4, OP_Z, s5)
exe(OP_X, &var, INIT0?var:var, e1, s2, e2, s3, e3, OP_Y, s4, OP_Z, s5)
exe(OP_X, &var, var, e1, INIT0?s2:0, e2, s3, e3, OP_Y, s4, OP_Z, s5)
mex(OP MEX2, &s2, INIT0?s20:s2, INIT0?0:expr,
OP MEX1, &s1, INIT0?s10:s1, INIT0?0:expr, limit, BR[0-63][0-3][1], BR[0-63][0-3][0])
cex(OP_CEXE, &ex0-9, c3, c2, c1, c0, 16bit-pattern)
mop(OP_X, ex9-0, &src|&dst, base, offset, mask, top, len, block, force, ptop, plen)
mo4(OP_X, ex9-0, &src|&dst, base, offset, mask, top, len, block, force, ptop, plen)
DMA information

Original C C+IMAX-code Target
(A) Intel-PC Native Intel-emax6lib Intel-CC Intel-CC Algorithm
(B) Intel-PC Simulator ARM+emax6lib ARM-XCC ARM-XCC(cross compiler) Algorithm
(C) Intel-PC Simulator IMAX/PIO Conv-c2c + ARM-XCC IMAX-code
(D) Intel-PC Simulator IMAX/DMA Conv-c2c + ARM-XCC IMAX-code + testbench
(E) Verilog Simulator Vsim + Testbench Verilog
(F) FPGA+Chipscope Vivado + hw_server Real Hardware
(G) ARM-SoC ARM+emax6lib ARM-CC Conv-c2c + ARM-CC Algorithm
(H) ARM-SoC IMAX/PIO Conv-c2c + ARM-CC Hardware w/o DMA
(I) ARM-SoC IMAX/DMA Conv-c2c + ARM-CC Performance
Conv-c2c (IMAX-CC) runs on CentOS/FreeBSD/ARM-SoC
- IMAX-code is translated to IMAX-config + DMA sequence, and embedded in ARM binary.
Simulator (csim) runs on CentOS/FreeBSD
- Register transfer level simulator
- ARMv8, 64cores, 32threads/core, L1+L2cache/core, L2-directory
reorder-buffer, parameterized memory hierarchy
- 64 IMAX, AXI4-IF, test-bench generator
20210401
119
HW/SW codesign

/* SCREEN=WD*HT */
r = t[ pix>>24 ];
g = t[256+((pix>>16)&255)];
b = t[512+((pix>> 8)&255)];
out[row*WD+col]=r<<24 | g<<16 | b<<8;
} }
20210401
120
Start programming
Load →
Store ←
Color map tables

/* SCREEN=WD*HT */
for (LOOP0=WD, col=-4; LOOP0--;) {
col += 4;
pix = in[row*WD+col/4];
r = t[ pix>>24 ];
g = t[256+((pix>>16)&255)];
b = t[512+((pix>> 8)&255)];
out[row*WD+col/4]=r<<24 | g<<16 | b<<8;
}
//EMAX5A end
}
//EDMAX5A drain_dirty_lmm
20210401
121
Put the loop structure
Load →
Store ←
Color map tables
/* SCREEN=WD*HT */
r = t[ pix>>24 ];
g = t[256+((pix>>16)&255)];
b = t[512+((pix>> 8)&255)];
out[row*WD+col]=r<<24 | g<<16 | b<<8;
} }
Load →
Store ←
Color map tables

/* SCREEN=WD*HT */
col += 4;
mop(OP_LDWR, &pix, in_row_WD, col, MSK_W0, in, WD); //pix = in[row*WD+col/4];
mop(OP_LDUBR, &r, t_r, pix, MSK_B3, t, 256*3/4); //r = t[ pix>>24 ];
mop(OP_LDUBR, &g, t_g, pix, MSK_B2, t, 256*3/4); //g = t[256+((pix>>16)&255)];
mop(OP_LDUBR, &b, t_b, pix, MSK_B1, t, 256*3/4); //b = t[512+((pix>> 8)&255)];
out[row*WD+col/4]=r<<24|g<<16|b<<8;
}
//EMAX5A end
}
20210401
122
Put templates
Load →
Store ←
Color map tables

/* SCREEN=WD*HT */
exe(OP_ADD, &col, col, EXP_H3210, 4, EXP_H3210, 0, EXP_H3210); //col += 4;
exe(OP_MMRG, &out, r, EXP_H3210, g, EXP_H3210, b, EXP_H3210);
mop(OP_STWR, &out, out_row_WD, col, MSK_W0); //out[row*WD+col/4]=r<<24|g<<16|b<<8;
}
//EMAX5A end
}
20210401
123
Try (A) first
Load →
Store ←
Color map tables

20210401
124
Check the data flow
Load →
Store ←
Color map tables
/* SCREEN=WD*HT */
exe(OP_ADD, &col, col, EXP_H3210, 4, EXP_H3210, 0, EXP_H3210); //col += 4;
exe(OP_MMRG, &out, r, EXP_H3210, g, EXP_H3210, b, EXP_H3210);
mop(OP_STWR, &out, out_row_WD, col, MSK_W0); //out[row*WD+col/4]=r<<24|g<<16|b<<8;
}
//EMAX5A end
}

20210401
125
Observe the result
Load →
Store ←
Color map tables

20210401
126
You can test IMAX

20210401
127
HBM2 and VMK version will appear soon
IMAX2: Ultra-speed compilable CGRA 2022/12/XX
First CGRA, based on linear cores (not island-style)
32-unit, 1280-operations/4cycle (768-int32, 256-fp32, 256-media8/16,
512-load/store, 1024-stochastic-fma8, and 128-sparse-matrix)
IMAX2 32 cores 250MHz
1280 operations per 4 cycles
ALVEO-U280/U280
Memory/core: 64KB
Operations/core: 32-load/8-store, quad-sparse-load,
3-cascaded octa-int/media, octa-single-float FMA,
32-stochastic FMA
http://archlab.naist.jp/proj-arm64/fpga/U280-step4000-20221020.img.gz
IMAX2 32 cores 250MHz
1280 operations per 4 cycles
VMK180/VM1802
Memory/core: 64KB
Operations/core: 32-load/8-store, quad-sparse-load,
3-cascaded octa-int/media, octa-single-float FMA,
32-stochastic FMA
http://archlab.naist.jp/proj-arm64/fpga/VMK180-step4000-20221020.img.gz

You can test IMAX
20210401
128

PBL1-v1-200e.pptx

Recommended

Recommended

More Related Content

Similar to PBL1-v1-200e.pptx

Similar to PBL1-v1-200e.pptx (20)

More from NAIST

More from NAIST (20)

Recently uploaded

Recently uploaded (20)

PBL1-v1-200e.pptx