An AI accelerator ASIC architecture

A scalable AI accelerator
ASIC platform
K. Le, January 17, 2019
Presented for information only. No guarantee for accuracy
and correctness.

2 1/17/2019K. Le
Market moving toward a different type of AI chips
 New realization: cloud-based AI processing has many limitations
 Concentrated cloud-based AI processing costs too much storage, power and bandwidth
 Limited ability to support real time requirements (automotive, robots, drones, etc.)
 Connectivity to cloud is not always guaranteed (security, mobility, network coverage,
etc.)
 Better user-experience requires local AI processing
 Mega-trend is toward edge AI processing
 Need new AI chips to enable “sensor triggered actions” and
“decentralized ai” at the edge

3 1/17/2019K. Le
Required edge AI processing functions
 Audio
 speech recognition
 identification, security
 language processing/translation
 Video
 image recognition
 pattern/object/face recognition
 Environmental/physical condition
 pressure, tension, force, temperature, noise, heart beat, humidity, etc.

4 1/17/2019K. Le
A scalable accelerator ASIC platform
for edge AI

5 1/17/2019K. Le
High level architecture
 Based on a scalable AI
compute fabric
 Pipelined flow for fast
learning and inferring
 Flexible architecture suitable
for cloud, gateway and edge
ai
 Allows up-scaling to multi-
chip solutions
AI Compute
Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
Control
Processor
IO
Inter
face
s
ROM/
SRAM
PLL &
PM
M
u
l
t
i
p
l
e
D
a
t
a
S
t
r
e
a
m
s
• AI COMPUTE FABRIC
• MULTIPLE PARALLEL DATA
STREAMS
• SCALABLE AND PARTITIONABLE
• ENERGY EFFICIENT
• FAST (UP TO 2-4GHZ)
• CONTROL PROCESSOR (ANDES,
ARM, MIPS, RISC-V, ETC.)
• LEARNING (CALCULATIONS,
ALGORITHM EXECUTION,
COMPARING, MODEL UPDATES)
• INFERRING [ALGORITHM UPDATE,
DECISION MAKING)
• IO INTERFACE TO MULTI-CHIP
SOLUTIONS

6 1/17/2019K. Le
Detailed diagram
 @250mHz FABRIC frequency, MAXIMUM THROUGHPUT
of a 1-cluster AI accelerator is 4 giga byte ops
 (2 bytes x 8 PE) x 1 cluster x 0.25Ghz -> 4 giga byte
operations per second
 128-bit input and output data buffers allow scaling
of fabric frequency without throttling
 Control processor can be V-type, ARM Cortex, etc.
operating at 100-250mhz
 Two independent clock domains – fabric and control
 Power: <2W on 14-16FF @250MHz
 Die size: 15 to 20mm2
AI Compute
Fabric
(1 block of 8x8
PEs)
128-bit Wide Output
Data Buffer
Control
Processor
IO
Inter
face
Cont
rol
32KB
L1 /
0.5MB
L2
SRAM
PLLs
32 Gbps Interface
(32 1 Gpbs LVDS)
128-bit Wide Input
Data Buffer
32 Gbps Interface
(32 1 Gpbs LVDS)
Vertical
Fabric
Flow
Control
Power
Mgmt
ROM/Se
curity
GPIO
s
/
LVDS
Horizontal Fabric
Control
GPIOs/LVDS
JTAG/T
est
CPU
LPDDR
Contro
ller &
PHY

7 1/17/2019K. Le
AI accelerator ASIC platform: multi-chip solutions
 Up-scalable SOC & system
architecture
 Suitable for massive data
processing
 Connectivity to server racks or
cloud via network
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
Host IO: PCIe-4
Switch & Flow
Control SOC
To Cloud
Or Cards/
Racks
Host IO: PCIe-4

8 1/17/2019K. Le
Processing Fabric
An example is the “piperench” coarse-grained reconfigurable
architecture from Carnegie-Mellon U.

9 1/17/2019K. Le
Fabric: local cluster
 Local Fabric architecture offers:
 8x8 local cluster configuration
sufficient for most applications
 Byte-wide processing elements
 Easy Scalability to 8 bytes per local
cluster
 Predictable Performance
 Ample Routing resources
 Pipe-lined flow architecture
 Faster and more power efficient than
CPU/GPU architectures
 Might add local mem blocks for
reverse machine learning applications
8
Note: H- and V-bus widths to be optimized
Expandable to 16b, 32b, 64b, etc.
word widths
P
E
P
E
P
E
…
P
E
P
E
P
E
…
…
…
…
P
E
P
E
P
E
…
8
8-bit Wide
V-Local (8-bit)* V-Local (8-bit)*
H-Local
(8-bit)*
H-Local
(8-bit)*
H-Local
(8-bit)*
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
Compute Stream A
Compute Stream B Compute Stream N
Local Mem Local Mem Local Mem

10 1/17/2019K. Le
Fabric: global clusters
 Global fabric architecture offers:
 Easy scalability to any (X,Y)
configurations to fit particular
applications
 Pipe-lined flow architecture
 Higher performance and efficiency
Note: H- and V-bus widths to be optimized
8x8
PEs
8x8
PEs
8x8
PEs
…
8x8
PEs
8x8
PEs
8x8
PEs
…
…
…
…
8x8
PEs
8x8
PEs
8x8
PEs
…
Y
Local Clusters
V-Global (8/16-bit)* V-Global (8/16-bit)*
H-Global
(8/16-bit)*
H-Global
(8/16-bit)*
H-Global
(8/16-bit)*
x
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
…
Compute Stream

11 1/17/2019K. Le
Fast parallel
computational fabric
 Parallel computational tasks
mapped at compiler-level to
multiple kernels concurrently
executed inside fabric
 On-chip HW task-master
 control
 schedule
 monitor
8x8
PEs
8x8
PEs
8x8
PEs
…
8x8
PEs
8x8
PEs
8x8
PEs
…
…
…
…
8x8
PEs
8x8
PEs
8x8
PEs
…
Local Clusters
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
…
Kernel A Kernel B
Kernel C
TASK
MASTER

12 1/17/2019K. Le
Processing element
An example is this PE from U. of Illinois

13 1/17/2019K. Le
Processing Element
 Proposed by Lu Wan, Chen Dong and Deming
Chen of U. of ILLInois, Urbana-Champaign
(2012)
 Advantages:
 Complete
 High-performance by-pass path
 Compatible with fabric architecture
 Changes from original:
 No on-the-fly fabric reconfiguration, done at
compile time
 PE To be re-optimized (add barrel shifter?)

14 1/17/2019K. Le
Extension: high performance AI
accelerator ASIC platform

15 1/17/2019K. Le
AI accelerator ASIC: high
performance platform
 @1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a
4x4 cluster AI accelerator asic is 256 Giga byte OPS
 (2 bytes x 8 PE) x 4x4 Clusters x 1 Ghz -> 256 giga byte
operations per second
 @1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a
8x8 cluster AI accelerator asic is 1 TOPS
 Fabric can probably operate at up to 4GHz in 14LPP -> 4
TOPS
 512-bit input and output data buffers allow scaling of
fabric frequency to over 1GHz without significant
throttling
 Control processor can be 32- or 64-bit (andes n9 or v-
type, arm cortex, etc.) operating at 300-500mhz
 Two independent clock domains – fabric and control
 Power: 10-12W on 14/16FF (7W for PCIe-4, 2-3W for
fabric, 2w for all others)
 Die size: should not exceed 50-60mm2
1GHz
AI Compute
Fabric
(4x4 blocks of
8x8 PEs)
512-bit Wide Output
Data Buffer
32- or 64-bit
Control
Processor
IO
Inter
face
Cont
rol
64KB
L1 /
1MB
L2
SRAM
PLLs32 PCIe-4 SerDes
512-bit Wide Input
Data Buffer
32 PCIe-/4 SerDes
Vertical
Fabric
Flow
Control
Power
Mgmt
ROM/
Securi
ty
GPIO
s
/
LVDS
Horizontal Fabric
Control
GPIOs/LVDS
JTAG/T
est
This architecture utilizes a dedicated CPU for AI tasks in the SOC.
CPU
LPDDR
Contro
ller &
PHY
Fabric
LPDDR
Control
ler &
PHY

An AI accelerator ASIC architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An AI accelerator ASIC architecture

Similar to An AI accelerator ASIC architecture (20)

Recently uploaded

Recently uploaded (9)

An AI accelerator ASIC architecture

Editor's Notes