SlideShare a Scribd company logo
A scalable AI accelerator
ASIC platform
K. Le, January 17, 2019
Presented for information only. No guarantee for accuracy
and correctness.
2 1/17/2019K. Le
Market moving toward a different type of AI chips
 New realization: cloud-based AI processing has many limitations
 Concentrated cloud-based AI processing costs too much storage, power and bandwidth
 Limited ability to support real time requirements (automotive, robots, drones, etc.)
 Connectivity to cloud is not always guaranteed (security, mobility, network coverage,
etc.)
 Better user-experience requires local AI processing
 Mega-trend is toward edge AI processing
 Need new AI chips to enable “sensor triggered actions” and
“decentralized ai” at the edge
3 1/17/2019K. Le
Required edge AI processing functions
 Audio
 speech recognition
 identification, security
 language processing/translation
 Video
 image recognition
 pattern/object/face recognition
 Environmental/physical condition
 pressure, tension, force, temperature, noise, heart beat, humidity, etc.
4 1/17/2019K. Le
A scalable accelerator ASIC platform
for edge AI
5 1/17/2019K. Le
High level architecture
 Based on a scalable AI
compute fabric
 Pipelined flow for fast
learning and inferring
 Flexible architecture suitable
for cloud, gateway and edge
ai
 Allows up-scaling to multi-
chip solutions
AI Compute
Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
Control
Processor
IO
Inter
face
s
ROM/
SRAM
PLL &
PM
M
u
l
t
i
p
l
e
D
a
t
a
S
t
r
e
a
m
s
• AI COMPUTE FABRIC
• MULTIPLE PARALLEL DATA
STREAMS
• SCALABLE AND PARTITIONABLE
• ENERGY EFFICIENT
• FAST (UP TO 2-4GHZ)
• CONTROL PROCESSOR (ANDES,
ARM, MIPS, RISC-V, ETC.)
• LEARNING (CALCULATIONS,
ALGORITHM EXECUTION,
COMPARING, MODEL UPDATES)
• INFERRING [ALGORITHM UPDATE,
DECISION MAKING)
• IO INTERFACE TO MULTI-CHIP
SOLUTIONS
6 1/17/2019K. Le
Detailed diagram
 @250mHz FABRIC frequency, MAXIMUM THROUGHPUT
of a 1-cluster AI accelerator is 4 giga byte ops
 (2 bytes x 8 PE) x 1 cluster x 0.25Ghz -> 4 giga byte
operations per second
 128-bit input and output data buffers allow scaling
of fabric frequency without throttling
 Control processor can be V-type, ARM Cortex, etc.
operating at 100-250mhz
 Two independent clock domains – fabric and control
 Power: <2W on 14-16FF @250MHz
 Die size: 15 to 20mm2
AI Compute
Fabric
(1 block of 8x8
PEs)
128-bit Wide Output
Data Buffer
Control
Processor
IO
Inter
face
Cont
rol
32KB
L1 /
0.5MB
L2
SRAM
PLLs
32 Gbps Interface
(32 1 Gpbs LVDS)
128-bit Wide Input
Data Buffer
32 Gbps Interface
(32 1 Gpbs LVDS)
Vertical
Fabric
Flow
Control
Power
Mgmt
ROM/Se
curity
GPIO
s
/
LVDS
Horizontal Fabric
Control
GPIOs/LVDS
JTAG/T
est
CPU
LPDDR
Contro
ller &
PHY
7 1/17/2019K. Le
AI accelerator ASIC platform: multi-chip solutions
 Up-scalable SOC & system
architecture
 Suitable for massive data
processing
 Connectivity to server racks or
cloud via network
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
Host IO: PCIe-4
Switch & Flow
Control SOC
To Cloud
Or Cards/
Racks
Host IO: PCIe-4
8 1/17/2019K. Le
Processing Fabric
An example is the “piperench” coarse-grained reconfigurable
architecture from Carnegie-Mellon U.
9 1/17/2019K. Le
Fabric: local cluster
 Local Fabric architecture offers:
 8x8 local cluster configuration
sufficient for most applications
 Byte-wide processing elements
 Easy Scalability to 8 bytes per local
cluster
 Predictable Performance
 Ample Routing resources
 Pipe-lined flow architecture
 Faster and more power efficient than
CPU/GPU architectures
 Might add local mem blocks for
reverse machine learning applications
8
Note: H- and V-bus widths to be optimized
Expandable to 16b, 32b, 64b, etc.
word widths
P
E
P
E
P
E
…
P
E
P
E
P
E
…
…
…
…
P
E
P
E
P
E
…
8
8-bit Wide
V-Local (8-bit)* V-Local (8-bit)*
H-Local
(8-bit)*
H-Local
(8-bit)*
H-Local
(8-bit)*
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
Compute Stream A
Compute Stream B Compute Stream N
Local Mem Local Mem Local Mem
10 1/17/2019K. Le
Fabric: global clusters
 Global fabric architecture offers:
 Easy scalability to any (X,Y)
configurations to fit particular
applications
 Pipe-lined flow architecture
 Higher performance and efficiency
Note: H- and V-bus widths to be optimized
8x8
PEs
8x8
PEs
8x8
PEs
…
8x8
PEs
8x8
PEs
8x8
PEs
…
…
…
…
8x8
PEs
8x8
PEs
8x8
PEs
…
Y
Local Clusters
V-Global (8/16-bit)* V-Global (8/16-bit)*
H-Global
(8/16-bit)*
H-Global
(8/16-bit)*
H-Global
(8/16-bit)*
x
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
…
Compute Stream
11 1/17/2019K. Le
Fast parallel
computational fabric
 Parallel computational tasks
mapped at compiler-level to
multiple kernels concurrently
executed inside fabric
 On-chip HW task-master
 control
 schedule
 monitor
8x8
PEs
8x8
PEs
8x8
PEs
…
8x8
PEs
8x8
PEs
8x8
PEs
…
…
…
…
8x8
PEs
8x8
PEs
8x8
PEs
…
Local Clusters
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
…
Kernel A Kernel B
Kernel C
TASK
MASTER
12 1/17/2019K. Le
Processing element
An example is this PE from U. of Illinois
13 1/17/2019K. Le
Processing Element
 Proposed by Lu Wan, Chen Dong and Deming
Chen of U. of ILLInois, Urbana-Champaign
(2012)
 Advantages:
 Complete
 High-performance by-pass path
 Compatible with fabric architecture
 Changes from original:
 No on-the-fly fabric reconfiguration, done at
compile time
 PE To be re-optimized (add barrel shifter?)
14 1/17/2019K. Le
Extension: high performance AI
accelerator ASIC platform
15 1/17/2019K. Le
AI accelerator ASIC: high
performance platform
 @1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a
4x4 cluster AI accelerator asic is 256 Giga byte OPS
 (2 bytes x 8 PE) x 4x4 Clusters x 1 Ghz -> 256 giga byte
operations per second
 @1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a
8x8 cluster AI accelerator asic is 1 TOPS
 Fabric can probably operate at up to 4GHz in 14LPP -> 4
TOPS
 512-bit input and output data buffers allow scaling of
fabric frequency to over 1GHz without significant
throttling
 Control processor can be 32- or 64-bit (andes n9 or v-
type, arm cortex, etc.) operating at 300-500mhz
 Two independent clock domains – fabric and control
 Power: 10-12W on 14/16FF (7W for PCIe-4, 2-3W for
fabric, 2w for all others)
 Die size: should not exceed 50-60mm2
1GHz
AI Compute
Fabric
(4x4 blocks of
8x8 PEs)
512-bit Wide Output
Data Buffer
32- or 64-bit
Control
Processor
IO
Inter
face
Cont
rol
64KB
L1 /
1MB
L2
SRAM
PLLs32 PCIe-4 SerDes
512-bit Wide Input
Data Buffer
32 PCIe-/4 SerDes
Vertical
Fabric
Flow
Control
Power
Mgmt
ROM/
Securi
ty
GPIO
s
/
LVDS
Horizontal Fabric
Control
GPIOs/LVDS
JTAG/T
est
This architecture utilizes a dedicated CPU for AI tasks in the SOC.
CPU
LPDDR
Contro
ller &
PHY
Fabric
LPDDR
Control
ler &
PHY

More Related Content

What's hot

Embedded Hypervisor for ARM
Embedded Hypervisor for ARMEmbedded Hypervisor for ARM
Embedded Hypervisor for ARM
National Cheng Kung University
 
Verilator勉強会 2021/05/29
Verilator勉強会 2021/05/29Verilator勉強会 2021/05/29
Verilator勉強会 2021/05/29
ryuz88
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
Grigory Sapunov
 
Linux on RISC-V with Open Hardware (ELC-E 2020)
Linux on RISC-V with Open Hardware (ELC-E 2020)Linux on RISC-V with Open Hardware (ELC-E 2020)
Linux on RISC-V with Open Hardware (ELC-E 2020)
Drew Fustini
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA acceleration
Marco77328
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
RISC-V International
 
Simd programming introduction
Simd programming introductionSimd programming introduction
Simd programming introduction
Champ Yen
 
"Using TensorFlow Lite to Deploy Deep Learning on Cortex-M Microcontrollers,"...
"Using TensorFlow Lite to Deploy Deep Learning on Cortex-M Microcontrollers,"..."Using TensorFlow Lite to Deploy Deep Learning on Cortex-M Microcontrollers,"...
"Using TensorFlow Lite to Deploy Deep Learning on Cortex-M Microcontrollers,"...
Edge AI and Vision Alliance
 
OSC2011 Tokyo/Fall 濃いバナ(virtio)
OSC2011 Tokyo/Fall 濃いバナ(virtio)OSC2011 Tokyo/Fall 濃いバナ(virtio)
OSC2011 Tokyo/Fall 濃いバナ(virtio)Takeshi HASEGAWA
 
いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例
Fixstars Corporation
 
ARM and SoC Traning Part I -- Overview
ARM and SoC Traning Part I -- OverviewARM and SoC Traning Part I -- Overview
ARM and SoC Traning Part I -- Overview
National Cheng Kung University
 
ARMアーキテクチャにおけるセキュリティ機構の紹介
ARMアーキテクチャにおけるセキュリティ機構の紹介ARMアーキテクチャにおけるセキュリティ機構の紹介
ARMアーキテクチャにおけるセキュリティ機構の紹介
sounakano
 
Presentation - Model Efficiency for Edge AI
Presentation - Model Efficiency for Edge AIPresentation - Model Efficiency for Edge AI
Presentation - Model Efficiency for Edge AI
Qualcomm Research
 
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2  「エッジAIモダン計測制御の世界」オ...Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2  「エッジAIモダン計測制御の世界」オ...
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...
Mr. Vengineer
 
Arm Processors Architectures
Arm Processors ArchitecturesArm Processors Architectures
Arm Processors Architectures
Mohammed Hilal
 
Case study on Intel core i3 processor.
Case study on Intel core i3 processor. Case study on Intel core i3 processor.
Case study on Intel core i3 processor.
Mauryasuraj98
 
“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...
“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...
“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...
Edge AI and Vision Alliance
 
Qemu Pcie
Qemu PcieQemu Pcie
Getting Started with Raspberry Pi
Getting Started with Raspberry PiGetting Started with Raspberry Pi
Getting Started with Raspberry Pi
yeokm1
 

What's hot (20)

Embedded Hypervisor for ARM
Embedded Hypervisor for ARMEmbedded Hypervisor for ARM
Embedded Hypervisor for ARM
 
Verilator勉強会 2021/05/29
Verilator勉強会 2021/05/29Verilator勉強会 2021/05/29
Verilator勉強会 2021/05/29
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
Linux on RISC-V with Open Hardware (ELC-E 2020)
Linux on RISC-V with Open Hardware (ELC-E 2020)Linux on RISC-V with Open Hardware (ELC-E 2020)
Linux on RISC-V with Open Hardware (ELC-E 2020)
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA acceleration
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 
Simd programming introduction
Simd programming introductionSimd programming introduction
Simd programming introduction
 
"Using TensorFlow Lite to Deploy Deep Learning on Cortex-M Microcontrollers,"...
"Using TensorFlow Lite to Deploy Deep Learning on Cortex-M Microcontrollers,"..."Using TensorFlow Lite to Deploy Deep Learning on Cortex-M Microcontrollers,"...
"Using TensorFlow Lite to Deploy Deep Learning on Cortex-M Microcontrollers,"...
 
OSC2011 Tokyo/Fall 濃いバナ(virtio)
OSC2011 Tokyo/Fall 濃いバナ(virtio)OSC2011 Tokyo/Fall 濃いバナ(virtio)
OSC2011 Tokyo/Fall 濃いバナ(virtio)
 
いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例
 
Learn C Programming Language by Using GDB
Learn C Programming Language by Using GDBLearn C Programming Language by Using GDB
Learn C Programming Language by Using GDB
 
ARM and SoC Traning Part I -- Overview
ARM and SoC Traning Part I -- OverviewARM and SoC Traning Part I -- Overview
ARM and SoC Traning Part I -- Overview
 
ARMアーキテクチャにおけるセキュリティ機構の紹介
ARMアーキテクチャにおけるセキュリティ機構の紹介ARMアーキテクチャにおけるセキュリティ機構の紹介
ARMアーキテクチャにおけるセキュリティ機構の紹介
 
Presentation - Model Efficiency for Edge AI
Presentation - Model Efficiency for Edge AIPresentation - Model Efficiency for Edge AI
Presentation - Model Efficiency for Edge AI
 
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2  「エッジAIモダン計測制御の世界」オ...Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2  「エッジAIモダン計測制御の世界」オ...
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...
 
Arm Processors Architectures
Arm Processors ArchitecturesArm Processors Architectures
Arm Processors Architectures
 
Case study on Intel core i3 processor.
Case study on Intel core i3 processor. Case study on Intel core i3 processor.
Case study on Intel core i3 processor.
 
“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...
“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...
“Vitis and Vitis AI: Application Acceleration from Cloud to Edge,” a Presenta...
 
Qemu Pcie
Qemu PcieQemu Pcie
Qemu Pcie
 
Getting Started with Raspberry Pi
Getting Started with Raspberry PiGetting Started with Raspberry Pi
Getting Started with Raspberry Pi
 

Similar to An AI accelerator ASIC architecture

PowerAI Deep dive
PowerAI Deep divePowerAI Deep dive
PowerAI Deep dive
Ganesan Narayanasamy
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
Wilhelm van Belkum
 
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampPCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
FPGA Central
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
Ganesan Narayanasamy
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
Michelle Holley
 
How to Select Hardware for Internet of Things Systems?
How to Select Hardware for Internet of Things Systems?How to Select Hardware for Internet of Things Systems?
How to Select Hardware for Internet of Things Systems?
Hannes Tschofenig
 
Hortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AIHortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AI
DataWorks Summit
 
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoC
Linaro
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
Anand Haridass
 
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
inside-BigData.com
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 
Brochure (2016-01-30)
Brochure (2016-01-30)Brochure (2016-01-30)
Brochure (2016-01-30)Jonah McLeod
 
Introduce: IBM Power Linux with PowerKVM
Introduce: IBM Power Linux with PowerKVMIntroduce: IBM Power Linux with PowerKVM
Introduce: IBM Power Linux with PowerKVM
Zainal Abidin
 
Intel Microprocessors- a Top down Approach
Intel Microprocessors- a Top down ApproachIntel Microprocessors- a Top down Approach
Intel Microprocessors- a Top down Approach
Editor IJCATR
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressHeiko Joerg Schick
 
Overview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit MicrocontrollersOverview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit Microcontrollers
Premier Farnell
 
IoT Week 2021_Jens Hagemeyer presentation
IoT Week 2021_Jens Hagemeyer presentationIoT Week 2021_Jens Hagemeyer presentation
IoT Week 2021_Jens Hagemeyer presentation
VEDLIoT Project
 

Similar to An AI accelerator ASIC architecture (20)

The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
PowerAI Deep dive
PowerAI Deep divePowerAI Deep dive
PowerAI Deep dive
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampPCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
How to Select Hardware for Internet of Things Systems?
How to Select Hardware for Internet of Things Systems?How to Select Hardware for Internet of Things Systems?
How to Select Hardware for Internet of Things Systems?
 
Hortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AIHortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AI
 
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoC
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
 
chameleon chip
chameleon chipchameleon chip
chameleon chip
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Brochure (2016-01-30)
Brochure (2016-01-30)Brochure (2016-01-30)
Brochure (2016-01-30)
 
Introduce: IBM Power Linux with PowerKVM
Introduce: IBM Power Linux with PowerKVMIntroduce: IBM Power Linux with PowerKVM
Introduce: IBM Power Linux with PowerKVM
 
Intel Microprocessors- a Top down Approach
Intel Microprocessors- a Top down ApproachIntel Microprocessors- a Top down Approach
Intel Microprocessors- a Top down Approach
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI Express
 
Overview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit MicrocontrollersOverview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit Microcontrollers
 
IoT Week 2021_Jens Hagemeyer presentation
IoT Week 2021_Jens Hagemeyer presentationIoT Week 2021_Jens Hagemeyer presentation
IoT Week 2021_Jens Hagemeyer presentation
 

Recently uploaded

web-tech-lab-manual-final-abhas.pdf. Jer
web-tech-lab-manual-final-abhas.pdf. Jerweb-tech-lab-manual-final-abhas.pdf. Jer
web-tech-lab-manual-final-abhas.pdf. Jer
freshgammer09
 
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
PinkySharma900491
 
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
kywwoyk
 
F5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptxF5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptx
ArjunJain44
 
Drugs used in parkinsonism and other movement disorders.pptx
Drugs used in parkinsonism and other movement disorders.pptxDrugs used in parkinsonism and other movement disorders.pptx
Drugs used in parkinsonism and other movement disorders.pptx
ThalapathyVijay15
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
eemet
 
Cyber Sequrity.pptx is life of cyber security
Cyber Sequrity.pptx is life of cyber securityCyber Sequrity.pptx is life of cyber security
Cyber Sequrity.pptx is life of cyber security
perweeng31
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
kywwoyk
 
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
Amil baba
 

Recently uploaded (9)

web-tech-lab-manual-final-abhas.pdf. Jer
web-tech-lab-manual-final-abhas.pdf. Jerweb-tech-lab-manual-final-abhas.pdf. Jer
web-tech-lab-manual-final-abhas.pdf. Jer
 
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
 
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
 
F5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptxF5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptx
 
Drugs used in parkinsonism and other movement disorders.pptx
Drugs used in parkinsonism and other movement disorders.pptxDrugs used in parkinsonism and other movement disorders.pptx
Drugs used in parkinsonism and other movement disorders.pptx
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
 
Cyber Sequrity.pptx is life of cyber security
Cyber Sequrity.pptx is life of cyber securityCyber Sequrity.pptx is life of cyber security
Cyber Sequrity.pptx is life of cyber security
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
 
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
 

An AI accelerator ASIC architecture

  • 1. A scalable AI accelerator ASIC platform K. Le, January 17, 2019 Presented for information only. No guarantee for accuracy and correctness.
  • 2. 2 1/17/2019K. Le Market moving toward a different type of AI chips  New realization: cloud-based AI processing has many limitations  Concentrated cloud-based AI processing costs too much storage, power and bandwidth  Limited ability to support real time requirements (automotive, robots, drones, etc.)  Connectivity to cloud is not always guaranteed (security, mobility, network coverage, etc.)  Better user-experience requires local AI processing  Mega-trend is toward edge AI processing  Need new AI chips to enable “sensor triggered actions” and “decentralized ai” at the edge
  • 3. 3 1/17/2019K. Le Required edge AI processing functions  Audio  speech recognition  identification, security  language processing/translation  Video  image recognition  pattern/object/face recognition  Environmental/physical condition  pressure, tension, force, temperature, noise, heart beat, humidity, etc.
  • 4. 4 1/17/2019K. Le A scalable accelerator ASIC platform for edge AI
  • 5. 5 1/17/2019K. Le High level architecture  Based on a scalable AI compute fabric  Pipelined flow for fast learning and inferring  Flexible architecture suitable for cloud, gateway and edge ai  Allows up-scaling to multi- chip solutions AI Compute Fabric Input Data Buffer, Memory & Control Output Data Buffer, Memory & Control Control Processor IO Inter face s ROM/ SRAM PLL & PM M u l t i p l e D a t a S t r e a m s • AI COMPUTE FABRIC • MULTIPLE PARALLEL DATA STREAMS • SCALABLE AND PARTITIONABLE • ENERGY EFFICIENT • FAST (UP TO 2-4GHZ) • CONTROL PROCESSOR (ANDES, ARM, MIPS, RISC-V, ETC.) • LEARNING (CALCULATIONS, ALGORITHM EXECUTION, COMPARING, MODEL UPDATES) • INFERRING [ALGORITHM UPDATE, DECISION MAKING) • IO INTERFACE TO MULTI-CHIP SOLUTIONS
  • 6. 6 1/17/2019K. Le Detailed diagram  @250mHz FABRIC frequency, MAXIMUM THROUGHPUT of a 1-cluster AI accelerator is 4 giga byte ops  (2 bytes x 8 PE) x 1 cluster x 0.25Ghz -> 4 giga byte operations per second  128-bit input and output data buffers allow scaling of fabric frequency without throttling  Control processor can be V-type, ARM Cortex, etc. operating at 100-250mhz  Two independent clock domains – fabric and control  Power: <2W on 14-16FF @250MHz  Die size: 15 to 20mm2 AI Compute Fabric (1 block of 8x8 PEs) 128-bit Wide Output Data Buffer Control Processor IO Inter face Cont rol 32KB L1 / 0.5MB L2 SRAM PLLs 32 Gbps Interface (32 1 Gpbs LVDS) 128-bit Wide Input Data Buffer 32 Gbps Interface (32 1 Gpbs LVDS) Vertical Fabric Flow Control Power Mgmt ROM/Se curity GPIO s / LVDS Horizontal Fabric Control GPIOs/LVDS JTAG/T est CPU LPDDR Contro ller & PHY
  • 7. 7 1/17/2019K. Le AI accelerator ASIC platform: multi-chip solutions  Up-scalable SOC & system architecture  Suitable for massive data processing  Connectivity to server racks or cloud via network AI Compute Fabric Input Data Buffer, Memory & Control Output Data Buffer, Memory & Control AI Compute Fabric Input Data Buffer, Memory & Control Output Data Buffer, Memory & Control AI Compute Fabric Input Data Buffer, Memory & Control Output Data Buffer, Memory & Control AI Compute Fabric Input Data Buffer, Memory & Control Output Data Buffer, Memory & Control Host IO: PCIe-4 Switch & Flow Control SOC To Cloud Or Cards/ Racks Host IO: PCIe-4
  • 8. 8 1/17/2019K. Le Processing Fabric An example is the “piperench” coarse-grained reconfigurable architecture from Carnegie-Mellon U.
  • 9. 9 1/17/2019K. Le Fabric: local cluster  Local Fabric architecture offers:  8x8 local cluster configuration sufficient for most applications  Byte-wide processing elements  Easy Scalability to 8 bytes per local cluster  Predictable Performance  Ample Routing resources  Pipe-lined flow architecture  Faster and more power efficient than CPU/GPU architectures  Might add local mem blocks for reverse machine learning applications 8 Note: H- and V-bus widths to be optimized Expandable to 16b, 32b, 64b, etc. word widths P E P E P E … P E P E P E … … … … P E P E P E … 8 8-bit Wide V-Local (8-bit)* V-Local (8-bit)* H-Local (8-bit)* H-Local (8-bit)* H-Local (8-bit)* M u l t i p l e C o m p u t e S t r e a m s Compute Stream A Compute Stream B Compute Stream N Local Mem Local Mem Local Mem
  • 10. 10 1/17/2019K. Le Fabric: global clusters  Global fabric architecture offers:  Easy scalability to any (X,Y) configurations to fit particular applications  Pipe-lined flow architecture  Higher performance and efficiency Note: H- and V-bus widths to be optimized 8x8 PEs 8x8 PEs 8x8 PEs … 8x8 PEs 8x8 PEs 8x8 PEs … … … … 8x8 PEs 8x8 PEs 8x8 PEs … Y Local Clusters V-Global (8/16-bit)* V-Global (8/16-bit)* H-Global (8/16-bit)* H-Global (8/16-bit)* H-Global (8/16-bit)* x M u l t i p l e C o m p u t e S t r e a m s … Compute Stream
  • 11. 11 1/17/2019K. Le Fast parallel computational fabric  Parallel computational tasks mapped at compiler-level to multiple kernels concurrently executed inside fabric  On-chip HW task-master  control  schedule  monitor 8x8 PEs 8x8 PEs 8x8 PEs … 8x8 PEs 8x8 PEs 8x8 PEs … … … … 8x8 PEs 8x8 PEs 8x8 PEs … Local Clusters M u l t i p l e C o m p u t e S t r e a m s … Kernel A Kernel B Kernel C TASK MASTER
  • 12. 12 1/17/2019K. Le Processing element An example is this PE from U. of Illinois
  • 13. 13 1/17/2019K. Le Processing Element  Proposed by Lu Wan, Chen Dong and Deming Chen of U. of ILLInois, Urbana-Champaign (2012)  Advantages:  Complete  High-performance by-pass path  Compatible with fabric architecture  Changes from original:  No on-the-fly fabric reconfiguration, done at compile time  PE To be re-optimized (add barrel shifter?)
  • 14. 14 1/17/2019K. Le Extension: high performance AI accelerator ASIC platform
  • 15. 15 1/17/2019K. Le AI accelerator ASIC: high performance platform  @1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a 4x4 cluster AI accelerator asic is 256 Giga byte OPS  (2 bytes x 8 PE) x 4x4 Clusters x 1 Ghz -> 256 giga byte operations per second  @1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a 8x8 cluster AI accelerator asic is 1 TOPS  Fabric can probably operate at up to 4GHz in 14LPP -> 4 TOPS  512-bit input and output data buffers allow scaling of fabric frequency to over 1GHz without significant throttling  Control processor can be 32- or 64-bit (andes n9 or v- type, arm cortex, etc.) operating at 300-500mhz  Two independent clock domains – fabric and control  Power: 10-12W on 14/16FF (7W for PCIe-4, 2-3W for fabric, 2w for all others)  Die size: should not exceed 50-60mm2 1GHz AI Compute Fabric (4x4 blocks of 8x8 PEs) 512-bit Wide Output Data Buffer 32- or 64-bit Control Processor IO Inter face Cont rol 64KB L1 / 1MB L2 SRAM PLLs32 PCIe-4 SerDes 512-bit Wide Input Data Buffer 32 PCIe-/4 SerDes Vertical Fabric Flow Control Power Mgmt ROM/ Securi ty GPIO s / LVDS Horizontal Fabric Control GPIOs/LVDS JTAG/T est This architecture utilizes a dedicated CPU for AI tasks in the SOC. CPU LPDDR Contro ller & PHY Fabric LPDDR Control ler & PHY

Editor's Notes

  1. 160 zetta bytes by 2025 from AI devices 2.6 billion connected AI devices
  2. CPU and memory contain the known agents, model representations, and algorithms for ML. Needs DDR (128b) for large data sets and complex models and algos.
  3. PCIe-4 switch and interface to rack or cloud
  4. CPU and memory contain the known agents, model representations, and algorithms for ML. Needs DDR (128b) for large data sets and complex models and algos.