The document proposes a scalable AI accelerator ASIC platform for edge AI processing. It describes a high-level architecture based on a scalable AI compute fabric that allows for fast learning and inference. The architecture is flexible and can scale from single-chip solutions to multi-chip solutions connected via high-speed interfaces. It also provides details on the AI compute fabric, processing elements, and how the platform could enable high-performance edge AI processing.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/altera/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Bill Jenkins, Senior Product Specialist for High Level Design Tools at Intel, presents the "Accelerating Deep Learning Using Altera FPGAs" tutorial at the May 2016 Embedded Vision Summit.
While large strides have recently been made in the development of high-performance systems for neural networks based on multi-core technology, significant challenges in power, cost and, performance scaling remain. Field-programmable gate arrays (FPGAs) are a natural choice for implementing neural networks because they can combine computing, logic, and memory resources in a single device. Intel's Programmable Solutions Group has developed a scalable convolutional neural network reference design for deep learning systems using the OpenCL programming language built with our SDK for OpenCL. The design performance is being benchmarked using several popular CNN benchmarks: CIFAR-10, ImageNet and KITTI.
Building the CNN with OpenCL kernels allows true scaling of the design from smaller to larger devices and from one device generation to the next. New designs can be sized using different numbers of kernels at each layer. Performance scaling from one generation to the next also benefits from architectural advancements, such as floating-point engines and frequency scaling. Thus, you achieve greater than linear performance and performance per watt scaling with each new series of devices.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/xilinx/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nick Ni, Director of Product Marketing at Xilinx, presents the "Xilinx AI Engine: High Performance with Future-proof Architecture Adaptability" tutorial at the May 2019 Embedded Vision Summit.
AI inference demands orders- of-magnitude more compute capacity than what today’s SoCs offer. At the same time, neural network topologies are changing too quickly to be addressed by ASICs that take years to go from architecture to production. In this talk, Ni introduces the Xilinx AI Engine, which complements the dynamically- programmable FPGA fabric to enable ASIC-like performance via custom data flows and a flexible memory hierarchy. This combination provides an orders-of-magnitude boost in AI performance along with the hardware architecture flexibility needed to quickly adapt to rapidly evolving neural network topologies.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/altera/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Bill Jenkins, Senior Product Specialist for High Level Design Tools at Intel, presents the "Accelerating Deep Learning Using Altera FPGAs" tutorial at the May 2016 Embedded Vision Summit.
While large strides have recently been made in the development of high-performance systems for neural networks based on multi-core technology, significant challenges in power, cost and, performance scaling remain. Field-programmable gate arrays (FPGAs) are a natural choice for implementing neural networks because they can combine computing, logic, and memory resources in a single device. Intel's Programmable Solutions Group has developed a scalable convolutional neural network reference design for deep learning systems using the OpenCL programming language built with our SDK for OpenCL. The design performance is being benchmarked using several popular CNN benchmarks: CIFAR-10, ImageNet and KITTI.
Building the CNN with OpenCL kernels allows true scaling of the design from smaller to larger devices and from one device generation to the next. New designs can be sized using different numbers of kernels at each layer. Performance scaling from one generation to the next also benefits from architectural advancements, such as floating-point engines and frequency scaling. Thus, you achieve greater than linear performance and performance per watt scaling with each new series of devices.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/xilinx/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nick Ni, Director of Product Marketing at Xilinx, presents the "Xilinx AI Engine: High Performance with Future-proof Architecture Adaptability" tutorial at the May 2019 Embedded Vision Summit.
AI inference demands orders- of-magnitude more compute capacity than what today’s SoCs offer. At the same time, neural network topologies are changing too quickly to be addressed by ASICs that take years to go from architecture to production. In this talk, Ni introduces the Xilinx AI Engine, which complements the dynamically- programmable FPGA fabric to enable ASIC-like performance via custom data flows and a flexible memory hierarchy. This combination provides an orders-of-magnitude boost in AI performance along with the hardware architecture flexibility needed to quickly adapt to rapidly evolving neural network topologies.
Linux on RISC-V with Open Hardware (ELC-E 2020)Drew Fustini
Want to run Linux on open hardware? This talk will explore how the RISC-V, an open instruction set (ISA), and open source FPGA tools can be leveraged to achieve that goal. I will explain how myself and others at Hackaday Supercon teamed up to get Linux running on a RISC-V soft-core in the ECP5 FPGA on the conference badge. I will introduce Migen, LiteX and Vexriscv, and explain how they enabled us to quickly implement an SoC in the FPGA capable of running Linux. I will also explore other Linux-capable open source RISC-V implementations, and how some are being used in industry. I will highlight that OpenHW Group has adopted the PULP Ariane from ETH Zurich for its Core-V CVA64 implementation. Finally, I will look at what Linux-capable "hard" RISC-V SoC's currently exist, and what is on the horizon for 2020 and 2021. This talk is should be relevant to people who are interested in building open hardware systems capable of running Linux. It should also be useful to people who are curious about RISC-V. Software engineers may find it exciting to learn how Python can be used to for chip-level design with Migen and LiteX, and simplify building a System-on-Chip (SoC) for an FPGA.
Small introduction to FPGA acceleration and the impact of the new High Level Synthesis toolchains to their programmability
Video here: https://www.linkedin.com/posts/marcobarbone_can-my-application-benefit-from-fpga-acceleration-activity-6848674747375460352-0fua
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit-google
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Pete Warden, Staff Research Engineer and TensorFlow Lite development lead at Google, presents the "Using TensorFlow Lite to Deploy Deep Learning on Cortex-M Microcontrollers" tutorial at the May 2019 Embedded Vision Summit.
Is it possible to deploy deep learning models on low-cost, low-power microcontrollers? While it may be surprising, the answer is a definite “yes”! In this talk, Warden explains how the new TensorFlow Lite framework enables creating very lightweight DNN implementations suitable for execution on microcontrollers. He illustrates how this works using an example of a 20 Kbyte DNN model that performs speech wake word detection, and discusses how this generalizes to image-based use cases. Warden introduces TensorFlow Lite, and explores the key steps in implementing lightweight DNNs, including model design, data gathering, hardware platform choice, software implementation and optimization.
Jim Huang (jserv) from 0xlab.org prepared the technical training for ARM and SoC. In part I, it introduced the overview of ARM architecture, family, ISA feature, SoC overview, and several practical approaches to Xscale SoC as example.
AI model efficiency is crucial for making AI ubiquitous, leading to smarter devices and enhanced lives. Besides the performance benefit, quantized neural networks also increase power efficiency for two reasons: reduced memory access costs and increased compute efficiency.
The quantization work done by the Qualcomm AI Research team is crucial in implementing machine learning algorithms on low-power edge devices. In network quantization, we focus on both pushing the state-of-the-art (SOTA) in compression and making quantized inference as easy to access as possible. For example, our SOTA work on oscillations in quantization-aware training that push the boundaries of what is possible with INT4 quantization. Furthermore, for ease of deployment, the integer formats such as INT16 and INT8 give comparable performance to floating point, i.e., FP16 and FP8, but have significantly better performance-per-watt performance. Researchers and developers can make use of this quantization research to successfully optimize and deploy their models across devices with open-sourced tools like AI Model Efficiency Toolkit (AIMET).
Presenters: Tijmen Blankevoort and Chirag Patel
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/12/vitis-and-vitis-ai-application-acceleration-from-cloud-to-edge-a-presentation-from-xilinx/
For more information about edge AI and computer vision, please visit:
https://www.edge-ai-vision.com
Vinod Kathail, Fellow and Chief Architect at Xilinx, presents the “Vitis and Vitis AI: Application Acceleration from Cloud to Edge” tutorial at the September 2020 Embedded Vision Summit.
Xilinx SoCs and FPGAs provide significant advantages in throughput, latency, and energy efficiency for production deployments of compute-intensive applications when compared to CPUs and GPUs. Over the last decade, FPGAs have evolved into highly configurable devices that provide on-chip heterogeneous multi-core CPUs, domain-specific programmable accelerators and “any-to-any” interface connectivity.
Today, the Xilinx Vitis Unified Software Platform supports high-level programming in C, C++, OpenCL, and Python, enabling developers to build and seamlessly deploy applications on Xilinx platforms including Alveo cards, FPGA instances in the cloud, and embedded devices. Moreover, Vitis enables the acceleration of large-scale data processing and machine learning applications using familiar high-level frameworks, such as TensorFlow and SPARK. This presentation provides an overview of the Vitis Software platform and the accelerated Vitis Vision Library, which enables customizable functions such as image signal processing, adaptable AI inference, 3D reconstruction and motion analysis.
A talk I gave at Creative Crew (Singapore) on 12 August 2016 to introduce newcomers to the Raspberry Pi.
Video link of this talk can be found here: https://engineers.sg/v/955
Code used in the talk can be found here: https://github.com/yeokm1/getting-started-with-rpi
Linux on RISC-V with Open Hardware (ELC-E 2020)Drew Fustini
Want to run Linux on open hardware? This talk will explore how the RISC-V, an open instruction set (ISA), and open source FPGA tools can be leveraged to achieve that goal. I will explain how myself and others at Hackaday Supercon teamed up to get Linux running on a RISC-V soft-core in the ECP5 FPGA on the conference badge. I will introduce Migen, LiteX and Vexriscv, and explain how they enabled us to quickly implement an SoC in the FPGA capable of running Linux. I will also explore other Linux-capable open source RISC-V implementations, and how some are being used in industry. I will highlight that OpenHW Group has adopted the PULP Ariane from ETH Zurich for its Core-V CVA64 implementation. Finally, I will look at what Linux-capable "hard" RISC-V SoC's currently exist, and what is on the horizon for 2020 and 2021. This talk is should be relevant to people who are interested in building open hardware systems capable of running Linux. It should also be useful to people who are curious about RISC-V. Software engineers may find it exciting to learn how Python can be used to for chip-level design with Migen and LiteX, and simplify building a System-on-Chip (SoC) for an FPGA.
Small introduction to FPGA acceleration and the impact of the new High Level Synthesis toolchains to their programmability
Video here: https://www.linkedin.com/posts/marcobarbone_can-my-application-benefit-from-fpga-acceleration-activity-6848674747375460352-0fua
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit-google
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Pete Warden, Staff Research Engineer and TensorFlow Lite development lead at Google, presents the "Using TensorFlow Lite to Deploy Deep Learning on Cortex-M Microcontrollers" tutorial at the May 2019 Embedded Vision Summit.
Is it possible to deploy deep learning models on low-cost, low-power microcontrollers? While it may be surprising, the answer is a definite “yes”! In this talk, Warden explains how the new TensorFlow Lite framework enables creating very lightweight DNN implementations suitable for execution on microcontrollers. He illustrates how this works using an example of a 20 Kbyte DNN model that performs speech wake word detection, and discusses how this generalizes to image-based use cases. Warden introduces TensorFlow Lite, and explores the key steps in implementing lightweight DNNs, including model design, data gathering, hardware platform choice, software implementation and optimization.
Jim Huang (jserv) from 0xlab.org prepared the technical training for ARM and SoC. In part I, it introduced the overview of ARM architecture, family, ISA feature, SoC overview, and several practical approaches to Xscale SoC as example.
AI model efficiency is crucial for making AI ubiquitous, leading to smarter devices and enhanced lives. Besides the performance benefit, quantized neural networks also increase power efficiency for two reasons: reduced memory access costs and increased compute efficiency.
The quantization work done by the Qualcomm AI Research team is crucial in implementing machine learning algorithms on low-power edge devices. In network quantization, we focus on both pushing the state-of-the-art (SOTA) in compression and making quantized inference as easy to access as possible. For example, our SOTA work on oscillations in quantization-aware training that push the boundaries of what is possible with INT4 quantization. Furthermore, for ease of deployment, the integer formats such as INT16 and INT8 give comparable performance to floating point, i.e., FP16 and FP8, but have significantly better performance-per-watt performance. Researchers and developers can make use of this quantization research to successfully optimize and deploy their models across devices with open-sourced tools like AI Model Efficiency Toolkit (AIMET).
Presenters: Tijmen Blankevoort and Chirag Patel
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/12/vitis-and-vitis-ai-application-acceleration-from-cloud-to-edge-a-presentation-from-xilinx/
For more information about edge AI and computer vision, please visit:
https://www.edge-ai-vision.com
Vinod Kathail, Fellow and Chief Architect at Xilinx, presents the “Vitis and Vitis AI: Application Acceleration from Cloud to Edge” tutorial at the September 2020 Embedded Vision Summit.
Xilinx SoCs and FPGAs provide significant advantages in throughput, latency, and energy efficiency for production deployments of compute-intensive applications when compared to CPUs and GPUs. Over the last decade, FPGAs have evolved into highly configurable devices that provide on-chip heterogeneous multi-core CPUs, domain-specific programmable accelerators and “any-to-any” interface connectivity.
Today, the Xilinx Vitis Unified Software Platform supports high-level programming in C, C++, OpenCL, and Python, enabling developers to build and seamlessly deploy applications on Xilinx platforms including Alveo cards, FPGA instances in the cloud, and embedded devices. Moreover, Vitis enables the acceleration of large-scale data processing and machine learning applications using familiar high-level frameworks, such as TensorFlow and SPARK. This presentation provides an overview of the Vitis Software platform and the accelerated Vitis Vision Library, which enables customizable functions such as image signal processing, adaptable AI inference, 3D reconstruction and motion analysis.
A talk I gave at Creative Crew (Singapore) on 12 August 2016 to introduce newcomers to the Raspberry Pi.
Video link of this talk can be found here: https://engineers.sg/v/955
Code used in the talk can be found here: https://github.com/yeokm1/getting-started-with-rpi
PCIe Gen 3.0 Presentation @ 4th FPGA CampFPGA Central
PCIe Gen3 presentation by PLDA at 4th FPGA Camp in Santa Clara, CA. For more details visit http://www.fpgacentral.com/fpgacamp or http://www.fpgacentral.com
How to Select Hardware for Internet of Things Systems?Hannes Tschofenig
With the increasing commercial interest in Internet of Things (IoT) the question about a reasonable hardware configuration surfaces again and again.
Peter Aldworth, a hardware engineer with more than 19 years of experience, discusses this topic in a presentation given to the IETF community.
In this session I will tell you what Hortonworks and IBM Power solutions are and how we can realize significant business value development and prompt use of open innovation in future cognitive utilization. In addition, I will introduce the value added unique to IBM that can be provided by IBM and Hortonworks partnership from the viewpoint of storage, analytics, data science and streaming analysis.
Heterogeneous Computing : The Future of SystemsAnand Haridass
Charts from NITK-IBM Computer Systems Research Group (NCSRG)
- Dennard Scaling,Moore's Law, OpenPOWER, Storage Class Memory, FPGA, GPU, CAPI, OpenCAPI, nVidia nvlink, Google Microsoft Heterogeneous system usage
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
Satoshi Matsuoka from RIKEN gave this talk at the HPC User Forum in Santa Fe.
"With rapid rise and increase of Big Data and AI as a new breed of high-performance workloads on supercomputers, we need to accommodate them at scale, and thus the need for R&D for HW and SW Infrastructures where traditional simulation-based HPC and BD/AI would converge, in a BYTES-oriented fashion. Post-K is the flagship next generation national supercomputer being developed by Riken and Fujitsu in collaboration. Post-K will have hyperscale class resource in one exascale machine, with well more than 100,000 nodes of sever-class A64fx many-core Arm CPUs, realized through extensive co-design process involving the entire Japanese HPC community.
Rather than to focus on double precision flops that are of lesser utility, rather Post-K, especially its Arm64fx processor and the Tofu-D network is designed to sustain extreme bandwidth on realistic applications including those for oil and gas, such as seismic wave propagation, CFD, as well as structural codes, besting its rivals by several factors in measured performance. Post-K is slated to perform 100 times faster on some key applications c.f. its predecessor, the K-Computer, but also will likely to be the premier big data and AI/ML infrastructure. Currently, we are conducting research to scale deep learning to more than 100,000 nodes on Post-K, where we would obtain near top GPU-class performance on each node."
Watch the video: https://wp.me/p3RLHQ-k6G
Learn more: https://en.wikichip.org/wiki/supercomputers/post-k
and
http://hpcuserforum.com
In this deck from ATPESC 2019, Ken Raffenetti from Argonne presents an overview of HPC interconnects.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two-week training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-luc
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Intel Microprocessors- a Top down ApproachEditor IJCATR
IBM is the world's largest manufacturer of computer chips. Although it has been challenged in recent years by
newcomers AMD and Cyrix, Intel still Predominate the market for PC microprocessors. Nearly all PCs are based on Intel's x86
architecture. IBM (International Business Machines)IBM (International Business Machines) is by far the world's largest information
technology company in terms of Gross ($88 billion in 2000) and by most other measures, a position it has held for about the past
50 years. IBM products include hardware and software for a line of business servers, storage products, custom-designed microchips,
and application software. Increasingly, IBM derives revenue from a range of consulting and outsourcing services. In this paper we
will compare different technologies of computer system, its processor and chips
VEDLIoT – A heterogeneous hardware platform for next-gen AIoT applications, Jens Hagemeyer, EU-IoT Training Session on “Machine Learning at the Edge and the FarEdge”, IoT Week (online event), August 2021
Similar to An AI accelerator ASIC architecture (20)
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...PinkySharma900491
Class khatm kaam kaam karne kk kabhi uske kk innings evening karni nnod ennu Tak add djdhejs a Nissan s isme sniff kaam GCC bagg GB g ghan HD smart karmathtaa Niven ken many bhej kaam karne Nissan kaam kaam Karo kaam lal mam cell pal xoxo
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...Amil baba
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
An AI accelerator ASIC architecture
1. A scalable AI accelerator
ASIC platform
K. Le, January 17, 2019
Presented for information only. No guarantee for accuracy
and correctness.
2. 2 1/17/2019K. Le
Market moving toward a different type of AI chips
New realization: cloud-based AI processing has many limitations
Concentrated cloud-based AI processing costs too much storage, power and bandwidth
Limited ability to support real time requirements (automotive, robots, drones, etc.)
Connectivity to cloud is not always guaranteed (security, mobility, network coverage,
etc.)
Better user-experience requires local AI processing
Mega-trend is toward edge AI processing
Need new AI chips to enable “sensor triggered actions” and
“decentralized ai” at the edge
3. 3 1/17/2019K. Le
Required edge AI processing functions
Audio
speech recognition
identification, security
language processing/translation
Video
image recognition
pattern/object/face recognition
Environmental/physical condition
pressure, tension, force, temperature, noise, heart beat, humidity, etc.
5. 5 1/17/2019K. Le
High level architecture
Based on a scalable AI
compute fabric
Pipelined flow for fast
learning and inferring
Flexible architecture suitable
for cloud, gateway and edge
ai
Allows up-scaling to multi-
chip solutions
AI Compute
Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
Control
Processor
IO
Inter
face
s
ROM/
SRAM
PLL &
PM
M
u
l
t
i
p
l
e
D
a
t
a
S
t
r
e
a
m
s
• AI COMPUTE FABRIC
• MULTIPLE PARALLEL DATA
STREAMS
• SCALABLE AND PARTITIONABLE
• ENERGY EFFICIENT
• FAST (UP TO 2-4GHZ)
• CONTROL PROCESSOR (ANDES,
ARM, MIPS, RISC-V, ETC.)
• LEARNING (CALCULATIONS,
ALGORITHM EXECUTION,
COMPARING, MODEL UPDATES)
• INFERRING [ALGORITHM UPDATE,
DECISION MAKING)
• IO INTERFACE TO MULTI-CHIP
SOLUTIONS
6. 6 1/17/2019K. Le
Detailed diagram
@250mHz FABRIC frequency, MAXIMUM THROUGHPUT
of a 1-cluster AI accelerator is 4 giga byte ops
(2 bytes x 8 PE) x 1 cluster x 0.25Ghz -> 4 giga byte
operations per second
128-bit input and output data buffers allow scaling
of fabric frequency without throttling
Control processor can be V-type, ARM Cortex, etc.
operating at 100-250mhz
Two independent clock domains – fabric and control
Power: <2W on 14-16FF @250MHz
Die size: 15 to 20mm2
AI Compute
Fabric
(1 block of 8x8
PEs)
128-bit Wide Output
Data Buffer
Control
Processor
IO
Inter
face
Cont
rol
32KB
L1 /
0.5MB
L2
SRAM
PLLs
32 Gbps Interface
(32 1 Gpbs LVDS)
128-bit Wide Input
Data Buffer
32 Gbps Interface
(32 1 Gpbs LVDS)
Vertical
Fabric
Flow
Control
Power
Mgmt
ROM/Se
curity
GPIO
s
/
LVDS
Horizontal Fabric
Control
GPIOs/LVDS
JTAG/T
est
CPU
LPDDR
Contro
ller &
PHY
7. 7 1/17/2019K. Le
AI accelerator ASIC platform: multi-chip solutions
Up-scalable SOC & system
architecture
Suitable for massive data
processing
Connectivity to server racks or
cloud via network
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
AI Compute Fabric
Input Data Buffer,
Memory & Control
Output Data Buffer,
Memory & Control
Host IO: PCIe-4
Switch & Flow
Control SOC
To Cloud
Or Cards/
Racks
Host IO: PCIe-4
8. 8 1/17/2019K. Le
Processing Fabric
An example is the “piperench” coarse-grained reconfigurable
architecture from Carnegie-Mellon U.
9. 9 1/17/2019K. Le
Fabric: local cluster
Local Fabric architecture offers:
8x8 local cluster configuration
sufficient for most applications
Byte-wide processing elements
Easy Scalability to 8 bytes per local
cluster
Predictable Performance
Ample Routing resources
Pipe-lined flow architecture
Faster and more power efficient than
CPU/GPU architectures
Might add local mem blocks for
reverse machine learning applications
8
Note: H- and V-bus widths to be optimized
Expandable to 16b, 32b, 64b, etc.
word widths
P
E
P
E
P
E
…
P
E
P
E
P
E
…
…
…
…
P
E
P
E
P
E
…
8
8-bit Wide
V-Local (8-bit)* V-Local (8-bit)*
H-Local
(8-bit)*
H-Local
(8-bit)*
H-Local
(8-bit)*
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
Compute Stream A
Compute Stream B Compute Stream N
Local Mem Local Mem Local Mem
10. 10 1/17/2019K. Le
Fabric: global clusters
Global fabric architecture offers:
Easy scalability to any (X,Y)
configurations to fit particular
applications
Pipe-lined flow architecture
Higher performance and efficiency
Note: H- and V-bus widths to be optimized
8x8
PEs
8x8
PEs
8x8
PEs
…
8x8
PEs
8x8
PEs
8x8
PEs
…
…
…
…
8x8
PEs
8x8
PEs
8x8
PEs
…
Y
Local Clusters
V-Global (8/16-bit)* V-Global (8/16-bit)*
H-Global
(8/16-bit)*
H-Global
(8/16-bit)*
H-Global
(8/16-bit)*
x
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
…
Compute Stream
11. 11 1/17/2019K. Le
Fast parallel
computational fabric
Parallel computational tasks
mapped at compiler-level to
multiple kernels concurrently
executed inside fabric
On-chip HW task-master
control
schedule
monitor
8x8
PEs
8x8
PEs
8x8
PEs
…
8x8
PEs
8x8
PEs
8x8
PEs
…
…
…
…
8x8
PEs
8x8
PEs
8x8
PEs
…
Local Clusters
M
u
l
t
i
p
l
e
C
o
m
p
u
t
e
S
t
r
e
a
m
s
…
Kernel A Kernel B
Kernel C
TASK
MASTER
13. 13 1/17/2019K. Le
Processing Element
Proposed by Lu Wan, Chen Dong and Deming
Chen of U. of ILLInois, Urbana-Champaign
(2012)
Advantages:
Complete
High-performance by-pass path
Compatible with fabric architecture
Changes from original:
No on-the-fly fabric reconfiguration, done at
compile time
PE To be re-optimized (add barrel shifter?)
15. 15 1/17/2019K. Le
AI accelerator ASIC: high
performance platform
@1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a
4x4 cluster AI accelerator asic is 256 Giga byte OPS
(2 bytes x 8 PE) x 4x4 Clusters x 1 Ghz -> 256 giga byte
operations per second
@1GHz FABRIC frequency, MAXIMUM THROUGHPUT of a
8x8 cluster AI accelerator asic is 1 TOPS
Fabric can probably operate at up to 4GHz in 14LPP -> 4
TOPS
512-bit input and output data buffers allow scaling of
fabric frequency to over 1GHz without significant
throttling
Control processor can be 32- or 64-bit (andes n9 or v-
type, arm cortex, etc.) operating at 300-500mhz
Two independent clock domains – fabric and control
Power: 10-12W on 14/16FF (7W for PCIe-4, 2-3W for
fabric, 2w for all others)
Die size: should not exceed 50-60mm2
1GHz
AI Compute
Fabric
(4x4 blocks of
8x8 PEs)
512-bit Wide Output
Data Buffer
32- or 64-bit
Control
Processor
IO
Inter
face
Cont
rol
64KB
L1 /
1MB
L2
SRAM
PLLs32 PCIe-4 SerDes
512-bit Wide Input
Data Buffer
32 PCIe-/4 SerDes
Vertical
Fabric
Flow
Control
Power
Mgmt
ROM/
Securi
ty
GPIO
s
/
LVDS
Horizontal Fabric
Control
GPIOs/LVDS
JTAG/T
est
This architecture utilizes a dedicated CPU for AI tasks in the SOC.
CPU
LPDDR
Contro
ller &
PHY
Fabric
LPDDR
Control
ler &
PHY
Editor's Notes
160 zetta bytes by 2025 from AI devices
2.6 billion connected AI devices
CPU and memory contain the known agents, model representations, and algorithms for ML. Needs DDR (128b) for large data sets and complex models and algos.
PCIe-4 switch and interface to rack or cloud
CPU and memory contain the known agents, model representations, and algorithms for ML. Needs DDR (128b) for large data sets and complex models and algos.