Vertex Perspectives | AI Optimized Chipsets | Part II

AI-Optimized Chipsets
May 2018
Part II: Computing Hardware

Computing Hardware
Previously in Part I, we reviewed the ADAC loop and key factors driving innovation for AI-optimized
chipsets.
In this instalment, we explore how AI-led computing demands are powering these trends:
• Deep learning is expected to drive training for neural networks requiring massive datasets for AI
algorithm development
• This in turn leads to a shift in the performance focus of computing from general application to neural
nets, increasing demand for high performance computing
• Deep learning is both computationally and memory intensive, necessitating enhancements in
processor performance
• Hence, the rise of startups adopting alternative, innovative approaches and how this is expected to
pave the way for different types of AI-optimized chipsets

Source: Nvidia | Graphcore
Deep learning is expected to drive training for neural
networks
Training Inference
“Dog”
“Cat”
Untrained Neural Network Model
“Cat”
Trained Model Optimized for Performance
• Training refers to neural network learning with significant
data
• AI algorithms are developed via training
• Consumes significant computing power
• Training loads can be divided into many concurrent tasks.
This is ideal for the GPU’s double floating point precision and
huge core counts
• Training can also be conducted using FPGAs
• Requires calculations with relatively high precision, often
using 32-bit floating-point operations
• Inference refers to the neutral network interpreting new data
to generate accurate results
• Typically conducted at the application or client end-point (i.e.
edge), rather than on the server or cloud
• Requires fewer hardware resources and depending on the
application, can be performed using CPUs
• This could be FPGAs, ASICs, Digital Signal Processors (DSPs) etc
• Inference is expected to shift locally to mobile devices
• Precision can be sacrificed in favor of greater speed or less
power consumption

“The workloads are changing dramatically” for computing, as a result of machine learning…and
whenever workloads have changed in computing, it has always created an opportunity for new
kinds of computing.”
Andrew Feldman
CEO | Cerebras
Source: Intel, NVIDIA, ImageNet, Ark Invest Management LLC
Deep Learning Growth Drivers
With massive datasets required for AI algorithm development
and inference

Source: Deep Learning: An Artificial Intelligence Revolution by ARK Investment | Learning both Weights and Connections for Efficient Neural Networks by Song Han et al. | Icon made
by Those Icons from www.flaticon.com
Shifting the performance focus of computing from general
application to neural nets

Source: Deep Learning: An Artificial Intelligence Revolution by ARK Investment | Learning both Weights and Connections for Efficient Neural Networks by Song Han et al. |
Convolutional Neural Network by Mathworks
Deep learning chipsets are designed to optimize performance, power and memory.
• Algorithms tend to be highly parallel
• Requires data splitting between different processing units
• Connecting the pipeline in the most efficient manner is key
• Significant transfer of data back and forth between memory
• For instance, convolutional neural networks require convolution operations to be repeated throughout the pipeline and the
number of operations can be extremely significant
Example of a neural network with many convolutional layers
Deep learning is both computationally and memory intensive

• A neural network takes input data, multiplies them with a weight matrix and applies an activation function
• Multiplying matrices is often the most computationally intensive part of running a trained model
Driving enhancements in processor performance via matrix
multiplication….
The outputs of this matrix multiplication are then
processed further by an activation function
This sequence of multiplications and additions
can be written as a matrix multiplication
X1
X2
X3
Y1
Y2
Input Neurons
= f (W11X1 + W12X2 + W13X3)
= f (W21X1 + W22X2 + W23X3)
Source: An in-depth look at Google’s first Tensor Processing Unit (TPU) by Kat Sato

Quantization in neural networks
Quantization is a process of converting a range of input values into a smaller set of output values that closely
approximates the original data
• Reduces the cost of neural network predictions and memory usage
▪ Especially for mobile and embedded deployments
• Neural network predictions may not require the precision of 16-bit or 32-bit floating point calculations
▪ For example, if it is raining - knowing whether it is light or heavy will suffice, there is no need to know how many
droplets of water are falling per second
• 8-bit integers can still be used to calculate a neural network prediction while maintaining the appropriate level of
accuracy
Source: An in-depth look at Google’s first Tensor Processing Unit (TPU) by Kat Sato
Quantization in TensorFlow

And graph processing….
Scalar Processing
• Processes an operation per instruction
• CPUs run at clock speeds in the GHz range
• Might take a long time to execute large matrix operations
via a sequence of scalar operations
A1
A2
An
+ B1
B2
Bn
…
+
+
=
=
=
C1
C2
Cn
a[i] + b[i] = c[i]
for i = 1 to n
A1
A2
An
B1
B2
Bn
…
+ =
C1
C2
Cn
a[1:n] + b[1:n] = c[1:n]
.
.
.
.
.
.
.
.
.
.
.
Source: Spark 2.x - 2nd generation Tungsten Engine
Vector Processing
• Same operation performed concurrently across a large
number of data elements at the same time
• GPUs are effectively vector processors
Graph Processing
• Runs many computational processes (vertices)
• Calculates the effects these vertices on other points with
which they interact via lines (i.e. edges)
• Overall processing works on many vertices and points
simultaneously
• Low precision needed

Source: Cerebras Founder Feldman Contemplates the A.I. Chip Age by Barron’s | : Suffering Ceepie-Geepies! Do We Need a New Processor Architecture? By The Register | Startup
Unveils Graph Processor at Hot Chips by EETimes
• The key to a “graph” machine is
software that captures the “intent”
of the graph problems it needs to
solve
▪ Processing in parallel instead of
sequential
• ThinCI’s Graph Streaming Processor
(GSP) is designed to understand the
complex data dependencies and
flow
• GSPs manage this entirely on the
chip with:
▪ Minimal software intervention
▪ Extremely low memory bandwidth
needs
• Reduces or eliminates inter-
processor communications and
synchronizations
• A microprocessor wastes a lot of effort
with a sparse matrix multiplying by zero
▪ A sparse matrix is a matrix that has
many elements that are zero
• A new chip is needed to:
▪ Handle sparse matrix math
▪ Emphasize communications between
inputs and outputs of calculations
• Machine learning methods (e.g.
convolutional neural networks) involve:
▪ Recursion
▪ Feedback
▪ Computations in one instance feed into
computations elsewhere in the process
• Cerebras’ solution: Simple on compute,
on arithmetic and very intense on
communications
Creating new approaches that focus on graph processing and
sparse matrix math, emphasizing communications between
inputs and outputs of calculations
• Graphcore’s Intelligence Processing
Unit (IPU) has a structure which
provides:
▪ Efficient massive compute
parallelism
▪ Huge memory bandwidth
• Both factors essential for delivering a
significant step-up in graph
processing power needed for machine
intelligence
• The graph is a highly-parallel
execution plan for the IPU
• Expected to increase the speed of
machine learning workloads
significantly:
▪ General: by 5x
▪ Specific: by 50 - 100x (e.g.
autonomous vehicle workloads)

Source: Horizon Robotics | Hailo | Gyrfalcon Technology
As well as AI processing in memory architectures and
massively parallel compute capabilities
• Deep learning processor for edge
devices offering datacenter class
performance in an embedded
device
• Dataflow approach, based on the
structure of Neural Networks (NNs)
• Distributed memory fabric,
combined with purpose-made
pipeline elements, allowing very low
power memory access (without
using batch processing)
• Novel control scheme based on
combination of hardware and
software, reaching very low
Joules/operation metrics with a high
degree of flexibility
• Extremely efficient computational
elements, which can be variably
applied according to need
• Low overhead interconnect, allowing
Near Memory Processing (NMP)
and balancing changing
requirements of memory, compute
and control along the NN
• Gyrfalcon’s Intelligent Matrix
Processor: Lightspeeur®
2801S delivers a APiM (AI
Processing in memory)
architecture which features
massively parallel compute
capabilities
• Its APiM architecture, uses
memory as the AI processing
unit. This eliminates the huge
data movement that results in
high power consumption
• The architecture features true,
on-chip parallelism, in situ
computing, and eliminates
memory bottlenecks. It has
roughly 28,000 parallel
computing cores and does
not require external memory
for AI inference
• It runs in various open
frameworks like TensorFlow,
Caffe and others to complete
deep learning training and
inference tasks
• The Brain Processing Unit (BPU) by Horizon Robotics is a
heterogeneous Multiple Instruction, Multiple Data
(MIMD) computation system
• By heterogeneity, the BPU uses multiple kinds of
Processing Units (PU) that were designed specifically
for neural network inference. It gains performance or
energy efficiency by adding dissimilar PUs, incorporating
specialized processing capabilities to handle particular
tasks
• MIMD is a technique employed to achieve
parallelism, with a number of PUs that function
asynchronously and independently
• At any one time, different PUs may be executing
different instructions on different pieces of data
• The first generation BPU employs a Gaussian
architecture - allowing each vision task to be divided
into 2 stages (i.e. attention and cognition) for optimal
allocation of computations. This offers a parallel and
fast filter of task-irrelevant information, on-demand
cognition and edge learning to adjust models after
deployment
• This design enables the BPU to achieve a performance
of up to 1TOPS at a low-power of 1.5W. It can
process the 1080P video input at 30 frames per
second, as well as detect and recognize up to 200
objects per frame

The choice of chipset depends on use - for training, inference,
in the cloud, at the edge or a hybrid of both
Cloud
• Some cloud providers have
been creating their own chips
• Using alternative architectures
to GPUs (e.g. FPGAs and ASICs)
• Cloud-based systems can
handle neural network
training and inference
Edge
• Edge devices, from phones to drones, to
focus mainly on inference, due to
energy efficiency and low-latency
computation considerations
• Inference will be moved to edge devices
for most applications (AR expected to be
a key driver)
• New entrants will have the best chance of
success in the end-device market given its
nascence
• Chips for end-devices have power
requirements as low as 1 watt
• Devices market is too large and diverse
for a single chip design to address, and
customers will ultimately want custom
designs

With industry players adopting different approaches
Cloud Edge
• Google TPUs are ASICs
• The high non-recurring costs
associated with designing the
ASIC can be adsorbed due to
Google’s large scale
• Using TPUs across multiple
operations help save costs,
ranging from Street View
to search queries
• TPUs save more power than
GPUs
• Rolling out FPGAs in its own datacenter revamp
▪ Similar to ASICs
▪ But reprogrammable so that their algorithms
can be updated
• Smartphone System-on-Chips (SoCs) are
likely to incorporate ASIC logic blocks
• Creates opportunities for new IP licensing
companies. (e.g. Cambricon has licensed
its ASIC design to Huawei for its Kirin 970
SoC)
• Specialized chips for mobile devices - an
increasing trend with:
▪ Dedicated AI chips appearing in
Apple’s iPhone X, Huawei’s Mate 10,
and Google’s Pixel 2
▪ ARM has reconfigured its chip design
to optimize AI
▪ Qualcomm launched its own mobile AI
chips
Huawei
Mate 10’s
Kirin 970
Source: Google | Microsoft | Huawei

Source: Artificial Intelligence: 10 Trends to Watch in 2017 and Beyond by Tractica | Expect Deeper and Cheaper Machine Learning by IEEE Spectrum | MIT Technology in Review | Google
Rattles the Tech World with a New AI Chip for All by Wired | Back to the Edge: AI Will Force Distributed Intelligence Everywhere by Azeem | When Moore’s Law Met AI – Artificial
Intelligence and the Future of Computing by Azeem
Latency and contextualization of locales are key drivers of
edge computing
Key Drivers of Edge Computing
• Learning typically happens in the cloud
• Devices do not do any learning from their environment or experience
• Besides inference, it will also be essential to push training to the edge
Latency
• For many applications that delay will be unacceptable (e.g. the high latency risk of sending signal data to the cloud for
self-driving prediction, even with 5G networks)
Context
• Devices will soon need to be powerful enough to learn at the edge of the network
• Devices will be used in situ and those locales will be increasingly contextualized
• The environment where the device is placed will be a key input to its operation. Allowing the network to learn from the
experience of edge devices and the environment

Source: Artificial Intelligence: 10 Trends to Watch in 2017 and Beyond by Tractica | Expect Deeper and Cheaper Machine Learning by IEEE Spectrum | MIT Technology in Review | Google Rattles
the Tech World with a New AI Chip for All by Wired | Back to the Edge: AI Will Force Distributed Intelligence Everywhere by Azeem | When Moore’s Law Met AI – Artificial Intelligence and the
Going forward, we are likely to see Federated Learning -
a multi-faceted infrastructure where learning happens on the edge
of the network and in the cloud
Federated Learning
• Allows for smarter models, lower latency and power consumption, while availing differential privacy and personalized
experiences
• Allows the network to learn from the experience of many edge devices and their experiences of the environment
• In a federated environment, edge devices could do some learning and efficiency send back deltas (or weights) to the
cloud where a central model could be more efficiently updated, instead of sending their raw experiential data back to
the cloud for analysis
• Differential privacy also ensures that the aggregate data in a database capture significant patterns, while protecting
individual privacy
• Google designed its original TPU for execution. Its new cloud TPU offers a chip that handles
training as well
• Amazon and Microsoft offering GPU processing via cloud services, but they do not offer
bespoke AI chips for both training and executing neural networks
• Bitmain claims to have built 70% of all the computers on the Bitcoin network. It makes
specialized chips to perform the critical hash functions involved in mining and trading
bitcoins, and packages those chips into the top mining rig - the Antminer S9
• In 2017, Bitmain unveiled details its new AI chip, the Sophon BM1680 - specialized for both
training and executing deep learning algorithms
Google’s cloud TPU
Bitmain’s Sophon
BM1680

• AI applications increasingly demand higher performance
and lower power but deep learning technology has
primarily been a software play
• Need for hardware acceleration was only recognized
recently. Top global semiconductor companies and a
number of startups ventured to develop specialized
chipsets to address these demands
• Current chipset market is dominated by GPUs and CPUs
• An expanded role for other chipset types, including
ASICs and FPGAs, is expected to exist in future
• According to Tractica, deep learning chipset shipments
are expected to increase from 863K units in 2016 to
41.2M units by 2025, with revenue growing from USD
513M to USD 12.2B at a CAGR of 42.2%
Source: Tractica | Graphcore
This is expected to drive increasing production of ASICs,
FPGAs, and other emerging chipsets
Deep Learning Chipset Unit Shipment by Type
(Global, 2016-2025)
“We believe that intelligence is the future of
computing, and graph processing is the future of
computers.”
Nigel Toon
CEO | Graphcore
863K units
41.2M units
Source: Tractica

In Part I, we note that deep learning technology has primarily been a software play to date. The rise of
new applications (e.g. autonomous driving) is expected to create substantial demand for computing.
Existing processors were not originally designed for these new applications, hence the need to develop
AI-optimized chipsets.
Currently, most of the computing happens in the cloud. As AI applications become more ubiquitous, we
expect a shift in inference and training closer to where it is needed, resulting in a relative increase in
intelligence at the network edge.
This is the end of Part II of a 4-part series of Vertex Perspectives that seeks to understand key factors
driving innovation for AI-optimized chipsets, their industry landscape and development trajectory.
In Part III, we assess the dominance of tech giants in the cloud, coupled with disruptive startups adopting
cloud-first or edge-first approaches to AI-optimized chips. Most industry players are expected to focus
on the cloud, with ASIC startups featuring prominently in the cloud and at the edge. Importantly, we look
at what these opportunities mean for investors and entrepreneurs.
Finally in Part IV, we look at other emerging technologies including neuromorphic chips and quantum
computing systems, to explore their promise as alternative AI-optimized chipsets.
Do let us know if you would like to subscribe to future Vertex Perspectives.
Key Takeaways

Disclaimer
This presentation has been compiled for informational purposes only. It does not constitute a recommendation to any party. The presentation relies on data and
insights from a wide range of sources including public and private companies, market research firms, government agencies and industry professionals. We cite
specific sources where information is public. The presentation is also informed by non-public information and insights.
Information provided by third parties may not have been independently verified. Vertex Holdings believes such information to be reliable and adequately
comprehensive but does not represent that such information is in all respects accurate or complete. Vertex Holdings shall not be held liable for any information
provided.
Any information or opinions provided in this report are as of the date of the report and Vertex Holdings is under no obligation to update the information or
communicate that any updates have been made.
About Vertex Ventures
Vertex Ventures is a global network of operator-investors who manage portfolios in the U.S., China, Israel, India and
Southeast Asia.
Vertex teams combine firsthand experience in transformational technologies; on-the-ground knowledge in the world’s major
innovation centers; and global context, connections and customers.
About the Authors
Emanuel TIMOR
General Partner
Vertex Ventures Israel
emanuel@vertexventures.com
XIA Zhi Jin
Partner
Vertex Ventures China
xiazj@vertexventures.com
Brian TOH
Director
Vertex Holdings
btoh@vertexholdings.com
Tracy JIN
Director
Vertex Holdings
tjin@vertexholdings.com

Vertex Perspectives | AI Optimized Chipsets | Part II

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Vertex Perspectives | AI Optimized Chipsets | Part II

Similar to Vertex Perspectives | AI Optimized Chipsets | Part II (20)

More from Vertex Holdings

More from Vertex Holdings (20)

Recently uploaded

Recently uploaded (20)

Vertex Perspectives | AI Optimized Chipsets | Part II