Accelerating algorithmic and
hardware advancements for
power efficient on-device AI
Qualcomm Technologies, Inc.
@qualcomm_techJuly 2018
2
By 2025, the data center sector could
be using 20% of all available electricity
in the world1
A cloud provider used the equivalent
energy consumption of ~366,000 US
households in 20142
Bitcoin mining in 2017 used the same
energy as did all of Ireland3
Computers
are consuming
an increasing
amount of energy
1. Andrae, Anders (2017) Total Consumer Power
Consumption Forecast; 2. The Verge (2014); 3.The Guardian (2017)
Given the economic potential
of AI, these numbers will
only be increasing
33
Will we have reached the capacity of the human brain?
Energy efficiency of a brain is 100x better than current hardware
Source: Welling
2025
Energyconsumption
1940 1950 1960 1970 1980 1990 2000 2010 2020 2030
1943: First NN (+/- N=10)
1988: NetTalk
(+/- N=20K)
2009: Hinton’s Deep
Belief Net (+/- N=10M)
2013: Google/Y!
(N=+/- 1B)
2025:
N = 100T = 1014
2017: Extremely large
neural networks (N =137B)
1012
1010
108
106
1014
104
102
100
Deep neural networks
are energy hungry
and growing fast
AI is being powered by the explosive
growth of deep neural networks
44
Value created by AI must exceed the cost to run the service
Broad economic viability requires energy efficient AI
Economic feasibility per transaction may
require cost as low as a micro-dollar
(1/10,000th of a cent)
• Personalized advertisements and recommendations
• Smart security monitoring based on image and sound recognition
• Efficiency improvements for smart cities and factories
5
The AI power and thermal ceiling
The challenge
of AI workloads
Very compute intensive
Complex concurrencies
Real-time
Always-on
Constrained
mobile environment
Must be thermally efficient
for sleek, ultra-light designs
Requires long battery
life for all-day use
Storage/memory
bandwidth limitations
66
Soon, AI algorithms will be
measured by the amount
of intelligence they provide
per watt hour.
77
Weight parameter
Activation node
Edges
Parts Objects
OutputPixels
Dog?
Cat?
Deep learning
The good
Convolutional
neural networks
(CNNs) have been
very successful
• Extract learnable features
with state-of-the art results
• Encode location invariance,
namely that the same object
may appear anywhere in
the image
• Share parameters, making
them “data efficient”
• Execute quickly on modern
hardware with parallel
processing
88
Weight parameter
Activation node
Edges
Parts Objects
OutputPixels
Dog?
Cat?
Deep learning
The bad and the ugly
CNNs use too much
memory, compute,
and energy (today)
• CNNs do not encode
additional symmetries, such
as rotation invariance
(object may appear in any
orientation1)
• CNNs do not reliably
quantify the confidence in a
prediction
• CNNs are easy to fool by
changing the input only
slightly, such as adversarial
examples
1. Cohen & Welling 2016
99
Edges
Parts Objects
OutputPixels
Dog?
Cat?
Bayesian deep-learning
addresses these challenges
Inspired by brain functionality, introducing noise
to neural networks is beneficial
Noise can be a
good thing for AI
Compression
and quantization
• Reduce complexity
of the neural network model
• Reduce bit-width of the
parameters and activations
• Save power and
improve efficiency
Introduce noise to weights Noise propagates to activations
1010
Edges
Parts Objects
OutputPixels
Dog
Cat
w1
w2 (Y)
w2
w1 Initial weight values Apply Bayesian
deep-learning to
shrink the model
Compression
and quantization
• Quantize weights:
Use lower precision (bit-width)
• Prune activations:
Reduce number of activation
nodes
1111
Edges
Parts Objects
OutputPixels
Introduce noise to parameters
Dog
Cat
w1
w2 (Y)
Prior P(W)—distribution before data (X)
w2
w1 Initial weight values Apply Bayesian
deep-learning to
shrink the model
Compression
and quantization
• Quantize weights:
Use lower precision (bit-width)
• Prune activations:
Reduce number of activation
nodes
1212
Edges
Parts Objects
OutputPixels
Introduce noise to parameters
Add data (X)
Dog
Cat
w1
w2 (Y)
Prior P(W)—distribution before data (X)
w2
Posterior P(W| X,Y)—
distribution after data
w1 Initial weight values Apply Bayesian
deep-learning to
shrink the model
Compression
and quantization
• Quantize weights:
Use lower precision (bit-width)
• Prune activations:
Reduce number of activation
nodes
1313
Edges
Parts Objects
OutputPixels
Introduce noise to parameters
Add data (X)
Dog
Cat
w1
w2
Prune activations
(similar method as
quantization)
X
Quantize weights
(or even remove)
(Y)
Prior P(W)—distribution before data (X)
w2
Posterior P(W| X,Y)—
distribution after data
w1 is pinned down
w2 is still highly uncertain
w1 Initial weight values Apply Bayesian
deep-learning to
shrink the model
Compression
and quantization
• Quantize weights:
Use lower precision (bit-width)
• Prune activations:
Reduce number of activation
nodes
14
Bayesian deep learning provides broad benefits
A powerful tool to address a variety of deep learning challenges
Compression
and quantization
Quantize parameters and
activations, prune model
components
Regularization
and generalization
Avoid overfitting data; choose
the simplest model to explain
observations (Occam’s razor)
Confidence
estimation
Generate the confidence
intervals of the predictions
Privacy/adversarial
robustness
Avoid storing personal
information in parameters, be
less sensitive to adversarial
attacks
1515
83%
84%
85%
86%
87%
88%
89%
90%
1 1.5 2 2.5 3 3.5
Top5Accuracy
Compression ratio
ResNet-18 on a large set of labeled images
Baseline Channel pruning SVD Bayesian pruning
Applying Bayesian deep learning to real use cases
Image classification
Zhang, Zou, He, & Sun 2015 (SVD);
He, Zhang, & Sun 2017 (channel pruning)
compression ratio while maintaining close to the same accuracy3X
16
What does the
future look like
for AI hardware?
Distributed computing
architectures
AI acceleration
research
1717
AI acceleration research
Balance AI acceleration capabilities across engines (CPU, GPU, DSP)
Focused on fundamentals to accelerate deep learning workloads at low power
Memory hierarchy
Optimize the memory hierarchy to reduce
the power consumption of data movement
while ensuring performance
Compute architecture
Optimize instruction type, parallelism,
and precision of the functional units
Utilization
Exploit sparsity to improve utilization and reduce power consumption
18
Algorithmic
advancements
Neural network algorithm design
optimized for hardware
Optimization for space and
runtime efficiency
Efficient
hardware
Efficient architecture design
Selecting the right compute
engine for the right task
Software
tools
Software-accelerated
run-time for deep learning
Neural processing SDK for
model compression and
optimization
The approach
for making
power efficient
on-device AI
Focusing on the joint
design of algorithms and
hardware to achieve
high performance
19
AI algorithms and hardware
need to be energy efficient
We are a leader in Bayesian
deep learning and its
applications to model
compression and quantization
We are doing fundamental
research on AI algorithms,
software, and hardware to
maximize power efficiency
Follow us on:
For more information, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Thank you
Nothing in these materials is an offer to sell any of the
components or devices referenced herein.
©2018 Qualcomm Technologies, Inc. and/or its affiliated
companies. All Rights Reserved.
Qualcomm is a trademark of Qualcomm Incorporated,
registered in the United States and other countries. Other
products and brand names may be trademarks or registered
trademarks of their respective owners.
References in this presentation to “Qualcomm” may mean Qualcomm
Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries
or business units within the Qualcomm corporate structure, as
applicable. Qualcomm Incorporated includes Qualcomm’s licensing
business, QTL, and the vast majority of its patent portfolio. Qualcomm
Technologies, Inc., a wholly-owned subsidiary of Qualcomm
Incorporated, operates, along with its subsidiaries, substantially all of
Qualcomm’s engineering, research and development functions, and
substantially all of its product and services businesses, including its
semiconductor business, QCT.

Accelerating algorithmic and hardware advancements for power efficient on-device AI

  • 1.
    Accelerating algorithmic and hardwareadvancements for power efficient on-device AI Qualcomm Technologies, Inc. @qualcomm_techJuly 2018
  • 2.
    2 By 2025, thedata center sector could be using 20% of all available electricity in the world1 A cloud provider used the equivalent energy consumption of ~366,000 US households in 20142 Bitcoin mining in 2017 used the same energy as did all of Ireland3 Computers are consuming an increasing amount of energy 1. Andrae, Anders (2017) Total Consumer Power Consumption Forecast; 2. The Verge (2014); 3.The Guardian (2017) Given the economic potential of AI, these numbers will only be increasing
  • 3.
    33 Will we havereached the capacity of the human brain? Energy efficiency of a brain is 100x better than current hardware Source: Welling 2025 Energyconsumption 1940 1950 1960 1970 1980 1990 2000 2010 2020 2030 1943: First NN (+/- N=10) 1988: NetTalk (+/- N=20K) 2009: Hinton’s Deep Belief Net (+/- N=10M) 2013: Google/Y! (N=+/- 1B) 2025: N = 100T = 1014 2017: Extremely large neural networks (N =137B) 1012 1010 108 106 1014 104 102 100 Deep neural networks are energy hungry and growing fast AI is being powered by the explosive growth of deep neural networks
  • 4.
    44 Value created byAI must exceed the cost to run the service Broad economic viability requires energy efficient AI Economic feasibility per transaction may require cost as low as a micro-dollar (1/10,000th of a cent) • Personalized advertisements and recommendations • Smart security monitoring based on image and sound recognition • Efficiency improvements for smart cities and factories
  • 5.
    5 The AI powerand thermal ceiling The challenge of AI workloads Very compute intensive Complex concurrencies Real-time Always-on Constrained mobile environment Must be thermally efficient for sleek, ultra-light designs Requires long battery life for all-day use Storage/memory bandwidth limitations
  • 6.
    66 Soon, AI algorithmswill be measured by the amount of intelligence they provide per watt hour.
  • 7.
    77 Weight parameter Activation node Edges PartsObjects OutputPixels Dog? Cat? Deep learning The good Convolutional neural networks (CNNs) have been very successful • Extract learnable features with state-of-the art results • Encode location invariance, namely that the same object may appear anywhere in the image • Share parameters, making them “data efficient” • Execute quickly on modern hardware with parallel processing
  • 8.
    88 Weight parameter Activation node Edges PartsObjects OutputPixels Dog? Cat? Deep learning The bad and the ugly CNNs use too much memory, compute, and energy (today) • CNNs do not encode additional symmetries, such as rotation invariance (object may appear in any orientation1) • CNNs do not reliably quantify the confidence in a prediction • CNNs are easy to fool by changing the input only slightly, such as adversarial examples 1. Cohen & Welling 2016
  • 9.
    99 Edges Parts Objects OutputPixels Dog? Cat? Bayesian deep-learning addressesthese challenges Inspired by brain functionality, introducing noise to neural networks is beneficial Noise can be a good thing for AI Compression and quantization • Reduce complexity of the neural network model • Reduce bit-width of the parameters and activations • Save power and improve efficiency Introduce noise to weights Noise propagates to activations
  • 10.
    1010 Edges Parts Objects OutputPixels Dog Cat w1 w2 (Y) w2 w1Initial weight values Apply Bayesian deep-learning to shrink the model Compression and quantization • Quantize weights: Use lower precision (bit-width) • Prune activations: Reduce number of activation nodes
  • 11.
    1111 Edges Parts Objects OutputPixels Introduce noiseto parameters Dog Cat w1 w2 (Y) Prior P(W)—distribution before data (X) w2 w1 Initial weight values Apply Bayesian deep-learning to shrink the model Compression and quantization • Quantize weights: Use lower precision (bit-width) • Prune activations: Reduce number of activation nodes
  • 12.
    1212 Edges Parts Objects OutputPixels Introduce noiseto parameters Add data (X) Dog Cat w1 w2 (Y) Prior P(W)—distribution before data (X) w2 Posterior P(W| X,Y)— distribution after data w1 Initial weight values Apply Bayesian deep-learning to shrink the model Compression and quantization • Quantize weights: Use lower precision (bit-width) • Prune activations: Reduce number of activation nodes
  • 13.
    1313 Edges Parts Objects OutputPixels Introduce noiseto parameters Add data (X) Dog Cat w1 w2 Prune activations (similar method as quantization) X Quantize weights (or even remove) (Y) Prior P(W)—distribution before data (X) w2 Posterior P(W| X,Y)— distribution after data w1 is pinned down w2 is still highly uncertain w1 Initial weight values Apply Bayesian deep-learning to shrink the model Compression and quantization • Quantize weights: Use lower precision (bit-width) • Prune activations: Reduce number of activation nodes
  • 14.
    14 Bayesian deep learningprovides broad benefits A powerful tool to address a variety of deep learning challenges Compression and quantization Quantize parameters and activations, prune model components Regularization and generalization Avoid overfitting data; choose the simplest model to explain observations (Occam’s razor) Confidence estimation Generate the confidence intervals of the predictions Privacy/adversarial robustness Avoid storing personal information in parameters, be less sensitive to adversarial attacks
  • 15.
    1515 83% 84% 85% 86% 87% 88% 89% 90% 1 1.5 22.5 3 3.5 Top5Accuracy Compression ratio ResNet-18 on a large set of labeled images Baseline Channel pruning SVD Bayesian pruning Applying Bayesian deep learning to real use cases Image classification Zhang, Zou, He, & Sun 2015 (SVD); He, Zhang, & Sun 2017 (channel pruning) compression ratio while maintaining close to the same accuracy3X
  • 16.
    16 What does the futurelook like for AI hardware? Distributed computing architectures AI acceleration research
  • 17.
    1717 AI acceleration research BalanceAI acceleration capabilities across engines (CPU, GPU, DSP) Focused on fundamentals to accelerate deep learning workloads at low power Memory hierarchy Optimize the memory hierarchy to reduce the power consumption of data movement while ensuring performance Compute architecture Optimize instruction type, parallelism, and precision of the functional units Utilization Exploit sparsity to improve utilization and reduce power consumption
  • 18.
    18 Algorithmic advancements Neural network algorithmdesign optimized for hardware Optimization for space and runtime efficiency Efficient hardware Efficient architecture design Selecting the right compute engine for the right task Software tools Software-accelerated run-time for deep learning Neural processing SDK for model compression and optimization The approach for making power efficient on-device AI Focusing on the joint design of algorithms and hardware to achieve high performance
  • 19.
    19 AI algorithms andhardware need to be energy efficient We are a leader in Bayesian deep learning and its applications to model compression and quantization We are doing fundamental research on AI algorithms, software, and hardware to maximize power efficiency
  • 20.
    Follow us on: Formore information, visit us at: www.qualcomm.com & www.qualcomm.com/blog Thank you Nothing in these materials is an offer to sell any of the components or devices referenced herein. ©2018 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved. Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners. References in this presentation to “Qualcomm” may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and development functions, and substantially all of its product and services businesses, including its semiconductor business, QCT.