Deep learning at the edge:
100x Inference improvement on edge devices
Dr. Lawrence Spracklen
Director of ML Architecture
AI goes big!
● Deep learning dominates today’s AI
conversations
● Delivering significant accuracy
improvements
● Enabled by explosive growth in model sizes
○ Initially in NLP, but increasingly vision and beyond
● Spiraling training and inference costs
● Unsustainable energy utilization
17,000X
increase
Source: Microsoft Research blog
A brute force solution
Perform matrix multiplications very fast
● GPUs have become AI workhorses
○ 500+ trillion arithmetic operations per second per card
● Hardware performance doubles every few years
● Exploding AI costs
○ 2018 : BERT cost $6K+ to train
○ 2020 : GPT-3 cost $10M+ to train
● Hardware failing to keep pace with growth in
model size Source: NVIDIA
Design new hardware
Edge systems are constrained systems
● Problems compounded at the edge
● Limited memory and computation resources
● Makes large SOTA models on edge devices impractical
● Efficiency step-change needed to drive change at the edge
Necessary to address either the software or the hardware
Improve model efficiency
on existing hardware
Source: NVIDIA blog
Natural Intelligence
Examine the human brain
• Neuron interconnections are sparse
• Neuron activations are sparse
• Neurons are low-fidelity
• Neurons are significantly more complex than AI’s point neuron abstraction
• Humans can learn from very limited examples
Sparse models
• Weight sparsity zeros the majority of a model’s neuron weights
• Activation sparsity dynamically activates context dependent sub-networks
• Enables significant computational savings
• Benefits are multiplicative
• Delivering 100X+ reductions in computational costs for identical model architectures
Input#1 Input#2
You can control your models
• Industry has struggled to accelerate sparse models on general-purpose
hardware
• 2-3X acceleration from 20X reduction in non-zero parameters
• However, it possible to precisely control weight and activation placement to
ensure hardware sympathetic patterns
• Achievable with both directly trained sparse models and ‘pruned’ sparse models
• Enables speedups that directly mirror the reduction in non-zeros
• Achievable on existing hardware (FPGAs, CPUs and GPUs)
• 95% weight sparsity delivers 20X acceleration
ConvNets on FPGAs
• Optimized model inference outperforms the traditional model by 112X
• Remember – same accuracy as the original dense model
• Corresponding improvements in per-operation energy efficiency
• Optimized models fit on ZU3EG edge devices
• Edge device outperforms standard models running on the datacenter-class U250!
FPGA
platform
Power
consumption
Network
type
Number of
networks on chip
Full chip throughput
(words/sec)
Full chip
speedup
Relative
efficiency
Alveo U250 225W
Dense 4 12,195 - 100%
Sparse-Dense 24 689,655 56.5 5675%
Sparse-Sparse 20 1,369,863 112.3 11274%
ZU3EG 24W
Dense 0 0 - -
Sparse-Dense 1 21,053 Infinite 1624%
Sparse-Sparse 1 45,455 Infinite 3505%
Transformers on CPUs and GPUs
● Techniques are compatible with large language and vision models
● These optimized models run extremely efficiently on CPUs and GPUs
● 10-30X throughput improvement for typical models
○ Identical model architecture and accuracy
● Accelerates both training and inference
● Significant reduction in memory footprint
Unleash deep learning at the edge
Low latency opportunity
● Long latencies prevent large model use in many
application spaces
○ Small transformer CPU inference latencies can approach 100ms
● Sparse models also deliver improved latency
● Inference latency can be decreased by 10X+
● Enables models to be used with conversational apps
○ Sub-3ms latency for BERTbase on CPUs
Pick your path
Create optimized
model
Accelerated
CPU Inference
Accelerated
GPU Inference
New
Model
Existing
Model
Accelerated
training on GPUs
TRAIN INFERENCE
Pretrained
Model Accelerated fine-
tuning on CPUs
5-20X performance
Improvements in training
10-30X performance
Improvements in inference
Accelerated
CPU Inference
Accelerated
GPU Inference
5-15X performance
Improvements in inference
Example Applications for these Optimized Models
Voice recognition on
embedded devices
Conversational AI – real-
time responsiveness
Computer vision at the edge Forecasting early disorders
on wearables
Taking it further
• Sparsity represents just the start of what’s possible
• Quantization
• Leverage reduced fidelity representations
• Further reduces memory footprint
• Translates into improved performance on most hardware [2.5X+]
• Knowledge distillation
• Reduce number/size of layers & use large teacher during training to maximize accuracy
• Fewer layers directly reduces inference costs [5X+]
• Benefits are multiplicative…
Deep learning models can run 100X faster on existing edge devices
In Summary
● Edge systems are constrained as explosive AI growth continues
● We can leverage structures and efficiencies from the brain to create
hardware-aware, highly performant sparse models
● Optimized models unlock possibilities for deep learning at the edge:
○ Outperform traditional models by 10-100X+
○ Reduced memory footprint
○ Sub 3ms inference latency for large language models
○ Can run where traditional models don’t fit
Interested in partnering with us?
Contact me at lspracklen@numenta.com
or visit numenta.com.
THANK YOU
QUESTIONS?

Deep learning at the edge: 100x Inference improvement on edge devices

  • 1.
    Deep learning atthe edge: 100x Inference improvement on edge devices Dr. Lawrence Spracklen Director of ML Architecture
  • 2.
    AI goes big! ●Deep learning dominates today’s AI conversations ● Delivering significant accuracy improvements ● Enabled by explosive growth in model sizes ○ Initially in NLP, but increasingly vision and beyond ● Spiraling training and inference costs ● Unsustainable energy utilization 17,000X increase Source: Microsoft Research blog
  • 3.
    A brute forcesolution Perform matrix multiplications very fast ● GPUs have become AI workhorses ○ 500+ trillion arithmetic operations per second per card ● Hardware performance doubles every few years ● Exploding AI costs ○ 2018 : BERT cost $6K+ to train ○ 2020 : GPT-3 cost $10M+ to train ● Hardware failing to keep pace with growth in model size Source: NVIDIA
  • 4.
    Design new hardware Edgesystems are constrained systems ● Problems compounded at the edge ● Limited memory and computation resources ● Makes large SOTA models on edge devices impractical ● Efficiency step-change needed to drive change at the edge Necessary to address either the software or the hardware Improve model efficiency on existing hardware Source: NVIDIA blog
  • 5.
    Natural Intelligence Examine thehuman brain • Neuron interconnections are sparse • Neuron activations are sparse • Neurons are low-fidelity • Neurons are significantly more complex than AI’s point neuron abstraction • Humans can learn from very limited examples
  • 6.
    Sparse models • Weightsparsity zeros the majority of a model’s neuron weights • Activation sparsity dynamically activates context dependent sub-networks • Enables significant computational savings • Benefits are multiplicative • Delivering 100X+ reductions in computational costs for identical model architectures Input#1 Input#2
  • 7.
    You can controlyour models • Industry has struggled to accelerate sparse models on general-purpose hardware • 2-3X acceleration from 20X reduction in non-zero parameters • However, it possible to precisely control weight and activation placement to ensure hardware sympathetic patterns • Achievable with both directly trained sparse models and ‘pruned’ sparse models • Enables speedups that directly mirror the reduction in non-zeros • Achievable on existing hardware (FPGAs, CPUs and GPUs) • 95% weight sparsity delivers 20X acceleration
  • 8.
    ConvNets on FPGAs •Optimized model inference outperforms the traditional model by 112X • Remember – same accuracy as the original dense model • Corresponding improvements in per-operation energy efficiency • Optimized models fit on ZU3EG edge devices • Edge device outperforms standard models running on the datacenter-class U250! FPGA platform Power consumption Network type Number of networks on chip Full chip throughput (words/sec) Full chip speedup Relative efficiency Alveo U250 225W Dense 4 12,195 - 100% Sparse-Dense 24 689,655 56.5 5675% Sparse-Sparse 20 1,369,863 112.3 11274% ZU3EG 24W Dense 0 0 - - Sparse-Dense 1 21,053 Infinite 1624% Sparse-Sparse 1 45,455 Infinite 3505%
  • 9.
    Transformers on CPUsand GPUs ● Techniques are compatible with large language and vision models ● These optimized models run extremely efficiently on CPUs and GPUs ● 10-30X throughput improvement for typical models ○ Identical model architecture and accuracy ● Accelerates both training and inference ● Significant reduction in memory footprint Unleash deep learning at the edge
  • 10.
    Low latency opportunity ●Long latencies prevent large model use in many application spaces ○ Small transformer CPU inference latencies can approach 100ms ● Sparse models also deliver improved latency ● Inference latency can be decreased by 10X+ ● Enables models to be used with conversational apps ○ Sub-3ms latency for BERTbase on CPUs
  • 11.
    Pick your path Createoptimized model Accelerated CPU Inference Accelerated GPU Inference New Model Existing Model Accelerated training on GPUs TRAIN INFERENCE Pretrained Model Accelerated fine- tuning on CPUs 5-20X performance Improvements in training 10-30X performance Improvements in inference Accelerated CPU Inference Accelerated GPU Inference 5-15X performance Improvements in inference
  • 12.
    Example Applications forthese Optimized Models Voice recognition on embedded devices Conversational AI – real- time responsiveness Computer vision at the edge Forecasting early disorders on wearables
  • 13.
    Taking it further •Sparsity represents just the start of what’s possible • Quantization • Leverage reduced fidelity representations • Further reduces memory footprint • Translates into improved performance on most hardware [2.5X+] • Knowledge distillation • Reduce number/size of layers & use large teacher during training to maximize accuracy • Fewer layers directly reduces inference costs [5X+] • Benefits are multiplicative… Deep learning models can run 100X faster on existing edge devices
  • 14.
    In Summary ● Edgesystems are constrained as explosive AI growth continues ● We can leverage structures and efficiencies from the brain to create hardware-aware, highly performant sparse models ● Optimized models unlock possibilities for deep learning at the edge: ○ Outperform traditional models by 10-100X+ ○ Reduced memory footprint ○ Sub 3ms inference latency for large language models ○ Can run where traditional models don’t fit
  • 15.
    Interested in partneringwith us? Contact me at lspracklen@numenta.com or visit numenta.com. THANK YOU QUESTIONS?