Deep learning at the edge: 100x Inference improvement on edge devices

Deep learning at the edge:
100x Inference improvement on edge devices
Dr. Lawrence Spracklen
Director of ML Architecture

AI goes big!
● Deep learning dominates today’s AI
conversations
● Delivering significant accuracy
improvements
● Enabled by explosive growth in model sizes
○ Initially in NLP, but increasingly vision and beyond
● Spiraling training and inference costs
● Unsustainable energy utilization
17,000X
increase
Source: Microsoft Research blog

A brute force solution
Perform matrix multiplications very fast
● GPUs have become AI workhorses
○ 500+ trillion arithmetic operations per second per card
● Hardware performance doubles every few years
● Exploding AI costs
○ 2018 : BERT cost $6K+ to train
○ 2020 : GPT-3 cost $10M+ to train
● Hardware failing to keep pace with growth in
model size Source: NVIDIA

Design new hardware
Edge systems are constrained systems
● Problems compounded at the edge
● Limited memory and computation resources
● Makes large SOTA models on edge devices impractical
● Efficiency step-change needed to drive change at the edge
Necessary to address either the software or the hardware
Improve model efficiency
on existing hardware
Source: NVIDIA blog

Natural Intelligence
Examine the human brain
• Neuron interconnections are sparse
• Neuron activations are sparse
• Neurons are low-fidelity
• Neurons are significantly more complex than AI’s point neuron abstraction
• Humans can learn from very limited examples

Sparse models
• Weight sparsity zeros the majority of a model’s neuron weights
• Activation sparsity dynamically activates context dependent sub-networks
• Enables significant computational savings
• Benefits are multiplicative
• Delivering 100X+ reductions in computational costs for identical model architectures
Input#1 Input#2

You can control your models
• Industry has struggled to accelerate sparse models on general-purpose
hardware
• 2-3X acceleration from 20X reduction in non-zero parameters
• However, it possible to precisely control weight and activation placement to
ensure hardware sympathetic patterns
• Achievable with both directly trained sparse models and ‘pruned’ sparse models
• Enables speedups that directly mirror the reduction in non-zeros
• Achievable on existing hardware (FPGAs, CPUs and GPUs)
• 95% weight sparsity delivers 20X acceleration

ConvNets on FPGAs
• Optimized model inference outperforms the traditional model by 112X
• Remember – same accuracy as the original dense model
• Corresponding improvements in per-operation energy efficiency
• Optimized models fit on ZU3EG edge devices
• Edge device outperforms standard models running on the datacenter-class U250!
FPGA
platform
Power
consumption
Network
type
Number of
networks on chip
Full chip throughput
(words/sec)
Full chip
speedup
Relative
efficiency
Alveo U250 225W
Dense 4 12,195 - 100%
Sparse-Dense 24 689,655 56.5 5675%
Sparse-Sparse 20 1,369,863 112.3 11274%
ZU3EG 24W
Dense 0 0 - -
Sparse-Dense 1 21,053 Infinite 1624%
Sparse-Sparse 1 45,455 Infinite 3505%

Transformers on CPUs and GPUs
● Techniques are compatible with large language and vision models
● These optimized models run extremely efficiently on CPUs and GPUs
● 10-30X throughput improvement for typical models
○ Identical model architecture and accuracy
● Accelerates both training and inference
● Significant reduction in memory footprint
Unleash deep learning at the edge

Low latency opportunity
● Long latencies prevent large model use in many
application spaces
○ Small transformer CPU inference latencies can approach 100ms
● Sparse models also deliver improved latency
● Inference latency can be decreased by 10X+
● Enables models to be used with conversational apps
○ Sub-3ms latency for BERTbase on CPUs

Pick your path
Create optimized
model
Accelerated
CPU Inference
Accelerated
GPU Inference
New
Model
Existing
Model
Accelerated
training on GPUs
TRAIN INFERENCE
Pretrained
Model Accelerated fine-
tuning on CPUs
5-20X performance
Improvements in training
10-30X performance
Improvements in inference
Accelerated
CPU Inference
Accelerated
GPU Inference
5-15X performance
Improvements in inference

Example Applications for these Optimized Models
Voice recognition on
embedded devices
Conversational AI – real-
time responsiveness
Computer vision at the edge Forecasting early disorders
on wearables

Taking it further
• Sparsity represents just the start of what’s possible
• Quantization
• Leverage reduced fidelity representations
• Further reduces memory footprint
• Translates into improved performance on most hardware [2.5X+]
• Knowledge distillation
• Reduce number/size of layers & use large teacher during training to maximize accuracy
• Fewer layers directly reduces inference costs [5X+]
• Benefits are multiplicative…
Deep learning models can run 100X faster on existing edge devices

In Summary
● Edge systems are constrained as explosive AI growth continues
● We can leverage structures and efficiencies from the brain to create
hardware-aware, highly performant sparse models
● Optimized models unlock possibilities for deep learning at the edge:
○ Outperform traditional models by 10-100X+
○ Reduced memory footprint
○ Sub 3ms inference latency for large language models
○ Can run where traditional models don’t fit

Interested in partnering with us?
Contact me at lspracklen@numenta.com
or visit numenta.com.
THANK YOU
QUESTIONS?

Deep learning at the edge: 100x Inference improvement on edge devices

More Related Content

Similar to Deep learning at the edge: 100x Inference improvement on edge devices

More from Numenta

Recently uploaded

Deep learning at the edge: 100x Inference improvement on edge devices