Deep learning models have grown exponentially in size, making them impractical for deployment on edge devices with limited resources. However, new research shows it is possible to achieve 100x faster inference on edge devices by leveraging sparse models inspired by the human brain. Sparse models activate only a small subset of connections between neurons, enabling significant computational savings without loss of accuracy. When precisely controlled to match hardware capabilities, sparse models can achieve speedups directly proportional to their reduction in parameters. This allows state-of-the-art deep learning models to run efficiently on edge devices, unlocking new applications in areas like computer vision, language processing, and health monitoring.
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Deep learning at the edge: 100x Inference improvement on edge devices
1. Deep learning at the edge:
100x Inference improvement on edge devices
Dr. Lawrence Spracklen
Director of ML Architecture
2. AI goes big!
● Deep learning dominates today’s AI
conversations
● Delivering significant accuracy
improvements
● Enabled by explosive growth in model sizes
○ Initially in NLP, but increasingly vision and beyond
● Spiraling training and inference costs
● Unsustainable energy utilization
17,000X
increase
Source: Microsoft Research blog
3. A brute force solution
Perform matrix multiplications very fast
● GPUs have become AI workhorses
○ 500+ trillion arithmetic operations per second per card
● Hardware performance doubles every few years
● Exploding AI costs
○ 2018 : BERT cost $6K+ to train
○ 2020 : GPT-3 cost $10M+ to train
● Hardware failing to keep pace with growth in
model size Source: NVIDIA
4. Design new hardware
Edge systems are constrained systems
● Problems compounded at the edge
● Limited memory and computation resources
● Makes large SOTA models on edge devices impractical
● Efficiency step-change needed to drive change at the edge
Necessary to address either the software or the hardware
Improve model efficiency
on existing hardware
Source: NVIDIA blog
5. Natural Intelligence
Examine the human brain
• Neuron interconnections are sparse
• Neuron activations are sparse
• Neurons are low-fidelity
• Neurons are significantly more complex than AI’s point neuron abstraction
• Humans can learn from very limited examples
6. Sparse models
• Weight sparsity zeros the majority of a model’s neuron weights
• Activation sparsity dynamically activates context dependent sub-networks
• Enables significant computational savings
• Benefits are multiplicative
• Delivering 100X+ reductions in computational costs for identical model architectures
Input#1 Input#2
7. You can control your models
• Industry has struggled to accelerate sparse models on general-purpose
hardware
• 2-3X acceleration from 20X reduction in non-zero parameters
• However, it possible to precisely control weight and activation placement to
ensure hardware sympathetic patterns
• Achievable with both directly trained sparse models and ‘pruned’ sparse models
• Enables speedups that directly mirror the reduction in non-zeros
• Achievable on existing hardware (FPGAs, CPUs and GPUs)
• 95% weight sparsity delivers 20X acceleration
8. ConvNets on FPGAs
• Optimized model inference outperforms the traditional model by 112X
• Remember – same accuracy as the original dense model
• Corresponding improvements in per-operation energy efficiency
• Optimized models fit on ZU3EG edge devices
• Edge device outperforms standard models running on the datacenter-class U250!
FPGA
platform
Power
consumption
Network
type
Number of
networks on chip
Full chip throughput
(words/sec)
Full chip
speedup
Relative
efficiency
Alveo U250 225W
Dense 4 12,195 - 100%
Sparse-Dense 24 689,655 56.5 5675%
Sparse-Sparse 20 1,369,863 112.3 11274%
ZU3EG 24W
Dense 0 0 - -
Sparse-Dense 1 21,053 Infinite 1624%
Sparse-Sparse 1 45,455 Infinite 3505%
9. Transformers on CPUs and GPUs
● Techniques are compatible with large language and vision models
● These optimized models run extremely efficiently on CPUs and GPUs
● 10-30X throughput improvement for typical models
○ Identical model architecture and accuracy
● Accelerates both training and inference
● Significant reduction in memory footprint
Unleash deep learning at the edge
10. Low latency opportunity
● Long latencies prevent large model use in many
application spaces
○ Small transformer CPU inference latencies can approach 100ms
● Sparse models also deliver improved latency
● Inference latency can be decreased by 10X+
● Enables models to be used with conversational apps
○ Sub-3ms latency for BERTbase on CPUs
11. Pick your path
Create optimized
model
Accelerated
CPU Inference
Accelerated
GPU Inference
New
Model
Existing
Model
Accelerated
training on GPUs
TRAIN INFERENCE
Pretrained
Model Accelerated fine-
tuning on CPUs
5-20X performance
Improvements in training
10-30X performance
Improvements in inference
Accelerated
CPU Inference
Accelerated
GPU Inference
5-15X performance
Improvements in inference
12. Example Applications for these Optimized Models
Voice recognition on
embedded devices
Conversational AI – real-
time responsiveness
Computer vision at the edge Forecasting early disorders
on wearables
13. Taking it further
• Sparsity represents just the start of what’s possible
• Quantization
• Leverage reduced fidelity representations
• Further reduces memory footprint
• Translates into improved performance on most hardware [2.5X+]
• Knowledge distillation
• Reduce number/size of layers & use large teacher during training to maximize accuracy
• Fewer layers directly reduces inference costs [5X+]
• Benefits are multiplicative…
Deep learning models can run 100X faster on existing edge devices
14. In Summary
● Edge systems are constrained as explosive AI growth continues
● We can leverage structures and efficiencies from the brain to create
hardware-aware, highly performant sparse models
● Optimized models unlock possibilities for deep learning at the edge:
○ Outperform traditional models by 10-100X+
○ Reduced memory footprint
○ Sub 3ms inference latency for large language models
○ Can run where traditional models don’t fit
15. Interested in partnering with us?
Contact me at lspracklen@numenta.com
or visit numenta.com.
THANK YOU
QUESTIONS?