Nick Ni (Xilinx) and Lawrence Spracklen (Numenta) presented a talk at the FGPA Conference Europe on July 8th, 2021. In this talk, they presented a neuroscience approach to optimize state-of-the-art deep learning networks into sparse topology and how it can unlock significant performance gains on FPGAs without major loss of accuracy. They then walked through the FPGA implementation where they exploited the advantage of sparse networks with a unique Domain Specific Architecture (DSA).
8. Xilinx Achieves Highest AI Inference Efficiency
7
Based on MLPerf
benchmarks
0% 20% 40% 60% 80% 100% 120%
Xilinx U250
nVidia A100
nVidia T4
nVidia Jetson
Resnet-50 Efficiency Achieved vs. Claimed
9. Numenta
Developing machine intelligence
through neocortical theory
• Understand how the brain works
• Apply neocortical principles to AI
Developed the “Thousand Brains”
theory of how the neocortex works
10. Artificial Neural Networks (ANNs)
Layer 1
Layer 2
Layer 3
Layer N
Input
Output Dense, fully-connected and computationally expensive
11. Traditional Approach to ANNs
Perform matrix multiplications very fast
• GPUs have become AI workhorses
• 500+ trillion arithmetic operations per
second per card
• Hardware performance doubles
every few years
• Hardware cannot keep pace with
growth in model size
• Exploding AI costs
• 2018 : BERT cost $6K+ to train
• 2020 : GPT-3 cost $10M+ to train
3-years
17,000X
increase
Figure credit
12. Can Neuroscience Help?
Examine the Neocortex
• Neuron interconnections are sparse
• Neuron activations are sparse
• Neurons are significantly more complex than AI’s point
neuron abstraction
• Humans can learn from very limited examples
Numenta’s Roadmap
13. The Neocortex is Highly Sparse
Source: Prof. Hasan, Max-Planck-Institute for Research
• Neural activity and connectivity are both highly sparse
• Only 0.5% to 2% of cells are active at any time
• Only 1% - 5% of connections actually exist between two connected layers
• Nothing like today’s dense deep learning networks
14. Sparse Layers
• Sparse weights
• Weight matrix is sparse
• Enforced via mask
• Sparse activations
• Outputs of top-k units are maintained
• Also applicable to sparse convolutional layers
• Zero-valued kernel weights
16. Make Models Fast
Sparse models
• Deliver comparable accuracy with up to 20X fewer parameters
• Also leverage activation sparsity for multiplicative benefits
• 100X+ reduction in compute costs
• Hardware needs to be capable of exploiting sparsity
• Efficiently avoid multiplying by the zeros
17. Multiplicative Sparsity Benefits
• Massive opportunity for performance by simultaneously exploiting
weight and activation sparsity
• Assuming your hardware is sufficiently flexible!
% weight sparsity
Potential
speedup
(Compared
to
dense)
%
a
c
t
i
v
a
t
i
o
n
s
p
a
r
s
i
t
y
0
15
30
45
60
75
90
0
50
100
150
200
250
300
350
400
450
500
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 96
Potential Speedup from Sparsity
0-50 50-100 100-150 150-200 200-250 250-300 300-350 350-400 400-450 450-500
18. The Challenges of Sparsity
• Dense matrix operations are efficiently handled by hardware
• Predictable memory access patterns to contiguous locations
• Computations compatible with wide SIMD processing
• Sparse matrix operations require efficiently locating, extracting
and processing non-zero elements
• On many architectures overheads outweigh computational savings
• Makes it challenging to exploit SIMD units
• Block sparsity attempts to group non-zero elements for more
efficient processing
• Large block sizes impact obtainable model accuracy
• Flexibility of FPGAs is well-suited to irregular nature of sparse
computations
• Traditional fine-grain sparsity is still a costly proposition
19. Numenta’s Sparsity
• Numenta has developed high performance sparse techniques
• Permit accurate fine-grain sparsity
• Significantly reduce processing overheads
• Performance scales linearly with degree of sparsity
20. Numenta Sparsity Details
• Leverage sparsity patterns that allow computation to be framed
as a dense operation
• Applicable to both linear and convolutional layers
• Light-weight constraint doesn’t impact achievable accuracy
• Enforce non-overlapping patterns across multiple sets of kernels
or linear-layer weights
• Degree of sparsity dictates size of set
• Allows multiple kernels/weights to be elegantly combined into a
single dense entity
• Speedup scales linearly with degree of sparsity
• Use sparse static mask training to provide control over
placement of non-zero weights
• Accurate networks while exploiting extreme sparsity
22. Example CNN
• Deep Convolutional Neural Network (CNN)
• 2 Convolutional layers & 2 linear layers
• 2.5M parameters
23. Google Speech Commands
• Trained our CNN on Google Speech Commands dataset
• One-word utterances spoken by thousands of different individuals
• Task is to recognize the word being spoken from the audio signal
• Model accuracy is ~98% for 10 category classification
• Compared dense model performance with 2 sparse networks
• Sparsity varied by network layer
• Weights 95% sparse
• Activations 88% sparse
• Accuracy of sparse models is identical to dense model
• Additional performance achievable by relaxing accuracy constraints
Name Weights Activations
Sparse-Dense Sparse Dense
Sparse-Sparse Sparse Sparse
24. FPGA Performance
• Implemented sparse and dense networks on two Xilinx chips
• Chips designed for datacenter and embedded (internet of things) applications
• Dense model implemented using Vitis AI
• Our Sparse implementations coded using HSL
Alveo U250 Zynq UltraScale+ ZU3EG
System logic cells 1,728,000 154,000
Memory 54MB 0.95MB
DSP slices 12,288 360
System power 225W 24W
25. Sparse Performance: single network
• Order of magnitude performance improvement from sparsity
• ~3X additional benefit from adding activation sparsity
• Dense network doesn’t even fit on ZU3EG
• Sparse network on ZU3EG outperforms dense on U250!
• Sparse networks make AI-at-the-edge a reality
FPGA platform Network type Throughput words/sec Speedup over dense
Alveo U250 Dense 3,049 -
Alveo U250 Sparse-Dense 35,714 11.71
Alveo U250 Sparse-Sparse 102,564 33.63
ZU3EG Dense 0 -
ZU3EG Sparse-Dense 21,053 Infinite
ZU3EG Sparse-Sparse 45,455 Infinite
26. Sparse Performance: full chip
• Efficiencies of sparse network implementations allow FPGA to
accommodate many more copies
• Sparse-Sparse outperforms traditional dense model by 112X
• Remember – same accuracy as the original dense model
• Sparse models on ZU3EG outperform dense models on U250
FPGA
platform
Network
type
Number of networks
on chip
Full chip throughput
(words/sec)
Full chip
speedup
Alveo U250 Dense 4 12,195 -
Alveo U250 Sparse-Dense 24 689,655 56.5
Alveo U250 Sparse-Sparse 20 1,369,863 112.3
ZU3EG Dense 0 0 -
ZU3EG Sparse-Dense 1 21,053 Infinite
ZU3EG Sparse-Sparse 1 45,455 Infinite
27. FPGA Performance Summary
• Flexible reconfigurable nature of FPGAs makes them ideal
platform for running sparse models
0 20 40 60 80 100 120
Speedup from Sparsity on FPGA
(Relative to dense, Xilinx U250)
Sparse-Dense
Sparse-Sparse
One
Network
Full
Chip
29. CPU Performance
• How do other compute platforms handle sparsity
• Create 95% sparse model for CPUs
• Only leverage weight sparsity (sparse-dense)
• Investigate different CPU model inference engines
• Microsoft and Intel runtimes don’t leverage sparsity
• Performance improvements at best around 3X
0 1 2 3
OpenVino
OnnxRuntime
TVM
DeepSparse
Speedup on CPUs from 95% sparsity
CPU
execution
engine
30. FPGA vs CPU
• Compare sparse performance on CPUs and FPGAs
• Using 24 core (48 hardware thread) Intel Xeon 8275CL
• AWS C5.12xlarge
Samples/second
0
200000
400000
600000
800000
1000000
1200000
1400000
Single-thread Full-chip
CPU-OpenVino CPU-OnnxRuntime
CPU-TVM CPU-DeepSparse
Numenta-SD Numenta-SS
US250
12X faster
31. • Created a sparse version of ResNet50
• Trained on Imagenet
• Training respects MLPerf optimization constraints
• Accuracy results:
• FPGA implementation in process
Sparse ResNets
Network Sparsity
Accuracy
(float32)
Accuracy
(int8)
Quantization
impact
MLPerf benchmark, dense 0% 76.7% 75.7% -1.00%
NVIDIA, static sparsity 50% 76.8%
Ours, static sparsity 75% 76.22% 74.67% -1.55%
Ours, dynamic sparsity 75% 77.1% 76.77% -0.33%
32. Early Results
• Significant performance benefits
• Weight and activation sparsity
• Constrained by MLPerf optimization restrictions
33. Generalized Sparse Support
• Creating general-purpose library for implementing sparse
neural networks on FPGAs
• Exploiting both weight and activation sparsity
• Supports linear layers and full-spectrum of convolutional
kernels
• In addition to other support functions (pooling etc)
• Degree of sparsity and performance requirements are
parameterizable settings
34. Your Cake and Eat it?
• Sparse networks deliver improved performance
• But [done correctly] they can also deliver
• Improved robustness to noise
• Improved generalization
* How can we be so dense? The benefits of using high sparse representations, Ahmad, Scheinkman
35. The Blessing of Dimensionality
• Advantageous to maximize sparsity to boost performance
• Achievable accuracy decreases at extreme sparsity
• Increasing width of network while holding parameter count
constant increases achievable accuracy
36. Conclusions
• Sparsity represents a powerful technique to reduce
computational costs for AI
• Also provide improved robustness to noise
• Frequently problematic to efficiency exploit sparsity on Hardware
• Sometimes as little as 3X
• Numenta’s sparsity techniques allow efficient execution on
general-purpose
• Linear speedup with degree of sparsity
• Multiplicative effect of exploiting both weight and activation sparsity
• The flexibility of FPGAs makes them an ideal platform for high
performance AI
• Numenta demonstrated 100X speedup from sparsity on a Xilinx U250
• Creating drag-n-drop support for Sparsity on FPGAs!