FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks - Xilinx & Numenta

Breaking the TOPS ceiling
with sparse neural networks
Nick Ni, Xilinx
Lawrence Spracklen, Numenta

© Copyright 2019 Xilinx
Era of Domain-Specific Architectures (DSAs)
Cache
CPU
CISC à RISC à Multi-Core
Fixed HW Accelerators
GPU, ASSP, ASIC
DSA
TPU, DLA
FPGA, Zynq, ACAP
4

What is a Domain-Specific Architecture?
5
Input
Output
“Dog”
55
55
96
5
5
11
11
224
224
Input
Image
(RGB)
Stride
of 4
27
27
256
3
3
13
13
3
3
384
13
13
3
3
384
13
13
256
Max
Pooling 4096 4096
1000
dense
dense dense
Max
Pooling
Max
Pooling
1) Custom Data Path
2) Custom Precision

What is a Domain-Specific Architecture?
6
1) Custom Data Path
2) Custom Precision
3) Custom Memory Hierarchy
Off-Chip
DDR
On-Chip
Memory

Classification Object Detection Segmentation Speech Recognition
Recommendation
Engine Anomaly Detection
Diverse models over a broad range of applications
CNN RNN, LSTM MLP
APPLICATIONS
AI Applications are Everywhere
7 © Copyright 2019 Xilinx

2012 2018
80
70
50
AlexNet
60
BN-AlexNet
BN-NIN
ENet
GoogLeNet
ResNet-18
VGG-16
VGG-19
ResNet-34
ResNet-50
ResNet-101
ResNet-153
ResNeXt-101
Inception-v3
Inception-v4
DenseNet-264
ShuffleNet 2x w/SE
SENet-154
MobileNet v2
Top-1
Accuracy
(1%)
Classification Object Detection Segmentation Speech Recognition
Recommendation
Engine Anomaly Detection
Diverse models over a broad range of applications
CNN RNN, LSTM MLP
APPLICATIONS
Source:
https://arxiv.org/pdf/1605.07678.pdf https://arxiv.org/pdf/1608.06993.pdf
https://arxiv.org/pdf/1709.01507.pdf https://arxiv.org/pdf/1611.05431.pdf
AI is Evolving Rapidly
8 © Copyright 2019 Xilinx

80
70
50
AlexNet
60
BN-AlexNet
BN-NIN
ENet
GoogLeNet
ResNet-18
VGG-16
VGG-19
ResNet-34
ResNet-50
ResNet-101
ResNet-153
ResNeXt-101
Inception-v3
Inception-v4
DenseNet-264
ShuffleNet 2x w/SE
SENet-154
MobileNet v2
Top-1
Accuracy
(1%)
AlexNet
GoogLeNet
DenseNet
AlexNet
GoogLeNet
DenseNet
Silicon lifecycle
Silicon Hardware Design Cycle Can’t Keep Up with the Rate of AI Innovation
AI is Evolving Rapidly
9

Xilinx Achieves Highest AI Inference Efficiency
7
Based on MLPerf
benchmarks
0% 20% 40% 60% 80% 100% 120%
Xilinx U250
nVidia A100
nVidia T4
nVidia Jetson
Resnet-50 Efficiency Achieved vs. Claimed

Numenta
Developing machine intelligence
through neocortical theory
• Understand how the brain works
• Apply neocortical principles to AI
Developed the “Thousand Brains”
theory of how the neocortex works

Artificial Neural Networks (ANNs)
Layer 1
Layer 2
Layer 3
Layer N
Input
Output Dense, fully-connected and computationally expensive

Traditional Approach to ANNs
Perform matrix multiplications very fast
• GPUs have become AI workhorses
• 500+ trillion arithmetic operations per
second per card
• Hardware performance doubles
every few years
• Hardware cannot keep pace with
growth in model size
• Exploding AI costs
• 2018 : BERT cost $6K+ to train
• 2020 : GPT-3 cost $10M+ to train
3-years
17,000X
increase
Figure credit

Can Neuroscience Help?
Examine the Neocortex
• Neuron interconnections are sparse
• Neuron activations are sparse
• Neurons are significantly more complex than AI’s point
neuron abstraction
• Humans can learn from very limited examples
Numenta’s Roadmap

The Neocortex is Highly Sparse
Source: Prof. Hasan, Max-Planck-Institute for Research
• Neural activity and connectivity are both highly sparse
• Only 0.5% to 2% of cells are active at any time
• Only 1% - 5% of connections actually exist between two connected layers
• Nothing like today’s dense deep learning networks

Sparse Layers
• Sparse weights
• Weight matrix is sparse
• Enforced via mask
• Sparse activations
• Outputs of top-k units are maintained
• Also applicable to sparse convolutional layers
• Zero-valued kernel weights

Make Models Fast
Sparse models
• Deliver comparable accuracy with up to 20X fewer parameters
• Also leverage activation sparsity for multiplicative benefits
• 100X+ reduction in compute costs
• Hardware needs to be capable of exploiting sparsity
• Efficiently avoid multiplying by the zeros

Multiplicative Sparsity Benefits
• Massive opportunity for performance by simultaneously exploiting
weight and activation sparsity
• Assuming your hardware is sufficiently flexible!
% weight sparsity
Potential
speedup
(Compared
to
dense)
%
a
c
t
i
v
a
t
i
o
n
s
p
a
r
s
i
t
y
0
15
30
45
60
75
90
0
50
100
150
200
250
300
350
400
450
500
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 96
Potential Speedup from Sparsity
0-50 50-100 100-150 150-200 200-250 250-300 300-350 350-400 400-450 450-500

The Challenges of Sparsity
• Dense matrix operations are efficiently handled by hardware
• Predictable memory access patterns to contiguous locations
• Computations compatible with wide SIMD processing
• Sparse matrix operations require efficiently locating, extracting
and processing non-zero elements
• On many architectures overheads outweigh computational savings
• Makes it challenging to exploit SIMD units
• Block sparsity attempts to group non-zero elements for more
efficient processing
• Large block sizes impact obtainable model accuracy
• Flexibility of FPGAs is well-suited to irregular nature of sparse
computations
• Traditional fine-grain sparsity is still a costly proposition

Numenta’s Sparsity
• Numenta has developed high performance sparse techniques
• Permit accurate fine-grain sparsity
• Significantly reduce processing overheads
• Performance scales linearly with degree of sparsity

Numenta Sparsity Details
• Leverage sparsity patterns that allow computation to be framed
as a dense operation
• Applicable to both linear and convolutional layers
• Light-weight constraint doesn’t impact achievable accuracy
• Enforce non-overlapping patterns across multiple sets of kernels
or linear-layer weights
• Degree of sparsity dictates size of set
• Allows multiple kernels/weights to be elegantly combined into a
single dense entity
• Speedup scales linearly with degree of sparsity
• Use sparse static mask training to provide control over
placement of non-zero weights
• Accurate networks while exploiting extreme sparsity

Example CNN
• Deep Convolutional Neural Network (CNN)
• 2 Convolutional layers & 2 linear layers
• 2.5M parameters

Google Speech Commands
• Trained our CNN on Google Speech Commands dataset
• One-word utterances spoken by thousands of different individuals
• Task is to recognize the word being spoken from the audio signal
• Model accuracy is ~98% for 10 category classification
• Compared dense model performance with 2 sparse networks
• Sparsity varied by network layer
• Weights 95% sparse
• Activations 88% sparse
• Accuracy of sparse models is identical to dense model
• Additional performance achievable by relaxing accuracy constraints
Name Weights Activations
Sparse-Dense Sparse Dense
Sparse-Sparse Sparse Sparse

FPGA Performance
• Implemented sparse and dense networks on two Xilinx chips
• Chips designed for datacenter and embedded (internet of things) applications
• Dense model implemented using Vitis AI
• Our Sparse implementations coded using HSL
Alveo U250 Zynq UltraScale+ ZU3EG
System logic cells 1,728,000 154,000
Memory 54MB 0.95MB
DSP slices 12,288 360
System power 225W 24W

Sparse Performance: single network
• Order of magnitude performance improvement from sparsity
• ~3X additional benefit from adding activation sparsity
• Dense network doesn’t even fit on ZU3EG
• Sparse network on ZU3EG outperforms dense on U250!
• Sparse networks make AI-at-the-edge a reality
FPGA platform Network type Throughput words/sec Speedup over dense
Alveo U250 Dense 3,049 -
Alveo U250 Sparse-Dense 35,714 11.71
Alveo U250 Sparse-Sparse 102,564 33.63
ZU3EG Dense 0 -
ZU3EG Sparse-Dense 21,053 Infinite
ZU3EG Sparse-Sparse 45,455 Infinite

Sparse Performance: full chip
• Efficiencies of sparse network implementations allow FPGA to
accommodate many more copies
• Sparse-Sparse outperforms traditional dense model by 112X
• Remember – same accuracy as the original dense model
• Sparse models on ZU3EG outperform dense models on U250
FPGA
platform
Network
type
Number of networks
on chip
Full chip throughput
(words/sec)
Full chip
speedup
Alveo U250 Dense 4 12,195 -
Alveo U250 Sparse-Dense 24 689,655 56.5
Alveo U250 Sparse-Sparse 20 1,369,863 112.3
ZU3EG Dense 0 0 -
ZU3EG Sparse-Dense 1 21,053 Infinite
ZU3EG Sparse-Sparse 1 45,455 Infinite

FPGA Performance Summary
• Flexible reconfigurable nature of FPGAs makes them ideal
platform for running sparse models
0 20 40 60 80 100 120
Speedup from Sparsity on FPGA
(Relative to dense, Xilinx U250)
Sparse-Dense
Sparse-Sparse
One
Network
Full
Chip

Power Efficiency
FPGA
platform
System
power
Network
type
Number of
networks
Words/sec/
watt
Relative
efficiency
Alveo U250 225 Dense 4 54 100%
Alveo U250 225 Sparse-Dense 24 3065 5675%
Alveo U250 225 Sparse-Sparse 20 6088 11274%
ZU3EG 24 Dense 0 0 0
ZU3EG 24 Sparse-Dense 1 877 1624%
ZU3EG 24 Sparse-Sparse 1 1893 3505%
• Increased performance does not come at the cost of increased
power consumption
• Two orders of magnitude reduction in per inference energy cost

CPU Performance
• How do other compute platforms handle sparsity
• Create 95% sparse model for CPUs
• Only leverage weight sparsity (sparse-dense)
• Investigate different CPU model inference engines
• Microsoft and Intel runtimes don’t leverage sparsity
• Performance improvements at best around 3X
0 1 2 3
OpenVino
OnnxRuntime
TVM
DeepSparse
Speedup on CPUs from 95% sparsity
CPU
execution
engine

FPGA vs CPU
• Compare sparse performance on CPUs and FPGAs
• Using 24 core (48 hardware thread) Intel Xeon 8275CL
• AWS C5.12xlarge
Samples/second
0
200000
400000
600000
800000
1000000
1200000
1400000
Single-thread Full-chip
CPU-OpenVino CPU-OnnxRuntime
CPU-TVM CPU-DeepSparse
Numenta-SD Numenta-SS
US250
12X faster

• Created a sparse version of ResNet50
• Trained on Imagenet
• Training respects MLPerf optimization constraints
• Accuracy results:
• FPGA implementation in process
Sparse ResNets
Network Sparsity
Accuracy
(float32)
Accuracy
(int8)
Quantization
impact
MLPerf benchmark, dense 0% 76.7% 75.7% -1.00%
NVIDIA, static sparsity 50% 76.8%
Ours, static sparsity 75% 76.22% 74.67% -1.55%
Ours, dynamic sparsity 75% 77.1% 76.77% -0.33%

Early Results
• Significant performance benefits
• Weight and activation sparsity
• Constrained by MLPerf optimization restrictions

Generalized Sparse Support
• Creating general-purpose library for implementing sparse
neural networks on FPGAs
• Exploiting both weight and activation sparsity
• Supports linear layers and full-spectrum of convolutional
kernels
• In addition to other support functions (pooling etc)
• Degree of sparsity and performance requirements are
parameterizable settings

Your Cake and Eat it?
• Sparse networks deliver improved performance
• But [done correctly] they can also deliver
• Improved robustness to noise
• Improved generalization
* How can we be so dense? The benefits of using high sparse representations, Ahmad, Scheinkman

The Blessing of Dimensionality
• Advantageous to maximize sparsity to boost performance
• Achievable accuracy decreases at extreme sparsity
• Increasing width of network while holding parameter count
constant increases achievable accuracy

Conclusions
• Sparsity represents a powerful technique to reduce
computational costs for AI
• Also provide improved robustness to noise
• Frequently problematic to efficiency exploit sparsity on Hardware
• Sometimes as little as 3X
• Numenta’s sparsity techniques allow efficient execution on
general-purpose
• Linear speedup with degree of sparsity
• Multiplicative effect of exploiting both weight and activation sparsity
• The flexibility of FPGAs makes them an ideal platform for high
performance AI
• Numenta demonstrated 100X speedup from sparsity on a Xilinx U250
• Creating drag-n-drop support for Sparsity on FPGAs!

Acknowledgements
Lead Architect on Numenta’s FPGA Sparse Networks
• Kevin Hunter
Numenta’s research team

Let us know!
• Interested in sparse networks on FPGAs?
• Let us know!!

THANK YOU
Questions?
lspracklen@numenta.com
https://numenta.com

FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks - Xilinx & Numenta

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks - Xilinx & Numenta

Similar to FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks - Xilinx & Numenta (20)

More from Numenta

More from Numenta (20)

Recently uploaded

Recently uploaded (20)

FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks - Xilinx & Numenta