SlideShare a Scribd company logo
1 of 39
Download to read offline
Breaking the TOPS ceiling
with sparse neural networks
Nick Ni, Xilinx
Lawrence Spracklen, Numenta
© Copyright 2019 Xilinx
Era of Domain-Specific Architectures (DSAs)
Cache
CPU
CISC à RISC à Multi-Core
Fixed HW Accelerators
GPU, ASSP, ASIC
DSA
TPU, DLA
FPGA, Zynq, ACAP
4
© Copyright 2019 Xilinx
What is a Domain-Specific Architecture?
5
Input
Output
“Dog”
55
55
96
5
5
11
11
224
224
Input
Image
(RGB)
Stride
of 4
27
27
256
3
3
13
13
3
3
384
13
13
3
3
384
13
13
256
Max
Pooling 4096 4096
1000
dense
dense dense
Max
Pooling
Max
Pooling
1) Custom Data Path
2) Custom Precision
© Copyright 2019 Xilinx
© Copyright 2019 Xilinx
What is a Domain-Specific Architecture?
6
1) Custom Data Path
2) Custom Precision
3) Custom Memory Hierarchy
Off-Chip
DDR
On-Chip
Memory
© Copyright 2019 Xilinx
© Copyright 2019 Xilinx
Classification Object Detection Segmentation Speech Recognition
Recommendation
Engine Anomaly Detection
Diverse models over a broad range of applications
CNN RNN, LSTM MLP
APPLICATIONS
AI Applications are Everywhere
7 © Copyright 2019 Xilinx
© Copyright 2019 Xilinx
2012 2018
80
70
50
AlexNet
60
BN-AlexNet
BN-NIN
ENet
GoogLeNet
ResNet-18
VGG-16
VGG-19
ResNet-34
ResNet-50
ResNet-101
ResNet-153
ResNeXt-101
Inception-v3
Inception-v4
DenseNet-264
ShuffleNet 2x w/SE
SENet-154
MobileNet v2
Top-1
Accuracy
(1%)
Classification Object Detection Segmentation Speech Recognition
Recommendation
Engine Anomaly Detection
Diverse models over a broad range of applications
CNN RNN, LSTM MLP
APPLICATIONS
Source:
https://arxiv.org/pdf/1605.07678.pdf https://arxiv.org/pdf/1608.06993.pdf
https://arxiv.org/pdf/1709.01507.pdf https://arxiv.org/pdf/1611.05431.pdf
AI is Evolving Rapidly
8 © Copyright 2019 Xilinx
© Copyright 2019 Xilinx
80
70
50
AlexNet
60
BN-AlexNet
BN-NIN
ENet
GoogLeNet
ResNet-18
VGG-16
VGG-19
ResNet-34
ResNet-50
ResNet-101
ResNet-153
ResNeXt-101
Inception-v3
Inception-v4
DenseNet-264
ShuffleNet 2x w/SE
SENet-154
MobileNet v2
Top-1
Accuracy
(1%)
AlexNet
GoogLeNet
DenseNet
AlexNet
GoogLeNet
DenseNet
Silicon lifecycle
Silicon Hardware Design Cycle Can’t Keep Up with the Rate of AI Innovation
AI is Evolving Rapidly
9
Xilinx Achieves Highest AI Inference Efficiency
7
Based on MLPerf
benchmarks
0% 20% 40% 60% 80% 100% 120%
Xilinx U250
nVidia A100
nVidia T4
nVidia Jetson
Resnet-50 Efficiency Achieved vs. Claimed
Numenta
Developing machine intelligence
through neocortical theory
• Understand how the brain works
• Apply neocortical principles to AI
Developed the “Thousand Brains”
theory of how the neocortex works
Artificial Neural Networks (ANNs)
Layer 1
Layer 2
Layer 3
Layer N
Input
Output Dense, fully-connected and computationally expensive
Traditional Approach to ANNs
Perform matrix multiplications very fast
• GPUs have become AI workhorses
• 500+ trillion arithmetic operations per
second per card
• Hardware performance doubles
every few years
• Hardware cannot keep pace with
growth in model size
• Exploding AI costs
• 2018 : BERT cost $6K+ to train
• 2020 : GPT-3 cost $10M+ to train
3-years
17,000X
increase
Figure credit
Can Neuroscience Help?
Examine the Neocortex
• Neuron interconnections are sparse
• Neuron activations are sparse
• Neurons are significantly more complex than AI’s point
neuron abstraction
• Humans can learn from very limited examples
Numenta’s Roadmap
The Neocortex is Highly Sparse
Source: Prof. Hasan, Max-Planck-Institute for Research
• Neural activity and connectivity are both highly sparse
• Only 0.5% to 2% of cells are active at any time
• Only 1% - 5% of connections actually exist between two connected layers
• Nothing like today’s dense deep learning networks
Sparse Layers
• Sparse weights
• Weight matrix is sparse
• Enforced via mask
• Sparse activations
• Outputs of top-k units are maintained
• Also applicable to sparse convolutional layers
• Zero-valued kernel weights
Sparse Deep Networks
Make Models Fast
Sparse models
• Deliver comparable accuracy with up to 20X fewer parameters
• Also leverage activation sparsity for multiplicative benefits
• 100X+ reduction in compute costs
• Hardware needs to be capable of exploiting sparsity
• Efficiently avoid multiplying by the zeros
Multiplicative Sparsity Benefits
• Massive opportunity for performance by simultaneously exploiting
weight and activation sparsity
• Assuming your hardware is sufficiently flexible!
% weight sparsity
Potential
speedup
(Compared
to
dense)
%
a
c
t
i
v
a
t
i
o
n
s
p
a
r
s
i
t
y
0
15
30
45
60
75
90
0
50
100
150
200
250
300
350
400
450
500
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 96
Potential Speedup from Sparsity
0-50 50-100 100-150 150-200 200-250 250-300 300-350 350-400 400-450 450-500
The Challenges of Sparsity
• Dense matrix operations are efficiently handled by hardware
• Predictable memory access patterns to contiguous locations
• Computations compatible with wide SIMD processing
• Sparse matrix operations require efficiently locating, extracting
and processing non-zero elements
• On many architectures overheads outweigh computational savings
• Makes it challenging to exploit SIMD units
• Block sparsity attempts to group non-zero elements for more
efficient processing
• Large block sizes impact obtainable model accuracy
• Flexibility of FPGAs is well-suited to irregular nature of sparse
computations
• Traditional fine-grain sparsity is still a costly proposition
Numenta’s Sparsity
• Numenta has developed high performance sparse techniques
• Permit accurate fine-grain sparsity
• Significantly reduce processing overheads
• Performance scales linearly with degree of sparsity
Numenta Sparsity Details
• Leverage sparsity patterns that allow computation to be framed
as a dense operation
• Applicable to both linear and convolutional layers
• Light-weight constraint doesn’t impact achievable accuracy
• Enforce non-overlapping patterns across multiple sets of kernels
or linear-layer weights
• Degree of sparsity dictates size of set
• Allows multiple kernels/weights to be elegantly combined into a
single dense entity
• Speedup scales linearly with degree of sparsity
• Use sparse static mask training to provide control over
placement of non-zero weights
• Accurate networks while exploiting extreme sparsity
80% Sparse, 5x5 Kernel
Example CNN
• Deep Convolutional Neural Network (CNN)
• 2 Convolutional layers & 2 linear layers
• 2.5M parameters
Google Speech Commands
• Trained our CNN on Google Speech Commands dataset
• One-word utterances spoken by thousands of different individuals
• Task is to recognize the word being spoken from the audio signal
• Model accuracy is ~98% for 10 category classification
• Compared dense model performance with 2 sparse networks
• Sparsity varied by network layer
• Weights 95% sparse
• Activations 88% sparse
• Accuracy of sparse models is identical to dense model
• Additional performance achievable by relaxing accuracy constraints
Name Weights Activations
Sparse-Dense Sparse Dense
Sparse-Sparse Sparse Sparse
FPGA Performance
• Implemented sparse and dense networks on two Xilinx chips
• Chips designed for datacenter and embedded (internet of things) applications
• Dense model implemented using Vitis AI
• Our Sparse implementations coded using HSL
Alveo U250 Zynq UltraScale+ ZU3EG
System logic cells 1,728,000 154,000
Memory 54MB 0.95MB
DSP slices 12,288 360
System power 225W 24W
Sparse Performance: single network
• Order of magnitude performance improvement from sparsity
• ~3X additional benefit from adding activation sparsity
• Dense network doesn’t even fit on ZU3EG
• Sparse network on ZU3EG outperforms dense on U250!
• Sparse networks make AI-at-the-edge a reality
FPGA platform Network type Throughput words/sec Speedup over dense
Alveo U250 Dense 3,049 -
Alveo U250 Sparse-Dense 35,714 11.71
Alveo U250 Sparse-Sparse 102,564 33.63
ZU3EG Dense 0 -
ZU3EG Sparse-Dense 21,053 Infinite
ZU3EG Sparse-Sparse 45,455 Infinite
Sparse Performance: full chip
• Efficiencies of sparse network implementations allow FPGA to
accommodate many more copies
• Sparse-Sparse outperforms traditional dense model by 112X
• Remember – same accuracy as the original dense model
• Sparse models on ZU3EG outperform dense models on U250
FPGA
platform
Network
type
Number of networks
on chip
Full chip throughput
(words/sec)
Full chip
speedup
Alveo U250 Dense 4 12,195 -
Alveo U250 Sparse-Dense 24 689,655 56.5
Alveo U250 Sparse-Sparse 20 1,369,863 112.3
ZU3EG Dense 0 0 -
ZU3EG Sparse-Dense 1 21,053 Infinite
ZU3EG Sparse-Sparse 1 45,455 Infinite
FPGA Performance Summary
• Flexible reconfigurable nature of FPGAs makes them ideal
platform for running sparse models
0 20 40 60 80 100 120
Speedup from Sparsity on FPGA
(Relative to dense, Xilinx U250)
Sparse-Dense
Sparse-Sparse
One
Network
Full
Chip
Power Efficiency
FPGA
platform
System
power
Network
type
Number of
networks
Words/sec/
watt
Relative
efficiency
Alveo U250 225 Dense 4 54 100%
Alveo U250 225 Sparse-Dense 24 3065 5675%
Alveo U250 225 Sparse-Sparse 20 6088 11274%
ZU3EG 24 Dense 0 0 0
ZU3EG 24 Sparse-Dense 1 877 1624%
ZU3EG 24 Sparse-Sparse 1 1893 3505%
• Increased performance does not come at the cost of increased
power consumption
• Two orders of magnitude reduction in per inference energy cost
CPU Performance
• How do other compute platforms handle sparsity
• Create 95% sparse model for CPUs
• Only leverage weight sparsity (sparse-dense)
• Investigate different CPU model inference engines
• Microsoft and Intel runtimes don’t leverage sparsity
• Performance improvements at best around 3X
0 1 2 3
OpenVino
OnnxRuntime
TVM
DeepSparse
Speedup on CPUs from 95% sparsity
CPU
execution
engine
FPGA vs CPU
• Compare sparse performance on CPUs and FPGAs
• Using 24 core (48 hardware thread) Intel Xeon 8275CL
• AWS C5.12xlarge
Samples/second
0
200000
400000
600000
800000
1000000
1200000
1400000
Single-thread Full-chip
CPU-OpenVino CPU-OnnxRuntime
CPU-TVM CPU-DeepSparse
Numenta-SD Numenta-SS
US250
12X faster
• Created a sparse version of ResNet50
• Trained on Imagenet
• Training respects MLPerf optimization constraints
• Accuracy results:
• FPGA implementation in process
Sparse ResNets
Network Sparsity
Accuracy
(float32)
Accuracy
(int8)
Quantization
impact
MLPerf benchmark, dense 0% 76.7% 75.7% -1.00%
NVIDIA, static sparsity 50% 76.8%
Ours, static sparsity 75% 76.22% 74.67% -1.55%
Ours, dynamic sparsity 75% 77.1% 76.77% -0.33%
Early Results
• Significant performance benefits
• Weight and activation sparsity
• Constrained by MLPerf optimization restrictions
Generalized Sparse Support
• Creating general-purpose library for implementing sparse
neural networks on FPGAs
• Exploiting both weight and activation sparsity
• Supports linear layers and full-spectrum of convolutional
kernels
• In addition to other support functions (pooling etc)
• Degree of sparsity and performance requirements are
parameterizable settings
Your Cake and Eat it?
• Sparse networks deliver improved performance
• But [done correctly] they can also deliver
• Improved robustness to noise
• Improved generalization
* How can we be so dense? The benefits of using high sparse representations, Ahmad, Scheinkman
The Blessing of Dimensionality
• Advantageous to maximize sparsity to boost performance
• Achievable accuracy decreases at extreme sparsity
• Increasing width of network while holding parameter count
constant increases achievable accuracy
Conclusions
• Sparsity represents a powerful technique to reduce
computational costs for AI
• Also provide improved robustness to noise
• Frequently problematic to efficiency exploit sparsity on Hardware
• Sometimes as little as 3X
• Numenta’s sparsity techniques allow efficient execution on
general-purpose
• Linear speedup with degree of sparsity
• Multiplicative effect of exploiting both weight and activation sparsity
• The flexibility of FPGAs makes them an ideal platform for high
performance AI
• Numenta demonstrated 100X speedup from sparsity on a Xilinx U250
• Creating drag-n-drop support for Sparsity on FPGAs!
Acknowledgements
Lead Architect on Numenta’s FPGA Sparse Networks
• Kevin Hunter
Numenta’s research team
Let us know!
• Interested in sparse networks on FPGAs?
• Let us know!!
THANK YOU
Questions?
lspracklen@numenta.com
https://numenta.com

More Related Content

What's hot

What's hot (20)

FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
 
Tutorial WiFi driver code - Opening Nuts and Bolts of Linux WiFi Subsystem
Tutorial WiFi driver code - Opening Nuts and Bolts of Linux WiFi SubsystemTutorial WiFi driver code - Opening Nuts and Bolts of Linux WiFi Subsystem
Tutorial WiFi driver code - Opening Nuts and Bolts of Linux WiFi Subsystem
 
A very good introduction to IPv6
A very good introduction to IPv6A very good introduction to IPv6
A very good introduction to IPv6
 
Vxlan control plane and routing
Vxlan control plane and routingVxlan control plane and routing
Vxlan control plane and routing
 
5G positioning for the connected intelligent edge
5G positioning for the connected intelligent edge5G positioning for the connected intelligent edge
5G positioning for the connected intelligent edge
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack
 
Addressing PNT threats in critical defense infrastructure
Addressing PNT threats in critical defense infrastructureAddressing PNT threats in critical defense infrastructure
Addressing PNT threats in critical defense infrastructure
 
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThe Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
 
eBPF maps 101
eBPF maps 101eBPF maps 101
eBPF maps 101
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
Introducing GNSS/GPS backup as a service (GBaaS)
Introducing GNSS/GPS backup as a service (GBaaS)Introducing GNSS/GPS backup as a service (GBaaS)
Introducing GNSS/GPS backup as a service (GBaaS)
 
Overview 5G NR Radio Protocols by Intel
Overview 5G NR Radio Protocols by Intel Overview 5G NR Radio Protocols by Intel
Overview 5G NR Radio Protocols by Intel
 
Syncing the cloud - from T1 to TAP
 Syncing the cloud - from T1 to TAP Syncing the cloud - from T1 to TAP
Syncing the cloud - from T1 to TAP
 
Docker Security: Are Your Containers Tightly Secured to the Ship?
Docker Security: Are Your Containers Tightly Secured to the Ship?Docker Security: Are Your Containers Tightly Secured to the Ship?
Docker Security: Are Your Containers Tightly Secured to the Ship?
 
F5 Solutions for Service Providers
F5 Solutions for Service ProvidersF5 Solutions for Service Providers
F5 Solutions for Service Providers
 
Ch 02 --- sdn and openflow architecture
Ch 02 --- sdn and openflow architectureCh 02 --- sdn and openflow architecture
Ch 02 --- sdn and openflow architecture
 
Xvisor: embedded and lightweight hypervisor
Xvisor: embedded and lightweight hypervisorXvisor: embedded and lightweight hypervisor
Xvisor: embedded and lightweight hypervisor
 
NMAP
NMAPNMAP
NMAP
 
DDoS Mitigation using BGP Flowspec
DDoS Mitigation using BGP Flowspec DDoS Mitigation using BGP Flowspec
DDoS Mitigation using BGP Flowspec
 

Similar to FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks - Xilinx & Numenta

HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
Linaro
 
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...
Edge AI and Vision Alliance
 
Microsofts Configurable Cloud
Microsofts Configurable CloudMicrosofts Configurable Cloud
Microsofts Configurable Cloud
Chris Genazzio
 
Tilera tile64 by Ibrahem Batta
Tilera tile64  by Ibrahem BattaTilera tile64  by Ibrahem Batta
Tilera tile64 by Ibrahem Batta
Ibrahem Batta
 

Similar to FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks - Xilinx & Numenta (20)

HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
 
Linaro connect 2018 keynote final updated
Linaro connect 2018 keynote final updatedLinaro connect 2018 keynote final updated
Linaro connect 2018 keynote final updated
 
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...
 
What is 3d torus
What is 3d torusWhat is 3d torus
What is 3d torus
 
Cloud Networking Trends
Cloud Networking TrendsCloud Networking Trends
Cloud Networking Trends
 
PLNOG 8: Ivan Pepelnjak - Data Center Fabrics - What Really Matters
PLNOG 8: Ivan Pepelnjak - Data Center Fabrics - What Really Matters PLNOG 8: Ivan Pepelnjak - Data Center Fabrics - What Really Matters
PLNOG 8: Ivan Pepelnjak - Data Center Fabrics - What Really Matters
 
Deep learning at the edge: 100x Inference improvement on edge devices
Deep learning at the edge: 100x Inference improvement on edge devicesDeep learning at the edge: 100x Inference improvement on edge devices
Deep learning at the edge: 100x Inference improvement on edge devices
 
Moore's law
Moore's lawMoore's law
Moore's law
 
GEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use CasesGEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use Cases
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)
 
Microsofts Configurable Cloud
Microsofts Configurable CloudMicrosofts Configurable Cloud
Microsofts Configurable Cloud
 
Large scalecplex
Large scalecplexLarge scalecplex
Large scalecplex
 
Supermicro Servers with Micron DDR5 & SSDs: Accelerating Real World Workloads
Supermicro Servers with Micron DDR5 & SSDs: Accelerating Real World WorkloadsSupermicro Servers with Micron DDR5 & SSDs: Accelerating Real World Workloads
Supermicro Servers with Micron DDR5 & SSDs: Accelerating Real World Workloads
 
Cisco CCNA Data Center Networking Fundamentals
Cisco CCNA Data Center Networking FundamentalsCisco CCNA Data Center Networking Fundamentals
Cisco CCNA Data Center Networking Fundamentals
 
pps Matters
pps Matterspps Matters
pps Matters
 
Tilera tile64 by Ibrahem Batta
Tilera tile64  by Ibrahem BattaTilera tile64  by Ibrahem Batta
Tilera tile64 by Ibrahem Batta
 
Accelerating Deep Learning Inference 
on Mobile Systems
Accelerating Deep Learning Inference 
on Mobile SystemsAccelerating Deep Learning Inference 
on Mobile Systems
Accelerating Deep Learning Inference 
on Mobile Systems
 
Rethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceRethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligence
 

More from Numenta

Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...
Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...
Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...
Numenta
 
The Biological Path Toward Strong AI by Matt Taylor (05/17/18)
The Biological Path Toward Strong AI by Matt Taylor (05/17/18)The Biological Path Toward Strong AI by Matt Taylor (05/17/18)
The Biological Path Toward Strong AI by Matt Taylor (05/17/18)
Numenta
 

More from Numenta (20)

Brains@Bay Meetup: A Primer on Neuromodulatory Systems - Srikanth Ramaswamy
Brains@Bay Meetup: A Primer on Neuromodulatory Systems - Srikanth RamaswamyBrains@Bay Meetup: A Primer on Neuromodulatory Systems - Srikanth Ramaswamy
Brains@Bay Meetup: A Primer on Neuromodulatory Systems - Srikanth Ramaswamy
 
Brains@Bay Meetup: How to Evolve Your Own Lab Rat - Thomas Miconi
Brains@Bay Meetup: How to Evolve Your Own Lab Rat - Thomas MiconiBrains@Bay Meetup: How to Evolve Your Own Lab Rat - Thomas Miconi
Brains@Bay Meetup: How to Evolve Your Own Lab Rat - Thomas Miconi
 
Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...
Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...
Brains@Bay Meetup: The Increasing Role of Sensorimotor Experience in Artifici...
 
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...
Brains@Bay Meetup: Open-ended Skill Acquisition in Humans and Machines: An Ev...
 
Brains@Bay Meetup: The Effect of Sensorimotor Learning on the Learned Represe...
Brains@Bay Meetup: The Effect of Sensorimotor Learning on the Learned Represe...Brains@Bay Meetup: The Effect of Sensorimotor Learning on the Learned Represe...
Brains@Bay Meetup: The Effect of Sensorimotor Learning on the Learned Represe...
 
SBMT 2021: Can Neuroscience Insights Transform AI? - Lawrence Spracklen
SBMT 2021: Can Neuroscience Insights Transform AI? - Lawrence SpracklenSBMT 2021: Can Neuroscience Insights Transform AI? - Lawrence Spracklen
SBMT 2021: Can Neuroscience Insights Transform AI? - Lawrence Spracklen
 
BAAI Conference 2021: The Thousand Brains Theory - A Roadmap for Creating Mac...
BAAI Conference 2021: The Thousand Brains Theory - A Roadmap for Creating Mac...BAAI Conference 2021: The Thousand Brains Theory - A Roadmap for Creating Mac...
BAAI Conference 2021: The Thousand Brains Theory - A Roadmap for Creating Mac...
 
Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...
Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...
Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...
 
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
 
CVPR 2020 Workshop: Sparsity in the neocortex, and its implications for conti...
CVPR 2020 Workshop: Sparsity in the neocortex, and its implications for conti...CVPR 2020 Workshop: Sparsity in the neocortex, and its implications for conti...
CVPR 2020 Workshop: Sparsity in the neocortex, and its implications for conti...
 
Sparsity In The Neocortex, And Its Implications For Machine Learning
Sparsity In The Neocortex,  And Its Implications For Machine LearningSparsity In The Neocortex,  And Its Implications For Machine Learning
Sparsity In The Neocortex, And Its Implications For Machine Learning
 
The Thousand Brains Theory: A Framework for Understanding the Neocortex and B...
The Thousand Brains Theory: A Framework for Understanding the Neocortex and B...The Thousand Brains Theory: A Framework for Understanding the Neocortex and B...
The Thousand Brains Theory: A Framework for Understanding the Neocortex and B...
 
Jeff Hawkins Human Brain Project Summit Keynote: "Location, Location, Locatio...
Jeff Hawkins Human Brain Project Summit Keynote: "Location, Location, Locatio...Jeff Hawkins Human Brain Project Summit Keynote: "Location, Location, Locatio...
Jeff Hawkins Human Brain Project Summit Keynote: "Location, Location, Locatio...
 
Location, Location, Location - A Framework for Intelligence and Cortical Comp...
Location, Location, Location - A Framework for Intelligence and Cortical Comp...Location, Location, Location - A Framework for Intelligence and Cortical Comp...
Location, Location, Location - A Framework for Intelligence and Cortical Comp...
 
Have We Missed Half of What the Neocortex Does? A New Predictive Framework ...
 Have We Missed Half of What the Neocortex Does?  A New Predictive Framework ... Have We Missed Half of What the Neocortex Does?  A New Predictive Framework ...
Have We Missed Half of What the Neocortex Does? A New Predictive Framework ...
 
Locations in the Neocortex: A Theory of Sensorimotor Prediction Using Cortica...
Locations in the Neocortex: A Theory of Sensorimotor Prediction Using Cortica...Locations in the Neocortex: A Theory of Sensorimotor Prediction Using Cortica...
Locations in the Neocortex: A Theory of Sensorimotor Prediction Using Cortica...
 
The Predictive Neuron: How Active Dendrites Enable Spatiotemporal Computation...
The Predictive Neuron: How Active Dendrites Enable Spatiotemporal Computation...The Predictive Neuron: How Active Dendrites Enable Spatiotemporal Computation...
The Predictive Neuron: How Active Dendrites Enable Spatiotemporal Computation...
 
The Biological Path Toward Strong AI by Matt Taylor (05/17/18)
The Biological Path Toward Strong AI by Matt Taylor (05/17/18)The Biological Path Toward Strong AI by Matt Taylor (05/17/18)
The Biological Path Toward Strong AI by Matt Taylor (05/17/18)
 
Does the neocortex use grid cell-like mechanisms to learn the structure of ob...
Does the neocortex use grid cell-like mechanisms to learn the structure of ob...Does the neocortex use grid cell-like mechanisms to learn the structure of ob...
Does the neocortex use grid cell-like mechanisms to learn the structure of ob...
 
Could A Model Of Predictive Voting Explain Many Long-Range Connections? by Su...
Could A Model Of Predictive Voting Explain Many Long-Range Connections? by Su...Could A Model Of Predictive Voting Explain Many Long-Range Connections? by Su...
Could A Model Of Predictive Voting Explain Many Long-Range Connections? by Su...
 

Recently uploaded

POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Cherry
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
Cherry
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Cherry
 
Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycle
Cherry
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Cherry
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Cherry
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Cherry
 

Recently uploaded (20)

POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
 
GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) Metabolism
 
Daily Lesson Log in Science 9 Fourth Quarter Physics
Daily Lesson Log in Science 9 Fourth Quarter PhysicsDaily Lesson Log in Science 9 Fourth Quarter Physics
Daily Lesson Log in Science 9 Fourth Quarter Physics
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Pteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecyclePteris : features, anatomy, morphology and lifecycle
Pteris : features, anatomy, morphology and lifecycle
 
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptx
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.
 
Site specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfSite specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdf
 
Understanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution MethodsUnderstanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution Methods
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Concept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdfConcept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdf
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil Record
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 

FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks - Xilinx & Numenta

  • 1. Breaking the TOPS ceiling with sparse neural networks Nick Ni, Xilinx Lawrence Spracklen, Numenta
  • 2. © Copyright 2019 Xilinx Era of Domain-Specific Architectures (DSAs) Cache CPU CISC à RISC à Multi-Core Fixed HW Accelerators GPU, ASSP, ASIC DSA TPU, DLA FPGA, Zynq, ACAP 4
  • 3. © Copyright 2019 Xilinx What is a Domain-Specific Architecture? 5 Input Output “Dog” 55 55 96 5 5 11 11 224 224 Input Image (RGB) Stride of 4 27 27 256 3 3 13 13 3 3 384 13 13 3 3 384 13 13 256 Max Pooling 4096 4096 1000 dense dense dense Max Pooling Max Pooling 1) Custom Data Path 2) Custom Precision © Copyright 2019 Xilinx
  • 4. © Copyright 2019 Xilinx What is a Domain-Specific Architecture? 6 1) Custom Data Path 2) Custom Precision 3) Custom Memory Hierarchy Off-Chip DDR On-Chip Memory © Copyright 2019 Xilinx
  • 5. © Copyright 2019 Xilinx Classification Object Detection Segmentation Speech Recognition Recommendation Engine Anomaly Detection Diverse models over a broad range of applications CNN RNN, LSTM MLP APPLICATIONS AI Applications are Everywhere 7 © Copyright 2019 Xilinx
  • 6. © Copyright 2019 Xilinx 2012 2018 80 70 50 AlexNet 60 BN-AlexNet BN-NIN ENet GoogLeNet ResNet-18 VGG-16 VGG-19 ResNet-34 ResNet-50 ResNet-101 ResNet-153 ResNeXt-101 Inception-v3 Inception-v4 DenseNet-264 ShuffleNet 2x w/SE SENet-154 MobileNet v2 Top-1 Accuracy (1%) Classification Object Detection Segmentation Speech Recognition Recommendation Engine Anomaly Detection Diverse models over a broad range of applications CNN RNN, LSTM MLP APPLICATIONS Source: https://arxiv.org/pdf/1605.07678.pdf https://arxiv.org/pdf/1608.06993.pdf https://arxiv.org/pdf/1709.01507.pdf https://arxiv.org/pdf/1611.05431.pdf AI is Evolving Rapidly 8 © Copyright 2019 Xilinx
  • 7. © Copyright 2019 Xilinx 80 70 50 AlexNet 60 BN-AlexNet BN-NIN ENet GoogLeNet ResNet-18 VGG-16 VGG-19 ResNet-34 ResNet-50 ResNet-101 ResNet-153 ResNeXt-101 Inception-v3 Inception-v4 DenseNet-264 ShuffleNet 2x w/SE SENet-154 MobileNet v2 Top-1 Accuracy (1%) AlexNet GoogLeNet DenseNet AlexNet GoogLeNet DenseNet Silicon lifecycle Silicon Hardware Design Cycle Can’t Keep Up with the Rate of AI Innovation AI is Evolving Rapidly 9
  • 8. Xilinx Achieves Highest AI Inference Efficiency 7 Based on MLPerf benchmarks 0% 20% 40% 60% 80% 100% 120% Xilinx U250 nVidia A100 nVidia T4 nVidia Jetson Resnet-50 Efficiency Achieved vs. Claimed
  • 9. Numenta Developing machine intelligence through neocortical theory • Understand how the brain works • Apply neocortical principles to AI Developed the “Thousand Brains” theory of how the neocortex works
  • 10. Artificial Neural Networks (ANNs) Layer 1 Layer 2 Layer 3 Layer N Input Output Dense, fully-connected and computationally expensive
  • 11. Traditional Approach to ANNs Perform matrix multiplications very fast • GPUs have become AI workhorses • 500+ trillion arithmetic operations per second per card • Hardware performance doubles every few years • Hardware cannot keep pace with growth in model size • Exploding AI costs • 2018 : BERT cost $6K+ to train • 2020 : GPT-3 cost $10M+ to train 3-years 17,000X increase Figure credit
  • 12. Can Neuroscience Help? Examine the Neocortex • Neuron interconnections are sparse • Neuron activations are sparse • Neurons are significantly more complex than AI’s point neuron abstraction • Humans can learn from very limited examples Numenta’s Roadmap
  • 13. The Neocortex is Highly Sparse Source: Prof. Hasan, Max-Planck-Institute for Research • Neural activity and connectivity are both highly sparse • Only 0.5% to 2% of cells are active at any time • Only 1% - 5% of connections actually exist between two connected layers • Nothing like today’s dense deep learning networks
  • 14. Sparse Layers • Sparse weights • Weight matrix is sparse • Enforced via mask • Sparse activations • Outputs of top-k units are maintained • Also applicable to sparse convolutional layers • Zero-valued kernel weights
  • 16. Make Models Fast Sparse models • Deliver comparable accuracy with up to 20X fewer parameters • Also leverage activation sparsity for multiplicative benefits • 100X+ reduction in compute costs • Hardware needs to be capable of exploiting sparsity • Efficiently avoid multiplying by the zeros
  • 17. Multiplicative Sparsity Benefits • Massive opportunity for performance by simultaneously exploiting weight and activation sparsity • Assuming your hardware is sufficiently flexible! % weight sparsity Potential speedup (Compared to dense) % a c t i v a t i o n s p a r s i t y 0 15 30 45 60 75 90 0 50 100 150 200 250 300 350 400 450 500 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 96 Potential Speedup from Sparsity 0-50 50-100 100-150 150-200 200-250 250-300 300-350 350-400 400-450 450-500
  • 18. The Challenges of Sparsity • Dense matrix operations are efficiently handled by hardware • Predictable memory access patterns to contiguous locations • Computations compatible with wide SIMD processing • Sparse matrix operations require efficiently locating, extracting and processing non-zero elements • On many architectures overheads outweigh computational savings • Makes it challenging to exploit SIMD units • Block sparsity attempts to group non-zero elements for more efficient processing • Large block sizes impact obtainable model accuracy • Flexibility of FPGAs is well-suited to irregular nature of sparse computations • Traditional fine-grain sparsity is still a costly proposition
  • 19. Numenta’s Sparsity • Numenta has developed high performance sparse techniques • Permit accurate fine-grain sparsity • Significantly reduce processing overheads • Performance scales linearly with degree of sparsity
  • 20. Numenta Sparsity Details • Leverage sparsity patterns that allow computation to be framed as a dense operation • Applicable to both linear and convolutional layers • Light-weight constraint doesn’t impact achievable accuracy • Enforce non-overlapping patterns across multiple sets of kernels or linear-layer weights • Degree of sparsity dictates size of set • Allows multiple kernels/weights to be elegantly combined into a single dense entity • Speedup scales linearly with degree of sparsity • Use sparse static mask training to provide control over placement of non-zero weights • Accurate networks while exploiting extreme sparsity
  • 21. 80% Sparse, 5x5 Kernel
  • 22. Example CNN • Deep Convolutional Neural Network (CNN) • 2 Convolutional layers & 2 linear layers • 2.5M parameters
  • 23. Google Speech Commands • Trained our CNN on Google Speech Commands dataset • One-word utterances spoken by thousands of different individuals • Task is to recognize the word being spoken from the audio signal • Model accuracy is ~98% for 10 category classification • Compared dense model performance with 2 sparse networks • Sparsity varied by network layer • Weights 95% sparse • Activations 88% sparse • Accuracy of sparse models is identical to dense model • Additional performance achievable by relaxing accuracy constraints Name Weights Activations Sparse-Dense Sparse Dense Sparse-Sparse Sparse Sparse
  • 24. FPGA Performance • Implemented sparse and dense networks on two Xilinx chips • Chips designed for datacenter and embedded (internet of things) applications • Dense model implemented using Vitis AI • Our Sparse implementations coded using HSL Alveo U250 Zynq UltraScale+ ZU3EG System logic cells 1,728,000 154,000 Memory 54MB 0.95MB DSP slices 12,288 360 System power 225W 24W
  • 25. Sparse Performance: single network • Order of magnitude performance improvement from sparsity • ~3X additional benefit from adding activation sparsity • Dense network doesn’t even fit on ZU3EG • Sparse network on ZU3EG outperforms dense on U250! • Sparse networks make AI-at-the-edge a reality FPGA platform Network type Throughput words/sec Speedup over dense Alveo U250 Dense 3,049 - Alveo U250 Sparse-Dense 35,714 11.71 Alveo U250 Sparse-Sparse 102,564 33.63 ZU3EG Dense 0 - ZU3EG Sparse-Dense 21,053 Infinite ZU3EG Sparse-Sparse 45,455 Infinite
  • 26. Sparse Performance: full chip • Efficiencies of sparse network implementations allow FPGA to accommodate many more copies • Sparse-Sparse outperforms traditional dense model by 112X • Remember – same accuracy as the original dense model • Sparse models on ZU3EG outperform dense models on U250 FPGA platform Network type Number of networks on chip Full chip throughput (words/sec) Full chip speedup Alveo U250 Dense 4 12,195 - Alveo U250 Sparse-Dense 24 689,655 56.5 Alveo U250 Sparse-Sparse 20 1,369,863 112.3 ZU3EG Dense 0 0 - ZU3EG Sparse-Dense 1 21,053 Infinite ZU3EG Sparse-Sparse 1 45,455 Infinite
  • 27. FPGA Performance Summary • Flexible reconfigurable nature of FPGAs makes them ideal platform for running sparse models 0 20 40 60 80 100 120 Speedup from Sparsity on FPGA (Relative to dense, Xilinx U250) Sparse-Dense Sparse-Sparse One Network Full Chip
  • 28. Power Efficiency FPGA platform System power Network type Number of networks Words/sec/ watt Relative efficiency Alveo U250 225 Dense 4 54 100% Alveo U250 225 Sparse-Dense 24 3065 5675% Alveo U250 225 Sparse-Sparse 20 6088 11274% ZU3EG 24 Dense 0 0 0 ZU3EG 24 Sparse-Dense 1 877 1624% ZU3EG 24 Sparse-Sparse 1 1893 3505% • Increased performance does not come at the cost of increased power consumption • Two orders of magnitude reduction in per inference energy cost
  • 29. CPU Performance • How do other compute platforms handle sparsity • Create 95% sparse model for CPUs • Only leverage weight sparsity (sparse-dense) • Investigate different CPU model inference engines • Microsoft and Intel runtimes don’t leverage sparsity • Performance improvements at best around 3X 0 1 2 3 OpenVino OnnxRuntime TVM DeepSparse Speedup on CPUs from 95% sparsity CPU execution engine
  • 30. FPGA vs CPU • Compare sparse performance on CPUs and FPGAs • Using 24 core (48 hardware thread) Intel Xeon 8275CL • AWS C5.12xlarge Samples/second 0 200000 400000 600000 800000 1000000 1200000 1400000 Single-thread Full-chip CPU-OpenVino CPU-OnnxRuntime CPU-TVM CPU-DeepSparse Numenta-SD Numenta-SS US250 12X faster
  • 31. • Created a sparse version of ResNet50 • Trained on Imagenet • Training respects MLPerf optimization constraints • Accuracy results: • FPGA implementation in process Sparse ResNets Network Sparsity Accuracy (float32) Accuracy (int8) Quantization impact MLPerf benchmark, dense 0% 76.7% 75.7% -1.00% NVIDIA, static sparsity 50% 76.8% Ours, static sparsity 75% 76.22% 74.67% -1.55% Ours, dynamic sparsity 75% 77.1% 76.77% -0.33%
  • 32. Early Results • Significant performance benefits • Weight and activation sparsity • Constrained by MLPerf optimization restrictions
  • 33. Generalized Sparse Support • Creating general-purpose library for implementing sparse neural networks on FPGAs • Exploiting both weight and activation sparsity • Supports linear layers and full-spectrum of convolutional kernels • In addition to other support functions (pooling etc) • Degree of sparsity and performance requirements are parameterizable settings
  • 34. Your Cake and Eat it? • Sparse networks deliver improved performance • But [done correctly] they can also deliver • Improved robustness to noise • Improved generalization * How can we be so dense? The benefits of using high sparse representations, Ahmad, Scheinkman
  • 35. The Blessing of Dimensionality • Advantageous to maximize sparsity to boost performance • Achievable accuracy decreases at extreme sparsity • Increasing width of network while holding parameter count constant increases achievable accuracy
  • 36. Conclusions • Sparsity represents a powerful technique to reduce computational costs for AI • Also provide improved robustness to noise • Frequently problematic to efficiency exploit sparsity on Hardware • Sometimes as little as 3X • Numenta’s sparsity techniques allow efficient execution on general-purpose • Linear speedup with degree of sparsity • Multiplicative effect of exploiting both weight and activation sparsity • The flexibility of FPGAs makes them an ideal platform for high performance AI • Numenta demonstrated 100X speedup from sparsity on a Xilinx U250 • Creating drag-n-drop support for Sparsity on FPGAs!
  • 37. Acknowledgements Lead Architect on Numenta’s FPGA Sparse Networks • Kevin Hunter Numenta’s research team
  • 38. Let us know! • Interested in sparse networks on FPGAs? • Let us know!!