Manish Pandey
May 13, 2019
Keynote Talk
Artificial Intelligence: Driving the Next Generation of
Chips and Systems
2019
Tel Aviv, Israel
22019
Revolution in Artificial Intelligence
32019
Intelligence Infused from the Edge to the Cloud
42019
Fueled by advances in AI/ML
2000 2005 2010 2015
Growthinpapers(relativeto1996)
Increasing
Accuracy
ErrorRateinImageClassification
2010 2011 2012 2013 2014 2015 2017
52019
AI Revolution Enabled by HardwareAI driven by advances in hardware
Chiwawa
=
=
=
62019
Deep Learning Advances Gated by Hardware
[Bill Dally, SysML 2018]
72019
Deep Learning Advances Gated by Hardware
• Results improve with
o Larger Models
o Larger Datasets
=> More Computation
82019
Data Set and Model Size Scaling
[Hestess et al. Arxiv 17.12.00409]
• 256X data
• 64X model size
Þ Compute 32,000 X
92019
How Much Compute?
12 HD Cameras
Inferencing**
• 25 Million Weights
• 300 Gops for HD Image
• 9.4 Tops for 30fps
• 12 Cameras, 3 nets = 338 Tops
Training
• 30Tops x 108 (train set) x 10^2 (epochs) =
1023 Ops
**ResNet-50
[Bill Dally, SysML 2018]
102019
Challenge: End of Line for CMOS Scaling
• Device scaling down slowing
• Power Density stopped scaling in 2005
[Olukotun, Horowitz][IMEC]
112019
Dennard Scaling to Dark Silicon
Can we specialize designs for DNNs?
Taylor, "A landscape of the new dark silicon design regime." IEEE Micro 33.5 (2013): 8-19
122019
Energy Cost for Operations
Instruction Fetch/D 70
[Horowitz ISSCC 2014]
*45nm
132019
Energy Cost for DNN Ops
Y = AB + C
Instruction Data Type Energy Energy/Op
1 Op/Instr
(*,+)
Memory fp32 89 nJ 693 pJ
128 Ops/Instr
(AX+B)
Memory fp32 72 nJ 562 pJ
128 Ops/Instr
(AX+B)
Cache fp32 0.87 nJ 6.82 pJ
128 Ops/Instr
(AX+B)
Cache fp16 0.47 nJ 3.68 pJ
*45nm
142019
Build a processor tuned to application - specialize for DNNs
• More effective parallelism for AI Operations (not ILP)
– SIMD vs. MIMD
– VLIW vs. Speculative, out-of-order
• Eliminate unneeded accuracy
– IEEE replaced by lower precision FP
– 32-64 bit bit integers to 8-16 bit integers
• More effective use of memory bandwidth
– User controlled versus caches
How many TeraOps per Watt?
152019
2-D Grid of Processing Elements
Systolic Arrays
• Balance compute and I/O
• Local Communication
• Concurrency
• M1*M2 nxn mult in 2n cycles
162019
TPUv2
• 8GB + 8GB HBM
• 600 GB/s Mem BW
• 32b float
• MXU fp32 accumulation
• Reduced precision mult.
[Dean NIPS’17]
172019
Datatypes – Reduced Precision
16-bits accuracy matches 32-bits!
Stochastic Rounding: 2.2 rounded to 2 with probability 0.8
rounded to 3 with probability 0.2
182019
Integer Weight Representations
192019
Extreme Datatypes
• Q: What can we do with single bit or ternary (+a, -b, 0) representations
• A: Surprisingly, a lot. Just retrain/increase the number of activation layers
[Chenzhuo et al,, arxiv 1612.01064]
Network
202019
Pruning – How many values?
90% values can be thrown away
212019
Quantization – How many distinct values?
16 Values => 4 bits representation
Instead of 16 fp32 numbers, store16 4-bit indices
[Han et al, 2015]
222019
Memory Locality
20mm
32-bit op
4pJ
256-bit access
8KB Cache
50pJ
256-bit
Bus 256pJ
DRAM 32bits
640pJ
256-bit
Bus 25pJ
Compute Weights
Interconnect
Bring Compute Closer to Memory
232019
Memory and inter-chip communication advances
[Graphcore 2018]
pJ
64pJ/B
16GB
900GB/s @ 60W
10pJ/B
256MB
6TB/s @ 60W
1pJ/B
1000 x 256KB
60TB/s @ 60W
Memory power density
Is ~25% of logic power density
242019
In-Memory Compute
[Ando et al., JSSC 04/18]
In-memory compute with low-bits/value, low cost ops
252019
AI Application-System Co-design
262019
AI Application-System Co-design
[Reagen et al, Minerva, ISCA 2016]
• Co-design across the algorithm, architecture, and circuit levels
• Optimize DNN hardware accelerators across multiple datasets
272019
Where Next?
• Rise of non-Von Neumann Architectures to efficiently solve Domain-Specific
Problems
– A new Golden Age of Computer Architecture
• Advances in semi technology, computer architecture will continue advances in AI
• 10 Tops/W is the current State of the Art (~14nm)
• Several orders of magnitude gap with the human brain
Thank You

ChipEx 2019 keynote

  • 1.
    Manish Pandey May 13,2019 Keynote Talk Artificial Intelligence: Driving the Next Generation of Chips and Systems 2019 Tel Aviv, Israel
  • 2.
  • 3.
    32019 Intelligence Infused fromthe Edge to the Cloud
  • 4.
    42019 Fueled by advancesin AI/ML 2000 2005 2010 2015 Growthinpapers(relativeto1996) Increasing Accuracy ErrorRateinImageClassification 2010 2011 2012 2013 2014 2015 2017
  • 5.
    52019 AI Revolution Enabledby HardwareAI driven by advances in hardware Chiwawa = = =
  • 6.
    62019 Deep Learning AdvancesGated by Hardware [Bill Dally, SysML 2018]
  • 7.
    72019 Deep Learning AdvancesGated by Hardware • Results improve with o Larger Models o Larger Datasets => More Computation
  • 8.
    82019 Data Set andModel Size Scaling [Hestess et al. Arxiv 17.12.00409] • 256X data • 64X model size Þ Compute 32,000 X
  • 9.
    92019 How Much Compute? 12HD Cameras Inferencing** • 25 Million Weights • 300 Gops for HD Image • 9.4 Tops for 30fps • 12 Cameras, 3 nets = 338 Tops Training • 30Tops x 108 (train set) x 10^2 (epochs) = 1023 Ops **ResNet-50 [Bill Dally, SysML 2018]
  • 10.
    102019 Challenge: End ofLine for CMOS Scaling • Device scaling down slowing • Power Density stopped scaling in 2005 [Olukotun, Horowitz][IMEC]
  • 11.
    112019 Dennard Scaling toDark Silicon Can we specialize designs for DNNs? Taylor, "A landscape of the new dark silicon design regime." IEEE Micro 33.5 (2013): 8-19
  • 12.
    122019 Energy Cost forOperations Instruction Fetch/D 70 [Horowitz ISSCC 2014] *45nm
  • 13.
    132019 Energy Cost forDNN Ops Y = AB + C Instruction Data Type Energy Energy/Op 1 Op/Instr (*,+) Memory fp32 89 nJ 693 pJ 128 Ops/Instr (AX+B) Memory fp32 72 nJ 562 pJ 128 Ops/Instr (AX+B) Cache fp32 0.87 nJ 6.82 pJ 128 Ops/Instr (AX+B) Cache fp16 0.47 nJ 3.68 pJ *45nm
  • 14.
    142019 Build a processortuned to application - specialize for DNNs • More effective parallelism for AI Operations (not ILP) – SIMD vs. MIMD – VLIW vs. Speculative, out-of-order • Eliminate unneeded accuracy – IEEE replaced by lower precision FP – 32-64 bit bit integers to 8-16 bit integers • More effective use of memory bandwidth – User controlled versus caches How many TeraOps per Watt?
  • 15.
    152019 2-D Grid ofProcessing Elements Systolic Arrays • Balance compute and I/O • Local Communication • Concurrency • M1*M2 nxn mult in 2n cycles
  • 16.
    162019 TPUv2 • 8GB +8GB HBM • 600 GB/s Mem BW • 32b float • MXU fp32 accumulation • Reduced precision mult. [Dean NIPS’17]
  • 17.
    172019 Datatypes – ReducedPrecision 16-bits accuracy matches 32-bits! Stochastic Rounding: 2.2 rounded to 2 with probability 0.8 rounded to 3 with probability 0.2
  • 18.
  • 19.
    192019 Extreme Datatypes • Q:What can we do with single bit or ternary (+a, -b, 0) representations • A: Surprisingly, a lot. Just retrain/increase the number of activation layers [Chenzhuo et al,, arxiv 1612.01064] Network
  • 20.
    202019 Pruning – Howmany values? 90% values can be thrown away
  • 21.
    212019 Quantization – Howmany distinct values? 16 Values => 4 bits representation Instead of 16 fp32 numbers, store16 4-bit indices [Han et al, 2015]
  • 22.
    222019 Memory Locality 20mm 32-bit op 4pJ 256-bitaccess 8KB Cache 50pJ 256-bit Bus 256pJ DRAM 32bits 640pJ 256-bit Bus 25pJ Compute Weights Interconnect Bring Compute Closer to Memory
  • 23.
    232019 Memory and inter-chipcommunication advances [Graphcore 2018] pJ 64pJ/B 16GB 900GB/s @ 60W 10pJ/B 256MB 6TB/s @ 60W 1pJ/B 1000 x 256KB 60TB/s @ 60W Memory power density Is ~25% of logic power density
  • 24.
    242019 In-Memory Compute [Ando etal., JSSC 04/18] In-memory compute with low-bits/value, low cost ops
  • 25.
  • 26.
    262019 AI Application-System Co-design [Reagenet al, Minerva, ISCA 2016] • Co-design across the algorithm, architecture, and circuit levels • Optimize DNN hardware accelerators across multiple datasets
  • 27.
    272019 Where Next? • Riseof non-Von Neumann Architectures to efficiently solve Domain-Specific Problems – A new Golden Age of Computer Architecture • Advances in semi technology, computer architecture will continue advances in AI • 10 Tops/W is the current State of the Art (~14nm) • Several orders of magnitude gap with the human brain
  • 28.