SlideShare a Scribd company logo
1 of 44
Download to read offline
Systolic Array
https://www.appliedimage.com/reference-info/using-nbs-1010a-resolution-test-target/
Dark Silicon
https://www.publicdomainpictures.net/en/view-image.php?image=44607&picture=portrait-of-the-dark-sides-man
Systolic Array
脈衝陣列列?!
Systolic Array
• Easy to describe in software language.

• easy to program w some kind of domain specific
language.

• Elegant

• Layout friendly
Memory Bandwidth
grows slowly
Spec Year MB/s
DDR 2000 2667
DDR2 2003 5333
DDR3 2007 12800
DDR4 2014 19200
Increasing Operations / IO
H. T. Kung 1982
Convolution Problem
y1 = y1+w1x1
ϵ = ϵ+w2x1
ϵ = ϵ+w3x1
ϵ = ϵ+w4x1
y2 = y2+w1x2
y1 = y1+w2x2
ϵ = ϵ+w3x2
ϵ = ϵ+w4x2
y3 = y3+w1x3
y2 = y2+w2x3
y1 = y1+w3x3
ϵ = ϵ+w4x3
y4 = y4+w1x4
y3 = y3+w2x4
y2 = y2+w3x4
y1 = y1+w4x4
y5 = y5+w1x5
y4 = y4+w2x5
y3 = y3+w3x5
y2 = y2+w4x5
output : y1
time(broadcast)
space(move)
H. T. Kung 1982
H. T. Kung 1982
Better precision of summation,

if MAC has more digit than bus
Require separate bus for collecting 

output from individual cells
Could be pipelined
adder tree
Without global data communication
Better precision of summation 

(same as B2)
Systolic output path

(or use next row in 2D)
Nodes are activated half of the time.
x1
x1
x1
x2
x2
x2
x2
x2x3
x3
x3
x3
x3
x4
x4
x4
w1
w1
w1 w2
w2
w2 w3
circle 0
circle 1
circle 2
circle 3
circle 4
circle 5
circle 6
Half nodes are
activated at any
given time.
Without global data communication
1 node / cycle
1 node / 2 cycles
Register to keep w
Better precision of summation 

w1
w1
w2
w2
w1
w1
x1
x1
x1
x2
x2
x2
x2x3
x3
x3
x3
w3
w2
w3
w2
w1
x3
w1
x4
w2
x4
w3
w3
x4
x4x5
x5
w1
y4
y4
y4
x6
w1
w1
w2
w2
w3
Y5
x4x5
x6
x7
w2
w3
w3
w1
w2
w1
w3
w2
w2 w1
w1
w3
x5x6
x7
x8
Sorting
Odd-Even Transposition
Sort Active
Comp & Swap
O((n/k)log(n/k)) + O(k(n/k))
H. T. Kung 1979
Finite Impulse
Response Filtering
In Matrix Form
H. T. Kung 1979
H. T. Kung 1979
H. T. Kung 1979
0
0
0
0
Priority Queue
• insert()

• delete()

• extract_min()
Priority Queue Operations
For n operations:
O(n log n) O(n)
Key : One operation
can be issued after
another in time.O(1)
Priority Queue Operations
insert(k) delete(k) extract_min()
Sink down the 

element with key k
A) Sink down a fake
element with key k
to find target.
B) Remove the target.
C) Bubble up the
below ones.
A) Take first element
B) Bubble up the below
ones.
Recurrence Evaluation
xi = R(xi−1, . . . , xi−k)
xi = axi−1 + bxi−2 + cxi−k + d
Removing Loops
Alternatives
Cloud TPU
Google Cloud Platform Blog 

https://cloud.google.com/tpu/
TPU V3TPU V2
TPU V2 Pod
TPU Programming
• A cloud TPU has 4 chips x 2
cores x 1 or 2 MXU

• MXU

• 128x128 systolic array

• 16K MAC / cycle

• bfloat16

• TPU memory prefer 8 bytes
alignment.

• 8 or 16GB HBM2 / core
https://cloud.google.com/tpu/docs/tpus

https://www.nvidia.com/en-us/geforce/products/10series/titan-x-pascal/
Titan X has 

3.5K cuda cores
So, each TPU V3 card has
4 chips x 2 cores x 2 MXU x 16K MAC / cycle
= 256K MAC / cycle at most.
https://cloud.google.com/tpu/docs/system-architecture
TPU Programming
• XLA compiler for TensorFlow programs.

• Tiling => Need reshape

• Shape => No dynamic batch

• Padding => under utilize TPU, more memory usage

• op_profile tool
TPU Programming
• Dense vector and matrix computations are fast

• M x M, M x v, Convolution

• Data movement on PCIe is slow.

• Only dense parts of the model, loss and gradient subgraphs are on TPU.

• I/O, reading data, writing checkpoint, preprocessing data is on CPU.

• decoding compressed images, randomly sampling/cropping, assembling training minibatches

• Non-matrix operations will likely not achieve high MXU utilization.

• add, reshape, or concatenate

• feature dimension => 128 x

• Batch dimension => 8 x
TPUEstimator
• TPUEstimator provides a graph operator to build and run
a replicated computation
https://www.tensorflow.org/api_docs/python/tf/contrib/tpu/TPUEstimator
Module: tf.contrib.tpu
Module: tf.contrib.tpu
https://www.tensorflow.org/api_docs/python/tf/contrib/tpu
Affinity
https://en.wikipedia.org/wiki/The_Boss_Baby

More Related Content

What's hot

What's hot (20)

Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10
 
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur BittorrentOsis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
 
20190611 Study Neural Network
20190611 Study  Neural Network20190611 Study  Neural Network
20190611 Study Neural Network
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Evolution of Macros
Evolution of MacrosEvolution of Macros
Evolution of Macros
 
Cryptography : From Demaratus to RSA
Cryptography : From Demaratus to RSACryptography : From Demaratus to RSA
Cryptography : From Demaratus to RSA
 
Introduction to Cryptography
Introduction to CryptographyIntroduction to Cryptography
Introduction to Cryptography
 
Faster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select DictionariesFaster Practical Block Compression for Rank/Select Dictionaries
Faster Practical Block Compression for Rank/Select Dictionaries
 
Introduction to Homomorphic Encryption
Introduction to Homomorphic EncryptionIntroduction to Homomorphic Encryption
Introduction to Homomorphic Encryption
 
Cryptography
CryptographyCryptography
Cryptography
 
対応点を用いないローリングシャッタ歪み補正と映像安定化論文
対応点を用いないローリングシャッタ歪み補正と映像安定化論文対応点を用いないローリングシャッタ歪み補正と映像安定化論文
対応点を用いないローリングシャッタ歪み補正と映像安定化論文
 
MXNet Workshop
MXNet WorkshopMXNet Workshop
MXNet Workshop
 
Erlang/N2O at KNPMeetup 2015
Erlang/N2O at KNPMeetup 2015Erlang/N2O at KNPMeetup 2015
Erlang/N2O at KNPMeetup 2015
 
Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
Practical Two-level Homomorphic Encryption in Prime-order Bilinear GroupsPractical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
 
Bitcoin Script
Bitcoin ScriptBitcoin Script
Bitcoin Script
 
Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법
 
C vs Java: Finding Prime Numbers
C vs Java: Finding Prime NumbersC vs Java: Finding Prime Numbers
C vs Java: Finding Prime Numbers
 
Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...
Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...
Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...
 
Hash functions
Hash functionsHash functions
Hash functions
 

Similar to Why Systolic Architectures

got HW crypto-slides_hardwear
got HW crypto-slides_hardweargot HW crypto-slides_hardwear
got HW crypto-slides_hardwear
Gunnar Alendal
 
Potapenko, vyukov forewarned is forearmed. a san and tsan
Potapenko, vyukov   forewarned is forearmed. a san and tsanPotapenko, vyukov   forewarned is forearmed. a san and tsan
Potapenko, vyukov forewarned is forearmed. a san and tsan
DefconRussia
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 

Similar to Why Systolic Architectures (20)

Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
Is your SQL Exadata-aware?
Is your SQL Exadata-aware?Is your SQL Exadata-aware?
Is your SQL Exadata-aware?
 
Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...
Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...
Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...
 
Brkdct 3101
Brkdct 3101Brkdct 3101
Brkdct 3101
 
OakTable World Sep14 clonedb
OakTable World Sep14 clonedb OakTable World Sep14 clonedb
OakTable World Sep14 clonedb
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
 
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
 
got HW crypto-slides_hardwear
got HW crypto-slides_hardweargot HW crypto-slides_hardwear
got HW crypto-slides_hardwear
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
 
Potapenko, vyukov forewarned is forearmed. a san and tsan
Potapenko, vyukov   forewarned is forearmed. a san and tsanPotapenko, vyukov   forewarned is forearmed. a san and tsan
Potapenko, vyukov forewarned is forearmed. a san and tsan
 
그래픽 최적화로 가...가버렷! (부제: 배치! 배칭을 보자!) , Batch! Let's take a look at Batching! -...
그래픽 최적화로 가...가버렷! (부제: 배치! 배칭을 보자!) , Batch! Let's take a look at Batching! -...그래픽 최적화로 가...가버렷! (부제: 배치! 배칭을 보자!) , Batch! Let's take a look at Batching! -...
그래픽 최적화로 가...가버렷! (부제: 배치! 배칭을 보자!) , Batch! Let's take a look at Batching! -...
 
Masked Occlusion Culling
Masked Occlusion CullingMasked Occlusion Culling
Masked Occlusion Culling
 
Developing Next-Generation Games with Stage3D (Molehill)
Developing Next-Generation Games with Stage3D (Molehill) Developing Next-Generation Games with Stage3D (Molehill)
Developing Next-Generation Games with Stage3D (Molehill)
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
Architect Cheatsheet
Architect CheatsheetArchitect Cheatsheet
Architect Cheatsheet
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 

More from Mindos Cheng

More from Mindos Cheng (13)

Deep Learning Accelerator Design Techniques
Deep Learning Accelerator Design TechniquesDeep Learning Accelerator Design Techniques
Deep Learning Accelerator Design Techniques
 
Tensor Core
Tensor CoreTensor Core
Tensor Core
 
Open GL ES Android
Open GL ES AndroidOpen GL ES Android
Open GL ES Android
 
Federated learning
Federated learningFederated learning
Federated learning
 
OpenGL ES 3.0 2013
OpenGL ES 3.0 2013OpenGL ES 3.0 2013
OpenGL ES 3.0 2013
 
Introduction to G0V.tw 2013
Introduction to G0V.tw 2013Introduction to G0V.tw 2013
Introduction to G0V.tw 2013
 
Google IO 2016
Google IO 2016Google IO 2016
Google IO 2016
 
GTC 2016 Taiwan Startups
GTC 2016 Taiwan StartupsGTC 2016 Taiwan Startups
GTC 2016 Taiwan Startups
 
GTC 2016 Taiwan Demos
GTC 2016 Taiwan DemosGTC 2016 Taiwan Demos
GTC 2016 Taiwan Demos
 
GTC 2016 Taiwan General
GTC 2016 Taiwan GeneralGTC 2016 Taiwan General
GTC 2016 Taiwan General
 
ORB SLAM Proposal for NTU GPU Programming Course 2016
ORB SLAM Proposal for NTU GPU Programming Course 2016ORB SLAM Proposal for NTU GPU Programming Course 2016
ORB SLAM Proposal for NTU GPU Programming Course 2016
 
Few Things about Mobile GPU
Few Things about Mobile GPUFew Things about Mobile GPU
Few Things about Mobile GPU
 
Graph-powered Machine Learning at Google @ Google Blog
Graph-powered Machine Learning at Google @ Google BlogGraph-powered Machine Learning at Google @ Google Blog
Graph-powered Machine Learning at Google @ Google Blog
 

Recently uploaded

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 

Recently uploaded (20)

The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 

Why Systolic Architectures