SlideShare a Scribd company logo
In-Datacenter Performance Analysis of a
Tensor Processing UnitTM
6th May, 2018
PR12 Paper Review
Jinwon Lee
Samsung Electronics
References
Most figures and slides are from
 Norman P.Jouppi, et al., "In-Datacenter PerformanceAnalysis of aTensor
Processing Unit", 44th IEEE/ACM International Symposium on Computer
Architecture (ISCA-44),Toronto, Canada, June 2017.
https://arxiv.org/abs/1704.04760
 David Patterson, "Evaluation of theTensor Processing Unit: A Deep Neural
Network Accelerator for the Datacenter", NAE Regional Meeting, April 2017.
https://sites.google.com/view/naeregionalsymposium
 Kaz Sato, “An in-depth look at Google’s firstTensor Processing Unit (TPU)”,
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-
googles-first-tensor-processing-unit-tpu
Authors
A Golden Age in Microprocessor Design
• Stunning progress in microprocessor design 40 years ≈ 106x faster!
• Three architectural innovations (~1000x)
 Width: 8163264 bit (~8x)
 Instruction level parallelism:
4-10 clock cycles per instruction to 4+ instructions per clock cycle (~10-20x)
 Multicore: 1 processor to 16 cores (~16x)
• Clock rate: 3 to 4000 MHz (~1000x thru technology & architecture)
• Made possible by IC technology:
 Moore’s Law: growth in transistor count (2X every 1.5 years)
 Dennard Scaling: power/transistor shrinks at same rate as transistors are
added (constant per mm2 of silicon)
End of Growth of Performance?
What’s Left?
• Since
 Transistors not getting much better
 Power budget not getting much higher
 Already switched from 1 inefficient processor/chip to N efficient
processors/chip
• Only path left is Domain Specific Architetures
 Just do a few tasks, but extremely well
TPU Origin
• Starting as far back as 2006, Google engineers had discussions about
deploying GPUs, FPGAs, or custom ASICs in their data centers.They
concluded that they can use the excess capacity of the large data
centers.
• The conversation changed in 2013 when it was projected that if
people used voice search for 3 minutes a day using speech
recognition DNNs, it would have required Google’s data centers to
double in order to meet computation demands.
• Google then started a high-priority project to quickly produce a
custom ASIC for inference.
• The goal was to improve cost-performance by 10x over GPUs.
• Given this mandate, theTPU was designed, verified, built, and
deployed in data centers in just 15 months
TPU
• Built on a 28nm process
• Runs @700MHz
• Consumes 40W when
running
• Connected to its host via a
PCIe Gen3 x16 bus
• TPU card to replace a disk
• Up to 4 cards / server
3 Kinds of Popular NNs
• Multi-Layer Perceptrons(MLP)
 Each new layer is a set of nonlinear functions of weighted sum of all outputs
( fully connected) from a prior one
• Convolutional Neural Networks(CNN)
 Each ensuing layer is a set of nonlinear functions of weighted sums of
spatially nearby subsets of outputs from the prior layer, which also reuses the
weights.
• Recurrent Neural Networks(RNN)
 Each subsequent layer is a collection of nonlinear functions of weighted sums
of outputs and the previous state.The most popular RNN is Long Short-Term
Memory (LSTM).
Inference DatacenterWorkload(95%)
TPU Architecture and Implementation
• Add as accelerators to existing servers
 So connect over I/O Bus(“PCIe”)
 TPU ≈ matrix accelerator on I/O bus
• Host server sends it instructions like a Floating Point Unit
 Unlike GPU that fetches and executes own instructions
• The goal was to run whole inference models in theTPU to reduce
interactions with the host CPU and to be flexible enough to match
the NN needs of 2015 and beyond
TPU Block Diagram
TPU High Level Architecture
• Matrix Multiply Unit is the heart of theTPU
 65,536(256x256) 8-bit MAC units
 The matrix unit holds one 64 KiB tile of weights
plus one for double-buffering
 >25x as many MACs vs GPU, >100x as many MACs vs CPU
• Peak performance: 92TOPS = 65,536 x 2 x 700M
• The 16-bit products are collected in the 4 MiB of 32-bit Accumulators below
the matrix unit.
 The 4MiB represents 4096, 256-element, 32-bit accumulators
 operations / byte @peak performance : 1350  round up : 2048  double
buffering : 4096
TPU High Level Architecture
• The weights for the matrix unit are staged
through an on-chip Weight FIFO that reads
from an off-chip 8 GiB DRAM called Weight Memory
 Two 2133MHz DDR3 DRAM channels
 for inference, weights are read-only
 8 GiB supports many simultaneously active models
• The intermediate results are held in the 24 MiB on-chip Unified Buffer,
which can serve as inputs to the Matrix Unit
 The 24 MiB size was picked in part to match the pitch of the Matrix Unit on the die
and, given the short development schedule
Floorplan ofTPU Die
• The Unified Buffer is
almost a third of the die
• Matrix Multiply Unit is a
quarter
• Control is just 2%
RISC, CISC and theTPU Instruction Set
• Most modern CPUs are heavily influenced by the Reduced Instruction
Set Computer (RISC) design style
 With RISC, the focus is to define simple instructions (e.g., load, store, add
and multiply) that are commonly used by the majority of applications and
then to execute those instructions as fast as possible.
• A Complex Instruction Set Computer(CISC) design focuses on
implementing high-level instructions that run more complex tasks
(such as calculating multiply-and-add many times) with each
instruction.
 The average clock cycles per instruction (CPI) of these CISC instructions is
typically 10 to 20
• TPU choose the CISC style
TPU Instructions
• It has about a dozen instructions overall, but below five are the key ones
TPU Instructions
• The CISC MatrixMultiply instruction is 12 bytes
 3 are Unified Buffer address; 2 are accumulator address; 4 are length
(sometimes 2 dimensions for convolutions); and the rest are opcode and
flags.
• Average clock cycles per instruction : > 10
• 4-stage overlapped execution, 1 instruction type / stage
 Execute other instructions while matrix multiplier busy
• Complexity in SW
 No branches, in-order issue, SW controlled buffers, SW controlled pipeline
synchronization
Systolic Execution in Matrix Array
• Problem : Reading a large SRAM uses much more power than
arithmetic
• Solution : Using “Systolic Execution” to save energy by reducing
reads and writes of the Unified Buffer
• A systolic array is a two dimensional collection of arithmetic units
that each independently compute a partial result as a function of
inputs from other arithmetic units that are considered upstream to
each unit
• It is similar to blood being pumped through the human circulatory
system by heart, which is the origin of the systolic name
Systolic Array(Example – vector input)
Systolic Array(Example – matrix input)
TPU Systolic Array
• In theTPU, the systolic array is
rotated
• Weights are loaded from the top
and the input data flows into the
array in from the left
• Weights are preloaded and take
effect with the advancing wave
alongside the first data of a new
block
Software Stack
• Software stack is split into a User Space
Driver and a Kernel Driver.
• The Kernel Driver is lightweight and
handles only memory management
and interrupts.
• The User Space driver changes
frequently. It sets up and controlsTPU
execution, reformats data intoTPU
order, translates API calls intoTPU
instructions, and turns them into an
application binary.
Relative Performances : 3 Contemporary Chips
*TPU is less than half die size of the Intel Haswell processor
• K80 andTPU in 28nm process, Haswell fabbed in intel 22nm process
• These chips and platforms chosen for comparison because widely deployed in
Google data centers
Relative Performance : 3 Platforms
• These chips and platforms chosen for comparison because widely
deployed in Google data centers
Performance Comparison
• Roofline Performance model
 This simple visual model is not perfect, yet
it offers insights on the causes of
performance bottlenecks.
 TheY-axis is performance in floating-point
operations per second, thus the peak
computation rate forms the “flat” part of
the roofline.
 The X-axis is operational intensity,
measured as floating-point operations per
DRAM byte accessed.
TPU Die Roofline
• TheTPU has a long “slanted” part of
its roofline, where operational
intensity means that performance is
limited by memory bandwidth.
• Five of the six applications are
happily bumping their heads against
the ceiling
• MLPs and LSTMs are memory bound,
and CNNs are computation bound.
CPU & GPU Rooflines
Log Rooflines for CPU, GPU andTPU
Linear Rooflines for CPU, GPU andTPU
Why So Far Below Rooflines? (MLP0)
• Response time is the reason
• Researchers have demonstrated that small increases in response
time cause customers to use a service less
• Inference prefers latency over throughput
TPU & GPU Relative Performance to CPU
• GM : Geometric Mean
• WM :Weighted Mean
Performance /Watt
ImprovingTPU : Move “Ridge Point” to the Left
• Current DRAM
 2 DDR 2133MHz  34GB/s
• Replace with GDDR5 like in K80
 BW : 34GB/s  180GB/s
 Move to Ridge Point from 1350 to 250
 This improvement would expand die size by about 10%. However, higher
memory bandwidth reduces pressure on the Unified Buffer, so reducing the
Unified Buffer to 14 MiB could gain back 10% in area.
Maximum MiB of the 24 MiB Unified Buffer used per NN app
RevisedTPU Raised Roofline
Performance /Watt Original & RevisedTPU
Overall Performance /Watt
Energy Proportionality
Evaluation ofTPU Designs
• Below table shows the differences between the model results and
the hardware performance counters, which average below 10%.
Weighted MeanTPU Relative Performance
Weighted MeanTPU Relative Performance
• First, increasing memory bandwidth ( memory ) has the biggest
impact: performance improves 3X on average when memory
increases 4X
• Second, clock rate has little benefit on average with or without more
accumulators.The reason is the MLPs and LSTMs are memory bound
but only the CNNs are compute bound
 Increasing the clock rate by 4X has almost no impact on MLPs and LSTMs
but improves performance of CNNs by about 2X.
• Third, the average performance slightly degrades when the matrix
unit expands from 256x256 to 512x512 for all apps
 The issue is analogous to internal fragmentation of large pages

More Related Content

What's hot

Introduction to HPC
Introduction to HPCIntroduction to HPC
Introduction to HPC
Chris Dwan
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
Jack (Jaegeun) Han
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
DataWorks Summit
 
Introduction to High Performance Computing
Introduction to High Performance ComputingIntroduction to High Performance Computing
Introduction to High Performance Computing
Umarudin Zaenuri
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
Esteban Hernandez
 
Google TPU
Google TPUGoogle TPU
Google TPU
Hao(Robin) Dong
 
High–Performance Computing
High–Performance ComputingHigh–Performance Computing
High–Performance Computing
BRAC University Computer Club
 
Introduction to High-Performance Computing
Introduction to High-Performance ComputingIntroduction to High-Performance Computing
Introduction to High-Performance Computing
Umarudin Zaenuri
 
Google edge tpu
Google edge tpuGoogle edge tpu
Google edge tpu
Rouyun Pan
 
Tuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish CacheTuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish Cache
Per Buer
 
High Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of ActuariesHigh Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of Actuaries
Adam DeConinck
 
How Development Teams Cut Costs with ScyllaDB.pdf
How Development Teams Cut Costs with ScyllaDB.pdfHow Development Teams Cut Costs with ScyllaDB.pdf
How Development Teams Cut Costs with ScyllaDB.pdf
ScyllaDB
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
Martin Traverso
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
ScyllaDB
 
High performance computing
High performance computingHigh performance computing
High performance computingGuy Tel-Zur
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
DataWorks Summit/Hadoop Summit
 
X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance View
 X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance View X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance View
X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance View
Rebekah Rodriguez
 
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the FutureSupermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Rebekah Rodriguez
 
High Performance Computing Presentation
High Performance Computing PresentationHigh Performance Computing Presentation
High Performance Computing Presentation
omar altayyan
 

What's hot (20)

Introduction to HPC
Introduction to HPCIntroduction to HPC
Introduction to HPC
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Introduction to High Performance Computing
Introduction to High Performance ComputingIntroduction to High Performance Computing
Introduction to High Performance Computing
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
 
Google TPU
Google TPUGoogle TPU
Google TPU
 
High–Performance Computing
High–Performance ComputingHigh–Performance Computing
High–Performance Computing
 
Introduction to High-Performance Computing
Introduction to High-Performance ComputingIntroduction to High-Performance Computing
Introduction to High-Performance Computing
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Google edge tpu
Google edge tpuGoogle edge tpu
Google edge tpu
 
Tuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish CacheTuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish Cache
 
High Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of ActuariesHigh Performance Computing: an Introduction for the Society of Actuaries
High Performance Computing: an Introduction for the Society of Actuaries
 
How Development Teams Cut Costs with ScyllaDB.pdf
How Development Teams Cut Costs with ScyllaDB.pdfHow Development Teams Cut Costs with ScyllaDB.pdf
How Development Teams Cut Costs with ScyllaDB.pdf
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
 
X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance View
 X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance View X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance View
X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance View
 
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the FutureSupermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
 
High Performance Computing Presentation
High Performance Computing PresentationHigh Performance Computing Presentation
High Performance Computing Presentation
 

Similar to In datacenter performance analysis of a tensor processing unit

Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
inside-BigData.com
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
Ganesan Narayanasamy
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
Coburn Watson
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
ruvex
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
zaid_b
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
Nagasuri Bala Venkateswarlu
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
BaharJV
 
Fundamentals.pptx
Fundamentals.pptxFundamentals.pptx
Fundamentals.pptx
dhivyak49
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
inside-BigData.com
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Cloudera, Inc.
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
Shien-Chun Luo
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
Modern processor art
Modern processor artModern processor art
Modern processor art
waqasjadoon11
 
processor struct
processor structprocessor struct
processor struct
waqasjadoon11
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
Slide_N
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
 

Similar to In datacenter performance analysis of a tensor processing unit (20)

Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
 
Fundamentals.pptx
Fundamentals.pptxFundamentals.pptx
Fundamentals.pptx
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
processor struct
processor structprocessor struct
processor struct
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 

More from Jinwon Lee

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
Jinwon Lee
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
Jinwon Lee
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Jinwon Lee
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
Jinwon Lee
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
Jinwon Lee
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
Jinwon Lee
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
Jinwon Lee
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
Jinwon Lee
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Jinwon Lee
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
Jinwon Lee
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
Jinwon Lee
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
Jinwon Lee
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Jinwon Lee
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Jinwon Lee
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
Jinwon Lee
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
Jinwon Lee
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
Jinwon Lee
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
Jinwon Lee
 

More from Jinwon Lee (20)

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 

Recently uploaded

NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
Amil baba
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
kywwoyk
 
F5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptxF5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptx
ArjunJain44
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
eemet
 
Cyber Sequrity.pptx is life of cyber security
Cyber Sequrity.pptx is life of cyber securityCyber Sequrity.pptx is life of cyber security
Cyber Sequrity.pptx is life of cyber security
perweeng31
 
Drugs used in parkinsonism and other movement disorders.pptx
Drugs used in parkinsonism and other movement disorders.pptxDrugs used in parkinsonism and other movement disorders.pptx
Drugs used in parkinsonism and other movement disorders.pptx
ThalapathyVijay15
 
web-tech-lab-manual-final-abhas.pdf. Jer
web-tech-lab-manual-final-abhas.pdf. Jerweb-tech-lab-manual-final-abhas.pdf. Jer
web-tech-lab-manual-final-abhas.pdf. Jer
freshgammer09
 
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
kywwoyk
 
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
PinkySharma900491
 

Recently uploaded (9)

NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
 
F5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptxF5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptx
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
 
Cyber Sequrity.pptx is life of cyber security
Cyber Sequrity.pptx is life of cyber securityCyber Sequrity.pptx is life of cyber security
Cyber Sequrity.pptx is life of cyber security
 
Drugs used in parkinsonism and other movement disorders.pptx
Drugs used in parkinsonism and other movement disorders.pptxDrugs used in parkinsonism and other movement disorders.pptx
Drugs used in parkinsonism and other movement disorders.pptx
 
web-tech-lab-manual-final-abhas.pdf. Jer
web-tech-lab-manual-final-abhas.pdf. Jerweb-tech-lab-manual-final-abhas.pdf. Jer
web-tech-lab-manual-final-abhas.pdf. Jer
 
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
 
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
 

In datacenter performance analysis of a tensor processing unit

  • 1. In-Datacenter Performance Analysis of a Tensor Processing UnitTM 6th May, 2018 PR12 Paper Review Jinwon Lee Samsung Electronics
  • 2. References Most figures and slides are from  Norman P.Jouppi, et al., "In-Datacenter PerformanceAnalysis of aTensor Processing Unit", 44th IEEE/ACM International Symposium on Computer Architecture (ISCA-44),Toronto, Canada, June 2017. https://arxiv.org/abs/1704.04760  David Patterson, "Evaluation of theTensor Processing Unit: A Deep Neural Network Accelerator for the Datacenter", NAE Regional Meeting, April 2017. https://sites.google.com/view/naeregionalsymposium  Kaz Sato, “An in-depth look at Google’s firstTensor Processing Unit (TPU)”, https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at- googles-first-tensor-processing-unit-tpu
  • 4. A Golden Age in Microprocessor Design • Stunning progress in microprocessor design 40 years ≈ 106x faster! • Three architectural innovations (~1000x)  Width: 8163264 bit (~8x)  Instruction level parallelism: 4-10 clock cycles per instruction to 4+ instructions per clock cycle (~10-20x)  Multicore: 1 processor to 16 cores (~16x) • Clock rate: 3 to 4000 MHz (~1000x thru technology & architecture) • Made possible by IC technology:  Moore’s Law: growth in transistor count (2X every 1.5 years)  Dennard Scaling: power/transistor shrinks at same rate as transistors are added (constant per mm2 of silicon)
  • 5. End of Growth of Performance?
  • 6. What’s Left? • Since  Transistors not getting much better  Power budget not getting much higher  Already switched from 1 inefficient processor/chip to N efficient processors/chip • Only path left is Domain Specific Architetures  Just do a few tasks, but extremely well
  • 7. TPU Origin • Starting as far back as 2006, Google engineers had discussions about deploying GPUs, FPGAs, or custom ASICs in their data centers.They concluded that they can use the excess capacity of the large data centers. • The conversation changed in 2013 when it was projected that if people used voice search for 3 minutes a day using speech recognition DNNs, it would have required Google’s data centers to double in order to meet computation demands. • Google then started a high-priority project to quickly produce a custom ASIC for inference. • The goal was to improve cost-performance by 10x over GPUs. • Given this mandate, theTPU was designed, verified, built, and deployed in data centers in just 15 months
  • 8. TPU • Built on a 28nm process • Runs @700MHz • Consumes 40W when running • Connected to its host via a PCIe Gen3 x16 bus • TPU card to replace a disk • Up to 4 cards / server
  • 9. 3 Kinds of Popular NNs • Multi-Layer Perceptrons(MLP)  Each new layer is a set of nonlinear functions of weighted sum of all outputs ( fully connected) from a prior one • Convolutional Neural Networks(CNN)  Each ensuing layer is a set of nonlinear functions of weighted sums of spatially nearby subsets of outputs from the prior layer, which also reuses the weights. • Recurrent Neural Networks(RNN)  Each subsequent layer is a collection of nonlinear functions of weighted sums of outputs and the previous state.The most popular RNN is Long Short-Term Memory (LSTM).
  • 11. TPU Architecture and Implementation • Add as accelerators to existing servers  So connect over I/O Bus(“PCIe”)  TPU ≈ matrix accelerator on I/O bus • Host server sends it instructions like a Floating Point Unit  Unlike GPU that fetches and executes own instructions • The goal was to run whole inference models in theTPU to reduce interactions with the host CPU and to be flexible enough to match the NN needs of 2015 and beyond
  • 13. TPU High Level Architecture • Matrix Multiply Unit is the heart of theTPU  65,536(256x256) 8-bit MAC units  The matrix unit holds one 64 KiB tile of weights plus one for double-buffering  >25x as many MACs vs GPU, >100x as many MACs vs CPU • Peak performance: 92TOPS = 65,536 x 2 x 700M • The 16-bit products are collected in the 4 MiB of 32-bit Accumulators below the matrix unit.  The 4MiB represents 4096, 256-element, 32-bit accumulators  operations / byte @peak performance : 1350  round up : 2048  double buffering : 4096
  • 14. TPU High Level Architecture • The weights for the matrix unit are staged through an on-chip Weight FIFO that reads from an off-chip 8 GiB DRAM called Weight Memory  Two 2133MHz DDR3 DRAM channels  for inference, weights are read-only  8 GiB supports many simultaneously active models • The intermediate results are held in the 24 MiB on-chip Unified Buffer, which can serve as inputs to the Matrix Unit  The 24 MiB size was picked in part to match the pitch of the Matrix Unit on the die and, given the short development schedule
  • 15. Floorplan ofTPU Die • The Unified Buffer is almost a third of the die • Matrix Multiply Unit is a quarter • Control is just 2%
  • 16. RISC, CISC and theTPU Instruction Set • Most modern CPUs are heavily influenced by the Reduced Instruction Set Computer (RISC) design style  With RISC, the focus is to define simple instructions (e.g., load, store, add and multiply) that are commonly used by the majority of applications and then to execute those instructions as fast as possible. • A Complex Instruction Set Computer(CISC) design focuses on implementing high-level instructions that run more complex tasks (such as calculating multiply-and-add many times) with each instruction.  The average clock cycles per instruction (CPI) of these CISC instructions is typically 10 to 20 • TPU choose the CISC style
  • 17. TPU Instructions • It has about a dozen instructions overall, but below five are the key ones
  • 18. TPU Instructions • The CISC MatrixMultiply instruction is 12 bytes  3 are Unified Buffer address; 2 are accumulator address; 4 are length (sometimes 2 dimensions for convolutions); and the rest are opcode and flags. • Average clock cycles per instruction : > 10 • 4-stage overlapped execution, 1 instruction type / stage  Execute other instructions while matrix multiplier busy • Complexity in SW  No branches, in-order issue, SW controlled buffers, SW controlled pipeline synchronization
  • 19. Systolic Execution in Matrix Array • Problem : Reading a large SRAM uses much more power than arithmetic • Solution : Using “Systolic Execution” to save energy by reducing reads and writes of the Unified Buffer • A systolic array is a two dimensional collection of arithmetic units that each independently compute a partial result as a function of inputs from other arithmetic units that are considered upstream to each unit • It is similar to blood being pumped through the human circulatory system by heart, which is the origin of the systolic name
  • 20. Systolic Array(Example – vector input)
  • 21. Systolic Array(Example – matrix input)
  • 22. TPU Systolic Array • In theTPU, the systolic array is rotated • Weights are loaded from the top and the input data flows into the array in from the left • Weights are preloaded and take effect with the advancing wave alongside the first data of a new block
  • 23. Software Stack • Software stack is split into a User Space Driver and a Kernel Driver. • The Kernel Driver is lightweight and handles only memory management and interrupts. • The User Space driver changes frequently. It sets up and controlsTPU execution, reformats data intoTPU order, translates API calls intoTPU instructions, and turns them into an application binary.
  • 24. Relative Performances : 3 Contemporary Chips *TPU is less than half die size of the Intel Haswell processor • K80 andTPU in 28nm process, Haswell fabbed in intel 22nm process • These chips and platforms chosen for comparison because widely deployed in Google data centers
  • 25. Relative Performance : 3 Platforms • These chips and platforms chosen for comparison because widely deployed in Google data centers
  • 26. Performance Comparison • Roofline Performance model  This simple visual model is not perfect, yet it offers insights on the causes of performance bottlenecks.  TheY-axis is performance in floating-point operations per second, thus the peak computation rate forms the “flat” part of the roofline.  The X-axis is operational intensity, measured as floating-point operations per DRAM byte accessed.
  • 27. TPU Die Roofline • TheTPU has a long “slanted” part of its roofline, where operational intensity means that performance is limited by memory bandwidth. • Five of the six applications are happily bumping their heads against the ceiling • MLPs and LSTMs are memory bound, and CNNs are computation bound.
  • 28. CPU & GPU Rooflines
  • 29. Log Rooflines for CPU, GPU andTPU
  • 30. Linear Rooflines for CPU, GPU andTPU
  • 31. Why So Far Below Rooflines? (MLP0) • Response time is the reason • Researchers have demonstrated that small increases in response time cause customers to use a service less • Inference prefers latency over throughput
  • 32. TPU & GPU Relative Performance to CPU • GM : Geometric Mean • WM :Weighted Mean
  • 34. ImprovingTPU : Move “Ridge Point” to the Left • Current DRAM  2 DDR 2133MHz  34GB/s • Replace with GDDR5 like in K80  BW : 34GB/s  180GB/s  Move to Ridge Point from 1350 to 250  This improvement would expand die size by about 10%. However, higher memory bandwidth reduces pressure on the Unified Buffer, so reducing the Unified Buffer to 14 MiB could gain back 10% in area. Maximum MiB of the 24 MiB Unified Buffer used per NN app
  • 39. Evaluation ofTPU Designs • Below table shows the differences between the model results and the hardware performance counters, which average below 10%.
  • 41. Weighted MeanTPU Relative Performance • First, increasing memory bandwidth ( memory ) has the biggest impact: performance improves 3X on average when memory increases 4X • Second, clock rate has little benefit on average with or without more accumulators.The reason is the MLPs and LSTMs are memory bound but only the CNNs are compute bound  Increasing the clock rate by 4X has almost no impact on MLPs and LSTMs but improves performance of CNNs by about 2X. • Third, the average performance slightly degrades when the matrix unit expands from 256x256 to 512x512 for all apps  The issue is analogous to internal fragmentation of large pages