Google TPU

Hao(Robin) Dong
Hao(Robin) DongBigdata Architect at Alibaba Group
Read paper “In-Datacenter
Performance Analysis of a
Tensor Processing Unit”2009-8-22
Authors
• Norman P. Jouppi (first
author)
– Distinguished Engineer at Google
– Lead designer of several
microprocessors and graphics
accelerator
• David Patterson (fourth
author)
– Father of “RISC”
Ref: https://www.computer.org/web/awards/goode-norman-jouppi
Neural Networks
• Application
– MLP, CNN, RNN represent 95% of NN inference workload
in Google datacenters
– Each model needs 5M ~ 100M weights
• Hardware
– TPU has 25 time as many MACs and 3.5 times as much on-chip
memory as the K80 GPU
Neural Networks (Cont.)
Origin
• Requirement
– DNNs might double computation demands
– Quickly produce a custom ASIC for inference
• Definition
– Coprocessor on the PCIE, plug into existing servers
– More like FPU (floating-point unit) than GPU
TPU Block Diagram
Architecture
• Matrix Multiply Unit
– Contains 256 x 256 MACs, can perform 8-bit multiply-and-
adds
– Designed for dense matrices
• Off-chip 8GiB DRAM (Weight Memory)
– Read-only (different from Global Memory of GPU)
– Supports many simultaneously active models
• Instruction Set
– Traditional CISC
– Read_Host_Memory/Read_Weights/MatrixMultiply/Convol
ve/Activate etc.
– 4-stage pipeline
Architecture (Cont.)
Architecture(Cont.)
Implementation
• Flows
– Data flows from the left (Unified Buffer)
– Weights are loaded from the top (Weight FIFO, 8GiB
DDR3 DRAM)
• Systolic System
– A network of processors which rhythmically compute and
pass data through the system
• Software Stack
– User Space Library and Kernel Driver (like Nvidia-GPU)
Performance
Performance (Cont.)
Performance (Cont.)
Alternative TPU Design
Discussion
• Fallacy: K80 GPU is a good match to inference
“GPUs have traditionally been seen as high-throughput
architectures that reply on high-bandwidth DRAM and thousands of
threads to achieve their goals”
Conclusion
• Advantage
– K80 GPU: 2496 32-bit, 8Mib on-chip memory
TPU: 65536 8-bit, 28Mib on-chip memory
– TPU leverages its advantage in MACs and on-chip
memory
– TPU succeeded because of the large matrix multiply
unit
Q1: Why don’t use TPU for training
• TPU’s on-chip 8GiB DRAM is read-only
– CPU paid a lot for synchronous operations on RAM
– Large mount of GPUs will lower the cost for single
chip
• GPU have more “parallel” performance
– Could train two small-model or a large mount of
samples at the same time
Q2: Why TPU faster?
• Application Specific Instruction Set
– Intel CPU (CISC) need decoding, out-of-order,
branch-prediction, SMT etc.
– GPU was optimized for “Parallel” rather than “Matrix”
• Read-only on-chip memory
• TensorRT makes GPU-inference much faster
GPU grows faster and faster
https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
Q3: TPU or FPGA?
• They looks like the same
– By programming, FPGA could have similar
Matrix-Multiply-Unit
– FPGA could also have “read-only” on-chip memory
• Making a utterly new chip is a high-risk task
– AMD
– Calxeda
– Fusionio
Thank you
1 of 21

Recommended

Architecture of TPU, GPU and CPU by
Architecture of TPU, GPU and CPUArchitecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUGlobalLogic Ukraine
1.2K views30 slides
Tensor Processing Unit (TPU) by
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Antonios Katsarakis
6K views25 slides
TPU paper slide by
TPU paper slideTPU paper slide
TPU paper slideDong-Hyun Hwang
545 views46 slides
Processeur FPGA by
Processeur FPGAProcesseur FPGA
Processeur FPGAHassan KIBOU
1.9K views27 slides
Introduction to GPU Programming by
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU ProgrammingChakkrit (Kla) Tantithamthavorn
3.3K views22 slides
High performance computing - building blocks, production & perspective by
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveJason Shih
16.8K views128 slides

More Related Content

What's hot

Profiling deep learning network using NVIDIA nsight systems by
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
4.6K views43 slides
GPUDirect RDMA and Green Multi-GPU Architectures by
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architecturesinside-BigData.com
3K views24 slides
CPU vs GPU Comparison by
CPU  vs GPU ComparisonCPU  vs GPU Comparison
CPU vs GPU Comparisonjeetendra mandal
316 views13 slides
GPU Architecture NVIDIA (GTX GeForce 480) by
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
1.3K views22 slides
Nvidia (History, GPU Architecture and New Pascal Architecture) by
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
1.2K views17 slides
Qualcomm SnapDragon 800 Mobile Device by
Qualcomm SnapDragon 800 Mobile DeviceQualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile DeviceJJ Wu
15.7K views34 slides

What's hot(20)

Profiling deep learning network using NVIDIA nsight systems by Jack (Jaegeun) Han
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
Jack (Jaegeun) Han4.6K views
GPUDirect RDMA and Green Multi-GPU Architectures by inside-BigData.com
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
GPU Architecture NVIDIA (GTX GeForce 480) by Fatima Qayyum
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
Fatima Qayyum1.3K views
Nvidia (History, GPU Architecture and New Pascal Architecture) by Saksham Tanwar
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
Saksham Tanwar1.2K views
Qualcomm SnapDragon 800 Mobile Device by JJ Wu
Qualcomm SnapDragon 800 Mobile DeviceQualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile Device
JJ Wu15.7K views
YOW2021 Computing Performance by Brendan Gregg
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
Brendan Gregg2K views
Snapdragon SoC and ARMv7 Architecture by Santosh Verma
Snapdragon SoC and ARMv7 ArchitectureSnapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 Architecture
Santosh Verma3.2K views
Computing Performance: On the Horizon (2021) by Brendan Gregg
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
Brendan Gregg93.1K views
Water technology-of-uncharted-gdc-2012 by MinGeun Park
Water technology-of-uncharted-gdc-2012Water technology-of-uncharted-gdc-2012
Water technology-of-uncharted-gdc-2012
MinGeun Park5.1K views
VPP for Stateless SRv6/GTP-U Translation by Satoru Matsushima
VPP for Stateless SRv6/GTP-U TranslationVPP for Stateless SRv6/GTP-U Translation
VPP for Stateless SRv6/GTP-U Translation
Satoru Matsushima681 views
Introduction to DPDK by Kernel TLV
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
Kernel TLV5.9K views
Arm cortex ( lpc 2148 ) based motor speed control by Uday Wankar
Arm cortex ( lpc 2148 ) based motor speed control Arm cortex ( lpc 2148 ) based motor speed control
Arm cortex ( lpc 2148 ) based motor speed control
Uday Wankar5.4K views
Cuda tutorial by maheshkha
Cuda tutorialCuda tutorial
Cuda tutorial
maheshkha4.4K views
Qualcomm Snapdragon Processor by Krishna Gehlot
Qualcomm Snapdragon ProcessorQualcomm Snapdragon Processor
Qualcomm Snapdragon Processor
Krishna Gehlot7.5K views

Similar to Google TPU

In datacenter performance analysis of a tensor processing unit by
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
1.4K views41 slides
AI Accelerators for Cloud Datacenters by
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
1.4K views53 slides
Spark and Deep Learning frameworks with distributed workloads by
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsS N
290 views40 slides
The Rise of Parallel Computing by
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
1.5K views28 slides
lecture11_GPUArchCUDA01.pptx by
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
9 views66 slides
Current Trends in HPC by
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPCPutchong Uthayopas
1.6K views85 slides

Similar to Google TPU(20)

In datacenter performance analysis of a tensor processing unit by Jinwon Lee
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee1.4K views
AI Accelerators for Cloud Datacenters by CastLabKAIST
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
CastLabKAIST1.4K views
Spark and Deep Learning frameworks with distributed workloads by S N
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloads
S N290 views
The Rise of Parallel Computing by bakers84
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
bakers841.5K views
lecture11_GPUArchCUDA01.pptx by ssuser413a98
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
ssuser413a989 views
Distributed DNN training: Infrastructure, challenges, and lessons learned by Wee Hyong Tok
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
Wee Hyong Tok366 views
Mauricio breteernitiz hpc-exascale-iscte by mbreternitz
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
mbreternitz86 views
Design installation-commissioning-red raider-cluster-ttu by Alan Sill
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
Alan Sill290 views
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ... by Chester Chen
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen3.5K views
High Performance Hardware for Data Analysis by Mike Pittaro
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
Mike Pittaro3.2K views
Mike Pittaro - High Performance Hardware for Data Analysis by PyData
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
PyData1K views
Tesla personal super computer by Priya Manik
Tesla personal super computerTesla personal super computer
Tesla personal super computer
Priya Manik4.2K views
Spark Summit EU talk by Berni Schiefer by Spark Summit
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
Spark Summit2.1K views
Assisting User’s Transition to Titan’s Accelerated Architecture by inside-BigData.com
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
inside-BigData.com1.1K views

More from Hao(Robin) Dong

Transformer and BERT by
Transformer and BERTTransformer and BERT
Transformer and BERTHao(Robin) Dong
749 views19 slides
flashcache原理及改造 by
flashcache原理及改造flashcache原理及改造
flashcache原理及改造Hao(Robin) Dong
516 views32 slides
ext2-110628041727-phpapp02 by
ext2-110628041727-phpapp02ext2-110628041727-phpapp02
ext2-110628041727-phpapp02Hao(Robin) Dong
386 views33 slides
Ext4 Bigalloc report public by
Ext4 Bigalloc report publicExt4 Bigalloc report public
Ext4 Bigalloc report publicHao(Robin) Dong
1.4K views5 slides
Overlayfs and VFS by
Overlayfs and VFSOverlayfs and VFS
Overlayfs and VFSHao(Robin) Dong
3.3K views40 slides
Ext4 new feature - bigalloc by
Ext4 new feature - bigallocExt4 new feature - bigalloc
Ext4 new feature - bigallocHao(Robin) Dong
1.4K views17 slides

More from Hao(Robin) Dong(9)

Recently uploaded

The Role of Patterns in the Era of Large Language Models by
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
91 views65 slides
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...ShapeBlue
171 views28 slides
Transcript: Redefining the book supply chain: A glimpse into the future - Tec... by
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...BookNet Canada
41 views16 slides
Business Analyst Series 2023 - Week 4 Session 8 by
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8DianaGray10
145 views13 slides
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOsPriyanka Aash
162 views59 slides
Cencora Executive Symposium by
Cencora Executive SymposiumCencora Executive Symposium
Cencora Executive Symposiummarketingcommunicati21
160 views14 slides

Recently uploaded(20)

The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li91 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue171 views
Transcript: Redefining the book supply chain: A glimpse into the future - Tec... by BookNet Canada
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
BookNet Canada41 views
Business Analyst Series 2023 - Week 4 Session 8 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8
DianaGray10145 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash162 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue152 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue247 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue265 views
Optimizing Communication to Optimize Human Behavior - LCBM by Yaman Kumar
Optimizing Communication to Optimize Human Behavior - LCBMOptimizing Communication to Optimize Human Behavior - LCBM
Optimizing Communication to Optimize Human Behavior - LCBM
Yaman Kumar38 views
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue208 views
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... by Moses Kemibaro
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Moses Kemibaro35 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue225 views
The Power of Generative AI in Accelerating No Code Adoption.pdf by Saeed Al Dhaheri
The Power of Generative AI in Accelerating No Code Adoption.pdfThe Power of Generative AI in Accelerating No Code Adoption.pdf
The Power of Generative AI in Accelerating No Code Adoption.pdf
Saeed Al Dhaheri39 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue224 views
"Node.js Development in 2024: trends and tools", Nikita Galkin by Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays33 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue178 views

Google TPU

  • 1. Read paper “In-Datacenter Performance Analysis of a Tensor Processing Unit”2009-8-22
  • 2. Authors • Norman P. Jouppi (first author) – Distinguished Engineer at Google – Lead designer of several microprocessors and graphics accelerator • David Patterson (fourth author) – Father of “RISC” Ref: https://www.computer.org/web/awards/goode-norman-jouppi
  • 3. Neural Networks • Application – MLP, CNN, RNN represent 95% of NN inference workload in Google datacenters – Each model needs 5M ~ 100M weights • Hardware – TPU has 25 time as many MACs and 3.5 times as much on-chip memory as the K80 GPU
  • 5. Origin • Requirement – DNNs might double computation demands – Quickly produce a custom ASIC for inference • Definition – Coprocessor on the PCIE, plug into existing servers – More like FPU (floating-point unit) than GPU
  • 7. Architecture • Matrix Multiply Unit – Contains 256 x 256 MACs, can perform 8-bit multiply-and- adds – Designed for dense matrices • Off-chip 8GiB DRAM (Weight Memory) – Read-only (different from Global Memory of GPU) – Supports many simultaneously active models • Instruction Set – Traditional CISC – Read_Host_Memory/Read_Weights/MatrixMultiply/Convol ve/Activate etc. – 4-stage pipeline
  • 10. Implementation • Flows – Data flows from the left (Unified Buffer) – Weights are loaded from the top (Weight FIFO, 8GiB DDR3 DRAM) • Systolic System – A network of processors which rhythmically compute and pass data through the system • Software Stack – User Space Library and Kernel Driver (like Nvidia-GPU)
  • 15. Discussion • Fallacy: K80 GPU is a good match to inference “GPUs have traditionally been seen as high-throughput architectures that reply on high-bandwidth DRAM and thousands of threads to achieve their goals”
  • 16. Conclusion • Advantage – K80 GPU: 2496 32-bit, 8Mib on-chip memory TPU: 65536 8-bit, 28Mib on-chip memory – TPU leverages its advantage in MACs and on-chip memory – TPU succeeded because of the large matrix multiply unit
  • 17. Q1: Why don’t use TPU for training • TPU’s on-chip 8GiB DRAM is read-only – CPU paid a lot for synchronous operations on RAM – Large mount of GPUs will lower the cost for single chip • GPU have more “parallel” performance – Could train two small-model or a large mount of samples at the same time
  • 18. Q2: Why TPU faster? • Application Specific Instruction Set – Intel CPU (CISC) need decoding, out-of-order, branch-prediction, SMT etc. – GPU was optimized for “Parallel” rather than “Matrix” • Read-only on-chip memory • TensorRT makes GPU-inference much faster
  • 19. GPU grows faster and faster https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
  • 20. Q3: TPU or FPGA? • They looks like the same – By programming, FPGA could have similar Matrix-Multiply-Unit – FPGA could also have “read-only” on-chip memory • Making a utterly new chip is a high-risk task – AMD – Calxeda – Fusionio