SlideShare a Scribd company logo
Read paper “In-Datacenter
Performance Analysis of a
Tensor Processing Unit”2009-8-22
Authors
• Norman P. Jouppi (first
author)
– Distinguished Engineer at Google
– Lead designer of several
microprocessors and graphics
accelerator
• David Patterson (fourth
author)
– Father of “RISC”
Ref: https://www.computer.org/web/awards/goode-norman-jouppi
Neural Networks
• Application
– MLP, CNN, RNN represent 95% of NN inference workload
in Google datacenters
– Each model needs 5M ~ 100M weights
• Hardware
– TPU has 25 time as many MACs and 3.5 times as much on-chip
memory as the K80 GPU
Neural Networks (Cont.)
Origin
• Requirement
– DNNs might double computation demands
– Quickly produce a custom ASIC for inference
• Definition
– Coprocessor on the PCIE, plug into existing servers
– More like FPU (floating-point unit) than GPU
TPU Block Diagram
Architecture
• Matrix Multiply Unit
– Contains 256 x 256 MACs, can perform 8-bit multiply-and-
adds
– Designed for dense matrices
• Off-chip 8GiB DRAM (Weight Memory)
– Read-only (different from Global Memory of GPU)
– Supports many simultaneously active models
• Instruction Set
– Traditional CISC
– Read_Host_Memory/Read_Weights/MatrixMultiply/Convol
ve/Activate etc.
– 4-stage pipeline
Architecture (Cont.)
Architecture(Cont.)
Implementation
• Flows
– Data flows from the left (Unified Buffer)
– Weights are loaded from the top (Weight FIFO, 8GiB
DDR3 DRAM)
• Systolic System
– A network of processors which rhythmically compute and
pass data through the system
• Software Stack
– User Space Library and Kernel Driver (like Nvidia-GPU)
Performance
Performance (Cont.)
Performance (Cont.)
Alternative TPU Design
Discussion
• Fallacy: K80 GPU is a good match to inference
“GPUs have traditionally been seen as high-throughput
architectures that reply on high-bandwidth DRAM and thousands of
threads to achieve their goals”
Conclusion
• Advantage
– K80 GPU: 2496 32-bit, 8Mib on-chip memory
TPU: 65536 8-bit, 28Mib on-chip memory
– TPU leverages its advantage in MACs and on-chip
memory
– TPU succeeded because of the large matrix multiply
unit
Q1: Why don’t use TPU for training
• TPU’s on-chip 8GiB DRAM is read-only
– CPU paid a lot for synchronous operations on RAM
– Large mount of GPUs will lower the cost for single
chip
• GPU have more “parallel” performance
– Could train two small-model or a large mount of
samples at the same time
Q2: Why TPU faster?
• Application Specific Instruction Set
– Intel CPU (CISC) need decoding, out-of-order,
branch-prediction, SMT etc.
– GPU was optimized for “Parallel” rather than “Matrix”
• Read-only on-chip memory
• TensorRT makes GPU-inference much faster
GPU grows faster and faster
https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
Q3: TPU or FPGA?
• They looks like the same
– By programming, FPGA could have similar
Matrix-Multiply-Unit
– FPGA could also have “read-only” on-chip memory
• Making a utterly new chip is a high-risk task
– AMD
– Calxeda
– Fusionio
Thank you

More Related Content

What's hot

The Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K SupercomputerThe Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K Supercomputer
inside-BigData.com
 
Deep learning with FPGA
Deep learning with FPGADeep learning with FPGA
Deep learning with FPGA
Ayush Singh, MS
 
High performance computing
High performance computingHigh performance computing
High performance computingGuy Tel-Zur
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
Saksham Tanwar
 
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the FutureSupermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Rebekah Rodriguez
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Carlo C. del Mundo
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
Denys Haryachyy
 
FPGAs and Machine Learning
FPGAs and Machine LearningFPGAs and Machine Learning
FPGAs and Machine Learning
inside-BigData.com
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
Snapdragon processors
Snapdragon processorsSnapdragon processors
Snapdragon processorsDeepak Mathew
 
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloudPart 3 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
Univa, an Altair Company
 
GPU
GPUGPU
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
Rebekah Rodriguez
 
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerModular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Rebekah Rodriguez
 
NVIDIA Corporation Brochure: Who We Are
NVIDIA Corporation Brochure: Who We AreNVIDIA Corporation Brochure: Who We Are
NVIDIA Corporation Brochure: Who We Are
NVIDIA
 
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Michelle Holley
 
Supermicro X12 Performance Update
Supermicro X12 Performance UpdateSupermicro X12 Performance Update
Supermicro X12 Performance Update
Rebekah Rodriguez
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
inside-BigData.com
 

What's hot (20)

The Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K SupercomputerThe Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K Supercomputer
 
Deep learning with FPGA
Deep learning with FPGADeep learning with FPGA
Deep learning with FPGA
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the FutureSupermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
FPGAs and Machine Learning
FPGAs and Machine LearningFPGAs and Machine Learning
FPGAs and Machine Learning
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Snapdragon processors
Snapdragon processorsSnapdragon processors
Snapdragon processors
 
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloudPart 3 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
 
GPU
GPUGPU
GPU
 
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
 
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerModular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
 
NVIDIA Corporation Brochure: Who We Are
NVIDIA Corporation Brochure: Who We AreNVIDIA Corporation Brochure: Who We Are
NVIDIA Corporation Brochure: Who We Are
 
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
 
Supermicro X12 Performance Update
Supermicro X12 Performance UpdateSupermicro X12 Performance Update
Supermicro X12 Performance Update
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
 

Similar to Google TPU

AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
CastLabKAIST
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloads
S N
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
bakers84
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
ssuser413a98
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
Wee Hyong Tok
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
mbreternitz
 
19-7960-01.pptx
19-7960-01.pptx19-7960-01.pptx
19-7960-01.pptx
survivesurviving
 
19-7960-01.pptx
19-7960-01.pptx19-7960-01.pptx
19-7960-01.pptx
Sourabh97054
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
Alan Sill
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
Kernel TLV
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
Mike Pittaro
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
PyData
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
Ganesan Narayanasamy
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computer
Priya Manik
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
inside-BigData.com
 
Modern processor art
Modern processor artModern processor art
Modern processor art
waqasjadoon11
 

Similar to Google TPU (20)

AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloads
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
19-7960-01.pptx
19-7960-01.pptx19-7960-01.pptx
19-7960-01.pptx
 
19-7960-01.pptx
19-7960-01.pptx19-7960-01.pptx
19-7960-01.pptx
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computer
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 

More from Hao(Robin) Dong

Transformer and BERT
Transformer and BERTTransformer and BERT
Transformer and BERT
Hao(Robin) Dong
 
flashcache原理及改造
flashcache原理及改造flashcache原理及改造
flashcache原理及改造Hao(Robin) Dong
 
ext2-110628041727-phpapp02
ext2-110628041727-phpapp02ext2-110628041727-phpapp02
ext2-110628041727-phpapp02Hao(Robin) Dong
 
Ext4 Bigalloc report public
Ext4 Bigalloc report publicExt4 Bigalloc report public
Ext4 Bigalloc report publicHao(Robin) Dong
 
Overlayfs and VFS
Overlayfs and VFSOverlayfs and VFS
Overlayfs and VFS
Hao(Robin) Dong
 
Ext4 new feature - bigalloc
Ext4 new feature - bigallocExt4 new feature - bigalloc
Ext4 new feature - bigallocHao(Robin) Dong
 
why we need ext4
why we need ext4why we need ext4
why we need ext4
Hao(Robin) Dong
 
Kernel在多核机器上的负载均衡机制
Kernel在多核机器上的负载均衡机制Kernel在多核机器上的负载均衡机制
Kernel在多核机器上的负载均衡机制Hao(Robin) Dong
 
Linux下Poll和Epoll内核源码剖析
Linux下Poll和Epoll内核源码剖析Linux下Poll和Epoll内核源码剖析
Linux下Poll和Epoll内核源码剖析
Hao(Robin) Dong
 

More from Hao(Robin) Dong (9)

Transformer and BERT
Transformer and BERTTransformer and BERT
Transformer and BERT
 
flashcache原理及改造
flashcache原理及改造flashcache原理及改造
flashcache原理及改造
 
ext2-110628041727-phpapp02
ext2-110628041727-phpapp02ext2-110628041727-phpapp02
ext2-110628041727-phpapp02
 
Ext4 Bigalloc report public
Ext4 Bigalloc report publicExt4 Bigalloc report public
Ext4 Bigalloc report public
 
Overlayfs and VFS
Overlayfs and VFSOverlayfs and VFS
Overlayfs and VFS
 
Ext4 new feature - bigalloc
Ext4 new feature - bigallocExt4 new feature - bigalloc
Ext4 new feature - bigalloc
 
why we need ext4
why we need ext4why we need ext4
why we need ext4
 
Kernel在多核机器上的负载均衡机制
Kernel在多核机器上的负载均衡机制Kernel在多核机器上的负载均衡机制
Kernel在多核机器上的负载均衡机制
 
Linux下Poll和Epoll内核源码剖析
Linux下Poll和Epoll内核源码剖析Linux下Poll和Epoll内核源码剖析
Linux下Poll和Epoll内核源码剖析
 

Recently uploaded

ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 

Recently uploaded (20)

ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 

Google TPU

  • 1. Read paper “In-Datacenter Performance Analysis of a Tensor Processing Unit”2009-8-22
  • 2. Authors • Norman P. Jouppi (first author) – Distinguished Engineer at Google – Lead designer of several microprocessors and graphics accelerator • David Patterson (fourth author) – Father of “RISC” Ref: https://www.computer.org/web/awards/goode-norman-jouppi
  • 3. Neural Networks • Application – MLP, CNN, RNN represent 95% of NN inference workload in Google datacenters – Each model needs 5M ~ 100M weights • Hardware – TPU has 25 time as many MACs and 3.5 times as much on-chip memory as the K80 GPU
  • 5. Origin • Requirement – DNNs might double computation demands – Quickly produce a custom ASIC for inference • Definition – Coprocessor on the PCIE, plug into existing servers – More like FPU (floating-point unit) than GPU
  • 7. Architecture • Matrix Multiply Unit – Contains 256 x 256 MACs, can perform 8-bit multiply-and- adds – Designed for dense matrices • Off-chip 8GiB DRAM (Weight Memory) – Read-only (different from Global Memory of GPU) – Supports many simultaneously active models • Instruction Set – Traditional CISC – Read_Host_Memory/Read_Weights/MatrixMultiply/Convol ve/Activate etc. – 4-stage pipeline
  • 10. Implementation • Flows – Data flows from the left (Unified Buffer) – Weights are loaded from the top (Weight FIFO, 8GiB DDR3 DRAM) • Systolic System – A network of processors which rhythmically compute and pass data through the system • Software Stack – User Space Library and Kernel Driver (like Nvidia-GPU)
  • 15. Discussion • Fallacy: K80 GPU is a good match to inference “GPUs have traditionally been seen as high-throughput architectures that reply on high-bandwidth DRAM and thousands of threads to achieve their goals”
  • 16. Conclusion • Advantage – K80 GPU: 2496 32-bit, 8Mib on-chip memory TPU: 65536 8-bit, 28Mib on-chip memory – TPU leverages its advantage in MACs and on-chip memory – TPU succeeded because of the large matrix multiply unit
  • 17. Q1: Why don’t use TPU for training • TPU’s on-chip 8GiB DRAM is read-only – CPU paid a lot for synchronous operations on RAM – Large mount of GPUs will lower the cost for single chip • GPU have more “parallel” performance – Could train two small-model or a large mount of samples at the same time
  • 18. Q2: Why TPU faster? • Application Specific Instruction Set – Intel CPU (CISC) need decoding, out-of-order, branch-prediction, SMT etc. – GPU was optimized for “Parallel” rather than “Matrix” • Read-only on-chip memory • TensorRT makes GPU-inference much faster
  • 19. GPU grows faster and faster https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
  • 20. Q3: TPU or FPGA? • They looks like the same – By programming, FPGA could have similar Matrix-Multiply-Unit – FPGA could also have “read-only” on-chip memory • Making a utterly new chip is a high-risk task – AMD – Calxeda – Fusionio