Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Read paper “In-Datacenter
Performance Analysis of a
Tensor Processing Unit”2009-8-22
Authors
• Norman P. Jouppi (first
author)
– Distinguished Engineer at Google
– Lead designer of several
microprocessors an...
Neural Networks
• Application
– MLP, CNN, RNN represent 95% of NN inference workload
in Google datacenters
– Each model ne...
Neural Networks (Cont.)
Origin
• Requirement
– DNNs might double computation demands
– Quickly produce a custom ASIC for inference
• Definition
– ...
TPU Block Diagram
Architecture
• Matrix Multiply Unit
– Contains 256 x 256 MACs, can perform 8-bit multiply-and-
adds
– Designed for dense m...
Architecture (Cont.)
Architecture(Cont.)
Implementation
• Flows
– Data flows from the left (Unified Buffer)
– Weights are loaded from the top (Weight FIFO, 8GiB
DD...
Performance
Performance (Cont.)
Performance (Cont.)
Alternative TPU Design
Discussion
• Fallacy: K80 GPU is a good match to inference
“GPUs have traditionally been seen as high-throughput
architect...
Conclusion
• Advantage
– K80 GPU: 2496 32-bit, 8Mib on-chip memory
TPU: 65536 8-bit, 28Mib on-chip memory
– TPU leverages ...
Q1: Why don’t use TPU for training
• TPU’s on-chip 8GiB DRAM is read-only
– CPU paid a lot for synchronous operations on R...
Q2: Why TPU faster?
• Application Specific Instruction Set
– Intel CPU (CISC) need decoding, out-of-order,
branch-predicti...
GPU grows faster and faster
https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
Q3: TPU or FPGA?
• They looks like the same
– By programming, FPGA could have similar
Matrix-Multiply-Unit
– FPGA could al...
Thank you
Upcoming SlideShare
Loading in …5
×

Google TPU

1,163 views

Published on

Summary and discussion about Google's paper of TPU.
Dose TPU really end the life of GPU? No, it could only do inference currently.

Published in: Technology

Google TPU

  1. 1. Read paper “In-Datacenter Performance Analysis of a Tensor Processing Unit”2009-8-22
  2. 2. Authors • Norman P. Jouppi (first author) – Distinguished Engineer at Google – Lead designer of several microprocessors and graphics accelerator • David Patterson (fourth author) – Father of “RISC” Ref: https://www.computer.org/web/awards/goode-norman-jouppi
  3. 3. Neural Networks • Application – MLP, CNN, RNN represent 95% of NN inference workload in Google datacenters – Each model needs 5M ~ 100M weights • Hardware – TPU has 25 time as many MACs and 3.5 times as much on-chip memory as the K80 GPU
  4. 4. Neural Networks (Cont.)
  5. 5. Origin • Requirement – DNNs might double computation demands – Quickly produce a custom ASIC for inference • Definition – Coprocessor on the PCIE, plug into existing servers – More like FPU (floating-point unit) than GPU
  6. 6. TPU Block Diagram
  7. 7. Architecture • Matrix Multiply Unit – Contains 256 x 256 MACs, can perform 8-bit multiply-and- adds – Designed for dense matrices • Off-chip 8GiB DRAM (Weight Memory) – Read-only (different from Global Memory of GPU) – Supports many simultaneously active models • Instruction Set – Traditional CISC – Read_Host_Memory/Read_Weights/MatrixMultiply/Convol ve/Activate etc. – 4-stage pipeline
  8. 8. Architecture (Cont.)
  9. 9. Architecture(Cont.)
  10. 10. Implementation • Flows – Data flows from the left (Unified Buffer) – Weights are loaded from the top (Weight FIFO, 8GiB DDR3 DRAM) • Systolic System – A network of processors which rhythmically compute and pass data through the system • Software Stack – User Space Library and Kernel Driver (like Nvidia-GPU)
  11. 11. Performance
  12. 12. Performance (Cont.)
  13. 13. Performance (Cont.)
  14. 14. Alternative TPU Design
  15. 15. Discussion • Fallacy: K80 GPU is a good match to inference “GPUs have traditionally been seen as high-throughput architectures that reply on high-bandwidth DRAM and thousands of threads to achieve their goals”
  16. 16. Conclusion • Advantage – K80 GPU: 2496 32-bit, 8Mib on-chip memory TPU: 65536 8-bit, 28Mib on-chip memory – TPU leverages its advantage in MACs and on-chip memory – TPU succeeded because of the large matrix multiply unit
  17. 17. Q1: Why don’t use TPU for training • TPU’s on-chip 8GiB DRAM is read-only – CPU paid a lot for synchronous operations on RAM – Large mount of GPUs will lower the cost for single chip • GPU have more “parallel” performance – Could train two small-model or a large mount of samples at the same time
  18. 18. Q2: Why TPU faster? • Application Specific Instruction Set – Intel CPU (CISC) need decoding, out-of-order, branch-prediction, SMT etc. – GPU was optimized for “Parallel” rather than “Matrix” • Read-only on-chip memory • TensorRT makes GPU-inference much faster
  19. 19. GPU grows faster and faster https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
  20. 20. Q3: TPU or FPGA? • They looks like the same – By programming, FPGA could have similar Matrix-Multiply-Unit – FPGA could also have “read-only” on-chip memory • Making a utterly new chip is a high-risk task – AMD – Calxeda – Fusionio
  21. 21. Thank you

×