SlideShare a Scribd company logo
1 of 22
Download to read offline
TensorFlow for HPC?
Peter Braam
peter@braam.io
me
1980 1993 20021997 2013
pure math & th. physics @oxford
cs @cmu
@5 startups & @3 big acquirers - Lustre
SKA @cambridge
work with 100’s of largest compute centers and
virtually all major system & CPU/GPU vendors
Math / ML /
Astrophysics
@flatiron Institue
2018
Origin of this talk
I worked extensively on HPC infrastructure for the SKA telescope.
Through coincidence I was offered a generous visit of CERN and asked to
explain some of my thoughts to the HEP ML community.
I decided to offer the HEP ML community a “systems perspective” of
TensorFlow, and I came away highly impressed about this platform’s history
and promise..
Why talk about TF?
Very widely used, gained much ground on other packages
Has unmatched flexibility for deployment
Achieves very high performance
Systems Engineering Masterpiece
Best of breed specialists involved from multiple domains
Door Opener for new xPU design
Domain specific computation infrastructure template
4
Why did Google do this?
Google’s AI could mean doubling their data centres (modest use)
100’s of projects will pursue ML: development productivity is central
Google released TensorFlow in 2015 (a 2nd design following DistBelief).
TensorFlow’s scope is profound: language, compiler, chips, tools, devops
One of the most impressive software - systems - hardware project I’ve seen
Character of mega projects ... (108
-1010
$$)
TensorFlow
Google realized they would massively develop
ML driven applications, modest use would
require twofold expansion of data centers
Challenge:
● high productivity software development
● portable deployment from phones to
massive clusters
● lowest cost performance ratio
SKA Telescope
SKA is deploying a massive new radio
telescope. It needs to provide usable science
data product (i.e. images) for astrophysicists
using algorithms that might need adaptation.
Challenge:
● Understand required compute systems
● Flexible development and runtime
environment
● Meet energy and financial budgets
6
Co-design
experts
architecture
suppliersIT
users
executives
7
TensorFlow
components
High level API’s like Keras allow rapid prototyping of ML models (this
is largely maths, not programming). Automatic differentiation.
Debugging and profiling tools exist, such as tensorboard and a data
flow debugger.
One code base can be used for development, training, evaluation,
inference and snapshotting, and runs on mobile devices through
specialized large clusters.
ML focus
this is frequently discused
Language, execution
platform and optimization
Compiler and Chips (TPU)
Devops
The architecture reflects a
strong separation of
concerns
Tensorflow Core
v (M vect)A (NxM mat)
y=Ax
y (N vect) output or “fetch”
input or “feed”
TF operation
Data Flow model with extremely rich features.
Expressions in programming languages define
data flow graphs from call graphs and arguments
TF treats graphs declaratively, i.e. they are
defined but not executed at the same time.
TF Graphs can be automatically split for
distributed execution on multiple devices.
9
TF Operations Reflect Domain Specific
Aspects found throughout TensorFlow
TF execution
Matmul
w
b
x
Add
User
Fetches
Output
TensorFlow
Performs
Operations
User
Feeds
Inputs
TensorFlow
Trains
Variables
TensorFlow
data flow
Tensors
Define graph
Start 1 or more sessions
The session executes the graph:
- recursively check what graph nodes the
fetch (output) depends on
- lots of optimizations
- execute dependencies
- in parallel
- this is called lazy evaluation
- enables parallelism
10
Execution Framework Challenges
➔ run on distributed systems and on many architectures (including the TPU)
◆ split graphs, feed data to remote architectures, control architecture
◆ understand the data: inactive and identity nodes, mapping constants, tensor dimensions
➔ create code for different architectures, optimized for scalable clusters and
for handheld devices. Compiler offers
◆ JIT: just in time (during execution) to take full advantage of the sizes of the tensors
◆ AOT: ahead of time to create a standalone binary
➔ optimizations:
◆ tiling sizes, threading, data alignment, perform padding, minimize communications,
adapt queue lengths
The TF framework itself but particularly the XLA compiler make this
transparent
Graph modifications for distributed execution
node 0
b c
w
y
xa
xPU1
xPU0
b
xPU0
c
w
y
a x
send send
recvrecv
xPU1
node 0
node 1
message
queues
12
Execution Platforms & Tensor Processing Units (TPU)
➔ A docker machine with Python can run Tensorflow and its debugging tools
◆ can even invoke a GPU
➔ Training and inference may require performance and scale
◆ Support for GPU, FPGA acceleration - through XLA
◆ Custom TPU processors
TPU’s are accelerators on PCI bus in servers
14
TPU pods (clusters)
TPU pods (clusters) - TPU chips
TPU chips have systolic MXU (matrix multiply
unit) reducing memory accesses by ~100x:
Pass data between ~100K ALU’s. Small processing units,
using a global clock, no registers.
Only for TF ops. ~100T Ops/cycle (limited precision)
System Organization
Send TF graph as a whole to a TF node
Send individual XLA generated operations
with their data to the TPU accelerator.
This includes instructions and data. The
TPU does not fetch instructions like a CPU
17
grpc over PCI
grpc over TCP/IP
storage
TPU v3.0 specs (conservative guesses based on v2)
18
TPU 3.0 TPU 3.0 / node TPU / pod
#TPU’s 1 card, 4 chips, 16 MXU 4 cards 1024 cards, 256 nodes
mem BW 5 TB/sec (?) 20 TB/sec 5 PB/sec
flops / sec (*) 100 TF/sec 400 TF/sec 100 PF/sec
Operations per clock cycle
CPU 10’s (cores)
CPU vectorized 1000 (core x vector length)
GPU 10K ‘s
TPU 128K (TPU v1)
* flops are of various precisions
instructions: - model is RPC to chip
Read_Host_Memory
Write_Host_Memory
Read_Weights
MatrixMultiply/Convolve
Activate (ReLU, Sigmoid, Maxpool, LRN, …)
This should raise eyebrows ...
256 nodes for 5PB/sec of BW and 100PF ??? -
pretty much a top 5 machine in top500
It would work very well for moderate
granularity computations, like SKA (and AI for
which it was made). Wouldn’t help with AMR
likely
Like GPU’s in 2003, this is worth tinkering and
playing with to see its HPC potential.
And yes, ultimately, it may require a new chip.
Some candidate enablers:
1. does HPC use need more operations
than the TensorFlow operations?
2. does a more general systolic network
interconnect offer more opportunities?
3. can mixed precision arithmetic be
introduced? (posithub.org)
This would come at a cost of adapting the
chip? For a project like SKA it could be a
major breakthrough.
Lessons Learned
Significant cost benefits make software and
custom HW projects viable solutions
Replicating an effort of this stature is
extremely difficult.
Domain specific solutions hold a lot of
promise.
Acknowledgement: I’ve used some images
from TensorFlow documentation
(https://tensorflow.org), and Google’s blog
(https://cloud.google.com/blog/products/gcp)
Things to remember:
TensorFlow is a complete systems project:
language, compiler, hardware, devops, tools
Compiler enables advanced use models from
one code base: mobile, cloud, distributed,
GPU, TPU
TPU design has extremely high memory
bandwidth and ops/sec
Thank You
peter@braam.io
TensorFlow for HPC?

More Related Content

What's hot

An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureMani Goswami
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsLinaro
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitCarlo C. del Mundo
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveJason Shih
 
TensorFlow example for AI Ukraine2016
TensorFlow example  for AI Ukraine2016TensorFlow example  for AI Ukraine2016
TensorFlow example for AI Ukraine2016Andrii Babii
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...MLconf
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 
LeFlowを調べてみました
LeFlowを調べてみましたLeFlowを調べてみました
LeFlowを調べてみましたMr. Vengineer
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoSciCompIIT
 
Get Your Hands Dirty with Intel® Distribution for Python*
Get Your Hands Dirty with Intel® Distribution for Python*Get Your Hands Dirty with Intel® Distribution for Python*
Get Your Hands Dirty with Intel® Distribution for Python*Intel® Software
 
Buzzwords Numba Presentation
Buzzwords Numba PresentationBuzzwords Numba Presentation
Buzzwords Numba Presentationkammeyer
 
Distributed Multi-device Execution of TensorFlow – an Outlook
Distributed Multi-device Execution of TensorFlow – an OutlookDistributed Multi-device Execution of TensorFlow – an Outlook
Distributed Multi-device Execution of TensorFlow – an OutlookSebnem Rusitschka
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupBest Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupJeff Larkin
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowAI Frontiers
 
PyData NYC whatsnew NumPy-SciPy 2019
PyData NYC whatsnew NumPy-SciPy 2019PyData NYC whatsnew NumPy-SciPy 2019
PyData NYC whatsnew NumPy-SciPy 2019Ralf Gommers
 

What's hot (20)

An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
 
Google TPU
Google TPUGoogle TPU
Google TPU
 
TPU paper slide
TPU paper slideTPU paper slide
TPU paper slide
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
 
TensorFlow example for AI Ukraine2016
TensorFlow example  for AI Ukraine2016TensorFlow example  for AI Ukraine2016
TensorFlow example for AI Ukraine2016
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
LeFlowを調べてみました
LeFlowを調べてみましたLeFlowを調べてみました
LeFlowを調べてみました
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
 
Get Your Hands Dirty with Intel® Distribution for Python*
Get Your Hands Dirty with Intel® Distribution for Python*Get Your Hands Dirty with Intel® Distribution for Python*
Get Your Hands Dirty with Intel® Distribution for Python*
 
Buzzwords Numba Presentation
Buzzwords Numba PresentationBuzzwords Numba Presentation
Buzzwords Numba Presentation
 
Distributed Multi-device Execution of TensorFlow – an Outlook
Distributed Multi-device Execution of TensorFlow – an OutlookDistributed Multi-device Execution of TensorFlow – an Outlook
Distributed Multi-device Execution of TensorFlow – an Outlook
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
IBM AI at Scale
IBM AI at ScaleIBM AI at Scale
IBM AI at Scale
 
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupBest Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
 
PyData NYC whatsnew NumPy-SciPy 2019
PyData NYC whatsnew NumPy-SciPy 2019PyData NYC whatsnew NumPy-SciPy 2019
PyData NYC whatsnew NumPy-SciPy 2019
 

Similar to TensorFlow for HPC?

Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in ProductionMatthias Feys
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Tyrone Systems
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V International
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systemsdairsie
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputinginside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningSergey Karayev
 
Heroku @ Toyota Motor Europe: Platform As A Factory As A Service
Heroku @ Toyota Motor Europe: Platform As A Factory As A ServiceHeroku @ Toyota Motor Europe: Platform As A Factory As A Service
Heroku @ Toyota Motor Europe: Platform As A Factory As A ServiceCQD
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedRCCSRENKEI
 

Similar to TensorFlow for HPC? (20)

Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in Production
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
Heroku @ Toyota Motor Europe: Platform As A Factory As A Service
Heroku @ Toyota Motor Europe: Platform As A Factory As A ServiceHeroku @ Toyota Motor Europe: Platform As A Factory As A Service
Heroku @ Toyota Motor Europe: Platform As A Factory As A Service
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 

More from inside-BigData.com

Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

TensorFlow for HPC?

  • 1. TensorFlow for HPC? Peter Braam peter@braam.io
  • 2. me 1980 1993 20021997 2013 pure math & th. physics @oxford cs @cmu @5 startups & @3 big acquirers - Lustre SKA @cambridge work with 100’s of largest compute centers and virtually all major system & CPU/GPU vendors Math / ML / Astrophysics @flatiron Institue 2018
  • 3. Origin of this talk I worked extensively on HPC infrastructure for the SKA telescope. Through coincidence I was offered a generous visit of CERN and asked to explain some of my thoughts to the HEP ML community. I decided to offer the HEP ML community a “systems perspective” of TensorFlow, and I came away highly impressed about this platform’s history and promise..
  • 4. Why talk about TF? Very widely used, gained much ground on other packages Has unmatched flexibility for deployment Achieves very high performance Systems Engineering Masterpiece Best of breed specialists involved from multiple domains Door Opener for new xPU design Domain specific computation infrastructure template 4
  • 5. Why did Google do this? Google’s AI could mean doubling their data centres (modest use) 100’s of projects will pursue ML: development productivity is central Google released TensorFlow in 2015 (a 2nd design following DistBelief). TensorFlow’s scope is profound: language, compiler, chips, tools, devops One of the most impressive software - systems - hardware project I’ve seen
  • 6. Character of mega projects ... (108 -1010 $$) TensorFlow Google realized they would massively develop ML driven applications, modest use would require twofold expansion of data centers Challenge: ● high productivity software development ● portable deployment from phones to massive clusters ● lowest cost performance ratio SKA Telescope SKA is deploying a massive new radio telescope. It needs to provide usable science data product (i.e. images) for astrophysicists using algorithms that might need adaptation. Challenge: ● Understand required compute systems ● Flexible development and runtime environment ● Meet energy and financial budgets 6
  • 8. TensorFlow components High level API’s like Keras allow rapid prototyping of ML models (this is largely maths, not programming). Automatic differentiation. Debugging and profiling tools exist, such as tensorboard and a data flow debugger. One code base can be used for development, training, evaluation, inference and snapshotting, and runs on mobile devices through specialized large clusters. ML focus this is frequently discused Language, execution platform and optimization Compiler and Chips (TPU) Devops The architecture reflects a strong separation of concerns
  • 9. Tensorflow Core v (M vect)A (NxM mat) y=Ax y (N vect) output or “fetch” input or “feed” TF operation Data Flow model with extremely rich features. Expressions in programming languages define data flow graphs from call graphs and arguments TF treats graphs declaratively, i.e. they are defined but not executed at the same time. TF Graphs can be automatically split for distributed execution on multiple devices. 9 TF Operations Reflect Domain Specific Aspects found throughout TensorFlow
  • 10. TF execution Matmul w b x Add User Fetches Output TensorFlow Performs Operations User Feeds Inputs TensorFlow Trains Variables TensorFlow data flow Tensors Define graph Start 1 or more sessions The session executes the graph: - recursively check what graph nodes the fetch (output) depends on - lots of optimizations - execute dependencies - in parallel - this is called lazy evaluation - enables parallelism 10
  • 11. Execution Framework Challenges ➔ run on distributed systems and on many architectures (including the TPU) ◆ split graphs, feed data to remote architectures, control architecture ◆ understand the data: inactive and identity nodes, mapping constants, tensor dimensions ➔ create code for different architectures, optimized for scalable clusters and for handheld devices. Compiler offers ◆ JIT: just in time (during execution) to take full advantage of the sizes of the tensors ◆ AOT: ahead of time to create a standalone binary ➔ optimizations: ◆ tiling sizes, threading, data alignment, perform padding, minimize communications, adapt queue lengths The TF framework itself but particularly the XLA compiler make this transparent
  • 12. Graph modifications for distributed execution node 0 b c w y xa xPU1 xPU0 b xPU0 c w y a x send send recvrecv xPU1 node 0 node 1 message queues 12
  • 13. Execution Platforms & Tensor Processing Units (TPU) ➔ A docker machine with Python can run Tensorflow and its debugging tools ◆ can even invoke a GPU ➔ Training and inference may require performance and scale ◆ Support for GPU, FPGA acceleration - through XLA ◆ Custom TPU processors
  • 14. TPU’s are accelerators on PCI bus in servers 14
  • 16. TPU pods (clusters) - TPU chips TPU chips have systolic MXU (matrix multiply unit) reducing memory accesses by ~100x: Pass data between ~100K ALU’s. Small processing units, using a global clock, no registers. Only for TF ops. ~100T Ops/cycle (limited precision)
  • 17. System Organization Send TF graph as a whole to a TF node Send individual XLA generated operations with their data to the TPU accelerator. This includes instructions and data. The TPU does not fetch instructions like a CPU 17 grpc over PCI grpc over TCP/IP storage
  • 18. TPU v3.0 specs (conservative guesses based on v2) 18 TPU 3.0 TPU 3.0 / node TPU / pod #TPU’s 1 card, 4 chips, 16 MXU 4 cards 1024 cards, 256 nodes mem BW 5 TB/sec (?) 20 TB/sec 5 PB/sec flops / sec (*) 100 TF/sec 400 TF/sec 100 PF/sec Operations per clock cycle CPU 10’s (cores) CPU vectorized 1000 (core x vector length) GPU 10K ‘s TPU 128K (TPU v1) * flops are of various precisions instructions: - model is RPC to chip Read_Host_Memory Write_Host_Memory Read_Weights MatrixMultiply/Convolve Activate (ReLU, Sigmoid, Maxpool, LRN, …)
  • 19. This should raise eyebrows ... 256 nodes for 5PB/sec of BW and 100PF ??? - pretty much a top 5 machine in top500 It would work very well for moderate granularity computations, like SKA (and AI for which it was made). Wouldn’t help with AMR likely Like GPU’s in 2003, this is worth tinkering and playing with to see its HPC potential. And yes, ultimately, it may require a new chip. Some candidate enablers: 1. does HPC use need more operations than the TensorFlow operations? 2. does a more general systolic network interconnect offer more opportunities? 3. can mixed precision arithmetic be introduced? (posithub.org) This would come at a cost of adapting the chip? For a project like SKA it could be a major breakthrough.
  • 20. Lessons Learned Significant cost benefits make software and custom HW projects viable solutions Replicating an effort of this stature is extremely difficult. Domain specific solutions hold a lot of promise. Acknowledgement: I’ve used some images from TensorFlow documentation (https://tensorflow.org), and Google’s blog (https://cloud.google.com/blog/products/gcp) Things to remember: TensorFlow is a complete systems project: language, compiler, hardware, devops, tools Compiler enables advanced use models from one code base: mobile, cloud, distributed, GPU, TPU TPU design has extremely high memory bandwidth and ops/sec