In this deck, Peter Braam looks at how TensorFlow framework could be used to accelerate high performance computing.
"Google has developed TensorFlow, a truly complete platform for ML. The performance of the platform is amazing, and it begs the question if it will be useful for HPC in a similar manner that GPU’s heralded a revolution.
As described in his talk at the CHPC 2018 Conference in South Africa, TensorFlow contains many ingredients, for example:
* many domain specific libraries for machine learning
* the TensorFlow domain specific data-flow language
carefully organized input and output for data flow
* an optimizing runtime and compiler
* hardware implementations of TensorFlow operations in
* TensorFlow processing unit (TPU) chips
Learn more: https://wp.me/p3RLHQ-jMv
and
https://www.tensorflow.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
2. me
1980 1993 20021997 2013
pure math & th. physics @oxford
cs @cmu
@5 startups & @3 big acquirers - Lustre
SKA @cambridge
work with 100’s of largest compute centers and
virtually all major system & CPU/GPU vendors
Math / ML /
Astrophysics
@flatiron Institue
2018
3. Origin of this talk
I worked extensively on HPC infrastructure for the SKA telescope.
Through coincidence I was offered a generous visit of CERN and asked to
explain some of my thoughts to the HEP ML community.
I decided to offer the HEP ML community a “systems perspective” of
TensorFlow, and I came away highly impressed about this platform’s history
and promise..
4. Why talk about TF?
Very widely used, gained much ground on other packages
Has unmatched flexibility for deployment
Achieves very high performance
Systems Engineering Masterpiece
Best of breed specialists involved from multiple domains
Door Opener for new xPU design
Domain specific computation infrastructure template
4
5. Why did Google do this?
Google’s AI could mean doubling their data centres (modest use)
100’s of projects will pursue ML: development productivity is central
Google released TensorFlow in 2015 (a 2nd design following DistBelief).
TensorFlow’s scope is profound: language, compiler, chips, tools, devops
One of the most impressive software - systems - hardware project I’ve seen
6. Character of mega projects ... (108
-1010
$$)
TensorFlow
Google realized they would massively develop
ML driven applications, modest use would
require twofold expansion of data centers
Challenge:
● high productivity software development
● portable deployment from phones to
massive clusters
● lowest cost performance ratio
SKA Telescope
SKA is deploying a massive new radio
telescope. It needs to provide usable science
data product (i.e. images) for astrophysicists
using algorithms that might need adaptation.
Challenge:
● Understand required compute systems
● Flexible development and runtime
environment
● Meet energy and financial budgets
6
8. TensorFlow
components
High level API’s like Keras allow rapid prototyping of ML models (this
is largely maths, not programming). Automatic differentiation.
Debugging and profiling tools exist, such as tensorboard and a data
flow debugger.
One code base can be used for development, training, evaluation,
inference and snapshotting, and runs on mobile devices through
specialized large clusters.
ML focus
this is frequently discused
Language, execution
platform and optimization
Compiler and Chips (TPU)
Devops
The architecture reflects a
strong separation of
concerns
9. Tensorflow Core
v (M vect)A (NxM mat)
y=Ax
y (N vect) output or “fetch”
input or “feed”
TF operation
Data Flow model with extremely rich features.
Expressions in programming languages define
data flow graphs from call graphs and arguments
TF treats graphs declaratively, i.e. they are
defined but not executed at the same time.
TF Graphs can be automatically split for
distributed execution on multiple devices.
9
TF Operations Reflect Domain Specific
Aspects found throughout TensorFlow
11. Execution Framework Challenges
➔ run on distributed systems and on many architectures (including the TPU)
◆ split graphs, feed data to remote architectures, control architecture
◆ understand the data: inactive and identity nodes, mapping constants, tensor dimensions
➔ create code for different architectures, optimized for scalable clusters and
for handheld devices. Compiler offers
◆ JIT: just in time (during execution) to take full advantage of the sizes of the tensors
◆ AOT: ahead of time to create a standalone binary
➔ optimizations:
◆ tiling sizes, threading, data alignment, perform padding, minimize communications,
adapt queue lengths
The TF framework itself but particularly the XLA compiler make this
transparent
12. Graph modifications for distributed execution
node 0
b c
w
y
xa
xPU1
xPU0
b
xPU0
c
w
y
a x
send send
recvrecv
xPU1
node 0
node 1
message
queues
12
13. Execution Platforms & Tensor Processing Units (TPU)
➔ A docker machine with Python can run Tensorflow and its debugging tools
◆ can even invoke a GPU
➔ Training and inference may require performance and scale
◆ Support for GPU, FPGA acceleration - through XLA
◆ Custom TPU processors
16. TPU pods (clusters) - TPU chips
TPU chips have systolic MXU (matrix multiply
unit) reducing memory accesses by ~100x:
Pass data between ~100K ALU’s. Small processing units,
using a global clock, no registers.
Only for TF ops. ~100T Ops/cycle (limited precision)
17. System Organization
Send TF graph as a whole to a TF node
Send individual XLA generated operations
with their data to the TPU accelerator.
This includes instructions and data. The
TPU does not fetch instructions like a CPU
17
grpc over PCI
grpc over TCP/IP
storage
18. TPU v3.0 specs (conservative guesses based on v2)
18
TPU 3.0 TPU 3.0 / node TPU / pod
#TPU’s 1 card, 4 chips, 16 MXU 4 cards 1024 cards, 256 nodes
mem BW 5 TB/sec (?) 20 TB/sec 5 PB/sec
flops / sec (*) 100 TF/sec 400 TF/sec 100 PF/sec
Operations per clock cycle
CPU 10’s (cores)
CPU vectorized 1000 (core x vector length)
GPU 10K ‘s
TPU 128K (TPU v1)
* flops are of various precisions
instructions: - model is RPC to chip
Read_Host_Memory
Write_Host_Memory
Read_Weights
MatrixMultiply/Convolve
Activate (ReLU, Sigmoid, Maxpool, LRN, …)
19. This should raise eyebrows ...
256 nodes for 5PB/sec of BW and 100PF ??? -
pretty much a top 5 machine in top500
It would work very well for moderate
granularity computations, like SKA (and AI for
which it was made). Wouldn’t help with AMR
likely
Like GPU’s in 2003, this is worth tinkering and
playing with to see its HPC potential.
And yes, ultimately, it may require a new chip.
Some candidate enablers:
1. does HPC use need more operations
than the TensorFlow operations?
2. does a more general systolic network
interconnect offer more opportunities?
3. can mixed precision arithmetic be
introduced? (posithub.org)
This would come at a cost of adapting the
chip? For a project like SKA it could be a
major breakthrough.
20. Lessons Learned
Significant cost benefits make software and
custom HW projects viable solutions
Replicating an effort of this stature is
extremely difficult.
Domain specific solutions hold a lot of
promise.
Acknowledgement: I’ve used some images
from TensorFlow documentation
(https://tensorflow.org), and Google’s blog
(https://cloud.google.com/blog/products/gcp)
Things to remember:
TensorFlow is a complete systems project:
language, compiler, hardware, devops, tools
Compiler enables advanced use models from
one code base: mobile, cloud, distributed,
GPU, TPU
TPU design has extremely high memory
bandwidth and ops/sec