Distributed computing is
distributed
Ion Stoica
December 4, 2020
ARPA Network
(1970s)
HPC
(1980s)
Web
(1990s)
Big Data
(2000s)
Distributed systems not new…
Distributed computing still the exception…
Inaccessible to most developers
Few universities teach distributed computing
This will change…
Distributed computing will be the norm,
rather than the exception
What is different this time?
The rise of deep learning (DL)
The end of Moore’s Law
Apps becoming AI centric
(https://openai.com/blog/ai-and-compute//)
DL demands growing faster than ever
35x every 18 months
(https://openai.com/blog/ai-and-compute//)
Not only esoteric apps
ResNet 50
ResNeXt 101
im
age
processing
translation
GPT-3 (175B)
Turing Proj. (17B)
GPT-2 (8.3 B)
GPT-1 (1.5B)
BERT
GPT-1
Transformer
ResNet-50
20x every 18 months
Not only processing, but memory
GPT-3 (175B)
Turing Proj. (17B)
GPT-2 (8.3 B)
GPT-1 (1.5B)
BERT
GPT-1
Transformer
ResNet-50
200x
every
18
m
onths!
Not only processing, but memory
What is different?
The rise of deep learning (DL)
The end of Moore’s Law
Apps becoming AI centric
The end of Moore’s Law
From 2x every 18 months
to 1.05x every 18 months.
(https://openai.com/blog/ai-and-compute//)
Hardware cannot keep up
35x every 18 months
CPU
(https://openai.com/blog/ai-and-compute//)
Hardware cannot keep up
35x every 18 months
Moore’s Law (2x every 18 months)
CPU
What about specialized hardware?
Trade generality for performance
Not enough! Just extending Moore’s Law
(https://openai.com/blog/ai-and-compute//)
Specialized hardware not enough
35x every 18 months
Moore’s Law (2x every 18 months)
CPU
GPU
TPU
Turing Proj. (17B)
GPT-2 (8.3 B)
GPT-1 (1.5B)
BERT
GPT-1
Transformer
ResNet-50
20x every 18 months
(https://devblogs.nvidia.com/training-bert-with-gpus/)
Memory dwarfed by demand
GPU memory
increased by just
1.45x every 18 months
No way out but distributed!
What is different?
The rise of deep learning (DL)
The end of Moore’s Law
Apps becoming AI centric
Apps becoming AI centric...
... and integrating other distributed workloads
HPC
Big Data
Micro-
services
Before: Isolated distributed workloads
HPC
AI
Big Data
Micro-
services
AI integrates with other dist. workloads
HPC AI
Training,
Simulations
Big Data
Micro-
services
Distributed training
using MPI
Leverage industrial
simulators for RL
AI integrates with other dist. workloads
HPC AI
Training,
Simulations
Micro-
services
Big
Data
Log
processing,
Featurization
Online learning, RL
AI integrates with other dist. workloads
HPC AI
Training,
Simulations
Big
Data
Log
processing,
Featurization
Micro-
services
Serving,
Business
logic
Inference, online
learning, web
backends
AI integrates with other dist. workloads
(https://devblogs.nvidia.com/training-bert-with-gpus/)Distributed apps becoming the norm
No way to scale AI,
but going distributed
HPC AI
Training,
Simulations
Big Data
Log
processing,
Featurization
Micro-
services
Serving,
Business
logic
Apps becoming AI centric
(and integrating with other
distributed workloads)
No general solution for e2e AI apps
DL
Big Data
MicroservicesHPC
?
Natural solution...
Stitch together existing frameworks
Distributed
systems
Model
Serving
Training
Distributed
systems
Distributed
systems
Hyperparam.
Tuning
Distributed
systems
Data
processing
Simulation
Distributed
systems
Business
logic
Distributed
systems
RAY
Universal framework for distributed computing
Model
Serving
Training
Hyperparam.
Tuning
Data
processing
Simulation
Business
Logic
Libraries
Universal framework for distributed
computing (Python and Java)
*
DL
Big Data
MicroservicesHPC
Three key ideas
Execute remotely functions as tasks, and
instantiate remotely classes as actors
○ Support both stateful and stateless computations
Asynchronous execution using futures
○ Enable parallelism
Distributed (immutable) object store
○ Efficient communication (send arguments by reference)
def read_array(file):
# read ndarray “a”
# from “file”
return a
def add(a, b):
return np.add(a, b)
a = read_array(file1)
b = read_array(file2)
sum = add(a, b)
Function
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter()
c.inc()
c.inc()
Class
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
a = read_array(file1)
b = read_array(file2)
sum = add(a, b)
@ray.remote
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter()
c.inc()
c.inc()
Function → Task Class → Actor
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
@ray.remote(num_gpus=1)
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter.remote()
id4 = c.inc.remote()
id5 = c.inc.remote()
Function → Task Class → Actor
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
file1 file2
Node 1 Node 2
• Blue variables are Object IDs
• Similar to futures
read_array
id1
Return id1 (future) immediately,
before read_array() finishes
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
file1 file2
Node 1 Node 2
read_array
id1
read_array
id2
Dynamic task graph:
build at runtime
• Blue variables are Object IDs
• Similar to futures
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
• Blue variables are Object IDs
• Similar to futures
file1 file2
Node 1 Node 2
read_array
id1
read_array
id2
add
id
Node
3
Every task scheduled, but
not finished yet
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
• Blue variables are Object IDs
• Similar to futures
file1 file2
Node 1 Node 2
read_array
id1
read_array
id2
add
id
Node
3
ray.get() block until
result available
Task API
@ray.remote
def read_array(file):
# read ndarray “a”
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id2 = read_array.remote(file2)
id = add.remote(id1, id2)
sum = ray.get(id)
• Blue variables are Object IDs
• Similar to futures
file1
Node 1 Node 2
Node
3
read_array
file2
read_array
add
sumTask graph executed to
compute sum
Task API
Library Ecosystem
universal framework for
distributed computing
Native Libraries Most popular scalable RL library
● PyTorch and TF support
● Largest # of algorithms
Best Distributed Library Ecosystem
universal framework for
distributed computing
Native Libraries
A popular hyperparameter tuning library:
“For me, and I say this as a Hyperopt maintainer,
Tune is the clear winner down the road. Tune is
fairly well architected and it integrates with
everything else, and it’s built on top of Ray so it
has all the benefits stemming from that as well.
… In 2020, I would certainly bet on Tune.”
-- Max Pumperla, HyperOpt creator
Best Distributed Library Ecosystem
universal framework for
distributed computing
Native Libraries
Just launched. Promising
start but a long way to go.
Best Distributed Library Ecosystem
universal framework for
distributed computing
Native Libraries
Best Distributed Library Ecosystem
universal framework for
distributed computing
Native Libraries
3rd Party Libraries
Best Distributed Library Ecosystem
●Horovod: most popular distributed
training library.
●Optuna and hyperopt: popular
hyperparameter search libraries
universal framework for
distributed computing
Native Libraries
3rd Party Libraries The two most popular
NLP libraries using Ray
to scale up
Best Distributed Library Ecosystem
universal framework for
distributed computing
Native Libraries
3rd Party Libraries
ModelArts
Best Distributed Library Ecosystem
Major ML cloud
platforms embedding
Ray/RLlib/Tune
universal framework for
distributed computing
Native Libraries
3rd Party Libraries
ModelArts
Best Distributed Library Ecosystem
Dask running on Ray
● Faster, more
resilient, and
more scalable
universal framework for
distributed computing
Native Libraries
3rd Party Libraries
ModelArts
Popular experiment tracking platform
● One-line integration with Ray Tune.
Best Distributed Library Ecosystem
universal framework for
distributed computing
Native Libraries
3rd Party Libraries
ModelArts
Intel’s unified Data Analytics and AI platform.
● Integrated Ray together with Spark, TF, etc
● Use cases include streaming ML with customers
such as Burger King and BMW.
Best Distributed Library Ecosystem
universal framework for
distributed computing
Native Libraries
3rd Party Libraries
ModelArts
Best Distributed Library Ecosystem
universal framework for
distributed computing
Native Libraries 3rd Party Libraries
ModelArts
Your apps
here!
Best Distributed Library Ecosystem
Adoption
Refactored actor
management
Placement groups
Java API
Direct calls
between
workers Azure support in
cluster launcher
Original dashboard Refactor
worker to C++
Initial port to Bazel
Significant community contributions
Summary
Ray: universal framework for distributed computing
Comprehensive ecosystem of scalable libraries
universal framework for
distributed computing
Native Libraries 3rd Party Libraries
ModelArts
Your apps
here!
https://github.com/ray-project/ray
Thank you!

The Future of Computing is Distributed