The Future of Computing is Distributed

Distributed computing is
distributed
Ion Stoica
December 4, 2020

ARPA Network
(1970s)
HPC
(1980s)
Web
(1990s)
Big Data
(2000s)
Distributed systems not new…

Distributed computing still the exception…
Inaccessible to most developers
Few universities teach distributed computing
This will change…
Distributed computing will be the norm,
rather than the exception

What is different this time?
The rise of deep learning (DL)
The end of Moore’s Law
Apps becoming AI centric

(https://openai.com/blog/ai-and-compute//)
DL demands growing faster than ever
35x every 18 months

Not only esoteric apps
ResNet 50
ResNeXt 101
im
age
processing
translation

GPT-3 (175B)
Turing Proj. (17B)
GPT-2 (8.3 B)
GPT-1 (1.5B)
BERT
GPT-1
Transformer
ResNet-50
20x every 18 months
Not only processing, but memory

GPT-3 (175B)
Turing Proj. (17B)
GPT-2 (8.3 B)
GPT-1 (1.5B)
BERT
GPT-1
Transformer
ResNet-50
200x
every
18
m
onths!
Not only processing, but memory

What is different?
The rise of deep learning (DL)

From 2x every 18 months
to 1.05x every 18 months.

Hardware cannot keep up
35x every 18 months
CPU

Hardware cannot keep up
35x every 18 months
Moore’s Law (2x every 18 months)
CPU

What about specialized hardware?
Trade generality for performance
Not enough! Just extending Moore’s Law

Specialized hardware not enough
35x every 18 months
Moore’s Law (2x every 18 months)
CPU
GPU
TPU

Turing Proj. (17B)
GPT-2 (8.3 B)
GPT-1 (1.5B)
BERT
GPT-1
Transformer
ResNet-50
20x every 18 months
(https://devblogs.nvidia.com/training-bert-with-gpus/)
Memory dwarfed by demand
GPU memory
increased by just
1.45x every 18 months
No way out but distributed!

Apps becoming AI centric...
... and integrating other distributed workloads

HPC
Big Data
Micro-
services
Before: Isolated distributed workloads

HPC
AI
Big Data
Micro-
services
AI integrates with other dist. workloads

HPC AI
Training,
Simulations
Big Data
Micro-
services
Distributed training
using MPI
Leverage industrial
simulators for RL

HPC AI
Training,
Simulations
Micro-
services
Big
Data
Log
processing,
Featurization
Online learning, RL

HPC AI
Training,
Simulations
Big
Data
Log
processing,
Featurization
Micro-
services
Serving,
Business
logic
Inference, online
learning, web
backends

(https://devblogs.nvidia.com/training-bert-with-gpus/)Distributed apps becoming the norm
No way to scale AI,
but going distributed
HPC AI
Training,
Simulations
Big Data
Log
processing,
Featurization
Micro-
services
Serving,
Business
logic
(and integrating with other
distributed workloads)

No general solution for e2e AI apps
DL
Big Data
MicroservicesHPC
?

Natural solution...
Stitch together existing frameworks
Distributed
systems
Model
Serving
Training
Distributed
systems
Distributed
systems
Hyperparam.
Tuning
Distributed
systems
Data
processing
Simulation
Distributed
systems
Business
logic
Distributed
systems

RAY
Universal framework for distributed computing

Model
Serving
Training
Hyperparam.
Tuning
Data
processing
Simulation
Business
Logic
Libraries
Universal framework for distributed
computing (Python and Java)
*

Three key ideas
Execute remotely functions as tasks, and
instantiate remotely classes as actors
○ Support both stateful and stateless computations
Asynchronous execution using futures
○ Enable parallelism
Distributed (immutable) object store
○ Eﬃcient communication (send arguments by reference)

def read_array(file):
# read ndarray “a”
# from “file”
return a
def add(a, b):
return np.add(a, b)
a = read_array(file1)
b = read_array(file2)
sum = add(a, b)
Function
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter()
c.inc()
c.inc()
Class

@ray.remote
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
a = read_array(file1)
b = read_array(file2)
sum = add(a, b)
@ray.remote
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter()
c.inc()
c.inc()
Function → Task Class → Actor

@ray.remote
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(file1)
id = add.remote(id1, id2)
sum = ray.get(id)
@ray.remote(num_gpus=1)
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter.remote()
id4 = c.inc.remote()
id5 = c.inc.remote()
Function → Task Class → Actor

@ray.remote
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
sum = ray.get(id)
file1 file2
Node 1 Node 2
• Blue variables are Object IDs
• Similar to futures
read_array
id1
Return id1 (future) immediately,
before read_array() finishes
Task API

@ray.remote
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
sum = ray.get(id)
ﬁle1 ﬁle2
Node 1 Node 2
read_array
id1
read_array
id2
Dynamic task graph:
build at runtime
Task API

@ray.remote
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
sum = ray.get(id)
file1 file2
Node 1 Node 2
read_array
id1
read_array
id2
add
id
Node
3
Every task scheduled, but
not finished yet
Task API

@ray.remote
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
sum = ray.get(id)
ﬁle1 ﬁle2
Node 1 Node 2
read_array
id1
read_array
id2
add
id
Node
3
ray.get() block until
result available
Task API

@ray.remote
# from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
sum = ray.get(id)
ﬁle1
Node 1 Node 2
Node
3
read_array
ﬁle2
read_array
add
sumTask graph executed to
compute sum
Task API

universal framework for
distributed computing
Native Libraries Most popular scalable RL library
● PyTorch and TF support
● Largest # of algorithms
Best Distributed Library Ecosystem

Native Libraries
A popular hyperparameter tuning library:
“For me, and I say this as a Hyperopt maintainer,
Tune is the clear winner down the road. Tune is
fairly well architected and it integrates with
everything else, and it’s built on top of Ray so it
has all the benefits stemming from that as well.
… In 2020, I would certainly bet on Tune.”
-- Max Pumperla, HyperOpt creator

Native Libraries
Just launched. Promising
start but a long way to go.

Native Libraries

Native Libraries
3rd Party Libraries
●Horovod: most popular distributed
training library.
●Optuna and hyperopt: popular
hyperparameter search libraries

Native Libraries
3rd Party Libraries The two most popular
NLP libraries using Ray
to scale up

Native Libraries
3rd Party Libraries
ModelArts
Major ML cloud
platforms embedding
Ray/RLlib/Tune

Native Libraries
3rd Party Libraries
ModelArts
Dask running on Ray
● Faster, more
resilient, and
more scalable

Native Libraries
3rd Party Libraries
ModelArts
Popular experiment tracking platform
● One-line integration with Ray Tune.

Native Libraries
3rd Party Libraries
ModelArts
Intel’s unified Data Analytics and AI platform.
● Integrated Ray together with Spark, TF, etc
● Use cases include streaming ML with customers
such as Burger King and BMW.

Native Libraries
3rd Party Libraries
ModelArts

Native Libraries 3rd Party Libraries
ModelArts
Your apps
here!

Refactored actor
management
Placement groups
Java API
Direct calls
between
workers Azure support in
cluster launcher
Original dashboard Refactor
worker to C++
Initial port to Bazel
Significant community contributions

Summary
Ray: universal framework for distributed computing
Comprehensive ecosystem of scalable libraries
Native Libraries 3rd Party Libraries
ModelArts
Your apps
here!
https://github.com/ray-project/ray

The Future of Computing is Distributed

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Future of Computing is Distributed

Similar to The Future of Computing is Distributed (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

The Future of Computing is Distributed