Distributed computing with Ray. Find your hyper-parameters, speed up your Pandas pipelines, and much more.

Distributed computing
with Ray
Jan Margeta | |
PyDays Vienna, May 3, 2019
jan@kardio.me @jmargeta

Healthier hearts Waste reduction Failure prevention
Hi, I am Jan
Computer vision and machine learning
Pythonista since 2.5+
Founder of KardioMe

Martin Fowler's First rule of
distributed objects computing
Don't
Massive complexity booster
See also Common fallacies of distributed computing

Scale up and down on
demand
Yamazaki et al. and 2048 GPUs (March 2019)
ImageNet in 74.7 seconds

Heterogeneous computations
CT or MRI
image
segmentpreprocess
landmark
estimation
meshing
view
estimation
VR
L
L
P
P S
M
V
3D print
M
GPU-based
machine learning
CPU-intensive
operation
WebVR-based UI
-
Long runing
external process
3D printed model of your own heart

Concurrent world packed with real-
time decisions
OK
Acquisition
VisualisationProcessing
Real-time cookie quality control

Resilience cannot be
achieved with a single
machine

Concurrency and
parallelism in Python
Threads, processes, async, distributed, Dask, Celery,
PySpark…

Threads
GIL, not using all cores anyway, output values...
import threading
def analyze_image(im):
return im.mean()
def process_image(im):
return im * 5
t1 = threading.Thread(target=analyze_image, args=(im,))
t2 = threading.Thread(target=process_image, args=(im,))
t1.start()
t2.start()
t1.join()
t2.join()

Processes
Sharing objects between processes - constant pickling
There is hope -
import multiprocessing
def analyze_image(im):
return im.mean()
def process_image(im):
return im * 5
p1 = multiprocessing.Process(target=analyze_image, args=(im,))
p2 = multiprocessing.Process(target=process_image, args=(im,))
p1.start()
p2.start()
p1.join()
p2.join()
https://docs.python.org/3.8/library/multiprocessing.shared_memory.html

And we are still just
running on a single
machine

Celery
from celery import Celery
app = Celery('jobs', ...)
@app.task
def compute_stuff(x, y):
return x + y
@app.task
def another_compute_stuff(x, y):
return x + y
from jobs import compute_stuff, another_compute_stuff
compute_stuff.delay(1, 1).get()
compute_stuff.apply_async((2, 2), link=another_compute_stuff.s(16))
compute_stuff.starmap([(2, 2), (4, 4)])

PySpark
Mature, excellent for ETL, simple queries
Great for homogeneous processing of the points
"BigData" ecosystem in Java
R = matrix(rand(M, F)) * matrix(rand(U, F).T)
ms = matrix(rand(M, F))
us = matrix(rand(U, F))
Rb = sc.broadcast(R)
msb = sc.broadcast(ms)
usb = sc.broadcast(us)
for i in range(ITERATIONS):
ms = sc.parallelize(range(M), partitions)
.map(lambda x: update(x, usb.value, Rb.value))
.collect()
ms = matrix(np.array(ms)[:, :, 0])
…

Spark barriers vs dynamic task graphs
Ray: A Distributed Execution Framework for Emerging AI Applications Michael Jordan (UC
Berkeley)

Dask
Much more "Pythonic" than Spark
Play well with data science tools
Global scheduler → latency
https://dask.org/
import dask
@dask.delayed
def add(x, y):
return x + y
x = add(1, 2)
y = add(x, 3)
y.compute()

Why new system?
Play well with existing tools
Scale from a laptop to a cluster
Heterogeneous code and hardware
Real-time and low-latency
Dynamically schedule tasks
Less cognitive load

Ray is a general purpose framework for parallel and
distributed Python and a collection of libraries targeting
data processing workflows
Developed at UC Berkeley as an attempt to replace Spark
https://github.com/ray-project/ray

Unique components
Stateless tasks and actors combined
Bottom-up scheduling for low latency
Shared object store with zero copy deserialization
Clean Pythonic API

Most* of Ray's API
you will ever need
The rest is (mostly) Python as we know it
*Seriously, this is pretty much it
ray.init # connect to a Ray cluster
ray.remote # declare a task/actor & remote execution
ray.get # retrieve a Ray object and convert to a Python object
ray.put # manually place an object to the object store
ray.wait # retrieve results as they are made ready

Two main abstractions
Tasks and actors

Tasks
Stateless computations
Decorate a function with ray.remote
Optionally with some extra parameters
@ray.remote
def imread(fname):
return cv2.imread(fname)
@ray.remote(num_cpus=1, num_gpus=0, num_return_vals=2)
def segment(image, threshold=128):
dark = image < threshold
bright = image > threshold
return dark, bright

Execute the task on a cluster
Append .remote
Immediatelly returns a future and gives back control
future = imread.remote('/data/python.png')
ObjectID(0100000067dc20383d2f04ea6cfade301eef9919)

Get the results
Schedule a computation for execution
ray.get blocks until the computation is completed
All subsequent ray.gets return almost instantly
Use the future as many times as needed
future = heavy_computation.remote()
arr = ray.get(future)
arr0 = ray.get(future)
arr1 = ray.get(future)
thumb_future = make_thumbnail.remote(future)
landmarks_future = find_landmarks.remote(future)

Actors
Mutable state and unique resources
Instantiate the actor somewhere
@ray.remote
class ParameterServer(object):
def __init__(self, keys, values):
values = [value.copy() for value in values]
self.weights = dict(zip(keys, values))
def push(self, keys, values):
for key, value in zip(keys, values):
self.weights[key] += value
def pull(self, keys):
return [self.weights[key] for key in keys]
ps = ParameterServer.remote(keys, initial_values)

Ray actor methods
always called sequentially
the only way to mutate a resource
simpler model without deadlocks
#LifeWithoutLocks
future0 = ps.push.remote(keys, grads0)
future1 = ps.push.remote(keys, grads1)
future2 = ps.grab.pull(keys)

Actors for resources
*camlib is our custom Cython-based wrapper for a vendor-specific library in Cython. Check out
the vendor agnostic and open-source .
@ray.remote
class Camera:
def __init__(self, ref):
self.cam = camlib.Camera(ref=ref)
self.cam.open()
self.num_frames = 0
def grab(self):
self.num_frames += 1
return self.cam.grab_frame()
def total_frames(self):
return self.num_frames
cam = Camera.remote(ref='1337')
im_fut = cam.grab.remote()
harvester

Mix and match tasks and actors
Grab and process an images from a camera
Or run a distributed SGD training
frame_id = camera.grab.remote()
segmented_id = segment.remote(frame_id)
segmented = ray.get(segmented_id)
@ray.remote
def worker(ps):
while True:
# Get the latest parameters
weights = ray.get(ps.pull.remote(keys))
# Compute an update of the params
# (e.g. the gradients for neural nets)
# Push the updates to the parameter server
ps.push.remote(keys, gradients)
worker_tasks = [worker.remote(ps) for _ in range(10)]

Dynamically define by run
import numpy as np
@ray.remote
def aggregate_data(x, y):
return x + y
data = [np.random.normal(size=1000) for i in range(4)]
while len(data) > 1:
intermediate_result = aggregate_data.remote(data[0], data[1])
data = data[2:] + [intermediate_result]
result = ray.get(data[0])

Worker & driver
Receive and execute tasks
Submit tasks to other workers
Driver is not assigned tasks for execution

Plasma - Shared
memory object store
Share objects across local processes
In-memory key-value object store
data = ['Hallo PyDays', 4, (5, 5), np.ones((128, 128))]
key = ray.put(data)
deserialized = ray.get(key)

Apache Arrow serialization
Standard objects
Numpy arrays

Raylet
Local scheduler
Driver can assign a task to a worker
Bottom up scheduling with fractional resources
No more tasks in parallel than the number of CPUs
(multithreaded libs - set the number of threads to 1)

Global control state
Take all metadata and state out of the system
Centralize it in a redis cluster
Everything else is largely stateless
Reschedule tasks on other machines

Fault-tolerance
Failover to other nodes based on
the global control state
Non actors - Reconstruct by lineage
Actors - Replay (experimental)

Does it scale?
mujoco video
Watch later Share
0:01 / 0:40
Moritz, Nishihara et al.: Ray: A Distributed Framework
for Emerging AI Applications
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

On-prem set-up
Start Ray head on one of the nodes
Start Ray workers on the nodes
Connect and run commands
Teardown Ray
$ ray start --head --redis-port=6379 # head IP: 192.168.1.5
$ ray start --redis-address=192.168.1.5:6379
ray.init(redis_address="192.168.1.5:6379")
@ray.remote
def imread(filename):
return cv2.imread(filename)
ims = ray.get([imread.remote(f) for f in glob('*.png')])
$ ray stop

Make a private Ray
cluster on the cloud
Ready-made auto-scaling scripts for AWS and GCP
Set-up a Ray cluster
Tear it down
or write a custom provider
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
https://ray.readthedocs.io/en/latest/autoscaling.html

Set-up Ray on any
Kubernetes cluster
$ kubectl create -f ray/kubernetes/head.yaml
$ kubectl create -f ray/kubernetes/worker.yaml
https://ray.readthedocs.io/en/latest/deploy-on-kubernetes.html

1. Create a Kubernetes cluster + download kubectl
Download the kubeconfig.yaml file from the UI
2. Check that the nodes are running
3. Deploy the head and the workers
4. Wait till the pods are running
$ kubectl --kubeconfig="kubeconfig.yaml" get nodes
NAME STATUS ROLES AGE VERSION
pool-6pi4ni81f-q4dn Ready <none> 87m v1.14.1
$ kubectl --kubeconfig="kubeconfig.yaml" apply -f head.yaml
$ kubectl --kubeconfig="kubeconfig.yaml" apply -f worker.yaml
$ kubectl --kubeconfig="kubeconfig.yaml" get pods
NAME READY STATUS RESTARTS AGE
ray-head-56fdb7fdd-qtgbt 1/1 Running 0 85m
ray-worker-85454649dd-5nb8k 0/1 Pending 0 13m
...

5. Enter the head pod and run ipython
6. Profit from a distributed Python
$ kubectl --kubeconfig="kubeconfig.yaml" exec -it
ray-head-56fdb7fdd-qtgbt -- bash
$ ipython
from collections import Counter
import time
import ray
ray.init(redis_address="localhost:6379")
@ray.remote
def get_node_ip():
time.sleep(0.01)
return ray.services.get_node_ip_address()
%time Counter(ray.get([f.remote() for _ in range(100)]

Development tools
Testing
usually trivial - input → output well defined
pytest
Debugging
standard tools often work
pdb, pudb, web-pdb…

Generate a tracing file
with ray timeline

The ecosystem
Higher-level libs built on top of Ray
Tune - hyper-parameter optimization
rllib - reinforcement learning
modin - distributed* Pandas
and more…
*experimental

Model hyper-parameter
tuning
Config 1 lr=0.001, n=4, act=relu
> Config 2 lr=0.1, n=5, act=elu

Tunable function
config - All tunable parameters of the function
reporter - Collector of metrics for the optimizer and for
visualization of the training in Tensorboard
def my_tunable_function(config, reporter):
train_data, self.test_data = make_data_loaders(config)
model = make_model(config)
trainer = make_optimizer(model, config)
for epoch in range(10): # Could be an infinite loop too
train(model, trainer, train_data)
accuracy = evaluate(model, test_data)
reporter(mean_accuracy=accuracy)

Class-based tunable API
Support for model checkpointing and restoration.
class MyTunableClass(Trainable):
def _setup(self, config):
self.train_data, self.test_data = make_data_loaders(config)
self.model = make_model(config)
self.trainer = make_optimizer(model, config)
def _train(self):
train_for_a_while(self.model, self.train_data, self.trainer)
return {"mean_accuracy": eval_model(self.model, self.test_data)}
def _save(self, checkpoint_dir):
return save_model(self.model, checkpoint_dir)
def _restore(self, checkpoint_path):
self.model.load_state_dict(checkpoint_path)

Define the parameter space
Register the trainable function
Launch hyper-parameter search
Consider extracting your argparse arguments
spec = {
"stop": {
"mean_accuracy": 0.995,
"time_total_s": 600,
},
"config": {
"activation": grid_search(["relu", "elu", "tanh"]),
"learning_rate": tune.grid_search([0.001, 0.01, 0.1]),
},
}
tune.register_trainable("train_imagenet", my_tunable_function)
tune.run("train_imagenet", name="tune_imagenet_test", **spec)

See the progress and compare the
models with Tensorboard

Reinforcement learning
https://gym.openai.com/ https://ray.readthedocs.io/en/latest/rllib.html

Wrapping OpenAI gym environments
in actors
import gym
@ray.remote
class Simulator:
def __init__(self):
self.env = gym.make("SpaceInvaders-v0")
self.env.reset()
def step(self, action):
return self.env.step(action)
simulator = Simulator.remote()
# Take actions in the simulator
observations = []
observations.append(simulator.step.remote(0))
observations.append(simulator.step.remote(1))

Speed-up your Pandas
With a single-line-of-code change
import modin.pandas as pd

Moding automatically partitions and
distributes your data frames
Earlier stage, 71% of Pandas API covered, else fallback to Pandas
https://github.com/modin-project/modin

And some more…
Check out the detailed docs, examples, code

Serial Parallel and distributedf
Remember
def heavy_computation(x):
# do something nice here
return x
results = [
heavy_computation(i)
for i in range(100)
]
@ray.remote
def heavy_computation(x):
# do something nice here
return x
ray.init()
results = ray.get([
heavy_computation.remote(i)
for i in range(100)
])

Conclusion
Simple API with tasks and actors
A sane local alternative to threads and processes
Use the same code locally and on a cluster
Growing ecosystem of libraries
Ray has fantastic docs and tutorials
pip install ray

Thanks!
Distributed computing
with Ray
Jan Margeta | |
May 3, 2019
jan@kardio.me @jmargeta

References
Seven concurrency models in seven weeks - Butcher
A note on distributed computing - Waldo J. et al.
Free lunch is over - Herb sutter
Fallacies of distrib. computing explained - Rotem-Gal-
Oz
Fallacies of distrib. computing - P. Deutsch
Ray docs
Ray tutorial
Plasma store
Plasma store and Arrow
Scaling Python modules with Ray framework

References
Ray - a cluster computing engine for reinforcement
learning applictions
https://ray-project.github.io/2018/07/15/parameter-
server-in-fifteen-lines.html
Ray: A Distributed Execution Framework for AI | SciPy
2018 - Robert Nishihara
Dask and Celery - M. Rocklin
Dask comparison to Spark
Ray: A Distributed System for AI
Resources
My referral link to $100 at Digital ocean for 60 days

Distributed computing with Ray. Find your hyper-parameters, speed up your Pandas pipelines, and much more.

More Related Content

What's hot

Similar to Distributed computing with Ray. Find your hyper-parameters, speed up your Pandas pipelines, and much more.

Recently uploaded

Distributed computing with Ray. Find your hyper-parameters, speed up your Pandas pipelines, and much more.