Building an ML Platform with Ray and MLflow

Building an ML
Platform with Ray and
MLﬂow
Amog Kamsetty and Archit Kulkarni
Ray Team @ Anyscale

The Team
Archit Kulkarni Amog Kamsetty Dmitri Gekhtman Edward Oakes
Richard Liaw Kai Fricke Simon Mo
Kathryn Zhou

Overview of Talk
▪ What are ML Platforms?
▪ Ray and its libraries
▪ MLﬂow
▪ Demo: An ML Platform
built with MLﬂow and
Ray

Typical ML Process
Fuzzy
search!
NLP, DL …

Execution
- Feature engineering
- Training
- Including tuning
- Serving
- Offline scoring, inference
- Online serving
Typical ML Process -- Simpliﬁed
Management
- Tracking
- Data, Code, Configurations
- Reproducing Results
- Deployment
- Deploy in a variety of
environments

Challenges with the ML Process
Data/Features
• Data Preparation
• Data Analysis
• Feature
Engineering
• Data Pipeline
• Data
Management/Feat
ure Store
• Manages big data
clusters
Model
• ML Expertise
• Implement SOTA
ML Research
• Experimentation
• Manage GPU
infrastructure
• Scalable training &
hyperparameter
tuning
Production
• A/B Testing
• Model Evaluation
• Analysis of
Predictions
• Deploy in variety of
environments
• CI/CD
• Highly Available
prediction service
Data/Research
Scientist
Engineers

Challenges with the ML Process
Data
• Data Preparation
• Data Analysis
• Feature
Engineering
• Data Pipeline
• Data
Management/Feat
ure Store
• Manages big data
clusters
Model
• ML Expertise
• Implement SOTA
ML Research
• Experimentation
• Manage GPU
infrastructure
• Scalable training &
hyperparameter
tuning
Production
• A/B Testing
• Model Evaluation
• Analysis of
Predictions
• Deploy in variety of
environments
• CI/CD
• Highly Available
prediction service
Data/Research
Scientist
Software/Data/
ML Engineer
ML Platform
Abstraction

ML Platforms -- Scale
- LinkedIn:
- 500+ “AI engineers” building models; 50+ MLP engineers
- > 50% offline compute demand (12K servers each with 256G RAM)
- More than 2x a year
- Uber Michelangelo, AirBnB Bighead, Facebook FBLearner,
etc.
- Globally, a few Billion $ now, growing 40%+ YoY
- Many companies building ML Platforms from the ground up

ML Platforms -- Landscape
(Source: Intel Capital)

Execution
- Feature engineering 🔪
- Training 🍳
- Including tuning 🧂
- Serving 🍽
- Online serving
Management
- Tracking 📝
- Reproducing Results 📖
- Deployment 🚚 💻
- Deploy in a variety of
environments

Execution
- Feature engineering 🔪
- Training 🍳
- Including tuning 🧂
- Serving 🍽
- Online serving
Management
- Tracking 📝
- Reproducing Results 📖
- Deployment 🚚 💻
- Variety of environments

What is Ray?
• A simple/general library for distributed computing
• Single machine or 100s of nodes
• Agnostic to the type of work
• An ecosystem of libraries (for scaling ML and more)
• Native: Ray RLlib, Ray Tune, Ray Serve
• Third party: Modin, Dask, Horovod, XGBoost, Pytorch Lightning
• Tools for launching clusters on any cloud provider

Three key ideas
Execute remote functions as tasks, and
instantiate remote classes as actors
• Support both stateful and stateless computations
Asynchronous execution using futures
• Enable parallelism
Distributed (immutable) object store
• Eﬃcient communication (send arguments by reference)

API
Functions -> Tasks
def read_array(file):
# read array “a” from “file”
return a
def add(a, b):
return np.add(a, b)

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote(“/input1”)
id1
read_array

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1
read_array
id2
zeros
read_array

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id3 = add.remote(id1, id2)
id1
read_array
id2
zeros
read_array
id3
add

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id3 = add.remote(id1, id2); ray.get(id3)
id1
read_array
id2
zeros
read_array
id3
add

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
Classes -> Actors

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
Classes -> Actors
@ray.remote
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value

API
Functions -> Tasks
@ray.remote
return a
@ray.remote
def add(a, b):
return np.add(a, b)
Classes -> Actors
@ray.remote
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter.remote()
id4 = c.inc.remote()
ray.get([id4, id5])

API
Functions -> Tasks
@ray.remote
return a
@ray.remote(num_gpus=1)
def add(a, b):
return np.add(a, b)
Classes -> Actors
@ray.remote(num_gpus=1)
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter.remote()
ray.get([id4, id5])

at Anyscale
Your app
here!
Native Libraries 3rd Party Libraries
Ecosystem
Universal framework for
Distributed computing
Ray Ecosystem

Ray Tune: Scalable
Hyperparameter Tuning
Wide variety of algorithms Compatible with ML frameworks
HYPERBAND
PBT
BAYESIAN OPT.

Ray Tune focuses on
simplifying execution
Easily launch distributed multi-gpu
tuning jobs
Automatic fault tolerance to save
3x on GPU costs
https://www.vecteezy.com/
$ ray up {cluster config}
ray.init(address="auto")
tune.run(func, num_samples=100)

Ray Tune interoperates
with other HPO libraries
Ray Tune
Ax
Optuna
scikit-optimize
…

def train_model(config={}):
model = ConvNet(config)
for i in range(steps):
current_loss = model.train()

from ray import tune
def train_model(config={}):
for i in range(steps):
tune.report(loss=current_loss)

def train_model(config):
for i in range(epochs):
tune.run(train_model,
config={“lr”: 0.1})

tune.run(
train_model,
config={“lr”: tune.uniform(0.001, 0.1)},
num_samples=100
)

tune.run(
train_model,
num_samples=100,
scheduler=ASHAScheduler())

tune.run(
train_model,
num_samples=100,
scheduler=PopulationBasedTraining(...))
def train_model(config, checkpoint_dir=None):
if checkpoint_dir is not None:
model.load_checkpoint(checkpoint_dir+”model.pt”)
with tune.checkpoint_dir() as dir:
model.save_checkpoint(dir+”model.pt”)

Ray Serve is a
Web Framework
Built for
Model Serving

Ray Serve is
high-performance and ﬂexible
• Framework-agnostic
• Easily scales
• Supports batching
• Query your endpoints from
HTTP and from Python
• Easily integrate with other
tools

Ray Serve is built on top of Ray
For user, no need to think about:
• Interprocess communication
• Failure management
• Scheduling
Just tell Ray Serve to scale up your model.

Serve functions and stateful classes.
Ray Serve will use multiple replicas to parallelize
across cores and across nodes in your cluster.
Ray Serve API

Flexibility
Query your model from HTTP:
> curl "http://127.0.0.1:8000/my/route"
Or query from Python using ServeHandle:

Challenges of ML in production
• It’s difficult to keep track of experiments.
• It’s difficult to reproduce code.
• There’s no standard way to package and deploy
models.
• There’s no central store to manage models (their
versions and stage transitions).
Source: mlﬂow.org

What is MLﬂow?
• Open-source ML lifecycle management tool
• Single solution for all of the above challenges
• Library-agnostic and language-agnostic
• (Works with your existing code)

Four key functions of MLﬂow
Source: MLﬂow

Ray Tune + MLﬂow Tracking
tune.run(
train_model,
num_samples=100,
callbacks=[MLflowLoggerCallback(“my_experiment”)])

Ray Tune + MLﬂow Tracking
@mlflow_mixin
mlflow.autolog()
xgboost_results = xgb.train(config, ...)
tune.run(
train_model,
num_samples=100)

+
> pip install mlflow-ray-serve
> ray start --head
> serve start

MLﬂow deployments CLI
Create deployment
> mlflow deployments create -t ray-serve -m <model URI>
--name my_model -C num_replicas=100
Model URI:
• models:/MyModel/1
• runs:/93203689db9c4b50afb6869
• s3://<bucket>/<path>
• ...

MLﬂow deployments Python API
Create model

Integrating with Ray Serve is easy.
• Ray Serve endpoints can be called from Python.
• Clean conceptual separation:
• Ray Serve handles data plane (processing)
• MLflow handles control plane (metadata, configuration)

Demo: An ML Platform built with MLﬂow and Ray

Acknowledgements
Thanks to Jules Damji, Sid Murching, and Paul Ogilvie for
their help and guidance with MLﬂow.
Thanks to Dmitri Gekhtman, Kai Fricke, Simon Mo,
Edward Oakes, Richard Liaw, Kathryn Zhou and the rest
of the Ray team!

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Building an ML Platform with Ray and MLflow

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building an ML Platform with Ray and MLflow

Similar to Building an ML Platform with Ray and MLflow (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Building an ML Platform with Ray and MLflow