This talk was given at the Cologne AI and Machine Learning Meetup on April 13, 2023 (https://www.meetup.com/de-DE/cologne-ai-and-machine-learning-meetup/events/291513393/) by Dr. Andreas Weiden, Co-Lead Cloud / Data Engineering at skillbyte: MLOps pipelines using MLFlow - From training to production
In this talk we explore the world of MLOps pipelines and how MLFlow can be used to facilitate workflows for getting your machine learning models from training to production. We will briefly delve into the tracking aspects of MLFlow and how to store experiments and runs. Next, we will move on to an actual use case that involves managing artefacts generated by multiple training pipelines running on a daily schedule. These artefacts are used in prediction services but also in managed vector search engines such as ElasticSearch and Google VertexAI. A simple microservice that polls the MLFlow registry is used to update both REST-APIs running in Kubernetes and to ingest the models into the vector search services. Finally, we will compare different alternatives that were considered.
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
MLOps pipelines using MLFlow - From training to production
1. MLOps pipelines using MLFlow - From training to production
Dr. Andreas Weiden, skillbyte
CAIML#24, April 13th, 2023
2. Problem Description
Team A, a (majority) Data Scientist team, creates many machine learning models
Team B, a (majority) Data Engineer team, needs to deploy these models into production and use them for a
recommender system
So two problems:
1. Technical
Need to deploy these models to multiple targets (→ this talk)
2. Organizational
Need to make two teams work together (→ not this talk)
3. MLOps
Got its name from DevOps and GitOps
Continuous training and deployment for machine learning systems
ml-ops.org
4. Pipelines
Created and maintained by Team A
Daily training runs, since fresh data is constantly coming in
Output various artifacts which are needed by the prediction services, run by Team B
Examples of what the Data Science Magic™ can be:
popularity of items
user embeddings from user-item interactions
item embeddings from item descriptions
Need somewhere to store the outputs of those
pipelines
And deploy them, too
5. Manage end-to-end machine learning
lifecycle
Open source: Github
Four pillars:
Tracking
Log parameters, code versions,
metrics, artifacts
Projects
Models
Registry
Basic unit is a Run
Whenever your pipeline runs, a new Run is created
Runs can be grouped under Experiments
You can add arbitrary data to a Run as well as the output
artifacts
import mlflow
import datetime, numpy as np, pickle, random
from tempfile import TemporaryDirectory
mlflow.set_experiment("Pipeline A")
run_name = f"Pipeline A {datetime.datetime.now().isoformat()}"
tags = {"version": "0.0.1"}
with mlflow.start_run(run_name=run_name, tags=tags) as run:
mlflow.log_param("ndims", 1024)
mlflow.log_metric("recall", random.random())
with TemporaryDirectory() as temp_dir:
with open(f"{temp_dir}/out.pickle", "wb") as f:
pickle.dump([np.random.rand(1024) for _ in range(100)], f)
mlflow.log_artifacts(temp_dir)
6. Manage end-to-end machine learning
lifecycle
Open source: Github
Four pillars:
Tracking
Log parameters, code versions,
metrics, artifacts
Projects
Models
Registry
Basic unit is a Run
Whenever your pipeline runs, a new Run is created
Runs can be grouped under Experiments
You can add arbitrary data to a Run as well as the output
artifacts
7. Manage end-to-end machine learning
lifecycle
Open source: Github
Four pillars:
Tracking
Log parameters, code versions,
metrics, artifacts
Projects
Package Data Science code including
dependencies
Git and containerization already does
this, if you lock your dependencies,
which you should
Models
Registry
8. Manage end-to-end machine learning
lifecycle
Open source: Github
Four pillars:
Tracking
Log parameters, code versions,
metrics, artifacts
Projects
Package Data Science code including
dependencies
Git and containerization already does
this, if you lock your dependencies,
which you should
Models
Package ML models and deploy them
Containers and/or model artifacts
Registry
A standard format for packaging machine learning models
that can be used in a variety of downstream tools e.g.
real-time serving through a REST API
batch inference on Apache Spark
Saves model specific data and environment data:
# Directory written by mlflow.sklearn.save_model(model, "my_model")
my_model/
├── MLmodel
├── model.pkl
├── conda.yaml
├── python_env.yaml
└── requirements.txt
9. Manage end-to-end machine learning
lifecycle
Open source: Github
Four pillars:
Tracking
Log parameters, code versions,
metrics, artifacts
Projects
Package Data Science code including
dependencies
Git and containerization already does
this, if you lock your dependencies,
which you should
Models
Package ML models and deploy them
Containers and/or model artifacts
Registry
Model storage and lifecycle
(versioning, stage transitions)
Can associate runs with a Model:
import mlflow
with mlflow.start_run() as run:
...
mlflow.register_model(f"runs:/{run.info.run_id}", "Model A")
10. Manage end-to-end machine learning
lifecycle
Open source: Github
Four pillars:
Tracking
Log parameters, code versions,
metrics, artifacts
Projects
Package Data Science code including
dependencies
Git and containerization already does
this, if you lock your dependencies,
which you should
Models
Package ML models and deploy them
Containers and/or model artifacts
Registry
Model storage and lifecycle
(versioning, stage transitions)
Sounds interesting, but only very
rudimentary
12. Not so fast…
… managed vector stores are also a thing
They give you e.g.
pre-filtering
a full-blown query syntax
Quite a few options exist nowadays
Elasticsearch
Google Vertex AI
RediSearch
Milvus
…
→ Need a generic way to deploy machine learning artifacts to
multiple targets
13. Watcher
Simple microservice that periodically polls the MLFlow registry
Pushes the updated artifacts to all ML deployment options
Uploads embeddings to managed databases
Updates ConfigMap definitions with the correct Run ID per model and environment
15. Configmap reloader
https://github.com/stakater/Reloader
Ensures that all deployments that rely on a ConfigMap get restarted whenever
that ConfigMap changes
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
configmap.reloader.stakater.com/reload: "foo-configmap"
spec:
template:
spec:
containers:
- name: foo
image: foo:0.0.1
env:
- name: MLFLOW_RUN_ID
valueFrom:
configMapKeyRef:
name: foo-configmap
key: RUN_ID
apiVersion: v1
kind: ConfigMap
metadata:
name: foo-configmap
data:
RUN_ID: "1234abcde"
16. Alternatives
Possible alternatives that we considered:
Model deployment directly through MLFlow
Seldon Core
AWS SageMaker
(Got more? Let me know!)
However, they all have the same drawbacks:
No control over final images
Image size not optimised
Custom logging, metrics, tracing, … difficult
No control over API
Only deploy to REST-APIs, but we also want other targets
17. Summary
Embeddings and models are centrally produced → Need some central model storage
Each of the targets supports training and or deploying ML models individually, none of them support doing
so for all targets
→ If your needs are diverse enough, you may need to roll your own (ML deployment)
→ But use existing tools where applicable