MLflow serving is a great way to deploy any model as a rest API endpoint and start experimenting. But what about taking it to the next level? What if we want to deploy our application to production just like any other server in a containerized environment? What about adding custom middlewares, monitoring, logging and tweaking performance for high scale?
2. Agenda
The Yotpo Use Case
Using mlflow serving to deploy and serve AI Models
that enrich and analyze images via REST API
The Challenges with ML in
Production
Tons of repetitive infrastructure work, distributed
across multiple teams
Mlflow Serving
Why it’s Awesome? What we did to make
production ready
3. Yotpo
at a glance
Founded 2011 | Raised $176,000,000
PAYING
CUSTOMERS +10,000
eCommerce Marketing Platform Overview
EMPLOYEES
524
OFFICES
8
New York Tel Aviv London
Sofia Boston Philippines
Yokneam Modiin
SMSReviews Loyalty ReferralVisual
4. Use Case - ML models to enrich Visual Content
for the Visual Marketing Suite
5. Visual Marketing
▪ Combine customer photos,
videos, and reviews and
create product galleries
based on UGC
▪ Leverage visual content to
create a better on site
experience and engaging
marketing campaigns
6. Smart Filters Module
▪ Shop owner curate images from
Instagram
▪ Filter them based on
▪ Dominant colors
▪ Contains People
▪ Image Quality - noise, under exposed, low res
▪ Scenery - outdoor/indoor
▪ Smiling People
▪ Possibilities are endless - adding more models should
be a breeze
▪ Chooses which images to display
on their onsite widget gallery
Extend the filters in the VMS moderation page
with AI-based visual filters.
9. Multiple teams and contexts
common functional silos in large organizations can create barriers, stifling the ability to
automate the end-to-end process of deploying ML applications to production
10. Challenges
▪ Vast and complex multi-layered of infrastructure
▪ Data Processing & Preparation
▪ Feature Extraction
▪ Training Infrastructure
▪ Serving Infrastructure
▪ Model -> Rest API. Repeat!
▪ Organizational structure
▪ Self contained teams
▪ End-2-End efforts and full accountability & ownership
▪ Dev process
▪ Large and cumbersome “hand-overs”
▪ “Waterfall” instead of “Agile”
12. ▪ MLflow Tracking
▪ Record and query experiments: code, data, config, and results
▪ MLflow Projects
▪ Package data science code in a format to reproduce runs on any platform
▪ MLflow Models
▪ Deploy machine learning models in diverse serving environments
▪ Model Registry
▪ Store, annotate, discover, and manage models in a central repository
* Solved a lot of problems for us
An open source platform for the machine learning lifecycle.
13. ▪ Generic and model agnostic (pyfunc_flavour)
▪ Can be automated to streamline any model deployment!
▪ The Out of the box server accepts the following data formats as POST
input to the /invocations path
▪ JSON-serialized pandas DataFrames
▪ JSON-serialized pandas DataFrames in the records orientation
▪ CSV-serialized pandas DataFrames
▪ Running using mlflow cli
▪ Good for testing things locally and experiment
▪ mlflow models serve -m runs:/my-run-id/model-path
A built in module in mlflow models to deploy ML Models as REST API Endpoints
Serving
15. Scoring Server
mlflow creates a server using the scoring_server module
▪ Loads the model from a given path (run_id, path)
▪ Downloading artifacts and loading them to memory
▪ Initializing conda env (you’re model deps)
▪ Creates a Flask web application
▪ Binds the /invocations endpoint
▪ Code for parsing the request and passing the required input to the model object
▪ Binds the “/ping”
▪ Endpoint for health-checking our server, will respond 200 OK if model can be loaded by mlflow
16. Scoring Server
wsgi.py as an entrypoint to launch the webserver
import os
from mlflow.pyfunc import scoring_server
from mlflow.pyfunc import load_model
app = scoring_server.init(load_model(os.environ[scoring_server._SERVER_MODEL_PATH]))
18. MLflow microservice lifecycle
Deploy
Run on any container
orchestration, Load
model from MLflow
Develop
Extend and customize
our mlflow serving code
& server
Optimize
Scale, change Instance
types, timeouts, Auto
scaling, Restart Policies
Monitor
Configure logging and
metrics, Service Health
Package
Build docker, install
relevant deps
19. ▪ Treat mlflow based services like any other microservice
▪ Enjoy the centralized infrastructure that is already in place
▪ Streamlined deployment mechanism
▪ Monitoring
▪ Log shipping automatically to ELK
▪ Metrics are scraped automatically by Prometheus - available in Grafana
▪ Auto scaling based on anything we choose to measure
▪ Automatic restarts, registrations in Service Discovery
▪ Canary / B/G Deployments, A/B Testing
▪ Unit Tests, Integration Tests, E2E
* This is about customizing & extending mlflow to suit your current infra
What needs to be productionized?
20. MLflow microservice lifecycle
Deploy
Run on any container
orchestration, Load
model from MLflow
Develop
Extend and customize
our mlflow serving code
& server
Optimize
Scale, change Instance
types, timeouts, Auto
scaling, Restart Policies
Monitor
Configure logging and
metrics, Service Health
Package
Build docker, install
relevant deps
21. Key Components of an mlflow based server
▪ Flask - lightweight WSGi web application microframework
▪ The web application code
▪ Easy setup and development
▪ Not performant enough for as standalone production server
▪ WSGi - Web Server Gateway Interface
▪ A specification that describes how a web server communicates with web applications
▪ A clear separation and decoupling between the web server (NGINX) and the Python application code (Flask)
▪ Gunicorn - WSGi complaint web server
▪ Widely popular & already being used in mlflow serving
▪ Fast, scalable & flexible
Web application & Web server
24. What is a server middleware?
Request
Response
Middleware 1
//logic
next()
//more logic
Middleware 2
//logic
next()
//more logic
Mlflow
Prediction
Code
//logic
//more logic
25. Middlewares Examples
Monitoring
Add custom metrics you
wish to measure
Authorization
Authorize requests
against 3rd party
service
Logging
Configure logging level,
handlers and format
Transformation
Emit the pandas format
required input
26. Custom Instrumentation
Custom wsgi.py entrypoint to launch the web server - Using Prometheus to export metrics
import os
from flask import request
from mlflow.pyfunc import scoring_server, load_model
from prometheus_flask_exporter.multiprocess import GunicornInternalPrometheusMetrics
app = scoring_server.init(load_model(os.getenv('MODEL_PATH')))
metrics = GunicornInternalPrometheusMetrics(app, defaults_prefix=os.getenv('APP_NAME'))
metrics.register_default(
metrics.counter(
'by_path_counter', 'Request count by request paths',
labels={'path': lambda: request.path}
)
)
27. MLflow microservice lifecycle
Deploy
Run on any container
orchestration, Load
model from MLflow
Develop
Extend and customize
our mlflow serving code
& server
Optimize
Scale, change Instance
types, timeouts, Auto
scaling, Restart Policies
Monitor
Configure logging and
metrics, Service Health
Package
Build docker, install
relevant deps
29. Packaging our Application
#!/bin/sh
export GUNICORN_CMD_ARGS="--statsd-host=${STATSD_HOST:-localhost:8125}
--statsd-prefix=${STATSD_PREFIX:-scene-detector}
--log-config ${WORKDIR}/gunicorn_logging.conf
--config ${WORKDIR}/gunicorn_conf.py
--bind ${SERVER_HOST}:${SERVER_PORT}
--workers ${WORKERS:-1}
--threads ${THREADS:-1}
--graceful-timeout ${GRACEFUL_TIMEOUT_SECONDS:-5}
--timeout ${TIMEOUT:-60}"
export prometheus_multiproc_dir=/tmp
#Notice here we are not running mlflow models serve directly, this is because we modify the flask app
and registered our middleware
#mlflow models serve -m "runs:/$MODEL_VERSION/$MODEL_NAME/" -h $SERVER_HOST -p $SERVER_PORT --no-conda -
-workers $WORKERS
exec gunicorn ${GUNICORN_CMD_ARGS} wsgi:app
30. MLflow microservice lifecycle
Deploy
Run on any container
orchestration, Load
model from MLflow
Develop
Extend and customize
our mlflow serving code
& server
Optimize
Scale, change Instance
types, timeouts, Auto
scaling, Restart Policies
Monitor
Configure logging and
metrics, Service Health
Package
Build docker, install
relevant deps
31. Deploying to production - CI/CD
Validations &
Evaluations
Codebase
Data Processing &
Cleaning
Feature Extraction Model Training
Build Scripts Configurations Monitor
Training
Serving
Deploy
Deploy & Schedule
Optimize
New Run ID
32. MLflow microservice lifecycle
Deploy
Run on any container
orchestration, Load
model from MLflow
Develop
Extend and customize
our mlflow serving code
& server
Optimize
Scale, change Instance
types, timeouts, Auto
scaling, Restart Policies
Monitor
Configure logging and
metrics, Service Health
Package
Build docker, install
relevant deps
34. MLflow microservice lifecycle
Deploy
Run on any container
orchestration, Load
model from MLflow
Develop
Extend and customize
our mlflow serving code
& server
Optimize
Scale, change Instance
types, timeouts, Auto
scaling, Restart Policies
Monitor
Configure logging and
metrics, Service Health
Package
Build docker, install
relevant deps
35. ▪ Instance Types
▪ Memory & CPU
▪ Configuring instance types which suits your model best, can be done on any container orchestration
▪ Horizontal Auto Scaling
▪ HPA on K8s
▪ Libra on Nomad
▪ Spot
General Optimizations
36. ▪ Sync
▪ The most basic and the default worker type is a synchronous worker class that handles a single request at a time.
▪ Async
▪ based on Greenlets (via Eventlet and Gevent).
▪ Greenlets are an implementation of cooperative multi-threading for Python
▪ The suggested number of workers is (2*CPU)+1.
▪ for a quad-core machine: gunicorn --workers=9 main:app
Workers
CPU bound applications - increasing the number of parallel requests
37. ▪ Threads
▪ Gunicorn also allows for each of the workers to have multiple threads
▪ Threads spawned by the same worker shares the same memory space
▪ When using multiple threads the worker type changes automatically to gthread
▪ The suggested number of workers mixed with threads is still (2*CPU)+1.
▪ for a quad-core machine: gunicorn --workers=3 --threads=3 main:app
▪ Pseudo-Threads
▪ There are some Python libraries such as gevent and Asyncio that enables concurrency in Python by using “pseudo-threads”
implemented with coroutines.
▪ mlflow uses gevent worker type by default (pip install gunicorn[gevent])
▪ gunicorn --worker-class=gevent --worker-connections=1000 --workers=9 main:app
▪ in this case, the maximum number of concurrent requests is 9000 (9 workers * 1000 connections per worker)
*Visibility on the requests that are waiting to be served
Threads & Pseudo-Threads
I/O bound applications - increasing the number of concurrent operations
39. The Serving Space is Crowded!
▪ Databricks released built in Serving module in Databricks
▪ https://databricks.com/blog/2020/06/25/announcing-mlflow-model-serving-on-databricks.html
▪ AWS Sagemaker/Azure ML
▪ cnvrg.io
▪ TF Serving
▪ Seldon
It all comes down to what suits you best!
There are plenty of options out there
40. Wrapping up!
▪ Key Takeaways
▪ ML In production is hard
▪ Organizational
▪ Technical
▪ mlflow serving is a great generic way to serve any model - and also can be extended easily!
▪ https://github.com/YotpoLtd/scene-detector-demo/tree/master (vgg16)
▪ For Yotpo, this was a game-changer, decreasing the barrier of entry of AI applications.
▪ Utilize the centralised production infrastructure that is already in place
▪ Where to go from here
▪ What about AB Testing?
▪ Base dockers to encapsulate mutual and shared code (entrypoint, Dockerfile, Dependencies)
▪ Dynamic MIddleware registration
▪ CD4ML - When a training pipelines finishes successfully, trigger deployment with new model version