Productionizing Real-time Serving With MLflow

Productionizing Real-Time
Serving With mlflow
Ron Barabash
Tech Lead @ Yotpo

Agenda
The Yotpo Use Case
Using mlflow serving to deploy and serve AI Models
that enrich and analyze images via REST API
The Challenges with ML in
Production
Tons of repetitive infrastructure work, distributed
across multiple teams
Mlflow Serving
Why it’s Awesome? What we did to make
production ready

Yotpo
at a glance
Founded 2011 | Raised $176,000,000
PAYING
CUSTOMERS +10,000
eCommerce Marketing Platform Overview
EMPLOYEES
524
OFFICES
8
New York Tel Aviv London
Sofia Boston Philippines
Yokneam Modiin
SMSReviews Loyalty ReferralVisual

Use Case - ML models to enrich Visual Content
for the Visual Marketing Suite

Visual Marketing
▪ Combine customer photos,
videos, and reviews and
create product galleries
based on UGC
▪ Leverage visual content to
create a better on site
experience and engaging
marketing campaigns

Smart Filters Module
▪ Shop owner curate images from
Instagram
▪ Filter them based on
▪ Dominant colors
▪ Contains People
▪ Image Quality - noise, under exposed, low res
▪ Scenery - outdoor/indoor
▪ Smiling People
▪ Possibilities are endless - adding more models should
be a breeze
▪ Chooses which images to display
on their onsite widget gallery
Extend the filters in the VMS moderation page
with AI-based visual filters.

Multiple teams and contexts
common functional silos in large organizations can create barriers, stifling the ability to
automate the end-to-end process of deploying ML applications to production

Challenges
▪ Vast and complex multi-layered of infrastructure
▪ Data Processing & Preparation
▪ Feature Extraction
▪ Training Infrastructure
▪ Serving Infrastructure
▪ Model -> Rest API. Repeat!
▪ Organizational structure
▪ Self contained teams
▪ End-2-End efforts and full accountability & ownership
▪ Dev process
▪ Large and cumbersome “hand-overs”
▪ “Waterfall” instead of “Agile”

https://blog.datakitchen.io/blog/deliver-ai-and-ml-models-at-scale-with-modelops
Eran Strod, Data Kitchen
“MLOps is an application of DataOps principles and
automation to machine learning systems.
It turns data science workflows into robust, repeatable
processes executed in minimal time and with virtually
zero errors”

▪ MLflow Tracking
▪ Record and query experiments: code, data, config, and results
▪ MLflow Projects
▪ Package data science code in a format to reproduce runs on any platform
▪ MLflow Models
▪ Deploy machine learning models in diverse serving environments
▪ Model Registry
▪ Store, annotate, discover, and manage models in a central repository
* Solved a lot of problems for us
An open source platform for the machine learning lifecycle.

▪ Generic and model agnostic (pyfunc_flavour)
▪ Can be automated to streamline any model deployment!
▪ The Out of the box server accepts the following data formats as POST
input to the /invocations path
▪ JSON-serialized pandas DataFrames
▪ JSON-serialized pandas DataFrames in the records orientation
▪ CSV-serialized pandas DataFrames
▪ Running using mlflow cli
▪ Good for testing things locally and experiment
▪ mlflow models serve -m runs:/my-run-id/model-path
A built in module in mlflow models to deploy ML Models as REST API Endpoints
Serving

Requests to mlflow model server
#!/bin/sh
curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json' -d '{
"columns": ["a", "b", "c"],
"data": [[1, 2, 3], [4, 5, 6]]
}'
# record-oriented (fine for vector rows, loses ordering for JSON records)
curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json;
format=pandas-records' -d '[
{"a": 1,"b": 2,"c": 3},
{"a": 4,"b": 5,"c": 6}
]'

Scoring Server
mlflow creates a server using the scoring_server module
▪ Loads the model from a given path (run_id, path)
▪ Downloading artifacts and loading them to memory
▪ Initializing conda env (you’re model deps)
▪ Creates a Flask web application
▪ Binds the /invocations endpoint
▪ Code for parsing the request and passing the required input to the model object
▪ Binds the “/ping”
▪ Endpoint for health-checking our server, will respond 200 OK if model can be loaded by mlflow

Scoring Server
wsgi.py as an entrypoint to launch the webserver
import os
from mlflow.pyfunc import scoring_server
from mlflow.pyfunc import load_model
app = scoring_server.init(load_model(os.environ[scoring_server._SERVER_MODEL_PATH]))

MLflow microservice lifecycle
Deploy
Run on any container
orchestration, Load
model from MLflow
Develop
Extend and customize
our mlflow serving code
& server
Optimize
Scale, change Instance
types, timeouts, Auto
scaling, Restart Policies
Monitor
Configure logging and
metrics, Service Health
Package
Build docker, install
relevant deps

▪ Treat mlflow based services like any other microservice
▪ Enjoy the centralized infrastructure that is already in place
▪ Streamlined deployment mechanism
▪ Monitoring
▪ Log shipping automatically to ELK
▪ Metrics are scraped automatically by Prometheus - available in Grafana
▪ Auto scaling based on anything we choose to measure
▪ Automatic restarts, registrations in Service Discovery
▪ Canary / B/G Deployments, A/B Testing
▪ Unit Tests, Integration Tests, E2E
* This is about customizing & extending mlflow to suit your current infra
What needs to be productionized?

Key Components of an mlflow based server
▪ Flask - lightweight WSGi web application microframework
▪ The web application code
▪ Easy setup and development
▪ Not performant enough for as standalone production server
▪ WSGi - Web Server Gateway Interface
▪ A specification that describes how a web server communicates with web applications
▪ A clear separation and decoupling between the web server (NGINX) and the Python application code (Flask)
▪ Gunicorn - WSGi complaint web server
▪ Widely popular & already being used in mlflow serving
▪ Fast, scalable & flexible
Web application & Web server

Example request flow
Request
localhost:5000
/static
html, css, js, images, pdf…...

What is a server middleware?
Request
Response
Middleware 1
//logic
next()
//more logic
Middleware 2
//logic
next()
//more logic
Mlflow
Prediction
Code
//logic
//more logic

Middlewares Examples
Monitoring
Add custom metrics you
wish to measure
Authorization
Authorize requests
against 3rd party
service
Logging
Configure logging level,
handlers and format
Transformation
Emit the pandas format
required input

Custom Instrumentation
Custom wsgi.py entrypoint to launch the web server - Using Prometheus to export metrics
import os
from flask import request
from mlflow.pyfunc import scoring_server, load_model
from prometheus_flask_exporter.multiprocess import GunicornInternalPrometheusMetrics
app = scoring_server.init(load_model(os.getenv('MODEL_PATH')))
metrics = GunicornInternalPrometheusMetrics(app, defaults_prefix=os.getenv('APP_NAME'))
metrics.register_default(
metrics.counter(
'by_path_counter', 'Request count by request paths',
labels={'path': lambda: request.path}
)
)

Packaging our Application
FROM python:3.6.5
ENV SERVER_HOST 0.0.0.0
ENV WORKDIR /opt/scene-detector
ENV PYTHONPATH /opt/scene-detector
WORKDIR $WORKDIR
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
COPY gunicorn_logging.conf gunicorn_logging.conf
COPY gunicorn_conf.py gunicorn_conf.py
COPY wsgi.py wsgi.py
COPY entrypoint.sh /
EXPOSE $SERVER_PORT
ENTRYPOINT ["/entrypoint.sh"]

Packaging our Application
#!/bin/sh
export GUNICORN_CMD_ARGS="--statsd-host=${STATSD_HOST:-localhost:8125}
--statsd-prefix=${STATSD_PREFIX:-scene-detector}
--log-config ${WORKDIR}/gunicorn_logging.conf
--config ${WORKDIR}/gunicorn_conf.py
--bind ${SERVER_HOST}:${SERVER_PORT}
--workers ${WORKERS:-1}
--threads ${THREADS:-1}
--graceful-timeout ${GRACEFUL_TIMEOUT_SECONDS:-5}
--timeout ${TIMEOUT:-60}"
export prometheus_multiproc_dir=/tmp
#Notice here we are not running mlflow models serve directly, this is because we modify the flask app
and registered our middleware
#mlflow models serve -m "runs:/$MODEL_VERSION/$MODEL_NAME/" -h $SERVER_HOST -p $SERVER_PORT --no-conda -
-workers $WORKERS
exec gunicorn ${GUNICORN_CMD_ARGS} wsgi:app

Deploying to production - CI/CD
Validations &
Evaluations
Codebase
Data Processing &
Cleaning
Feature Extraction Model Training
Build Scripts Configurations Monitor
Training
Serving
Deploy
Deploy & Schedule
Optimize
New Run ID

▪ Instance Types
▪ Memory & CPU
▪ Configuring instance types which suits your model best, can be done on any container orchestration
▪ Horizontal Auto Scaling
▪ HPA on K8s
▪ Libra on Nomad
▪ Spot
General Optimizations

▪ Sync
▪ The most basic and the default worker type is a synchronous worker class that handles a single request at a time.
▪ Async
▪ based on Greenlets (via Eventlet and Gevent).
▪ Greenlets are an implementation of cooperative multi-threading for Python
▪ The suggested number of workers is (2*CPU)+1.
▪ for a quad-core machine: gunicorn --workers=9 main:app
Workers
CPU bound applications - increasing the number of parallel requests

▪ Threads
▪ Gunicorn also allows for each of the workers to have multiple threads
▪ Threads spawned by the same worker shares the same memory space
▪ When using multiple threads the worker type changes automatically to gthread
▪ The suggested number of workers mixed with threads is still (2*CPU)+1.
▪ for a quad-core machine: gunicorn --workers=3 --threads=3 main:app
▪ Pseudo-Threads
▪ There are some Python libraries such as gevent and Asyncio that enables concurrency in Python by using “pseudo-threads”
implemented with coroutines.
▪ mlflow uses gevent worker type by default (pip install gunicorn[gevent])
▪ gunicorn --worker-class=gevent --worker-connections=1000 --workers=9 main:app
▪ in this case, the maximum number of concurrent requests is 9000 (9 workers * 1000 connections per worker)
*Visibility on the requests that are waiting to be served
Threads & Pseudo-Threads
I/O bound applications - increasing the number of concurrent operations

The Serving Space is Crowded!
▪ Databricks released built in Serving module in Databricks
▪ https://databricks.com/blog/2020/06/25/announcing-mlflow-model-serving-on-databricks.html
▪ AWS Sagemaker/Azure ML
▪ cnvrg.io
▪ TF Serving
▪ Seldon
It all comes down to what suits you best!
There are plenty of options out there

Wrapping up!
▪ Key Takeaways
▪ ML In production is hard
▪ Organizational
▪ Technical
▪ mlflow serving is a great generic way to serve any model - and also can be extended easily!
▪ https://github.com/YotpoLtd/scene-detector-demo/tree/master (vgg16)
▪ For Yotpo, this was a game-changer, decreasing the barrier of entry of AI applications.
▪ Utilize the centralised production infrastructure that is already in place
▪ Where to go from here
▪ What about AB Testing?
▪ Base dockers to encapsulate mutual and shared code (entrypoint, Dockerfile, Dependencies)
▪ Dynamic MIddleware registration
▪ CD4ML - When a training pipelines finishes successfully, trigger deployment with new model version

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Productionizing Real-time Serving With MLflow

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Productionizing Real-time Serving With MLflow

Similar to Productionizing Real-time Serving With MLflow (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Productionizing Real-time Serving With MLflow