[DSC Europe 23] Petar Zecevic - ML in Production on Databricks

inteligencija.com
Machine Learning in
Production on Databricks
Petar Zečević
Senior Principal Consultant
Poslovna inteligencija, Zagreb, Croatia

inteligencija.com
Agenda
• Why is productionising Machine Learning hard?
• Overview of the Machine Learning lifecycle best practices
• Overview of how Databricks solves Machine Learning
productionisation

inteligencija.com
ML lifecycle challenges

inteligencija.com
ML lifecycle – the naive version
Streaming
data?
Is the data
fresh?
Schema
changes?
ETL code
versioning?
Has the data
distribution
changed?
ETL
testing?
What
parameters
and algos
worked?
Is the
model
performanc
e still OK?
Which
environment
?
Can the
environment
be
reproduced?
Which data
was used
for
training?
Preparation
code
versioning
and
testing?
Are all the
features
really
needed?
Are
features in
the prod
equivalent
to the ones
from
training?
Is the whole
pipeline
integration
tested?

inteligencija.com
How is ML development different?
„The ML Test Score: A Rubric for ML Production Readiness and
Technical Debt Reduction”, Breck et al., Google 2017

inteligencija.com
Getting to production – the ML development process
„Hidden Technical Debt in Machine Learning Systems”, Sculley et al.,
Google 2014

inteligencija.com
What is MLOps?
MLOps = DevOps + DataOps + ModelOp

inteligencija.com
What is DevOps?
Image source: Wikipedia
• Application code versioning
• Continuous integration – CI
• Continuous deployment – CD
• Automated testing
• Infrastructure as code
• Configuration management
• Monitoring

inteligencija.com
What is DataOps?
Image source: Monte Carlo Data
• ETL/ELT pipelines
• Code versioning
• Data lineage
• Data testing
• Data privacy
• Data self service
• Feature engineering

inteligencija.com
What is ModelOps?
Image source: Aksel Yap
• Feature engineering
• Tracking experiments
• Model validation and testing
• Versioning of model code
• Apply models to real-life data (deployment)
• Model performance monitoring

inteligencija.com
Deployment paradigms
• Batch – most of the applications
• Streaming – latency in seconds and minutes
• Real time – latency in <1s
• Edge (on-device) – specially tuned models

inteligencija.com
Getting ML to production:
MLOps best practices

inteligencija.com
Data best practices
• Version control data pipeline code
• Document feature expectations and automate data quality
checks
• Design for reusable and modular data pipelines
• Test feature creation and data processing code
• Use a Feature Store to ensure that features are consistent
across different models and pipelines
• Adopt CI/CD for data pipelines
• Adopt Infrastructure as Code
• Beware sensitive data and compliance
• Training/serving skew – Check that training and serving
features are computed in the same way (a.k.a online/offline
skew)

inteligencija.com
Model best practices
• Version control model training code and track experiments
• Model testing:
• Check for feature usefulness and cost
• Tune all hyperparameters
• Compare models to simpler alternatives – sanity check
• Test performance on important subsets of data (e.g.
regions)
• Understand the real-world impact of the model outputs
• Use canary deployments and A/B testing in production
• Have a rollback strategy
• Monitor for model degradation in production
• Understand how fast the model goes stale
• Set up automatic retraining pipelines (continuous learning)

inteligencija.com
How does Databricks help?

inteligencija.com
Delta Lake
• ACID transactions – ensures data consistency and reliability
• Schema enforcement and evolution – helps with data quality
• Time travel (Data versioning) – facilitates experimentation
• Deletes and upserts (MERGE INTO) – iterative and incremental
feature preparation
• Data skipping and other optimizations – improves
performance

inteligencija.com
Delta Live Tables
• Framework for building data processing pipelines
• You define transformations and DLT manages:
• Orchestration
• Cluster management
• Monitoring
• Data quality (Expectations)
• Error handling
• Can perform CDC with APPLY CHANGES INTO .. FROM ..

inteligencija.com
Delta Live Tables

inteligencija.com
Unity Catalog
• Centralized data discovery and access – quick search for and
reuse of existing datasets
• Centralized data governance and security – Fine-grained
access control management from a central location
• Data lineage – tables, columns, notebooks, workflows and
dashboards provide automatically collected lineage information
• Collaboration – Cross-workspace sharing enables teams to
share datasets across projects without data duplication,
promoting consistent use

inteligencija.com
Feature Store
• Centralized Feature Management – discoverable and reusable
• Any table in Unity Catalog can serve as a feature table (since
DBR 13.2)
• Lineage – upstream and downstream
• Consistency Across Models – Features used for training
models are also served in production
• Simplified Serving – models include feature metadata
• Should be used consistently (log_model) – so that you can
keep track of feature usage
• You can publish features to online stores (Amazon
DynamoDB, Aurora or RDS MySQL)
• for models served with Databricks Model Serving

inteligencija.com
MLflow
Integrated within the Databricks platform (notebooks and
workflows):
• MLflow Tracking – Log and query experiments and runs in
terms of code, data, config, and results
• MLflow Projects – Package data science code in a reusable,
reproducible form to share with other data scientists or transfer
to production.
• MLflow Models – Manage and deploy machine learning models
from a variety of ML libraries to a variety of model serving and
inference platforms.
• MLflow Model Registry – A centralized model store, set of APIs,
and UI, to collaboratively manage the lifecycle of a MLflow
Model.

inteligencija.com
MLflow Model Registry
central model registry vs. per-workspace model registry

inteligencija.com
AutoML
• Generates ML classification, regression or forecasting code
automatically, based on input table and the target field
• Features from the Feature Store can be joined
• Jupyter notebooks with code for splitting data, setting up
libraries etc.
• Provides a good starting point for experiments and/or models
ready to be registered

inteligencija.com
CI/CD integration
• Databricks Repos UI is used for checking out Git branches,
merging and pushing changes
• It provides a REST API that can be invoked by Git automation
• In production you can:
• directly reference notebooks in remote Git repos by tags or
branches
• set up read-only folders with checked-out repos and update
them automatically using Git automation
• MLflow Model Registry provides an API so that Git automation
can automatically transition models between environments

inteligencija.com
Moving to production on
Databricks

inteligencija.com
Execution environments
Different environments, such as dev, staging and prod can be separate
• Multiple cloud accounts
• Multiple Databricks workspaces – within a single cloud account
• Databricks workspace access controls

inteligencija.com
Promoting code and models
Use Git branches to separate code versions:
• dev branch for development
• specific feature branches for feature development
• release branches for different versions
Lifecycle of models might be independent of code

inteligencija.com
Promoting code and models
Image source: Databricks
Promoting models and code across environments:

inteligencija.com
The recommended approach for model promotion
The workflow recommended by Databricks:
• Dev environment:
• Develop training and other code
• Promote code
• Staging environment:
• Test training code on subset of data
• Test other code
• Promote code
• Prod environment:
• Train model on production data
• Test model
• Deploy model
• Deploy code

inteligencija.com
Links & Resources
• Big Book of MLOps – Databricks
• The Comprehensive Guide to Feature Stores – Databricks
• ML in Production – Databricks course
• https://github.com/databricks/mlops-stacks
• https://ml-ops.org/
• https://cloud.google.com/architecture/mlops-continuous-
delivery-and-automation-pipelines-in-machine-learning

inteligencija.com
We are Data & Analytics consulting company committed to deliver great solutions and products that
enables our clients to unlock hidden opportunities within data, become data-driven and make better
business decisions
Our goal is to enable data-driven business decisions
Offices in UK,
Sweden,
Austria,
Slovenia and
Croatia
180+
employees
20 years in
Data &
Analytics
250+
projects
100+
clients on 5
continents

[DSC Europe 23] Petar Zecevic - ML in Production on Databricks

Recommended

Recommended

More Related Content

Similar to [DSC Europe 23] Petar Zecevic - ML in Production on Databricks

Similar to [DSC Europe 23] Petar Zecevic - ML in Production on Databricks (20)

More from DataScienceConferenc1

More from DataScienceConferenc1 (20)

Recently uploaded

Recently uploaded (20)

[DSC Europe 23] Petar Zecevic - ML in Production on Databricks

Editor's Notes