MLOps implemented - how we combine the cloud & open-source to boost data scientists work - Krzysztof Zarzycki, Marek Wiewiórka - GetInData

MLOps implemented - how we
combine the cloud & open-source
to boost data scientists work

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Marek Wiewiórka
Chief Data Architect
marek.wiewiorka@getindata.com
Krzysztof Zarzycki
Chief Technology Ofﬁcer
krzysztof.zarzycki@getindata.com

Founded by ex-Spotify
engineers in 2014
Focus only on Big Data and
Cloud (from day 1)
Community builders
(Big Data Tech Warsaw, blogs,
OSS)
80+ Big Data engineers
(and growing)
GetInData in a Nutshell

How We Got to MLOps
2015
Google publishes
“Hidden Technical Debt in
Machine Learning Systems“
2018
Started building a
cloud-native ML platform at
ING Bank
2019
started building a ML
Platform for a large Polish
telecom
2020
Built ML Platform for
Kcell, the largest Kazakh
Telecom
2020
MLOps projects started
with retail (cloud),
mobile app
2021
MLOps project started
for the largest Polish
bank (cloud)
and more...

● Software Engineering-like process but for ML models
● The pipeline is the result, not the model
● No IT required, for Data Science to production
● Freedom of choice of tools
● Loosely coupled mix of cloud services and open-source
● Best of breed instead of all-in-one approach
Our MLOps Principles

Data Science Workbench - Our Vision

Data Scientists IDE - Batteries Included
●

Kedro - Data Scientist’s Swiss Knife
● Kedro is an open-source Python framework for
creating reproducible, maintainable and
modular data science code
● Kedro’s main concepts:
○ Project template
○ Conﬁguration and environments
○ Data catalog
○ Nodes and pipelines

● common directory structure for all
projects
● customizable Cookiecutter
templates
● boilerplate code for a ML project
using Kedro framework
● ofﬁcial and in-house baked
kedro new --starter=pyspark
Kedro - Project starters

Kedro - Data Catalog
Data source deﬁnition:
● Separation of
transformations code
and data connectors
● Can be reused
between projects

● Node - a Python function
that has zero to many inputs
and/or output datasets
● Pipeline - a DAG. A
collection of nodes with
deﬁned relationships and
dependencies.
kedro run
Kedro viz
Kedro Nodes and Pipelines

1. Log into JupyterLab
2. Create a project with a Kedro starter
3. EDA with notebooks & pipeline
implementation using VS Code
4. Run your project and automatically track
experiment with a local MLflow
5. Optionally schedule it with a local Airflow
6. Repeat until you’re happy with your model !
Local development with Kedro and MLflow

● Pipeline containerization with
kedro-docker
● DAGs generation and scheduling
with one of our plugins:
○ kedro-airflow-k8s
○ kedro-kubeflow
● Dataset stability with
kedro-popmon (together with ING)
● Kubernetes pods profiling(R&D)
● CI/CD for maximum automation
Delivering ML Model to Production

● Freedom of toolkit choice
with containerized execution
● Scalable training
● Experiments and models
tracking
● “Continuous Training”
Schedule- or event-driven
Model Training

● CI/CD
● Models from registry
● Batch & online
● Scalability
● Extensive monitoring
Model Serving

Model Deployment to Production!
writes
produces

Kedro-kubeﬂow
kedro-airﬂow-k8s
Model deployer
Jupyter plugins
Prebaked images
Google
AI Platform
Experimentation Training Serving

MLOps R&D
● Align Data and ML engineering
● Feature Store
○ Feast, GCP, AWS
● Kedro
○ Company-wide data discovery tools
○ Hyperparameters tuning
○ Serving, model deployment
● Advanced deployments
● Retraining, data drift
● Business monitoring, outcome attribution

● Focus on unlocking data scientists
○ Start with Data Science Workbench
○ Make code reproducible by CI
○ Then build Scalable Training
How to Start with MLOps?

Thank you! - Dziękujemy!
github.com/getindata/kedro-kubeﬂow
github.com/getindata/kedro-airﬂow-k8s

MLOps implemented - how we combine the cloud & open-source to boost data scientists work - Krzysztof Zarzycki, Marek Wiewiórka - GetInData

Recommended

Recommended

More Related Content

Similar to MLOps implemented - how we combine the cloud & open-source to boost data scientists work - Krzysztof Zarzycki, Marek Wiewiórka - GetInData

Similar to MLOps implemented - how we combine the cloud & open-source to boost data scientists work - Krzysztof Zarzycki, Marek Wiewiórka - GetInData (20)

More from GetInData

More from GetInData (20)

Recently uploaded

Recently uploaded (20)

MLOps implemented - how we combine the cloud & open-source to boost data scientists work - Krzysztof Zarzycki, Marek Wiewiórka - GetInData