A Collaborative and
Scalable Machine
Learning Workflow
Nick Hale
Senior Researcher
Agenda
§ Overview
§ Components
§ Architecture
§ Project Workflow
Overview
• Objectives
• A cost effective and scalable process for collaborative Machine Learning (ML) R&D
• A framework for comparing experiments and deploying production models
• A cloud agnostic ML training and deployment framework
• Core Contributions
• Git version control method suited for collaborative ML development
• A method for scaling data science pipelines
• A tightly integrated method for collaborative data science, experiment tracking, and
model deployment
Components
Components
• Environment Consistency – Docker
• Docker is an open platform for developing and running applications
• Docker uses loosely isolated environments called Containers that include
everything needed to develop and run code
• Docker is used to create a consistent environment (dependencies, libraries,
etc.) for ML among developers, model training runs, and deployments
Components
• ML Pipelines – Kedro
• Open-source Python framework for reproducible, maintainable, and modular
data science code from QuantumBlack
• Creates Directed Acyclic Graphs (DAGs) comprised of functions and datasets
called pipelines
• Nodes in the pipeline are functions that can be data transformations, model
training, etc.
• Uses data engineering convention to track data transformations across local
and cloud datastores
Example Pipeline
Data Engineering Convention
Components
• ML Experiment Tracking– MLflow
• Open-source platform for managing the ML lifecycle, including experiments,
reproducibility, deployment, and model registry
• ML tests from multiple users and compute environments tracked and
compared
• Models can be deployed to production environments directly from MLflow and
predictions can be served through REST APIs
• Model versions and parameters are tracked so that models can always be
reproduced
MLflow Server User Interface
Components
• ML lifecycle Integration – Databricks
• Databricks is a platform that enables seamless integration of data science
code, data, experiment tracking, and cloud resources
• Databricks allows data scientists to easily run local Kedro pipelines on
compute clusters via the Databricks Connect library
• Databricks Connect is a Spark client library that connects local development
environments to Databricks clusters
• Databricks allows for easy logging of ML experiments in MLflow
Components
• Big data processing – PySpark
• PySpark is a python library that ships distributed data science jobs to a Spark
cluster running on Databricks
• PySpark and Databricks Connect allows Kedro pipelines developed on a local
machine to run on the cloud with no friction
• Koalas implements the native Pandas DataFrame on top of Spark which
allows scalable computing with a minimal learning curve
Local Machines / Individual Developer Compute
Docker/Git repositories
Cloud
ML Scaffold Repo (custom ml
functions)
New ML Project Repo
(Kedroized)
Training
Compute
Deployment
Compute
Databricks base
Image
databricks-connect
Install/clone
ML
Libraries
Pipelines Spark Jobs /
Pipelines
Kedro Data
Engineering
Convention
Architecture
Compute
Local Machine
Dev Env
Inference
Machine
Development
Pipelines +
experiments
Candidate model
training
Tracking
Logging
Experiments
Register/version
models
Deploy Final
Model
Serving Model
Project Set-up Git Flow
Project Template
Dev/Training
Container
Scaffold Library
pip install
Cloud
Compute
Init master/dev with project
set-up
ML Engineer dev branches
dev
master
Logging/
versioning
Model
Re-training
New
function
(s)
depend
encies
to
scaffold
Cloud
Compute
Evaluate
models/code
Select final
model
databricks-connect
Kedro Data
Engineering
Step
1.
2.
3.
4.
5.

A Collaborative Data Science Development Workflow

  • 1.
    A Collaborative and ScalableMachine Learning Workflow Nick Hale Senior Researcher
  • 2.
    Agenda § Overview § Components §Architecture § Project Workflow
  • 3.
    Overview • Objectives • Acost effective and scalable process for collaborative Machine Learning (ML) R&D • A framework for comparing experiments and deploying production models • A cloud agnostic ML training and deployment framework • Core Contributions • Git version control method suited for collaborative ML development • A method for scaling data science pipelines • A tightly integrated method for collaborative data science, experiment tracking, and model deployment
  • 4.
  • 5.
    Components • Environment Consistency– Docker • Docker is an open platform for developing and running applications • Docker uses loosely isolated environments called Containers that include everything needed to develop and run code • Docker is used to create a consistent environment (dependencies, libraries, etc.) for ML among developers, model training runs, and deployments
  • 6.
    Components • ML Pipelines– Kedro • Open-source Python framework for reproducible, maintainable, and modular data science code from QuantumBlack • Creates Directed Acyclic Graphs (DAGs) comprised of functions and datasets called pipelines • Nodes in the pipeline are functions that can be data transformations, model training, etc. • Uses data engineering convention to track data transformations across local and cloud datastores
  • 7.
  • 8.
    Components • ML ExperimentTracking– MLflow • Open-source platform for managing the ML lifecycle, including experiments, reproducibility, deployment, and model registry • ML tests from multiple users and compute environments tracked and compared • Models can be deployed to production environments directly from MLflow and predictions can be served through REST APIs • Model versions and parameters are tracked so that models can always be reproduced
  • 9.
  • 10.
    Components • ML lifecycleIntegration – Databricks • Databricks is a platform that enables seamless integration of data science code, data, experiment tracking, and cloud resources • Databricks allows data scientists to easily run local Kedro pipelines on compute clusters via the Databricks Connect library • Databricks Connect is a Spark client library that connects local development environments to Databricks clusters • Databricks allows for easy logging of ML experiments in MLflow
  • 11.
    Components • Big dataprocessing – PySpark • PySpark is a python library that ships distributed data science jobs to a Spark cluster running on Databricks • PySpark and Databricks Connect allows Kedro pipelines developed on a local machine to run on the cloud with no friction • Koalas implements the native Pandas DataFrame on top of Spark which allows scalable computing with a minimal learning curve
  • 12.
    Local Machines /Individual Developer Compute Docker/Git repositories Cloud ML Scaffold Repo (custom ml functions) New ML Project Repo (Kedroized) Training Compute Deployment Compute Databricks base Image databricks-connect Install/clone ML Libraries Pipelines Spark Jobs / Pipelines Kedro Data Engineering Convention Architecture
  • 13.
    Compute Local Machine Dev Env Inference Machine Development Pipelines+ experiments Candidate model training Tracking Logging Experiments Register/version models Deploy Final Model Serving Model Project Set-up Git Flow Project Template Dev/Training Container Scaffold Library pip install Cloud Compute Init master/dev with project set-up ML Engineer dev branches dev master Logging/ versioning Model Re-training New function (s) depend encies to scaffold Cloud Compute Evaluate models/code Select final model databricks-connect Kedro Data Engineering Step 1. 2. 3. 4. 5.