Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

A Collaborative Data Science Development Workflow

Download to read offline

Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.

  • Be the first to like this

A Collaborative Data Science Development Workflow

  1. 1. A Collaborative and Scalable Machine Learning Workflow Nick Hale Senior Researcher
  2. 2. Agenda § Overview § Components § Architecture § Project Workflow
  3. 3. Overview • Objectives • A cost effective and scalable process for collaborative Machine Learning (ML) R&D • A framework for comparing experiments and deploying production models • A cloud agnostic ML training and deployment framework • Core Contributions • Git version control method suited for collaborative ML development • A method for scaling data science pipelines • A tightly integrated method for collaborative data science, experiment tracking, and model deployment
  4. 4. Components
  5. 5. Components • Environment Consistency – Docker • Docker is an open platform for developing and running applications • Docker uses loosely isolated environments called Containers that include everything needed to develop and run code • Docker is used to create a consistent environment (dependencies, libraries, etc.) for ML among developers, model training runs, and deployments
  6. 6. Components • ML Pipelines – Kedro • Open-source Python framework for reproducible, maintainable, and modular data science code from QuantumBlack • Creates Directed Acyclic Graphs (DAGs) comprised of functions and datasets called pipelines • Nodes in the pipeline are functions that can be data transformations, model training, etc. • Uses data engineering convention to track data transformations across local and cloud datastores
  7. 7. Example Pipeline Data Engineering Convention
  8. 8. Components • ML Experiment Tracking– MLflow • Open-source platform for managing the ML lifecycle, including experiments, reproducibility, deployment, and model registry • ML tests from multiple users and compute environments tracked and compared • Models can be deployed to production environments directly from MLflow and predictions can be served through REST APIs • Model versions and parameters are tracked so that models can always be reproduced
  9. 9. MLflow Server User Interface
  10. 10. Components • ML lifecycle Integration – Databricks • Databricks is a platform that enables seamless integration of data science code, data, experiment tracking, and cloud resources • Databricks allows data scientists to easily run local Kedro pipelines on compute clusters via the Databricks Connect library • Databricks Connect is a Spark client library that connects local development environments to Databricks clusters • Databricks allows for easy logging of ML experiments in MLflow
  11. 11. Components • Big data processing – PySpark • PySpark is a python library that ships distributed data science jobs to a Spark cluster running on Databricks • PySpark and Databricks Connect allows Kedro pipelines developed on a local machine to run on the cloud with no friction • Koalas implements the native Pandas DataFrame on top of Spark which allows scalable computing with a minimal learning curve
  12. 12. Local Machines / Individual Developer Compute Docker/Git repositories Cloud ML Scaffold Repo (custom ml functions) New ML Project Repo (Kedroized) Training Compute Deployment Compute Databricks base Image databricks-connect Install/clone ML Libraries Pipelines Spark Jobs / Pipelines Kedro Data Engineering Convention Architecture
  13. 13. Compute Local Machine Dev Env Inference Machine Development Pipelines + experiments Candidate model training Tracking Logging Experiments Register/version models Deploy Final Model Serving Model Project Set-up Git Flow Project Template Dev/Training Container Scaffold Library pip install Cloud Compute Init master/dev with project set-up ML Engineer dev branches dev master Logging/ versioning Model Re-training New function (s) depend encies to scaffold Cloud Compute Evaluate models/code Select final model databricks-connect Kedro Data Engineering Step 1. 2. 3. 4. 5.

Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.

Views

Total views

149

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

3

Shares

0

Comments

0

Likes

0

×