Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning Data Lineage with MLflow and Delta Lake


Published on

Many organizations using machine learning are facing challenges storing and versioning their complex ML data as well as a large number of models generated from those data. To simplify this process, organizations tend to start building their customized ‘ML platforms.’

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Machine Learning Data Lineage with MLflow and Delta Lake

  1. 1. Machine Learning Data Lineage with and Delta Lake Richard Zang, Senior Software Engineer, Databricks Denny Lee, Staff Developer Advocate, Databricks
  2. 2. Richard Zang Senior Software Engineer at Databricks Previously ▪ Senior Software Engineer at Hortonworks ▪ Senior Software Engineer at Opentext Analytics
  3. 3. Denny Lee Staff Developer Advocate at Databricks Previously ▪ Senior Director of Data Science Engineering at Concur ▪ Principal Program Manager at at Microsoft ▪ Project Isotope (Azure HDInsight) ▪ SQLCAT DW/BI Lead
  4. 4. Intro
  5. 5. Machine Learning Development is Complex
  6. 6. ML Lifecycle 7 Delta Data Prep Training Deploy Raw Data μ λ θ Tuning Scale μ λ θ Tuning Scale Scale Scale Model Exchange Governance
  7. 7. Tracking Record and query experiments: code, metrics, parameters, artifacts, models Models General model format that standardizes deployment options Model Registry Centralized and collaborative model lifecycle management Projects Packaging format for reproducible runs on any compute platform Components
  8. 8. Model Lifecycle Data Lineage Staging Production Archived Data Scientists Deployment Engineers v1 v2 v3 Models Tracking Flavor 2Flavor 1 Model Registry In-Line Code Containers Batch & Stream Scoring Cloud Inference Services OSS Serving Solutions Serving Parameter s Metrics Artifacts ModelsMetadata v0 v1
  9. 9. Challenges in Model Management When you’re working on one ML app alone, keeping the model in files is manageable MODEL DEVELOPER classifier_v1.h5 classifier_v2.h5 classifier_v3_sept_19.h5 classifier_v3_new.h5 …
  10. 10. Challenges in Model Management When you work in a large organization with many models, management becomes a big challenge: • Where can I find the best version of this model? • How was this model trained? • How can I track docs for each model? • How can I review models? MODEL DEVELOPER REVIEWER MODEL USER ???
  11. 11. MLflow Model Registry Repository of named, versioned models with comments & tags Track each model’s stage: dev, staging, production, archived Easily load a specific version
  13. 13. A Data Engineer’s Dream... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming
  14. 14. Delta On Disk my_table/ _delta_log/ 00000.json 00001.json date=2019-01-01/ file-1.parquet Transaction Log Table Versions (Optional) Partition Directories Data Files
  15. 15. Implementing Atomicity Changes to the table are stored as ordered, atomic units called commits Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet 000000.json 000001.json …
  16. 16. Solving Conflicts Optimistically 1. Record start version 2. Record reads/writes 3. Attempt commit 4. If someone else wins, check if anything you read has changed. 5. Try again. 000000.json 000001.json 000002.json User 1 User 2 Write: Append Read: Schema Write: Append Read: Schema
  17. 17. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.