Machine Learning Data Lineage with MLflow and Delta Lake

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage
with and Delta Lake
Richard Zang, Senior Software Engineer, Databricks
Denny Lee, Staff Developer Advocate, Databricks
Richard Zang
Senior Software Engineer at
Databricks
Previously
▪ Senior Software Engineer at
Hortonworks
▪ Senior Software Engineer at
Opentext Analytics
Denny Lee
Staff Developer Advocate at
Databricks
Previously
▪ Senior Director of Data Science
Engineering at Concur
▪ Principal Program Manager at at
Microsoft
▪ Project Isotope (Azure
HDInsight)
▪ SQLCAT DW/BI Lead
Intro
Machine Learning
Development is Complex
ML Lifecycle
7
Delta
Data Prep
Training
Deploy
Raw Data
μ
λ θ Tuning
Scale
μ
λ θ Tuning
Scale
Scale
Scale
Model
Exchange
Governance
Tracking
Record and query
experiments: code,
metrics, parameters,
artifacts, models
Models
General model
format
that standardizes
deployment options
Model Registry
Centralized and
collaborative
model lifecycle
management
Projects
Packaging format
for reproducible runs
on any compute
platform
Components
Model Lifecycle Data Lineage
Staging Production Archived
Data Scientists Deployment Engineers
v1
v2
v3
Models Tracking
Flavor 2Flavor 1
Model Registry
In-Line Code
Containers
Batch & Stream
Scoring
Cloud Inference
Services
OSS Serving
Solutions
Serving
Parameter
s
Metrics Artifacts
ModelsMetadata
v0
v1
Challenges in Model Management
When you’re working on one ML app alone, keeping the model in
files is manageable
MODEL
DEVELOPER
classifier_v1.h5
classifier_v2.h5
classifier_v3_sept_19.h5
classifier_v3_new.h5
…
Challenges in Model Management
When you work in a large organization with many models,
management becomes a big challenge:
• Where can I find the best version of this model?
• How was this model trained?
• How can I track docs for each model?
• How can I review models?
MODEL
DEVELOPER
REVIEWER
MODEL
USER
???
MLflow Model Registry
Repository of named, versioned
models with comments & tags
Track each model’s stage: dev,
staging, production, archived
Easily load a specific version
MLflow Model Registry
Model Registry
MODEL
DEVELOPER
DOWNSTREAM
USERS
AUTOMATED JOBS
REST SERVING
REVIEWERS,
CI/CD TOOLS
A Data Engineer’s Dream...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a cost
efficient way without having to choose between batch or streaming
Delta On Disk
my_table/
_delta_log/
00000.json
00001.json
date=2019-01-01/
file-1.parquet
Transaction Log
Table Versions
(Optional) Partition Directories
Data Files
Implementing Atomicity
Changes to the table
are stored as
ordered, atomic units
called commits
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
000000.json
000001.json
…
Solving Conflicts Optimistically
1. Record start version
2. Record reads/writes
3. Attempt commit
4. If someone else wins,
check if anything you
read has changed.
5. Try again.
000000.json
000001.json
000002.json
User 1 User 2
Write: Append
Read: Schema
Write: Append
Read: Schema
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Machine Learning Data Lineage with MLflow and Delta Lake
1 of 19

Recommended

Databricks Overview for MLOps by
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOpsDatabricks
1.4K views30 slides
Introducing Databricks Delta by
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
5.9K views36 slides
Modernizing to a Cloud Data Architecture by
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
644 views22 slides
Building End-to-End Delta Pipelines on GCP by
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPDatabricks
649 views25 slides
Architect’s Open-Source Guide for a Data Mesh Architecture by
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
3.1K views48 slides
Intro to Delta Lake by
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
1.5K views22 slides

More Related Content

What's hot

Demystifying Data Warehousing as a Service - DFW by
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWKent Graziano
2.2K views79 slides
Ml ops past_present_future by
Ml ops past_present_futureMl ops past_present_future
Ml ops past_present_futureNisha Talagala
282 views21 slides
Introduction SQL Analytics on Lakehouse Architecture by
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
5.8K views52 slides
Big data architectures and the data lake by
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
54.1K views53 slides
Learn to Use Databricks for the Full ML Lifecycle by
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleDatabricks
639 views15 slides
What’s New with Databricks Machine Learning by
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningDatabricks
401 views102 slides

What's hot(20)

Demystifying Data Warehousing as a Service - DFW by Kent Graziano
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
Kent Graziano2.2K views
Introduction SQL Analytics on Lakehouse Architecture by Databricks
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks5.8K views
Big data architectures and the data lake by James Serra
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra54.1K views
Learn to Use Databricks for the Full ML Lifecycle by Databricks
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
Databricks639 views
What’s New with Databricks Machine Learning by Databricks
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
Databricks401 views
Data Lakehouse Symposium | Day 4 by Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K views
MLflow Model Serving by Databricks
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
Databricks1.4K views
DW Migration Webinar-March 2022.pptx by Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K views
Scaling and Modernizing Data Platform with Databricks by Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
Databricks594 views
Data Warehouse - Incremental Migration to the Cloud by Michael Rainey
Data Warehouse - Incremental Migration to the CloudData Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the Cloud
Michael Rainey1.4K views
Learn to Use Databricks for Data Science by Databricks
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K views
Data Lakehouse Symposium | Day 1 | Part 1 by Databricks
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K views
Introduction to Azure Data Factory by Slava Kokaev
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data Factory
Slava Kokaev8.7K views
Differentiate Big Data vs Data Warehouse use cases for a cloud solution by James Serra
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra9.3K views
Introducing the Snowflake Computing Cloud Data Warehouse by Snowflake Computing
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing10.6K views
Databricks Delta Lake and Its Benefits by Databricks
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks5.1K views
Using Databricks as an Analysis Platform by Databricks
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks746 views
Data Lake Overview by James Serra
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra19.8K views

Similar to Machine Learning Data Lineage with MLflow and Delta Lake

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D... by
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
1.5K views50 slides
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle by
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleDatabricks
2.4K views41 slides
Machine Learning Models in Production by
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
2.2K views32 slides
Lviv Data Science Club (Sergiy Lunyakin) by
Lviv Data Science Club (Sergiy Lunyakin)Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)Lviv Startup Club
151 views21 slides
20160317 - PAZUR - PowerBI & R by
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & RŁukasz Grala
690 views24 slides
An introduction to QuerySurge webinar by
An introduction to QuerySurge webinarAn introduction to QuerySurge webinar
An introduction to QuerySurge webinarRTTS
13K views14 slides

Similar to Machine Learning Data Lineage with MLflow and Delta Lake(20)

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D... by Databricks
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks1.5K views
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle by Databricks
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Databricks2.4K views
Machine Learning Models in Production by DataWorks Summit
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit2.2K views
Lviv Data Science Club (Sergiy Lunyakin) by Lviv Startup Club
Lviv Data Science Club (Sergiy Lunyakin)Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)
Lviv Startup Club151 views
20160317 - PAZUR - PowerBI & R by Łukasz Grala
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
Łukasz Grala690 views
An introduction to QuerySurge webinar by RTTS
An introduction to QuerySurge webinarAn introduction to QuerySurge webinar
An introduction to QuerySurge webinar
RTTS13K views
Introduction to SQL Server Analysis services 2008 by Tobias Koprowski
Introduction to SQL Server Analysis services 2008Introduction to SQL Server Analysis services 2008
Introduction to SQL Server Analysis services 2008
Tobias Koprowski1.5K views
Python + MPP Database = Large Scale AI/ML Projects in Production Faster by Paige_Roberts
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts149 views
Serverless machine learning architectures at Helixa by Data Science Milan
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
Data Science Milan259 views
Analyti x mapping manager product overview presentation by AnalytixDataServices
Analyti x mapping manager product overview presentationAnalyti x mapping manager product overview presentation
Analyti x mapping manager product overview presentation
Databricks Platform.pptx by Alex Ivy
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy3.2K views
How does Microsoft solve Big Data? by James Serra
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?
James Serra7.4K views
DataMass Summit - Machine Learning for Big Data in SQL Server by Łukasz Grala
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala981 views
What’s new in SQL Server 2017 by James Serra
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017
James Serra10.4K views
Mapping Manager Product Overview by Rakesh Kumar
Mapping Manager Product OverviewMapping Manager Product Overview
Mapping Manager Product Overview
Rakesh Kumar481 views
Whats New In 2010 (Msdn & Visual Studio) by Steve Lange
Whats New In 2010 (Msdn & Visual Studio)Whats New In 2010 (Msdn & Visual Studio)
Whats New In 2010 (Msdn & Visual Studio)
Steve Lange1.2K views

More from Databricks

Data Lakehouse Symposium | Day 1 | Part 2 by
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
739 views16 slides
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
6.3K views64 slides
Democratizing Data Quality Through a Centralized Platform by
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
1.4K views36 slides
Why APM Is Not the Same As ML Monitoring by
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
743 views26 slides
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix by
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
688 views48 slides
Stage Level Scheduling Improving Big Data and AI Integration by
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
850 views38 slides

More from Databricks(20)

Data Lakehouse Symposium | Day 1 | Part 2 by Databricks
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks739 views
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K views
Democratizing Data Quality Through a Centralized Platform by Databricks
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks1.4K views
Why APM Is Not the Same As ML Monitoring by Databricks
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks743 views
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix by Databricks
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks688 views
Stage Level Scheduling Improving Big Data and AI Integration by Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 views
Simplify Data Conversion from Spark to TensorFlow and PyTorch by Databricks
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks1.8K views
Scaling your Data Pipelines with Apache Spark on Kubernetes by Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Scaling and Unifying SciKit Learn and Apache Spark Pipelines by Databricks
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks667 views
Sawtooth Windows for Feature Aggregations by Databricks
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks604 views
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink by Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks675 views
Re-imagine Data Monitoring with whylogs and Spark by Databricks
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks550 views
Raven: End-to-end Optimization of ML Prediction Queries by Databricks
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks448 views
Processing Large Datasets for ADAS Applications using Apache Spark by Databricks
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks512 views
Massive Data Processing in Adobe Using Delta Lake by Databricks
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks719 views
Machine Learning CI/CD for Email Attack Detection by Databricks
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks389 views
Jeeves Grows Up: An AI Chatbot for Performance and Quality by Databricks
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks260 views
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue by Databricks
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks348 views
Infrastructure Agnostic Machine Learning Workload Deployment by Databricks
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks347 views
Improving Apache Spark for Dynamic Allocation and Spot Instances by Databricks
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks281 views

Recently uploaded

ColonyOS by
ColonyOSColonyOS
ColonyOSJohanKristiansson6
9 views17 slides
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented GenerationDataScienceConferenc1
5 views29 slides
How Leaders See Data? (Level 1) by
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)Narendra Narendra
13 views76 slides
RuleBookForTheFairDataEconomy.pptx by
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptxnoraelstela1
67 views16 slides
CRIJ4385_Death Penalty_F23.pptx by
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptxyvettemm100
6 views24 slides
MOSORE_BRESCIA by
MOSORE_BRESCIAMOSORE_BRESCIA
MOSORE_BRESCIAFederico Karagulian
5 views8 slides

Recently uploaded(20)

[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
RuleBookForTheFairDataEconomy.pptx by noraelstela1
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela167 views
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 views
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
Understanding Hallucinations in LLMs - 2023 09 29.pptx by Greg Makowski
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptx
Greg Makowski13 views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials10 views
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0118 views
Data structure and algorithm. by Abdul salam
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 18 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf by vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 views
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 views

Machine Learning Data Lineage with MLflow and Delta Lake

  • 2. Machine Learning Data Lineage with and Delta Lake Richard Zang, Senior Software Engineer, Databricks Denny Lee, Staff Developer Advocate, Databricks
  • 3. Richard Zang Senior Software Engineer at Databricks Previously ▪ Senior Software Engineer at Hortonworks ▪ Senior Software Engineer at Opentext Analytics
  • 4. Denny Lee Staff Developer Advocate at Databricks Previously ▪ Senior Director of Data Science Engineering at Concur ▪ Principal Program Manager at at Microsoft ▪ Project Isotope (Azure HDInsight) ▪ SQLCAT DW/BI Lead
  • 7. ML Lifecycle 7 Delta Data Prep Training Deploy Raw Data μ λ θ Tuning Scale μ λ θ Tuning Scale Scale Scale Model Exchange Governance
  • 8. Tracking Record and query experiments: code, metrics, parameters, artifacts, models Models General model format that standardizes deployment options Model Registry Centralized and collaborative model lifecycle management Projects Packaging format for reproducible runs on any compute platform Components
  • 9. Model Lifecycle Data Lineage Staging Production Archived Data Scientists Deployment Engineers v1 v2 v3 Models Tracking Flavor 2Flavor 1 Model Registry In-Line Code Containers Batch & Stream Scoring Cloud Inference Services OSS Serving Solutions Serving Parameter s Metrics Artifacts ModelsMetadata v0 v1
  • 10. Challenges in Model Management When you’re working on one ML app alone, keeping the model in files is manageable MODEL DEVELOPER classifier_v1.h5 classifier_v2.h5 classifier_v3_sept_19.h5 classifier_v3_new.h5 …
  • 11. Challenges in Model Management When you work in a large organization with many models, management becomes a big challenge: • Where can I find the best version of this model? • How was this model trained? • How can I track docs for each model? • How can I review models? MODEL DEVELOPER REVIEWER MODEL USER ???
  • 12. MLflow Model Registry Repository of named, versioned models with comments & tags Track each model’s stage: dev, staging, production, archived Easily load a specific version
  • 13. MLflow Model Registry Model Registry MODEL DEVELOPER DOWNSTREAM USERS AUTOMATED JOBS REST SERVING REVIEWERS, CI/CD TOOLS
  • 14. A Data Engineer’s Dream... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming
  • 16. Implementing Atomicity Changes to the table are stored as ordered, atomic units called commits Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet 000000.json 000001.json …
  • 17. Solving Conflicts Optimistically 1. Record start version 2. Record reads/writes 3. Attempt commit 4. If someone else wins, check if anything you read has changed. 5. Try again. 000000.json 000001.json 000002.json User 1 User 2 Write: Append Read: Schema Write: Append Read: Schema
  • 18. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.