Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Platform
Marquez:
An Open Source Metadata Service for ML
Platforms
AI NEXTCon SF ‘19
Data Platform
Software Engineer
Marquez Team, Data Platform
@wslulciuc
Staff Engineer
ML Team, Data Platform
@theshah
AGENDA
Intro to Marquez
Demo
02
03
Why metadata?01
Data Platform
ML + Marquez04
Why metadata?01
Data lineage
● Add context to
data
Democratize
● Self-service data
culture
Data quality
● Build trust in
data
Why manage a...
… creating a healthy data
ecosystem
Freedom
● Experiment
● Flexible
● Self-sufficient
Accountability
● Cost
● Trust
Self-service
● Discover
● Explore
● Global...
Metadata (Marquez)
Ingest
Storage Compute
StreamingBatch/ETL
● WeWork’s Data
platform built
around Marquez
● Integrations
...
Intro to Marquez02
Data
Lineage
Data
Governance
Data
Discovery
Marquez
Data Platform
Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Data Platform
Source
1 *
Marquez: Data model
DbTable Filesystem Stream
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Data Platform
Sou...
Data Platform
v1 v4Dataset
v2
v4
v4
Job
v1
Dataset
v4
Job
v2
Marquez: Data model
● Debugging
○ What job version(s) consume...
Data Platform
Marquez: Metadata collection
How is metadata collected?
● Push-based metadata
collection
● REST API
● Langua...
Integrations
APIs /
Libraries
Data Platform
Core Governance SearchServices
Integrations
APIs /
Libraries
Data Platform
Core Governance
Metadata
DB
Graph
Search
Search
Services
Storage
Integrations
APIs /
Libraries
Data Platform
Data Platform
Core
ETL
Batch
Stream
Search
Governance
UI
CI/CD
Marquez
@ WeWork
Marquez
+Marquez
Data Platform
Data Platform
● Enables global task-level
metadata collection
● Extends Airflow’s DAG class
from marquez_airflow import DA...
Airflow
DAG
DAG
DAG
DAG
Marquez Lib.
Data Platform
● Metadata
○ Task lifecycle
○ Task parameters
○ Task runs linked to ver...
Data Platform
DAG
MarquezAPI
Integration
Marquez
RESTAPI
Capturing task-level metadata in a
nutshell
Marquez: Airflow
Job
...
Demo03
ML + Marquez04
Data Platform
Let’s start with a (familiar?) story...
Data Platform
● You are a successful Data
Scientist or Machine Learning
Engineer
● Your organization has a healthy
data ec...
Data Platform
● You are a successful Data
Scientist or Machine Learning
Engineer
● Your organization has a healthy
data ec...
Data Platform
● Your CTO schedules a
meeting with you
● He says those ML models are
great and all…
● But he wants way more...
Data Platform
● You’re going from ML in “the
Small” to ML in “the Large”1
● What happens next?
1
https://al3x.net/posts/20...
Data Platform
Machine Learning at Scale
● You set up infrastructure to build way more models
● Your models are driving bus...
Some problems emerge
Data Platform
Machine Learning at Scale
● For some models, accuracy is declining without
explanation
● There are no bugs i...
Data Platform
Machine Learning at Scale
https://xkcd.com/1838/
Model Lineage Tracking in Marquez
Machine Learning at Scale
Check the upstream data!
Data Platform
Machine Learning at Scale
Check the training data!
Model
Data Platform
Machine Learning at Scale
Check the training data!
Dataset
Model
Job
Data Platform
Machine Learning at Scale
Check the training data!
Job Dataset
Model
Job
Data Platform
Machine Learning at Scale
Check the training data!
Job Dataset
Model
Dataset Job
Data Platform
Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Model
*1
Lineage Visualization!
Marquez: Data model
Lineage Visualization!
Marquez: Data model
Model
Lineage Visualization for ML Models
Marquez: Data model
Model
Data Platform
ML + Marquez
Problems Solved
✅ Identified training data issues with lineage
Things are not OK yet
Data Platform
Machine Learning at Scale
● You traced the upstream lineage and found the source
of bad data
Job Dataset
Mod...
Data Platform
Machine Learning at Scale
● You traced the upstream lineage and found the source
of bad data
● But it will t...
Data Platform
Machine Learning at Scale
● You traced the upstream lineage and found the source
of bad data
● But it will t...
Model Version Tracking in Marquez
Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Model
ModelVersion
*1
1
*
*
1
Model Version...
Model Version Tracking
v1
Marquez: Data model
��
v2
v3
v4
✅
✅
✅
❌
Data Platform
ML + Marquez
Problems Solved
✅ Identified training data issues with lineage
✅ Fast model rollbacks with mode...
Could this have been prevented?
Data Platform
Machine Learning at Scale
● Regularly test quality of upstream datasets?
Model
Data Platform
Machine Learning at Scale
● Regularly test quality of upstream datasets?
Dataset
Model
Job
Test
✅
Data Platform
Machine Learning at Scale
● Regularly test quality of upstream datasets?
Job Dataset
Model
Dataset JobDatase...
Data Platform
Machine Learning at Scale
● Regularly test quality of upstream datasets?
● Automatically alert an engineer f...
Data Quality Tracking
Leverage existing Data Quality Frameworks
Dataset Quality Tracking in Marquez
Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Model
ModelVersion
*1
1
*
*
1
Track Quality...
Is Quality Monitoring Enough?
Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
Job Dataset
Model
Dataset JobDatase...
Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
● It’s a race to fix the issue befo...
Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
● It’s a race to fix the issue befo...
Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
● It’s a race to fix the issue befo...
Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
● It’s a race to fix the issue befo...
Disable Training when quality checks fail
Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Model
ModelVersion
*1
1
*
*
1
Determine if ...
Data Platform
ML + Marquez
Problems Solved
✅ Identified training data issues with lineage
✅ Fast model rollbacks with mode...
Thanks!
Data Platform
github.com/MarquezProject
@MarquezProject
Questions?
Data Platform AI NEXTCon SF ‘19
Marquez: An Open Source Metadata Service for ML Platforms
Upcoming SlideShare
Loading in …5
×

Marquez: An Open Source Metadata Service for ML Platforms

The term data quality is used to describe the correctness, reliability, and usability of datasets. Data scientists and business analysts often determine the quality of a dataset by its trustworthiness and completeness. But what information might be needed to differentiate between good vs bad data? How quickly can data quality issues be identified and explored? More importantly, how can metadata enable data scientists to make better sense of the high volume of data within their organization from a variety of data sources?

To maximize the usefulness of datasets for data-intensive applications, it is critical that metadata is collected, maintained, and shared across the organization. The investment in metadata enables: Data lineage, Data governance, and Data discovery.

Machine Learning (ML) jobs, just another type of data-intensive application, would benefit from metadata as well. But unlike most software projects which use established tools for maintaining quality, ML projects have fewer safeguards to prevent defects. Marquez helps fill the tooling gap available for ML jobs by tracking the relationships between training jobs, input datasets, and ML models. Marquez also links the different variations of training jobs which can grow wildly due to experimentation and hyperparameter optimization. Data lineage tracking in Marquez also reveals unexpected changes in upstream data dependencies which can harm model performance and be time consuming to debug.

In this talk, we introduce Marquez: an open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. We will demonstrate how metadata management with Marquez helps maintain high model performance and prevent quality issues.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Marquez: An Open Source Metadata Service for ML Platforms

  1. 1. Data Platform Marquez: An Open Source Metadata Service for ML Platforms AI NEXTCon SF ‘19
  2. 2. Data Platform Software Engineer Marquez Team, Data Platform @wslulciuc Staff Engineer ML Team, Data Platform @theshah
  3. 3. AGENDA Intro to Marquez Demo 02 03 Why metadata?01 Data Platform ML + Marquez04
  4. 4. Why metadata?01
  5. 5. Data lineage ● Add context to data Democratize ● Self-service data culture Data quality ● Build trust in data Why manage and utilize metadata? Data Platform
  6. 6. … creating a healthy data ecosystem
  7. 7. Freedom ● Experiment ● Flexible ● Self-sufficient Accountability ● Cost ● Trust Self-service ● Discover ● Explore ● Global context A healthy data ecosystem Data Platform
  8. 8. Metadata (Marquez) Ingest Storage Compute StreamingBatch/ETL ● WeWork’s Data platform built around Marquez ● Integrations ○ Ingest ○ Storage ○ Compute Data Platform Flink Airflow Kafka Iceberg / S3
  9. 9. Intro to Marquez02
  10. 10. Data Lineage Data Governance Data Discovery Marquez Data Platform
  11. 11. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Data Platform Source 1 *
  12. 12. Marquez: Data model DbTable Filesystem Stream Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Data Platform Source 1 *
  13. 13. Data Platform v1 v4Dataset v2 v4 v4 Job v1 Dataset v4 Job v2 Marquez: Data model ● Debugging ○ What job version(s) consumed dataset version X? ● Backfilling ○ Full / incremental processing Design benefits
  14. 14. Data Platform Marquez: Metadata collection How is metadata collected? ● Push-based metadata collection ● REST API ● Language-specific SDKs ○ Java ○ Python ○ Go Marquez Job dataset + job metadata
  15. 15. Integrations APIs / Libraries Data Platform
  16. 16. Core Governance SearchServices Integrations APIs / Libraries Data Platform
  17. 17. Core Governance Metadata DB Graph Search Search Services Storage Integrations APIs / Libraries Data Platform
  18. 18. Data Platform Core ETL Batch Stream Search Governance UI CI/CD Marquez @ WeWork Marquez
  19. 19. +Marquez Data Platform
  20. 20. Data Platform ● Enables global task-level metadata collection ● Extends Airflow’s DAG class from marquez_airflow import DAG from airflow.operators.postgres_operator import PostgresOperator ... room_bookings_7_days_dag.py Marquez: Airflow Airflow support for Marquez
  21. 21. Airflow DAG DAG DAG DAG Marquez Lib. Data Platform ● Metadata ○ Task lifecycle ○ Task parameters ○ Task runs linked to versioned code ○ Task inputs / outputs ● Lineage ○ Track origin of data Marquez: Airflow Airflow support for Marquez (cont.)
  22. 22. Data Platform DAG MarquezAPI Integration Marquez RESTAPI Capturing task-level metadata in a nutshell Marquez: Airflow Job Dataset Job Version Run Dataset Version * 1 * 1 1* 1* Source 1 * * 1
  23. 23. Demo03
  24. 24. ML + Marquez04
  25. 25. Data Platform Let’s start with a (familiar?) story...
  26. 26. Data Platform ● You are a successful Data Scientist or Machine Learning Engineer ● Your organization has a healthy data ecosystem ● Occasionally you build ML models for periodic, offline use
  27. 27. Data Platform ● You are a successful Data Scientist or Machine Learning Engineer ● Your organization has a healthy data ecosystem ● Occasionally you build ML models for one-time use ● Life is good
  28. 28. Data Platform ● Your CTO schedules a meeting with you ● He says those ML models are great and all… ● But he wants way more models ...making real-time predictions… driving impactful business decisions
  29. 29. Data Platform ● You’re going from ML in “the Small” to ML in “the Large”1 ● What happens next? 1 https://al3x.net/posts/2010/07/27/node.html
  30. 30. Data Platform Machine Learning at Scale ● You set up infrastructure to build way more models ● Your models are driving business decisions in real-time ● The models make great predictions Model 😄Model Model 😄😄
  31. 31. Some problems emerge
  32. 32. Data Platform Machine Learning at Scale ● For some models, accuracy is declining without explanation ● There are no bugs in the training workflow ● Changing learning algorithms does not help Model 😭
  33. 33. Data Platform Machine Learning at Scale https://xkcd.com/1838/
  34. 34. Model Lineage Tracking in Marquez
  35. 35. Machine Learning at Scale Check the upstream data! Data Platform
  36. 36. Machine Learning at Scale Check the training data! Model Data Platform
  37. 37. Machine Learning at Scale Check the training data! Dataset Model Job Data Platform
  38. 38. Machine Learning at Scale Check the training data! Job Dataset Model Job Data Platform
  39. 39. Machine Learning at Scale Check the training data! Job Dataset Model Dataset Job Data Platform
  40. 40. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1*
  41. 41. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Model *1
  42. 42. Lineage Visualization! Marquez: Data model
  43. 43. Lineage Visualization! Marquez: Data model Model
  44. 44. Lineage Visualization for ML Models Marquez: Data model Model
  45. 45. Data Platform ML + Marquez Problems Solved ✅ Identified training data issues with lineage
  46. 46. Things are not OK yet
  47. 47. Data Platform Machine Learning at Scale ● You traced the upstream lineage and found the source of bad data Job Dataset Model Dataset Job
  48. 48. Data Platform Machine Learning at Scale ● You traced the upstream lineage and found the source of bad data ● But it will take days of data cleansing work to before model accuracy is restored Job Dataset Model Dataset Job
  49. 49. Data Platform Machine Learning at Scale ● You traced the upstream lineage and found the source of bad data ● But it will take days of data cleansing work to before model accuracy is restored ● You need to rollback to the best last model Job Dataset Model Dataset Job
  50. 50. Model Version Tracking in Marquez
  51. 51. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Model ModelVersion *1 1 * * 1 Model Version Tracking!
  52. 52. Model Version Tracking v1 Marquez: Data model �� v2 v3 v4 ✅ ✅ ✅ ❌
  53. 53. Data Platform ML + Marquez Problems Solved ✅ Identified training data issues with lineage ✅ Fast model rollbacks with model version tracking
  54. 54. Could this have been prevented?
  55. 55. Data Platform Machine Learning at Scale ● Regularly test quality of upstream datasets? Model
  56. 56. Data Platform Machine Learning at Scale ● Regularly test quality of upstream datasets? Dataset Model Job Test ✅
  57. 57. Data Platform Machine Learning at Scale ● Regularly test quality of upstream datasets? Job Dataset Model Dataset JobDataset Test ✅ Test ❌
  58. 58. Data Platform Machine Learning at Scale ● Regularly test quality of upstream datasets? ● Automatically alert an engineer for faster resolution Job Dataset Model Dataset JobDataset Test ✅ Test ❌ ��
  59. 59. Data Quality Tracking Leverage existing Data Quality Frameworks
  60. 60. Dataset Quality Tracking in Marquez
  61. 61. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Model ModelVersion *1 1 * * 1 Track Quality on Dataset Versions! quality_status boolean
  62. 62. Is Quality Monitoring Enough?
  63. 63. Data Platform Machine Learning at Scale ● Data pipelines often run on a fixed schedule Job Dataset Model Dataset JobDataset ❌
  64. 64. Data Platform Machine Learning at Scale ● Data pipelines often run on a fixed schedule ● It’s a race to fix the issue before bad data propagates Job Dataset Model Dataset JobDataset ❌
  65. 65. Data Platform Machine Learning at Scale ● Data pipelines often run on a fixed schedule ● It’s a race to fix the issue before bad data propagates Job Dataset Model Dataset JobDataset ❌
  66. 66. Data Platform Machine Learning at Scale ● Data pipelines often run on a fixed schedule ● It’s a race to fix the issue before bad data propagates ● What if the pipeline was dataset quality-aware? Job Dataset Model Dataset JobDataset ❌
  67. 67. Data Platform Machine Learning at Scale ● Data pipelines often run on a fixed schedule ● It’s a race to fix the issue before bad data propagates ● What if the pipeline was aware of dataset quality? Job Dataset Model Dataset JobDataset ❌ OK to run?
  68. 68. Disable Training when quality checks fail
  69. 69. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Model ModelVersion *1 1 * * 1 Determine if training is safe by checking metadata DatasetVersion quality_status boolean
  70. 70. Data Platform ML + Marquez Problems Solved ✅ Identified training data issues with lineage ✅ Fast model rollbacks with model version tracking ✅ Prevent bad training runs with data quality checking
  71. 71. Thanks! Data Platform
  72. 72. github.com/MarquezProject @MarquezProject
  73. 73. Questions? Data Platform AI NEXTCon SF ‘19

×