Data Platform
Marquez:
An Open Source Metadata Service for ML
Platforms
AI NEXTCon SF ‘19
Data Platform
Software Engineer
Marquez Team, Data Platform
@wslulciuc
Staff Engineer
ML Team, Data Platform
@theshah
AGENDA
Intro to Marquez
Demo
02
03
Why metadata?01
Data Platform
ML + Marquez04
Why metadata?01
Data lineage
● Add context to
data
Democratize
● Self-service data
culture
Data quality
● Build trust in
data
Why manage and utilize metadata?
Data Platform
… creating a healthy data
ecosystem
Freedom
● Experiment
● Flexible
● Self-sufficient
Accountability
● Cost
● Trust
Self-service
● Discover
● Explore
● Global context
A healthy data ecosystem
Data Platform
Metadata (Marquez)
Ingest
Storage Compute
StreamingBatch/ETL
● WeWork’s Data
platform built
around Marquez
● Integrations
○ Ingest
○ Storage
○ Compute
Data Platform
Flink
Airflow
Kafka
Iceberg / S3
Intro to Marquez02
Data
Lineage
Data
Governance
Data
Discovery
Marquez
Data Platform
Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Data Platform
Source
1 *
Marquez: Data model
DbTable Filesystem Stream
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Data Platform
Source
1 *
Data Platform
v1 v4Dataset
v2
v4
v4
Job
v1
Dataset
v4
Job
v2
Marquez: Data model
● Debugging
○ What job version(s) consumed dataset
version X?
● Backfilling
○ Full / incremental processing
Design benefits
Data Platform
Marquez: Metadata collection
How is metadata collected?
● Push-based metadata
collection
● REST API
● Language-specific SDKs
○ Java
○ Python
○ Go
Marquez
Job
dataset + job
metadata
Integrations
APIs /
Libraries
Data Platform
Core Governance SearchServices
Integrations
APIs /
Libraries
Data Platform
Core Governance
Metadata
DB
Graph
Search
Search
Services
Storage
Integrations
APIs /
Libraries
Data Platform
Data Platform
Core
ETL
Batch
Stream
Search
Governance
UI
CI/CD
Marquez
@ WeWork
Marquez
+Marquez
Data Platform
Data Platform
● Enables global task-level
metadata collection
● Extends Airflow’s DAG class
from marquez_airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
...
room_bookings_7_days_dag.py
Marquez: Airflow
Airflow support for Marquez
Airflow
DAG
DAG
DAG
DAG
Marquez Lib.
Data Platform
● Metadata
○ Task lifecycle
○ Task parameters
○ Task runs linked to versioned code
○ Task inputs / outputs
● Lineage
○ Track origin of data
Marquez: Airflow
Airflow support for Marquez (cont.)
Data Platform
DAG
MarquezAPI
Integration
Marquez
RESTAPI
Capturing task-level metadata in a
nutshell
Marquez: Airflow
Job
Dataset
Job
Version
Run
Dataset
Version
*
1
*
1
1*
1*
Source
1 *
*
1
Demo03
ML + Marquez04
Data Platform
Let’s start with a (familiar?) story...
Data Platform
● You are a successful Data
Scientist or Machine Learning
Engineer
● Your organization has a healthy
data ecosystem
● Occasionally you build ML
models for periodic, offline use
Data Platform
● You are a successful Data
Scientist or Machine Learning
Engineer
● Your organization has a healthy
data ecosystem
● Occasionally you build ML
models for one-time use
● Life is good
Data Platform
● Your CTO schedules a
meeting with you
● He says those ML models are
great and all…
● But he wants way more
models ...making real-time
predictions… driving
impactful business
decisions
Data Platform
● You’re going from ML in “the
Small” to ML in “the Large”1
● What happens next?
1
https://al3x.net/posts/2010/07/27/node.html
Data Platform
Machine Learning at Scale
● You set up infrastructure to build way more models
● Your models are driving business decisions in real-time
● The models make great predictions
Model 😄Model
Model
😄😄
Some problems emerge
Data Platform
Machine Learning at Scale
● For some models, accuracy is declining without
explanation
● There are no bugs in the training workflow
● Changing learning algorithms does not help
Model 😭
Data Platform
Machine Learning at Scale
https://xkcd.com/1838/
Model Lineage Tracking in Marquez
Machine Learning at Scale
Check the upstream data!
Data Platform
Machine Learning at Scale
Check the training data!
Model
Data Platform
Machine Learning at Scale
Check the training data!
Dataset
Model
Job
Data Platform
Machine Learning at Scale
Check the training data!
Job Dataset
Model
Job
Data Platform
Machine Learning at Scale
Check the training data!
Job Dataset
Model
Dataset Job
Data Platform
Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Model
*1
Lineage Visualization!
Marquez: Data model
Lineage Visualization!
Marquez: Data model
Model
Lineage Visualization for ML Models
Marquez: Data model
Model
Data Platform
ML + Marquez
Problems Solved
✅ Identified training data issues with lineage
Things are not OK yet
Data Platform
Machine Learning at Scale
● You traced the upstream lineage and found the source
of bad data
Job Dataset
Model
Dataset Job
Data Platform
Machine Learning at Scale
● You traced the upstream lineage and found the source
of bad data
● But it will take days of data cleansing work to before
model accuracy is restored
Job Dataset
Model
Dataset Job
Data Platform
Machine Learning at Scale
● You traced the upstream lineage and found the source
of bad data
● But it will take days of data cleansing work to before
model accuracy is restored
● You need to rollback to the best last model
Job Dataset
Model
Dataset Job
Model Version Tracking in Marquez
Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Model
ModelVersion
*1
1
*
*
1
Model Version Tracking!
Model Version Tracking
v1
Marquez: Data model
��
v2
v3
v4
✅
✅
✅
❌
Data Platform
ML + Marquez
Problems Solved
✅ Identified training data issues with lineage
✅ Fast model rollbacks with model version tracking
Could this have been prevented?
Data Platform
Machine Learning at Scale
● Regularly test quality of upstream datasets?
Model
Data Platform
Machine Learning at Scale
● Regularly test quality of upstream datasets?
Dataset
Model
Job
Test
✅
Data Platform
Machine Learning at Scale
● Regularly test quality of upstream datasets?
Job Dataset
Model
Dataset JobDataset
Test
✅
Test
❌
Data Platform
Machine Learning at Scale
● Regularly test quality of upstream datasets?
● Automatically alert an engineer for faster resolution
Job Dataset
Model
Dataset JobDataset
Test
✅
Test
❌
��
Data Quality Tracking
Leverage existing Data Quality Frameworks
Dataset Quality Tracking in Marquez
Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Model
ModelVersion
*1
1
*
*
1
Track Quality on Dataset Versions!
quality_status boolean
Is Quality Monitoring Enough?
Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
Job Dataset
Model
Dataset JobDataset
❌
Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
● It’s a race to fix the issue before bad data propagates
Job Dataset
Model
Dataset JobDataset
❌
Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
● It’s a race to fix the issue before bad data propagates
Job Dataset
Model
Dataset JobDataset
❌
Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
● It’s a race to fix the issue before bad data propagates
● What if the pipeline was dataset quality-aware?
Job Dataset
Model
Dataset JobDataset
❌
Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
● It’s a race to fix the issue before bad data propagates
● What if the pipeline was aware of dataset quality?
Job Dataset
Model
Dataset JobDataset
❌
OK to run?
Disable Training when quality checks fail
Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Model
ModelVersion
*1
1
*
*
1
Determine if training is safe by
checking metadata
DatasetVersion
quality_status boolean
Data Platform
ML + Marquez
Problems Solved
✅ Identified training data issues with lineage
✅ Fast model rollbacks with model version tracking
✅ Prevent bad training runs with data quality checking
Thanks!
Data Platform
github.com/MarquezProject
@MarquezProject
Questions?
Data Platform AI NEXTCon SF ‘19

Marquez: An Open Source Metadata Service for ML Platforms