Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Marquez: An Open Source Metadata Service for ML Platforms

930 views

Published on

The term data quality is used to describe the correctness, reliability, and usability of datasets. Data scientists and business analysts often determine the quality of a dataset by its trustworthiness and completeness. But what information might be needed to differentiate between good vs bad data? How quickly can data quality issues be identified and explored? More importantly, how can metadata enable data scientists to make better sense of the high volume of data within their organization from a variety of data sources?

To maximize the usefulness of datasets for data-intensive applications, it is critical that metadata is collected, maintained, and shared across the organization. The investment in metadata enables: Data lineage, Data governance, and Data discovery.

Machine Learning (ML) jobs, just another type of data-intensive application, would benefit from metadata as well. But unlike most software projects which use established tools for maintaining quality, ML projects have fewer safeguards to prevent defects. Marquez helps fill the tooling gap available for ML jobs by tracking the relationships between training jobs, input datasets, and ML models. Marquez also links the different variations of training jobs which can grow wildly due to experimentation and hyperparameter optimization. Data lineage tracking in Marquez also reveals unexpected changes in upstream data dependencies which can harm model performance and be time consuming to debug.

In this talk, we introduce Marquez: an open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. We will demonstrate how metadata management with Marquez helps maintain high model performance and prevent quality issues.

Published in: Data & Analytics
  • The Most Effective Natural Breast Enlargement Techniques That Have Already Changed The Lives Of Over 7591 Women From 69 Countries Worldwide! 》》》 https://dwz1.cc/YYZPZbuh
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/r8vlglq } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/r8vlglq } ......................................................................................................................... Download Full doc Ebook here { https://tinyurl.com/r8vlglq } ......................................................................................................................... Download PDF EBOOK here { https://tinyurl.com/r8vlglq } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/r8vlglq } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/r8vlglq } ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Marquez: An Open Source Metadata Service for ML Platforms

  1. 1. Data Platform Marquez: An Open Source Metadata Service for ML Platforms AI NEXTCon SF ‘19
  2. 2. Data Platform Software Engineer Marquez Team, Data Platform @wslulciuc Staff Engineer ML Team, Data Platform @theshah
  3. 3. AGENDA Intro to Marquez Demo 02 03 Why metadata?01 Data Platform ML + Marquez04
  4. 4. Why metadata?01
  5. 5. Data lineage ● Add context to data Democratize ● Self-service data culture Data quality ● Build trust in data Why manage and utilize metadata? Data Platform
  6. 6. … creating a healthy data ecosystem
  7. 7. Freedom ● Experiment ● Flexible ● Self-sufficient Accountability ● Cost ● Trust Self-service ● Discover ● Explore ● Global context A healthy data ecosystem Data Platform
  8. 8. Metadata (Marquez) Ingest Storage Compute StreamingBatch/ETL ● WeWork’s Data platform built around Marquez ● Integrations ○ Ingest ○ Storage ○ Compute Data Platform Flink Airflow Kafka Iceberg / S3
  9. 9. Intro to Marquez02
  10. 10. Data Lineage Data Governance Data Discovery Marquez Data Platform
  11. 11. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Data Platform Source 1 *
  12. 12. Marquez: Data model DbTable Filesystem Stream Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Data Platform Source 1 *
  13. 13. Data Platform v1 v4Dataset v2 v4 v4 Job v1 Dataset v4 Job v2 Marquez: Data model ● Debugging ○ What job version(s) consumed dataset version X? ● Backfilling ○ Full / incremental processing Design benefits
  14. 14. Data Platform Marquez: Metadata collection How is metadata collected? ● Push-based metadata collection ● REST API ● Language-specific SDKs ○ Java ○ Python ○ Go Marquez Job dataset + job metadata
  15. 15. Integrations APIs / Libraries Data Platform
  16. 16. Core Governance SearchServices Integrations APIs / Libraries Data Platform
  17. 17. Core Governance Metadata DB Graph Search Search Services Storage Integrations APIs / Libraries Data Platform
  18. 18. Data Platform Core ETL Batch Stream Search Governance UI CI/CD Marquez @ WeWork Marquez
  19. 19. +Marquez Data Platform
  20. 20. Data Platform ● Enables global task-level metadata collection ● Extends Airflow’s DAG class from marquez_airflow import DAG from airflow.operators.postgres_operator import PostgresOperator ... room_bookings_7_days_dag.py Marquez: Airflow Airflow support for Marquez
  21. 21. Airflow DAG DAG DAG DAG Marquez Lib. Data Platform ● Metadata ○ Task lifecycle ○ Task parameters ○ Task runs linked to versioned code ○ Task inputs / outputs ● Lineage ○ Track origin of data Marquez: Airflow Airflow support for Marquez (cont.)
  22. 22. Data Platform DAG MarquezAPI Integration Marquez RESTAPI Capturing task-level metadata in a nutshell Marquez: Airflow Job Dataset Job Version Run Dataset Version * 1 * 1 1* 1* Source 1 * * 1
  23. 23. Demo03
  24. 24. ML + Marquez04
  25. 25. Data Platform Let’s start with a (familiar?) story...
  26. 26. Data Platform ● You are a successful Data Scientist or Machine Learning Engineer ● Your organization has a healthy data ecosystem ● Occasionally you build ML models for periodic, offline use
  27. 27. Data Platform ● You are a successful Data Scientist or Machine Learning Engineer ● Your organization has a healthy data ecosystem ● Occasionally you build ML models for one-time use ● Life is good
  28. 28. Data Platform ● Your CTO schedules a meeting with you ● He says those ML models are great and all… ● But he wants way more models ...making real-time predictions… driving impactful business decisions
  29. 29. Data Platform ● You’re going from ML in “the Small” to ML in “the Large”1 ● What happens next? 1 https://al3x.net/posts/2010/07/27/node.html
  30. 30. Data Platform Machine Learning at Scale ● You set up infrastructure to build way more models ● Your models are driving business decisions in real-time ● The models make great predictions Model 😄Model Model 😄😄
  31. 31. Some problems emerge
  32. 32. Data Platform Machine Learning at Scale ● For some models, accuracy is declining without explanation ● There are no bugs in the training workflow ● Changing learning algorithms does not help Model 😭
  33. 33. Data Platform Machine Learning at Scale https://xkcd.com/1838/
  34. 34. Model Lineage Tracking in Marquez
  35. 35. Machine Learning at Scale Check the upstream data! Data Platform
  36. 36. Machine Learning at Scale Check the training data! Model Data Platform
  37. 37. Machine Learning at Scale Check the training data! Dataset Model Job Data Platform
  38. 38. Machine Learning at Scale Check the training data! Job Dataset Model Job Data Platform
  39. 39. Machine Learning at Scale Check the training data! Job Dataset Model Dataset Job Data Platform
  40. 40. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1*
  41. 41. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Model *1
  42. 42. Lineage Visualization! Marquez: Data model
  43. 43. Lineage Visualization! Marquez: Data model Model
  44. 44. Lineage Visualization for ML Models Marquez: Data model Model
  45. 45. Data Platform ML + Marquez Problems Solved ✅ Identified training data issues with lineage
  46. 46. Things are not OK yet
  47. 47. Data Platform Machine Learning at Scale ● You traced the upstream lineage and found the source of bad data Job Dataset Model Dataset Job
  48. 48. Data Platform Machine Learning at Scale ● You traced the upstream lineage and found the source of bad data ● But it will take days of data cleansing work to before model accuracy is restored Job Dataset Model Dataset Job
  49. 49. Data Platform Machine Learning at Scale ● You traced the upstream lineage and found the source of bad data ● But it will take days of data cleansing work to before model accuracy is restored ● You need to rollback to the best last model Job Dataset Model Dataset Job
  50. 50. Model Version Tracking in Marquez
  51. 51. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Model ModelVersion *1 1 * * 1 Model Version Tracking!
  52. 52. Model Version Tracking v1 Marquez: Data model �� v2 v3 v4 ✅ ✅ ✅ ❌
  53. 53. Data Platform ML + Marquez Problems Solved ✅ Identified training data issues with lineage ✅ Fast model rollbacks with model version tracking
  54. 54. Could this have been prevented?
  55. 55. Data Platform Machine Learning at Scale ● Regularly test quality of upstream datasets? Model
  56. 56. Data Platform Machine Learning at Scale ● Regularly test quality of upstream datasets? Dataset Model Job Test ✅
  57. 57. Data Platform Machine Learning at Scale ● Regularly test quality of upstream datasets? Job Dataset Model Dataset JobDataset Test ✅ Test ❌
  58. 58. Data Platform Machine Learning at Scale ● Regularly test quality of upstream datasets? ● Automatically alert an engineer for faster resolution Job Dataset Model Dataset JobDataset Test ✅ Test ❌ ��
  59. 59. Data Quality Tracking Leverage existing Data Quality Frameworks
  60. 60. Dataset Quality Tracking in Marquez
  61. 61. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Model ModelVersion *1 1 * * 1 Track Quality on Dataset Versions! quality_status boolean
  62. 62. Is Quality Monitoring Enough?
  63. 63. Data Platform Machine Learning at Scale ● Data pipelines often run on a fixed schedule Job Dataset Model Dataset JobDataset ❌
  64. 64. Data Platform Machine Learning at Scale ● Data pipelines often run on a fixed schedule ● It’s a race to fix the issue before bad data propagates Job Dataset Model Dataset JobDataset ❌
  65. 65. Data Platform Machine Learning at Scale ● Data pipelines often run on a fixed schedule ● It’s a race to fix the issue before bad data propagates Job Dataset Model Dataset JobDataset ❌
  66. 66. Data Platform Machine Learning at Scale ● Data pipelines often run on a fixed schedule ● It’s a race to fix the issue before bad data propagates ● What if the pipeline was dataset quality-aware? Job Dataset Model Dataset JobDataset ❌
  67. 67. Data Platform Machine Learning at Scale ● Data pipelines often run on a fixed schedule ● It’s a race to fix the issue before bad data propagates ● What if the pipeline was aware of dataset quality? Job Dataset Model Dataset JobDataset ❌ OK to run?
  68. 68. Disable Training when quality checks fail
  69. 69. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Model ModelVersion *1 1 * * 1 Determine if training is safe by checking metadata DatasetVersion quality_status boolean
  70. 70. Data Platform ML + Marquez Problems Solved ✅ Identified training data issues with lineage ✅ Fast model rollbacks with model version tracking ✅ Prevent bad training runs with data quality checking
  71. 71. Thanks! Data Platform
  72. 72. github.com/MarquezProject @MarquezProject
  73. 73. Questions? Data Platform AI NEXTCon SF ‘19

×