Advertisement

Democratizing Data Quality Through a Centralized Platform

Developer Marketing and Relations at MuleSoft
Jun. 16, 2021
Advertisement

More Related Content

Advertisement

More from Databricks(20)

Advertisement

Democratizing Data Quality Through a Centralized Platform

  1. 1 1 Smit Shah Yuliana Havryshchuk Democratizing Data Quality at Zillow through a Centralized Platform
  2. 2 Who We Are Data Governance Platform Team @ Zillow Smit Shah Senior Software Development Engineer, Big Data Yuliana Havryshchuk Software Development Engineer, Big Data
  3. 3 Agenda ● What is Zillow? ● Data Quality Challenges ● Centralized Data Quality Platform ○ Architecture ○ Self-Service ○ Pipeline integration ● Key Takeaways
  4. Zillow
  5. About Zillow ● Reimagining real estate to make it easier to unlock life’s next chapter * As of Q4-2020 ● Offer customers an on-demand experience for selling, buying, renting and financing with transparency and nearly seamless end-to-end service ● Most-visited real estate website in the United States
  6. Data Quality Challenges
  7. Why Monitor Data Quality? ● Data fuels many customer facing and internal services at Zillow that rely on high quality data ○ Zestimate ○ Zillow Offers ○ Zillow Premier Agent ○ Econ and many more ● Reliable performance of ML and Services requires certain level of data quality
  8. Challenges we Faced ● No standard way to monitor quality ● Lack of visibility into data health ● No known lineage between data and processes
  9. Centralized Data Quality Platform
  10. Data Quality Platform Increase Visibility of Data Health Integrate with Data Lineage Support Built-in Alerting Enable Safe Evolution of Rules Standardize Data Quality Rules 5 Pillars for Data Quality Platform
  11. Platform Architecture * As of May 2021
  12. Platform Architecture * As of May 2021
  13. Platform Architecture * As of May 2021
  14. Platform Architecture * As of May 2021
  15. Platform Architecture * As of May 2021
  16. Platform Architecture * As of May 2021
  17. Platform Architecture * As of May 2021
  18. Self-Service Capabilities
  19. Self-Service Onboarding - Goals ● Must be scalable ● Must be accessible to all user archetypes ● Must require minimal configuration
  20. Self-Service Onboarding - Data Discovery * These values are simulated
  21. Self-Service Onboarding - Example * These values are simulated id name type page_views data_date 1 123 Green St house 709 2021-05-01 2 47 Walker Rd townhouse 132 2021-05-01 1225 City St #901 condo 800 2021-05-01 4 47 Walker Ave test 600 2021-05-01
  22. Self-Service Onboarding - Rule-based * These values are simulated
  23. Self-Service Monitoring - Rule-based * These values are simulated
  24. Self-Service Onboarding - Example * These values are simulated id name type page_views data_date 1 123 Green St house 709 2021-05-01 1 123 Green St house 820 2021-05-02 1 123 Green St house 12 2021-05-03 1 123 Green St house 760 2021-05-04
  25. Self-Service Onboarding - Metrics * These values are simulated
  26. Overview Metric * These values are simulated Self-Service Onboarding - Monitoring
  27. Behind the Scenes ● Rule-based monitors turn into contracts ● Metrics monitors turn into ML-based anomaly detection ● Register data quality requirements in config stores ● Dynamically generate validation pipelines
  28. Validation Libraries Built in-house: ● Luminaire Contract Evaluation Library (scala) for rule-based constraints ● Luminaire Anomaly Detection Library (python) for time-series metrics ○ https://github.com/zillow/luminaire
  29. Pipeline Integration
  30. Pipeline Integration (before) Producers Consumers
  31. Pipeline Integration (after) Producers Consumers *
  32. Validation Results ● Alert data users if any checks fail ● Integrate with pipeline execution to prevent propagation ● Provide visibility through data discovery tool ● Provide common understanding between producers and consumers
  33. Future Direction ● Tighter integration between components ● Expand libraries to support more use-cases ● Move from detection to diagnosis ● Validation for streaming data
  34. Key Takeaways
  35. Key Takeaways ● 5 pillars that helped us build a robust platform: standardization, visibility, evolution, alerting, lineage ● Alerting on data quality issues early allows proactive response ● Producing quality data increases trust in data and improves decisions made ● Data quality is a shared responsibility, and collaboration is needed to be successful
  36. Questions? Thank you! https://www.zillow.com/careers/
Advertisement