Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Democratizing Data Quality Through a Centralized Platform

Download to read offline

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.

At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:

Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Democratizing Data Quality Through a Centralized Platform

  1. 1. 1 1 Smit Shah Yuliana Havryshchuk Democratizing Data Quality at Zillow through a Centralized Platform
  2. 2. 2 Who We Are Data Governance Platform Team @ Zillow Smit Shah Senior Software Development Engineer, Big Data Yuliana Havryshchuk Software Development Engineer, Big Data
  3. 3. 3 Agenda ● What is Zillow? ● Data Quality Challenges ● Centralized Data Quality Platform ○ Architecture ○ Self-Service ○ Pipeline integration ● Key Takeaways
  4. 4. Zillow
  5. 5. About Zillow ● Reimagining real estate to make it easier to unlock life’s next chapter * As of Q4-2020 ● Offer customers an on-demand experience for selling, buying, renting and financing with transparency and nearly seamless end-to-end service ● Most-visited real estate website in the United States
  6. 6. Data Quality Challenges
  7. 7. Why Monitor Data Quality? ● Data fuels many customer facing and internal services at Zillow that rely on high quality data ○ Zestimate ○ Zillow Offers ○ Zillow Premier Agent ○ Econ and many more ● Reliable performance of ML and Services requires certain level of data quality
  8. 8. Challenges we Faced ● No standard way to monitor quality ● Lack of visibility into data health ● No known lineage between data and processes
  9. 9. Centralized Data Quality Platform
  10. 10. Data Quality Platform Increase Visibility of Data Health Integrate with Data Lineage Support Built-in Alerting Enable Safe Evolution of Rules Standardize Data Quality Rules 5 Pillars for Data Quality Platform
  11. 11. Platform Architecture * As of May 2021
  12. 12. Platform Architecture * As of May 2021
  13. 13. Platform Architecture * As of May 2021
  14. 14. Platform Architecture * As of May 2021
  15. 15. Platform Architecture * As of May 2021
  16. 16. Platform Architecture * As of May 2021
  17. 17. Platform Architecture * As of May 2021
  18. 18. Self-Service Capabilities
  19. 19. Self-Service Onboarding - Goals ● Must be scalable ● Must be accessible to all user archetypes ● Must require minimal configuration
  20. 20. Self-Service Onboarding - Data Discovery * These values are simulated
  21. 21. Self-Service Onboarding - Example * These values are simulated id name type page_views data_date 1 123 Green St house 709 2021-05-01 2 47 Walker Rd townhouse 132 2021-05-01 1225 City St #901 condo 800 2021-05-01 4 47 Walker Ave test 600 2021-05-01
  22. 22. Self-Service Onboarding - Rule-based * These values are simulated
  23. 23. Self-Service Monitoring - Rule-based * These values are simulated
  24. 24. Self-Service Onboarding - Example * These values are simulated id name type page_views data_date 1 123 Green St house 709 2021-05-01 1 123 Green St house 820 2021-05-02 1 123 Green St house 12 2021-05-03 1 123 Green St house 760 2021-05-04
  25. 25. Self-Service Onboarding - Metrics * These values are simulated
  26. 26. Overview Metric * These values are simulated Self-Service Onboarding - Monitoring
  27. 27. Behind the Scenes ● Rule-based monitors turn into contracts ● Metrics monitors turn into ML-based anomaly detection ● Register data quality requirements in config stores ● Dynamically generate validation pipelines
  28. 28. Validation Libraries Built in-house: ● Luminaire Contract Evaluation Library (scala) for rule-based constraints ● Luminaire Anomaly Detection Library (python) for time-series metrics ○ https://github.com/zillow/luminaire
  29. 29. Pipeline Integration
  30. 30. Pipeline Integration (before) Producers Consumers
  31. 31. Pipeline Integration (after) Producers Consumers *
  32. 32. Validation Results ● Alert data users if any checks fail ● Integrate with pipeline execution to prevent propagation ● Provide visibility through data discovery tool ● Provide common understanding between producers and consumers
  33. 33. Future Direction ● Tighter integration between components ● Expand libraries to support more use-cases ● Move from detection to diagnosis ● Validation for streaming data
  34. 34. Key Takeaways
  35. 35. Key Takeaways ● 5 pillars that helped us build a robust platform: standardization, visibility, evolution, alerting, lineage ● Alerting on data quality issues early allows proactive response ● Producing quality data increases trust in data and improves decisions made ● Data quality is a shared responsibility, and collaboration is needed to be successful
  36. 36. Questions? Thank you! https://www.zillow.com/careers/
  • jazzwang

    Jul. 17, 2021

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Views

Total views

687

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

22

Shares

0

Comments

0

Likes

1

×