Luminaire is an open-source Python library developed by Zillow to provide scalable and automated time series anomaly detection. It uses AutoML techniques to optimize model selection and configuration with minimal input. Luminaire profiles and preprocesses time series data, trains both batch and streaming models, scores anomalies, and supports distributed training and scoring on large datasets using Spark. Zillow uses Luminaire to monitor data quality across many metrics and data sources that power its products and services.
2. Who We Are
Data Governance Platform Team
@ Zillow
Sayan Chakraborty
Senior Applied Scientist
Smit Shah
Senior Software Development
Engineer, Big Data
3. Agenda
● What is Zillow?
● Why Monitor Data Quality
● Data Quality Challenges
● Luminaire and Scaling
● Key Takeaways
5. About Zillow
● Reimagining real estate to make it
easier to unlock life’s next chapter
* As of Q4-2020
● Offer customers an on-demand
experience for selling, buying,
renting and financing with
transparency and nearly seamless
end-to-end service
● Most-visited real estate website in
the United States
7. Why Monitor Data Quality?
● Data fuels many customer facing
and internal services at Zillow that
rely on high quality data
○ Zestimate
○ Zillow Offers
○ Zillow Premier Agent
○ Econ and many more
● Reliable performance of ML and
Services requires certain level of
data quality
8. Why detect Anomalies?
Anomaly
A data instance or behavior significantly
different from the ‘regular’ patterns
Complex
Time-sensitive
Inevitable
Catching anomalies in important metric
helps keep our business healthy
9. Ways to Monitor Data Quality
Rule Based
● Domain experts sets pre-specified
rules or thresholds
○ Example: Percent of null data should be
less than 2% per day for a given metric
● Less complicated to set up and
easy to interpret
● Works well when the properties of
data are simple and remain
stationary over time
ML Based
● Rules are set through mathematical
modeling
● Works well when properties of data
are complex and changes over time
● A more hands-off approach
11. Data Quality is Context Dependent
● Depends on the use case
● Depends on the reference time frame under consideration
○ Example: Different interpretation of the same fluctuation can be obtained when compared
under shorter vs longer reference time-frames
● Depends on externalities such as holidays, product launches, market specific
etc
12. Challenges
● Modeling
○ Wide ranges of time series patterns from different data sources - one model
doesn’t fit all
○ Definition of anomalies changes at different levels of aggregation of the same
data
● Scaling and Standardization
○ Everyone (Analyst, PM, DE) should be able to use ML for anomaly detection and
get trustworthy data (but everyone is not an ML expert)
○ Require Scalability for handling large amount of data across teams
13. Wishlist for the system
● Able to catch any data irregularities
● Scale for large amount of data and metrics
● Minimal configuration
● Minimal maintenance over time
No existing solution meets the above requirements
15. Luminaire Python Package
Integrated with
Different
Models
AutoML Built-in
Proven to
Outperform
Many Existing
Methods
Time series
Data Profiling
Enabled
Built for Batch
and Streaming
use cases
Key Features
Github: https://github.com/zillow/luminaire
Tutorials: https://zillow.github.io/luminaire/
Scientific Paper (IEEE BigData 2020): Building an Automated and Self-Aware Anomaly Detection System (arxiv link)
30. Our Integrations with Central Data Systems
● Self-service UI for easier on-boarding
● Surfacing health metrics of the data source with central data catalog
● Tagging producers and consumers of the anomaly detection jobs
● Smart Alerting based on scoring output sensitivity
31. Future Direction
● Support anomaly detection beyond temporal context
● Build decision systems for ML pipelines using Luminaire
● Root Cause Analysis to go a step ahead from detection to diagnosis
● User feedback to get labeled anomalies
33. Key Takeaways
● Luminaire is a python library which supports anomaly detection for wide
variety of time series patterns and use cases
● Proposed a technique to build a fully automated anomaly detection system
that scales to big data use cases and requires minimal maintenance