Scaling AutoML-Driven Anomaly Detection With Luminaire

1
Sayan Chakraborty
Smit Shah
Scaling AutoML-driven
Anomaly Detection with
Luminaire

Who We Are
Data Governance Platform Team
@ Zillow
Sayan Chakraborty
Senior Applied Scientist
Smit Shah
Senior Software Development
Engineer, Big Data

Agenda
● What is Zillow?
● Why Monitor Data Quality
● Data Quality Challenges
● Luminaire and Scaling
● Key Takeaways

About Zillow
● Reimagining real estate to make it
easier to unlock life’s next chapter
* As of Q4-2020
● Offer customers an on-demand
experience for selling, buying,
renting and financing with
transparency and nearly seamless
end-to-end service
● Most-visited real estate website in
the United States

Why Monitor Data Quality?
● Data fuels many customer facing
and internal services at Zillow that
rely on high quality data
○ Zestimate
○ Zillow Offers
○ Zillow Premier Agent
○ Econ and many more
● Reliable performance of ML and
Services requires certain level of
data quality

Why detect Anomalies?
Anomaly
A data instance or behavior significantly
different from the ‘regular’ patterns
Complex
Time-sensitive
Inevitable
Catching anomalies in important metric
helps keep our business healthy

Ways to Monitor Data Quality
Rule Based
● Domain experts sets pre-specified
rules or thresholds
○ Example: Percent of null data should be
less than 2% per day for a given metric
● Less complicated to set up and
easy to interpret
● Works well when the properties of
data are simple and remain
stationary over time
ML Based
● Rules are set through mathematical
modeling
● Works well when properties of data
are complex and changes over time
● A more hands-off approach

Data Quality is Context Dependent
● Depends on the use case
● Depends on the reference time frame under consideration
○ Example: Different interpretation of the same fluctuation can be obtained when compared
under shorter vs longer reference time-frames
● Depends on externalities such as holidays, product launches, market specific
etc

Challenges
● Modeling
○ Wide ranges of time series patterns from different data sources - one model
doesn’t fit all
○ Definition of anomalies changes at different levels of aggregation of the same
data
● Scaling and Standardization
○ Everyone (Analyst, PM, DE) should be able to use ML for anomaly detection and
get trustworthy data (but everyone is not an ML expert)
○ Require Scalability for handling large amount of data across teams

Wishlist for the system
● Able to catch any data irregularities
● Scale for large amount of data and metrics
● Minimal configuration
● Minimal maintenance over time
No existing solution meets the above requirements

Luminaire Python Package
Integrated with
Different
Models
AutoML Built-in
Proven to
Outperform
Many Existing
Methods
Time series
Data Profiling
Enabled
Built for Batch
and Streaming
use cases
Key Features
Github: https://github.com/zillow/luminaire
Tutorials: https://zillow.github.io/luminaire/
Scientific Paper (IEEE BigData 2020): Building an Automated and Self-Aware Anomaly Detection System (arxiv link)

Luminaire Components
AutoML
Training Components
Data Profiling / Preprocessing
Batch Data Modeling
Streaming Data
Modeling
Scoring/Alerting
Scoring Components
Pull Batch Model
Pull Streaming
Model

AutoML
Training Components
Batch Data Modeling
Streaming Data
Modeling
>>> from luminaire.exploration.data_exploration import
DataExploration
>>> de_obj = DataExploration(freq='D', data_shift_truncate=False,
is_log_transformed=True, fill_rate=0.9)
>>> data, pre_prc = de_obj.profile(data)
>>> print(pre_prc)
{'success': True, 'trend_change_list': ['2020-04-01 00:00:00'],
'change_point_list': ['2020-03-16 00:00:00'],
'is_log_transformed': 1, 'min_ts_mean': None, 'ts_start': '2020-
01-01 00:00:00', 'ts_end': '2020-06-07 00:00:00'}

Training - Batch
AutoML
Training Components
Batch Data Modeling
Streaming Data
Modeling
>>> from luminaire.model.lad_structural import LADStructuralModel
>>> hyper_params = {"include_holidays_exog": True,
"is_log_transformed": False, "max_ft_freq": 5, "p": 3, "q": 3}
>>> lad_struct_obj = LADStructuralModel(hyper_params=hyper_params,
freq='D')
>>> success, model_date, model = lad_struct_obj.train(data=data,
**pre_prc)
>>> print(success, model_date, model)
(True, '2020-06-07 00:00:00',
<luminaire_models.model.lad_structural.LADStructuralModel object
at 0x7f97e127d320>)

Training - Streaming
AutoML
Training Components
Batch Data Modeling
Streaming Data
Modeling
>>> from luminaire.model.window_density import
WindowDensityHyperParams, WindowDensityModel
>>> from luminaire.exploration.data_exploration import
DataExploration
>>> config = WindowDensityHyperParams().params
>>> de_obj = DataExploration(**config)
>>> data, pre_prc = de_obj.stream_profile(df=data)
>>> config.update(pre_prc)
>>> wdm_obj = WindowDensityModel(hyper_params=config)
>>> success, training_end, model = wdm_obj.train(data=data)
>>> print(success, training_end, model)
True 2020-07-03 00:00:00
<luminaire.model.window_density.WindowDensityModel object at
0x7fb6fab80b00>

AutoML - Configuration Optimization
AutoML
Training Components
Batch Data Modeling
Streaming Data
Modeling
>>> from luminaire.optimization.hyperparameter_optimization import
HyperparameterOptimization
>>> hopt_obj = HyperparameterOptimization(freq='D')
>>> opt_config = hopt_obj.run(data=data)
>>> print(opt_config)
{'LuminaireModel': 'LADStructuralModel', 'data_shift_truncate': 0,
'fill_rate': 0.742353444620679, 'include_holidays_exog': 1,
'is_log_transformed': 1, 'max_ft_freq': 2, 'p': 1, 'q': 1}
>>> model_class_name = opt_config['LuminaireModel']
>>> module = __import__('luminaire.model', fromlist=[''])
>>> model_class = getattr(module, model_class_name)
>>> model_object = model_class(hyper_params=opt_config, freq='D')
>>> success, model_date, trained_model =
model_object.train(data=training_data, **pre_prc)
>>> print(success, model_date, model)
(True, '2020-06-07 00:00:00',
<luminaire_models.model.lad_structural.LADStructuralModel object
at 0x7fe2b47a7978>)

Scoring - Batch
Scoring/Alerting
Scoring Components
Pull Batch Model
Pull Streaming
Model
>>> model.score(2000, '2020-06-08')
{'Success': True, 'IsLogTransformed': 1,
'LogTransformedAdjustedActual': 7.601402334583733,
'LogTransformedPrediction': 7.85697078664991,
'LogTransformedStdErr': 0.05909378128162875,
'LogTransformedCILower': 7.759770166178546,
'LogTransformedCIUpper': 7.954171407121274, 'AdjustedActual':
2000.000000000015, 'Prediction': 1913.333800801316, 'StdErr':
111.1165409184448, 'CILower': 1722.81265596681, 'CIUpper':
2093.854945635823, 'ConfLevel': 90.0, 'ExogenousHolidays': 0,
'IsAnomaly': False, 'IsAnomalyExtreme': False,
'AnomalyProbability': 0.9616869199903785,
'DownAnomalyProbability': 0.21915654000481077,
'UpAnomalyProbability': 0.7808434599951892, 'ModelFreshness': 0.1}

Scoring - Streaming
Scoring/Alerting
Scoring Components
Pull Batch Model
Pull Streaming
Model
>>> freq = model._params['freq']
>>> de_obj = DataExploration(freq=freq)
>>> processed_data, pre_prc =
de_obj.stream_profile(df=data, impute_only=True,
impute_zero=True)
>>> score, scored_window = model.score(processed_data)
>>> print(score)
{'Success': True, 'ConfLevel': 99.9, 'IsAnomaly': True,
'AnomalyProbability': 1.0}

Scaling - Distributed Training/Scoring
Reference: https://www.zillow.com/tech/anomaly-detection-at-zillow-using-luminaire/

Scaling - Distributed using Spark
metrics time_series
[data-date, observed-value]
run_date
met_1 [[2021-01-01, 125], [2021-01-
02, 135],
[2021-01-03, 140], ...]
2021-02-01
00:00:00
met_2 [[2021-01-01, 0.17], [2021-01-
02, 0.19],
[2021-01-03, 0.22], ...]
2021-02-01
00:00:00
UDF
(Train)
metrics time_series
run_date model_object
met_1 [[2021-01-01, 125], [2021-01-
02, 135],
[2021-01-03, 140], ...]
2021-02-01
00:00:00
<object_met_1>
met_2 [[2021-01-01, 0.17], [2021-01-
02, 0.19],
[2021-01-03, 0.22], ...]
2021-02-01
00:00:00
<object_met_2>
UDF
(Score)
metrics time_series
run_date model_object
met_1 [[2021-04-01, 115], [2021-04-
02, 113]]
2021-04-02
00:00:00
<object_met_1>
met_2 [[2021-04-01, 0.45], [2021-04-
02, 0.36]]
2021-04-02
00:00:00
<object_met_2>
metrics time_series
run_date score_results
met_1 [[2021-04-01, 115], [2021-04-
02, 113]]
2021-04-02
00:00:00
[{“success”:
True,
“'AnomalyProbab
ility': 0.85, ..}, ..]
met_2 [[2021-04-01, 0.45], [2021-04-
02, 0.36]]
2021-04-02
00:00:00
[{“success”:
True,
“'AnomalyProbab
ility': 0.995, ..}, ..]
* These values are simulated

Our Integrations with Central Data Systems
● Self-service UI for easier on-boarding
● Surfacing health metrics of the data source with central data catalog
● Tagging producers and consumers of the anomaly detection jobs
● Smart Alerting based on scoring output sensitivity

Future Direction
● Support anomaly detection beyond temporal context
● Build decision systems for ML pipelines using Luminaire
● Root Cause Analysis to go a step ahead from detection to diagnosis
● User feedback to get labeled anomalies

Key Takeaways
● Luminaire is a python library which supports anomaly detection for wide
variety of time series patterns and use cases
● Proposed a technique to build a fully automated anomaly detection system
that scales to big data use cases and requires minimal maintenance

Questions?
Thank you!
https://www.zillow.com/careers/

Scaling AutoML-Driven Anomaly Detection With Luminaire

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling AutoML-Driven Anomaly Detection With Luminaire

Similar to Scaling AutoML-Driven Anomaly Detection With Luminaire (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Scaling AutoML-Driven Anomaly Detection With Luminaire