WHOOPS, THE NUMBERS
ARE WRONG!
SCALING DATA QUALITY
@
MICHELLE UFFORD
DATA ENGINEERING & ANALYTICS, NETFLIX
HADOOP SUMMIT 2017
Overview.
The business.
20170612
100+ million
members
$6 billion
on content
125+ million
hours watched
launched
in 1997
every. day.
$
Anytime. Anywhere.*
20170612
* Well, almost anywhere.
Any device.
20170612
300 terabyte
DW writes
5 petabyte
DW reads
The data.
20170612
60+ petabyte
data warehouse
700+ billion
events written
300 terabyte
DW writes
5 petabyte
DW reads
The data.
20170612
60+ petabyte
data warehouse
700+ billion
events written
20170612
data access
AWS
S3
Amazon
Redshift
data processing
fast storage data viz
METACA
T
data services
events data
operational data
elastic storage
Apache Pig
Big Data Platform
Data Quality.
Federated metastore &
extensible data catalog
20170612
Metacat
Federated Metastore
s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855
…
s3://…/dw/fact_table_f/utc_date=20170611/batchid=1497226702
s3://…/dw/fact_table_f/utc_date=20170612/batchid=1497312541
dw.fact_table_f
20170612
Metacat
Federated Metastore
s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855
…
s3://…/dw/fact_table_f/utc_date=20170611/batchid=1497226702
s3://…/dw/fact_table_f/utc_date=20170612/batchid=1497312541
dw.fact_table_f
utc_date=20170101
utc_date=20170611
utc_date=20170612
…
20170612
Metacat
Federated Metastore
utc_date=20170101
20170612
Metacat
Federated Metastore
utc_date=20170101
20170612
Extended table attributes
● primary key(s)
● column types
● lifecycle
● audience
● “valid-thru” timestamp
● … and much more
Metacat
Federated Metastore
Data Quality Service.
20170612
Quinto
Data Quality Service
20170612
Quinto
Data Quality Service
20170612
Quinto
Data Quality Service
20170612
Quinto
Data Quality Service
Write - Audit - Publish
ETL pattern
for high-quality
big data jobs
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
dw.my_table_f
WAP
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
dw.my_table_f audit.my_table_f_1497312000
WAPStage-0: Prep
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
s3://…/utc_date=20170612/batchid=1497312541
WAPStage-1: Write
audit.my_table_f_1497312000dw.my_table_f
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
WAPStage-1: Write
audit.my_table_f_1497312000
$TABLE
dw.my_table_f
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_f
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_f
Quint
o
metric eval behavior result
--------------------------------------------------
RowCount >= zero fail job
RowCount >= prior value fail job
NullCount normal dist warn job
Quinto configuration
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_f
Quint
o
metric eval behavior result
--------------------------------------------------
RowCount >= zero fail job
RowCount >= prior value fail job
NullCount normal dist warn job
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_f
Quint
o
metric eval behavior result
--------------------------------------------------
RowCount >= zero fail job
RowCount >= prior value fail job
NullCount normal dist warn job
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job fail
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job fail
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
WAPStage-3: Publish
audit.my_table_f_1497312000dw.my_table_f
s3://…/utc_date=20170612/batchid=1497312541
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
WAPStage-3: Publish
audit.my_table_f_1497312000dw.my_table_f
s3://…/utc_date=20170612/batchid=1497312541
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
WAPStage-3: Publish
s3://…/utc_date=20170612/batchid=1497312541
dw.my_table_f
valid_thru_ts = 20170613 00:00:00
ETL Pattern
20170612
Quinto evaluations
● intelligent recommendations
● multiple tiers of coverage
● configurable rules
Jumpstarter.
Python Library
WAP.
Python Library
Minimal requirements
● parameterized destination table
WAP.
Python Library
Running WAP.
20170612
What’s Next.
● additional Metacat statistics
● robust anomaly detection (RAD)
● complete migration for all prod tables
20170612
Tips & Lessons Learned.
● Query-based solution may be “good enough” for many.
● Not all tables need quality coverage.
● One size rarely fits all tables.
● Build components, not “all-or-nothing” frameworks.
MICHELLE UFFORD
mufford@netflix.com
twitter.com/MichelleUfford
DATA
techblog.netflix.com
medium.com/netflix-techblog
twitter.com/NetflixData
tinyurl.com/NetflixData
Thank you!
WE’RE HIRING! jobs.netflix.com

Scaling Data Quality @ Netflix

Editor's Notes

  • #2 https://dataworkssummit.com/san-jose-2017/sessions/whoops-the-numbers-are-wrong-scaling-data-quality-netflix/ WHOOPS, THE NUMBERS ARE WRONG! SCALING DATA QUALITY @ NETFLIX Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies? In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data. DETAILS This session is a (Intermediate) talk in our Data Processing and Warehousing track. It focuses on Apache Hadoop, Apache Hive, Apache Pig, Apache Spark and is geared towards Architect, Data Analyst, Developer / Engineer, Operations / IT audiences.