Scaling Data Quality @ Netflix

1. WHOOPS, THE NUMBERS ARE WRONG! SCALING DATA QUALITY @ MICHELLE UFFORD DATA ENGINEERING & ANALYTICS, NETFLIX HADOOP SUMMIT 2017

2. Overview.

3. The business. 20170612 100+ million members $6 billion on content 125+ million hours watched launched in 1997 every. day. $

4. Anytime. Anywhere.* 20170612 * Well, almost anywhere.

5. Any device. 20170612

6. 300 terabyte DW writes 5 petabyte DW reads The data. 20170612 60+ petabyte data warehouse 700+ billion events written

7. 300 terabyte DW writes 5 petabyte DW reads The data. 20170612 60+ petabyte data warehouse 700+ billion events written

8. 20170612 data access AWS S3 Amazon Redshift data processing fast storage data viz METACA T data services events data operational data elastic storage Apache Pig Big Data Platform

9. Data Quality.

11. Federated metastore & extensible data catalog

12. 20170612 Metacat Federated Metastore s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855 … s3://…/dw/fact_table_f/utc_date=20170611/batchid=1497226702 s3://…/dw/fact_table_f/utc_date=20170612/batchid=1497312541 dw.fact_table_f

13. 20170612 Metacat Federated Metastore s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855 … s3://…/dw/fact_table_f/utc_date=20170611/batchid=1497226702 s3://…/dw/fact_table_f/utc_date=20170612/batchid=1497312541 dw.fact_table_f utc_date=20170101 utc_date=20170611 utc_date=20170612 …

14. 20170612 Metacat Federated Metastore utc_date=20170101

15. 20170612 Metacat Federated Metastore utc_date=20170101

16. 20170612 Extended table attributes ● primary key(s) ● column types ● lifecycle ● audience ● “valid-thru” timestamp ● … and much more Metacat Federated Metastore

17. Data Quality Service.

18. 20170612 Quinto Data Quality Service

22. Write - Audit - Publish ETL pattern for high-quality big data jobs

23. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 dw.my_table_f WAP ETL Pattern

24. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 dw.my_table_f audit.my_table_f_1497312000 WAPStage-0: Prep ETL Pattern

25. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 s3://…/utc_date=20170612/batchid=1497312541 WAPStage-1: Write audit.my_table_f_1497312000dw.my_table_f ETL Pattern

26. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 WAPStage-1: Write audit.my_table_f_1497312000 $TABLE dw.my_table_f utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... ETL Pattern

27. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_f utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... ETL Pattern

28. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_f Quint o metric eval behavior result -------------------------------------------------- RowCount >= zero fail job RowCount >= prior value fail job NullCount normal dist warn job Quinto configuration utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... ETL Pattern

29. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_f Quint o metric eval behavior result -------------------------------------------------- RowCount >= zero fail job RowCount >= prior value fail job NullCount normal dist warn job utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern

30. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_f Quint o metric eval behavior result -------------------------------------------------- RowCount >= zero fail job RowCount >= prior value fail job NullCount normal dist warn job utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern

31. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result -------------------------------------------------- RowCount >= zero fail job pass RowCount >= prior value fail job NullCount normal dist warn job Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern

34. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result -------------------------------------------------- RowCount >= zero fail job pass RowCount >= prior value fail job pass NullCount normal dist warn job Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern

37. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result -------------------------------------------------- RowCount >= zero fail job pass RowCount >= prior value fail job pass NullCount normal dist warn job fail Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern

38. 20170612 … WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result -------------------------------------------------- RowCount >= zero fail job pass RowCount >= prior value fail job pass NullCount normal dist warn job fail Quint o utc_date=20170612 com.netflix.dse.mds.metric.RowCount: 17240 com.netflix.dse.mds.metric.NullCount: 17240 ... utc_date=20170611 com.netflix.dse.mds.metric.RowCount: 16135 com.netflix.dse.mds.metric.NullCount: 21 ... Quinto configuration ETL Pattern

39. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 WAPStage-3: Publish audit.my_table_f_1497312000dw.my_table_f s3://…/utc_date=20170612/batchid=1497312541 ETL Pattern

40. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 WAPStage-3: Publish audit.my_table_f_1497312000dw.my_table_f s3://…/utc_date=20170612/batchid=1497312541 ETL Pattern

41. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 WAPStage-3: Publish s3://…/utc_date=20170612/batchid=1497312541 dw.my_table_f valid_thru_ts = 20170613 00:00:00 ETL Pattern

43. 20170612 Quinto evaluations ● intelligent recommendations ● multiple tiers of coverage ● configurable rules Jumpstarter. Python Library

44. WAP. Python Library Minimal requirements ● parameterized destination table

45. WAP. Python Library Running WAP.

47. 20170612 What’s Next. ● additional Metacat statistics ● robust anomaly detection (RAD) ● complete migration for all prod tables

48. 20170612 Tips & Lessons Learned. ● Query-based solution may be “good enough” for many. ● Not all tables need quality coverage. ● One size rarely fits all tables. ● Build components, not “all-or-nothing” frameworks.

49. MICHELLE UFFORD mufford@netflix.com twitter.com/MichelleUfford DATA techblog.netflix.com medium.com/netflix-techblog twitter.com/NetflixData tinyurl.com/NetflixData Thank you! WE’RE HIRING! jobs.netflix.com

Scaling Data Quality @ Netflix

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling Data Quality @ Netflix

Similar to Scaling Data Quality @ Netflix (20)

More from Michelle Ufford

More from Michelle Ufford (6)

Recently uploaded

Recently uploaded (20)

Scaling Data Quality @ Netflix

Editor's Notes