Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Working smarter,
not harder
DATA ENGINEERING
EFFICIENCY @
MICHELLE UFFORD
MANAGER, CORE INNOVATION
DATA ENGINEERING & ANAL...
A brief stroll down memory lane
Part one.
year 1 year 2 year 3
EngineerTime Michelle’s Wildly Subjective & Completely Unscientific
Observations of Engineering Effor...
year 1 year 2 year 3
EngineerTime Michelle’s Wildly Subjective & Completely Unscientific
Observations of Engineering Effor...
● archiving old data or unused tables
● fixing & reflowing bad data
● documenting lineage & relationships
● etc. etc. etc....
There must be a better way.
A peak at data engineering at Netflix
Part two.
data
acces
s
20170914
Amazon
Redshift
data
processin
g
fast
storage
data
viz
events data
RAW
data
storage DW RPT
METACA
T
...
data
scientists
business
analysts
data
engineers
data viz
engineers
quantitative
analysts
product
managers
analytics
engin...
20162015201420132012
2017
data
scientists
business
analysts
data
engineers
data viz
engineers
quantitative
analysts
product
managers
analytics
engin...
How can we do more?
Driving data engineering efficiency
Part three.
Simplify & Automate
Simplify & Automate:
Data Maintenance.
● data archival
● unused data assets
● table metadata
● data lineage
● etc.
20170612
Quinto evaluations
● intelligent recommendations
● multiple tiers of coverage
● configurable rules
Jumpstarter.
P...
Simplify & Automate:
Data Quality.
● identify appropriate level of quality coverage for a given table based upon usage dat...
20170612
Metacat
Federated Metastore
s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855
…
s3://…/dw/fact_table_f/...
20170612
Metacat
Federated Metastore
s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855
…
s3://…/dw/fact_table_f/...
20170612
Metacat
Federated Metastore
utc_date=20170101
20170612
Metacat
Federated Metastore
utc_date=20170101
20170612
Data Quality
● intelligent recommendations
● multiple tiers of coverage
● configurable rules
Jumpstarter.
Python ...
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
dw.my_table_f audit.my_...
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
s3://…/utc_date=2017061...
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
s3://…/utc_date=2017061...
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
WAPStage-3: Publish
aud...
Simplify & Automate:
Data Insight.
● provide easy visibility into current state & changes over time
● provide prescriptive...
Data Engineering @ Netflix.
Support & maintenance: 35%
New development & functionality:
45%
Good.
But we can do better.
A sneak peak at what we’re working on now
Part four.
year 1 year 2 year 3
EngineerTime Michelle’s Wildly Subjective & Currently Unproven
Theory of the Impact of ‘Smarter’ Solu...
Faster & Smarter:
Data Maintenance.
● multi-node object deprecation
● field-level deprecation
● beyond pattern matching
● ...
Faster & Smarter:
Data Quality.
● additional Metacat statistics
● robust anomaly detection
● aggressively experiment with ...
Faster & Smarter:
Data Insight.
● … next year’s Strata talk? 
MICHELLE UFFORD
mufford@netflix.com
twitter.com/MichelleUfford
DATA
techblog.netflix.com
medium.com/netflix-techblog
twitt...
Data Engineering Efficiency @ Netflix - Strata 2017
Data Engineering Efficiency @ Netflix - Strata 2017
Data Engineering Efficiency @ Netflix - Strata 2017
Data Engineering Efficiency @ Netflix - Strata 2017
Data Engineering Efficiency @ Netflix - Strata 2017
Data Engineering Efficiency @ Netflix - Strata 2017
Data Engineering Efficiency @ Netflix - Strata 2017
Data Engineering Efficiency @ Netflix - Strata 2017
Data Engineering Efficiency @ Netflix - Strata 2017
Upcoming SlideShare
Loading in …5
×

Data Engineering Efficiency @ Netflix - Strata 2017

1,501 views

Published on

Slides from Strata 2017 talk, "Data Engineering Efficiency @ Netflix."

Michelle Ufford explains how Netflix’s data engineering and analytics team is using data to find common patterns among the chaos that enable the company to automate repetitive and time-consuming tasks and discover ways to improve data quality, reduce costs, and quickly identify and respond to issues. Michelle provides a quick overview of Netflix’s analytics environment before diving into some of the major challenges facing the company’s data engineers. Along the way, Michelle shares how Netflix is building more intelligent data platform services and tools to improve data quality, automate data maintenance, alert on job optimization opportunities, and more.

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Data Engineering Efficiency @ Netflix - Strata 2017

  1. 1. Working smarter, not harder DATA ENGINEERING EFFICIENCY @ MICHELLE UFFORD MANAGER, CORE INNOVATION DATA ENGINEERING & ANALYTICS STRATA NYC, FALL 2017
  2. 2. A brief stroll down memory lane Part one.
  3. 3. year 1 year 2 year 3 EngineerTime Michelle’s Wildly Subjective & Completely Unscientific Observations of Engineering Efforts over Time new development support & maintenance everything else Circa 2007
  4. 4. year 1 year 2 year 3 EngineerTime Michelle’s Wildly Subjective & Completely Unscientific Observations of Engineering Efforts over Time new development support & maintenance everything else Circa 2007 ~10% ~75% when Michelle jumps ship 
  5. 5. ● archiving old data or unused tables ● fixing & reflowing bad data ● documenting lineage & relationships ● etc. etc. etc. Support & Maintenance. ● troubleshooting failed jobs ● investigating data quality issues ● migrating to newer releases ● optimizing job performance
  6. 6. There must be a better way.
  7. 7. A peak at data engineering at Netflix Part two.
  8. 8. data acces s 20170914 Amazon Redshift data processin g fast storage data viz events data RAW data storage DW RPT METACA T data catalo g api job execution data ingestion harlotte
  9. 9. data scientists business analysts data engineers data viz engineers quantitative analysts product managers analytics engineers software engineers ML scientists Planning & Analysis Data Engineering & Analytics Science & Algorithms algorithm engineers research scientists Algorithms Engineering executives executives Business Engineering
  10. 10. 20162015201420132012
  11. 11. 2017
  12. 12. data scientists business analysts data engineers data viz engineers quantitative analysts product managers analytics engineers software engineers ML scientists Planning & Analysis Data Engineering & Analytics Science & Algorithms algorithm engineers research scientists Algorithms Engineering executives executives Business Engineering
  13. 13. How can we do more?
  14. 14. Driving data engineering efficiency Part three.
  15. 15. Simplify & Automate
  16. 16. Simplify & Automate: Data Maintenance. ● data archival ● unused data assets ● table metadata ● data lineage ● etc.
  17. 17. 20170612 Quinto evaluations ● intelligent recommendations ● multiple tiers of coverage ● configurable rules Jumpstarter. Python Library
  18. 18. Simplify & Automate: Data Quality. ● identify appropriate level of quality coverage for a given table based upon usage data ● provide initial configuration of quality thresholds based upon table behavior patterns ● simplify integration of quality checks into data pipelines ● etc.
  19. 19. 20170612 Metacat Federated Metastore s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855 … s3://…/dw/fact_table_f/utc_date=20170611/batchid=1497226702 s3://…/dw/fact_table_f/utc_date=20170612/batchid=1497312541 dw.fact_table_f
  20. 20. 20170612 Metacat Federated Metastore s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855 … s3://…/dw/fact_table_f/utc_date=20170611/batchid=1497226702 s3://…/dw/fact_table_f/utc_date=20170612/batchid=1497312541 dw.fact_table_f utc_date=20170101 utc_date=20170611 utc_date=20170612 …
  21. 21. 20170612 Metacat Federated Metastore utc_date=20170101
  22. 22. 20170612 Metacat Federated Metastore utc_date=20170101
  23. 23. 20170612 Data Quality ● intelligent recommendations ● multiple tiers of coverage ● configurable rules Jumpstarter. Python Library
  24. 24. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 dw.my_table_f audit.my_table_f_1497312000 WAPStage-0: Prep ETL Pattern
  25. 25. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 s3://…/utc_date=20170612/batchid=1497312541 WAPStage-1: Write audit.my_table_f_1497312000dw.my_table_f ETL Pattern
  26. 26. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 s3://…/utc_date=20170612/batchid=1497312541 WAPStage-2: Audit audit.my_table_f_1497312000dw.my_table_f ETL Pattern
  27. 27. 20170612 s3://…/utc_date=20170101/batchid=1483229855 … s3://…/utc_date=20170611/batchid=1497226702 WAPStage-3: Publish audit.my_table_f_1497312000dw.my_table_f s3://…/utc_date=20170612/batchid=1497312541 ETL Pattern
  28. 28. Simplify & Automate: Data Insight. ● provide easy visibility into current state & changes over time ● provide prescriptive guidance on impactful optimization opportunities ● notify users of unexpected conditions which may indicate problems ● etc.
  29. 29. Data Engineering @ Netflix. Support & maintenance: 35% New development & functionality: 45%
  30. 30. Good. But we can do better.
  31. 31. A sneak peak at what we’re working on now Part four.
  32. 32. year 1 year 2 year 3 EngineerTime Michelle’s Wildly Subjective & Currently Unproven Theory of the Impact of ‘Smarter’ Solutions new development support & maintenance everything else Circa 2017 ~20% ??? ~60% ???
  33. 33. Faster & Smarter: Data Maintenance. ● multi-node object deprecation ● field-level deprecation ● beyond pattern matching ● etc.
  34. 34. Faster & Smarter: Data Quality. ● additional Metacat statistics ● robust anomaly detection ● aggressively experiment with configurations ● etc.
  35. 35. Faster & Smarter: Data Insight. ● … next year’s Strata talk? 
  36. 36. MICHELLE UFFORD mufford@netflix.com twitter.com/MichelleUfford DATA techblog.netflix.com medium.com/netflix-techblog twitter.com/NetflixData tinyurl.com/NetflixData Thank you! WE’RE HIRING! jobs.netflix.com

×