Rapid Analytics @ Netflix LA
Chris Stephens
Senior Data Engineer
Yes, we’re in LA!
● 500+ employees in Beverly Hills
● growing rapidly
● new 14 story LA headquarters under construction
○ planning to move in early 2017
Freedom & Responsibility
Context, not Control
Highly Aligned, Loosely Coupled Teams
Culture + Technology
Courage
Judgement
Honesty
Communication
Curiosity
Passion
Innovation
Impact
Selflessness
Freedom & Responsibility
What does this mean for our day-to-day?
We let everyone drop tables
in production
Cost / Benefit
Conscientious people make mistakes,
but not very often
Data warehouse is not an operational system
What happens if a table is accidentally dropped?
● Do you have backups?
● How quickly can you restore a table?
Is the benefit of worth the tax on every data /
analytical product your team produces?
We have some protection
In Hive, all tables are external tables pointing to S3 locations.
ETL writes a new “batch” of data then updates the metastore.
s3://[bucket]/hive/schema.db/table/batchid=1459364911
ALTER TABLE table SET LOCATION [path to new batch ID];
DROP TABLE does not delete any data.
In our MPP databases, we have a procedure for upgrading and
downgrading our privileges.
CALL admin.UpgradePrivileges('me')
Lasts for several hours. Usage is logged.
Accidents? Restore from backups. Or reload from Hive.
When other teams are ready to move to production ...
We’re done. And moving on to the next thing.
You can trust your people to work the same way.
We share our code
Netflix believes in open source, both inside and out.
● cross team access to source code repositories
● bi-weekly dedicated time for innovation
● common identity management and access control for APIs
Netflix’s Big Data Portal
Data engineers & analysts use good judgement to decide
what work will have the most impact for our customers.
Managers provide context and support.
We don’t have an “on call”
(Use a “first responder” instead)
Everyone on the team takes a shift: both BI and data engineers (even
managers every once in a while!)
First Responder = the first one to respond
● handles most common failures (restarting jobs)
● reaches out directly to ETL owner if escalation is required
● handles communication surrounding ETL delays
Goal is to protect the team’s time and focus
How we do this
● visually define what needs attention and what doesn’t
○ “above the line” vs “below the line”
● email alerts for “above the line” jobs that take longer than normal
● playbook for fixing common stuff
○ the more complete your entries are, the less you get called!
Have a very clear sense of what is urgent, and what isn’t
Treating every failure like it’s urgent bleeds your team of the time they
need to do work
Build your processes so they can be ignored for 3 days
● don’t load data if it’s incomplete
● reprocess fact data for several days instead of picking up the latest
Gives you the freedom to judge whether a failure is worth an interruption
Our Engineers use what they need
Netflix doesn’t have a CTO, and doesn’t have
architects, technical fellows, or technocrats.
Instead, Netflix has amazing engineers.
Data team wants to store real-time aggregations of
billions of records available for point queries.
Example #1
Data team does most processing on a table in batch,
but needs real-time lookups and updates in some cases.
Example #2
Our data platform team:
● gives us access to or build tools to let us do it ourselves
● holds regular “office hours” we can use if we need help
We aren’t defined by our roles
BI engineer needs data structured a certain way for a report
Many environments:
● Ask a data engineer to build them a table
Our environment:
● Let the BI engineer schedule a Hive script and adjust as necessary
We focus on centers of excellence, not role boundaries
More Examples:
● our BI engineers use Python to automate tasks
● our data engineers have Tableau licenses, and use them for quick
visualizations and report deployments
For small tasks, this helps us avoid the overhead of interruption and
knowledge transfer
Questions?

Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated and Expanded), Chris Stephens, Senior Data Engineer, Netflix

  • 1.
    Rapid Analytics @Netflix LA Chris Stephens Senior Data Engineer
  • 2.
    Yes, we’re inLA! ● 500+ employees in Beverly Hills ● growing rapidly ● new 14 story LA headquarters under construction ○ planning to move in early 2017
  • 3.
    Freedom & Responsibility Context,not Control Highly Aligned, Loosely Coupled Teams Culture + Technology Courage Judgement Honesty Communication Curiosity Passion Innovation Impact Selflessness
  • 4.
    Freedom & Responsibility Whatdoes this mean for our day-to-day?
  • 5.
    We let everyonedrop tables in production
  • 6.
    Cost / Benefit Conscientiouspeople make mistakes, but not very often Data warehouse is not an operational system What happens if a table is accidentally dropped? ● Do you have backups? ● How quickly can you restore a table? Is the benefit of worth the tax on every data / analytical product your team produces?
  • 7.
    We have someprotection
  • 8.
    In Hive, alltables are external tables pointing to S3 locations. ETL writes a new “batch” of data then updates the metastore. s3://[bucket]/hive/schema.db/table/batchid=1459364911 ALTER TABLE table SET LOCATION [path to new batch ID]; DROP TABLE does not delete any data.
  • 9.
    In our MPPdatabases, we have a procedure for upgrading and downgrading our privileges. CALL admin.UpgradePrivileges('me') Lasts for several hours. Usage is logged. Accidents? Restore from backups. Or reload from Hive.
  • 10.
    When other teamsare ready to move to production ... We’re done. And moving on to the next thing. You can trust your people to work the same way.
  • 11.
  • 12.
    Netflix believes inopen source, both inside and out. ● cross team access to source code repositories ● bi-weekly dedicated time for innovation ● common identity management and access control for APIs
  • 13.
  • 16.
    Data engineers &analysts use good judgement to decide what work will have the most impact for our customers. Managers provide context and support.
  • 17.
    We don’t havean “on call” (Use a “first responder” instead)
  • 18.
    Everyone on theteam takes a shift: both BI and data engineers (even managers every once in a while!) First Responder = the first one to respond ● handles most common failures (restarting jobs) ● reaches out directly to ETL owner if escalation is required ● handles communication surrounding ETL delays
  • 19.
    Goal is toprotect the team’s time and focus
  • 20.
    How we dothis ● visually define what needs attention and what doesn’t ○ “above the line” vs “below the line” ● email alerts for “above the line” jobs that take longer than normal ● playbook for fixing common stuff ○ the more complete your entries are, the less you get called!
  • 21.
    Have a veryclear sense of what is urgent, and what isn’t
  • 22.
    Treating every failurelike it’s urgent bleeds your team of the time they need to do work Build your processes so they can be ignored for 3 days ● don’t load data if it’s incomplete ● reprocess fact data for several days instead of picking up the latest Gives you the freedom to judge whether a failure is worth an interruption
  • 23.
    Our Engineers usewhat they need
  • 24.
    Netflix doesn’t havea CTO, and doesn’t have architects, technical fellows, or technocrats. Instead, Netflix has amazing engineers.
  • 25.
    Data team wantsto store real-time aggregations of billions of records available for point queries. Example #1
  • 26.
    Data team doesmost processing on a table in batch, but needs real-time lookups and updates in some cases. Example #2
  • 27.
    Our data platformteam: ● gives us access to or build tools to let us do it ourselves ● holds regular “office hours” we can use if we need help
  • 28.
    We aren’t definedby our roles
  • 29.
    BI engineer needsdata structured a certain way for a report Many environments: ● Ask a data engineer to build them a table Our environment: ● Let the BI engineer schedule a Hive script and adjust as necessary
  • 30.
    We focus oncenters of excellence, not role boundaries
  • 31.
    More Examples: ● ourBI engineers use Python to automate tasks ● our data engineers have Tableau licenses, and use them for quick visualizations and report deployments For small tasks, this helps us avoid the overhead of interruption and knowledge transfer
  • 32.