Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed Time Travel for Feature Generation at Netflix

914 views

Published on

Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create feature is critical for machine learning projects to be successful. At Netflix, we spend significant time and effort experimenting with new features and new ways of building models. This involves generating features for our members from different regions over multiple days. To enable this, we built a time machine using Apache Spark that computes features for any arbitrary time in the recent past. The first step of building this time machine is to snapshot the data from various micro services on a regular basis. We built a general purpose workflow orchestration and scheduling framework optimized for machine learning pipelines and used it to run the snapshot and model training workflows. Snapshot data is then consumed by feature encoders to compute various features for offline experimentation and model training. Crucially, the same feature encoders are used in both offline model building and online scoring for production or A/B tests. Building this time machine helped us try new ideas quickly without placing stress on production services and without having to wait for data accumulation of the newly-implemented features. Moreover, building it with Apache Spark empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. Finally, using Apache Zeppelin notebook, we are able to interactively prototype features and run experiments.


Speaker Bio: DB Tsai

DB Tsai is an Apache Spark committer and a Senior Research Engineer working on Personalized Recommendation Algorithms at Netflix. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where He implemented several algorithms including Linear Regression and Binary/Multinomial Logistic Regression with Elastici-Net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. DB was a Ph.D. candidate in Applied Physics at Stanford University. He holds a Master’s degree in Electrical Engineering from Stanford University.

Published in: Software

Distributed Time Travel for Feature Generation at Netflix

  1. 1. Distributed Time Travel for Feature Generation DB Tsai March 24, 2016 at SF Big Analytics Meetup
  2. 2. Who am I? • I am a Senior Research Engineer at Netflix • I am an Apache Spark Committer
  3. 3. Turn on Netflix, and the absolute best content for you would automatically start playing
  4. 4. Data Driven • Try an idea offline using historical data to see if it would have made better recommendations • If it did, deploy a live A/B test to see if it performs well in Production
  5. 5. Quickly try ideas on historical data and transition to online A/B test
  6. 6. + ≠ Feature Engineering
  7. 7. Why build a Time Machine?
  8. 8. Label Leaks • Recommendation is an analytic learning process to learn from the past to predict the future. • P(playt|featurest’) where t’ < t • This has to be very careful; otherwise the features can contain the labels such that offline metrics are very good, but online evaluation will not be performing as expected.
  9. 9. The Past • Generate features based on event data logged in Hive • Need to reimplement features for online A/B test • Data discrepancies between offline and online sources • Log features online where the model will be used • Need to deploy each idea into production • Feature generation calls online services and filters data past a certain time • Works only when a service records a log of historical events • Additional load on online services
  10. 10. DeLorean image by JMortonPhoto.com & OtoGodfrey.com
  11. 11. Time Travel using Snapshots • Snapshot online services and use the snapshot data offline to generate features • Share facts and features between experiments without calling live systems
  12. 12. How to build a Time Machine
  13. 13. Context Selection Data Snapshots APIs for Time Travel
  14. 14. Context Selection Context Selection Runs once a day Hive S3 Context SetStratified Sampling Contexts tagged with meta data
  15. 15. Data Snapshots S3 Context Set Data Snapshots Runs once a day S3 Snapshot Prana (Netflix Libraries) Viewing History Service MyList Service Ratings Service Snapshot data for each Context Thrift Parquet
  16. 16. APIs for Time Travel
  17. 17. Data Architecture S3 Snapshot S3 Context Set Runs once a day Prana (Netflix Libraries) Viewing History Service MyList Service Ratings Service Context Selection Runs once a day Hive Stratified Sampling Contexts tagged with meta data Thrift Context Selection Data Snapshots Batch APIs RDD of Snapshot Objects Data Snapshots Batch APIs
  18. 18. Generating Features via Time Travel
  19. 19. Great Scott! • DeLorean: A time-traveling vehicle • uses data snapshots to travel in time • scales with Apache Spark • prototypes new ideas with Zeppelin • requires minimal code changes from experimentation to A/B test to production https://en.wikipedia.org/wiki/Emmett_Brown There’s the DeLorean!
  20. 20. Running Time Travel Experiment Select the destination time Bring it up to 88 miles per hour!
  21. 21. Running Time Travel Experiment Design Experiment Collect Label Dataset DeLorean: Offline Feature Generation Distributed Model Training Parallel training of individual models using different executors Compute Validation Metrics Model Testing Choose best model Design a New Experimentto Test Out DifferentIdeas Good Metrics Offline Experiment Online System Online AB Testing Bad Metrics Selected Contexts
  22. 22. DeLorean Input Data • Contexts: The setting for evaluating a set of items (e.g. tuples of member profiles, country, time, device, etc.) • Items: The elements to be trained on, scored, and/or ranked (e.g. videos, rows, search entities). • Labels: For supervised learning, this will be the label (target) for each item.
  23. 23. Feature Encoders • Compute features for each item in a given context • Each type of raw data element has its own data key • Data map is a map from data keys to data objects in a given context • Data map is consumed by feature encoder to compute features
  24. 24. Two type of Data Elements • Context-dependent data elements • Viewing History • Mylist • ... • Context-independent data elements • Video Metadata • Genre Metadata • ...
  25. 25. Video Country of Origin Matching Fraction Context-Items Context: s Items: Context: s Items: Context Dependent Data Element Viewing History Context: s Items: Context: s Items: Context: s Items: = 0.5 = 0.5 = 0.5 Context Independent Data Element Video Metadata Context: s Items: = 1.0 = 0.0 = 1.0 Features
  26. 26. Feature GenerationS3 Snapshot Model Training Label Features Feature EncodersLabel Data Feature Encoders Data Elements Feature Model (JSON) Feature Encoders Feature Encoders Feature Encoders Required Feature Keys Data Data Map Features Data in POJOs Data Keys Data Keys
  27. 27. Features • Represented in Spark’s DataFrames • In nested structure to avoid data shuffling in ranking process • Stored with Parquet format in S3
  28. 28. Features Context Item, label, and features
  29. 29. Going Online S3 Snapshot DeLorean: Offline Feature Generation Online Ranking / Scoring Service Model Training / Validation / Testing Offline Experiment Online SystemViewing History Service MyList Service Ratings Service Online Feature Generation Deploy models Shared Feature Encoders
  30. 30. Conclusion Spark helped us significantly reduce the time from an idea to an AB Test
  31. 31. Future work Event Driven Data Snapshots Time Travel to the Future!!
  32. 32. We’re hiring! (come talk to us) https://jobs.netflix.com/ Tech Blog: http://bit.ly/sparktimetravel

×