Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Distributed	Time	Travel	for	Feature	
Generation
DB	Tsai
March	24,	2016	at	SF	Big	Analytics	Meetup
Who	am	I?
• I	am	a	Senior	Research	Engineer	at	Netflix
• I	am	an	Apache	Spark	Committer
Turn on Netflix, and the absolute best
content for you would automatically
start playing
Data	Driven
• Try	an	idea	offline	using	historical	data to	see	if	it	would	
have	made	better	recommendations
• If	it	did,	...
Quickly try ideas on historical data and
transition to online A/B test
+ ≠
Feature	Engineering
Why build a Time Machine?
Label	Leaks
• Recommendation	is	an	analytic	learning	process	to	learn	from	
the	past	to	predict	the	future.	
• P(playt|fea...
The	Past
• Generate	features	based	on	event	data	logged	in	Hive
• Need	to	reimplement	features	for	online	A/B	test
• Data ...
DeLorean image by JMortonPhoto.com & OtoGodfrey.com
Time	Travel	using	Snapshots
• Snapshot	online	services	and	use	the	snapshot	
data	offline	to	generate	features
• Share	fac...
How to build a Time Machine
Context	Selection
Data	Snapshots
APIs	for	Time	Travel
Context Selection
Context	
Selection
Runs once a day
Hive
S3
Context
SetStratified
Sampling
Contexts
tagged with
meta data
Data	Snapshots
S3
Context
Set
Data	
Snapshots Runs once
a day
S3
Snapshot
Prana	
(Netflix	
Libraries)
Viewing	
History	
Se...
APIs	for	Time	Travel
Data	Architecture
S3
Snapshot
S3
Context
Set
Runs once
a day
Prana	
(Netflix	
Libraries)
Viewing	
History	
Service
MyList	...
Generating Features via Time Travel
Great	Scott!	
• DeLorean:	A	time-traveling	vehicle
• uses	data	snapshots	to	travel	in	time
• scales	with	Apache	Spark
• pr...
Running	Time	Travel	Experiment
Select the destination time
Bring it up to 88 miles per hour!
Running	Time	Travel	Experiment
Design	Experiment
Collect	Label	Dataset
DeLorean:	Offline	
Feature	Generation	
Distributed	...
DeLorean	Input	Data
• Contexts:	The	setting	for	evaluating	a	set	of	items	(e.g.	tuples	
of	member	profiles,	country,	time,...
Feature	Encoders
• Compute	features	for	each	item	in	a	given	context
• Each	type	of	raw	data	element	has	its	own	data	key
...
Two	type	of	Data	Elements
• Context-dependent	data	elements
• Viewing	History	
• Mylist
• ...
• Context-independent	data	e...
Video	Country	of	Origin	Matching	Fraction
Context-Items
Context:										s										
Items:
Context:										s										...
Feature	GenerationS3
Snapshot
Model	Training
Label Features
Feature	EncodersLabel	Data
Feature	Encoders
Data	Elements
Feat...
Features
• Represented	in	Spark’s	DataFrames
• In	nested	structure	to	avoid	data	shuffling	in	ranking	
process
• Stored	wi...
Features
Context
Item, label,
and features
Going	Online
S3
Snapshot
DeLorean:	Offline	
Feature	Generation
Online	Ranking	/	
Scoring	Service
Model	Training	/	
Validat...
Conclusion
Spark helped us significantly reduce
the time from an idea to an AB Test
Future	work
Event Driven Data Snapshots
Time Travel to the Future!!
We’re	hiring!
(come	talk	to	us)
https://jobs.netflix.com/
Tech Blog: http://bit.ly/sparktimetravel
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
Transcription
Next
Download to read offline and view in fullscreen.

5

Share

Download to read offline

Distributed Time Travel for Feature Generation at Netflix

Download to read offline

Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create feature is critical for machine learning projects to be successful. At Netflix, we spend significant time and effort experimenting with new features and new ways of building models. This involves generating features for our members from different regions over multiple days. To enable this, we built a time machine using Apache Spark that computes features for any arbitrary time in the recent past. The first step of building this time machine is to snapshot the data from various micro services on a regular basis. We built a general purpose workflow orchestration and scheduling framework optimized for machine learning pipelines and used it to run the snapshot and model training workflows. Snapshot data is then consumed by feature encoders to compute various features for offline experimentation and model training. Crucially, the same feature encoders are used in both offline model building and online scoring for production or A/B tests. Building this time machine helped us try new ideas quickly without placing stress on production services and without having to wait for data accumulation of the newly-implemented features. Moreover, building it with Apache Spark empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. Finally, using Apache Zeppelin notebook, we are able to interactively prototype features and run experiments.


Speaker Bio: DB Tsai

DB Tsai is an Apache Spark committer and a Senior Research Engineer working on Personalized Recommendation Algorithms at Netflix. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where He implemented several algorithms including Linear Regression and Binary/Multinomial Logistic Regression with Elastici-Net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. DB was a Ph.D. candidate in Applied Physics at Stanford University. He holds a Master’s degree in Electrical Engineering from Stanford University.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Distributed Time Travel for Feature Generation at Netflix

  1. 1. Distributed Time Travel for Feature Generation DB Tsai March 24, 2016 at SF Big Analytics Meetup
  2. 2. Who am I? • I am a Senior Research Engineer at Netflix • I am an Apache Spark Committer
  3. 3. Turn on Netflix, and the absolute best content for you would automatically start playing
  4. 4. Data Driven • Try an idea offline using historical data to see if it would have made better recommendations • If it did, deploy a live A/B test to see if it performs well in Production
  5. 5. Quickly try ideas on historical data and transition to online A/B test
  6. 6. + ≠ Feature Engineering
  7. 7. Why build a Time Machine?
  8. 8. Label Leaks • Recommendation is an analytic learning process to learn from the past to predict the future. • P(playt|featurest’) where t’ < t • This has to be very careful; otherwise the features can contain the labels such that offline metrics are very good, but online evaluation will not be performing as expected.
  9. 9. The Past • Generate features based on event data logged in Hive • Need to reimplement features for online A/B test • Data discrepancies between offline and online sources • Log features online where the model will be used • Need to deploy each idea into production • Feature generation calls online services and filters data past a certain time • Works only when a service records a log of historical events • Additional load on online services
  10. 10. DeLorean image by JMortonPhoto.com & OtoGodfrey.com
  11. 11. Time Travel using Snapshots • Snapshot online services and use the snapshot data offline to generate features • Share facts and features between experiments without calling live systems
  12. 12. How to build a Time Machine
  13. 13. Context Selection Data Snapshots APIs for Time Travel
  14. 14. Context Selection Context Selection Runs once a day Hive S3 Context SetStratified Sampling Contexts tagged with meta data
  15. 15. Data Snapshots S3 Context Set Data Snapshots Runs once a day S3 Snapshot Prana (Netflix Libraries) Viewing History Service MyList Service Ratings Service Snapshot data for each Context Thrift Parquet
  16. 16. APIs for Time Travel
  17. 17. Data Architecture S3 Snapshot S3 Context Set Runs once a day Prana (Netflix Libraries) Viewing History Service MyList Service Ratings Service Context Selection Runs once a day Hive Stratified Sampling Contexts tagged with meta data Thrift Context Selection Data Snapshots Batch APIs RDD of Snapshot Objects Data Snapshots Batch APIs
  18. 18. Generating Features via Time Travel
  19. 19. Great Scott! • DeLorean: A time-traveling vehicle • uses data snapshots to travel in time • scales with Apache Spark • prototypes new ideas with Zeppelin • requires minimal code changes from experimentation to A/B test to production https://en.wikipedia.org/wiki/Emmett_Brown There’s the DeLorean!
  20. 20. Running Time Travel Experiment Select the destination time Bring it up to 88 miles per hour!
  21. 21. Running Time Travel Experiment Design Experiment Collect Label Dataset DeLorean: Offline Feature Generation Distributed Model Training Parallel training of individual models using different executors Compute Validation Metrics Model Testing Choose best model Design a New Experimentto Test Out DifferentIdeas Good Metrics Offline Experiment Online System Online AB Testing Bad Metrics Selected Contexts
  22. 22. DeLorean Input Data • Contexts: The setting for evaluating a set of items (e.g. tuples of member profiles, country, time, device, etc.) • Items: The elements to be trained on, scored, and/or ranked (e.g. videos, rows, search entities). • Labels: For supervised learning, this will be the label (target) for each item.
  23. 23. Feature Encoders • Compute features for each item in a given context • Each type of raw data element has its own data key • Data map is a map from data keys to data objects in a given context • Data map is consumed by feature encoder to compute features
  24. 24. Two type of Data Elements • Context-dependent data elements • Viewing History • Mylist • ... • Context-independent data elements • Video Metadata • Genre Metadata • ...
  25. 25. Video Country of Origin Matching Fraction Context-Items Context: s Items: Context: s Items: Context Dependent Data Element Viewing History Context: s Items: Context: s Items: Context: s Items: = 0.5 = 0.5 = 0.5 Context Independent Data Element Video Metadata Context: s Items: = 1.0 = 0.0 = 1.0 Features
  26. 26. Feature GenerationS3 Snapshot Model Training Label Features Feature EncodersLabel Data Feature Encoders Data Elements Feature Model (JSON) Feature Encoders Feature Encoders Feature Encoders Required Feature Keys Data Data Map Features Data in POJOs Data Keys Data Keys
  27. 27. Features • Represented in Spark’s DataFrames • In nested structure to avoid data shuffling in ranking process • Stored with Parquet format in S3
  28. 28. Features Context Item, label, and features
  29. 29. Going Online S3 Snapshot DeLorean: Offline Feature Generation Online Ranking / Scoring Service Model Training / Validation / Testing Offline Experiment Online SystemViewing History Service MyList Service Ratings Service Online Feature Generation Deploy models Shared Feature Encoders
  30. 30. Conclusion Spark helped us significantly reduce the time from an idea to an AB Test
  31. 31. Future work Event Driven Data Snapshots Time Travel to the Future!!
  32. 32. We’re hiring! (come talk to us) https://jobs.netflix.com/ Tech Blog: http://bit.ly/sparktimetravel
  • fantasylight

    May. 17, 2017
  • jazzwang

    May. 16, 2017
  • benteeuwen

    Apr. 5, 2017
  • RavirajAdrangi

    Jan. 14, 2017
  • dbtsai

    Apr. 6, 2016

Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create feature is critical for machine learning projects to be successful. At Netflix, we spend significant time and effort experimenting with new features and new ways of building models. This involves generating features for our members from different regions over multiple days. To enable this, we built a time machine using Apache Spark that computes features for any arbitrary time in the recent past. The first step of building this time machine is to snapshot the data from various micro services on a regular basis. We built a general purpose workflow orchestration and scheduling framework optimized for machine learning pipelines and used it to run the snapshot and model training workflows. Snapshot data is then consumed by feature encoders to compute various features for offline experimentation and model training. Crucially, the same feature encoders are used in both offline model building and online scoring for production or A/B tests. Building this time machine helped us try new ideas quickly without placing stress on production services and without having to wait for data accumulation of the newly-implemented features. Moreover, building it with Apache Spark empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. Finally, using Apache Zeppelin notebook, we are able to interactively prototype features and run experiments. Speaker Bio: DB Tsai DB Tsai is an Apache Spark committer and a Senior Research Engineer working on Personalized Recommendation Algorithms at Netflix. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where He implemented several algorithms including Linear Regression and Binary/Multinomial Logistic Regression with Elastici-Net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. DB was a Ph.D. candidate in Applied Physics at Stanford University. He holds a Master’s degree in Electrical Engineering from Stanford University.

Views

Total views

1,257

On Slideshare

0

From embeds

0

Number of embeds

200

Actions

Downloads

29

Shares

0

Comments

0

Likes

5

×