Hossein Taghavi
With: Ashok Chandrashekar, Linas Baltrunas, and Justin Basilico
Balancing Discovery and Continuation
in Recommendations
RecSysTV 2016
Outline
§ Background: Netflix recommendations
§ Recommending for different modes of watching
§ Case study: Continue Watching row
§ Conclusions
2
Evolution of Netflix
2006 2016
Netflix Scale
§ > 83M members
§ > 190 countries
§ > 1000 device types
§ > 3.7B hours of content
streamed every month
§ 36% of peak US
downstream traffic
4
§ Recommendations through
predicted star rating
§ Contest:
§ Accuracy measured by root
mean squared error (RMSE)
§ Improve by 10% = $1 million!
§ Data size:
§ 100M ratings (back then
“almost massive”)
5
Turn on Netflix, and the
absolute best contents for you
would automatically start playing
Recommendation System: Ideal State
6
Create a page of recommendations
where the titles you are
most likely to watch and enjoy are
shown on the most visible parts of
the page
Meanwhile…
7
Title Ranking
Everything is a RecommendationRowSelection&Ordering
Recommendations are
driven by machine
learning algorithms
Over 80% of what
members watch comes
from our
recommendations
8
How the Homepage is Built
§ The titles are organized as rows
§ Ordering of titles within rows depends on the row type
§ Selection and ordering of rows:
§ Personalized page generation
algorithm
§ Also some business rules and
constraints
§ Balance thematic coherence,
relevance, and diversity
9
Various Types of Member Interactions/Feedback
§ Plays
§ How long, pause, rewind, skip, etc.
§ Rating and social
§ Rate, like, share
§ Context
§ Time, location, device, language
§ Interactions
§ Scrolling, opening a title page,
search, list add 10
Building the Recommendations is Data Driven
§ Try an idea offline using historical
data to see if it would have made
better recommendations
§ Offline metrics: AUC, nDCG, Recall, …
§ If it did, deploy a live A/B test to see
if it performs well in Production
§ Primary metric: Member retention
Idea /
Problem
Data
Algorithm
Model
Metrics
A/B
Testing
11
For More Reading
§ Netflix tech blog:
§ bit.ly/beyondfivestars
§ bit.ly/learnapage
§ bit.ly/sparktimetravel
12
Building recommendation algorithms that are
balanced for different modes of watching
13
The same you watched last time!
What Is the Most Likely Title You Will Watch?
§ A large portion of watching hours are spent in continue
watching mode
14
Different Modes of Watching
§ Continuation: Resume a
recently-watched TV/Movie
§ List: Play a title previously
added to My List
§ Rewatch: Rewatch a title
enjoyed in the past
§ Discovery: Discover a new
title to watch
15
Recommending for Different Modes:
Approach 1
§ Build one unified model for ranking the titles in each row
and one for ranking rows
§ Optimized for the likelihood of play/enjoyment from the page
§ Benefits:
§ Fewer models to maintain
§ Fewer A/B tests
16
Approach 1: Challenges
§ Members behave differently in different modes
§ Different row types are designed for different behaviors
§ Hard to capture and balance all that in one objective
§ E.g. simply ranking titles by likelihood of play will fill the page with
already-watched titles è Poor member experience
§ Recommendations for different modes have different
sensitivities to member actions
§ Continuation recs may react immediately to watching activities,
My List recs may react to My List add/remove activities, etc.
17
Approach 2: Dedicated Models + Blend
§ Build separate models for the each mode
§ Blend the results on the page
§ Blending can be done through a model trained offline, or a
parameter tuned online
§ E.g., one or more dedicated rows for each mode
§ Pro:
§ More modular, provides more intuitive knobs for balancing
§ Con:
§ Less elegant, more maintenance 18
Case Study: Continue Watching Row
19
Continue Watching Row: The Past
§ CW row was shown on some devices
§ Videos sorted by recency of last watch
§ Row appearance on page by business rules
§ On the website, only a single CW title
§ A very significant fraction of plays are continuations
§ CW deserved a better treatment
20
Objective
§ Unify the CW row across devices
§ Optimize the row in two dimensions:
§ Row position on page
§ Place it higher when the member is more
likely to resume a video
§ Re-order the titles within the CW row
§ By their likelihood to be resumed in the
current session
21
Some Intuitive Patterns
§ Member may be more likely to want to
§ Resume a video if:
§ In the middle of binging a TV show
§ Partially watched a movie recently
§ Often watched it around this time of the day, location, or on the current
device
§ Discover a new title if:
§ Just finished a movie or completed all episodes of a show
§ Hasn’t watched anything recently
§ Is a relatively new member
22
Building a Recommendation Model for CW
§ Feature Brainstorm
§ Training Data
§ Models and Metrics
§ Implementation
23
Feature Ideas
§ Member-level:
§ Member’s subscription: tenure, country, language
§ How active has the member been recently
§ Member past ratings, genre preferences, etc.
24
Feature Ideas
§ Video and member’s previous interactions with it:
§ How recently was the video added to the catalog, watched, ...
§ How much of the movie/show watched
§ Video metadata:
§ Type and genre of video, # episodes
§ E.g., kids titles may be re-watched more
§ What else is on the catalog
§ Popularity and relevance of the video
§ How often do members resume this video
25
Feature Ideas
§ Contextual:
§ Time of the day and day of the week
§ Location at various resolutions
§ Device
26
Title Ranking Model
§ Training data
§ Continuation sessions
§ Look at which of the recently-watched titles were played?
§ Model
§ Learn-to-rank: Linear/ensembles/…
§ Optimize for how well we rank the played title among other titles
27
Title Ranking Model: Performance
§ Baseline: Ranking by recency of
last play
§ Recency rank was also an
important feature in the model
§ Metrics significantly higher than
the baseline
§ E.g. Significant lift in precision
§ A/B testing also showed
improvements
28
Row Placement Model
§ Objective
§ Estimate the likelihood of continuation vs. discovery
§ Map that likelihood to a position on the page
§ Simplification:
§ Fix two candidate positions on the page and apply a threshold
§ Tune the threshold to optimize some accuracy metric
29
Row Placement Model: Training
§ Training data
§ Randomly select sessions with plays globally
§ Model
§ Binary classification of continuation vs. discovery sessions
§ Evaluated using classification and ranking metrics
30
Row Placement Model: Performance
§ Metrics
§ Achieved high classification metrics for predicting continuation vs
discovery
§ Error types:
§ False positives è CW occupies top of the page unnecessarily
§ False negative è Difficult for member to find the CW title
§ Placing the row
§ Threshold trades off FP and FN è Hard to tune offline
§ Tuned the threshold by A/B testing
31
Reusing the Title Ranking Model
§ Use the title-level scores
§ Calibrate scores to get probability Pt of continuation for each CW
title t
§ Aggregate into an overall probability of continuation
§ E.g., assuming independence:
PCW = 1 - ∏tϵCW (1- Pt)
§ Pro: Avoids maintaining two separate models
§ Con: Not as accurate as a dedicated model
32
Context Awareness
§ Title ranks highest on the same time of day and device
as last play
§ Experiment:
§ Played “Sid the Science Kid” on iPhone
§ Played “Narcos” on the website
è Different ranking on iPhone and Web
33
Serving the CW Row in Production
§ Score cannot be precomputed è Real- or near real-time
§ Some features are context dependent
§ Row should refresh each time a member watches a title
§ Need to push updates to clients to keep the row fresh
§ Latency bottleneck: Data transfers from the cache to
computation backend
§ Requires careful backend engineering
§ Fallback strategy: If computation fails, can use recency ranking
34
Conclusions and Future Directions
35
Conclusions
§ Important to understand different modes of behavior
§ Continuation is a key driver of streaming hours
§ Improving CW recommendations improves member experience
§ A/B testing showed significant boost in user engagement
§ Future:
§ Incorporate the placement of CW row (and others) into the main
page construction model
§ When can we automatically start resuming a title? 36
Questions?
Upcoming blog post on this topic at: techblog.netflix.com
Job openings: jobs.netflix.com
37

Balancing Discovery and Continuation in Recommendations

  • 1.
    Hossein Taghavi With: AshokChandrashekar, Linas Baltrunas, and Justin Basilico Balancing Discovery and Continuation in Recommendations RecSysTV 2016
  • 2.
    Outline § Background: Netflixrecommendations § Recommending for different modes of watching § Case study: Continue Watching row § Conclusions 2
  • 3.
  • 4.
    Netflix Scale § >83M members § > 190 countries § > 1000 device types § > 3.7B hours of content streamed every month § 36% of peak US downstream traffic 4
  • 5.
    § Recommendations through predictedstar rating § Contest: § Accuracy measured by root mean squared error (RMSE) § Improve by 10% = $1 million! § Data size: § 100M ratings (back then “almost massive”) 5
  • 6.
    Turn on Netflix,and the absolute best contents for you would automatically start playing Recommendation System: Ideal State 6
  • 7.
    Create a pageof recommendations where the titles you are most likely to watch and enjoy are shown on the most visible parts of the page Meanwhile… 7
  • 8.
    Title Ranking Everything isa RecommendationRowSelection&Ordering Recommendations are driven by machine learning algorithms Over 80% of what members watch comes from our recommendations 8
  • 9.
    How the Homepageis Built § The titles are organized as rows § Ordering of titles within rows depends on the row type § Selection and ordering of rows: § Personalized page generation algorithm § Also some business rules and constraints § Balance thematic coherence, relevance, and diversity 9
  • 10.
    Various Types ofMember Interactions/Feedback § Plays § How long, pause, rewind, skip, etc. § Rating and social § Rate, like, share § Context § Time, location, device, language § Interactions § Scrolling, opening a title page, search, list add 10
  • 11.
    Building the Recommendationsis Data Driven § Try an idea offline using historical data to see if it would have made better recommendations § Offline metrics: AUC, nDCG, Recall, … § If it did, deploy a live A/B test to see if it performs well in Production § Primary metric: Member retention Idea / Problem Data Algorithm Model Metrics A/B Testing 11
  • 12.
    For More Reading §Netflix tech blog: § bit.ly/beyondfivestars § bit.ly/learnapage § bit.ly/sparktimetravel 12
  • 13.
    Building recommendation algorithmsthat are balanced for different modes of watching 13
  • 14.
    The same youwatched last time! What Is the Most Likely Title You Will Watch? § A large portion of watching hours are spent in continue watching mode 14
  • 15.
    Different Modes ofWatching § Continuation: Resume a recently-watched TV/Movie § List: Play a title previously added to My List § Rewatch: Rewatch a title enjoyed in the past § Discovery: Discover a new title to watch 15
  • 16.
    Recommending for DifferentModes: Approach 1 § Build one unified model for ranking the titles in each row and one for ranking rows § Optimized for the likelihood of play/enjoyment from the page § Benefits: § Fewer models to maintain § Fewer A/B tests 16
  • 17.
    Approach 1: Challenges §Members behave differently in different modes § Different row types are designed for different behaviors § Hard to capture and balance all that in one objective § E.g. simply ranking titles by likelihood of play will fill the page with already-watched titles è Poor member experience § Recommendations for different modes have different sensitivities to member actions § Continuation recs may react immediately to watching activities, My List recs may react to My List add/remove activities, etc. 17
  • 18.
    Approach 2: DedicatedModels + Blend § Build separate models for the each mode § Blend the results on the page § Blending can be done through a model trained offline, or a parameter tuned online § E.g., one or more dedicated rows for each mode § Pro: § More modular, provides more intuitive knobs for balancing § Con: § Less elegant, more maintenance 18
  • 19.
    Case Study: ContinueWatching Row 19
  • 20.
    Continue Watching Row:The Past § CW row was shown on some devices § Videos sorted by recency of last watch § Row appearance on page by business rules § On the website, only a single CW title § A very significant fraction of plays are continuations § CW deserved a better treatment 20
  • 21.
    Objective § Unify theCW row across devices § Optimize the row in two dimensions: § Row position on page § Place it higher when the member is more likely to resume a video § Re-order the titles within the CW row § By their likelihood to be resumed in the current session 21
  • 22.
    Some Intuitive Patterns §Member may be more likely to want to § Resume a video if: § In the middle of binging a TV show § Partially watched a movie recently § Often watched it around this time of the day, location, or on the current device § Discover a new title if: § Just finished a movie or completed all episodes of a show § Hasn’t watched anything recently § Is a relatively new member 22
  • 23.
    Building a RecommendationModel for CW § Feature Brainstorm § Training Data § Models and Metrics § Implementation 23
  • 24.
    Feature Ideas § Member-level: §Member’s subscription: tenure, country, language § How active has the member been recently § Member past ratings, genre preferences, etc. 24
  • 25.
    Feature Ideas § Videoand member’s previous interactions with it: § How recently was the video added to the catalog, watched, ... § How much of the movie/show watched § Video metadata: § Type and genre of video, # episodes § E.g., kids titles may be re-watched more § What else is on the catalog § Popularity and relevance of the video § How often do members resume this video 25
  • 26.
    Feature Ideas § Contextual: §Time of the day and day of the week § Location at various resolutions § Device 26
  • 27.
    Title Ranking Model §Training data § Continuation sessions § Look at which of the recently-watched titles were played? § Model § Learn-to-rank: Linear/ensembles/… § Optimize for how well we rank the played title among other titles 27
  • 28.
    Title Ranking Model:Performance § Baseline: Ranking by recency of last play § Recency rank was also an important feature in the model § Metrics significantly higher than the baseline § E.g. Significant lift in precision § A/B testing also showed improvements 28
  • 29.
    Row Placement Model §Objective § Estimate the likelihood of continuation vs. discovery § Map that likelihood to a position on the page § Simplification: § Fix two candidate positions on the page and apply a threshold § Tune the threshold to optimize some accuracy metric 29
  • 30.
    Row Placement Model:Training § Training data § Randomly select sessions with plays globally § Model § Binary classification of continuation vs. discovery sessions § Evaluated using classification and ranking metrics 30
  • 31.
    Row Placement Model:Performance § Metrics § Achieved high classification metrics for predicting continuation vs discovery § Error types: § False positives è CW occupies top of the page unnecessarily § False negative è Difficult for member to find the CW title § Placing the row § Threshold trades off FP and FN è Hard to tune offline § Tuned the threshold by A/B testing 31
  • 32.
    Reusing the TitleRanking Model § Use the title-level scores § Calibrate scores to get probability Pt of continuation for each CW title t § Aggregate into an overall probability of continuation § E.g., assuming independence: PCW = 1 - ∏tϵCW (1- Pt) § Pro: Avoids maintaining two separate models § Con: Not as accurate as a dedicated model 32
  • 33.
    Context Awareness § Titleranks highest on the same time of day and device as last play § Experiment: § Played “Sid the Science Kid” on iPhone § Played “Narcos” on the website è Different ranking on iPhone and Web 33
  • 34.
    Serving the CWRow in Production § Score cannot be precomputed è Real- or near real-time § Some features are context dependent § Row should refresh each time a member watches a title § Need to push updates to clients to keep the row fresh § Latency bottleneck: Data transfers from the cache to computation backend § Requires careful backend engineering § Fallback strategy: If computation fails, can use recency ranking 34
  • 35.
  • 36.
    Conclusions § Important tounderstand different modes of behavior § Continuation is a key driver of streaming hours § Improving CW recommendations improves member experience § A/B testing showed significant boost in user engagement § Future: § Incorporate the placement of CW row (and others) into the main page construction model § When can we automatically start resuming a title? 36
  • 37.
    Questions? Upcoming blog poston this topic at: techblog.netflix.com Job openings: jobs.netflix.com 37