Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Multi-Armed Bandit Framework For Recommendations at Netflix

In this talk, we present a general multi-armed bandit framework for recommendations on the Netflix homepage. We present two example case studies using MABs at Netflix - a) Artwork Personalization to recommend personalized visuals for each of our members for the different titles and b) Billboard recommendation to recommend the right title to be watched on the Billboard.

A Multi-Armed Bandit Framework For Recommendations at Netflix

  1. 1. A Multi-Armed Bandit Framework for Recommendations at Netflix Jaya Kawale & Fernando Amat PRS Workshop, June 2018
  2. 2. Quickly help members discover content they’ll love
  3. 3. Global Members, Personalized Tastes 125 Million Members ~200 Countries
  4. 4. 98% Match
  5. 5. Spot the Algorithms! 98% Match
  6. 6. Spot the Algorithms! 98% Match
  7. 7. Case Study I: Artwork Optimization Goal: Recommend a personalized artwork or imagery for a title to help members decide if they will enjoy the title or not.
  8. 8. Case Study II: Billboard Recommendation Goal: Successfully introduce content to the right members.
  9. 9. Traditional Approaches for Recommendation Collaborative Filtering ● Idea is to use the “wisdom of the crowd” to recommend items ● Well understood and various algorithms exist (e.g. Matrix Factorization) Collaborative Filtering 0 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 Users Items
  10. 10. Challenges for Traditional Approaches ● Scarce feedback ● Dynamic catalog ● Country availability ● Non-stationary member base ● Time sensitivity ○ Content popularity changes ○ Member interests evolves ○ Respond quickly to member feedback
  11. 11. Challenges for Traditional Approaches Continuous and fast learning needed ● Scarce feedback ● Dynamic catalog ● Country availability ● Non-stationary member base ● Time sensitivity ○ Content popularity changes ○ Member interests evolves ○ Respond quickly to member feedback
  12. 12. Multi-Armed Bandits Increasingly successful in various practical settings where these challenges occur Clinical Trials Network Routing Online Advertising AI for Games Hyperparameter Optimization
  13. 13. Multi-Armed Bandits ● A gambler playing multiple slot machines with unknown reward distribution ● Which machine to play to maximize reward?
  14. 14. Multi-Armed Bandit For Recommendation Exploration-Exploitation tradeoff : Recommend the optimal title given the evidence i.e. exploit OR Recommend other titles to gather feedback i.e. explore.
  15. 15. Numerous Variants ● Different Strategies: ε-Greedy, Thompson Sampling (TS), Upper Confidence Bound (UCB), etc. ● Different Environments: ○ Stochastic and stationary: Reward is generated i.i.d. from a distribution specific to the action. No payoff drift. ○ Adversarial: No assumptions on how rewards are generated. ● Different objectives: Cumulative regret, tracking the best expert ● Continuous or discrete set of actions, finite vs infinite ● Extensions: Varying set of arms, Contextual Bandits, etc.
  16. 16. Case Study I: Artwork Personalization
  17. 17. Bandit Algorithms Setting For each (user, show) request: ● Actions: set of candidate images available ● Reward: how many minutes did the user play from that impression ● Environment: Netflix homepage in user’s device ● Learner: its goal is to maximize the cumulative reward after N requests Learner Environment Action Reward Context
  18. 18. Specific challenges ● Play attribution and reward assignment ○ Incremental effect of the image on top of recommender system ● Only one image per title can be presented ○ Although inherently it is a ranking problem Would you play because the movie is recommended or because of the artwork? Or both?
  19. 19. Specific challenges ● Change effect ○ Can changing images too often make users confused? Session 1 Session 2 Session 3 ... Session N Sequence A Sequence B
  20. 20. ● We have control over the set of actions ○ How many images per show ○ Image design ● What makes a good asset? ○ Representative (no clickbait) ○ Differential ○ Informative ○ Engaging Actions Personal (i.e. contextual)
  21. 21. Intuition for Personalized Assets ● Emphasize themes through different artwork according to some context (user, viewing history, country, etc.) Preferences in genre
  22. 22. Intuition for Personalized Assets ● Emphasize themes through different artwork according to some context (user, viewing history, country, etc) Preferences in cast members
  23. 23. Epsilon Greedy for MABs ● Unbiased training data ● Like AB test across actions ● Greedy ● Select optimal action Explore ε 1-ε Exploit
  24. 24. ● Learn a binary classifier per image to predict probability of play ● Pick the winner (arg max) Member (context) Features Image Pool Model 1 Winner arg max Model 2 Model 3 Model 4 Greedy Exploit Policy
  25. 25. Take Fraction Example: Luke Cage Take Fraction = 1 / 3 Play No play User A User B User C
  26. 26. ● Unbiased offline evaluation from explore data Offline metric: Replay [Li et al, 2010] Offline Take Fraction = 2 / 3 User 1 User 2 User 3 User 4 User 5 User 6 Random Assignment Play? Model Assignment
  27. 27. Offline Replay ● Context matters ● Artwork diversity matters ● Personalization wiggles around most popular images Lift in Replay in the various algorithms as compared to the Random baseline
  28. 28. Online results ● Rollout to our >125M member base ● Most beneficial for less known titles ● Compression from title -level offline metrics due to cannibalization between titles
  29. 29. Case Study II: Billboard Recommendation
  30. 30. Considerations for the greedy policy ● Explore ○ Bandwidth allocation and cost of exploration ○ New vs existing titles ● Exploit ○ Model synchronisation ○ Title availability ○ Frequency of model update ○ Incremental updates vs batch training ■ Stationarity of title popularities ? ? ? ? ?? ?
  31. 31. Greedy Exploit Policy Member Features Candidate Pool Model 1 Winner Probability Of Play Model 2 Model 3 Model 4
  32. 32. Would the member have played the title anyway?
  33. 33. Netflix Promotions Netflix homepage is an expensive real-estate (opportunity cost): - so many titles to promote - so few opportunities to win a “moment of truth” D1 D2 D3 D4 D5 Promote?▶ ▶ ▶ ▶ Probability of Play Days
  34. 34. Netflix Promotions Netflix homepage is an expensive real-estate (opportunity cost): - so many titles to promote - so few opportunities to win a “moment of truth” Traditional (correlational) ML systems: - take action if probability of positive reward is high, irrespective of reward base rate - don’t model incremental effect of taking action D1 D2 D3 D4 D5 Promote?▶ ▶ ▶ ▶ Probability of Play Days
  35. 35. Incrementality from Advertising ● Goal: Measure ad effectiveness. ● Incrementality: The difference in the outcome because the ad was shown; the causal effect of the ad. $1.1M $1.0M $100k Other Advertisers’ Ads Control Treatment Revenue Random Assignment* *Johnson, Garrett A. and Lewis, Randall A. and Nubbemeyer, Elmar I, Ghost Ads: Improving the Economics of Measuring Online Ad Effectiveness (January 12, 2017). Simon Business School Working Paper No. FR 15-21. Available at SSRN: https://ssrn.com/abstract=2620078
  36. 36. Incrementality Based Policy ● Goal: Select title for promotion that benefits most from being shown in billboard ○ Member can play title from other sections on the homepage or search ○ Popular titles likely to appear on homepage anyway: Trending Now ○ Better utilize most expensive real-estate on the homepage! ● Define policy to be incremental with respect to probability of play
  37. 37. Incrementality Based Policy on Billboard ● Goal: Recommend title which has the largest additional benefit from being presented on the Billboard ○ Recommend titles with argmax of
  38. 38. Which titles benefit from Billboard? Title A benefits much more than Title C by being shown on the Billboard Scatter plot of incremental vs baseline probability of play for various members.
  39. 39. Offline & Online Results ● Incrementality based policy sacrifices replay by selecting a lesser known title that would benefit from being shown on the Billboard. ● Our implementation of incrementality is able to shift engagement within the candidate pool. Lift in Replay in the various algorithms as compared to a random baseline
  40. 40. Research Directions
  41. 41. Action selection orchestration ● Neighboring image selection influences result ● Title-level optimization is not enough Row A (diverse images) Row B (the microphone row) Stand-up comedy
  42. 42. Automatic image selection ● Generating new artwork is costly and time consuming ● Develop algorithm to predict asset quality from raw image Raw image Box-art
  43. 43. Long-term Reward: Road to RL ● Maximize long term reward: reinforcement learning ○ User long term joy rather than play clicks or duration.
  44. 44. Thank you. Jaya Kawale (jkawale@netflix.com) Fernando Amat (famat@netflix.com)

×