Building	
  Industrial-­‐scale	
  Real-­‐world	
  Recommender	
  Systems	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  September	
  11,	
  2012	
  
	
  
Xavier	
  Amatriain	
  
Personaliza8on	
  Science	
  and	
  Engineering	
  -­‐	
  Ne?lix	
                                                                                                                                                                         @xamat	
  
Outline
1.  Anatomy of Netflix Personalization
2.  Data & Models
3.  Consumer (Data) Science
4.  Architectures
Anatomy of
Netflix
Personalization
  Everything is a Recommendation
Everything is personalized
                    Ranking




                              Note:
                              Recommendations
       Rows




                              are per household,
                              not individual user




                                                    4
Top 10
  Personalization awareness




All       Dad     Dad&Mom Daughter    All    All?   Daughter   Son   Mom   Mom




                                     Diversity

                                                                                 5
Support for Recommendations




                         Social Support   6
Watch again & Continue Watching




                                  7
Genres




8
Genre rows
§  Personalized genre rows focus on user interest
   §  Also provide context and “evidence”
   §  Important for member satisfaction – moving personalized rows to top on
       devices increased retention
§  How are they generated?
   §  Implicit: based on user’s recent plays, ratings, & other interactions
   §  Explicit taste preferences
   §  Hybrid:combine the above
   §  Also take into account:
   §  Freshness - has this been shown before?
   §  Diversity– avoid repeating tags and genres, limit number of TV genres, etc.
Genres - personalization




                           10
Genres - personalization




                           11
Genres- explanations




                       12
Genres- explanations




                       13
Genres – user involvement




                            14
Genres – user involvement




                            15
Similars
 §  Displayed in
     many different
     contexts
    §  In response to
        user actions/
        context (search,
        queue add…)
    §  More like… rows
Anatomy of a Personalization - Recap
§  Everything is a recommendation: not only rating
    prediction, but also ranking, row selection, similarity…
§  We strive to make it easy for the user, but…
§  We want the user to be aware and be involved in the
    recommendation process
§  Deal with implicit/explicit and hybrid feedback
§  Add support/explanations for recommendations
§  Consider issues such as diversity or freshness
                                                               17
Data
  &
Models
§  Plays
Big Data   §  Behavior
           §  Geo-Information
           §  Time
           §  Ratings
           §  Searches
           §  Impressions
           §  Device info
           §  Metadata
           §  Social
           §  …


                                 19
Big Data   §  25M+ subscribers
@Netflix   §  Ratings: 4M/day
           §  Searches: 3M/day
           §  Plays: 30M/day
           §  2B hours streamed in Q4
               2011
           §  1B hours in June 2012



                                         20
Models
§    Logistic/linear regression
§    Elastic nets
§    Matrix Factorization
§    Markov Chains
§    Clustering
§    LDA
§    Association Rules
§    Gradient Boosted Decision Trees
§  …

                                        21
Rating Prediction




                    22
2007 Progress Prize
§  KorBell team (AT&T) improved by 8.43%
§  Spent ~2,000 hours
§  Combined 107 prediction algorithms with linear
    equation
§  Gave us the source code
2007 Progress Prize
§  Top 2 algorithms
   §  SVD - Prize RMSE: 0.8914
   §  RBM - Prize RMSE: 0.8990
§  Linear blend Prize RMSE: 0.88
§  Limitations
   §  Designed for 100M ratings, we have 5B ratings
   §  Not adaptable as users add ratings
   §  Performance issues
§  Currently in use as part of Netflix’ rating prediction component
SVD
X[n x m] = U[n x r] S [ r x r] (V[m x r])T




§    X: m x n matrix (e.g., m users, n videos)

§    U: m x r matrix (m users, r concepts)
§    S: r x r diagonal matrix (strength of each ‘concept’) (r: rank of the matrix)

§    V: r x n matrix (n videos, r concepts)
Simon Funk’s SVD
§  One of the most
    interesting findings
    during the Netflix
    Prize came out of a
    blog post
§  Incremental, iterative,
    and approximate way
    to compute the SVD
    using gradient
    descent
                              http://sifter.org/~simon/journal/20061211.html   26
SVD for Rating Prediction
§  Associate each user with a user-factors vector pu ∈ ℜ f
§  Associate each item with an item-factors vector qv ∈ ℜ f
§  Define a baseline estimate buv = µ + bu + bv to account for
    user and item deviation from the average
§  Predict rating using the rule
                    '            T
                   r = buv + p qv
                   uv            u




                                                                  27
SVD++
§  Koren et. al proposed an asymmetric variation that includes
    implicit feedback:
                             $     −
                                     1
                                                                           −
                                                                               1           '
                  '         T
                             & R(u) 2
                 r = buv + q &
                 uv         v             ∑       (ruj − buj )x j + N(u)       2
                                                                                    ∑ yj ) )
                             %           j∈R(u)                                    j∈N (u) (

§  Where
   §  qv , xv , yv ∈ ℜ f are three item factor vectors
   §  Users are not parametrized, but rather represented by:
     §  R(u): items rated by user u
     §  N(u): items for which the user has given an implicit preference (e.g. rated
         vs. not rated)


                                                                                               28
RBM
First generation neural networks (~60s)

                                      Like   Hate
§  Perceptrons (~1960)                             output units -
  §  Single layer of hand-coded                    class labels
      features
  §  Linear activation function
  §  Fundamentally limited in what                    non-adaptive
      they can learn to do.                            hand-coded
                                                       features

                                                       input units -
                                                       features
Second generation neural networks (~80s)
                       Compare output to
Back-propagate         correct answer to
                       compute error signal
error signal to
get derivatives
                                              outputs
for learning
          Non-linear
          activation
          function
                                                 hidden layers



                                              input features
Belief Networks (~90s)
§  Directed acyclic graph                  stochas8c	
  
    composed of stochastic                  hidden	
  	
  	
  	
  	
  	
  	
  	
  
    variables with weighted                 cause	
  
    connections.
§  Can observe some of the
    variables
§  Solve two problems:
   §  Inference: Infer the states of the
       unobserved variables.                         visible	
  	
  
   §  Learning: Adjust the                          effect	
  
       interactions between variables
       to make the network more likely
       to generate the observed data.
Restricted Boltzmann Machine
§  Restrict the connectivity to make learning easier.
    §  Only one layer of hidden units.
       §  Although multiple layers are possible          hidden
    §  No connections between hidden units.                     j
§  Hidden units are independent given the visible
    states..
    §  So we can quickly get an unbiased sample from
        the posterior distribution over hidden “causes”   i
        when given a data-vector
                                                              visible
§  RBMs can be stacked to form Deep Belief
    Nets (DBN)
RBM for the Netflix Prize




                            34
What about the final prize ensembles?
§  Our offline studies showed they were too
    computationally intensive to scale
§  Expected improvement not worth the
    engineering effort
§  Plus, focus had already shifted to other
    issues that had more impact than rating
    prediction...

                                               35
Ranking   Key algorithm, sorts titles in most
                                     contexts
Ranking
§  Ranking = Scoring + Sorting + Filtering       §  Factors
    bags of movies for presentation to a user        §  Accuracy
§  Goal: Find the best possible ordering of a       §  Novelty
    set of videos for a user within a specific       §  Diversity
    context in real-time                             §  Freshness
§  Objective: maximize consumption                  §  Scalability
§  Aspirations: Played & “enjoyed” titles have      §  …
    best score
§  Akin to CTR forecast for ads/search results
Ranking
§  Popularity is the obvious baseline
§  Ratings prediction is a clear secondary data
    input that allows for personalization
§  We have added many other features (and tried
    many more that have not proved useful)
§  What about the weights?
  §  Based on A/B testing
  §  Machine-learned
Example: Two features, linear model
                                                       1	
  
Predicted Rating




                                               2	
  




                                                                                                                                  Final	
  Ranking	
  
                                       3	
  
                                   4	
                                       Linear	
  Model:	
  
                                                               frank(u,v)	
  =	
  w1	
  p(v)	
  +	
  w2	
  r(u,v)	
  +	
  b	
  
                           5	
  




                                   Popularity
                                                                                                                                                         39
Results




          40
Learning to rank
§  Machine learning problem: goal is to construct ranking
    model from training data
§  Training data can have partial order or binary judgments
    (relevant/not relevant).
§  Resulting order of the items typically induced from a
    numerical score
§  Learning to rank is a key element for personalization
§  You can treat the problem as a standard supervised
    classification problem

                                                               41
Learning to Rank Approaches
1.  Pointwise
   §    Ranking function minimizes loss function defined on individual
         relevance judgment
   §    Ranking score based on regression or classification
   §    Ordinal regression, Logistic regression, SVM, GBDT, …
2.  Pairwise
   §    Loss function is defined on pair-wise preferences
   §    Goal: minimize number of inversions in ranking
   §    Ranking problem is then transformed into the binary classification
         problem
   §    RankSVM, RankBoost, RankNet, FRank…
Learning to rank - metrics
§  Quality of ranking measured using metrics as
  §  Normalized Discounted Cumulative Gain
                                                   n
                      DCG                            relevancei
        NDCG =            where DCG = relevance1 + ∑            and IDCG = ideal ranking
                     IDCG                          2    log 2 i
  §  Mean Reciprocal Rank (MRR)
                1          1
       MRR =
                H
                     ∑ rank(h )      where hi are the positive “hits” from the user
                     h∈H       i

  §  Mean average Precision (MAP)
               N

              ∑ AveP(n)                                                                  tp
      MAP =    n=1
                                   where N can be number of users, items… and P =
                      N                                                               tp + fp



                                                                                                43
Learning to rank - metrics
§  Quality of ranking measured using metrics as
   §  Fraction of Concordant Pairs (FCP)
     §  Given items xi and xj, user preference P and a ranking method R, a
         concordant pair (CP) is { xi , x j } s.t.P(xi ) > P(x j ) ⇔ R(xi ) < R(x j )
                      ∑CP(x , x )
                               i       j

     §  Then FCP =   i≠ j
                         n(n −1)
   §  Others…                     2
§  But, it is hard to optimize machine-learned models directly on
    these measures
   §  They are not differentiable
§  Recent research on models that directly optimize ranking
    measures

                                                                                        44
Learning to Rank Approaches
3.     Listwise
      a.      Directly optimizing IR measures (difficult since they are not differentiable)
                  §      Directly optimize IR measures through Genetic Programming
                  §      Directly optimize measures with Simulated Annealing
                  §      Gradient descent on smoothed version of objective function
                  §      SVM-MAP relaxes the MAP metric by adding it to the SVM constraints
                  §      AdaRank uses boosting to optimize NDCG
      b.      Indirect Loss Function
                  §      RankCosine uses similarity between the ranking list and the ground truth as
                          loss function
                  §      ListNet uses KL-divergence as loss function by defining a probability
                          distribution
            §          Problem: optimization in the listwise loss function does not necessarily optimize
                        IR metrics
Similars

           §  Different similarities computed
               from different sources: metadata,
               ratings, viewing data…
           §  Similarities can be treated as
               data/features
           §  Machine Learned models
               improve our concept of “similarity”




                                                46
Data & Models - Recap
§  All sorts of feedback from the user can help generate better
    recommendations
§  Need to design systems that capture and take advantage of
    all this data
§  The right model is as important as the right data
§  It is important to come up with new theoretical models, but
    also need to think about application to a domain, and practical
    issues
§  Rating prediction models are only part of the solution to
    recommendation (think about ranking, similarity…)

                                                                      47
Consumer
(Data) Science
Consumer Science
§  Main goal is to effectively innovate for customers
§  Innovation goals
  §  “If you want to increase your success rate, double
      your failure rate.” – Thomas Watson, Sr., founder of
      IBM
  §  The only real failure is the failure to innovate
  §  Fail cheaply
  §  Know why you failed/succeeded

                                                             49
Consumer (Data) Science
1.  Start with a hypothesis:
   §  Algorithm/feature/design X will increase member engagement
       with our service, and ultimately member retention
2.  Design a test
   §  Develop a solution or prototype
   §  Think about dependent & independent variables, control,
       significance…
3.  Execute the test
4.  Let data speak for itself

                                                                    50
Offline/Online testing process
     days                    Weeks to months




 Offline                Online A/B                           Rollout
                                                           Feature to
 testing    [success]
                         testing               [success]    all users



              [fail]




                                                                   51
Offline testing process
                  Initial
                Hypothesis
                              Decide
Reformulate                   Model                        Rollout
Hypothesis                                                Prototype           Rollout
                            Train Model                                     Feature to
         [no]                                             Wait for
                                                                             all users

      Try
                               offline
                                                   Online A/B
                                                          Results

                                                         Analyze
   different
   model?
                [yes]
                                Test
                                                    testing
                                                         Results

                                                                         [success]
                    [no]      Hypothesis                   Significant
                               validated                  improvement
                                offline?                    on users?
                                           [yes]
                   [fail]                                                            52
                                                                 [no]
Offline testing
§  Optimize algorithms offline
§  Measure model performance, using metrics such as:
   §  Mean Reciprocal Rank, Normalized Discounted Cumulative Gain, Fraction of
       Concordant Pairs, Precision/Recall & F-measures, AUC, RMSE, Diversity…

§  Offline performance used as an indication to make informed
    decisions on follow-up A/B tests
§  A critical (and unsolved) issue is how offline metrics can
    correlate with A/B test results.
§  Extremely important to define a coherent offline evaluation
    framework (e.g. How to create training/testing datasets is not
    trivial)

                                                                                  53
Online A/B testing process
                                                         Choose
                                             Design A/
                                                         Control
                                              B Test
                                                         Group
                             Decide
Reformulate                  Model                        Rollout
Hypothesis                                               Prototype                Rollout
                           Train Model                                          Feature to
         [no]                                            Wait for
                              offline                                            all users

      Try
               Offline                                   Results

                                                         Analyze
   different
   model?      testing
                [yes]
                               Test
                                                         Results


                                                          Significant
                             Hypothesis   [success]      improvement
                              validated                    on users?
                    [no]       offline?                                 [yes]


                                                                [no]                    54
Executing A/B tests
§  Many different metrics, but ultimately trust user
    engagement (e.g. hours of play and customer retention)
§  Think about significance and hypothesis testing
   §  Our tests usually have thousands of members and 2-20 cells
§  A/B Tests allow you to try radical ideas or test many
    approaches at the same time.
   §  We typically have hundreds of customer A/B tests running
§  Decisions on the product always data-driven

                                                                    55
What to measure
§  OEC: Overall Evaluation Criteria
§  In an AB test framework, the measure of success is key
§  Short-term metrics do not always align with long term
    goals
   §  E.g. CTR: generating more clicks might mean that our
       recommendations are actually worse
§  Use long term metrics such as LTV (Life time value)
    whenever possible
   §  In Netflix, we use member retention
                                                              56
What to measure
§  Short-term metrics can sometimes be informative, and
    may allow for faster decision-taking
   §  At Netflix we use many such as hours streamed by users or
       %hours from a given algorithm
§  But, be aware of several caveats of using early decision
    mechanisms
                                                 Initial effects appear to trend.
                                                 See “Trustworthy Online
                                                 Controlled Experiments: Five
                                                 Puzzling Outcomes
                                                 Explained” [Kohavi et. Al. KDD
                                                 12]


                                                                                    57
Consumer Data Science - Recap
§  Consumer Data Science aims to innovate for the
    customer by running experiments and letting data speak
§  This is mainly done through online AB Testing
§  However, we can speed up innovation by experimenting
    offline
§  But, both for online and offline experimentation, it is
    important to choose the right metric and experimental
    framework

                                                              58
Architectures



                59
Technology




             hQp://techblog.ne?lix.com	
     60
61
Event & Data
Distribution




               62
Event & Data Distribution
•  UI devices should broadcast many
   different kinds of user events
    •    Clicks
    •    Presentations
    •    Browsing events
    •    …
•  Events vs. data
    •  Some events only need to be
       propagated and trigger an action
       (low latency, low information per
       event)
    •  Others need to be processed and
       “turned into” data (higher latency,
       higher information quality).
    •  And… there are many in between
•  Real-time event flow managed
   through internal tool (Manhattan)
•  Data flow mostly managed through
   Hadoop.

                                             63
Offline Jobs




               64
Offline Jobs
•  Two kinds of offline jobs
     •  Model training
     •  Batch offline computation of
        recommendations/
        intermediate results
•  Offline queries either in Hive or
   PIG
•  Need a publishing mechanism
   that solves several issues
     •  Notify readers when result of
        query is ready
     •  Support different repositories
        (s3, cassandra…)
     •  Handle errors, monitoring…
     •  We do this through Hermes
                                         65
Computation




              66
Computation
•  Two ways of computing personalized
   results
    •  Batch/offline
    •  Online
•  Each approach has pros/cons
    •  Offline
         +    Allows more complex computations
         +    Can use more data
         -    Cannot react to quick changes
         -    May result in staleness
    •  Online
         +    Can respond quickly to events
         +    Can use most recent data
         -    May fail because of SLA
         -    Cannot deal with “complex”
              computations
•  It’s not an either/or decision
    •  Both approaches can be combined

                                                 67
Signals & Models




                   68
Signals & Models

•  Both offline and online algorithms are
   based on three different inputs:
    •  Models: previously trained from
       existing data
    •  (Offline) Data: previously
       processed and stored information
    •  Signals: fresh data obtained from
       live services
        •  User-related data
        •  Context data (session, date,
           time…)



                                          69
Results




          70
Results
•  Recommendations can be serviced
   from:
    •  Previously computed lists
    •  Online algorithms
    •  A combination of both
•  The decision on where to service the
   recommendation from can respond to
   many factors including context.
•  Also, important to think about the
   fallbacks (what if plan A fails)
•  Previously computed lists/intermediate
   results can be stored in a variety of
   ways
     •  Cache
     •  Cassandra
     •  Relational DB
                                            71
Alerts and Monitoring
§  A non-trivial concern in large-scale recommender
    systems
§  Monitoring: continuously observe quality of system
§  Alert: fast notification if quality of system goes below a
    certain pre-defined threshold
§  Questions:
   §  What do we need to monitor?
   §  How do we know something is “bad enough” to alert


                                                                 72
What to monitor
                                             Did something go
§  Staleness                                  wrong here?

   §  Monitor time since last data update




                                                                73
What to monitor
§  Algorithmic quality
   §  Monitor different metrics by comparing what users do and what
       your algorithm predicted they would do




                                                                       74
What to monitor
§  Algorithmic quality
   §  Monitor different metrics by comparing what users do and what
       your algorithm predicted they would do

           Did something go
             wrong here?




                                                                       75
What to monitor
§  Algorithmic source for users
   §  Monitor how users interact with different algorithms
                                                Algorithm X

                                        Did something go
                                          wrong here?



                                                              New version




                                                                            76
When to alert
§  Alerting thresholds are hard to tune
   §  Avoid unnecessary alerts (the “learn-to-ignore problem”)
   §  Avoid important issues being noticed before the alert happens
§  Rules of thumb
   §  Alert on anything that will impact user experience significantly
   §  Alert on issues that are actionable
   §  If a noticeable event happens without an alert… add a new alert
       for next time



                                                                          77
Conclusions

              78
The Personalization Problem
§  The Netflix Prize simplified the recommendation problem
    to predicting ratings
§  But…
  §  User ratings are only one of the many data inputs we have
  §  Rating predictions are only part of our solution
     §  Other algorithms such as ranking or similarity are very important

§  We can reformulate the recommendation problem
  §  Function to optimize: probability a user chooses something and
      enjoys it enough to come back to the service

                                                                             79
More to Recsys than Algorithms
§  Not only is there more to algorithms than rating
    prediction
§  There is more to Recsys than algorithms
   §  User Interface & Feedback
   §  Data
   §  AB Testing
   §  Systems & Architectures




                                                       80
More data +
         Better models +
     More accurate metrics +
Better approaches & architectures
  Lots of room for improvement!
                                    81
We’re hiring!




Xavier Amatriain (@xamat)
 xamatriain@netflix.com

Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial

  • 1.
    Building  Industrial-­‐scale  Real-­‐world  Recommender  Systems                                                                                                                            September  11,  2012     Xavier  Amatriain   Personaliza8on  Science  and  Engineering  -­‐  Ne?lix   @xamat  
  • 2.
    Outline 1.  Anatomy ofNetflix Personalization 2.  Data & Models 3.  Consumer (Data) Science 4.  Architectures
  • 3.
    Anatomy of Netflix Personalization Everything is a Recommendation
  • 4.
    Everything is personalized Ranking Note: Recommendations Rows are per household, not individual user 4
  • 5.
    Top 10 Personalization awareness All Dad Dad&Mom Daughter All All? Daughter Son Mom Mom Diversity 5
  • 6.
  • 7.
    Watch again &Continue Watching 7
  • 8.
  • 9.
    Genre rows §  Personalizedgenre rows focus on user interest §  Also provide context and “evidence” §  Important for member satisfaction – moving personalized rows to top on devices increased retention §  How are they generated? §  Implicit: based on user’s recent plays, ratings, & other interactions §  Explicit taste preferences §  Hybrid:combine the above §  Also take into account: §  Freshness - has this been shown before? §  Diversity– avoid repeating tags and genres, limit number of TV genres, etc.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    Genres – userinvolvement 14
  • 15.
    Genres – userinvolvement 15
  • 16.
    Similars §  Displayedin many different contexts §  In response to user actions/ context (search, queue add…) §  More like… rows
  • 17.
    Anatomy of aPersonalization - Recap §  Everything is a recommendation: not only rating prediction, but also ranking, row selection, similarity… §  We strive to make it easy for the user, but… §  We want the user to be aware and be involved in the recommendation process §  Deal with implicit/explicit and hybrid feedback §  Add support/explanations for recommendations §  Consider issues such as diversity or freshness 17
  • 18.
  • 19.
    §  Plays Big Data §  Behavior §  Geo-Information §  Time §  Ratings §  Searches §  Impressions §  Device info §  Metadata §  Social §  … 19
  • 20.
    Big Data §  25M+ subscribers @Netflix §  Ratings: 4M/day §  Searches: 3M/day §  Plays: 30M/day §  2B hours streamed in Q4 2011 §  1B hours in June 2012 20
  • 21.
    Models §  Logistic/linear regression §  Elastic nets §  Matrix Factorization §  Markov Chains §  Clustering §  LDA §  Association Rules §  Gradient Boosted Decision Trees §  … 21
  • 22.
  • 23.
    2007 Progress Prize § KorBell team (AT&T) improved by 8.43% §  Spent ~2,000 hours §  Combined 107 prediction algorithms with linear equation §  Gave us the source code
  • 24.
    2007 Progress Prize § Top 2 algorithms §  SVD - Prize RMSE: 0.8914 §  RBM - Prize RMSE: 0.8990 §  Linear blend Prize RMSE: 0.88 §  Limitations §  Designed for 100M ratings, we have 5B ratings §  Not adaptable as users add ratings §  Performance issues §  Currently in use as part of Netflix’ rating prediction component
  • 25.
    SVD X[n x m]= U[n x r] S [ r x r] (V[m x r])T §  X: m x n matrix (e.g., m users, n videos) §  U: m x r matrix (m users, r concepts) §  S: r x r diagonal matrix (strength of each ‘concept’) (r: rank of the matrix) §  V: r x n matrix (n videos, r concepts)
  • 26.
    Simon Funk’s SVD § One of the most interesting findings during the Netflix Prize came out of a blog post §  Incremental, iterative, and approximate way to compute the SVD using gradient descent http://sifter.org/~simon/journal/20061211.html 26
  • 27.
    SVD for RatingPrediction §  Associate each user with a user-factors vector pu ∈ ℜ f §  Associate each item with an item-factors vector qv ∈ ℜ f §  Define a baseline estimate buv = µ + bu + bv to account for user and item deviation from the average §  Predict rating using the rule ' T r = buv + p qv uv u 27
  • 28.
    SVD++ §  Koren et.al proposed an asymmetric variation that includes implicit feedback: $ − 1 − 1 ' ' T & R(u) 2 r = buv + q & uv v ∑ (ruj − buj )x j + N(u) 2 ∑ yj ) ) % j∈R(u) j∈N (u) ( §  Where §  qv , xv , yv ∈ ℜ f are three item factor vectors §  Users are not parametrized, but rather represented by: §  R(u): items rated by user u §  N(u): items for which the user has given an implicit preference (e.g. rated vs. not rated) 28
  • 29.
  • 30.
    First generation neuralnetworks (~60s) Like Hate §  Perceptrons (~1960) output units - §  Single layer of hand-coded class labels features §  Linear activation function §  Fundamentally limited in what non-adaptive they can learn to do. hand-coded features input units - features
  • 31.
    Second generation neuralnetworks (~80s) Compare output to Back-propagate correct answer to compute error signal error signal to get derivatives outputs for learning Non-linear activation function hidden layers input features
  • 32.
    Belief Networks (~90s) § Directed acyclic graph stochas8c   composed of stochastic hidden                 variables with weighted cause   connections. §  Can observe some of the variables §  Solve two problems: §  Inference: Infer the states of the unobserved variables. visible     §  Learning: Adjust the effect   interactions between variables to make the network more likely to generate the observed data.
  • 33.
    Restricted Boltzmann Machine § Restrict the connectivity to make learning easier. §  Only one layer of hidden units. §  Although multiple layers are possible hidden §  No connections between hidden units. j §  Hidden units are independent given the visible states.. §  So we can quickly get an unbiased sample from the posterior distribution over hidden “causes” i when given a data-vector visible §  RBMs can be stacked to form Deep Belief Nets (DBN)
  • 34.
    RBM for theNetflix Prize 34
  • 35.
    What about thefinal prize ensembles? §  Our offline studies showed they were too computationally intensive to scale §  Expected improvement not worth the engineering effort §  Plus, focus had already shifted to other issues that had more impact than rating prediction... 35
  • 36.
    Ranking Key algorithm, sorts titles in most contexts
  • 37.
    Ranking §  Ranking =Scoring + Sorting + Filtering §  Factors bags of movies for presentation to a user §  Accuracy §  Goal: Find the best possible ordering of a §  Novelty set of videos for a user within a specific §  Diversity context in real-time §  Freshness §  Objective: maximize consumption §  Scalability §  Aspirations: Played & “enjoyed” titles have §  … best score §  Akin to CTR forecast for ads/search results
  • 38.
    Ranking §  Popularity isthe obvious baseline §  Ratings prediction is a clear secondary data input that allows for personalization §  We have added many other features (and tried many more that have not proved useful) §  What about the weights? §  Based on A/B testing §  Machine-learned
  • 39.
    Example: Two features,linear model 1   Predicted Rating 2   Final  Ranking   3   4   Linear  Model:   frank(u,v)  =  w1  p(v)  +  w2  r(u,v)  +  b   5   Popularity 39
  • 40.
  • 41.
    Learning to rank § Machine learning problem: goal is to construct ranking model from training data §  Training data can have partial order or binary judgments (relevant/not relevant). §  Resulting order of the items typically induced from a numerical score §  Learning to rank is a key element for personalization §  You can treat the problem as a standard supervised classification problem 41
  • 42.
    Learning to RankApproaches 1.  Pointwise §  Ranking function minimizes loss function defined on individual relevance judgment §  Ranking score based on regression or classification §  Ordinal regression, Logistic regression, SVM, GBDT, … 2.  Pairwise §  Loss function is defined on pair-wise preferences §  Goal: minimize number of inversions in ranking §  Ranking problem is then transformed into the binary classification problem §  RankSVM, RankBoost, RankNet, FRank…
  • 43.
    Learning to rank- metrics §  Quality of ranking measured using metrics as §  Normalized Discounted Cumulative Gain n DCG relevancei NDCG = where DCG = relevance1 + ∑ and IDCG = ideal ranking IDCG 2 log 2 i §  Mean Reciprocal Rank (MRR) 1 1 MRR = H ∑ rank(h ) where hi are the positive “hits” from the user h∈H i §  Mean average Precision (MAP) N ∑ AveP(n) tp MAP = n=1 where N can be number of users, items… and P = N tp + fp 43
  • 44.
    Learning to rank- metrics §  Quality of ranking measured using metrics as §  Fraction of Concordant Pairs (FCP) §  Given items xi and xj, user preference P and a ranking method R, a concordant pair (CP) is { xi , x j } s.t.P(xi ) > P(x j ) ⇔ R(xi ) < R(x j ) ∑CP(x , x ) i j §  Then FCP = i≠ j n(n −1) §  Others… 2 §  But, it is hard to optimize machine-learned models directly on these measures §  They are not differentiable §  Recent research on models that directly optimize ranking measures 44
  • 45.
    Learning to RankApproaches 3.  Listwise a.  Directly optimizing IR measures (difficult since they are not differentiable) §  Directly optimize IR measures through Genetic Programming §  Directly optimize measures with Simulated Annealing §  Gradient descent on smoothed version of objective function §  SVM-MAP relaxes the MAP metric by adding it to the SVM constraints §  AdaRank uses boosting to optimize NDCG b.  Indirect Loss Function §  RankCosine uses similarity between the ranking list and the ground truth as loss function §  ListNet uses KL-divergence as loss function by defining a probability distribution §  Problem: optimization in the listwise loss function does not necessarily optimize IR metrics
  • 46.
    Similars §  Different similarities computed from different sources: metadata, ratings, viewing data… §  Similarities can be treated as data/features §  Machine Learned models improve our concept of “similarity” 46
  • 47.
    Data & Models- Recap §  All sorts of feedback from the user can help generate better recommendations §  Need to design systems that capture and take advantage of all this data §  The right model is as important as the right data §  It is important to come up with new theoretical models, but also need to think about application to a domain, and practical issues §  Rating prediction models are only part of the solution to recommendation (think about ranking, similarity…) 47
  • 48.
  • 49.
    Consumer Science §  Maingoal is to effectively innovate for customers §  Innovation goals §  “If you want to increase your success rate, double your failure rate.” – Thomas Watson, Sr., founder of IBM §  The only real failure is the failure to innovate §  Fail cheaply §  Know why you failed/succeeded 49
  • 50.
    Consumer (Data) Science 1. Start with a hypothesis: §  Algorithm/feature/design X will increase member engagement with our service, and ultimately member retention 2.  Design a test §  Develop a solution or prototype §  Think about dependent & independent variables, control, significance… 3.  Execute the test 4.  Let data speak for itself 50
  • 51.
    Offline/Online testing process days Weeks to months Offline Online A/B Rollout Feature to testing [success] testing [success] all users [fail] 51
  • 52.
    Offline testing process Initial Hypothesis Decide Reformulate Model Rollout Hypothesis Prototype Rollout Train Model Feature to [no] Wait for all users Try offline Online A/B Results Analyze different model? [yes] Test testing Results [success] [no] Hypothesis Significant validated improvement offline? on users? [yes] [fail] 52 [no]
  • 53.
    Offline testing §  Optimizealgorithms offline §  Measure model performance, using metrics such as: §  Mean Reciprocal Rank, Normalized Discounted Cumulative Gain, Fraction of Concordant Pairs, Precision/Recall & F-measures, AUC, RMSE, Diversity… §  Offline performance used as an indication to make informed decisions on follow-up A/B tests §  A critical (and unsolved) issue is how offline metrics can correlate with A/B test results. §  Extremely important to define a coherent offline evaluation framework (e.g. How to create training/testing datasets is not trivial) 53
  • 54.
    Online A/B testingprocess Choose Design A/ Control B Test Group Decide Reformulate Model Rollout Hypothesis Prototype Rollout Train Model Feature to [no] Wait for offline all users Try Offline Results Analyze different model? testing [yes] Test Results Significant Hypothesis [success] improvement validated on users? [no] offline? [yes] [no] 54
  • 55.
    Executing A/B tests § Many different metrics, but ultimately trust user engagement (e.g. hours of play and customer retention) §  Think about significance and hypothesis testing §  Our tests usually have thousands of members and 2-20 cells §  A/B Tests allow you to try radical ideas or test many approaches at the same time. §  We typically have hundreds of customer A/B tests running §  Decisions on the product always data-driven 55
  • 56.
    What to measure § OEC: Overall Evaluation Criteria §  In an AB test framework, the measure of success is key §  Short-term metrics do not always align with long term goals §  E.g. CTR: generating more clicks might mean that our recommendations are actually worse §  Use long term metrics such as LTV (Life time value) whenever possible §  In Netflix, we use member retention 56
  • 57.
    What to measure § Short-term metrics can sometimes be informative, and may allow for faster decision-taking §  At Netflix we use many such as hours streamed by users or %hours from a given algorithm §  But, be aware of several caveats of using early decision mechanisms Initial effects appear to trend. See “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained” [Kohavi et. Al. KDD 12] 57
  • 58.
    Consumer Data Science- Recap §  Consumer Data Science aims to innovate for the customer by running experiments and letting data speak §  This is mainly done through online AB Testing §  However, we can speed up innovation by experimenting offline §  But, both for online and offline experimentation, it is important to choose the right metric and experimental framework 58
  • 59.
  • 60.
    Technology hQp://techblog.ne?lix.com   60
  • 61.
  • 62.
  • 63.
    Event & DataDistribution •  UI devices should broadcast many different kinds of user events •  Clicks •  Presentations •  Browsing events •  … •  Events vs. data •  Some events only need to be propagated and trigger an action (low latency, low information per event) •  Others need to be processed and “turned into” data (higher latency, higher information quality). •  And… there are many in between •  Real-time event flow managed through internal tool (Manhattan) •  Data flow mostly managed through Hadoop. 63
  • 64.
  • 65.
    Offline Jobs •  Twokinds of offline jobs •  Model training •  Batch offline computation of recommendations/ intermediate results •  Offline queries either in Hive or PIG •  Need a publishing mechanism that solves several issues •  Notify readers when result of query is ready •  Support different repositories (s3, cassandra…) •  Handle errors, monitoring… •  We do this through Hermes 65
  • 66.
  • 67.
    Computation •  Two waysof computing personalized results •  Batch/offline •  Online •  Each approach has pros/cons •  Offline +  Allows more complex computations +  Can use more data -  Cannot react to quick changes -  May result in staleness •  Online +  Can respond quickly to events +  Can use most recent data -  May fail because of SLA -  Cannot deal with “complex” computations •  It’s not an either/or decision •  Both approaches can be combined 67
  • 68.
  • 69.
    Signals & Models • Both offline and online algorithms are based on three different inputs: •  Models: previously trained from existing data •  (Offline) Data: previously processed and stored information •  Signals: fresh data obtained from live services •  User-related data •  Context data (session, date, time…) 69
  • 70.
  • 71.
    Results •  Recommendations canbe serviced from: •  Previously computed lists •  Online algorithms •  A combination of both •  The decision on where to service the recommendation from can respond to many factors including context. •  Also, important to think about the fallbacks (what if plan A fails) •  Previously computed lists/intermediate results can be stored in a variety of ways •  Cache •  Cassandra •  Relational DB 71
  • 72.
    Alerts and Monitoring § A non-trivial concern in large-scale recommender systems §  Monitoring: continuously observe quality of system §  Alert: fast notification if quality of system goes below a certain pre-defined threshold §  Questions: §  What do we need to monitor? §  How do we know something is “bad enough” to alert 72
  • 73.
    What to monitor Did something go §  Staleness wrong here? §  Monitor time since last data update 73
  • 74.
    What to monitor § Algorithmic quality §  Monitor different metrics by comparing what users do and what your algorithm predicted they would do 74
  • 75.
    What to monitor § Algorithmic quality §  Monitor different metrics by comparing what users do and what your algorithm predicted they would do Did something go wrong here? 75
  • 76.
    What to monitor § Algorithmic source for users §  Monitor how users interact with different algorithms Algorithm X Did something go wrong here? New version 76
  • 77.
    When to alert § Alerting thresholds are hard to tune §  Avoid unnecessary alerts (the “learn-to-ignore problem”) §  Avoid important issues being noticed before the alert happens §  Rules of thumb §  Alert on anything that will impact user experience significantly §  Alert on issues that are actionable §  If a noticeable event happens without an alert… add a new alert for next time 77
  • 78.
  • 79.
    The Personalization Problem § The Netflix Prize simplified the recommendation problem to predicting ratings §  But… §  User ratings are only one of the many data inputs we have §  Rating predictions are only part of our solution §  Other algorithms such as ranking or similarity are very important §  We can reformulate the recommendation problem §  Function to optimize: probability a user chooses something and enjoys it enough to come back to the service 79
  • 80.
    More to Recsysthan Algorithms §  Not only is there more to algorithms than rating prediction §  There is more to Recsys than algorithms §  User Interface & Feedback §  Data §  AB Testing §  Systems & Architectures 80
  • 81.
    More data + Better models + More accurate metrics + Better approaches & architectures Lots of room for improvement! 81
  • 82.
    We’re hiring! Xavier Amatriain(@xamat) xamatriain@netflix.com