SlideShare a Scribd company logo
1 of 17
Evaluation Methods and Challenges
Evaluation Methods

• Ideal method
   – Experimental Design: Run side-by-side experiments on a small
     fraction of randomly selected traffic with new method (treatment)
     and status quo (control)
   – Limitation
       • Often expensive and difficult to test large number of methods


• Problem: How do we evaluate methods offline on logged
  data?
   – Goal: To maximize clicks/revenue and not prediction accuracy on
     the entire system. Cost of predictive inaccuracy for different
     instances vary.
       • E.g. 100% error on a low CTR article may not matter much
         because it always co-occurs with a high CTR article that is
         predicted accurately


 Deepak Agarwal & Bee-Chung Chen @ ICML’11                           2
Usual Metrics

• Predictive accuracy
    – Root Mean Squared Error (RMSE)
    – Mean Absolute Error (MAE)
    – Area under the Curve, ROC
• Other rank based measures based on retrieval accuracy for top-k

    – Recall in test data
       • What Fraction of items that user actually liked in the test data were
         among the top-k recommended by the algorithm (fraction of hits, e.g.
         Karypsis, CIKM 2001)
• One flaw in several papers
    – Training and test split are not based on time.
       • Information leakage
       • Even in Netflix, this is the case to some extent
               – Time split per user, not per event. For instance, information may leak if
                 models are based on user-user similarity.

  Deepak Agarwal & Bee-Chung Chen @ ICML’11                                                  3
Metrics continued..

• Recall per event based on Replay-Match method
    – Fraction of clicked events where the top recommended item
      matches the clicked one.


• This is good if logged data collected from a randomized
  serving scheme, with biased data this could be a problem
    – We will be inventing algorithms that provide recommendations that
      are similar to the current one
       • No reward for novel recommendations




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                          4
Details on Replay-Match method (Li, Langford, et al)

• x: feature vector for a visit
• r = [r1,r2,…,rK]: reward vector for the K items in inventory
• h(x): recommendation algorithm to be evaluated
• Goal: Estimate expected reward for h(x)




• s(x): recommendation scheme that generated logged-data
• x1,..,xT: visits in the logged data
• rti: reward for visit t, where i = s(xt)

  Deepak Agarwal & Bee-Chung Chen @ ICML’11                      5
Replay-Match continued

• Estimator




• If importance weights
  and

   – It can be shown estimator is unbiased


• E.g. if s(x) is random serving scheme, importance weights
  are uniform over the item set
• If s(x) is not random, importance weights have to be
  estimated through a model
 Deepak Agarwal & Bee-Chung Chen @ ICML’11                6
Back to Multi-Objective Optimization


                              EDITORIAL
 Recommender                                    AD SERVER

               •Clicks on FP links influence      PREMIUM display
               downstream supply distribution
                                                   (GUARANTEED)
content
                                                  Spot Market (Cheaper)




                                                 Downstream
                                                 engagement

                                                  (Time spent)




   Deepak Agarwal & Bee-Chung Chen @ ICML’11                      7
Serving Content on Front Page: Click Shaping

• What do we want to optimize?
• Current: Maximize clicks (maximize downstream supply from FP)
• But consider the following
    – Article 1: CTR=5%, utility per click = 5
    – Article 2: CTR=4.9%, utility per click=10
       • By promoting 2, we lose 1 click/100 visits, gain 5 utils
• If we do this for a large number of visits --- lose some clicks but obtain
  significant gains in utility?
    – E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement, etc)




  Deepak Agarwal & Bee-Chung Chen @ ICML’11                                   8
Why call it Click Shaping?
                                                                                            other
                                                                                                                  buzz
                other                                                                video              autos            finance
                              buzz   finance                                   videogames                                    gmy.news
        video                                                                        tv                                       health
                          autos
  videogames                            gmy.news
                                          health                              travel                                           hotjobs
         tv
                                          hotjobs                             tech                                              movies
  travel
                                                                                                                                new.music
                                            movies
   tech                                                                      sports
                                            new.music
                                                                                                                                         •AFTER
                                             •BEFORE                       shopping
 sports                                                                                                                       news
                                                                               shine
                                                                                   rivals
shopping                                                                                              omg
                                         news                                  realestate
    shine
                                            10.00%
       rivals
                        omg                  8.00%
   realestate                                6.00%
                                             4.00%
                                             2.00%
  •Supply distribution                       0.00%




                                                                                                                   ts
  •Changes




                                                                                                                 ing
                                                                            bs
                                                           s




                                                                                          omg


                                                                                               s




                                                                                                               te ch
                                             -2.00%




                                                                                                         e




                                                                                                                                     tv
                                                                           ie s


                                                                              s
                                                                              s




                                                                                                                                                  r
                                                                                                                                      o
                                                               buzz




                                                                                                                    l
                                                                                         s tat e
                                                                             ic
                                                                            ce


                                                                            th




                                                                                                                                                es
                                                                                                             tra ve




                                                                                                                                             othe
                                                      aut o




                                                                                                    shin
                                                                                         rival
                                                                        new
                                                                       .ne w




                                                                                                                                 vide
                                                                                                             spor
                                                                      .mus
                                                                     hotjo
                                                                       heal
                                                                     finan




                                                                                                                                           gam
                                                                      mov
                                             -4.00%




                                                                                                              p
                                                                                                         shop
                                                                                   rea le
                                                                   gmy


                                             -6.00%


                                                                  new




                                                                                                                                             o
                                                                                                                                         vide
                                             -8.00%
                                            -10.00%




           •SHAPING can happen with respect to any downstream metrics (like engagement)




       Deepak Agarwal & Bee-Chung Chen @ ICML’11                                                                                                      9
Multi-Objective Optimization
     K properties                                m user segments
                                    n articles



                              •A1                        •S1

         •news



                              •A2                        •S2

       •finance


            •…                        •…                  •…


         •omg
                              •An                        •Sm




                    • CTR of user segment i on article j: pij
                    • Time duration of i on j: dij
                                                                    •10
  Deepak Agarwal & Bee-Chung Chen @ ICML’11                        10
Multi-Objective Program
    Scalarization




  Goal Programming




                     Simplex constraints on xiJ is always applied
                     Constraints are linear


         Every 10 mins, solve x
         Use this x as the serving scheme in the next 10 mins
                                                                     11
  Deepak Agarwal & Bee-Chung Chen @ ICML’11                         11
Pareto-optimal solution (more in KDD 2011)




                                               •12
  Deepak Agarwal & Bee-Chung Chen @ ICML’11   12
Summary

• Modern recommendation systems on the web crucially depend on
  extracting intelligence from massive amounts of data collected on a
  routine basis
• Lots of data and processing power not enough, the number of things
  we need to learn grows with data size
• Extracting grouping structures at coarser resolutions based on
  similarity (correlations) is important
   – ML has a big role to play here
• Continuous and adaptive experimentation in a judicious manner crucial
  to maximize performance
   – Again, ML has a big role to play
• Multi-objective optimization is often required, the objectives are
  application dependent.
   – ML has to work in close collaboration with engineering, product &
     business execs
  Deepak Agarwal & Bee-Chung Chen @ ICML’11                             13
Challenges
Recall: Some examples

• Simple version
   – I have an important module on my page, content inventory is
     obtained from a third party source which is further refined through
     editorial oversight. Can I algorithmically recommend content on this
     module? I want to drive up total CTR on this module
• More advanced
   – I got X% lift in CTR. But I have additional information on other
     downstream utilities (e.g. dwell time). Can I increase downstream
     utility without losing too many clicks?
• Highly advanced
   – There are multiple modules running on my website. How do I take
     a holistic approach and perform a simultaneous optimization?




 Deepak Agarwal & Bee-Chung Chen @ ICML’11                            15
For the simple version

• Multi-position optimization
    – Explore/exploit, optimal subset selection


• Explore/Exploit strategies for large content pool and high
  dimensional problems
    – Some work on hierarchical bandits but more needs to be done


• Constructing user profiles from multiple sources with less
  than full coverage
    – Couple of papers at KDD 2011
• Content understanding
• Metrics to measure user engagement (other than CTR)


  Deepak Agarwal & Bee-Chung Chen @ ICML’11                         16
Other problems

• Whole page optimization
   – Incorporating correlations




• Incentivizing User generated content


• Incorporating Social information for better recommendation


• Multi-context Learning



 Deepak Agarwal & Bee-Chung Chen @ ICML’11                17

More Related Content

Similar to Evaluate Methods for Improving Recommendations

Metrics for start up ninjas
Metrics for start up ninjasMetrics for start up ninjas
Metrics for start up ninjasAman Sandhu
 
Startup Metrics for Pirates (Nov 2012)
Startup Metrics for Pirates (Nov 2012)Startup Metrics for Pirates (Nov 2012)
Startup Metrics for Pirates (Nov 2012)Dave McClure
 
Michael Goguen, Sequoia Capital: Think Big, Start Small
Michael Goguen, Sequoia Capital: Think Big, Start SmallMichael Goguen, Sequoia Capital: Think Big, Start Small
Michael Goguen, Sequoia Capital: Think Big, Start SmallDanuta Pysarenko
 
myThings - NOAH12 London
myThings - NOAH12 LondonmyThings - NOAH12 London
myThings - NOAH12 LondonMarco Rodzynek
 
Startup Metrics 4 Pirates (Seoul)
Startup Metrics 4 Pirates (Seoul)Startup Metrics 4 Pirates (Seoul)
Startup Metrics 4 Pirates (Seoul)Dave McClure
 
Attribution SESsf 2012 by @stevelatham - Encore Media Metrics
Attribution SESsf 2012 by @stevelatham - Encore Media MetricsAttribution SESsf 2012 by @stevelatham - Encore Media Metrics
Attribution SESsf 2012 by @stevelatham - Encore Media MetricsEncore Media Metrics
 
TCC - Digital Download - January 2013
TCC - Digital Download - January 2013TCC - Digital Download - January 2013
TCC - Digital Download - January 2013WiTH Collective
 
Creator's portal website 「Re:Quest.com」
Creator's portal website 「Re:Quest.com」Creator's portal website 「Re:Quest.com」
Creator's portal website 「Re:Quest.com」Takeaki Tsutsumi
 
1998 Recruiting Strategy
1998 Recruiting Strategy1998 Recruiting Strategy
1998 Recruiting StrategyJohn Sumser
 
Customer Development 2: Three types of markets
Customer Development 2: Three types of marketsCustomer Development 2: Three types of markets
Customer Development 2: Three types of marketsVenture Hacks
 
Attribution Demystified: Digital World Expo 2012
Attribution Demystified: Digital World Expo 2012Attribution Demystified: Digital World Expo 2012
Attribution Demystified: Digital World Expo 2012Encore Media Metrics
 
UX & ROI: What to measure and what to expect
UX & ROI: What to measure and what to expectUX & ROI: What to measure and what to expect
UX & ROI: What to measure and what to expectcxpartners
 
QM-038-什麼是6Sigma good
QM-038-什麼是6Sigma goodQM-038-什麼是6Sigma good
QM-038-什麼是6Sigma goodhandbook
 
About Our Recommender System
About Our Recommender SystemAbout Our Recommender System
About Our Recommender SystemKimikazu Kato
 

Similar to Evaluate Methods for Improving Recommendations (20)

Astia Pitch
Astia PitchAstia Pitch
Astia Pitch
 
Metrics for start up ninjas
Metrics for start up ninjasMetrics for start up ninjas
Metrics for start up ninjas
 
Startup Metrics for Pirates (Nov 2012)
Startup Metrics for Pirates (Nov 2012)Startup Metrics for Pirates (Nov 2012)
Startup Metrics for Pirates (Nov 2012)
 
Michael Goguen, Sequoia Capital: Think Big, Start Small
Michael Goguen, Sequoia Capital: Think Big, Start SmallMichael Goguen, Sequoia Capital: Think Big, Start Small
Michael Goguen, Sequoia Capital: Think Big, Start Small
 
myThings - NOAH12 London
myThings - NOAH12 LondonmyThings - NOAH12 London
myThings - NOAH12 London
 
Startup Metrics 4 Pirates (Seoul)
Startup Metrics 4 Pirates (Seoul)Startup Metrics 4 Pirates (Seoul)
Startup Metrics 4 Pirates (Seoul)
 
Attribution SESsf 2012 by @stevelatham - Encore Media Metrics
Attribution SESsf 2012 by @stevelatham - Encore Media MetricsAttribution SESsf 2012 by @stevelatham - Encore Media Metrics
Attribution SESsf 2012 by @stevelatham - Encore Media Metrics
 
TCC - Digital Download - January 2013
TCC - Digital Download - January 2013TCC - Digital Download - January 2013
TCC - Digital Download - January 2013
 
Creator's portal website 「Re:Quest.com」
Creator's portal website 「Re:Quest.com」Creator's portal website 「Re:Quest.com」
Creator's portal website 「Re:Quest.com」
 
1998 Recruiting Strategy
1998 Recruiting Strategy1998 Recruiting Strategy
1998 Recruiting Strategy
 
Customer Development 2: Three types of markets
Customer Development 2: Three types of marketsCustomer Development 2: Three types of markets
Customer Development 2: Three types of markets
 
Attribution Demystified: Digital World Expo 2012
Attribution Demystified: Digital World Expo 2012Attribution Demystified: Digital World Expo 2012
Attribution Demystified: Digital World Expo 2012
 
UX & ROI: What to measure and what to expect
UX & ROI: What to measure and what to expectUX & ROI: What to measure and what to expect
UX & ROI: What to measure and what to expect
 
Presentation vn
Presentation vnPresentation vn
Presentation vn
 
Web Analytics Whitepaper
Web Analytics WhitepaperWeb Analytics Whitepaper
Web Analytics Whitepaper
 
Alan Whitford
Alan WhitfordAlan Whitford
Alan Whitford
 
Lecture 6 revenue model
Lecture 6   revenue modelLecture 6   revenue model
Lecture 6 revenue model
 
QM-038-什麼是6Sigma good
QM-038-什麼是6Sigma goodQM-038-什麼是6Sigma good
QM-038-什麼是6Sigma good
 
About Our Recommender System
About Our Recommender SystemAbout Our Recommender System
About Our Recommender System
 
Enterprise Gamification by itzCorinne
Enterprise Gamification by itzCorinneEnterprise Gamification by itzCorinne
Enterprise Gamification by itzCorinne
 

Evaluate Methods for Improving Recommendations

  • 2. Evaluation Methods • Ideal method – Experimental Design: Run side-by-side experiments on a small fraction of randomly selected traffic with new method (treatment) and status quo (control) – Limitation • Often expensive and difficult to test large number of methods • Problem: How do we evaluate methods offline on logged data? – Goal: To maximize clicks/revenue and not prediction accuracy on the entire system. Cost of predictive inaccuracy for different instances vary. • E.g. 100% error on a low CTR article may not matter much because it always co-occurs with a high CTR article that is predicted accurately Deepak Agarwal & Bee-Chung Chen @ ICML’11 2
  • 3. Usual Metrics • Predictive accuracy – Root Mean Squared Error (RMSE) – Mean Absolute Error (MAE) – Area under the Curve, ROC • Other rank based measures based on retrieval accuracy for top-k – Recall in test data • What Fraction of items that user actually liked in the test data were among the top-k recommended by the algorithm (fraction of hits, e.g. Karypsis, CIKM 2001) • One flaw in several papers – Training and test split are not based on time. • Information leakage • Even in Netflix, this is the case to some extent – Time split per user, not per event. For instance, information may leak if models are based on user-user similarity. Deepak Agarwal & Bee-Chung Chen @ ICML’11 3
  • 4. Metrics continued.. • Recall per event based on Replay-Match method – Fraction of clicked events where the top recommended item matches the clicked one. • This is good if logged data collected from a randomized serving scheme, with biased data this could be a problem – We will be inventing algorithms that provide recommendations that are similar to the current one • No reward for novel recommendations Deepak Agarwal & Bee-Chung Chen @ ICML’11 4
  • 5. Details on Replay-Match method (Li, Langford, et al) • x: feature vector for a visit • r = [r1,r2,…,rK]: reward vector for the K items in inventory • h(x): recommendation algorithm to be evaluated • Goal: Estimate expected reward for h(x) • s(x): recommendation scheme that generated logged-data • x1,..,xT: visits in the logged data • rti: reward for visit t, where i = s(xt) Deepak Agarwal & Bee-Chung Chen @ ICML’11 5
  • 6. Replay-Match continued • Estimator • If importance weights and – It can be shown estimator is unbiased • E.g. if s(x) is random serving scheme, importance weights are uniform over the item set • If s(x) is not random, importance weights have to be estimated through a model Deepak Agarwal & Bee-Chung Chen @ ICML’11 6
  • 7. Back to Multi-Objective Optimization EDITORIAL Recommender AD SERVER •Clicks on FP links influence PREMIUM display downstream supply distribution (GUARANTEED) content Spot Market (Cheaper) Downstream engagement (Time spent) Deepak Agarwal & Bee-Chung Chen @ ICML’11 7
  • 8. Serving Content on Front Page: Click Shaping • What do we want to optimize? • Current: Maximize clicks (maximize downstream supply from FP) • But consider the following – Article 1: CTR=5%, utility per click = 5 – Article 2: CTR=4.9%, utility per click=10 • By promoting 2, we lose 1 click/100 visits, gain 5 utils • If we do this for a large number of visits --- lose some clicks but obtain significant gains in utility? – E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement, etc) Deepak Agarwal & Bee-Chung Chen @ ICML’11 8
  • 9. Why call it Click Shaping? other buzz other video autos finance buzz finance videogames gmy.news video tv health autos videogames gmy.news health travel hotjobs tv hotjobs tech movies travel new.music movies tech sports new.music •AFTER •BEFORE shopping sports news shine rivals shopping omg news realestate shine 10.00% rivals omg 8.00% realestate 6.00% 4.00% 2.00% •Supply distribution 0.00% ts •Changes ing bs s omg s te ch -2.00% e tv ie s s s r o buzz l s tat e ic ce th es tra ve othe aut o shin rival new .ne w vide spor .mus hotjo heal finan gam mov -4.00% p shop rea le gmy -6.00% new o vide -8.00% -10.00% •SHAPING can happen with respect to any downstream metrics (like engagement) Deepak Agarwal & Bee-Chung Chen @ ICML’11 9
  • 10. Multi-Objective Optimization K properties m user segments n articles •A1 •S1 •news •A2 •S2 •finance •… •… •… •omg •An •Sm • CTR of user segment i on article j: pij • Time duration of i on j: dij •10 Deepak Agarwal & Bee-Chung Chen @ ICML’11 10
  • 11. Multi-Objective Program  Scalarization Goal Programming Simplex constraints on xiJ is always applied Constraints are linear Every 10 mins, solve x Use this x as the serving scheme in the next 10 mins 11 Deepak Agarwal & Bee-Chung Chen @ ICML’11 11
  • 12. Pareto-optimal solution (more in KDD 2011) •12 Deepak Agarwal & Bee-Chung Chen @ ICML’11 12
  • 13. Summary • Modern recommendation systems on the web crucially depend on extracting intelligence from massive amounts of data collected on a routine basis • Lots of data and processing power not enough, the number of things we need to learn grows with data size • Extracting grouping structures at coarser resolutions based on similarity (correlations) is important – ML has a big role to play here • Continuous and adaptive experimentation in a judicious manner crucial to maximize performance – Again, ML has a big role to play • Multi-objective optimization is often required, the objectives are application dependent. – ML has to work in close collaboration with engineering, product & business execs Deepak Agarwal & Bee-Chung Chen @ ICML’11 13
  • 15. Recall: Some examples • Simple version – I have an important module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to drive up total CTR on this module • More advanced – I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. dwell time). Can I increase downstream utility without losing too many clicks? • Highly advanced – There are multiple modules running on my website. How do I take a holistic approach and perform a simultaneous optimization? Deepak Agarwal & Bee-Chung Chen @ ICML’11 15
  • 16. For the simple version • Multi-position optimization – Explore/exploit, optimal subset selection • Explore/Exploit strategies for large content pool and high dimensional problems – Some work on hierarchical bandits but more needs to be done • Constructing user profiles from multiple sources with less than full coverage – Couple of papers at KDD 2011 • Content understanding • Metrics to measure user engagement (other than CTR) Deepak Agarwal & Bee-Chung Chen @ ICML’11 16
  • 17. Other problems • Whole page optimization – Incorporating correlations • Incentivizing User generated content • Incorporating Social information for better recommendation • Multi-context Learning Deepak Agarwal & Bee-Chung Chen @ ICML’11 17