Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Measuring effectiveness of machine learning systems

362 views

Published on

Many online systems, such as recommender systems or ad systems, are increasingly being used in societally critical domains such as education, healthcare, finance and governance. A natural question to ask is about their effectiveness, which is often measured using observational metrics. However, these metrics hide cause-and-effect processes between these systems, people's behavior and outcomes. I will present a causal framework that allows us to tackle questions about the effects of algorithmic systems and demonstrate its usage through evaluation of Amazon's recommender system and a major search engine. I will also discuss how such evaluations can lead to metrics for designing better systems.

Published in: Data & Analytics
  • Be the first to comment

Measuring effectiveness of machine learning systems

  1. 1. Measuring effectiveness of machine learning systems AMIT SHARMA, MICROSOFT RESEARCH INDIA www.amitsharma.in @amt_shrma
  2. 2. Are these systems performing well? How do these systems affect people’s behavior? How can we make these systems better? Better for whom?
  3. 3. Evaluating systemsEstimating causal effects Three examples. Two studies: Question: How good is a recommender system? Estimate what would have happened without the recommender system. Question: Is a search engine performing well for all users? Estimate how performance changes if a person had a different demographic, without changing anything else.
  4. 4. Easiest answer: Look at metrics
  5. 5. But metrics can be misleading Performance metric can be high or low for a number of reasons --day of week, time of year --selection effects Different metrics may provide a different picture.
  6. 6. Example 1: Increasing activity on Xbox 7
  7. 7. From data to prediction 8 Use these correlations to make a predictive model. Future Activity -> f(number of friends, logins in past month) 
  8. 8. From data to “actionable insights” 9 Would increasing the number of friends increase people’s activity on our system? Maybe, may be not (!)
  9. 9. Different explanations are possible 10
  10. 10. Example 2: Search Ads 11
  11. 11. Are search ads really that effective? 12
  12. 12. But search results point to the same website 13
  13. 13. Without reasoning about causality, may overestimate effectiveness of ads 14
  14. 14. Okay, search ads have an explicit intent. Display ads should be fine? 15
  15. 15. Estimating the impact of ads 16
  16. 16. People anyways buy more toys in December 17
  17. 17. 18
  18. 18. 19
  19. 19. Example 3: Did a system change lead to better outcomes? 20 System A System B
  20. 20. Comparing old versus new system 21 Old (A) New (B) 50/1000 (5%) 54/1000 (5.4%) New system is better?
  21. 21. Looking at change in CTR by income 22 Old System (A) New System (B) 10/400 (2.5%) 4/200 (2%) Old System (A) New System (B) 40/600 (6.6%) 50/800 (6.2%) Low-income High-income
  22. 22. The Simpson’s paradox Is Algorithm A better? 23 Old system (A) New system (B) Conversion rate for Low-income people 10/400 (2.5%) 4/200 (2%) Conversion rate for High-income people 40/600 (6.6%) 50/800 (6.2%) Total Conversion rate 50/1000 (5%) 54/1000 (5.4%)
  23. 23. Answer (as usual): May be, may be not. High-income people might have better means to know about the updated information. Time of year may have an effect. There could be other hidden causal variations. 24
  24. 24. 25
  25. 25. Easy answer: Do A/B tests Great for testing focused hypotheses. But cannot help you find robust hypotheses. Not possible in many scenarios for ethical or practical reasons.
  26. 26. Harder answer: Approximate A/B tests offline Gives all the benefits of A/B tests Can be run at scale. But, without an experiment, we have no data on what would happen if we change the system. How do we estimate something that we have never observed?
  27. 27. Causal inference to the rescue… 28
  28. 28. Evaluating systemsEstimating causal effects Two examples: Question: How good is a recommender system? Estimate what would have happened without the recommender system. Question: Is a search engine performing well for all users? Estimate how performance changes if a person had a different demographic, without changing anything else.
  29. 29. Study 1: Estimating the impact of a recommender system Sharma-Hofman-Watts (2015,2016)
  30. 30. Example: Estimating the causal impact of Amazon’s recommender system 31
  31. 31. How much activity comes from the recommendation system? 32
  32. 32. Confounding: Observed click-throughs may be due to correlated demand 33 Demand for The Road Visits to The Road Rec. visits to No Country for Old Men Demand for No Country for Old Men
  33. 33. Observed activity is almost surely an overestimate of the causal effect 34 Causal Convenience OBSERVED ACTIVITY FROM RECOMMENDER All page visits ? ACTIVITY WITHOUT RECOMMENDER
  34. 34. Finding auxiliary outcome: Split outcome into recommender (primary) and direct visits (auxiliary) 35 All visits to a recommended product Recommender visits Direct visits Search visits Direct browsing Auxiliary outcome: Proxy for unobserved demand
  35. 35. ? ? 1a. Search for any product with a shock to page visits 36
  36. 36. 1b. Filtering out invalid natural experiments 37
  37. 37. The “split-door” criterion 38 Criterion: 𝑿 ∐ 𝒀 𝑫 Demand for focal product (UX) Visits to focal product (X) Rec. visits (YR) Direct visits (YD) Demand for rec. product (UY)
  38. 38. More formally, why does it work? 39 Unobserved variables (UX) Cause (X) Outcome (YR) Auxiliary Outcome (YD) Unobserved variables (UY)
  39. 39. Data from Amazon.com, using Bing toolbar • • • Out of which 20 K products have at least 10 visits on any one day
  40. 40. Constructed sequence of visits for each user 41
  41. 41. Implementing the split-door criterion 42 𝑡 = 15 days
  42. 42. Using the split-door criterion, obtain 23,000 natural experiments for over 12,000 products. 43
  43. 43. 44
  44. 44. Observational click-through rate overestimates causal effect 45
  45. 45. Generalization? Distribution of products with a natural experiment identical to overall distribution 46
  46. 46. Study 2: How does user satisfaction for a search engine vary across demographics? Mehrotra-Anderson-Diaz-Sharma-Wallach (2017)
  47. 47. Tricky: straightforward optimization can lead to differential performance Search engine uses a standard metric: time spent on clicked result page as an indicator of satisfaction. Goal: estimate difference in user satisfaction between these two demographic groups. Suppose older users issue more of “retirement planning” queries Age: >50 years 80% users 10% users Age: <30 years …
  48. 48. 1. Overall metrics can hide differential satisfaction Average user satisfaction for “retirement planning” may be high. But, Average satisfaction for younger users=0.7 Average satisfaction for older users=0.2
  49. 49. 2. Query-level metrics can hide differential satisfaction <query> <query> <query> <query> <query> <query> retirement planning <query> <query> retirement planning retirement planning <query> retirement planning … Same user satisfaction for “retirement planning” for both older and younger users = 0.7 What if average satisfaction for <query>=0.9? Older users still receiving more of lower- quality results than younger users. Younger users Older users
  50. 50. 3. More critically, even individual- level metrics can also hide differential satisfaction Reading time for the same webpage result for the same user satisfaction Time spent on a webpage Younger Users Older Users
  51. 51. How do we know whether some users are more satisfied than others?
  52. 52. Data: Demographic characteristics of search engine users Internal logs from Bing.com for two weeks 4 M users | 32 M impressions | 17 M sessions Demographics: Age & Gender Age: ◦ post-Millenial: <18 ◦ Millenial: 18-34 ◦ Generation X: 35-54 ◦ Baby Boomer: 55 - 74
  53. 53. Overall metrics across Demographics Four metrics: Graded Utility (GU) Reformulation Rate (RR) Successful Click Count (SCC) Page Click Count (PCC)
  54. 54. Pitfalls with Overall Metrics Conflate two separate effects: ◦ natural demographic variation caused by the differing traits among the different demographic groups e.g. ◦ Different queries issued ◦ Different information need for the same query ◦ Even for the same satisfaction, demographic A tends to click more than demographic B ◦ Systemic difference in user satisfaction due to the search engine
  55. 55. Constructing a causal model Information Need Demographics Metric User satisfaction Query Search Results
  56. 56. I. Context Matching: selecting for activity with near-identical context Information Need Demographics Metric User satisfaction Query Search Results Context
  57. 57. Information Need Demographics Metric User satisfaction Query Search Results Context For any two users from different demographics, 1. Same Query 2. Same Information Need: 1. Control for user intent: same final SAT click 2. Only consider navigational queries 3. Identical top-8 Search Results 1.2 M impressions, 19K unique queries, 617K users
  58. 58. Age-wise differences in metrics disappear General auditing tool: robust Very low coverage across queries: Did we control for too much?
  59. 59. II. Query-level pairwise model: Estimating satisfaction directly by considering pairs of users Information Need Demographics Metric User satisfaction Query Search Results
  60. 60. Estimating absolute satisfaction is non-trivial • Instead, Estimate relative satisfaction by considering pairs of users for the same query • Conservative proxy for pairwise satisfaction by only considering “big” differences in observed metric for the same query • Logistic regression model for estimating probability of impression i being more satisfied than impression j:
  61. 61. Again, see a small age-wise difference in satisfaction
  62. 62. Conclusion I: ML systems need Grey Box analysis
  63. 63. Conclusion II Evaluation of ML systems requires careful analysis of inputs and outputs. Observational metrics are usually biased: ◦ Big fraction of recommendation click-throughs may be convenience. ◦ Search engine metrics do not provide a provide a clear picture of user satisfaction. Causal models essential for developing robust metrics.
  64. 64. Thank you Amit Sharma @amt_shrma http://www.amitsharma.in

×