Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Auditing search engines for differential satisfaction across demographics

342 views

Published on

Many online services, such as search engines, social media platforms, and digital marketplaces, are advertised as being available to any user, regardless of their age, gender, or other demographic factors. However, there are growing concerns that these services may systematically underserve some groups of users. In this work, we present a framework for internally auditing such services for differences in user satisfaction across demographic groups, using search engines as a case study. We first explain the pitfalls of naively comparing the behavioral metrics that are commonly used to evaluate search engines. We then propose three methods for measuring latent differences in user satisfaction from observed differences in evaluation metrics. To develop these methods, we drew on ideas from the causal inference and multilevel modeling literature. Our framework is broadly applicable to other online services, and provides general insight into interpreting their evaluation metrics.

Published in: Data & Analytics
  • Be the first to comment

Auditing search engines for differential satisfaction across demographics

  1. 1. Auditing Search Engines for Differential Satisfaction across Demographics Rishabh Mehrotra, Ashton Anderson, Fernando Diaz, Amit Sharma, Hanna Wallach, Emine Yilmaz Microsoft Research New York
  2. 2. From public libraries to search engines
  3. 3. Motivation for auditing • Ethical • Equal access to everyone • Practical • Equal access helps attract a large and diverse population of users • Service providers are scrutinized for seemingly unfair behavior [1,2,3] • We offer methods for auditing a system’s performance for detection of differences in user satisfaction across demographics [1] N. Diakopoulos. Algorithmic accountability. Digital Journalism, 3(3):398–415, 2015 [2] S. Barocas and A. D. Selbst. Big data’s disparate impact. California Law Review, 104, 2016. [3] C. Munoz, M. Smith, and D. Patel. Big data: A report on algorithmic systems, opportunity, and civil rights. Technical report, Executive Office of the President of the United States, May 2016.
  4. 4. Tricky: straightforward optimization can lead to differential performance • Search engine uses a standard metric: time spent on clicked result page as an indicator of satisfaction. • Goal: estimate difference in user satisfaction between these two demographic groups. • Suppose older users issue more of “retirement planning” queries Age: >50 years 80% users 10% users Age: <30 years …
  5. 5. 1. Overall metrics can hide differential satisfaction • Average user satisfaction for “retirement planning” may be high. But, • Average satisfaction for younger users=0.7 • Average satisfaction for older users=0.2
  6. 6. 2. Query-level metrics can hide differential satisfaction <query> <query> <query> <query> <query> <query> retirement planning <query> <query> retirement planning retirement planning <query> retirement planning … Same user satisfaction for “retirement planning” for both older and younger users = 0.7 What if average satisfaction for <query>=0.9? Older users still receiving more of lower- quality results than younger users. Younger users Older users
  7. 7. 3. More critically, even individual- level metrics can also hide differential satisfaction Reading time for the same webpage result for the same user satisfaction Time spent on a webpage Younger Users Older Users
  8. 8. We must control for natural demographic variation to meaningfully audit for differential satisfaction.
  9. 9. Data: Demographic characteristics of search engine users • Internal logs from Bing.com for two weeks • 4 M users | 32 M impressions | 17 M sessions • Demographics: Age & Gender • Age: • post-Millenial: <18 • Millenial: 18-34 • Generation X: 35-54 • Baby Boomer: 55 - 74
  10. 10. Demographic distribution of user activity Age Groups
  11. 11. Overall metrics across Demographics Four metrics: Graded Utility (GU) Reformulation Rate (RR) Successful Click Count (SCC) Page Click Count (PCC)
  12. 12. Pitfalls with Overall Metrics • Conflate two separate effects: • natural demographic variation caused by the differing traits among the different demographic groups e.g. • Different queries issued • Different information need for the same query • Even for the same satisfaction, demographic A tends to click more than demographic B • Systemic difference in user satisfaction due to the search engine
  13. 13. Utilize work from causal inference Information Need Demographics Metric User satisfaction Query Search Results
  14. 14. I. Context Matching: selecting for activity with near-identical context Information Need Demographics Metric User satisfaction Query Search Results Context
  15. 15. Information Need Demographics Metric User satisfaction Query Search Results Context For any two users from different demographics, 1. Same Query 2. Same Information Need: 1. Control for user intent: same final SAT click 2. Only consider navigational queries 3. Identical top-8 Search Results 1.2 M impressions, 19K unique queries, 617K users
  16. 16. Age-wise differences in metrics disappear • General auditing tool: robust • Very low coverage across queries • Did we control for too much?
  17. 17. II. Query-level hierarchical model: Differential satisfaction for the same query Information Need Demographics Metric User satisfaction Query Search Results
  18. 18. • Simply fitting different models for each query will not work for less popular queries. • We formulate a hierarchical model that borrows strength from more popular queries: • Consider metric for each impression---query and user---as a deviation from overall metric based on: • Query Topic • User demographics
  19. 19. Age-wise differences appear again: bigger differences for harder queries
  20. 20. III. Query-level pairwise model: Estimating satisfaction directly by considering pairs of users Information Need Demographics Metric User satisfaction Query Search Results
  21. 21. Estimating absolute satisfaction is non-trivial • Instead, Estimate relative satisfaction by considering pairs of users for the same query • Conservative proxy for pairwise satisfaction by only considering “big” differences in observed metric for the same query • Logistic regression model for estimating probability of impression i being more satisfied than impression j:
  22. 22. Again, see a small age-wise difference in satisfaction
  23. 23. • Auditing is more nuanced than merely measuring metrics on demographically-binned traffic. • We find light trend towards older users being more satisfied. • General framework for auditing systems • Plug-in different metrics • Plug-in different demographics/user groups • Suggests recalibration of metrics based on demographics Discussion
  24. 24. Thank You! Amit Sharma Postdoctoral Researcher http://www.amitsharma.in @amt_shrma amshar@microsoft.com Auditing is more nuanced than merely measuring metrics on demographically-binned traffic. General framework for auditing systems Plug-in different metrics Plug-in different demographics/user groups Paper: http://datworkshop.org/papers/dat16-final41.pdf

×