This document presents a framework for auditing search engines to detect differences in user satisfaction across demographics. It describes three methods for more meaningful auditing that control for natural demographic variations: 1) Context matching to select near-identical user activity, 2) A hierarchical query-level model to borrow strength across popular queries, and 3) A query-level pairwise model to directly estimate relative satisfaction between user pairs for the same query. The framework found some light trends of older users being more satisfied but showed auditing is nuanced and different from measuring metrics on binned traffic alone. It provides a general approach for auditing systems using different metrics and user groups.
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Auditing search engines for differential satisfaction across demographics
1. Auditing Search Engines for
Differential Satisfaction
across Demographics
Rishabh Mehrotra, Ashton Anderson, Fernando Diaz,
Amit Sharma, Hanna Wallach, Emine Yilmaz
Microsoft Research New York
3. Motivation for auditing
• Ethical
• Equal access to everyone
• Practical
• Equal access helps attract a large and
diverse population of users
• Service providers are scrutinized for
seemingly unfair behavior [1,2,3]
• We offer methods for auditing a
system’s performance for detection
of differences in user satisfaction
across demographics
[1] N. Diakopoulos. Algorithmic accountability. Digital Journalism, 3(3):398–415, 2015
[2] S. Barocas and A. D. Selbst. Big data’s disparate impact. California Law Review, 104, 2016.
[3] C. Munoz, M. Smith, and D. Patel. Big data: A report on algorithmic systems, opportunity, and civil rights.
Technical report, Executive Office of the President of the United States, May 2016.
4. Tricky: straightforward optimization
can lead to differential performance
• Search engine uses a standard metric: time spent on clicked
result page as an indicator of satisfaction.
• Goal: estimate difference in user satisfaction between these two
demographic groups.
• Suppose older users issue more of “retirement planning” queries
Age: >50 years
80% users 10% users
Age: <30 years
…
5. 1. Overall metrics can hide
differential satisfaction
• Average user satisfaction for “retirement planning”
may be high.
But,
• Average satisfaction for younger users=0.7
• Average satisfaction for older users=0.2
6. 2. Query-level metrics can hide
differential satisfaction
<query>
<query>
<query>
<query>
<query>
<query>
retirement planning
<query>
<query>
retirement planning
retirement planning
<query>
retirement planning
…
Same user satisfaction for
“retirement planning” for both older and
younger users = 0.7
What if average satisfaction for
<query>=0.9?
Older users still receiving more of lower-
quality results than younger users.
Younger users
Older users
7. 3. More critically, even individual-
level metrics can also hide differential
satisfaction
Reading time for the same webpage result for the
same user satisfaction
Time spent on a webpage
Younger Users
Older Users
8. We must control for natural
demographic variation to
meaningfully audit for differential
satisfaction.
9. Data: Demographic characteristics
of search engine users
• Internal logs from Bing.com for two weeks
• 4 M users | 32 M impressions | 17 M sessions
• Demographics: Age & Gender
• Age:
• post-Millenial: <18
• Millenial: 18-34
• Generation X: 35-54
• Baby Boomer: 55 - 74
12. Pitfalls with Overall Metrics
• Conflate two separate effects:
• natural demographic variation caused by the differing
traits among the different demographic groups e.g.
• Different queries issued
• Different information need for the same query
• Even for the same satisfaction, demographic A tends to click
more than demographic B
• Systemic difference in user satisfaction due to the search
engine
13. Utilize work from causal inference
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
14. I. Context Matching: selecting for
activity with near-identical context
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
Context
16. Age-wise differences in metrics disappear
• General auditing tool: robust
• Very low coverage across queries
• Did we control for too much?
17. II. Query-level hierarchical model:
Differential satisfaction for the same query
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
18. • Simply fitting different models for each query will not
work for less popular queries.
• We formulate a hierarchical model that borrows
strength from more popular queries:
• Consider metric for each impression---query and user---as a
deviation from overall metric based on:
• Query Topic
• User demographics
20. III. Query-level pairwise model:
Estimating satisfaction directly by
considering pairs of users
Information
Need
Demographics
Metric
User
satisfaction
Query
Search
Results
21. Estimating absolute satisfaction is non-trivial
• Instead, Estimate relative satisfaction by considering pairs of users
for the same query
• Conservative proxy for pairwise satisfaction by only considering “big”
differences in observed metric for the same query
• Logistic regression model for estimating probability of impression i
being more satisfied than impression j:
22. Again, see a small age-wise difference in
satisfaction
23. • Auditing is more nuanced than merely measuring
metrics on demographically-binned traffic.
• We find light trend towards older users being more
satisfied.
• General framework for auditing systems
• Plug-in different metrics
• Plug-in different demographics/user groups
• Suggests recalibration of metrics based on
demographics
Discussion
24. Thank You!
Amit Sharma
Postdoctoral Researcher
http://www.amitsharma.in
@amt_shrma
amshar@microsoft.com
Auditing is more nuanced than merely measuring metrics on demographically-binned
traffic.
General framework for auditing systems
Plug-in different metrics
Plug-in different demographics/user groups
Paper: http://datworkshop.org/papers/dat16-final41.pdf