Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Annotation
1. Designing at the Intersection of HCI & AI:
Misinformation & Crowdsourced Annotation
Matt Lease
School of Information @mattlease
University of Texas at Austin ml@utexas.edu
Slides:
slideshare.net/mattlease
2. “The place where people & technology meet”
~ Wobbrock et al., 2009
“iSchools” now exist at over 100 universities around the world
What’s an Information School?
2
3. Matt Lease (University of Texas at Austin) 3
UT Austin “Moonshot” Project
Goal: design a future of AI & autonomous technologies
that are beneficial — not detrimental — to society.
http://goodsystems.utexas.edu
4. Part I: Design for
Crowdsourced Annotation
4Matt Lease (University of Texas at Austin)
5. Matt Lease (University of Texas at Austin)
Motivation 1: Supervised Learning
• AI accuracy greatly impacted by amount of training data
• Want labels that are reliable, inexpensive, & easy to collect
• Snow et al., EMNLP 2008
– Ensure label quality by assigning same task
to multiple workers & aggregating responses
– Can we ensure quality without reliance on
redundant work?
5
6. Motivation 2:
Human Computation
6
“Software developers with innovative ideas for businesses and
technologies are constrained by the limits of artificial
intelligence… If software developers could programmatically
access and incorporate human intelligence into their
applications, a whole new class of innovative businesses and
applications would be possible. This is the goal of Amazon
Mechanical Turk… people are freer to innovate because they
can now imbue software with real human intelligence.”
7. Collecting Annotator Rationales
for Relevance Judgments
Tyler McDonnell, Matthew Lease, Mucahid Kutlu, Tamer Elsayed
2016 AAAI Conference on Human Computation & Crowdsourcing (HCOMP)
Follow-on work: Kutlu et al., SIGIR 2018
10. Search Relevance
10
jaundice
25 Years of the National Institute of
Standards & Technology Text REtrieval
Conference (NIST TREC)
● Expert assessors provide relevance
labels for web pages.
● Task is highly subjective: even expert
assessors disagree often.
Google: Quality Rater Guidelines
(150 pages of instructions!)
What are the symptoms of jaundice?
jaundice
11. A First Experiment
• Collected sample of relevance judgments on Mechanical Turk.
• Labeled some data ourselves.
• Checked agreement.
11
● Between workers.
● Between workers vs. our own labels.
● Between workers vs. NIST gold.
● Between our labels vs. NIST gold.
● Why do our labels disagree with NIST? Who knows…
Can we do
better?
14. Why Rationales?
14
jaundice1. Transparency
● Focused context for interpreting
objective or subjective answers.
● Workers can justify decisions and
establish other valid answers.
● Scalable gold creation w/o experts.
● Can verify labels both now & in future.
● e.g. Imagine NIST gold with these
What are the symptoms of jaundice?
jaundice
15. Why Rationales?
15
jaundice2. Reliability & Verifiability
● Increased accountability reduces
temptation to cheat.
● Enables iterative task design.
(more to come…)
● Enables dual-supervision, both when
aggregating answers and training
model for actual task. (more to come…)
● Better quality assurance could reduce
need to aggregate redundant work.
What are the symptoms of jaundice?
jaundice
16. Why Rationales?
16
jaundice3. Increased Inclusivity
Hypothesis: With improved transparency
and accountability, we can remove all
traditional barriers to participation so
anyone interested is allowed to work.
● Scalability
● Diversity
● Equal Opportunity
What are the symptoms of jaundice?
jaundice
17. Experimental Setup
• Collected 10K relevance judgments through Mechanical Turk.
• Evaluated two main task types.
– Standard Task (Baseline): Assessors provide a relevance judgment
– Rationale Task: Assessors provide a relevance judgment & rationale.
– Two other variant designs will be mentioned later in talk...
• No worker qualifications or “honey-pot” questions used.
• Equal pay across all evaluated tasks.
17
18. Results - Accuracy
• Requiring rationales yields
much higher quality work.
• Accuracy with one rationale
(80%) not far off from five
standard judgments (86%)
18
19. Results - Cost-Efficiency
• Rationale tasks initially
slower, but the difference
becomes negligible with
task familiarity.
• Rationales make explicit the
implicit reasoning process
underlying labeling.
19
20. But wait, there’s more!
What about using the collected rationales?
20
22. Using Rationales: Overlap
22
Assessor 1 Rationale Assessor 2 Rationale Overlap
Idea: Filter judgments based on pairwise rationale overlap among assessors.
Motivation: Workers who converge on similar rationales likely to agree on labels too.
23. Results - Accuracy (Overlap)
Filtering collected judgments
by rationale overlap before
aggregation increases quality.
23
24. Using Rationales: Two-Stage Task Design
24
Assessor 1 Rationale
Assessor 1: Relevant Assessor 2:
?
Idea: Reviewer must confirm or refute initial reviewer.
Motivation: Worker must consider their response in the
context of peer’s reasoning.
82% of Stage 1 errors fixed.
No new errors introduced.
25. Results - Accuracy (Two-Stage)
• One review achieves same
accuracy as using four
extra standard judgments.
• Aggregating reviewers
reaches same accuracy as
filtered approaches.
25
1 Assessor +
1 Reviewer
1 Assessor +
4 Reviewers
26. The Big Picture
• Transparency
– Context for understanding and validating subjective answers.
– Convergence on justification-based crowdsourcing.
• Improved Accuracy
– Rationales make implicit explicit and hold workers accountable.
• Improved Cost-Efficiency
– No additional cost for collection once workers are familiar with task.
• Improved Aggregation
– Rationales can be used for filtering or aggregating judgments.
26
27. Future Work
27
• Dual Supervision: How can we further leverage
rationales for aggregation?
– Supervised learning over labels/rationales.
Zaidan, Eisner, & Piatko. NAACL 2007.
• Task Design: What about other sequential task
designs? (e.g., multi-stage)
• Generalizability: How far can we generalize
rationales to other tasks? (e.g., beyond images)
Donahue & Grauman. Annotator Rationales
for Visual Recognition. ICCV 2011.
29. “Truthiness” is not a new problem
“Truthiness is tearing apart our country... It used to be,
everyone was entitled to their own opinion, but not
their own facts. But that’s not the case anymore.”
– Stephen Colbert (Jan. 25, 2006)
“You furnish the pictures and I’ll furnish the war.”
– William Randolph Hearst (Jan. 25, 1898)
29
30. Information Literacy
National Information Literacy Awareness Month,
US Presidential Proclamation, October 1, 2009.
“Though we may know how to find the information
we need, we must also know how to evaluate it.
Over the past decade, we have seen a crisis of
authenticity emerge. We now live in a world where
anyone can publish an opinion or perspective, true
or not, and have that opinion amplified…”
30Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
32. Automatic Fact Checking
32Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
33. 33Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
34. 34Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
35. Matt Lease (University of Texas at Austin)
Design Challenge: How to interact with ML models?
35
36. Matt Lease (University of Texas at Austin)
Brief Case Study: Facebook
(simpler case: journalist fact-checking)
36
37. Matt Lease (University of Texas at Austin)
Tessa Lyons, a Facebook News Feed product manager:
“…putting a strong image, like a red flag, next to an
article may actually entrench deeply held beliefs —
the opposite effect to what we intended.”
37
40. Matt Lease (University of Texas at Austin)
AI & HCI for Misinformation
“A few classes in ‘use and users of information’ … could
have helped social media platforms avoid the common
pitfalls of the backfire effect in their fake news efforts
and perhaps even avoided … mob rule, virality-based
algorithmic prioritization in the first place.”
https://www.forbes.com/sites/kalevleetaru/
Monday, August 5, 2019
40
41. Believe it or not: Designing a Human-AI
Partnership for Mixed-Initiative Fact-Checking
Joint work with
An Thanh Nguyen (UT), Byron Wallace (Northeastern), & more…
Matt Lease
School of Information @mattlease
University of Texas at Austin ml@utexas.edu
Slides:
slideshare.net/mattlease
44. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Design Challenges
• Fair, Accountable, & Transparent (AI)
– Why trust “black box” classifier?
– How do we reason about potential bias?
– Do people really only want to know “fact” vs. “fake”?
– How to integrate human knowledge/experience?
• Joint AI + Human Reasoning, Correct Errors, Personalization
• How to design strong Human + AI Partnerships?
– Horvitz, CHI’99: mixed-initiative design
– Dove et al., CHI’17 “Machine Learning As a Design Material”
44
45. • Crowdsourced stance labels
– Hybrid AI + Human (near real-time) Prediction
• Joint graphical model of stance, veracity, & annotators
– Interaction between variables
– Interpretable
• Source on github
Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Nguyen et al., AAAI’18
45
46. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking 46
Demo!
Nguyen et al., UIST’18
47. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Primary Interface
47
48. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Source Reputation
48
49. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
System Architecture
• Google Search API
• Two logistic regression models
– average accuracy > 70% but with variance
– Stance (Ferreira & Vlachos ’16) w/ same features
– Veracity (Popat et al. ‘17)
– Scikit-learn, L1 regularization, Liblinear solver, & default parameters
49
50. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Data: Train & Test
Emergent (Ferreira & Vlachos ’16)
Accuracy of prediction models
50
51. Matt Lease (University of Texas at Austin)
Findings
• User studies on MTurk, ~100 participants per experiment
• Experiment 1: Whether or not they see model predictions
51
52. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
2 Groups: Control vs. System
52
53. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
2 Groups: Control vs. System
53
54. Matt Lease (University of Texas at Austin)
Summary of Findings
• User studies on MTurk, ~100 participants per experiment
• Experiment 1: Whether or not reputation/stance shown
– Predict claim veracity before & after seeing model predictions
54
55. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
2 Groups: Control vs. System
55
56. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
2 Groups: Control vs. System
56
57. Matt Lease (University of Texas at Austin)
Summary of Findings
• User studies on MTurk, ~100 participants per experiment
• Experiment 1: Whether or not reputation/stance shown
– Predict claim veracity before & after seeing model predictions
– Result: human accuracy roughly follows model accuracy
57
58. Matt Lease (University of Texas at Austin)
Summary of Findings
• User studies on MTurk, ~100 participants per experiment
• Experiment 1: Whether or not reputation/stance shown
– Predict claim veracity before & after seeing model predictions
– Result: human accuracy roughly follows model accuracy
• Experiment 2: Whether or not user can override predictions
– Predict claim veracity and give confidence in prediction
– Not statistically significant on average, interaction sometimes hurts
58
60. Matt Lease (University of Texas at Austin)
New form of echo chamber?
Interaction promotes transparency & trust, but can affirm user bias
60
61. Anubrata Das, Kunjan Mehta and Matthew Lease
SIGIR 2019 Workshop on Fair, Accountable, Confidential, Transparent,
and Safe Information Retrieval (FACTS-IR). July 25, 2019
CobWeb: A Research Prototype for
Exploring User Bias in Political
Fact-Checking
62. 62
We introduce
“political leaning” (bias)
as a function of the
adjusted reputation of
the news sources
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
63. 63
We introduce
“political leaning” (bias)
as a function of the
adjusted reputation of
the news sources
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
A user can alter the
source reputation
Changing reputation
scores changes the
predicted correctness
Changing reputation
scores changes the
overall political leaning
Imaginary
Source Bias
Stance
Sources
64. 64
We introduce
“political leaning” (bias)
as a function of the
adjusted reputation of
the news sources
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
Changing the overall
political leaning
changes the predicted
correctness
A user can alter the
overall political leaning
Imaginary
Source Bias
Stance
Sources
Changing the overall
political leaning changes
the source reputations
65. 65
Participants are able to correctly identify the correctness of an
imaginary claim
8/10
Participants find the change in overall political leaning as a
function of change in reputation score intuitive
6/10
Participants find the relationship between change in overall
political leaning and the source reputation score intuitive
6/10
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
66. 66
Post-tasks Questionnaire
Usefulness04
Communicating overall political leaning
is useful in predicting truthfulness of
claims.
Ease of Use03
Knowing the overall political leaning
would make it easier to predict the
truthfulness of claims.
Effectiveness02
Knowing the overall political leaning
would enable me to effectively predict
the truthfulness of claims.
Accuracy01
Knowing the overall political leaning
would help me accurately predict the
truthfulness of claims.
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
67. 67
Participants able to correctly identify the correctness of an
imaginary claim
8/10
Participants find the change in overall political leaning as a
function of change in reputation score intuitive
6/10
Participants find it useful to have an indicator for their overall
political leaning in a claim checking scenario
8/10
Participants find the relationship between change in overall
political leaning and the source reputation score intuitive
6/10
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
68. 68
Future Work03
- Bias Detection on Real Data
- Extensive user study
- Evaluation design for Search Bias
Conclusion02
Communicating user’s own bias in
fact checking helps a user in
assessing the credibility of a claim
Contribution01
An interface that communicates a
user’s own political biases in a fact-
checking context
CobWeb: A Research Prototype for Exploring User Bias in Political Fact-Checking
-- Working paper; more to come!
https://arxiv.org/abs/1907.03718
A Conceptual Framework for Evaluating Fairness in Search
https://arxiv.org/abs/1907.09328
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
69. Wrap-up on
Misinformation
• Fact-checking more than black-box prediction: Interaction, exploration, trust
– Useful problem for grounding work on Fair, Accountable, & Transparent (FAT) AI
• Mixed-initiative human + AI partnership for fact-checking
– backend NLP + front-end interaction
• Fact Checking & IR (Lease, DESIRES’18)
– How to diversify search results for controversial topics?
– Information evaluation (eg, vaccination & autism)
• Potential harm as well as good
– Potential added confusion, data / algorithmic bias
– Potential for personal “echo chamber”
– Adversarial settings
69Matt Lease (University of Texas at Austin)
70. Matt Lease (University of Texas at Austin)
Thank You!
Slides: slideshare.net/mattlease
Lab: ir.ischool.utexas.edu
70