Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Annotation

111 views

Published on

Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Annotation

Published in: Technology
  • If you want to enjoy the Good Life: making money in the comfort of your own home with just your laptop, then this is for YOU... ★★★ http://ishbv.com/goldops777/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Annotation

  1. 1. Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Annotation Matt Lease School of Information @mattlease University of Texas at Austin ml@utexas.edu Slides: slideshare.net/mattlease
  2. 2. “The place where people & technology meet” ~ Wobbrock et al., 2009 “iSchools” now exist at over 100 universities around the world What’s an Information School? 2
  3. 3. Matt Lease (University of Texas at Austin) 3 UT Austin “Moonshot” Project Goal: design a future of AI & autonomous technologies that are beneficial — not detrimental — to society. http://goodsystems.utexas.edu
  4. 4. Part I: Design for Crowdsourced Annotation 4Matt Lease (University of Texas at Austin)
  5. 5. Matt Lease (University of Texas at Austin) Motivation 1: Supervised Learning • AI accuracy greatly impacted by amount of training data • Want labels that are reliable, inexpensive, & easy to collect • Snow et al., EMNLP 2008 – Ensure label quality by assigning same task to multiple workers & aggregating responses – Can we ensure quality without reliance on redundant work? 5
  6. 6. Motivation 2: Human Computation 6 “Software developers with innovative ideas for businesses and technologies are constrained by the limits of artificial intelligence… If software developers could programmatically access and incorporate human intelligence into their applications, a whole new class of innovative businesses and applications would be possible. This is the goal of Amazon Mechanical Turk… people are freer to innovate because they can now imbue software with real human intelligence.”
  7. 7. Collecting Annotator Rationales for Relevance Judgments Tyler McDonnell, Matthew Lease, Mucahid Kutlu, Tamer Elsayed 2016 AAAI Conference on Human Computation & Crowdsourcing (HCOMP) Follow-on work: Kutlu et al., SIGIR 2018
  8. 8. Search Relevance What are the symptoms of jaundice? 8 jaundicejaundice
  9. 9. Search Relevance 9 jaundice What are the symptoms of jaundice? jaundice
  10. 10. Search Relevance 10 jaundice 25 Years of the National Institute of Standards & Technology Text REtrieval Conference (NIST TREC) ● Expert assessors provide relevance labels for web pages. ● Task is highly subjective: even expert assessors disagree often. Google: Quality Rater Guidelines (150 pages of instructions!) What are the symptoms of jaundice? jaundice
  11. 11. A First Experiment • Collected sample of relevance judgments on Mechanical Turk. • Labeled some data ourselves. • Checked agreement. 11 ● Between workers. ● Between workers vs. our own labels. ● Between workers vs. NIST gold. ● Between our labels vs. NIST gold. ● Why do our labels disagree with NIST? Who knows… Can we do better?
  12. 12. The Rationale 12 jaundice What are the symptoms of jaundice? jaundice
  13. 13. The Rationale 13 jaundice What are the symptoms of jaundice? Zaidan, Eisner, & Piatko. NAACL 2007. jaundice
  14. 14. Why Rationales? 14 jaundice1. Transparency ● Focused context for interpreting objective or subjective answers. ● Workers can justify decisions and establish other valid answers. ● Scalable gold creation w/o experts. ● Can verify labels both now & in future. ● e.g. Imagine NIST gold with these What are the symptoms of jaundice? jaundice
  15. 15. Why Rationales? 15 jaundice2. Reliability & Verifiability ● Increased accountability reduces temptation to cheat. ● Enables iterative task design. (more to come…) ● Enables dual-supervision, both when aggregating answers and training model for actual task. (more to come…) ● Better quality assurance could reduce need to aggregate redundant work. What are the symptoms of jaundice? jaundice
  16. 16. Why Rationales? 16 jaundice3. Increased Inclusivity Hypothesis: With improved transparency and accountability, we can remove all traditional barriers to participation so anyone interested is allowed to work. ● Scalability ● Diversity ● Equal Opportunity What are the symptoms of jaundice? jaundice
  17. 17. Experimental Setup • Collected 10K relevance judgments through Mechanical Turk. • Evaluated two main task types. – Standard Task (Baseline): Assessors provide a relevance judgment – Rationale Task: Assessors provide a relevance judgment & rationale. – Two other variant designs will be mentioned later in talk... • No worker qualifications or “honey-pot” questions used. • Equal pay across all evaluated tasks. 17
  18. 18. Results - Accuracy • Requiring rationales yields much higher quality work. • Accuracy with one rationale (80%) not far off from five standard judgments (86%) 18
  19. 19. Results - Cost-Efficiency • Rationale tasks initially slower, but the difference becomes negligible with task familiarity. • Rationales make explicit the implicit reasoning process underlying labeling. 19
  20. 20. But wait, there’s more! What about using the collected rationales? 20
  21. 21. Using Rationales: Overlap 21 Assessor 1 Rationale Assessor 2 Rationale
  22. 22. Using Rationales: Overlap 22 Assessor 1 Rationale Assessor 2 Rationale Overlap Idea: Filter judgments based on pairwise rationale overlap among assessors. Motivation: Workers who converge on similar rationales likely to agree on labels too.
  23. 23. Results - Accuracy (Overlap) Filtering collected judgments by rationale overlap before aggregation increases quality. 23
  24. 24. Using Rationales: Two-Stage Task Design 24 Assessor 1 Rationale Assessor 1: Relevant Assessor 2: ? Idea: Reviewer must confirm or refute initial reviewer. Motivation: Worker must consider their response in the context of peer’s reasoning. 82% of Stage 1 errors fixed. No new errors introduced.
  25. 25. Results - Accuracy (Two-Stage) • One review achieves same accuracy as using four extra standard judgments. • Aggregating reviewers reaches same accuracy as filtered approaches. 25 1 Assessor + 1 Reviewer 1 Assessor + 4 Reviewers
  26. 26. The Big Picture • Transparency – Context for understanding and validating subjective answers. – Convergence on justification-based crowdsourcing. • Improved Accuracy – Rationales make implicit explicit and hold workers accountable. • Improved Cost-Efficiency – No additional cost for collection once workers are familiar with task. • Improved Aggregation – Rationales can be used for filtering or aggregating judgments. 26
  27. 27. Future Work 27 • Dual Supervision: How can we further leverage rationales for aggregation? – Supervised learning over labels/rationales. Zaidan, Eisner, & Piatko. NAACL 2007. • Task Design: What about other sequential task designs? (e.g., multi-stage) • Generalizability: How far can we generalize rationales to other tasks? (e.g., beyond images) Donahue & Grauman. Annotator Rationales for Visual Recognition. ICCV 2011.
  28. 28. Part II: Misinfomation & Human-AI Partnerships 28Matt Lease (University of Texas at Austin)
  29. 29. “Truthiness” is not a new problem “Truthiness is tearing apart our country... It used to be, everyone was entitled to their own opinion, but not their own facts. But that’s not the case anymore.” – Stephen Colbert (Jan. 25, 2006) “You furnish the pictures and I’ll furnish the war.” – William Randolph Hearst (Jan. 25, 1898) 29
  30. 30. Information Literacy National Information Literacy Awareness Month, US Presidential Proclamation, October 1, 2009. “Though we may know how to find the information we need, we must also know how to evaluate it. Over the past decade, we have seen a crisis of authenticity emerge. We now live in a world where anyone can publish an opinion or perspective, true or not, and have that opinion amplified…” 30Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
  31. 31. Matt Lease (University of Texas at Austin) 31
  32. 32. Automatic Fact Checking 32Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
  33. 33. 33Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
  34. 34. 34Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
  35. 35. Matt Lease (University of Texas at Austin) Design Challenge: How to interact with ML models? 35
  36. 36. Matt Lease (University of Texas at Austin) Brief Case Study: Facebook (simpler case: journalist fact-checking) 36
  37. 37. Matt Lease (University of Texas at Austin) Tessa Lyons, a Facebook News Feed product manager: “…putting a strong image, like a red flag, next to an article may actually entrench deeply held beliefs — the opposite effect to what we intended.” 37
  38. 38. Matt Lease (University of Texas at Austin) Alternative Design 38
  39. 39. Matt Lease (University of Texas at Austin) Another Alternative Design 39
  40. 40. Matt Lease (University of Texas at Austin) AI & HCI for Misinformation “A few classes in ‘use and users of information’ … could have helped social media platforms avoid the common pitfalls of the backfire effect in their fake news efforts and perhaps even avoided … mob rule, virality-based algorithmic prioritization in the first place.” https://www.forbes.com/sites/kalevleetaru/ Monday, August 5, 2019 40
  41. 41. Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking Joint work with An Thanh Nguyen (UT), Byron Wallace (Northeastern), & more… Matt Lease School of Information @mattlease University of Texas at Austin ml@utexas.edu Slides: slideshare.net/mattlease
  42. 42. Matt Lease (University of Texas at Austin) 42
  43. 43. Automatic Fact-Checking 43Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
  44. 44. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking Design Challenges • Fair, Accountable, & Transparent (AI) – Why trust “black box” classifier? – How do we reason about potential bias? – Do people really only want to know “fact” vs. “fake”? – How to integrate human knowledge/experience? • Joint AI + Human Reasoning, Correct Errors, Personalization • How to design strong Human + AI Partnerships? – Horvitz, CHI’99: mixed-initiative design – Dove et al., CHI’17 “Machine Learning As a Design Material” 44
  45. 45. • Crowdsourced stance labels – Hybrid AI + Human (near real-time) Prediction • Joint graphical model of stance, veracity, & annotators – Interaction between variables – Interpretable • Source on github Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking Nguyen et al., AAAI’18 45
  46. 46. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking 46 Demo! Nguyen et al., UIST’18
  47. 47. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking Primary Interface 47
  48. 48. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking Source Reputation 48
  49. 49. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking System Architecture • Google Search API • Two logistic regression models – average accuracy > 70% but with variance – Stance (Ferreira & Vlachos ’16) w/ same features – Veracity (Popat et al. ‘17) – Scikit-learn, L1 regularization, Liblinear solver, & default parameters 49
  50. 50. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking Data: Train & Test Emergent (Ferreira & Vlachos ’16) Accuracy of prediction models 50
  51. 51. Matt Lease (University of Texas at Austin) Findings • User studies on MTurk, ~100 participants per experiment • Experiment 1: Whether or not they see model predictions 51
  52. 52. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking 2 Groups: Control vs. System 52
  53. 53. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking 2 Groups: Control vs. System 53
  54. 54. Matt Lease (University of Texas at Austin) Summary of Findings • User studies on MTurk, ~100 participants per experiment • Experiment 1: Whether or not reputation/stance shown – Predict claim veracity before & after seeing model predictions 54
  55. 55. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking 2 Groups: Control vs. System 55
  56. 56. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking 2 Groups: Control vs. System 56
  57. 57. Matt Lease (University of Texas at Austin) Summary of Findings • User studies on MTurk, ~100 participants per experiment • Experiment 1: Whether or not reputation/stance shown – Predict claim veracity before & after seeing model predictions – Result: human accuracy roughly follows model accuracy 57
  58. 58. Matt Lease (University of Texas at Austin) Summary of Findings • User studies on MTurk, ~100 participants per experiment • Experiment 1: Whether or not reputation/stance shown – Predict claim veracity before & after seeing model predictions – Result: human accuracy roughly follows model accuracy • Experiment 2: Whether or not user can override predictions – Predict claim veracity and give confidence in prediction – Not statistically significant on average, interaction sometimes hurts 58
  59. 59. Matt Lease (University of Texas at Austin) What about user bias? 59
  60. 60. Matt Lease (University of Texas at Austin) New form of echo chamber? Interaction promotes transparency & trust, but can affirm user bias 60
  61. 61. Anubrata Das, Kunjan Mehta and Matthew Lease SIGIR 2019 Workshop on Fair, Accountable, Confidential, Transparent, and Safe Information Retrieval (FACTS-IR). July 25, 2019 CobWeb: A Research Prototype for Exploring User Bias in Political Fact-Checking
  62. 62. 62 We introduce “political leaning” (bias) as a function of the adjusted reputation of the news sources CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
  63. 63. 63 We introduce “political leaning” (bias) as a function of the adjusted reputation of the news sources CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease A user can alter the source reputation Changing reputation scores changes the predicted correctness Changing reputation scores changes the overall political leaning Imaginary Source Bias Stance Sources
  64. 64. 64 We introduce “political leaning” (bias) as a function of the adjusted reputation of the news sources CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease Changing the overall political leaning changes the predicted correctness A user can alter the overall political leaning Imaginary Source Bias Stance Sources Changing the overall political leaning changes the source reputations
  65. 65. 65 Participants are able to correctly identify the correctness of an imaginary claim 8/10 Participants find the change in overall political leaning as a function of change in reputation score intuitive 6/10 Participants find the relationship between change in overall political leaning and the source reputation score intuitive 6/10 CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
  66. 66. 66 Post-tasks Questionnaire Usefulness04 Communicating overall political leaning is useful in predicting truthfulness of claims. Ease of Use03 Knowing the overall political leaning would make it easier to predict the truthfulness of claims. Effectiveness02 Knowing the overall political leaning would enable me to effectively predict the truthfulness of claims. Accuracy01 Knowing the overall political leaning would help me accurately predict the truthfulness of claims. CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
  67. 67. 67 Participants able to correctly identify the correctness of an imaginary claim 8/10 Participants find the change in overall political leaning as a function of change in reputation score intuitive 6/10 Participants find it useful to have an indicator for their overall political leaning in a claim checking scenario 8/10 Participants find the relationship between change in overall political leaning and the source reputation score intuitive 6/10 CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
  68. 68. 68 Future Work03 - Bias Detection on Real Data - Extensive user study - Evaluation design for Search Bias Conclusion02 Communicating user’s own bias in fact checking helps a user in assessing the credibility of a claim Contribution01 An interface that communicates a user’s own political biases in a fact- checking context CobWeb: A Research Prototype for Exploring User Bias in Political Fact-Checking -- Working paper; more to come! https://arxiv.org/abs/1907.03718 A Conceptual Framework for Evaluating Fairness in Search https://arxiv.org/abs/1907.09328 CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
  69. 69. Wrap-up on Misinformation • Fact-checking more than black-box prediction: Interaction, exploration, trust – Useful problem for grounding work on Fair, Accountable, & Transparent (FAT) AI • Mixed-initiative human + AI partnership for fact-checking – backend NLP + front-end interaction • Fact Checking & IR (Lease, DESIRES’18) – How to diversify search results for controversial topics? – Information evaluation (eg, vaccination & autism) • Potential harm as well as good – Potential added confusion, data / algorithmic bias – Potential for personal “echo chamber” – Adversarial settings 69Matt Lease (University of Texas at Austin)
  70. 70. Matt Lease (University of Texas at Austin) Thank You! Slides: slideshare.net/mattlease Lab: ir.ischool.utexas.edu 70

×