Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Matt Lease
School of Information
The University of Texas at Austin
Adventures in Crowdsourcing :
Toward Safer Content Mode...
Roadmap
• Context: UT Good Systems & iSchool
• Two parts to talk today
– Content Moderation
– Aggregating Complex Annotati...
3
Goal: Design a future of Artificial Intelligence (AI)
technologies to meet society’s needs and values.
.
http://goodsyst...
“The place where people & technology meet”
~ Wobbrock et al., 2009
“iSchools” now exist at over 100 universities around th...
Anubrata Das, Brandon Dang and Matthew Lease
School of Information
The University of Texas at Austin
Fast, Accurate, and H...
Today’s Talk: Content Moderation
- Social media platforms are hubs of user generated content
- Some types of content are u...
Scale of Content Moderation
7Paul M. Barrett. (2020). Who Moderates the Social Media Giants? A Call to End Outsourcing.
Fa...
Can’t we just use AI?
• High cost of errors -> very high accuracy required
• Continually evolving content and moderation p...
Barr & Cabrera, ACM Queue 2006
9
“Software developers with innovative ideas for businesses
and technologies are constraine...
10
Implication on Moderators
“The psychological effects of viewing harmful content is well
documented, with reports of modera...
The Great Irony
12
The sort of task we most want an algorithm to do
(emotionally disturbing) is what people are doing
beca...
BUT WHO PROTECTS THE
MODERATORS? (HCOMP 2018)
BRANDON DANG1, MARTIN J. RIEDL2, AND MATTHEW LEASE1
1School of Information &...
Research Question
14
By revealing less of an image, can we reduce the emotional
labor of image moderation without compromi...
Design and Demo
http://ir.ischool.utexas.edu/CM/demo/
15Dang, Brandon, Martin J. Riedl, and Matthew Lease. "But who protec...
Exposure and Control
“shielding moderators from harm begins with giving them
more control of what they’re seeing and how t...
Exposure and Control
- Industry moving towards establishing best practices for providing control & tools
17
18Source:https://docs.microsoft.com/en-us/azure/cognitive-services/content-moderator/images/video-review-default-view.png;...
Exposure and Control
- Industry moving towards establishing best practices for providing control & tools
- Such interventi...
HCOMP’20: MTurk Moderation Task
20
Survey: Well-being and Usability
21
Usefulness04
Perceived usefulness and
perceived ease of use
(Davis 1989; Venkatesh and...
Experiment
22
- Random sample of 60 synthetic & real images
across categories: 180 total images
- Divided into groups of 9...
Results
Performance
- Accuracy
- Time taken
- Effort*
- # Clicks
- # Mouse Movement
Well-being
- Worker comfort
- Experien...
Speed and Accuracy is not Impacted in Interactive Blurring
24
Worker Accuracy Time
Similar Effort Across Designs (except for “Click”)
25
# Clicks # Mouse Movement
Slider is Perceived to be the Most Usable Interface
26
Perceived Usefulness Perceived Ease of Use
Hover is perceived as most comfortable
27
SPANE-B score for all interventions except for click is
higher than the unblurred baseline
28
Positive and Negative Experi...
Overall emotional exhaustion is the least for hover
30
Increased mean positive affect with increasing level of blur
31
Positive and Negative Affect
Summary: Hover is the Champion for Adoption
32
B: Baseline, **p< 0.05, ***p< 0.005
- Slider and hover are both top perform...
33
Future Work03
• Qualitative Analysis
• Intelligent Unblurring
• Early warning for severity
Conclusion02
As opposed to s...
Alex Braylan1
and Matthew Lease2
1
Dept. of Computer Science & 2
School of Information
The University of Texas at Austin
M...
Simple annotation & aggregation
• Classification
– sentiment analysis
– image categorization
• Ordinal rating
– product & ...
What’s the capital of Texas?
Austin
Austin
Houston
36
What’s the capital of Texas?
Austin
Austin
Houston
Majority Vote
37
Caption this image:
38
A cat is
eating
The cat
eats
A beautiful
picture
Caption this image:
When majority voting falls short
Problem: large label space, exact match doesn’t work!
39
A cat is
eat...
What about complex annotations?
Ranked lists
Parse trees
A1: A cat is eating
A2: The cat eats
A3: A beautiful picture
Imag...
Outline
• Prior work
• Approach
• Experiments
• Conclusion
41
Aggregating Simple Labels
• Hundreds of papers
• Multiple benchmarking studies
• Rich body of Bayesian modeling
• General-...
Task-specific models
• Pros:
– Task specialization
maximizes accuracy
• Cons:
– Need new model for
every task
– Complicate...
Task-specific workflows
• Pros:
– Empower workers
for complex tasks
• Cons:
– Need new workflow
for every task
– Complicat...
Our goals
• We want aggregation for complex data types
– Build on ideas from simple label aggregation models
• We want to ...
Outline
• Prior work
• Approach
• Experiments
• Conclusion
46
Key Insight
• Partial credit matching via task-specific distance function
– Encapsulate task-specific label features into ...
Distance functions
48
Properties of distance functions
Non-negativity
Symmetry
Triangle inequality
Data Free Text Rankings...
Calculate distances
“a cat is eating” “cat is eating”
“a beautiful picture” “the cat eats”
49
• Example task: text annotat...
Calculate distances
“a cat is eating” “cat is eating”
“a beautiful picture” “the cat eats”
0.05
0.1
0.1
50
• Example task:...
Calculate distances
“a cat is eating” “cat is eating”
“a beautiful picture” “the cat eats”
0.8
0.82
0.05
0.1
0.1
51
0.82
•...
A1: A cat is eating
A2: The cat eats
A3: A beautiful
picture
0.1 0.6
0.3
52
All tasks reduce to matrices of
annotation dis...
How to aggregate given distances
• Local selection model
• Global selection model
• Combined
53
Current item
Other items
Local approach: Smallest Avg Distance
• For each item:
1. Compute average distance between
annotations for the item
2. Cho...
Global approach: Best Available User
• For each annotator:
– Score by average distance over full dataset
• For each item:
...
Can we get best of both worlds?
• Want a method that combines:
– Best available user (global)
– Smallest avg distance (loc...
Multidimensional Annotation Scaling (MAS)
• Based on Multidimensional
Scaling (Kruskal & Wish 1978)
• Probabilistic model ...
MAS Objective 1: Likelihood
Multidimensional Scaling
objective:
Diuv ∼ N(∥εiu−εiv∥, σ)
• Diuv : observed distance
• εiu : ...
MAS Objective 1: Likelihood
Multidimensional Scaling
objective:
Diuv ∼ N(∥εiu−εiv∥, σ)
• Diuv : observed distance
• εiu : ...
MAS Objective 2: Prior
“a cat is eating”
“cat is eating”
“a beautiful picture”
“the cat eats”
Pseudo-gold
61
MAS Objective 2: Prior
“a cat is eating”
“cat is eating”
“a beautiful picture”
“the cat eats”
62
MAS Objective 2: Prior
“a cat is eating”
“cat is eating”
“a beautiful picture”
“the cat eats”
63
MAS Objective 2: Prior
64
MAS Objective 2: Prior
65
Outline
• Prior work
• Approach
• Experiments
• Conclusion
66
Tasks & datasets
SYNTHETIC DATASETS
• Syntactic parse trees
– Distance function: evalb
• Ranked lists
– Distance function:...
Methods
Baselines:
• Random User (RU): pick one label randomly
• ZenCrowd (ZC) (Demartini et al. 2012)
– Weighted voting b...
Results
Task Metric RU ZC CHMM MAS Oracle
Translations GLEU 0.185 0.246
Sequences F1 0.561 0.827
Parses EVALB 0.812 0.939
...
Results
Task Metric RU ZC CHMM MAS Oracle
Translations GLEU 0.185 0.188 0.246
Sequences F1 0.561 0.569 0.827
Parses EVALB ...
Results
Task Metric RU ZC CHMM MAS Oracle
Translations GLEU 0.185 0.188 - 0.246
Sequences F1 0.561 0.569 0.702 0.827
Parse...
Results
Task Metric RU ZC CHMM MAS Oracle
Translations GLEU 0.185 0.188 - 0.217 0.246
Sequences F1 0.561 0.569 0.702 0.709...
Conclusion
• Goal: general-purpose probabilistic model to aggregate complex annotations
– Categorical-based methods insuff...
Ongoing work
• Generalization to more tasks (e.g., image bounding boxes & keypoints)
• Generalization to simple annotation...
Thank you!
75
Matt Lease (University of Texas at Austin)
Lab: ir.ischool.utexas.edu
@mattlease
Slides: slideshare.net/matt...
Upcoming SlideShare
Loading in …5
×

of

Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 1 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 2 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 3 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 4 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 5 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 6 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 7 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 8 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 9 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 10 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 11 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 12 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 13 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 14 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 15 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 16 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 17 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 18 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 19 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 20 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 21 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 22 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 23 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 24 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 25 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 26 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 27 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 28 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 29 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 30 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 31 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 32 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 33 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 34 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 35 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 36 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 37 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 38 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 39 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 40 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 41 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 42 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 43 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 44 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 45 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 46 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 47 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 48 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 49 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 50 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 51 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 52 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 53 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 54 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 55 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 56 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 57 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 58 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 59 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 60 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 61 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 62 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 63 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 64 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 65 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 66 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 67 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 68 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 69 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 70 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 71 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 72 Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks Slide 73
Upcoming SlideShare
What to Upload to SlideShare
Next

0 Likes

Share

Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks

Talk given at Delft University speaker series on "Crowd Computing & Human-Centered AI" (https://www.academicfringe.org/). November 23, 2020. Covers two 2020 works:

(1) Anubrata Das, Brandon Dang, and Matthew Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. In Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2020.

Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks

  1. 1. Matt Lease School of Information The University of Texas at Austin Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks 1 Lab: ir.ischool.utexas.edu @mattlease Slides: slideshare.net/mattlease
  2. 2. Roadmap • Context: UT Good Systems & iSchool • Two parts to talk today – Content Moderation – Aggregating Complex Annotations 2
  3. 3. 3 Goal: Design a future of Artificial Intelligence (AI) technologies to meet society’s needs and values. . http://goodsystems.utexas.edu Good Systems: an 8-year, $10M UT Austin Grand Challenge
  4. 4. “The place where people & technology meet” ~ Wobbrock et al., 2009 “iSchools” now exist at over 100 universities around the world 4 What’s an Information School?
  5. 5. Anubrata Das, Brandon Dang and Matthew Lease School of Information The University of Texas at Austin Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content 5 Lab: ir.ischool.utexas.edu @mattlease Slides: slideshare.net/mattlease
  6. 6. Today’s Talk: Content Moderation - Social media platforms are hubs of user generated content - Some types of content are unacceptable or may cause harm - pornography & nudity, depictions of violence, hate speech, mis/disinformation - What is considered acceptable varies by platform and region - Further issues of free speech & due process in content removal & remediation - e.g., Moderate Globally, Impact Locally: The Global Impacts of Content Moderation (Yale, Nov. 2020) 6 Alon Halevy et al. "Preserving integrity in online social networks." arXiv preprint, September 25, 2020.
  7. 7. Scale of Content Moderation 7Paul M. Barrett. (2020). Who Moderates the Social Media Giants? A Call to End Outsourcing. Facebook, Youtube
  8. 8. Can’t we just use AI? • High cost of errors -> very high accuracy required • Continually evolving content and moderation policies – also regional variants, cultural issues, and adversarial attacks • While AI systems are often advertised/perceived as fully-automated, in practice, human labor is typically required and often hidden – Gray and Suri (2019) “ghost work”, Ekbia and Nardi (2014) ”heteromation”, Irani and Silberman (2013) “invisible work” • Human moderators today: Facebook ~15K, Youtube ~10K • No free lunch: human annotators still needed to create training data 8
  9. 9. Barr & Cabrera, ACM Queue 2006 9 “Software developers with innovative ideas for businesses and technologies are constrained by the limits of artificial intelligence… If software developers could programmatically access and incorporate human intelligence into their applications, a whole new class of innovative businesses and applications would be possible. This is the goal of Amazon Mechanical Turk… people are freer to innovate because they can now imbue software with real human intelligence.”
  10. 10. 10
  11. 11. Implication on Moderators “The psychological effects of viewing harmful content is well documented, with reports of moderators experiencing posttraumatic stress disorder (PTSD) symptoms and other mental health issues as a result of the disturbing content they are exposed to.” (Cambridge Consultants, 2019) 11 “From my own interviews with more than 100 moderators… a significant number [get PTSD]. And many other employees develop long- lasting mental health symptoms that stop short of full-blown PTSD, including depression, anxiety, and insomnia.” (Casey Newton, 2020) Volume quotas (akin to a call center) - “constant measurement for accuracy is as pressurizing as a quota” (Dwoskin 2019) Image Source: The Verge
  12. 12. The Great Irony 12 The sort of task we most want an algorithm to do (emotionally disturbing) is what people are doing because the algorithm isn’t good enough
  13. 13. BUT WHO PROTECTS THE MODERATORS? (HCOMP 2018) BRANDON DANG1, MARTIN J. RIEDL2, AND MATTHEW LEASE1 1School of Information & 2School of Journalism (both students contributed equally) The University of Texas at Austin AAAI HCOMP -&- ACM Collective Intelligence July 2018, Zurich, Switzerland
  14. 14. Research Question 14 By revealing less of an image, can we reduce the emotional labor of image moderation without compromising moderator accuracy and efficiency?
  15. 15. Design and Demo http://ir.ischool.utexas.edu/CM/demo/ 15Dang, Brandon, Martin J. Riedl, and Matthew Lease. "But who protects the moderators? the case of crowdsourced image moderation." arXiv preprint arXiv:1804.10999 (2018). Code: https://github.com/budang/content-moderation
  16. 16. Exposure and Control “shielding moderators from harm begins with giving them more control of what they’re seeing and how they’re seeing it, so just the existence of ...preferences helps” (Sullivan 2019) 16 “Scientifically, do we know how much [exposure] is too much? The answer is no, we don’t... If there’s something that were to keep me up at night... it’s that question” (Facebook psychologist Chris Harrison) “Finding the right balance between content reviewer well- being and resiliency, quality, and productivity is very challenging at the scale we operate in. We are continually working to get this balance right.” (Facebook’s Carolyn Glanville) Source: https://images.fastcompany.net/image/upload/w_596,c_limit,q_auto:best,f_auto/wp-cms/uploads/2019/06/Quick-Settings.png
  17. 17. Exposure and Control - Industry moving towards establishing best practices for providing control & tools 17
  18. 18. 18Source:https://docs.microsoft.com/en-us/azure/cognitive-services/content-moderator/images/video-review-default-view.png; https://docs.microsoft.com/en-us/azure/cognitive-services/content-moderator/
  19. 19. Exposure and Control - Industry moving towards establishing best practices for providing control & tools - Such interventions include greyscaling, muting videos, and blurring - Not well understood how effective such practices are - Google: Ramakrishnan and Karunakaran (HCOMP 2019) report grayscaling of images and videos reduces harm. Also study static blurring. 19
  20. 20. HCOMP’20: MTurk Moderation Task 20
  21. 21. Survey: Well-being and Usability 21 Usefulness04 Perceived usefulness and perceived ease of use (Davis 1989; Venkatesh and Davis 2000) Emotional Exhaustion03 Slightly modified version of emotional exhaustion scale (Wharton 1993) (Cates and Howe 2015) Positive and Negative Affect02 7-point Likert scale what emotions they are currently feeling (I-PANAS-SF) (Thompson 2007) Positive and Negative Experience01 5-point Likert scale how often they experience the following emotions: positive, negative, good, bad, pleasant, unpleasant, etc. (SPANE) (Diener et al, 2010)
  22. 22. Experiment 22 - Random sample of 60 synthetic & real images across categories: 180 total images - Divided into groups of 9, balanced over classes - 20 HITs, Five workers/ HIT - Workers restricted to a single HIT - Adult content qualification, >98% approval rate with 300+ submitted HITs - $7.25/hour
  23. 23. Results Performance - Accuracy - Time taken - Effort* - # Clicks - # Mouse Movement Well-being - Worker comfort - Experience - Affect - Emotional Exhaustion - Usefulness *Brandon Dang, Miles Hutson, & Matthew Lease. MmmTurkey: A Crowdsourcing Framework for Deploying Tasks and Recording Worker Behavior on Amazon Mechanical Turk. HCOMP 2016. https://github.com/budang/turkey-lite
  24. 24. Speed and Accuracy is not Impacted in Interactive Blurring 24 Worker Accuracy Time
  25. 25. Similar Effort Across Designs (except for “Click”) 25 # Clicks # Mouse Movement
  26. 26. Slider is Perceived to be the Most Usable Interface 26 Perceived Usefulness Perceived Ease of Use
  27. 27. Hover is perceived as most comfortable 27
  28. 28. SPANE-B score for all interventions except for click is higher than the unblurred baseline 28 Positive and Negative Experience Overall Experience
  29. 29. Overall emotional exhaustion is the least for hover 30
  30. 30. Increased mean positive affect with increasing level of blur 31 Positive and Negative Affect
  31. 31. Summary: Hover is the Champion for Adoption 32 B: Baseline, **p< 0.05, ***p< 0.005 - Slider and hover are both top performers - Hover shows significantly low emotional exhaustion with comparatively high accuracy - If key goal is to keep accuracy intact & reduce emotional impact, we recommend hover design
  32. 32. 33 Future Work03 • Qualitative Analysis • Intelligent Unblurring • Early warning for severity Conclusion02 As opposed to static blurring that decreases accuracy, Interactive blurring, improves well-being without sacrificing accuracy and speed Contribution01 Proposed and extensively evaluated intervention that improves moderator well-being
  33. 33. Alex Braylan1 and Matthew Lease2 1 Dept. of Computer Science & 2 School of Information The University of Texas at Austin Modeling and Aggregation of Complex Annotations via Annotation Distance 34 ml@utexas.edu @mattlease Slides: slideshare.net/mattlease Encore: Dec 11 talk @NeurIPS Crowd Science Workshop (https://research.yandex.com/workshops/crowd/neurips-2020) Code & Data: https://github.com/Praznat/annotationmodeling
  34. 34. Simple annotation & aggregation • Classification – sentiment analysis – image categorization • Ordinal rating – product & movie reviews – search relevance • Multiple choice selection – quizzes Aggregation • Crowdsourcing: quality control • Experts: wisdom of crowds • Goal: select best label available for each item (no label fusion) 35
  35. 35. What’s the capital of Texas? Austin Austin Houston 36
  36. 36. What’s the capital of Texas? Austin Austin Houston Majority Vote 37
  37. 37. Caption this image: 38 A cat is eating The cat eats A beautiful picture
  38. 38. Caption this image: When majority voting falls short Problem: large label space, exact match doesn’t work! 39 A cat is eating The cat eats A beautiful picture
  39. 39. What about complex annotations? Ranked lists Parse trees A1: A cat is eating A2: The cat eats A3: A beautiful picture Image captions Range sequences 40
  40. 40. Outline • Prior work • Approach • Experiments • Conclusion 41
  41. 41. Aggregating Simple Labels • Hundreds of papers • Multiple benchmarking studies • Rich body of Bayesian modeling • General-purpose aggregation models for simple labels don’t support complex labels! Dawid-Skene MACE Hierarchical Dawid-Skene Item Difficulty Logistic Random Effects Source: Paun et al 2018 “Comparing bayesian models of annotation” 42
  42. 42. Task-specific models • Pros: – Task specialization maximizes accuracy • Cons: – Need new model for every task – Complicated, difficult to formulate Nguyen et al 2017 (Sequences) Lin, Mausam, and Weld 2012 (Math) 43
  43. 43. Task-specific workflows • Pros: – Empower workers for complex tasks • Cons: – Need new workflow for every task – Complicated, difficult to formulate Noronha et al 2011 (image analysis) Lasecki et al 2012 (transcription) 44
  44. 44. Our goals • We want aggregation for complex data types – Build on ideas from simple label aggregation models • We want to generalize across many labeling tasks – Can we reduce problem to common simpler state space? 45
  45. 45. Outline • Prior work • Approach • Experiments • Conclusion 46
  46. 46. Key Insight • Partial credit matching via task-specific distance function – Encapsulate task-specific label features into requester distance function – Model annotation distances rather than annotations – Distance functions already exist for most tasks because people need evaluation functions to compare predicted labels vs gold 47
  47. 47. Distance functions 48 Properties of distance functions Non-negativity Symmetry Triangle inequality Data Free Text Rankings Example evaluation fn BLEU(x, y) Example distance fn Non-negativity ✓ ✓ Symmetry ✓ ✓ Triangle inequality ✓ ✓
  48. 48. Calculate distances “a cat is eating” “cat is eating” “a beautiful picture” “the cat eats” 49 • Example task: text annotation • Example distance function: string edit distance
  49. 49. Calculate distances “a cat is eating” “cat is eating” “a beautiful picture” “the cat eats” 0.05 0.1 0.1 50 • Example task: text annotation • Example distance function: string edit distance
  50. 50. Calculate distances “a cat is eating” “cat is eating” “a beautiful picture” “the cat eats” 0.8 0.82 0.05 0.1 0.1 51 0.82 • Example task: text annotation • Example distance function: string edit distance
  51. 51. A1: A cat is eating A2: The cat eats A3: A beautiful picture 0.1 0.6 0.3 52 All tasks reduce to matrices of annotation distances
  52. 52. How to aggregate given distances • Local selection model • Global selection model • Combined 53 Current item Other items
  53. 53. Local approach: Smallest Avg Distance • For each item: 1. Compute average distance between annotations for the item 2. Choose annotation with smallest average distance • Generalization of majority vote • Independence between items • Local approach does not model annotator reliability 54 Current item Other items
  54. 54. Global approach: Best Available User • For each annotator: – Score by average distance over full dataset • For each item: – Choose label by best-scoring annotator • Fixed annotator reliability • Global approach does not model how well annotators did on specific items 55 Current item Other items
  55. 55. Can we get best of both worlds? • Want a method that combines: – Best available user (global) – Smallest avg distance (local) • Should build on rich history of work on Bayesian annotation modeling • Need a principled framework for modeling annotation distance matrices weights votes weighted voting 56
  56. 56. Multidimensional Annotation Scaling (MAS) • Based on Multidimensional Scaling (Kruskal & Wish 1978) • Probabilistic model of multi- item distance matrices • “Hierarchical Bayesian” – Additional learned parameters represent crowd effects such as worker reliability A cat is eating The cat eats A beautiful picture 58
  57. 57. MAS Objective 1: Likelihood Multidimensional Scaling objective: Diuv ∼ N(∥εiu−εiv∥, σ) • Diuv : observed distance • εiu : annotation embedding • σ : error scale “a cat is eating” “cat is eating” “a beautiful picture” “the cat eats” 0.8 0.82 0.05 0.1 0.1 0.82 59
  58. 58. MAS Objective 1: Likelihood Multidimensional Scaling objective: Diuv ∼ N(∥εiu−εiv∥, σ) • Diuv : observed distance • εiu : annotation embedding • σ : error scale “a cat is eating” “cat is eating” “a beautiful picture” “the cat eats” 0.8 0.82 0.05 0.1 0.1 0.82 60
  59. 59. MAS Objective 2: Prior “a cat is eating” “cat is eating” “a beautiful picture” “the cat eats” Pseudo-gold 61
  60. 60. MAS Objective 2: Prior “a cat is eating” “cat is eating” “a beautiful picture” “the cat eats” 62
  61. 61. MAS Objective 2: Prior “a cat is eating” “cat is eating” “a beautiful picture” “the cat eats” 63
  62. 62. MAS Objective 2: Prior 64
  63. 63. MAS Objective 2: Prior 65
  64. 64. Outline • Prior work • Approach • Experiments • Conclusion 66
  65. 65. Tasks & datasets SYNTHETIC DATASETS • Syntactic parse trees – Distance function: evalb • Ranked lists – Distance function: Kendall’s tau REAL DATASETS • Biomedical text sequences – Distance function: Span F1 • Urdu-English translations – Distance function: GLEU 67 Nguyen et al 2017 Zaidan and Callison-Burch 2011
  66. 66. Methods Baselines: • Random User (RU): pick one label randomly • ZenCrowd (ZC) (Demartini et al. 2012) – Weighted voting based on exact match (rare!) • Crowd Hidden Markov Model (CHMM) (Nguyen et al. 2017) – Sequence annotation task only Upper bound: Oracle (OR) (always picks best label) • Even if 5 workers answer, limited by best answer any of them gave 68
  67. 67. Results Task Metric RU ZC CHMM MAS Oracle Translations GLEU 0.185 0.246 Sequences F1 0.561 0.827 Parses EVALB 0.812 0.939 Rankings 0.491 0.724 69 • Diverse complex label datasets
  68. 68. Results Task Metric RU ZC CHMM MAS Oracle Translations GLEU 0.185 0.188 0.246 Sequences F1 0.561 0.569 0.827 Parses EVALB 0.812 0.819 0.939 Rankings 0.491 0.495 0.724 70 • Diverse complex label datasets
  69. 69. Results Task Metric RU ZC CHMM MAS Oracle Translations GLEU 0.185 0.188 - 0.246 Sequences F1 0.561 0.569 0.702 0.827 Parses EVALB 0.812 0.819 - 0.939 Rankings 0.491 0.495 - 0.724 71 • Diverse complex label datasets
  70. 70. Results Task Metric RU ZC CHMM MAS Oracle Translations GLEU 0.185 0.188 - 0.217 0.246 Sequences F1 0.561 0.569 0.702 0.709 0.827 Parses EVALB 0.812 0.819 - 0.932 0.939 Rankings 0.491 0.495 - 0.710 0.724 72 • Diverse complex label datasets • MAS aggregation is best way to get closer to ground truth with no model alteration between datasets
  71. 71. Conclusion • Goal: general-purpose probabilistic model to aggregate complex annotations – Categorical-based methods insufficient – Custom models difficult to design for new annotation types • Solution: Model annotation distances via task-specific distance functions – Transforms problem into general-purpose variable space • Multi-dimensional Annotation Scaling (MAS) – Allows unsupervised weighted voting with inferred annotator reliability • Not covered in talk (see paper) – Semi-supervised learning – Partial credit 73
  72. 72. Ongoing work • Generalization to more tasks (e.g., image bounding boxes & keypoints) • Generalization to simple annotation tasks (”one ring to rule them all”) • Support for multiple latent objects per item • Merging annotations rather than selecting best one – e.g. guessing weight of an ox – MAS vs. non-embedding EM model, varying noise, fewer annotations, … 74 Code & Data: https://github.com/Praznat/annotationmodeling A1: A cat is eating A2: The cat eats A3: A beautiful picture
  73. 73. Thank you! 75 Matt Lease (University of Texas at Austin) Lab: ir.ischool.utexas.edu @mattlease Slides: slideshare.net/mattlease We thank our many talented crowd workers for their contributions to our research!

Talk given at Delft University speaker series on "Crowd Computing & Human-Centered AI" (https://www.academicfringe.org/). November 23, 2020. Covers two 2020 works: (1) Anubrata Das, Brandon Dang, and Matthew Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. In Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2020. Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.

Views

Total views

348

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

0

Shares

0

Comments

0

Likes

0

×