Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks

Matt Lease
School of Information
The University of Texas at Austin
Adventures in Crowdsourcing :
Toward Safer Content Moderation & Better
Supporting Complex Annotation Tasks
1
Lab: ir.ischool.utexas.edu
@mattlease
Slides: slideshare.net/mattlease

Roadmap
• Context: UT Good Systems & iSchool
• Two parts to talk today
– Content Moderation
– Aggregating Complex Annotations
2

3
Goal: Design a future of Artificial Intelligence (AI)
technologies to meet society’s needs and values.
.
http://goodsystems.utexas.edu
Good Systems: an 8-year, $10M
UT Austin Grand Challenge

“The place where people & technology meet”
~ Wobbrock et al., 2009
“iSchools” now exist at over 100 universities around the world
4
What’s an Information School?

Anubrata Das, Brandon Dang and Matthew Lease
Fast, Accurate, and Healthier:
Interactive Blurring Helps Moderators
Reduce Exposure to Harmful Content
5
@mattlease

Today’s Talk: Content Moderation
- Social media platforms are hubs of user generated content
- Some types of content are unacceptable or may cause harm
- pornography & nudity, depictions of violence, hate speech, mis/disinformation
- What is considered acceptable varies by platform and region
- Further issues of free speech & due process in content removal & remediation
- e.g., Moderate Globally, Impact Locally: The Global Impacts of Content Moderation (Yale, Nov. 2020)
6
Alon Halevy et al. "Preserving integrity in online social networks." arXiv preprint, September 25, 2020.

Scale of Content Moderation
7Paul M. Barrett. (2020). Who Moderates the Social Media Giants? A Call to End Outsourcing.
Facebook, Youtube

Can’t we just use AI?
• High cost of errors -> very high accuracy required
• Continually evolving content and moderation policies
– also regional variants, cultural issues, and adversarial attacks
• While AI systems are often advertised/perceived as fully-automated, in
practice, human labor is typically required and often hidden
– Gray and Suri (2019) “ghost work”, Ekbia and Nardi (2014) ”heteromation”,
Irani and Silberman (2013) “invisible work”
• Human moderators today: Facebook ~15K, Youtube ~10K
• No free lunch: human annotators still needed to create training data 8

Barr & Cabrera, ACM Queue 2006
9
“Software developers with innovative ideas for businesses
and technologies are constrained by the limits of artificial
intelligence… If software developers could programmatically
access and incorporate human intelligence into their
applications, a whole new class of innovative businesses
and applications would be possible. This is the goal of
Amazon Mechanical Turk… people are freer to innovate because
they can now imbue software with real human intelligence.”

Implication on Moderators
“The psychological effects of viewing harmful content is well
documented, with reports of moderators experiencing
posttraumatic stress disorder (PTSD) symptoms and other
mental health issues as a result of the disturbing content they
are exposed to.” (Cambridge Consultants, 2019)
11
“From my own interviews with more than 100 moderators… a
signiﬁcant number [get PTSD]. And many other employees
develop long- lasting mental health symptoms that stop short
of full-blown PTSD, including depression, anxiety, and
insomnia.” (Casey Newton, 2020)
Volume quotas (akin to a call center) - “constant measurement
for accuracy is as pressurizing as a quota” (Dwoskin 2019)
Image Source: The Verge

The Great Irony
12
The sort of task we most want an algorithm to do
(emotionally disturbing) is what people are doing
because the algorithm isn’t good enough

BUT WHO PROTECTS THE
MODERATORS? (HCOMP 2018)
BRANDON DANG1, MARTIN J. RIEDL2, AND MATTHEW LEASE1
1School of Information & 2School of Journalism (both students contributed equally)
AAAI HCOMP -&- ACM Collective Intelligence
July 2018, Zurich, Switzerland

Research Question
14
By revealing less of an image, can we reduce the emotional
labor of image moderation without compromising
moderator accuracy and efﬁciency?

Design and Demo
http://ir.ischool.utexas.edu/CM/demo/
15Dang, Brandon, Martin J. Riedl, and Matthew Lease. "But who protects the moderators? the case of crowdsourced image
moderation." arXiv preprint arXiv:1804.10999 (2018).
Code: https://github.com/budang/content-moderation

Exposure and Control
“shielding moderators from harm begins with giving them
more control of what they’re seeing and how they’re seeing it,
so just the existence of ...preferences helps” (Sullivan 2019)
16
“Scientifically, do we know how much [exposure] is too much?
The answer is no, we don’t... If there’s something that were to
keep me up at night... it’s that question”
(Facebook psychologist Chris Harrison)
“Finding the right balance between content reviewer well-
being and resiliency, quality, and productivity is very
challenging at the scale we operate in. We are continually
working to get this balance right.” (Facebook’s Carolyn
Glanville)
Source: https://images.fastcompany.net/image/upload/w_596,c_limit,q_auto:best,f_auto/wp-cms/uploads/2019/06/Quick-Settings.png

- Industry moving towards establishing best practices for providing control & tools
17

18Source:https://docs.microsoft.com/en-us/azure/cognitive-services/content-moderator/images/video-review-default-view.png;
https://docs.microsoft.com/en-us/azure/cognitive-services/content-moderator/

- Industry moving towards establishing best practices for providing control & tools
- Such interventions include greyscaling, muting videos, and blurring
- Not well understood how effective such practices are
- Google: Ramakrishnan and Karunakaran (HCOMP 2019) report grayscaling of
images and videos reduces harm. Also study static blurring.
19

HCOMP’20: MTurk Moderation Task
20

Survey: Well-being and Usability
21
Usefulness04
Perceived usefulness and
perceived ease of use
(Davis 1989; Venkatesh and Davis 2000)
Emotional Exhaustion03
Slightly modified version of emotional
exhaustion scale
(Wharton 1993) (Cates and Howe 2015)
Positive and Negative
Affect02
7-point Likert scale what emotions they are
currently feeling (I-PANAS-SF)
(Thompson 2007)
Positive and Negative
Experience01
5-point Likert scale how often they experience
the following emotions: positive, negative,
good, bad, pleasant, unpleasant, etc. (SPANE)
(Diener et al, 2010)

Experiment
22
- Random sample of 60 synthetic & real images
across categories: 180 total images
- Divided into groups of 9, balanced over classes
- 20 HITs, Five workers/ HIT
- Workers restricted to a single HIT
- Adult content qualification, >98% approval rate
with 300+ submitted HITs
- $7.25/hour

Results
Performance
- Accuracy
- Time taken
- Effort*
- # Clicks
- # Mouse Movement
Well-being
- Worker comfort
- Experience
- Affect
- Emotional Exhaustion
- Usefulness
*Brandon Dang, Miles Hutson, & Matthew Lease. MmmTurkey: A Crowdsourcing Framework for Deploying Tasks
and Recording Worker Behavior on Amazon Mechanical Turk. HCOMP 2016. https://github.com/budang/turkey-lite

Speed and Accuracy is not Impacted in Interactive Blurring
24
Worker Accuracy Time

Similar Effort Across Designs (except for “Click”)
25
# Clicks # Mouse Movement

Slider is Perceived to be the Most Usable Interface
26
Perceived Usefulness Perceived Ease of Use

Hover is perceived as most comfortable
27

SPANE-B score for all interventions except for click is
higher than the unblurred baseline
28
Positive and Negative Experience Overall Experience

Overall emotional exhaustion is the least for hover
30

Increased mean positive affect with increasing level of blur
31
Positive and Negative Affect

Summary: Hover is the Champion for Adoption
32
B: Baseline, **p< 0.05, ***p< 0.005
- Slider and hover are both top performers
- Hover shows signiﬁcantly low emotional exhaustion with comparatively high accuracy
- If key goal is to keep accuracy intact & reduce emotional impact, we recommend hover design

33
Future Work03
• Qualitative Analysis
• Intelligent Unblurring
• Early warning for severity
Conclusion02
As opposed to static blurring that
decreases accuracy, Interactive
blurring, improves well-being without
sacrificing accuracy and speed
Contribution01
Proposed and extensively evaluated
intervention that improves moderator
well-being

Alex Braylan1
and Matthew Lease2
1
Dept. of Computer Science & 2
Modeling and Aggregation of Complex
Annotations via Annotation Distance
34
ml@utexas.edu
@mattlease
Encore: Dec 11 talk @NeurIPS Crowd Science Workshop (https://research.yandex.com/workshops/crowd/neurips-2020)
Code & Data: https://github.com/Praznat/annotationmodeling

Simple annotation & aggregation
• Classification
– sentiment analysis
– image categorization
• Ordinal rating
– product & movie reviews
– search relevance
• Multiple choice selection
– quizzes
Aggregation
• Crowdsourcing: quality control
• Experts: wisdom of crowds
• Goal: select best label available
for each item (no label fusion)
35

What’s the capital of Texas?
Austin
Austin
Houston
36

What’s the capital of Texas?
Austin
Austin
Houston
Majority Vote
37

Caption this image:
38
A cat is
eating
The cat
eats
A beautiful
picture

Caption this image:
When majority voting falls short
Problem: large label space, exact match doesn’t work!
39
A cat is
eating
The cat
eats
A beautiful
picture

What about complex annotations?
Ranked lists
Parse trees
A1: A cat is eating
A2: The cat eats
A3: A beautiful picture
Image captions
Range sequences
40

Outline
• Prior work
• Approach
• Experiments
• Conclusion
41

Aggregating Simple Labels
• Hundreds of papers
• Multiple benchmarking studies
• Rich body of Bayesian modeling
• General-purpose aggregation
models for simple labels don’t
support complex labels!
Dawid-Skene MACE
Hierarchical Dawid-Skene
Item Difficulty
Logistic Random Effects
Source:
Paun et al 2018
“Comparing bayesian
models of annotation”
42

Task-specific models
• Pros:
– Task specialization
maximizes accuracy
• Cons:
– Need new model for
every task
– Complicated, difficult
to formulate
Nguyen et al 2017 (Sequences)
Lin, Mausam, and Weld 2012 (Math)
43

Task-specific workflows
• Pros:
– Empower workers
for complex tasks
• Cons:
– Need new workflow
for every task
– Complicated, difficult
to formulate
Noronha et al 2011
(image analysis)
Lasecki et al 2012
(transcription)
44

Our goals
• We want aggregation for complex data types
– Build on ideas from simple label aggregation models
• We want to generalize across many labeling tasks
– Can we reduce problem to common simpler state space?
45

Outline
• Prior work
• Approach
• Experiments
• Conclusion
46

Key Insight
• Partial credit matching via task-specific distance function
– Encapsulate task-specific label features into requester distance function
– Model annotation distances rather than annotations
– Distance functions already exist for most tasks because people need
evaluation functions to compare predicted labels vs gold
47

Distance functions
48
Properties of distance functions
Non-negativity
Symmetry
Triangle inequality
Data Free Text Rankings
Example
evaluation fn
BLEU(x, y)
Example
distance fn
Non-negativity ✓ ✓
Symmetry ✓ ✓
Triangle
inequality
✓ ✓

Calculate distances
“a cat is eating” “cat is eating”
“a beautiful picture” “the cat eats”
49
• Example task: text annotation
• Example distance function:
string edit distance

Calculate distances
0.05
0.1
0.1
50

Calculate distances
0.8
0.82
0.05
0.1
0.1
51
0.82

A1: A cat is eating
A2: The cat eats
A3: A beautiful
picture
0.1 0.6
0.3
52
All tasks reduce to matrices of
annotation distances

How to aggregate given distances
• Local selection model
• Global selection model
• Combined
53
Current item
Other items

Local approach: Smallest Avg Distance
• For each item:
1. Compute average distance between
annotations for the item
2. Choose annotation with smallest
average distance
• Generalization of majority vote
• Independence between items
• Local approach does not model
annotator reliability
54
Current item
Other items

Global approach: Best Available User
• For each annotator:
– Score by average distance over full dataset
• For each item:
– Choose label by best-scoring annotator
• Fixed annotator reliability
• Global approach does not model how
well annotators did on specific items
55
Current item
Other items

Can we get best of both worlds?
• Want a method that combines:
– Best available user (global)
– Smallest avg distance (local)
• Should build on rich history of work on Bayesian annotation modeling
• Need a principled framework for modeling annotation distance matrices
weights
votes weighted voting
56

Multidimensional Annotation Scaling (MAS)
• Based on Multidimensional
Scaling (Kruskal & Wish 1978)
• Probabilistic model of multi-
item distance matrices
• “Hierarchical Bayesian”
– Additional learned parameters
represent crowd effects such as
worker reliability
A cat is
eating
The cat
eats
A beautiful
picture
58

MAS Objective 1: Likelihood
Multidimensional Scaling
objective:
Diuv ∼ N(∥εiu−εiv∥, σ)
• Diuv : observed distance
• εiu : annotation embedding
• σ : error scale
0.8
0.82
0.05
0.1
0.1
0.82
59

MAS Objective 1: Likelihood
Multidimensional Scaling
objective:
Diuv ∼ N(∥εiu−εiv∥, σ)
• Diuv : observed distance
• εiu : annotation embedding
• σ : error scale
“a cat is eating”
“cat is eating”
“a beautiful picture”
“the cat eats”
0.8
0.82
0.05
0.1
0.1
0.82
60

MAS Objective 2: Prior
“cat is eating”
“the cat eats”
Pseudo-gold
61

“cat is eating”
“the cat eats”
62

“cat is eating”
“the cat eats”
63

Outline
• Prior work
• Approach
• Experiments
• Conclusion
66

Tasks & datasets
SYNTHETIC DATASETS
• Syntactic parse trees
– Distance function: evalb
• Ranked lists
– Distance function: Kendall’s tau
REAL DATASETS
• Biomedical text sequences
– Distance function: Span F1
• Urdu-English translations
– Distance function: GLEU
67
Nguyen et al 2017
Zaidan and Callison-Burch 2011

Methods
Baselines:
• Random User (RU): pick one label randomly
• ZenCrowd (ZC) (Demartini et al. 2012)
– Weighted voting based on exact match (rare!)
• Crowd Hidden Markov Model (CHMM) (Nguyen et al. 2017)
– Sequence annotation task only
Upper bound: Oracle (OR) (always picks best label)
• Even if 5 workers answer, limited by best answer any of them gave
68

Results
Task Metric RU ZC CHMM MAS Oracle
Translations GLEU 0.185 0.246
Sequences F1 0.561 0.827
Parses EVALB 0.812 0.939
Rankings 0.491 0.724
69
• Diverse complex label datasets

Results
Translations GLEU 0.185 0.188 0.246
Sequences F1 0.561 0.569 0.827
Parses EVALB 0.812 0.819 0.939
Rankings 0.491 0.495 0.724
70

Results
Translations GLEU 0.185 0.188 - 0.246
Sequences F1 0.561 0.569 0.702 0.827
Parses EVALB 0.812 0.819 - 0.939
Rankings 0.491 0.495 - 0.724
71

Results
Translations GLEU 0.185 0.188 - 0.217 0.246
Sequences F1 0.561 0.569 0.702 0.709 0.827
Parses EVALB 0.812 0.819 - 0.932 0.939
Rankings 0.491 0.495 - 0.710 0.724
72
• MAS aggregation is best way to get closer to ground truth with no
model alteration between datasets

Conclusion
• Goal: general-purpose probabilistic model to aggregate complex annotations
– Categorical-based methods insufficient
– Custom models difficult to design for new annotation types
• Solution: Model annotation distances via task-specific distance functions
– Transforms problem into general-purpose variable space
• Multi-dimensional Annotation Scaling (MAS)
– Allows unsupervised weighted voting with inferred annotator reliability
• Not covered in talk (see paper)
– Semi-supervised learning
– Partial credit 73

Ongoing work
• Generalization to more tasks (e.g., image bounding boxes & keypoints)
• Generalization to simple annotation tasks (”one ring to rule them all”)
• Support for multiple latent objects per item
• Merging annotations rather than selecting best one
– e.g. guessing weight of an ox
– MAS vs. non-embedding EM model, varying noise, fewer annotations, …
74
Code & Data: https://github.com/Praznat/annotationmodeling
A1: A cat is eating
A2: The cat eats
A3: A beautiful picture

Thank you!
75
Matt Lease (University of Texas at Austin)
@mattlease
We thank our many talented crowd workers for their contributions to our research!

Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks

Similar to Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks (20)

More from Matthew Lease

More from Matthew Lease (20)

Recently uploaded

Recently uploaded (20)

Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Supporting Complex Annotation Tasks