Automated Models for Quantifying Centrality of Survey Responses

Matt Lease
Associate Professor
School of Information
The University of Texas at Austin
Amazon Scholar
Human-in-the-loop Services
Amazon Web Services (AWS)
Automated Models for Quantifying
Centrality of Survey Responses
1
Lab: ir.ischool.utexas.edu
@mattlease
Slides: slideshare.net/mattlease

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark.
Human-in-the-loop Services
• 3 Team Products: Mechanical Turk,
Sagemaker Ground Truth, and Augmented AI (A2I)
• https://www.amazon.science/research-awards
– Cash and/or AWS credits
• Summer, sabbatical, or longer engagements
– https://www.amazon.science/scholars
– https://www.amazon.science/visiting-academics
• https://www.amazon.science/tag/internships

HTTPS://WWW.HUMANCOMPUTATION.COM
3

What’s the capital of Texas?
Austin
Austin
Houston
4

What’s the capital of Texas?
Austin
Austin
Houston
Majority Vote
5

Simple annotation & aggregation
Classification
• sentiment analysis
• image categorization
Ordinal rating
• product & movie reviews
• search relevance
Aggregation
• Crowdsourcing: quality control
• Experts: wisdom of crowds
• Goal: select best label available
for each item (no label fusion)
6

Caption this image:
7
A cat is
eating
The cat
eats
A beautiful
picture

Caption this image:
When majority voting falls short
Problem: large label space, exact match doesn’t work!
8
A cat is
eating
The cat
eats
A beautiful
picture

What about complex annotations?
Ranked lists
Parse trees
A1: A cat is eating
A2: The cat eats
A3: A beautiful picture
Image captions
Range sequences
9

10
Alexander Braylan1 and Matthew Lease2
1
Dept. of Computer Science & 2
School of Information
The University of Texas at Austin
Modeling and Aggregation of Complex
Annotations via Annotation Distance
Code & Data: https://github.com/Praznat/annotationmodeling
https://github.com/Praznat/annotationmodeling

Roadmap
• Prior work
• Approach
• Example outputs
• Conclusion
11

Aggregating Simple Labels
• Hundreds of papers
• Multiple benchmarking studies
• Rich body of Bayesian modeling
• General-purpose aggregation
models for simple labels don’t
support complex labels
Dawid-Skene MACE
Hierarchical Dawid-Skene
Item Difficulty
Logistic Random Effects
Source:
Paun et al 2018
“Comparing bayesian
models of annotation”
12

Task-specific models
• Pros:
– Task specialization
maximizes accuracy
• Cons:
– Need new model for
every task
– Complicated, difficult
to formulate
Nguyen et al 2017 (Sequences)
Lin, Mausam, and Weld 2012 (Math)
13

Our goals
• We want aggregation for complex data types
– Build on ideas from simple label aggregation models
• We want to generalize across many labeling tasks
– Can we reduce problem to common simpler state space?
14

Roadmap
• Prior work
• Approach
• Example outputs
• Conclusion
15

Key Insight
Partial credit matching via task-specific distance function
• Adopt or define a distance function for each annotation task
• Model annotation distances uniformly across tasks
• Distance functions already exist for many task types
– Free-text responses, e.g., survey questions
16

Calculate distances
“a cat is eating” “cat is eating”
“a beautiful picture” “the cat eats”
17
• Example task: free text answer
• Example distance function:
string edit distance

Calculate distances
0.05
0.1
0.1
18

Calculate distances
0.8
0.82
0.05
0.1
0.1
19
0.82

Example Distance: Levenshtein
20

Example Distance: Word embeddings
21

Distance function properties
22
Properties of distance functions
Non-negativity
Symmetry
Triangle inequality
Data Free Text Rankings
Example
evaluation fn
BLEU(x, y)
Example
distance fn
Non-negativity ✓ ✓
Symmetry ✓ ✓
Triangle
inequality
✓ ✓

Calculate distances
0.8
0.82
0.05
0.1
0.1
23
0.82

A1: A cat is eating
A2: The cat eats
A3: A beautiful
picture
0.1 0.6
0.3
24
All tasks reduce to
matrices of distances

How to aggregate given distances
• Local selection model
• Global selection model
• Combined
25
Current item
Other items

Local approach: Smallest Avg Distance (SAD)
• For each question: compute average
distance between responses
• The response with smallest average
distance is locally most normative,
generalizing majority vote
• Independence between items
• Local approach does not model
respondent agreement
26
Current item
Other items

Global approach: Best Available User (BAU)
• Score each participant by their
average distance to all other
participants across all questions
• The participant with lowest score is
globally most normative; treat their
response as most normative
• Global approach ignores distance
observed on the current item
27
Current item
Other items

Can we get best of both worlds?
• Want a method that combines:
– Best available user (global)
– Smallest avg distance (local)
• Should build on rich history of work on Bayesian annotation modeling
• Need a principled framework for modeling annotation distance matrices
weights
votes weighted voting
28

Multidimensional Annotation Scaling (MAS)
• Based on Multidimensional
Scaling (Kruskal & Wish 1978)
• Probabilistic model of multi-
item distance matrices
• “Hierarchical Bayesian”
– Additional learned parameters
represent crowd effects such as
worker reliability
A cat is
eating
The cat
eats
A beautiful
picture
29

MAS Objective 1: Likelihood
Multidimensional Scaling
objective:
Diuv ∼ N(∥εiu−εiv∥, σ)
• Diuv : observed distance
• εiu : annotation embedding
• σ : error scale
0.8
0.82
0.05
0.1
0.1
0.82
30

MAS Objective 1: Likelihood
Multidimensional Scaling
objective:
Diuv ∼ N(∥εiu−εiv∥, σ)
• Diuv : observed distance
• εiu : annotation embedding
• σ : error scale
“a cat is eating”
“cat is eating”
“a beautiful picture”
“the cat eats”
0.8
0.82
0.05
0.1
0.1
0.82
31

MAS Objective 2: Prior
“cat is eating”
“the cat eats”
Pseudo-gold
32

“cat is eating”
“the cat eats”
33

“cat is eating”
“the cat eats”
34

35

36
https://github.com/Praznat/annotationmodelingç

Roadmap
• Prior work
• Approach
• Example outputs
• Conclusion
37

Example Output: father
38
Response SAD MAS
He always speaks ill about his father behind back. 0.78 0.16
He always speaks ill of his father behind his back. 0.71 0.30
He always talks about his father behind his back. 0.74 0.50
He always speaks ill of his father 0.78 0.55
He always speak ill of his father. 0.79 0.62
He is always talking about his father behind his back. 0.82 0.63
He always says behind his father. 0.90 0.72
He always talks about his dad behind his back. 0.83 0.73

Example Output: she says
39
Response SAD MAS
Please be sure to take a note of what she says. 0.77 0.16
Please take a note of what she says. 0.84 0.30
Be sure to take a warning notice what she says. 0.86 0.46
Please be sure to take notes what she says. 0.81 0.48
Please take a note what she say. 0.92 0.73
Please be sure to take instructions for her saying. 0.93 0.76
Make sure to insert disclaimer about what she says. 0.93 0.80
Please make a memo whatever she says. 0.99 0.82

Example Output: quiet
40
Response SAD MAS
As long as you keep quiet you may stay here 0.83 0.26
You can stay here as long as you keep quiet. 0.86 0.39
You may stay here if you keep quiet. 0.81 0.39
You can stay here if you keep quiet. 0.82 0.57
So long as you remain quiet you may stay here. 0.92 0.57
If it is quiet you may stay here 0.90 0.70
If you keep quiet you can stay here. 0.92 0.81
You may be here if you keep quiet. 0.91 0.84

Example Output: go ahead
41
Response SAD MAS
Please go ahead if i am late. 0.83 0.16
Please go ahead if I'm late. 0.79 0.28
Please go ahead if I delayed. 0.82 0.51
Please go without me if I'm late. 0.91 0.62
Please go ahead if I get late 0.83 0.67
Please go ahead and leave if I'm late. 0.88 0.74
If I am late you can go in first. 1.00 0.79
If I should be late go without me. 1.00 0.81

Example Output: married
42
Response SAD MAS
Actually they are not married 0.91 0.18
To tell the truth they are not couple 0.79 0.47
To tell the truth they are not a married couple 0.84 0.62
To tell the truth they're not married 0.89 0.63
In fact they are not couple 0.94 0.69
to telling the truth we're not married 0.97 0.71
Two people are not couples in truth 1.00 0.79

Roadmap
• Prior work
• Approach
• Example outputs
• Conclusion
43

Conclusion
• Probabilistic model identifies normative vs. outlier
responses by quantifying distance between responses
• Many choices for measuring distance between two
texts (e.g., character-based or more semantic NLP)
• 3 models: local (SAD), global (BAU), or combo (MAS)
• Open source: github.com/Praznat/annotationmodeling
44
A1: A cat is eating
A2: The cat eats

Future work
45
A1: A cat is eating
A2: The cat eats
• From objective labeling tasks to subjective responses
• Evaluation on survey data
– Collaboration with behavioral science researchers?
– Compare distance functions and model settings for utility
• Automatic detection of consistent biases in a
participant’s responses vs. what’s group normative

46
Matt Lease (University of Texas at Austin)
Lab: ir.ischool.utexas.edu
@mattlease
Slides: slideshare.net/mattlease
We thank our many talented crowd workers
for their contributions to our research!
Alexander Braylan and Matthew Lease. Aggregating Complex Annotations via Merging and Matching.
In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining, pages 86--94, 2021. [ bib | pdf | data | sourcecode | video | slides | tech-report ]
Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via
Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.
[ bib | pdf | data | sourcecode | video | slides ]

MTurk: The Early Days
48
• Artificial Intelligence, With Help From the Humans.
– J. Pontin. NY Times, March 25, 2007
• Is Amazon's Mechanical Turk a Failure? April 9, 2007
– “As of this writing, there are [only] 128 HITs available on Mechanical Turk.”
• Su et al., WWW 2007: “a web-based human data collection system… ‘System M’ ”

2008: the ”Gold” Rush Begins
Braylan and Lease 49
Snow et al, EMNLP (Natural Language Processing)
• Annotating human language for natural language processing (NLP)
• 22,000 labels for only $26 USD
• Crowd’s consensus labels can replace traditional expert labels
“Discovery” sparks rush for “gold” data across areas
• Alonso et al., SIGIR Forum (Information Retrieval)
• Kittur et al., CHI (Human-Computer Interaction)
• Sorokin and Forsythe, CVPR (Computer Vision)

2010-11: Social & Behavioral Sciences
50
• A Guide to Behavioral Experiments on Mechanical Turk
– W. Mason and S. Suri (2010). SSRN online.
• Crowdsourcing for Human Subjects Research
– L. Schmidt (CrowdConf 2010)
• Crowdsourcing Content Analysis for Behavioral Research: Insights from Mechanical Turk
– Conley & Tosti-Kharas (2010). Academy of Management
• Amazon's Mechanical Turk : A New Source of Inexpensive, Yet High-Quality, Data?
– M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.
– see also: Amazon Mechanical Turk Guide for Social Scientists

The Future of Crowd Work (ACM CSCW’13)
by Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton
51

Example Output
53

Tasks & datasets
SYNTHETIC DATASETS
• Syntactic parse trees
– Distance function: evalb
• Ranked lists
– Distance function: Kendall’s tau
REAL DATASETS
• Biomedical text sequences
– Distance function: Span F1
• Urdu-English translations
– Distance function: GLEU
55
Nguyen et al 2017
Zaidan and Callison-Burch 2011

Methods
Baselines:
• Random User (RU): pick one label randomly
• ZenCrowd (ZC) (Demartini et al. 2012)
– Weighted voting based on exact match (rare!)
• Crowd Hidden Markov Model (CHMM) (Nguyen et al. 2017)
– Sequence annotation task only
Upper bound: Oracle (OR) (always picks best label)
• Even if 5 workers answer, limited by best answer any of them gave
56

Results
Task Metric RU ZC CHMM MAS Oracle
Translations GLEU 0.185 0.246
Sequences F1 0.561 0.827
Parses EVALB 0.812 0.939
Rankings 0.491 0.724
57
• Diverse complex label datasets

Results
Translations GLEU 0.185 0.188 0.246
Sequences F1 0.561 0.569 0.827
Parses EVALB 0.812 0.819 0.939
Rankings 0.491 0.495 0.724
58

Results
Translations GLEU 0.185 0.188 - 0.246
Sequences F1 0.561 0.569 0.702 0.827
Parses EVALB 0.812 0.819 - 0.939
Rankings 0.491 0.495 - 0.724
59

Results
Translations GLEU 0.185 0.188 - 0.217 0.246
Sequences F1 0.561 0.569 0.702 0.709 0.827
Parses EVALB 0.812 0.819 - 0.932 0.939
Rankings 0.491 0.495 - 0.710 0.724
60
• MAS aggregation is best way to get closer to ground truth with no
model alteration between datasets

62
Goal: Design a future of Artificial Intelligence (AI)
technologies to meet society’s needs and values.
.
http://goodsystems.utexas.edu
Good Systems: an 8-year, $10M
UT Austin Grand Challenge

“The place where people & technology meet”
~ Wobbrock et al., 2009
“iSchools” now exist at over 100 universities around the world
63
What’s an Information School?

Task-specific workflows
• Pros:
– Empower workers
for complex tasks
• Cons:
– Need new workflow
for every task
– Complicated, difficult
to formulate
Noronha et al 2011
(image analysis)
Lasecki et al 2012
(transcription)
64

Automated Models for Quantifying Centrality of Survey Responses

Recommended

Recommended

More Related Content

Similar to Automated Models for Quantifying Centrality of Survey Responses

Similar to Automated Models for Quantifying Centrality of Survey Responses (20)

More from Matthew Lease

More from Matthew Lease (20)

Recently uploaded

Recently uploaded (20)

Automated Models for Quantifying Centrality of Survey Responses