Research talk presented at "Innovations in Online Research" (October 1, 2021)
Event URL: https://web.cvent.com/event/d063e447-1f16-4f70-a375-5d6978b3feea/websitePage:b8d4ce12-3d02-4d24-897d-fd469ca4808a.
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Automated Models for Quantifying Centrality of Survey Responses
1. Matt Lease
Associate Professor
School of Information
The University of Texas at Austin
Amazon Scholar
Human-in-the-loop Services
Amazon Web Services (AWS)
Automated Models for Quantifying
Centrality of Survey Responses
1
Lab: ir.ischool.utexas.edu
@mattlease
Slides: slideshare.net/mattlease
8. Caption this image:
When majority voting falls short
Problem: large label space, exact match doesn’t work!
8
A cat is
eating
The cat
eats
A beautiful
picture
9. What about complex annotations?
Ranked lists
Parse trees
A1: A cat is eating
A2: The cat eats
A3: A beautiful picture
Image captions
Range sequences
9
10. 10
Alexander Braylan1 and Matthew Lease2
1
Dept. of Computer Science & 2
School of Information
The University of Texas at Austin
Modeling and Aggregation of Complex
Annotations via Annotation Distance
Code & Data: https://github.com/Praznat/annotationmodeling
https://github.com/Praznat/annotationmodeling
11. Roadmap
• Prior work
• Approach
• Example outputs
• Conclusion
11
https://github.com/Praznat/annotationmodeling
12. Aggregating Simple Labels
• Hundreds of papers
• Multiple benchmarking studies
• Rich body of Bayesian modeling
• General-purpose aggregation
models for simple labels don’t
support complex labels
Dawid-Skene MACE
Hierarchical Dawid-Skene
Item Difficulty
Logistic Random Effects
Source:
Paun et al 2018
“Comparing bayesian
models of annotation”
12
https://github.com/Praznat/annotationmodeling
13. Task-specific models
• Pros:
– Task specialization
maximizes accuracy
• Cons:
– Need new model for
every task
– Complicated, difficult
to formulate
Nguyen et al 2017 (Sequences)
Lin, Mausam, and Weld 2012 (Math)
13
https://github.com/Praznat/annotationmodeling
14. Our goals
• We want aggregation for complex data types
– Build on ideas from simple label aggregation models
• We want to generalize across many labeling tasks
– Can we reduce problem to common simpler state space?
14
https://github.com/Praznat/annotationmodeling
15. Roadmap
• Prior work
• Approach
• Example outputs
• Conclusion
15
https://github.com/Praznat/annotationmodeling
16. Key Insight
Partial credit matching via task-specific distance function
• Adopt or define a distance function for each annotation task
• Model annotation distances uniformly across tasks
• Distance functions already exist for many task types
– Free-text responses, e.g., survey questions
16
https://github.com/Praznat/annotationmodeling
17. Calculate distances
“a cat is eating” “cat is eating”
“a beautiful picture” “the cat eats”
17
• Example task: free text answer
• Example distance function:
string edit distance
https://github.com/Praznat/annotationmodeling
18. Calculate distances
“a cat is eating” “cat is eating”
“a beautiful picture” “the cat eats”
0.05
0.1
0.1
18
• Example task: free text answer
• Example distance function:
string edit distance
https://github.com/Praznat/annotationmodeling
19. Calculate distances
“a cat is eating” “cat is eating”
“a beautiful picture” “the cat eats”
0.8
0.82
0.05
0.1
0.1
19
0.82
• Example task: free text answer
• Example distance function:
string edit distance
https://github.com/Praznat/annotationmodeling
22. Distance function properties
22
Properties of distance functions
Non-negativity
Symmetry
Triangle inequality
Data Free Text Rankings
Example
evaluation fn
BLEU(x, y)
Example
distance fn
Non-negativity ✓ ✓
Symmetry ✓ ✓
Triangle
inequality
✓ ✓
https://github.com/Praznat/annotationmodeling
23. Calculate distances
“a cat is eating” “cat is eating”
“a beautiful picture” “the cat eats”
0.8
0.82
0.05
0.1
0.1
23
0.82
https://github.com/Praznat/annotationmodeling
24. A1: A cat is eating
A2: The cat eats
A3: A beautiful
picture
0.1 0.6
0.3
24
All tasks reduce to
matrices of distances
https://github.com/Praznat/annotationmodeling
25. How to aggregate given distances
• Local selection model
• Global selection model
• Combined
25
Current item
Other items
https://github.com/Praznat/annotationmodeling
26. Local approach: Smallest Avg Distance (SAD)
• For each question: compute average
distance between responses
• The response with smallest average
distance is locally most normative,
generalizing majority vote
• Independence between items
• Local approach does not model
respondent agreement
26
Current item
Other items
https://github.com/Praznat/annotationmodeling
27. Global approach: Best Available User (BAU)
• Score each participant by their
average distance to all other
participants across all questions
• The participant with lowest score is
globally most normative; treat their
response as most normative
• Global approach ignores distance
observed on the current item
27
Current item
Other items
https://github.com/Praznat/annotationmodeling
28. Can we get best of both worlds?
• Want a method that combines:
– Best available user (global)
– Smallest avg distance (local)
• Should build on rich history of work on Bayesian annotation modeling
• Need a principled framework for modeling annotation distance matrices
weights
votes weighted voting
28
https://github.com/Praznat/annotationmodeling
29. Multidimensional Annotation Scaling (MAS)
• Based on Multidimensional
Scaling (Kruskal & Wish 1978)
• Probabilistic model of multi-
item distance matrices
• “Hierarchical Bayesian”
– Additional learned parameters
represent crowd effects such as
worker reliability
A cat is
eating
The cat
eats
A beautiful
picture
29
https://github.com/Praznat/annotationmodeling
32. MAS Objective 2: Prior
“a cat is eating”
“cat is eating”
“a beautiful picture”
“the cat eats”
Pseudo-gold
32
https://github.com/Praznat/annotationmodeling
33. MAS Objective 2: Prior
“a cat is eating”
“cat is eating”
“a beautiful picture”
“the cat eats”
33
https://github.com/Praznat/annotationmodeling
34. MAS Objective 2: Prior
“a cat is eating”
“cat is eating”
“a beautiful picture”
“the cat eats”
34
https://github.com/Praznat/annotationmodeling
35. MAS Objective 2: Prior
35
https://github.com/Praznat/annotationmodeling
36. MAS Objective 2: Prior
36
https://github.com/Praznat/annotationmodelingç
37. Roadmap
• Prior work
• Approach
• Example outputs
• Conclusion
37
https://github.com/Praznat/annotationmodeling
38. Example Output: father
38
Response SAD MAS
He always speaks ill about his father behind back. 0.78 0.16
He always speaks ill of his father behind his back. 0.71 0.30
He always talks about his father behind his back. 0.74 0.50
He always speaks ill of his father 0.78 0.55
He always speak ill of his father. 0.79 0.62
He is always talking about his father behind his back. 0.82 0.63
He always says behind his father. 0.90 0.72
He always talks about his dad behind his back. 0.83 0.73
https://github.com/Praznat/annotationmodelingç
39. Example Output: she says
39
Response SAD MAS
Please be sure to take a note of what she says. 0.77 0.16
Please take a note of what she says. 0.84 0.30
Be sure to take a warning notice what she says. 0.86 0.46
Please be sure to take notes what she says. 0.81 0.48
Please take a note what she say. 0.92 0.73
Please be sure to take instructions for her saying. 0.93 0.76
Make sure to insert disclaimer about what she says. 0.93 0.80
Please make a memo whatever she says. 0.99 0.82
https://github.com/Praznat/annotationmodelingç
40. Example Output: quiet
40
Response SAD MAS
As long as you keep quiet you may stay here 0.83 0.26
You can stay here as long as you keep quiet. 0.86 0.39
You may stay here if you keep quiet. 0.81 0.39
You can stay here if you keep quiet. 0.82 0.57
So long as you remain quiet you may stay here. 0.92 0.57
If it is quiet you may stay here 0.90 0.70
If you keep quiet you can stay here. 0.92 0.81
You may be here if you keep quiet. 0.91 0.84
https://github.com/Praznat/annotationmodelingç
41. Example Output: go ahead
41
Response SAD MAS
Please go ahead if i am late. 0.83 0.16
Please go ahead if I'm late. 0.79 0.28
Please go ahead if I delayed. 0.82 0.51
Please go without me if I'm late. 0.91 0.62
Please go ahead if I get late 0.83 0.67
Please go ahead and leave if I'm late. 0.88 0.74
If I am late you can go in first. 1.00 0.79
If I should be late go without me. 1.00 0.81
https://github.com/Praznat/annotationmodelingç
42. Example Output: married
42
Response SAD MAS
Actually they are not married 0.91 0.18
To tell the truth they are not couple 0.79 0.47
To tell the truth they are not a married couple 0.84 0.62
To tell the truth they're not married 0.89 0.63
In fact they are not couple 0.94 0.69
to telling the truth we're not married 0.97 0.71
Two people are not couples in truth 1.00 0.79
https://github.com/Praznat/annotationmodelingç
43. Roadmap
• Prior work
• Approach
• Example outputs
• Conclusion
43
https://github.com/Praznat/annotationmodeling
44. Conclusion
• Probabilistic model identifies normative vs. outlier
responses by quantifying distance between responses
• Many choices for measuring distance between two
texts (e.g., character-based or more semantic NLP)
• 3 models: local (SAD), global (BAU), or combo (MAS)
• Open source: github.com/Praznat/annotationmodeling
44
A1: A cat is eating
A2: The cat eats
A3: A beautiful picture
https://github.com/Praznat/annotationmodeling
45. Future work
45
A1: A cat is eating
A2: The cat eats
A3: A beautiful picture
• From objective labeling tasks to subjective responses
• Evaluation on survey data
– Collaboration with behavioral science researchers?
– Compare distance functions and model settings for utility
• Automatic detection of consistent biases in a
participant’s responses vs. what’s group normative
https://github.com/Praznat/annotationmodeling
46. 46
Matt Lease (University of Texas at Austin)
Lab: ir.ischool.utexas.edu
@mattlease
Slides: slideshare.net/mattlease
We thank our many talented crowd workers
for their contributions to our research!
https://github.com/Praznat/annotationmodeling
Alexander Braylan and Matthew Lease. Aggregating Complex Annotations via Merging and Matching.
In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining, pages 86--94, 2021. [ bib | pdf | data | sourcecode | video | slides | tech-report ]
Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via
Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.
[ bib | pdf | data | sourcecode | video | slides ]
48. MTurk: The Early Days
48
• Artificial Intelligence, With Help From the Humans.
– J. Pontin. NY Times, March 25, 2007
• Is Amazon's Mechanical Turk a Failure? April 9, 2007
– “As of this writing, there are [only] 128 HITs available on Mechanical Turk.”
• Su et al., WWW 2007: “a web-based human data collection system… ‘System M’ ”
49. 2008: the ”Gold” Rush Begins
Braylan and Lease 49
Snow et al, EMNLP (Natural Language Processing)
• Annotating human language for natural language processing (NLP)
• 22,000 labels for only $26 USD
• Crowd’s consensus labels can replace traditional expert labels
“Discovery” sparks rush for “gold” data across areas
• Alonso et al., SIGIR Forum (Information Retrieval)
• Kittur et al., CHI (Human-Computer Interaction)
• Sorokin and Forsythe, CVPR (Computer Vision)
50. 2010-11: Social & Behavioral Sciences
50
• A Guide to Behavioral Experiments on Mechanical Turk
– W. Mason and S. Suri (2010). SSRN online.
• Crowdsourcing for Human Subjects Research
– L. Schmidt (CrowdConf 2010)
• Crowdsourcing Content Analysis for Behavioral Research: Insights from Mechanical Turk
– Conley & Tosti-Kharas (2010). Academy of Management
• Amazon's Mechanical Turk : A New Source of Inexpensive, Yet High-Quality, Data?
– M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.
– see also: Amazon Mechanical Turk Guide for Social Scientists
51. The Future of Crowd Work (ACM CSCW’13)
by Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton
51
55. Tasks & datasets
SYNTHETIC DATASETS
• Syntactic parse trees
– Distance function: evalb
• Ranked lists
– Distance function: Kendall’s tau
REAL DATASETS
• Biomedical text sequences
– Distance function: Span F1
• Urdu-English translations
– Distance function: GLEU
55
Nguyen et al 2017
Zaidan and Callison-Burch 2011
56. Methods
Baselines:
• Random User (RU): pick one label randomly
• ZenCrowd (ZC) (Demartini et al. 2012)
– Weighted voting based on exact match (rare!)
• Crowd Hidden Markov Model (CHMM) (Nguyen et al. 2017)
– Sequence annotation task only
Upper bound: Oracle (OR) (always picks best label)
• Even if 5 workers answer, limited by best answer any of them gave
56
57. Results
Task Metric RU ZC CHMM MAS Oracle
Translations GLEU 0.185 0.246
Sequences F1 0.561 0.827
Parses EVALB 0.812 0.939
Rankings 0.491 0.724
57
• Diverse complex label datasets
60. Results
Task Metric RU ZC CHMM MAS Oracle
Translations GLEU 0.185 0.188 - 0.217 0.246
Sequences F1 0.561 0.569 0.702 0.709 0.827
Parses EVALB 0.812 0.819 - 0.932 0.939
Rankings 0.491 0.495 - 0.710 0.724
60
• Diverse complex label datasets
• MAS aggregation is best way to get closer to ground truth with no
model alteration between datasets
62. 62
Goal: Design a future of Artificial Intelligence (AI)
technologies to meet society’s needs and values.
.
http://goodsystems.utexas.edu
Good Systems: an 8-year, $10M
UT Austin Grand Challenge
63. “The place where people & technology meet”
~ Wobbrock et al., 2009
“iSchools” now exist at over 100 universities around the world
63
What’s an Information School?
64. Task-specific workflows
• Pros:
– Empower workers
for complex tasks
• Cons:
– Need new workflow
for every task
– Complicated, difficult
to formulate
Noronha et al 2011
(image analysis)
Lasecki et al 2012
(transcription)
64