Modeling and Aggregation of Complex Annotations

ALEXANDER BRAYLAN* AND MATTHEW LEASE
The University of Texas at Austin
http://ir.ischool.utexas.edu/
APRIL 2020
MODELING AND AGGREGATION
OF COMPLEX ANNOTATIONS VIA
ANNOTATION DISTANCE

Simple annotation & aggregation
• classification
– sentiment analysis
– image categorization
• ordinal rating
– product & movie reviews
– search relevance
• multiple choice selection
– quizzes
Aggregation
• Crowd-sourcing: quality
control
• Experts: wisdom of crowds
• Goal is to select best label
available for each item
1

What’s the capital of Texas?
Austin
Austin
Houston
2

What’s the capital of Texas?
Austin
Austin
Houston
Majority Vote
3

Caption this image:
4
A cat is
eating
The cat
eats
A beautiful
picture

Caption this image:
When majority voting falls short
Problem: large label space, exact match doesn’t work!
5
A cat is
eating
The cat
eats
A beautiful
picture

What about complex annotations?
Ranked lists
Parse trees
A1: A cat is eating
A2: The cat eats
A3: A beautiful picture
Image captions
Range sequences
6

Outline
• Prior work
• Approach
• Experiments
• Conclusion
7

Aggregating Simple Labels
• Hundreds of papers
• Multiple benchmarking studies
• Rich body of Bayesian modeling
• General-purpose aggregation
models for simple labels don’t
support complex labels!
Dawid-Skene MACE
Hierarchical Dawid-Skene
Item Difficulty
Logistic Random Effects
Source:
Paun et al 2018
“Comparing bayesian
models of annotation”
8

Task-specific models
• Pros:
– Task specialization
maximizes accuracy
• Cons:
– Need new model for
every task
– Complicated, difficult
to formulate
Nguyen et al 2007 (Sequences)
Lin, Mausam, and Weld 2012 (Math)
9

Task-specific workflows
• Pros:
– Empower workers
for complex tasks
• Cons:
– Need new workflow
for every task
– Complicated, difficult
to formulate
Noronha et al 2011
(image analysis)
Lasecki et al 2012
(transcription)
10

Our goals
• We want aggregation for complex data types
– Build on ideas from simple label aggregation models
• We want to generalize across many labeling tasks
– Can we reduce problem to common simpler state space?
11

Outline
• Prior work
• Approach
• Experiments
• Conclusion
12

Key Insight
• Partial credit matching via task-specific distance function
– Encapsulate task-specific label features into requester distance function
– Model annotation distances rather than annotations
– Distance functions already exist for most tasks because people need
evaluation functions to compare predicted labels vs gold
13

Distance functions
14
Properties of distance functions
Non-negativity
Symmetry
Triangle inequality
Data Free Text Rankings
Example
evaluation fn
BLEU(x, y)
Kendall’s
𝜏(x, y)
Example
distance fn
1 –
BLEU(x, y)!BLEU(y, x)
"
1 - 𝜏(x, y)
Non-negativity ✓ ✓
Symmetry ✓ ✓
Triangle
inequality
✓ ✓

Calculate distances
“a cat is eating” “cat is eating”
“a beautiful picture” “the cat eats”
15
• Example task: text annotation
• Example distance function:
string edit distance

Calculate distances
0.05
0.1
0.1
16

Calculate distances
0.8
0.82
0.05
0.1
0.1
17
0.82

A1: A cat is eating
A2: The cat eats
A3: A beautiful
picture
0.1 0.6
0.3
18
All tasks reduce to matrices of
annotation distances

How to aggregate given distances
• Local selection model
• Global selection model
• Combined
19
Current item
Other items

Local approach: Smallest Avg Distance
• For each item:
1. Compute average distance between
annotations for the item
2. Choose annotation with smallest
average distance
• Generalization of majority vote
• Independence between items
• Local approach does not model
annotator reliability
20
Current item
Other items

Global approach: Best Available User
• For each annotator:
– Score by average distance over full dataset
• For each item:
– Choose label by best-scoring annotator
• Fixed annotator reliability
• Global approach does not model how
well annotators did on specific items
21
Current item
Other items

Can we get best of both worlds?
• Want a method that combines:
– Best available user (global)
– Smallest avg distance (local)
• Should build on rich history of work on Bayesian annotation modeling
• Need a principled framework for modeling annotation distance matrices
weights
votes weighted voting
22

Multidimensional Annotation Scaling (MAS)
• Based on Multidimensional
Scaling (Kruskal & Wish 1978)
• Probabilistic model of multi-
item distance matrices
• “Hierarchical Bayesian”
– Additional learned parameters
represent crowd effects such as
worker reliability
A cat is
eating
The cat
eats
A beautiful
picture
24

MAS Objective 1: Likelihood
Multidimensional Scaling
objective:
Diuv ∼ N(∥εiu−εiv∥, σ)
• Diuv : observed distance
• εiu : annotation embedding
• σ : error scale
0.8
0.82
0.05
0.1
0.1
0.82

MAS Objective 1: Likelihood
Multidimensional Scaling
objective:
Diuv ∼ N(∥εiu−εiv∥, σ)
• Diuv : observed distance
• εiu : annotation embedding
• σ : error scale
“a cat is eating”
“cat is eating”
“a beautiful picture”
“the cat eats”
0.8
0.82
0.05
0.1
0.1
0.82
26

MAS Objective 2: Prior
Annotation prior
probability objective:
εiu =
!"iu
|!"iu|
#εiu ∼ Normal
“cat is eating”
“the cat eats”
Pseudo-gold
27

Annotation prior
εiu =
!"iu
|!"iu|
#εiu ∼ N(0, γuδi)
• !ε : unnormalized embedding
• γu : annotator error
• δi : item difficulty
“cat is eating”
“the cat eats”
28

Annotation prior
εiu =
!"iu
|!"iu|
#εiu ∼ N(0, γuδi)
• γu : annotator error
“cat is eating”
“the cat eats”
29

Annotation prior
εiu =
!"iu
|!"iu|
#εiu ∼ N(0, 𝛄uδi)
• 𝛄u : annotator error
𝜸 = 0.1 𝜸 = 0.4
𝜸 = 0.5
𝜸 = 0.7
30

Annotation prior
εiu =
!"iu
|!"iu|
#εiu ∼ N(0, 𝛄uδi)
• 𝛄u : annotator error
𝜸 = 0.1 𝜸 = 0.4
𝜸 = 0.5
𝜸 = 0.7
31

Outline
• Prior work
• Approach
• Experiments
• Conclusion
32

Tasks & datasets
SYNTHETIC DATASETS
• Syntactic parse trees
– Distance function: evalb
• Ranked lists
– Distance function: Kendall’s
tau
REAL DATASETS
• Biomedical text sequences
– Distance function: Span F1
• Urdu-English translations
– Distance function: GLEU
33
Nguyen et al
2017
Zaidan and
Callison-Burch
2011

Methods
Baselines:
• Random User (RU): pick one label randomly
• ZenCrowd (ZC) (Demartini et al 2012)
– Weighted voting based on exact match (rare!)
• Crowd Hidden Markov Model (CHMM) (Nguyen et al 2017)
– Sequence annotation task only
Upper bound: Oracle (OR) (always picks best label)
• Even if 5 workers answer, limited by best answer any of them gave
34

Results
Task Metric RU ZC CHMM MAS Oracle
Translations GLEU 0.185 0.188 - 0.217 0.246
Sequences F1 0.561 0.569 0.702 0.709 0.827
Parses EVALB 0.812 0.819 - 0.932 0.939
Rankings Kendall 𝜏 0.491 0.495 - 0.710 0.724
35
• Diverse complex label datasets
• MAS aggregation is best way to get closer to ground truth with no
model alteration between datasets

Conclusion
• Goal: general-purpose probabilistic model to aggregate complex annotations
– Categorical-based methods insufficient
– Custom models difficult to design for new annotation types
• Solution: Model annotation distances via task-specific distance functions
– Transforms problem into general-purpose variable space
• Multi-dimensional Annotation Scaling (MAS)
– Allows unsupervised weighted voting with inferred annotator reliability
• Not covered in talk (see paper)
– Semi-supervised learning
– Partial credit 36

Current & Future work
Big picture: what is needed to support complex crowd-sourcing?
• Integration with workflow design and other quality-control mechanisms
• Dynamic (online) collection – measuring value of getting another label
• Merging annotations rather than selecting best one
– e.g. guessing weight of an ox
• Learning difficult tasks over time
37

THANK YOU!
Code available at
https://github.com/Praznat/annotationmodeling
A1: A cat is
eating
A2: The cat eats
A3: A beautiful
picture
We thank the crowd workers for the data they contributed for this research study!

Hierarchical priors:
log(γu) ~ N(log(&γ), 𝜙)
log(δu) ~ N(log(&δ), 𝜓)
• *γ : annotator error location
• *δ : item difficulty location
• 𝜙: annotator error scale
• 𝜓: item difficulty scale
γ2
γ1
γ4
γ3
&γ𝜙 𝜙 40

• Traditionally, goal of annotation
aggregation is to determine a
single ground truth per item
• With complex annotations there
could be several acceptable
answers
• Alternate goal is to score each
annotation by expected quality
Experiments: score-all results
41

• Noisy parser experiment: more
workers whose annotations
deviate substantially from gold
• Semi-supervised learning allows
rearrangement of inferred worker
reliability according to similarity to
known gold
Insight: semi-supervised learning
42

Why is it hard to design bespoke
models?
SIMPLE label generative model
P(Lui | gi, 𝜃u, 𝜃i, …)
COMPLEX label generative model
P(Lui | gi, 𝜃u, 𝜃i, …)
Categorical
Categorical
Scalars etc Complex
data type
Scalars etc
Complex
data type
label gold latent parameters
observed unobserved
label gold latent parameters
observed unobserved
43

Modeling and Aggregation of Complex Annotations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Modeling and Aggregation of Complex Annotations

Similar to Modeling and Aggregation of Complex Annotations (20)

Recently uploaded

Recently uploaded (20)

Modeling and Aggregation of Complex Annotations