Interannotator Agreement

John Blake
University of Aizu, Japan
Inter-annotator agreement:
By hook or by crook
www.orau.org

Overview
• Background
• Case study
– Annotation of scientific research abstracts
– Strategic decision points
• Findings
– Methodological improvements
– Statistical smoke and rhetorical mirrors
• Conclusions
2

Subjectivity in annotation
3
POS tagging,
Phonetic
transcription
etc.
Annotation
guidelines with
discussion of
boundary
cases
Basic
annotation
guidelines
Speaker
intuition, e.g.
discourse
annotation,
pragmatics,
etc.
Problem:
Vagueness and ambiguity in natural languages
Manning (2011) 97.321 / 10021 = 56.28 %

Automated and manual
annotation compared
4
Automated annotation Manual annotation
Subjective agent Software developer Annotator
Subjective stage Prior to annotation During annotation
Replicability (near) Perfect Variable
Initial set up cost High (if new software) Low
On-going cost (near) Zero High
Scalable Yes No
Dependent
condition
Availability of training
set
Availability of
annotators (contingent
on time/money)
Speed (near) Instantaneous Variable
Factors
considered
Endogeneric Endo- and exogeneric
Strength Grammatical parsing Semantic parsing

Inter-annotator agreement
5
Crucial issue: Are the annotations correct?
We are interested in validity
• Ability to discriminate without error by placing item into appropriate category
But there is no “Ground truth”
• Linguistic categories are determined by human judgement
 Implication: We cannot measure correctness directly
So we measure reliability , e.g. reproducibility.
• Intra-annotator reliability
• Inter-annotator reliability
i.e. whether human coders/annotators consistently make same decisions
 Assumption 1: lack of reliability rules out validity (text/training issues)
 Assumption 2: high reliability implies validity
Terminology credit (Artsein & Poesio, 2008)
Idea adapted from Boldea & Evert (2009) : https://clseslli09.files.wordpress.com/2009/07/02_iaa-
slides2.pdf/

Simple example 1
6
(abbreviated for length to increase readability)
Sentence Coder
1
Coder
2
Agreement
We address the problem of …… recognition I P 
Our aim is to …recognize [x] from [y]. P P 
[A] is set up as prior information, and its pose is
determined by three parameters, which are [j,k and l].
M M 
An efficient local gradient-based method is proposed to
…, which is combined into … framework to estimate [V
and W] by iterative evolution
P R 
It is shown that the local gradient-based method can
evaluate accurately and efficiently [V and W] .
R R 
Observed agreement between 1 and 2 is 60%

IAA measures: Kappa coefficient
7
Inter-annotator agreement of 60% in previous example, but
chance agreement figure is 20%. Agreement measures must
be corrected for chance agreement (Carletta, 1996).
Kappa coefficient (Cohen 1960 for 2, Fleiss for 2+)
e.g. Corrected measure: K =
P A −P E
1−𝑃(𝐸)
1 (agreement) 0 (no correlation) -1(disagreement)
Interpretation of Kappa
• Landis and Koch (1977) 0.6-0.79 substantial; 0.8+ perfect
• Krippendorff (1980) 0.67-0.79 tentative; 0.8+ good
• Green (1997) 0.4-0.74 fair/good; 0.75 high

IAA measures: Sophisticated
8
e.g. Typical measures used in computational linguistics built
into NLP pipelines, such as NLTK and GATE
Rather than measuring agreement alone, we can measure
both agreement and disagreement, e.g. using Measuring
agreement on set-valued items (MASI) and/or Jaccard
distance. Both MASI (Passonneau, 2006) and Jaccard distance
make use of the union and intersection between sets.
Jaccard formula (Jaccard, 1908 cited in Dunn & Everitt, 2004)
is:

Case study overview
• Moves in scientific research abstracts
• Scientific disciplines
• Core corpus specifications
• Example abstract
• Tagset
• Strategic decision points (tag #IAA extraction)
NB: By convention this far-from-linear study is
presented in a linear fashion when in fact there
were numerous forks, dead-ends and iterations.
9

Moves in scientific research abstracts
10
Move definition
“a discoursal or rhetorical unit that performs a coherent
communicative function in a written or spoken discourse”.
(Swales, 2004, p.228)
Move sequences
Example (very short) abstract
5-move code Introduction Purpose Method Results Discussion

Scientific disciplines
11
Science
Fundamental
Empirical
Natural
Physical Materials science
Life Botany
Social Linguistics
Theoretical Formal
Information
theory
Applied
Engineering
Evolutionary
computation
Knowledge & data
engineering
Image processing
Wireless
computing
Electronic
engineering
Healthcare Medical

Core 1000 corpus specifications
12
Code Journal name #
abstracts
#
words
1 EC Transactions on Evolutionary Computation 100 17,433
2 KDE Transactions on Knowledge and Data Engineering 100 18,407
3 IP Transactions on Image Processing 100 16,859
4 IT Transactions on Information Theory 100 15,982
5 WC Transactions on Wireless Communications 100 15,971
6 Mat Advanced materials 100 6.078
7 Bot The plant cell 100 19,981
8 Ling App. Ling; Journal of Comm; J of Cog. Neurosc. 100 13,587
9 Eng Transactions on Industrial Electronics 100 14,569
10 Med British Medical Journal 100 29,437
Total 1000 162,232
First 100 abstracts of research articles from top-tier journals published
from Jan 2012.

We study the detection error probability associated with a balanced
binary relay tree, where the leaves of the tree correspond to N
identical and independent sensors. The root of the tree represents a
fusion center that makes the overall detection decision. Each of the
other nodes in the tree is a relay node that combines two binary
messages to form a single output binary message. Only the leaves are
sensors. In this way, the information from the sensors is aggregated
into the fusion center via the relay nodes. In this context, we describe
the evolution of the Type I and Type II error probabilities of the binary
data as it propagates from the leaves toward the root. Tight upper and
lower bounds for the total error probability at the fusion center as
functions of N are derived. These characterize how fast the total error
probability converges to 0 with respect to N , even if the individual
sensors have error probabilities that converge to 1/2.
[IT 120616]
Standard abstract (IT)
13

Tagset
14
Manual annotation using UAM Corpus Tool 2.X and 3.X (O`Donnell, 2015)
This layer of annotation is for rhetorical moves.
There are 5 choices of moves and 6 choices of submoves.
In short, each ontological unit is assigned to one of 9 choices.
The “uncertain” tag is designed as a temporary label.

#IAA theme extraction
Strategic decision points
• Research log was kept using themes, e.g. #meth,
#stats, #IAA
• 142 notes relating to #IAA written between 2012-
2017 were identified.
• The findings presented are the notes that are the
most important and generalizable to other
projects.
15

Findings overview:
Three types of strategic decisions affecting IAA
1. Methodological decisions
2. Statistical decisions
3. Rhetorical decisions
16

Findings (1)
Methodological choices to enhance IAA
A. Ontological unit
B. Tagset size
C. Tag clarity of demarcation
D. Catch-all tags
E. Detailed coding booklet
F. Pre-selection, training and testing
G. Easy-to-use tools
H. Monitoring, feedback and regular meetings
I. Pilot studies and small trials
17

Finding 1a: Ontological unit
18
Fixed ontological units (i.e. what you code), e.g. each
word, each sentence, simplify calculation of IAA and
increase the IAA since boundaries of each unit are
identical.
Variable ontological units provide researchers with
additional choices on how to calculate (manipulate?)
IAA – identical, subsumed, cross-over. How do you
calculate by character (inc. white space?), letter,
word, what unit?
I love you. 8 letters, 3 words, 11 characters
I love him. Agreement ratio 0.62, 0.67, 0.72

Finding 1b: Tagset size
The more tags, the less agreement
Rissanen (1989, as cited in Archer, 2012, n.p.) points out the
“mystery of vanishing reliability”
i.e. the statistical unreliability of annotation that is too detailed.
Obvious with hindsight, but researchers tend to develop tags
that will inform their research rather than result in higher IAA.
1 tag = total agreement (but probably no reason to code)
10 tags = less agreement
100 tags = much less agreement
1000 tags = almost no chance of high IAA
19

Finding 1c:
Tagset clarity of demarcation
Pilot studies of possible tags and tagsets
Pilot study:
Tagged 100 abstracts using IMRD move and CARS move tags
Difficulty:
1. prevalence of method in IMRD positions
2. demarcation of boundary cases  created SOP, codified in
coding booklet
Final selection:
Dropped both sets of tags and selected Hyland (2004, p.67)
IPMPC tagset20

Finding 1d: Catch-all tags
21
Tags Description
Fuzzy Used when difficult to assign to tag in
existing tagset
Multiple Used when more than one tag applies
Portmanteau Used when item transcends two tag
domains
Problematic Used when impossible to assign tag
Archer (2012, n.p.) describes four tag types, all of which
increase IAA by providing easy-to-code options for
boundary cases.
My “uncertain” tag is a catch-all. Calculating IAA
including “uncertain” results in higher IAA.

Finding 1e:
Annotation (coding) booklet
22
Standard operating procedure
• Guidelines, Rules, Examples, Borderline cases disambiguated

Finding 1f:
Training course and test
23
Course based on annotation booklet
• Face-to-face and/or online
Test based on annotation booklet
• Serialist tests
• Holistic tests
Qualification cut-off points
• e.g. 90% can start annotating
• e.g. 61% needs additional training
• e.g. 60% discontinue training

Finding 1g:
Easy-to-use annotation tools
24
• Tool and instructions!
• UAM Corpus Tool – help forum in Spanish
• Wrote project-specific instruction booklet for annotators

Finding 1h: monitoring,
feedback and regular meetings
25
These three aspects I believe led to greater retention of
annotators and higher accuracy.
• More monitoring in initial stages (real-time is possible in GATE)
– to identify problems early
• Constructive actionable feedback
– to retain annotator and increase accuracy
• Regular meetings
– annotators who cancelled meetings tended to have a
problem (either with annotation or in their life).
I helped with annotation issues.

Finding 1i: Pilot studies
26
Various pilot studies and small-scale trials.
Enables researcher to discover issues and proactively avert potential problems
• 136 abstracts SFL annotation of process, participant and circumstance
• 136 abstracts SFL annotation of sub-categories of circumstance
• 10 abstracts Multimethod
• 500 abstracts Lexicogrammatical
• 40 abstracts Specialist vs linguist IMRaD annotation
• 100 abstracts Tagset selection (CARS vs IMRaD)
• 3 people Development of Coding booklet
• 10 abstracts Examples vs. Coding booklet
• 2 people Development of training course
• 500 abstracts Rhetorical moves using coding booklet by self
• 1000 abstracts Rhetorical moves using coding booklet by self & annotators
• 2500 abstracts Rhetorical moves using coding booklet by annotators

Findings (2)
Statistical choices to enhance IAA
A. Cherry-picking population-sample size ratio
B. Random vs systematic
C. Dealing with outliers (annotators)
• Omit [+justify?]; replace with mean [?]
D. Sample selection:
• early vs later coding
• pre-discussion vs. post-discussion
E. Granularity (see next slide)
• Reducing granularity by merging units; fewer
categories, higher agreement
27

Finding 2e: Granularity
28
Measures of IAA increase greatly as granularity decreases
Lower
IAA
Higher
IAA

Findings (3)
Rhetorical choices to enhance IAA
Claim high IAA with no further details
+ gold standard with no further details and/or
+ provide a simple ratio or percentage and/or
+ provide details of sample size
Rely on vagueness and ambiguity to allow reader to
infer higher IAA than found or actual high IAA.
29

Conclusion
High IAA may be due to
• sound or cogent methodological choices;
but it could also be due to manipulating the
• statistical smoke
(i.e. selecting parameters leading to higher IAA)
and
• rhetorical mirrors.
(i.e. using vagueness/ambiguity to infer IAA is high)
In most publications in applied linguistics, sufficient
detail is not provided.
30

Best practice suggestions
• Annotate using tags at one level more finely.
• Create annotation booklet with clear rules,
examples and discussion of boundary cases.
• Develop, trial and require all annotators to
complete a training course.
• Set a benchmark standard.
• Monitor and provide constructive actionable
feedback to annotators.
• Report IAA in sufficient detail to convince
skeptical readers.
31

Beware of the
skeleton in the cupboard
• Researchers aim to
portray their work as
sound or cogent.
• Actual IAA may differ
from reported IAA
• Be wary of statistical
smoke and
rhetorical mirrors
32

Any questions, suggestions or
comments?
John Blake
jblake@u-aizu.ac.jp

Interannotator Agreement

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Interannotator Agreement

Similar to Interannotator Agreement (20)

More from john6938

More from john6938 (20)

Recently uploaded

Recently uploaded (20)

Interannotator Agreement