University of Aizu, Japan
By hook or by crook
• Case study
– Annotation of scientific research abstracts
– Strategic decision points
– Methodological improvements
– Statistical smoke and rhetorical mirrors
Subjectivity in annotation
Vagueness and ambiguity in natural languages
Manning (2011) 97.321 / 10021 = 56.28 %
Automated and manual
Automated annotation Manual annotation
Subjective agent Software developer Annotator
Subjective stage Prior to annotation During annotation
Replicability (near) Perfect Variable
Initial set up cost High (if new software) Low
On-going cost (near) Zero High
Scalable Yes No
Availability of training
Speed (near) Instantaneous Variable
Endogeneric Endo- and exogeneric
Strength Grammatical parsing Semantic parsing
Crucial issue: Are the annotations correct?
We are interested in validity
• Ability to discriminate without error by placing item into appropriate category
But there is no “Ground truth”
• Linguistic categories are determined by human judgement
Implication: We cannot measure correctness directly
So we measure reliability , e.g. reproducibility.
• Intra-annotator reliability
• Inter-annotator reliability
i.e. whether human coders/annotators consistently make same decisions
Assumption 1: lack of reliability rules out validity (text/training issues)
Assumption 2: high reliability implies validity
Terminology credit (Artsein & Poesio, 2008)
Idea adapted from Boldea & Evert (2009) : https://clseslli09.files.wordpress.com/2009/07/02_iaa-
Simple example 1
(abbreviated for length to increase readability)
We address the problem of …… recognition I P
Our aim is to …recognize [x] from [y]. P P
[A] is set up as prior information, and its pose is
determined by three parameters, which are [j,k and l].
An efficient local gradient-based method is proposed to
…, which is combined into … framework to estimate [V
and W] by iterative evolution
It is shown that the local gradient-based method can
evaluate accurately and efficiently [V and W] .
Observed agreement between 1 and 2 is 60%
IAA measures: Kappa coefficient
Inter-annotator agreement of 60% in previous example, but
chance agreement figure is 20%. Agreement measures must
be corrected for chance agreement (Carletta, 1996).
Kappa coefficient (Cohen 1960 for 2, Fleiss for 2+)
e.g. Corrected measure: K =
P A −P E
1 (agreement) 0 (no correlation) -1(disagreement)
Interpretation of Kappa
• Landis and Koch (1977) 0.6-0.79 substantial; 0.8+ perfect
• Krippendorff (1980) 0.67-0.79 tentative; 0.8+ good
• Green (1997) 0.4-0.74 fair/good; 0.75 high
IAA measures: Sophisticated
e.g. Typical measures used in computational linguistics built
into NLP pipelines, such as NLTK and GATE
Rather than measuring agreement alone, we can measure
both agreement and disagreement, e.g. using Measuring
agreement on set-valued items (MASI) and/or Jaccard
distance. Both MASI (Passonneau, 2006) and Jaccard distance
make use of the union and intersection between sets.
Jaccard formula (Jaccard, 1908 cited in Dunn & Everitt, 2004)
Case study overview
• Moves in scientific research abstracts
• Scientific disciplines
• Core corpus specifications
• Example abstract
• Strategic decision points (tag #IAA extraction)
NB: By convention this far-from-linear study is
presented in a linear fashion when in fact there
were numerous forks, dead-ends and iterations.
Moves in scientific research abstracts
“a discoursal or rhetorical unit that performs a coherent
communicative function in a written or spoken discourse”.
(Swales, 2004, p.228)
Example (very short) abstract
5-move code Introduction Purpose Method Results Discussion
Physical Materials science
Knowledge & data
Core 1000 corpus specifications
Code Journal name #
1 EC Transactions on Evolutionary Computation 100 17,433
2 KDE Transactions on Knowledge and Data Engineering 100 18,407
3 IP Transactions on Image Processing 100 16,859
4 IT Transactions on Information Theory 100 15,982
5 WC Transactions on Wireless Communications 100 15,971
6 Mat Advanced materials 100 6.078
7 Bot The plant cell 100 19,981
8 Ling App. Ling; Journal of Comm; J of Cog. Neurosc. 100 13,587
9 Eng Transactions on Industrial Electronics 100 14,569
10 Med British Medical Journal 100 29,437
Total 1000 162,232
First 100 abstracts of research articles from top-tier journals published
from Jan 2012.
We study the detection error probability associated with a balanced
binary relay tree, where the leaves of the tree correspond to N
identical and independent sensors. The root of the tree represents a
fusion center that makes the overall detection decision. Each of the
other nodes in the tree is a relay node that combines two binary
messages to form a single output binary message. Only the leaves are
sensors. In this way, the information from the sensors is aggregated
into the fusion center via the relay nodes. In this context, we describe
the evolution of the Type I and Type II error probabilities of the binary
data as it propagates from the leaves toward the root. Tight upper and
lower bounds for the total error probability at the fusion center as
functions of N are derived. These characterize how fast the total error
probability converges to 0 with respect to N , even if the individual
sensors have error probabilities that converge to 1/2.
Standard abstract (IT)
Manual annotation using UAM Corpus Tool 2.X and 3.X (O`Donnell, 2015)
This layer of annotation is for rhetorical moves.
There are 5 choices of moves and 6 choices of submoves.
In short, each ontological unit is assigned to one of 9 choices.
The “uncertain” tag is designed as a temporary label.
#IAA theme extraction
Strategic decision points
• Research log was kept using themes, e.g. #meth,
• 142 notes relating to #IAA written between 2012-
2017 were identified.
• The findings presented are the notes that are the
most important and generalizable to other
Three types of strategic decisions affecting IAA
1. Methodological decisions
2. Statistical decisions
3. Rhetorical decisions
Methodological choices to enhance IAA
A. Ontological unit
B. Tagset size
C. Tag clarity of demarcation
D. Catch-all tags
E. Detailed coding booklet
F. Pre-selection, training and testing
G. Easy-to-use tools
H. Monitoring, feedback and regular meetings
I. Pilot studies and small trials
Finding 1a: Ontological unit
Fixed ontological units (i.e. what you code), e.g. each
word, each sentence, simplify calculation of IAA and
increase the IAA since boundaries of each unit are
Variable ontological units provide researchers with
additional choices on how to calculate (manipulate?)
IAA – identical, subsumed, cross-over. How do you
calculate by character (inc. white space?), letter,
word, what unit?
I love you. 8 letters, 3 words, 11 characters
I love him. Agreement ratio 0.62, 0.67, 0.72
Finding 1b: Tagset size
The more tags, the less agreement
Rissanen (1989, as cited in Archer, 2012, n.p.) points out the
“mystery of vanishing reliability”
i.e. the statistical unreliability of annotation that is too detailed.
Obvious with hindsight, but researchers tend to develop tags
that will inform their research rather than result in higher IAA.
1 tag = total agreement (but probably no reason to code)
10 tags = less agreement
100 tags = much less agreement
1000 tags = almost no chance of high IAA
Tagset clarity of demarcation
Pilot studies of possible tags and tagsets
Tagged 100 abstracts using IMRD move and CARS move tags
1. prevalence of method in IMRD positions
2. demarcation of boundary cases created SOP, codified in
Dropped both sets of tags and selected Hyland (2004, p.67)
Finding 1d: Catch-all tags
Fuzzy Used when difficult to assign to tag in
Multiple Used when more than one tag applies
Portmanteau Used when item transcends two tag
Problematic Used when impossible to assign tag
Archer (2012, n.p.) describes four tag types, all of which
increase IAA by providing easy-to-code options for
My “uncertain” tag is a catch-all. Calculating IAA
including “uncertain” results in higher IAA.
Training course and test
Course based on annotation booklet
• Face-to-face and/or online
Test based on annotation booklet
• Serialist tests
• Holistic tests
Qualification cut-off points
• e.g. 90% can start annotating
• e.g. 61% needs additional training
• e.g. 60% discontinue training
Easy-to-use annotation tools
• Tool and instructions!
• UAM Corpus Tool – help forum in Spanish
• Wrote project-specific instruction booklet for annotators
Finding 1h: monitoring,
feedback and regular meetings
These three aspects I believe led to greater retention of
annotators and higher accuracy.
• More monitoring in initial stages (real-time is possible in GATE)
– to identify problems early
• Constructive actionable feedback
– to retain annotator and increase accuracy
• Regular meetings
– annotators who cancelled meetings tended to have a
problem (either with annotation or in their life).
I helped with annotation issues.
Finding 1i: Pilot studies
Various pilot studies and small-scale trials.
Enables researcher to discover issues and proactively avert potential problems
• 136 abstracts SFL annotation of process, participant and circumstance
• 136 abstracts SFL annotation of sub-categories of circumstance
• 10 abstracts Multimethod
• 500 abstracts Lexicogrammatical
• 40 abstracts Specialist vs linguist IMRaD annotation
• 100 abstracts Tagset selection (CARS vs IMRaD)
• 3 people Development of Coding booklet
• 10 abstracts Examples vs. Coding booklet
• 2 people Development of training course
• 500 abstracts Rhetorical moves using coding booklet by self
• 1000 abstracts Rhetorical moves using coding booklet by self & annotators
• 2500 abstracts Rhetorical moves using coding booklet by annotators
Statistical choices to enhance IAA
A. Cherry-picking population-sample size ratio
B. Random vs systematic
C. Dealing with outliers (annotators)
• Omit [+justify?]; replace with mean [?]
D. Sample selection:
• early vs later coding
• pre-discussion vs. post-discussion
E. Granularity (see next slide)
• Reducing granularity by merging units; fewer
categories, higher agreement
Finding 2e: Granularity
Measures of IAA increase greatly as granularity decreases
Rhetorical choices to enhance IAA
Claim high IAA with no further details
+ gold standard with no further details and/or
+ provide a simple ratio or percentage and/or
+ provide details of sample size
Rely on vagueness and ambiguity to allow reader to
infer higher IAA than found or actual high IAA.
High IAA may be due to
• sound or cogent methodological choices;
but it could also be due to manipulating the
• statistical smoke
(i.e. selecting parameters leading to higher IAA)
• rhetorical mirrors.
(i.e. using vagueness/ambiguity to infer IAA is high)
In most publications in applied linguistics, sufficient
detail is not provided.
Best practice suggestions
• Annotate using tags at one level more finely.
• Create annotation booklet with clear rules,
examples and discussion of boundary cases.
• Develop, trial and require all annotators to
complete a training course.
• Set a benchmark standard.
• Monitor and provide constructive actionable
feedback to annotators.
• Report IAA in sufficient detail to convince
Beware of the
skeleton in the cupboard
• Researchers aim to
portray their work as
sound or cogent.
• Actual IAA may differ
from reported IAA
• Be wary of statistical
Any questions, suggestions or