Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Interannotator Agreement


Published on

Interannotator Agreement: By hook or by crook
Presentation for Asia Pacific Corpus Linguistics Conference 2018

Published in: Education
  • Login to see the comments

  • Be the first to like this

Interannotator Agreement

  1. 1. John Blake University of Aizu, Japan Inter-annotator agreement: By hook or by crook
  2. 2. Overview • Background • Case study – Annotation of scientific research abstracts – Strategic decision points • Findings – Methodological improvements – Statistical smoke and rhetorical mirrors • Conclusions 2
  3. 3. Subjectivity in annotation 3 POS tagging, Phonetic transcription etc. Annotation guidelines with discussion of boundary cases Basic annotation guidelines Speaker intuition, e.g. discourse annotation, pragmatics, etc. Problem: Vagueness and ambiguity in natural languages Manning (2011) 97.321 / 10021 = 56.28 %
  4. 4. Automated and manual annotation compared 4 Automated annotation Manual annotation Subjective agent Software developer Annotator Subjective stage Prior to annotation During annotation Replicability (near) Perfect Variable Initial set up cost High (if new software) Low On-going cost (near) Zero High Scalable Yes No Dependent condition Availability of training set Availability of annotators (contingent on time/money) Speed (near) Instantaneous Variable Factors considered Endogeneric Endo- and exogeneric Strength Grammatical parsing Semantic parsing
  5. 5. Inter-annotator agreement 5 Crucial issue: Are the annotations correct? We are interested in validity • Ability to discriminate without error by placing item into appropriate category But there is no “Ground truth” • Linguistic categories are determined by human judgement  Implication: We cannot measure correctness directly So we measure reliability , e.g. reproducibility. • Intra-annotator reliability • Inter-annotator reliability i.e. whether human coders/annotators consistently make same decisions  Assumption 1: lack of reliability rules out validity (text/training issues)  Assumption 2: high reliability implies validity Terminology credit (Artsein & Poesio, 2008) Idea adapted from Boldea & Evert (2009) : slides2.pdf/
  6. 6. Simple example 1 6 (abbreviated for length to increase readability) Sentence Coder 1 Coder 2 Agreement We address the problem of …… recognition I P  Our aim is to …recognize [x] from [y]. P P  [A] is set up as prior information, and its pose is determined by three parameters, which are [j,k and l]. M M  An efficient local gradient-based method is proposed to …, which is combined into … framework to estimate [V and W] by iterative evolution P R  It is shown that the local gradient-based method can evaluate accurately and efficiently [V and W] . R R  Observed agreement between 1 and 2 is 60%
  7. 7. IAA measures: Kappa coefficient 7 Inter-annotator agreement of 60% in previous example, but chance agreement figure is 20%. Agreement measures must be corrected for chance agreement (Carletta, 1996). Kappa coefficient (Cohen 1960 for 2, Fleiss for 2+) e.g. Corrected measure: K = P A −P E 1−𝑃(𝐸) 1 (agreement) 0 (no correlation) -1(disagreement) Interpretation of Kappa • Landis and Koch (1977) 0.6-0.79 substantial; 0.8+ perfect • Krippendorff (1980) 0.67-0.79 tentative; 0.8+ good • Green (1997) 0.4-0.74 fair/good; 0.75 high
  8. 8. IAA measures: Sophisticated 8 e.g. Typical measures used in computational linguistics built into NLP pipelines, such as NLTK and GATE Rather than measuring agreement alone, we can measure both agreement and disagreement, e.g. using Measuring agreement on set-valued items (MASI) and/or Jaccard distance. Both MASI (Passonneau, 2006) and Jaccard distance make use of the union and intersection between sets. Jaccard formula (Jaccard, 1908 cited in Dunn & Everitt, 2004) is:
  9. 9. Case study overview • Moves in scientific research abstracts • Scientific disciplines • Core corpus specifications • Example abstract • Tagset • Strategic decision points (tag #IAA extraction) NB: By convention this far-from-linear study is presented in a linear fashion when in fact there were numerous forks, dead-ends and iterations. 9
  10. 10. Moves in scientific research abstracts 10 Move definition “a discoursal or rhetorical unit that performs a coherent communicative function in a written or spoken discourse”. (Swales, 2004, p.228) Move sequences Example (very short) abstract 5-move code Introduction Purpose Method Results Discussion
  11. 11. Scientific disciplines 11 Science Fundamental Empirical Natural Physical Materials science Life Botany Social Linguistics Theoretical Formal Information theory Applied Engineering Evolutionary computation Knowledge & data engineering Image processing Wireless computing Electronic engineering Healthcare Medical
  12. 12. Core 1000 corpus specifications 12 Code Journal name # abstracts # words 1 EC Transactions on Evolutionary Computation 100 17,433 2 KDE Transactions on Knowledge and Data Engineering 100 18,407 3 IP Transactions on Image Processing 100 16,859 4 IT Transactions on Information Theory 100 15,982 5 WC Transactions on Wireless Communications 100 15,971 6 Mat Advanced materials 100 6.078 7 Bot The plant cell 100 19,981 8 Ling App. Ling; Journal of Comm; J of Cog. Neurosc. 100 13,587 9 Eng Transactions on Industrial Electronics 100 14,569 10 Med British Medical Journal 100 29,437 Total 1000 162,232 First 100 abstracts of research articles from top-tier journals published from Jan 2012.
  13. 13. We study the detection error probability associated with a balanced binary relay tree, where the leaves of the tree correspond to N identical and independent sensors. The root of the tree represents a fusion center that makes the overall detection decision. Each of the other nodes in the tree is a relay node that combines two binary messages to form a single output binary message. Only the leaves are sensors. In this way, the information from the sensors is aggregated into the fusion center via the relay nodes. In this context, we describe the evolution of the Type I and Type II error probabilities of the binary data as it propagates from the leaves toward the root. Tight upper and lower bounds for the total error probability at the fusion center as functions of N are derived. These characterize how fast the total error probability converges to 0 with respect to N , even if the individual sensors have error probabilities that converge to 1/2. [IT 120616] Standard abstract (IT) 13
  14. 14. Tagset 14 Manual annotation using UAM Corpus Tool 2.X and 3.X (O`Donnell, 2015) This layer of annotation is for rhetorical moves. There are 5 choices of moves and 6 choices of submoves. In short, each ontological unit is assigned to one of 9 choices. The “uncertain” tag is designed as a temporary label.
  15. 15. #IAA theme extraction Strategic decision points • Research log was kept using themes, e.g. #meth, #stats, #IAA • 142 notes relating to #IAA written between 2012- 2017 were identified. • The findings presented are the notes that are the most important and generalizable to other projects. 15
  16. 16. Findings overview: Three types of strategic decisions affecting IAA 1. Methodological decisions 2. Statistical decisions 3. Rhetorical decisions 16
  17. 17. Findings (1) Methodological choices to enhance IAA A. Ontological unit B. Tagset size C. Tag clarity of demarcation D. Catch-all tags E. Detailed coding booklet F. Pre-selection, training and testing G. Easy-to-use tools H. Monitoring, feedback and regular meetings I. Pilot studies and small trials 17
  18. 18. Finding 1a: Ontological unit 18 Fixed ontological units (i.e. what you code), e.g. each word, each sentence, simplify calculation of IAA and increase the IAA since boundaries of each unit are identical. Variable ontological units provide researchers with additional choices on how to calculate (manipulate?) IAA – identical, subsumed, cross-over. How do you calculate by character (inc. white space?), letter, word, what unit? I love you. 8 letters, 3 words, 11 characters I love him. Agreement ratio 0.62, 0.67, 0.72
  19. 19. Finding 1b: Tagset size The more tags, the less agreement Rissanen (1989, as cited in Archer, 2012, n.p.) points out the “mystery of vanishing reliability” i.e. the statistical unreliability of annotation that is too detailed. Obvious with hindsight, but researchers tend to develop tags that will inform their research rather than result in higher IAA. 1 tag = total agreement (but probably no reason to code) 10 tags = less agreement 100 tags = much less agreement 1000 tags = almost no chance of high IAA 19
  20. 20. Finding 1c: Tagset clarity of demarcation Pilot studies of possible tags and tagsets Pilot study: Tagged 100 abstracts using IMRD move and CARS move tags Difficulty: 1. prevalence of method in IMRD positions 2. demarcation of boundary cases  created SOP, codified in coding booklet Final selection: Dropped both sets of tags and selected Hyland (2004, p.67) IPMPC tagset20
  21. 21. Finding 1d: Catch-all tags 21 Tags Description Fuzzy Used when difficult to assign to tag in existing tagset Multiple Used when more than one tag applies Portmanteau Used when item transcends two tag domains Problematic Used when impossible to assign tag Archer (2012, n.p.) describes four tag types, all of which increase IAA by providing easy-to-code options for boundary cases. My “uncertain” tag is a catch-all. Calculating IAA including “uncertain” results in higher IAA.
  22. 22. Finding 1e: Annotation (coding) booklet 22 Standard operating procedure • Guidelines, Rules, Examples, Borderline cases disambiguated
  23. 23. Finding 1f: Training course and test 23 Course based on annotation booklet • Face-to-face and/or online Test based on annotation booklet • Serialist tests • Holistic tests Qualification cut-off points • e.g. 90% can start annotating • e.g. 61% needs additional training • e.g. 60% discontinue training
  24. 24. Finding 1g: Easy-to-use annotation tools 24 • Tool and instructions! • UAM Corpus Tool – help forum in Spanish • Wrote project-specific instruction booklet for annotators
  25. 25. Finding 1h: monitoring, feedback and regular meetings 25 These three aspects I believe led to greater retention of annotators and higher accuracy. • More monitoring in initial stages (real-time is possible in GATE) – to identify problems early • Constructive actionable feedback – to retain annotator and increase accuracy • Regular meetings – annotators who cancelled meetings tended to have a problem (either with annotation or in their life). I helped with annotation issues.
  26. 26. Finding 1i: Pilot studies 26 Various pilot studies and small-scale trials. Enables researcher to discover issues and proactively avert potential problems • 136 abstracts SFL annotation of process, participant and circumstance • 136 abstracts SFL annotation of sub-categories of circumstance • 10 abstracts Multimethod • 500 abstracts Lexicogrammatical • 40 abstracts Specialist vs linguist IMRaD annotation • 100 abstracts Tagset selection (CARS vs IMRaD) • 3 people Development of Coding booklet • 10 abstracts Examples vs. Coding booklet • 2 people Development of training course • 500 abstracts Rhetorical moves using coding booklet by self • 1000 abstracts Rhetorical moves using coding booklet by self & annotators • 2500 abstracts Rhetorical moves using coding booklet by annotators
  27. 27. Findings (2) Statistical choices to enhance IAA A. Cherry-picking population-sample size ratio B. Random vs systematic C. Dealing with outliers (annotators) • Omit [+justify?]; replace with mean [?] D. Sample selection: • early vs later coding • pre-discussion vs. post-discussion E. Granularity (see next slide) • Reducing granularity by merging units; fewer categories, higher agreement 27
  28. 28. Finding 2e: Granularity 28 Measures of IAA increase greatly as granularity decreases Lower IAA Higher IAA
  29. 29. Findings (3) Rhetorical choices to enhance IAA Claim high IAA with no further details + gold standard with no further details and/or + provide a simple ratio or percentage and/or + provide details of sample size Rely on vagueness and ambiguity to allow reader to infer higher IAA than found or actual high IAA. 29
  30. 30. Conclusion High IAA may be due to • sound or cogent methodological choices; but it could also be due to manipulating the • statistical smoke (i.e. selecting parameters leading to higher IAA) and • rhetorical mirrors. (i.e. using vagueness/ambiguity to infer IAA is high) In most publications in applied linguistics, sufficient detail is not provided. 30
  31. 31. Best practice suggestions • Annotate using tags at one level more finely. • Create annotation booklet with clear rules, examples and discussion of boundary cases. • Develop, trial and require all annotators to complete a training course. • Set a benchmark standard. • Monitor and provide constructive actionable feedback to annotators. • Report IAA in sufficient detail to convince skeptical readers. 31
  32. 32. Beware of the skeleton in the cupboard • Researchers aim to portray their work as sound or cogent. • Actual IAA may differ from reported IAA • Be wary of statistical smoke and rhetorical mirrors 32
  33. 33. Any questions, suggestions or comments? John Blake