Segmentation Similarity    and AgreementA metric for evaluating automatic and         human segmenters    Chris Fournier  ...
What is segmentation?Introduction               Figure: Baker (1990, pp. 76–77)                                           ...
What is segmentation?Introduction    Par.   Topic    1–3    Intro - the search for life in space    4–5    The moon’s chem...
Why do we segment?Introduction   To model topical shifts, aiding:         Video and audio retrieval               (Franz e...
Types of segmentationIntroduction            Linear               Hierarchical                                       5 s1 ...
Automatically segmentationIntroduction   Many automatic segmenters exist:         TextTiling               (Hearst 1997)  ...
Problem: selecting a segmenterIntroduction   How do we select the best performing   segmenter for a task?         Ideally ...
Problem: selecting a segmenterIntroduction   How do we less expensively select the   best performing segmenter for a task?...
FocusIntroduction   We focuses on comparing segmentations   to evaluate:         Manual segmentations reliability         ...
Why is this comparison difficult?Difficulty    Difficulty arises because:            There is no one “true” segmentation     ...
No one “true” segmentationDifficulty1234567    Figure: 7 manual codings collected by Hearst    (1997) of Stargazers Look fo...
Near missesDifficulty         1,500                                    Full    Near         1,000Misses          500       ...
Existing evaluation metricsEvaluation Metrics   Existing segmentation evaluation metrics:         Precision, Recall, Fβ -m...
Stability & internal segment sizesEvaluation Metrics                 1                0.9 Metric value                0.8 ...
Common failingsEvaluation Metrics   Existing segmentation evaluation metrics:         Require one “true” reference        ...
A new metric: SSegmentation Similarity   Segmentation Similarity (S):      New boundary edit distance         Edit distanc...
ParametersSegmentation Similarity   S has three parameters:          n the number of PBs considered a            near miss...
Mass and potential boundariesSegmentation Similarity      Segmentations have:         Potential boundaries separating unit...
Modelling dissimilaritySegmentation Similarity      Linear segmentation errors can be      modeled as edit operations at p...
NormalizationSegmentation Similarity                          mass(i)−1−d(si1 ,si2 )    S(si1 , si2 ) =           mass(i)−...
Calculating similaritySegmentation Similarity   From the previous example:      4 edits (3 sub. and 1 transposition)      ...
Near missesSegmentation Similarity      S can scale near misses by PBs spanned:      te(n, b) = b − (1 /b )n−2   where n ≥...
Increasing near miss span sizeSegmentation Similarity   10.9                                          1−WD0.8             ...
Reliability of manual codingsSegmentation Agreement   How do we verify manual reliability?                                ...
CategoriesSegmentation Agreement   Calculate Ae using one category per t:       boundary presence (K = {segt |t ∈ T})   Wh...
Examples of manual codingsMultiply-Coded Corpora   Linear multiply-coded segmentations:        Kazantseva & Szpakowicz (20...
Overall agreementMultiply-Coded Corpora    Kazantseva & Szpakowicz (2012)                       ∗     Mean coder group πS ...
Overall error typesMultiply-Coded Corpora                                      Misses                                    F...
Comparing segmentersEvaluation   How can we compare auto segmenters?         Pairwise mean S with manual codings          ...
Comparing segmentersEvaluation   How can we compare auto segmenters?         Differences in agreement             1. Calcu...
SummaryConclusion   Segmentation Similarity (S)      Stable, unlike window metrics        Highly configurable        Gives ...
Future work & ImplementationConclusion   Future work       Multiple boundary types; and        Hierarchical segmentation. ...
References I Artstein, R. & Poesio, M. (2008), ‘Inter-coder agreement for    computational linguistics’, Computational Lin...
References II Franz, M., Mccarley, J. S., Xu, J.-m., Systems, H. I. & Search, I.    (2007), User-Oriented Text Segmentatio...
References III Kazantseva, A. & Szpakowicz, S. (2011), Linear Text   Segmentation Using Affinity Propagation, in ‘Proceedin...
References IV McCallum, A., Munteanu, C., Penn, G. & Zhu, X. (2012),   Ecological validity and the evaluation of speech   ...
References V Stoyanov, V. & Cardie, C. (2008), Topic identification for    fine-grained opinion analysis, in ‘Proceedings of...
Upcoming SlideShare
Loading in …5
×

Segmentation Similarity and Agreement

538 views

Published on

We propose a new segmentation evaluation metric, called segmentation similarity (S), that quantifies the similarity between two segmentations as the proportion of boundaries that are not transformed when comparing them using edit distance, essentially using edit distance as a penalty function and scaling penalties by segmentation size. We propose several adapted inter-annotator agreement coefficients which use S that are suitable for segmentation. We show that S is configurable enough to suit a wide variety of segmentation evaluations, and is an improvement upon the state of the art. We also propose using inter-annotator agreement coefficients to evaluate automatic segmenters in terms of human performance.

For more information, view the paper and software at:
http://nlp.chrisfournier.ca

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
538
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Segmentation Similarity and Agreement

  1. 1. Segmentation Similarity and AgreementA metric for evaluating automatic and human segmenters Chris Fournier Diana Inkpen School of Electrical Engineering and Computer Science University of Ottawa June 4, 2012 1 / 37
  2. 2. What is segmentation?Introduction Figure: Baker (1990, pp. 76–77) 2 / 37
  3. 3. What is segmentation?Introduction Par. Topic 1–3 Intro - the search for life in space 4–5 The moon’s chemical composition 6–8 How early earth-moon proximity shaped the moon 9–12 How the moon helped life evolve on earth 13 Improbability of the earth-moon system 14–16 Binary/trinary star systems make life unlikely 17–18 The low probability of nonbinary/trinary systems 19–20 Properties of earth’s sun that facilitate life 21 Summary Figure: Hyp. segmentation (Hearst 1997, p. 33) 3 / 37
  4. 4. Why do we segment?Introduction To model topical shifts, aiding: Video and audio retrieval (Franz et al. 2007) Question answering (Oh et al. 2007) Subjectivity analysis (Stoyanov & Cardie 2008) Automatic summarization (Haghighi & Vanderwende 2009) 4 / 37
  5. 5. Types of segmentationIntroduction Linear Hierarchical 5 s1 3 2 3 1 3 2 1 1 1 1 1 5 / 37
  6. 6. Automatically segmentationIntroduction Many automatic segmenters exist: TextTiling (Hearst 1997) Minimum Cut segmenter (Malioutov & Barzilay 2006) Bayesian segmenter (Eisenstein & Barzilay 2008) Affinity Propagation for Segmentation (Kazantseva & Szpakowicz 2011) 6 / 37
  7. 7. Problem: selecting a segmenterIntroduction How do we select the best performing segmenter for a task? Ideally evaluate performance in situ Evaluate end-task performance while varying segmenters Attain ecological validity1 “. . . the ability of experiments to tell us how real people operate in the real world” (Cohen 1995, p. 102) This is time consuming and expensive 1 For an example study, see McCallum et al. (2012) 7 / 37
  8. 8. Problem: selecting a segmenterIntroduction How do we less expensively select the best performing segmenter for a task? 1. Identify/collect manual segmentations 2. Verify their reliability 3. Train an automatic segmenter 4. Compare automatic and manual segmentations using a metric 8 / 37
  9. 9. FocusIntroduction We focuses on comparing segmentations to evaluate: Manual segmentations reliability Automatic segmenter performance 9 / 37
  10. 10. Why is this comparison difficult?Difficulty Difficulty arises because: There is no one “true” segmentation Low manual agreement (Hearst 1997) Coders disagree on granularity (Pevzner & Hearst 2002) Few boundaries to agree upon Hearst (1993, p. 6) Near misses often occur between boundaries 10 / 37
  11. 11. No one “true” segmentationDifficulty1234567 Figure: 7 manual codings collected by Hearst (1997) of Stargazers Look for Life (Baker 1990) 11 / 37
  12. 12. Near missesDifficulty 1,500 Full Near 1,000Misses 500 0 0 5 10 15 20 Distance considered as a near miss (PBs) Figure: S of Kazantseva & Szpakowicz (2012) 12 / 37
  13. 13. Existing evaluation metricsEvaluation Metrics Existing segmentation evaluation metrics: Precision, Recall, Fβ -measure Does not discount near-misses Pk (Beeferman & Berger 1999) Window-based near-miss accounting Not stable (Pevzner & Hearst 2002) WindowDiff (Pevzner & Hearst 2002) Substantial modification of Pk More stable (Pevzner & Hearst 2002) 13 / 37
  14. 14. Stability & internal segment sizesEvaluation Metrics 1 0.9 Metric value 0.8 0.7 0.6 S 1 − WD (20,30) (15,35) (10,40) (5,45) Figure: 10 trials of 100 segs. w/ FP & FN p = 0.5 14 / 37
  15. 15. Common failingsEvaluation Metrics Existing segmentation evaluation metrics: Require one “true” reference Cannot use multiple manual codings Cannot be adapted for agreement Pairwise means must be permuted WD(s1 , s2 ) = WD(s2 , s1 ) 15 / 37
  16. 16. A new metric: SSegmentation Similarity Segmentation Similarity (S): New boundary edit distance Edit distance used to penalize error Scales and normalizes penalties in relation to segment mass S is ideal because it is: A minimum edit distance (stable) Symmetric (no “true” segmentation) Highly configurable 16 / 37
  17. 17. ParametersSegmentation Similarity S has three parameters: n the number of PBs considered a near miss (default is 2) TE (y/n), to use transposition error scaling, or not (default is yes) Weights upon error types to reduce their severity (default is 1PB each) 17 / 37
  18. 18. Mass and potential boundariesSegmentation Similarity Segmentations have: Potential boundaries separating units Mass measured in units Types of boundaries. 0 1 2 3 4 5 6 ⇒ 1 3 2 Figure: Annotation of segmentation mass 18 / 37
  19. 19. Modelling dissimilaritySegmentation Similarity Linear segmentation errors can be modeled as edit operations at positions: 1 n-wise transposition 2,3,4 substitutions 1 2 3 4 s1 s2 FP FN FP FN FN Figure: Types of segmentations errors 19 / 37
  20. 20. NormalizationSegmentation Similarity mass(i)−1−d(si1 ,si2 ) S(si1 , si2 ) = mass(i)−1 20 / 37
  21. 21. Calculating similaritySegmentation Similarity From the previous example: 4 edits (3 sub. and 1 transposition) 14 units of mass. 1 2 3 4 s1 s2 14 − 1 − 4 9 S(si1 , si2 ) = = = 0.6923 14 − 1 13 1 − WD = 0.6154 21 / 37
  22. 22. Near missesSegmentation Similarity S can scale near misses by PBs spanned: te(n, b) = b − (1 /b )n−2 where n ≥ 2 and b > 0 s1 6 8 s2 7 7 S = 0.9231 1 − WD = 0.8182 22 / 37
  23. 23. Increasing near miss span sizeSegmentation Similarity 10.9 1−WD0.8 S(n = 3) S(n = 5,scale) S(n = 5,wtrp = 0)0.7 0 2 4 6 8 10 Difference in position (units) 23 / 37
  24. 24. Reliability of manual codingsSegmentation Agreement How do we verify manual reliability? 2 3 Inter-coder agreement coefficients: Aa − Ae κ, π, κ ∗ , and π ∗ = 1 − Ae Adapt to use Segmentation Similarity: ∗ ∗ κS , πS , κS , and πS 2 Fleiss’s Multi-π (π ∗ ) is Siegel & Castellan’s (1988) κ 3 Formulations from Artstein & Poesio (2008) used 24 / 37
  25. 25. CategoriesSegmentation Agreement Calculate Ae using one category per t: boundary presence (K = {segt |t ∈ T}) Why? Coders either place a boundary or not Coders do not place non-boundaries We desire boundary agreement “Unsure”, “no choice” are not options Default is no boundary placement 25 / 37
  26. 26. Examples of manual codingsMultiply-Coded Corpora Linear multiply-coded segmentations: Kazantseva & Szpakowicz (2012) The Moonstone by Wilkie Collins Topically segmented by 4-6 coders Paragraph-level Hearst (1997) Stargazers Look for Life by Dan Baker Topically segmented by 7 coders Paragraph-level 26 / 37
  27. 27. Overall agreementMultiply-Coded Corpora Kazantseva & Szpakowicz (2012) ∗ Mean coder group πS 0.8923 ± 0.0377 Mean S 0.8885 ± 0.0662 Hearst (1997) ∗ πS 0.7514 Mean S 0.7619 ± 0.0706 27 / 37
  28. 28. Overall error typesMultiply-Coded Corpora Misses Full Near Kazantseva & Szpakowicz (2012) 1039 212 Hearst (1997) 72 28 K&S 2012 H 1997 Sub. Transp. PBs w/o error 28 / 37
  29. 29. Comparing segmentersEvaluation How can we compare auto segmenters? Pairwise mean S with manual codings 2 1 3 mean(S1 , S2 , S3 ) Statistical hypothesis testing 29 / 37
  30. 30. Comparing segmentersEvaluation How can we compare auto segmenters? Differences in agreement 1. Calculate manual coder agreement ∗ πS,3M 2. Recalculate agreement adding an automatic segmenter’s values ∗ πS,3M,1A 3. Compare the two agreement values 30 / 37
  31. 31. SummaryConclusion Segmentation Similarity (S) Stable, unlike window metrics Highly configurable Gives detailed error information Mean values can be used to perform statistical hypothesis tests Adapted inter-annotator agreement Quantify manual agreement & reliability Compare automatic segmenters in terms of human performance 31 / 37
  32. 32. Future work & ImplementationConclusion Future work Multiple boundary types; and Hierarchical segmentation. Software implementation http://nlp.chrisfournier.ca/ 32 / 37
  33. 33. References I Artstein, R. & Poesio, M. (2008), ‘Inter-coder agreement for computational linguistics’, Computational Linguistics 34(4), 555–596. Baker, D. (1990), ‘Stargazers look for life’, South Magazine 117, 76–77. Beeferman, D. & Berger, A. (1999), ‘Statistical models for text segmentation’, Machine learning 34(1-3), 177–210. Cohen, P. R. (1995), Empirical methods for artificial intelligence, Cambridge, MA, USA. Eisenstein, J. & Barzilay, R. (2008), Bayesian unsupervised topic segmentation, in ‘Proceedings of the Conference on Empirical Methods in Natural Language Processing’, number October, Association for Computational Linguistics, Morristown, NJ, USA, pp. 334–343. 33 / 37
  34. 34. References II Franz, M., Mccarley, J. S., Xu, J.-m., Systems, H. I. & Search, I. (2007), User-Oriented Text Segmentation Evaluation Measure, in ‘Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval’, number 1, pp. 701–702. Haghighi, A. & Vanderwende, L. (2009), Exploring content models for multi-document summarization, in ‘Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics’, NAACL ’09, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 362–370. Hearst, M. A. (1993), TextTiling: A Quantitative Approach to Discourse, Technical report. Hearst, M. A. (1997), ‘TextTiling: segmenting text into multi-paragraph subtopic passages’, Computational Linguistics 23(1), 33–64. 34 / 37
  35. 35. References III Kazantseva, A. & Szpakowicz, S. (2011), Linear Text Segmentation Using Affinity Propagation, in ‘Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing’, Association for Computational Linguistics, Edinburgh, Scotland, UK., pp. 284–293. Kazantseva, A. & Szpakowicz, S. (2012), Topical Segmentation: a Study of Human Performance, in ‘Proceedings of the Human Language Technologies: The 2012 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT’12)’, Association for Computational Linguistics. Malioutov, I. & Barzilay, R. (2006), Minimum cut model for spoken lecture segmentation, in ‘Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics’, ACL-44, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 25–32. 35 / 37
  36. 36. References IV McCallum, A., Munteanu, C., Penn, G. & Zhu, X. (2012), Ecological validity and the evaluation of speech summarization quality, in ‘Proceedings of the NAACL HLT 2012 Workshop on Evaluation Metrics and System Comparison for Automatic Summarization’, Association for Computational Linguistics. Oh, H.-J., Myaeng, S. H. & Jang, M.-G. (2007), ‘Semantic passage segmentation based on sentence topics for question answering’, Information Sciences 177(18), 3696–3717. Pevzner, L. & Hearst, M. (2002), ‘A critique and improvement of an evaluation metric for text segmentation’, Computational Linguistics 28(1), 19–36. Siegel, S. & Castellan, N. (1988), Nonparametric Statistics for the Behavioral Sciences, second edn, McGraw-Hill, Inc. 36 / 37
  37. 37. References V Stoyanov, V. & Cardie, C. (2008), Topic identification for fine-grained opinion analysis, in ‘Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1’, COLING ’08, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 817–824. 37 / 37

×