Your SlideShare is downloading. ×
0
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Experimenting the TextTiling Algorithm

409

Published on

Material of the Natural Language Processing (NLP) Workshop with STIC-Asia representatives and the Nepal team. …

Material of the Natural Language Processing (NLP) Workshop with STIC-Asia representatives and the Nepal team.
August 30-31, 2007.
Patan Dhoka, Lalitpur, Nepal.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
409
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Experimenting the TextTiling Algorithm Summary of the work done by master students at Université Toulouse Le Mirail Adam C., Andreani V., Bengsston J., Bouchara N., Choucavy L., Delpech E., El Maarouf I., Fontan L., Gotlik W.
  • 2. Experimenting the Text Tiling algorithm Part I : What is the Text Tiling Algorithm ? Part II : Experimentations with the Text Tiling algorithm Part III : Demo
  • 3. Part I : What is the TextTiling algorithm?  « an algorithm for partitionning expository texts into coherent multi-paragraph discourse units which reflects the subtopic structure of the texts »  developed by Marti Hearst (1997): «TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages », In Computational Linguistics, March 1997. http://www.ischool.berkeley.edu/~hearst/tiling-about.html
  • 4. Why segment a text into multi-paragraphs unit ? Computational tasks that use arbitrary windows might benefit from using windows with motivated boundaries Ease of readability for online long texts (Reading Assistant Tools) IR : retrieving relevant passages instead of whole document Summarization : extract sentences according to their position in the subtopic structure
  • 5. What is the hypothesis behind TextTiling ?  « TextTiling assumes that a set of lexical items is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion when that subtopic changes, a significant proportion of the of the vocabulary changes vocabulary changes as well »as well » Text Tiling doesn’t detect subtopics per se but shifts in topic by means of change in vocabulary Operates a linear segmentation (no hierarchy)
  • 6. Detection of topic shift Raw text Tokenisation similarity score SS bloc A vs bloc B S S Segmentation into pseudo-sentences (20 tokens) a similarity score is computed every pseudo-sentence between 2 blocks of 6 pseudo-sequences  the more vocabulary in common, the highest the score  S S S S S S S S S S S S S S S
  • 7. I. Detection of topic shift SCORE 1  a gap means there is a 0,85 0,9 drop in vocabulary similarity 0,8 0,8 0,7  topic shifts occur at the 0,6 0,75 deepest gaps (after smoothing) 0,5 0,4 0,7 tiles boundaries will be adjusted to the nearest paragraph break 0,3 0,65 0,2 0,1 0,6 0 1 1 3 3 5 5 7 7 9 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Pseudo-sentence number
  • 8. Evaluation by Hearst (1997)  Evaluation on 12 magazine articles annotated by 7 judges  Judges are asked « to mark the paragraph boudary at which the topic changed »  In case of disagreement among judges, a boudary is kept if at least 3 judges agree on it  Agreement among judges (kappa measure) : kappa = 0.647
  • 9. Evaluation by Hearst (1997) Precision Recall 0.43 0.42 TextTiler 0.66 0.61 Judges 0.81 0.71 Baseline (random) Works well on long (+1800 words) expository texts with little structural demarcation
  • 10. Part II : Experimentations with theTextTiling algorithm  Work done by masters students, Université Toulouse Le Mirail  Implementation in Perl  Experimentations :  cross annotation of 3 texts  variation of :  linguistic parameters  computation parameters
  • 11. Annotation of topic boundary  No clear-cut topic shift, rather ‘regions’ of shift Annotators felt a smaller unity (sentence) would have been more convenient  Our kappa : 0.56  Heart’s judges : 0.65  kappa should be at least > 0.67, the best is > 0.8  A difficult (unnatural ?) task for humans
  • 12. Variation of linguistic parameters basic trigrams lemmatization (TreeTagger*) 0,61 0,7 0,58 0,6 0,53 0,5 0,35 0,34 0,26 0,23 PRECISION F-MESURE 0,4 0,25 0,3 0,2 0,17 0,1 0 RECALL * http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
  • 13. Variation of computation parameters  Computation window :  pseudo-sentence length  block length  Smoothing : 0,7 0,7 0,7 0,6 0,6 0,6 0,5 0,5 0,5 0,4 0,4 0,4 0,3 0,3 0,3 0,2 0,2 0,2 0,1 0,1 0,1 0 0 0 1 1 15 57 71 18 17 22 2736 40 5053 65 66 78 85 92 99 105 118127 137 141 148 155 162170 183 196 1425 29 41 4349 57 64 73 81 89 92 105 113 121 129 134 145 153157169 177 185 193 197 79 97 106 113 120 131 144 161 169 176 183 190 201 9 33
  • 14. Size of computation window Pseudo-sentence length Block length 2 4 6 8 10 12 14 16 18 20 5 ++ +++ ++ ++ ++ ++ ++ ++ ++ ++ 10 ++ ++ ++ + + ++ + + + + 15 ++ + + + + + + - - - 20 + + + - - - - - - -- 25 + + - - - - - -- -- -- 30 + - - - - -- -- -- -- -- 35 + - - - - -- -- -- -- -- 40 -- -- -- -- -- -- -- -- -- --
  • 15. Correlation window size / smoothing window size (number of tokens) 10 30 40 50 iteration 3 3 1 1 1 width Smoothing 20 2 1 2 2 1  Correlation between window size and smoothing : The smallest your window, the more smoothing you need to smoothe
  • 16. Optimal parameters set Nb parag. Nb Words sentences tokens smooth. words / / / iteration parag. block sentence smooth. width Text 1 12 2000 167 6 5 3 2 Text 2 22 2400 109 6 10 1 1 Text 3 37 1750 20 8 10 1 1  One optimal parameters set per text  Optimal set varies according to text/paragraph length ?
  • 17. Final thoughts  Linguistic processing : lemmatization doesn’t significantly improve TextTiling  what about stemming ?   Computation parameters :  parameters are highly dependent  optimal parameters set vary from text to text  Proposal : an adaptative Text Tiler ?  window size could be adapted to text intrinsic qualities  smoothing could then be adapted to window size
  • 18. Part III : Demo
  • 19. Similarity score – Hearst (1997) Sim (b1 ,b2) = ∑t wt,b1 . wt,b2 √ ∑ w² b1 . ∑ w² b2 t t t t b1 : block 1 b2 : block 2 t : token w : weight (frequency) of the token in the block back
  • 20. Kappa measure http://www.musc.edu/dc/icrebm/kappa.html Annot 1 yes no TOTAL 40 35 Y2=75 no 5 20 N2=25 TOTAL Y1=45 N1=55 T=100 Annot2 yes Kappa Agreement P(A) = 0.6 Expected agreement P(E) = (Y1.Y2 + N1.N2) / T² = 0.475 P(A) – P(E) = 1 – P(E) = 0.24 back

×