Material of the Natural Language Processing (NLP) Workshop with STIC-Asia representatives and the Nepal team.
August 30-31, 2007.
Patan Dhoka, Lalitpur, Nepal.
1. Experimenting the
TextTiling Algorithm
Summary of the work done by master
students at Université Toulouse Le Mirail
Adam C., Andreani V., Bengsston J., Bouchara N., Choucavy L.,
Delpech E., El Maarouf I., Fontan L., Gotlik W.
2. Experimenting the Text Tiling
algorithm
Part I : What is the Text Tiling Algorithm ?
Part II : Experimentations with the Text
Tiling algorithm
Part III : Demo
3. Part I :
What is the TextTiling algorithm?
« an algorithm for partitionning expository texts into
coherent multi-paragraph discourse units which reflects
the subtopic structure of the texts »
developed by Marti Hearst (1997):
«TextTiling: Segmenting Text into Multi-Paragraph
Subtopic Passages », In Computational Linguistics, March
1997.
http://www.ischool.berkeley.edu/~hearst/tiling-about.html
4. Why segment a text into multi-paragraphs
unit ?
Computational tasks that use arbitrary windows might
benefit from using windows with motivated boundaries
Ease of readability for online long texts (Reading
Assistant Tools)
IR : retrieving relevant passages instead of whole
document
Summarization : extract sentences according to their
position in the subtopic structure
5. What is the hypothesis behind TextTiling ?
« TextTiling assumes that a set of lexical items is in use
during the course of a given subtopic discussion, and
when that subtopic changes, a significant proportion
when that subtopic changes, a significant proportion of the
of the vocabulary changes
vocabulary changes as well »as well »
Text Tiling doesn’t detect subtopics per se but shifts in
topic by means of change in vocabulary
Operates a linear segmentation (no hierarchy)
6. Detection of topic shift
Raw text
Tokenisation
similarity score SS
bloc A vs bloc B S
S
Segmentation into
pseudo-sentences
(20 tokens)
a similarity score is computed every
pseudo-sentence between 2 blocks of 6
pseudo-sequences
the more vocabulary in common, the
highest the score
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
7. I. Detection of topic shift
SCORE
1
a gap means there is a
0,85
0,9
drop in vocabulary similarity
0,8
0,8
0,7
topic shifts occur at the
0,6
0,75
deepest gaps (after
smoothing)
0,5
0,4
0,7
tiles boundaries will be
adjusted to the nearest
paragraph break
0,3
0,65
0,2
0,1
0,6
0
1 1 3 3 5 5 7 7 9 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Pseudo-sentence
number
8. Evaluation by Hearst (1997)
Evaluation on 12 magazine articles annotated by 7
judges
Judges are asked « to mark the paragraph boudary at
which the topic changed »
In case of disagreement among judges, a boudary is
kept if at least 3 judges agree on it
Agreement among judges (kappa measure) :
kappa = 0.647
9. Evaluation by Hearst (1997)
Precision
Recall
0.43
0.42
TextTiler
0.66
0.61
Judges
0.81
0.71
Baseline
(random)
Works well on long (+1800 words) expository texts with
little structural demarcation
10. Part II : Experimentations with
theTextTiling algorithm
Work done by masters students, Université Toulouse Le
Mirail
Implementation in Perl
Experimentations :
cross annotation of 3 texts
variation of :
linguistic parameters
computation parameters
11. Annotation of topic boundary
No clear-cut topic shift, rather ‘regions’ of shift
Annotators felt a smaller unity (sentence) would have
been more convenient
Our kappa : 0.56
Heart’s judges : 0.65
kappa should be at least > 0.67, the best is > 0.8
A difficult (unnatural ?) task for humans
15. Correlation
window size / smoothing
window size (number of tokens)
10
30
40
50
iteration
3
3
1
1
1
width
Smoothing
20
2
1
2
2
1
Correlation between window size and smoothing :
The smallest your window, the more smoothing you need
to smoothe
16. Optimal parameters set
Nb
parag.
Nb
Words sentences tokens
smooth.
words /
/
/
iteration
parag. block
sentence
smooth.
width
Text 1
12
2000
167
6
5
3
2
Text 2
22
2400
109
6
10
1
1
Text 3
37
1750
20
8
10
1
1
One optimal parameters set per text
Optimal set varies according to text/paragraph
length ?
17. Final thoughts
Linguistic processing :
lemmatization doesn’t significantly improve TextTiling
what about stemming ?
Computation parameters :
parameters are highly dependent
optimal parameters set vary from text to text
Proposal : an adaptative Text Tiler ?
window size could be adapted to text intrinsic qualities
smoothing could then be adapted to window size
19. Similarity score – Hearst (1997)
Sim (b1 ,b2) =
∑t wt,b1 . wt,b2
√ ∑ w² b1 . ∑ w² b2
t
t
t
t
b1 : block 1
b2 : block 2
t : token
w : weight (frequency) of the token in the block
back