2. Outline
Introduction
Types of stemming algorithms
Experimental evaluations of stemming
Stemming to compress inverted files
Summary
Appendix
3. Introduction
Stemming is one technique to provide ways of finding
morphological variants of search terms.
Used to improve retrieval effectiveness and to reduce the
size of indexing files.
Taxonomy for stemming algorithms
4. Introduction (con’t)
Criteria for judging stemmers
Correctness
Overstemming: too much of a term is removed.
Understemming: too little of a term is removed.
Retrieval effectiveness
measured with recall and precision, and on
their speed, size, and so on
compression performance
5. Type of stemming algorithms
Table lookup approach
Successor Variety
n-gram stemmers
Affix Removal Stemmers
6. Table lookup approach
Store a table of all index terms and their
stems, so terms from queries and indexes
could be stemmed very fast.
Problems
There is no such data for English. Or some
terms are domain dependent.
The storage overhead for such a table, though
trading size for time is sometimes warranted.
7. Successor Variety approach
Determine word and morpheme boundaries
based on the distribution of phonemes in a
large body of utterances.
The successor variety of a string is the
number of different characters that follow it in
words in some body of text.
The successor variety of substrings of a term
will decrease as more characters are added
until a segment boundary is reached.
8. Prefix Successor Variety Letters
R
RE
REA
READ
READA
READAB
READABL
READABLE
3
2
1
3
1
1
1
1
E,I,O
A,D
D
A,I,S
B
L
E
(Blank)
Test Word: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE,
READING, READS, RED, ROPE, RIPE
Successor Variety approach (con’t)
9. Successor Variety approach (con’t)
cutoff method
some cutoff value is selected and a boundary
is identified whenever the cutoff value is
reached
peak and plateau method
segment break is made after a character
whose successor variety exceeds that of the
characters immediately preceding and
following it
complete method
10. Successor Variety approach (con’t)
entropy method
: the number of words in a text body beginning
with the i length sequence of letters
: the number of words in with the successor j
The probability that a member of number of words in
has the successor j is given by
The entropy of is
|
| i
D
|
| ij
D i
D
i
D
|
|
|
|
i
D
ij
D
|
| i
D
|
|
|
|
log
|
|
|
|
2
26
1 i
ij
i
ij
j
i
D
D
D
D
H
11. Successor Variety approach (con’t)
Two criteria used to evaluate various
segmentation methods
1. the number of correct segment cuts divided
by the total number of cuts
2. the number of correct segment cuts divided
by the total number of true boundaries
After segmenting, if the first segment occurs
in more than 12 words in the corpus, it is
probably a prefix.
12. Successor Variety approach (con’t)
The successor variety stemming process
has three parts
1. determine the successor varieties for a word
2. segment the word using one of the methods
3. select one of the segments as the stem
13. n-gram stemmers
Association measures are calculated between pairs of
terms based on shared unique digrams.
statistics => st ta at ti is st ti ic cs
unique digrams = at cs ic is st ta ti
statistical => st ta at ti is st ti ic ca al
unique digrams = al at ca ic is st ta ti
Dice’s coefficient (similarity)
A and B are the numbers of unique digrams in the first
and the second words. C is the number of unique
digrams shared by A and B.
80
.
8
7
6
*
2
2
B
A
C
S
14. n-gram stemmers (con’t)
Similarity measures are determined for all
pairs of terms in the database, forming a
similarity matrix
Once such a similarity matrix is available,
terms are clustered using a single link
clustering method (as described in Ch.16)
15. Affix Removal Stemmers
Affix removal algorithms remove suffixes
and/or prefixes from terms leaving a stem
If a word ends in “ies” but not ”eies” or ”aies ” (Harman 1991)
Then “ies” -> “y”
If a word ends in “es” but not ”aes” , or ”ees ” or “oes”
Then “es” -> “e”
If a word ends in “s” but not ”us” or ”ss ”
Then “s” -> “NULL”
16. The Porter algorithm
The Porter algorithm consists of a set of
condition/action rules.
The condition fall into three classes
Conditions on the stem
Conditions on the suffix
Conditions on rules
17. Conditions on the stem
1.The measure , denoted m ,of a stem is based on its
alternate vowel-consonant sequences.
Measure Example
M=0
M=1
M=2
TR,EE,TREE,Y,BY
TROUBLE,OATS,TREES,IVY
TROUBLES,PRIVATE,OATEN
[V]
)
VC
(
[C] m
18. Conditions on the stem (con’t)
2.*<X> ---the stem ends with a given letter X
3.*v*---the stem contains a vowel
4.*d ---the stem ends in double consonant
5.*o ---the stem ends with a consonant-vowel-
consonant,sequence ,where the final consonant is
not w, x or y
Suffix conditions take the form: (current_suffix == pattern)
19. Conditions on the rules
The rules are divided into steps. The rules in a step
are examined in sequence , and only one rule from a
step can apply
{ step1a(word);
step1b(stem);
if (the second or third rule of step 1b was used)
step1b1(stem);
step1c(stem);
step2(stem);
step3(stem);
step4(stem);
step5a(stem);
step5b(stem);
}
28. Stemming Studies : Conclusion
The majority of stemming’s affection on
retrieval performance have been positive
Stemming is as effective as manual conflation
The effect of stemming is dependent on the
nature of vocabulary used
There appears to be little difference between
the retrieval effectiveness of different full
stemmers
29. Stemming to compress inverted files
Lennon et al. report the following compression
percentages for various stemmers and databases. It
is obvious that the savings in storage can be
substantial.
Compression rates also increase for affix removal
stemmers as the number of suffixes increases.
30. Summary
Stemmers are used to conflate terms to
improve retrieval effectiveness and /or to
reduce the size of indexing file.
Stemming will increase recall at the cost of
decreased precision.
Stemming can have marked effect on the size
of indexing files ,sometimes decreasing the
size of file as much as 50 percent .