unit 4.ppt

Outline
 Introduction
 Types of stemming algorithms
 Experimental evaluations of stemming
 Stemming to compress inverted files
 Summary
 Appendix

Introduction
 Stemming is one technique to provide ways of finding
morphological variants of search terms.
 Used to improve retrieval effectiveness and to reduce the
size of indexing files.
 Taxonomy for stemming algorithms

Introduction (con’t)
 Criteria for judging stemmers
 Correctness
 Overstemming: too much of a term is removed.
 Understemming: too little of a term is removed.
 Retrieval effectiveness
measured with recall and precision, and on
their speed, size, and so on
 compression performance

Type of stemming algorithms
 Table lookup approach
 Successor Variety
 n-gram stemmers
 Affix Removal Stemmers

Table lookup approach
 Store a table of all index terms and their
stems, so terms from queries and indexes
could be stemmed very fast.
 Problems
 There is no such data for English. Or some
terms are domain dependent.
 The storage overhead for such a table, though
trading size for time is sometimes warranted.

Successor Variety approach
 Determine word and morpheme boundaries
based on the distribution of phonemes in a
large body of utterances.
 The successor variety of a string is the
number of different characters that follow it in
words in some body of text.
 The successor variety of substrings of a term
will decrease as more characters are added
until a segment boundary is reached.

Prefix Successor Variety Letters
R
RE
REA
READ
READA
READAB
READABL
READABLE
3
2
1
3
1
1
1
1
E,I,O
A,D
D
A,I,S
B
L
E
(Blank)
Test Word: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE,
READING, READS, RED, ROPE, RIPE
Successor Variety approach (con’t)

 cutoff method
 some cutoff value is selected and a boundary
is identified whenever the cutoff value is
reached
 peak and plateau method
 segment break is made after a character
whose successor variety exceeds that of the
characters immediately preceding and
following it
 complete method

 entropy method
 : the number of words in a text body beginning
with the i length sequence of letters 
 : the number of words in with the successor j
 The probability that a member of number of words in
has the successor j is given by
 The entropy of is
|
| i
D
|
| ij
D i
D
i
D
|
|
|
|
i
D
ij
D


|
| i
D
|
|
|
|
log
|
|
|
|
2
26
1 i
ij
i
ij
j
i
D
D
D
D
H




 

 


 Two criteria used to evaluate various
segmentation methods
1. the number of correct segment cuts divided
by the total number of cuts
2. the number of correct segment cuts divided
by the total number of true boundaries
 After segmenting, if the first segment occurs
in more than 12 words in the corpus, it is
probably a prefix.

 The successor variety stemming process
has three parts
1. determine the successor varieties for a word
2. segment the word using one of the methods
3. select one of the segments as the stem

n-gram stemmers
 Association measures are calculated between pairs of
terms based on shared unique digrams.
statistics => st ta at ti is st ti ic cs
unique digrams = at cs ic is st ta ti
statistical => st ta at ti is st ti ic ca al
unique digrams = al at ca ic is st ta ti
 Dice’s coefficient (similarity)
A and B are the numbers of unique digrams in the first
and the second words. C is the number of unique
digrams shared by A and B.
80
.
8
7
6
*
2
2





B
A
C
S

n-gram stemmers (con’t)
 Similarity measures are determined for all
pairs of terms in the database, forming a
similarity matrix
 Once such a similarity matrix is available,
terms are clustered using a single link
clustering method (as described in Ch.16)

Affix Removal Stemmers
 Affix removal algorithms remove suffixes
and/or prefixes from terms leaving a stem
 If a word ends in “ies” but not ”eies” or ”aies ” (Harman 1991)
Then “ies” -> “y”
 If a word ends in “es” but not ”aes” , or ”ees ” or “oes”
Then “es” -> “e”
 If a word ends in “s” but not ”us” or ”ss ”
Then “s” -> “NULL”

The Porter algorithm
 The Porter algorithm consists of a set of
condition/action rules.
 The condition fall into three classes
 Conditions on the stem
 Conditions on the suffix
 Conditions on rules

Conditions on the stem
1.The measure , denoted m ,of a stem is based on its
alternate vowel-consonant sequences.
Measure Example
M=0
M=1
M=2
TR,EE,TREE,Y,BY
TROUBLE,OATS,TREES,IVY
TROUBLES,PRIVATE,OATEN
[V]
)
VC
(
[C] m

Conditions on the stem (con’t)
2.*<X> ---the stem ends with a given letter X
3.*v*---the stem contains a vowel
4.*d ---the stem ends in double consonant
5.*o ---the stem ends with a consonant-vowel-
consonant,sequence ,where the final consonant is
not w, x or y
Suffix conditions take the form: (current_suffix == pattern)

Conditions on the rules
 The rules are divided into steps. The rules in a step
are examined in sequence , and only one rule from a
step can apply
{ step1a(word);
step1b(stem);
if (the second or third rule of step 1b was used)
step1b1(stem);
step1c(stem);
step2(stem);
step3(stem);
step4(stem);
step5a(stem);
step5b(stem);
}

Experimental Evaluations of stemming

Stemming Studies : Conclusion
 The majority of stemming’s affection on
retrieval performance have been positive
 Stemming is as effective as manual conflation
 The effect of stemming is dependent on the
nature of vocabulary used
 There appears to be little difference between
the retrieval effectiveness of different full
stemmers

Stemming to compress inverted files
Lennon et al. report the following compression
percentages for various stemmers and databases. It
is obvious that the savings in storage can be
substantial.
Compression rates also increase for affix removal
stemmers as the number of suffixes increases.

Summary
 Stemmers are used to conflate terms to
improve retrieval effectiveness and /or to
reduce the size of indexing file.
 Stemming will increase recall at the cost of
decreased precision.
 Stemming can have marked effect on the size
of indexing files ,sometimes decreasing the
size of file as much as 50 percent .

 Courtesy:
 資訊擷取與推薦技術：期中報告
 指導教授：黃三益老師
 學生： 9142608 黃哲修
 9142609 張家豪

unit 4.ppt

Recommended

Recommended

More Related Content

Similar to unit 4.ppt

Similar to unit 4.ppt (20)

Recently uploaded

Recently uploaded (20)

unit 4.ppt