SlideShare a Scribd company logo
Stemming Algorithms
Outline
 Introduction
 Types of stemming algorithms
 Experimental evaluations of stemming
 Stemming to compress inverted files
 Summary
 Appendix
Introduction
 Stemming is one technique to provide ways of finding
morphological variants of search terms.
 Used to improve retrieval effectiveness and to reduce the
size of indexing files.
 Taxonomy for stemming algorithms
Introduction (con’t)
 Criteria for judging stemmers
 Correctness
 Overstemming: too much of a term is removed.
 Understemming: too little of a term is removed.
 Retrieval effectiveness
measured with recall and precision, and on
their speed, size, and so on
 compression performance
Type of stemming algorithms
 Table lookup approach
 Successor Variety
 n-gram stemmers
 Affix Removal Stemmers
Table lookup approach
 Store a table of all index terms and their
stems, so terms from queries and indexes
could be stemmed very fast.
 Problems
 There is no such data for English. Or some
terms are domain dependent.
 The storage overhead for such a table, though
trading size for time is sometimes warranted.
Successor Variety approach
 Determine word and morpheme boundaries
based on the distribution of phonemes in a
large body of utterances.
 The successor variety of a string is the
number of different characters that follow it in
words in some body of text.
 The successor variety of substrings of a term
will decrease as more characters are added
until a segment boundary is reached.
Prefix Successor Variety Letters
R
RE
REA
READ
READA
READAB
READABL
READABLE
3
2
1
3
1
1
1
1
E,I,O
A,D
D
A,I,S
B
L
E
(Blank)
Test Word: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE,
READING, READS, RED, ROPE, RIPE
Successor Variety approach (con’t)
Successor Variety approach (con’t)
 cutoff method
 some cutoff value is selected and a boundary
is identified whenever the cutoff value is
reached
 peak and plateau method
 segment break is made after a character
whose successor variety exceeds that of the
characters immediately preceding and
following it
 complete method
Successor Variety approach (con’t)
 entropy method
 : the number of words in a text body beginning
with the i length sequence of letters 
 : the number of words in with the successor j
 The probability that a member of number of words in
has the successor j is given by
 The entropy of is
|
| i
D
|
| ij
D i
D
i
D
|
|
|
|
i
D
ij
D


|
| i
D
|
|
|
|
log
|
|
|
|
2
26
1 i
ij
i
ij
j
i
D
D
D
D
H




 

 

Successor Variety approach (con’t)
 Two criteria used to evaluate various
segmentation methods
1. the number of correct segment cuts divided
by the total number of cuts
2. the number of correct segment cuts divided
by the total number of true boundaries
 After segmenting, if the first segment occurs
in more than 12 words in the corpus, it is
probably a prefix.
Successor Variety approach (con’t)
 The successor variety stemming process
has three parts
1. determine the successor varieties for a word
2. segment the word using one of the methods
3. select one of the segments as the stem
n-gram stemmers
 Association measures are calculated between pairs of
terms based on shared unique digrams.
statistics => st ta at ti is st ti ic cs
unique digrams = at cs ic is st ta ti
statistical => st ta at ti is st ti ic ca al
unique digrams = al at ca ic is st ta ti
 Dice’s coefficient (similarity)
A and B are the numbers of unique digrams in the first
and the second words. C is the number of unique
digrams shared by A and B.
80
.
8
7
6
*
2
2





B
A
C
S
n-gram stemmers (con’t)
 Similarity measures are determined for all
pairs of terms in the database, forming a
similarity matrix
 Once such a similarity matrix is available,
terms are clustered using a single link
clustering method (as described in Ch.16)
Affix Removal Stemmers
 Affix removal algorithms remove suffixes
and/or prefixes from terms leaving a stem
 If a word ends in “ies” but not ”eies” or ”aies ” (Harman 1991)
Then “ies” -> “y”
 If a word ends in “es” but not ”aes” , or ”ees ” or “oes”
Then “es” -> “e”
 If a word ends in “s” but not ”us” or ”ss ”
Then “s” -> “NULL”
The Porter algorithm
 The Porter algorithm consists of a set of
condition/action rules.
 The condition fall into three classes
 Conditions on the stem
 Conditions on the suffix
 Conditions on rules
Conditions on the stem
1.The measure , denoted m ,of a stem is based on its
alternate vowel-consonant sequences.
Measure Example
M=0
M=1
M=2
TR,EE,TREE,Y,BY
TROUBLE,OATS,TREES,IVY
TROUBLES,PRIVATE,OATEN
[V]
)
VC
(
[C] m
Conditions on the stem (con’t)
2.*<X> ---the stem ends with a given letter X
3.*v*---the stem contains a vowel
4.*d ---the stem ends in double consonant
5.*o ---the stem ends with a consonant-vowel-
consonant,sequence ,where the final consonant is
not w, x or y
Suffix conditions take the form: (current_suffix == pattern)
Conditions on the rules
 The rules are divided into steps. The rules in a step
are examined in sequence , and only one rule from a
step can apply
{ step1a(word);
step1b(stem);
if (the second or third rule of step 1b was used)
step1b1(stem);
step1c(stem);
step2(stem);
step3(stem);
step4(stem);
step5a(stem);
step5b(stem);
}
Experimental Evaluations of stemming
Stemming Studies : Conclusion
 The majority of stemming’s affection on
retrieval performance have been positive
 Stemming is as effective as manual conflation
 The effect of stemming is dependent on the
nature of vocabulary used
 There appears to be little difference between
the retrieval effectiveness of different full
stemmers
Stemming to compress inverted files
Lennon et al. report the following compression
percentages for various stemmers and databases. It
is obvious that the savings in storage can be
substantial.
Compression rates also increase for affix removal
stemmers as the number of suffixes increases.
Summary
 Stemmers are used to conflate terms to
improve retrieval effectiveness and /or to
reduce the size of indexing file.
 Stemming will increase recall at the cost of
decreased precision.
 Stemming can have marked effect on the size
of indexing files ,sometimes decreasing the
size of file as much as 50 percent .
 Courtesy:
 資訊擷取與推薦技術:期中報告
 指導教授:黃三益 老師
 學生: 9142608 黃哲修
 9142609 張家豪

More Related Content

Similar to unit 4.ppt

lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.ppt
butest
 
Module 11
Module 11Module 11
Module 11
bittudavis
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
Dmitriy Selivanov
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Mail.ru Group
 
A Survey of String Matching Algorithms
A Survey of String Matching AlgorithmsA Survey of String Matching Algorithms
A Survey of String Matching Algorithms
IJERA Editor
 
Chapter Three(1)
Chapter Three(1)Chapter Three(1)
Chapter Three(1)
bolovv
 
Chapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materialsChapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materials
gadisaAdamu
 
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
IJCSEA Journal
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...
CSCJournals
 
Perl for Phyloinformatics
Perl for PhyloinformaticsPerl for Phyloinformatics
Perl for Phyloinformatics
Rutger Vos
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
DhruvKushwaha12
 
2015 bioinformatics go_hmm_wim_vancriekinge
2015 bioinformatics go_hmm_wim_vancriekinge2015 bioinformatics go_hmm_wim_vancriekinge
2015 bioinformatics go_hmm_wim_vancriekinge
Prof. Wim Van Criekinge
 
Two-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential EvolutionTwo-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential Evolution
Xin-She Yang
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
Data Science Warsaw
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
eSAT Publishing House
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
eSAT Journals
 
COMPILER DESIGN LECTURES -UNIT-2 ST.pptx
COMPILER DESIGN LECTURES -UNIT-2 ST.pptxCOMPILER DESIGN LECTURES -UNIT-2 ST.pptx
COMPILER DESIGN LECTURES -UNIT-2 ST.pptx
Ranjeet Reddy
 
5-Introduction to Parsing and Context Free Grammar-09-05-2023.pptx
5-Introduction to Parsing and Context Free Grammar-09-05-2023.pptx5-Introduction to Parsing and Context Free Grammar-09-05-2023.pptx
5-Introduction to Parsing and Context Free Grammar-09-05-2023.pptx
venkatapranaykumarGa
 
Summary distributed representations_words_phrases
Summary distributed representations_words_phrasesSummary distributed representations_words_phrases
Summary distributed representations_words_phrases
Yue Xiangnan
 
L046056365
L046056365L046056365
L046056365
IJERA Editor
 

Similar to unit 4.ppt (20)

lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.ppt
 
Module 11
Module 11Module 11
Module 11
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
 
A Survey of String Matching Algorithms
A Survey of String Matching AlgorithmsA Survey of String Matching Algorithms
A Survey of String Matching Algorithms
 
Chapter Three(1)
Chapter Three(1)Chapter Three(1)
Chapter Three(1)
 
Chapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materialsChapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materials
 
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...
 
Perl for Phyloinformatics
Perl for PhyloinformaticsPerl for Phyloinformatics
Perl for Phyloinformatics
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
 
2015 bioinformatics go_hmm_wim_vancriekinge
2015 bioinformatics go_hmm_wim_vancriekinge2015 bioinformatics go_hmm_wim_vancriekinge
2015 bioinformatics go_hmm_wim_vancriekinge
 
Two-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential EvolutionTwo-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential Evolution
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
COMPILER DESIGN LECTURES -UNIT-2 ST.pptx
COMPILER DESIGN LECTURES -UNIT-2 ST.pptxCOMPILER DESIGN LECTURES -UNIT-2 ST.pptx
COMPILER DESIGN LECTURES -UNIT-2 ST.pptx
 
5-Introduction to Parsing and Context Free Grammar-09-05-2023.pptx
5-Introduction to Parsing and Context Free Grammar-09-05-2023.pptx5-Introduction to Parsing and Context Free Grammar-09-05-2023.pptx
5-Introduction to Parsing and Context Free Grammar-09-05-2023.pptx
 
Summary distributed representations_words_phrases
Summary distributed representations_words_phrasesSummary distributed representations_words_phrases
Summary distributed representations_words_phrases
 
L046056365
L046056365L046056365
L046056365
 

Recently uploaded

DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
National Information Standards Organization (NISO)
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 

Recently uploaded (20)

DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 

unit 4.ppt

  • 2. Outline  Introduction  Types of stemming algorithms  Experimental evaluations of stemming  Stemming to compress inverted files  Summary  Appendix
  • 3. Introduction  Stemming is one technique to provide ways of finding morphological variants of search terms.  Used to improve retrieval effectiveness and to reduce the size of indexing files.  Taxonomy for stemming algorithms
  • 4. Introduction (con’t)  Criteria for judging stemmers  Correctness  Overstemming: too much of a term is removed.  Understemming: too little of a term is removed.  Retrieval effectiveness measured with recall and precision, and on their speed, size, and so on  compression performance
  • 5. Type of stemming algorithms  Table lookup approach  Successor Variety  n-gram stemmers  Affix Removal Stemmers
  • 6. Table lookup approach  Store a table of all index terms and their stems, so terms from queries and indexes could be stemmed very fast.  Problems  There is no such data for English. Or some terms are domain dependent.  The storage overhead for such a table, though trading size for time is sometimes warranted.
  • 7. Successor Variety approach  Determine word and morpheme boundaries based on the distribution of phonemes in a large body of utterances.  The successor variety of a string is the number of different characters that follow it in words in some body of text.  The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached.
  • 8. Prefix Successor Variety Letters R RE REA READ READA READAB READABL READABLE 3 2 1 3 1 1 1 1 E,I,O A,D D A,I,S B L E (Blank) Test Word: READABLE Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE, READING, READS, RED, ROPE, RIPE Successor Variety approach (con’t)
  • 9. Successor Variety approach (con’t)  cutoff method  some cutoff value is selected and a boundary is identified whenever the cutoff value is reached  peak and plateau method  segment break is made after a character whose successor variety exceeds that of the characters immediately preceding and following it  complete method
  • 10. Successor Variety approach (con’t)  entropy method  : the number of words in a text body beginning with the i length sequence of letters   : the number of words in with the successor j  The probability that a member of number of words in has the successor j is given by  The entropy of is | | i D | | ij D i D i D | | | | i D ij D   | | i D | | | | log | | | | 2 26 1 i ij i ij j i D D D D H          
  • 11. Successor Variety approach (con’t)  Two criteria used to evaluate various segmentation methods 1. the number of correct segment cuts divided by the total number of cuts 2. the number of correct segment cuts divided by the total number of true boundaries  After segmenting, if the first segment occurs in more than 12 words in the corpus, it is probably a prefix.
  • 12. Successor Variety approach (con’t)  The successor variety stemming process has three parts 1. determine the successor varieties for a word 2. segment the word using one of the methods 3. select one of the segments as the stem
  • 13. n-gram stemmers  Association measures are calculated between pairs of terms based on shared unique digrams. statistics => st ta at ti is st ti ic cs unique digrams = at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique digrams = al at ca ic is st ta ti  Dice’s coefficient (similarity) A and B are the numbers of unique digrams in the first and the second words. C is the number of unique digrams shared by A and B. 80 . 8 7 6 * 2 2      B A C S
  • 14. n-gram stemmers (con’t)  Similarity measures are determined for all pairs of terms in the database, forming a similarity matrix  Once such a similarity matrix is available, terms are clustered using a single link clustering method (as described in Ch.16)
  • 15. Affix Removal Stemmers  Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem  If a word ends in “ies” but not ”eies” or ”aies ” (Harman 1991) Then “ies” -> “y”  If a word ends in “es” but not ”aes” , or ”ees ” or “oes” Then “es” -> “e”  If a word ends in “s” but not ”us” or ”ss ” Then “s” -> “NULL”
  • 16. The Porter algorithm  The Porter algorithm consists of a set of condition/action rules.  The condition fall into three classes  Conditions on the stem  Conditions on the suffix  Conditions on rules
  • 17. Conditions on the stem 1.The measure , denoted m ,of a stem is based on its alternate vowel-consonant sequences. Measure Example M=0 M=1 M=2 TR,EE,TREE,Y,BY TROUBLE,OATS,TREES,IVY TROUBLES,PRIVATE,OATEN [V] ) VC ( [C] m
  • 18. Conditions on the stem (con’t) 2.*<X> ---the stem ends with a given letter X 3.*v*---the stem contains a vowel 4.*d ---the stem ends in double consonant 5.*o ---the stem ends with a consonant-vowel- consonant,sequence ,where the final consonant is not w, x or y Suffix conditions take the form: (current_suffix == pattern)
  • 19. Conditions on the rules  The rules are divided into steps. The rules in a step are examined in sequence , and only one rule from a step can apply { step1a(word); step1b(stem); if (the second or third rule of step 1b was used) step1b1(stem); step1c(stem); step2(stem); step3(stem); step4(stem); step5a(stem); step5b(stem); }
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 27.
  • 28. Stemming Studies : Conclusion  The majority of stemming’s affection on retrieval performance have been positive  Stemming is as effective as manual conflation  The effect of stemming is dependent on the nature of vocabulary used  There appears to be little difference between the retrieval effectiveness of different full stemmers
  • 29. Stemming to compress inverted files Lennon et al. report the following compression percentages for various stemmers and databases. It is obvious that the savings in storage can be substantial. Compression rates also increase for affix removal stemmers as the number of suffixes increases.
  • 30. Summary  Stemmers are used to conflate terms to improve retrieval effectiveness and /or to reduce the size of indexing file.  Stemming will increase recall at the cost of decreased precision.  Stemming can have marked effect on the size of indexing files ,sometimes decreasing the size of file as much as 50 percent .
  • 31.  Courtesy:  資訊擷取與推薦技術:期中報告  指導教授:黃三益 老師  學生: 9142608 黃哲修  9142609 張家豪