SlideShare a Scribd company logo
More on indexing and text operations
Modern Information Retrieval
University of Qom
Z. Imanimehr
Spring 2023
Recall the basic indexing pipeline
Tokenizer
Token stream Friends Romans Countrymen
Linguistic modules
Modified tokens friend roman countryman
Indexer
Inverted index
friend
roman
countryman
2 4
2
13 16
1
Document Friends, Romans, countrymen.
2
Text operations
3
 Tokenization
 Stop word removal
 Normalization
 Stemming or lemmatization
 Equivalence classes
 Example1: case folding
 Example2: using thesauri (or Soundex) to find equivalence
classes of synonyms and homonyms
4
Text operations & linguistic
preprocessing
Parsing a document
 What format is it in?
 pdf/word/excel/html?
 What language is it in?
 What character set is in use?
These tasks can be seen as classification problems, which
we will study later in the course.
But these tasks are often done heuristically …
Sec. 2.1
5
Indexing granularity
6
 What is a unit document?
 A file? Zip files?
 Whole book or chapters?
 A Powerpoint file or each of its slides?
Tokenization
 Input: “Friends, Romans, Countrymen”
 Output: Tokens
 Friends
 Romans
 Countrymen
 Each such token is now a candidate for an index
entry, after further processing
Sec. 2.2.1
7
Tokenization
 Issues in tokenization:
 Finland’s capital  Finland? Finlands? Finland’s?
 Hewlett-Packard  Hewlett and Packard as two
tokens?
 co-education
 lower-case
 state-of-the-art: break up hyphenated sequence.
 It can be effective to get the user to put in possible hyphens
 San Francisco: one token or two?
 How do you decide it is one token?
Sec. 2.2.1
8
Tokenization: Numbers
 Examples
 3/12/91 Mar. 12, 1991 12/3/91
 55 B.C.
 B-52
 My PGP key is 324a3df234cb23e
 (800) 234-2333
 Often have embedded spaces
 Older IR systems may not index numbers
 But often very useful
 e.g., looking up error codes/stack traces on the web
 Will often index “meta-data” separately
 Creation date, format, etc.
Sec. 2.2.1
9
Tokenization: Language issues
 French
 L'ensemble: one token or two?
 L ? L’ ? Le ?
 German noun compounds are not segmented
 Lebensversicherungsgesellschaftsangestellter
 ‘life insurance company employee’
 German retrieval systems benefit greatly from a
compound splitter module
 Can give a 15% performance boost for German
Sec. 2.2.1
10
Tokenization: Language issues
 Chinese and Japanese have no spaces between
words:
 莎拉波娃现在居住在美国东南部的佛罗里达。
 Not always guaranteed a unique tokenization
 Further complicated normalization in Japanese, with
multiple alphabets intermingled is required
 Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
End-user can express query entirely in hiragana!
Sec. 2.2.1
11
Stop words
 Stop list: exclude from dictionary the commonest words.
 They have little semantic content: ‘the’, ‘a’, ‘and’, ‘to’, ‘be’
 There are a lot of them: ~30% of postings for top 30 words
 But the trend is away from doing this:
 Good compression techniques (IIR, Chapter 5)
 the space for including stopwords in a system is very small
 Good query optimization techniques (IIR, Chapter 7)
 pay little at query time for including stop words.
 You need them for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 Relational queries: “flights to London”
Sec. 2.2.2
12
Normalization to terms
 Normalize words in indexed text (also query)
 U.S.A. USA
 Term is a (normalized) word type, which is an entry in our IR
system dictionary
 We most commonly implicitly define equivalence classes of
terms by, e.g.,
 deleting periods to form a term
 U.S.A., USA  USA
 deleting hyphens to form a term
 anti-discriminatory, antidiscriminatory  antidiscriminatory
 Crucial: Need to “normalize” indexed text as well as query terms
into the same form
Sec. 2.2.3
13
Normalization to terms
14
 Do we handle synonyms and homonyms?
 E.g., by hand-constructed equivalence classes
 car = automobile color = colour
 We can rewrite to form equivalence-class terms
 When the doc contains automobile, index it under car-
automobile (and/or vice-versa)
 Or we can expand a query
 When the query contains automobile, look under car as
well
Alternative to creating equivalence classes
Query expansion instead of normalization
 An alternative to equivalence classing is to do
asymmetric expansion of query
 An example of where this may be useful
 Enter: window Search: window, windows
 Enter: windows Search: Windows, windows
 Potentially more powerful, but less efficient
Sec. 2.2.3
15
Normalization: Case folding
 Reduce all letters to lower case
 exception: upper case in mid-sentence?
 e.g., General Motors
 Fed vs. fed
 SAIL vs. sail
 Often best to lower case everything, since users will use
lowercase regardless of ‘correct’ capitalization…
 Google example: Query C.A.T.
 #1 result was for “cat” not Caterpillar Inc.
Sec. 2.2.3
16
Normalization: Stemming and lemmatization
17
 For grammatical reasons, docs may use different
forms of a word
 Example: organize, organizes, and organizing
 There are families of derivationally related words
with similar meanings
 Example: democracy, democratic, and democratization
Lemmatization
 Reduce inflectional/variant forms to their base form,
e.g.,
 am, are, is  be
 car, cars, car's, cars'  car
 the boy's cars are different colors  the boy car be
different color
 Lemmatization implies doing “proper” reduction to
dictionary headword form
 It needs a complete vocabulary and morphological
analysis to correctly lemmatize words
Sec. 2.2.4
18
Stemming
 Reduce terms to their “roots” before indexing
 Stemmers use language-specific rules, but they require
less knowledge than a lemmatizer
 Stemming: crude affix chopping
 The exact stemmed form does not matter
 only the resulted equivalence classes play role.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compress and
compress ar both accept
as equival to compress
Sec. 2.2.4
19
Porter’s algorithm
 Commonest algorithm for stemming English
 Results suggest it’s at least as good as other stemming
options
 Conventions + 5 phases of reductions
 phases applied sequentially
 each phase consists of a set of commands
Sec. 2.2.4
20
Porter’s algorithm: Typical rules
 sses  ss
 ies  i
 ational  ate
 tional  tion
 Rules sensitive to the measure of words
 (m>1) EMENT →
 replacement → replac
 cement → cement
Sec. 2.2.4
21
Do stemming and other normalizations
help?
22
 English: very mixed results. Helps recall but harms
precision
 Example of harmful stemming:
 operative (dentistry) ⇒ oper
 operational (research) ⇒ oper
 operating (systems) ⇒ oper
 Definitely useful for Spanish, German, Finnish, …
 30% performance gains for Finnish!
Lemmatization vs. Stemming
23
 Lemmatization produces at most very modest
benefits for retrieval.
 Either form of normalization tends not to improve
English information retrieval performance in
aggregate
 The situation is different for languages with much
more morphology (such as Spanish, German, and
Finnish).
 quite large gains from the use of stemmers
Normalization: Other languages
 Accents: e.g., French résumé vs. resume.
 Umlauts: e.g., German: Tuebingen vs. Tübingen
 Should be equivalent
 Most important criterion:
 How are your users like to write their queries for these words?
 Users often may not type accents even in languages that
standardly have accents
 Often best to normalize to a de-accented term
 Tuebingen, Tübingen, Tubingen  Tubingen
 For foreign names, the spelling may be unclear or there
may be variant transliteration standards giving different
spellings
 Soundex forms equivalence classes of words based on phonetic
heuristics
Sec. 2.2.3
24
Resources
 IIR 2
 MIR 9.2
 Porter’s stemmer:
 http://www.tartarus.org/~martin/PorterStemmer/
25

More Related Content

Similar to More on Indexing Text Operations (1).pptx

Generic Programming
Generic ProgrammingGeneric Programming
Generic Programming
Ganesh Samarthyam
 
Python programming
Python programmingPython programming
Python programming
noor_faiza
 
Publishability workshop: Writing readable academic text
Publishability workshop: Writing readable academic textPublishability workshop: Writing readable academic text
Publishability workshop: Writing readable academic text
Lawrie Hunter
 
Ui testing on multi lingual
Ui testing on multi lingualUi testing on multi lingual
Ui testing on multi lingual
tharayadav
 
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduatesScales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Hans Ecke
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...
NALESVPMEngg
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
beshahashenafe20
 
CS152 Programming Paradigm
CS152 Programming Paradigm CS152 Programming Paradigm
CS152 Programming Paradigm
Kaya Ota
 
intro.ppt
intro.pptintro.ppt
intro.ppt
GeekyHassan
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
MeetupDataScienceRoma
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
SamuelKetema1
 
Advanced_programming_language_design.pdf
Advanced_programming_language_design.pdfAdvanced_programming_language_design.pdf
Advanced_programming_language_design.pdf
RodulfoGabrito
 
Using ontology based context in the
Using ontology based context in theUsing ontology based context in the
Using ontology based context in the
ijaia
 
Functional programming
Functional programmingFunctional programming
Functional programming
OpenAgile Romania
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Liwei Ren任力偉
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free Grammars
Marina Santini
 
Academic writing: pointers for G-cube PhD students 23-01.12
Academic writing: pointers for G-cube PhD students 23-01.12Academic writing: pointers for G-cube PhD students 23-01.12
Academic writing: pointers for G-cube PhD students 23-01.12
Lawrie Hunter
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Chunyang Chen
 
An investigation of diachronic change in hypotaxis and parataxis in German th...
An investigation of diachronic change in hypotaxis and parataxis in German th...An investigation of diachronic change in hypotaxis and parataxis in German th...
An investigation of diachronic change in hypotaxis and parataxis in German th...
Mario Bisiada
 
Generic Tools, Specific Laguages
Generic Tools, Specific LaguagesGeneric Tools, Specific Laguages
Generic Tools, Specific Laguages
Markus Voelter
 

Similar to More on Indexing Text Operations (1).pptx (20)

Generic Programming
Generic ProgrammingGeneric Programming
Generic Programming
 
Python programming
Python programmingPython programming
Python programming
 
Publishability workshop: Writing readable academic text
Publishability workshop: Writing readable academic textPublishability workshop: Writing readable academic text
Publishability workshop: Writing readable academic text
 
Ui testing on multi lingual
Ui testing on multi lingualUi testing on multi lingual
Ui testing on multi lingual
 
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduatesScales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
 
CS152 Programming Paradigm
CS152 Programming Paradigm CS152 Programming Paradigm
CS152 Programming Paradigm
 
intro.ppt
intro.pptintro.ppt
intro.ppt
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
 
Advanced_programming_language_design.pdf
Advanced_programming_language_design.pdfAdvanced_programming_language_design.pdf
Advanced_programming_language_design.pdf
 
Using ontology based context in the
Using ontology based context in theUsing ontology based context in the
Using ontology based context in the
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free Grammars
 
Academic writing: pointers for G-cube PhD students 23-01.12
Academic writing: pointers for G-cube PhD students 23-01.12Academic writing: pointers for G-cube PhD students 23-01.12
Academic writing: pointers for G-cube PhD students 23-01.12
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
 
An investigation of diachronic change in hypotaxis and parataxis in German th...
An investigation of diachronic change in hypotaxis and parataxis in German th...An investigation of diachronic change in hypotaxis and parataxis in German th...
An investigation of diachronic change in hypotaxis and parataxis in German th...
 
Generic Tools, Specific Laguages
Generic Tools, Specific LaguagesGeneric Tools, Specific Laguages
Generic Tools, Specific Laguages
 

Recently uploaded

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 

Recently uploaded (20)

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 

More on Indexing Text Operations (1).pptx

  • 1. More on indexing and text operations Modern Information Retrieval University of Qom Z. Imanimehr Spring 2023
  • 2. Recall the basic indexing pipeline Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Indexer Inverted index friend roman countryman 2 4 2 13 16 1 Document Friends, Romans, countrymen. 2
  • 3. Text operations 3  Tokenization  Stop word removal  Normalization  Stemming or lemmatization  Equivalence classes  Example1: case folding  Example2: using thesauri (or Soundex) to find equivalence classes of synonyms and homonyms
  • 4. 4 Text operations & linguistic preprocessing
  • 5. Parsing a document  What format is it in?  pdf/word/excel/html?  What language is it in?  What character set is in use? These tasks can be seen as classification problems, which we will study later in the course. But these tasks are often done heuristically … Sec. 2.1 5
  • 6. Indexing granularity 6  What is a unit document?  A file? Zip files?  Whole book or chapters?  A Powerpoint file or each of its slides?
  • 7. Tokenization  Input: “Friends, Romans, Countrymen”  Output: Tokens  Friends  Romans  Countrymen  Each such token is now a candidate for an index entry, after further processing Sec. 2.2.1 7
  • 8. Tokenization  Issues in tokenization:  Finland’s capital  Finland? Finlands? Finland’s?  Hewlett-Packard  Hewlett and Packard as two tokens?  co-education  lower-case  state-of-the-art: break up hyphenated sequence.  It can be effective to get the user to put in possible hyphens  San Francisco: one token or two?  How do you decide it is one token? Sec. 2.2.1 8
  • 9. Tokenization: Numbers  Examples  3/12/91 Mar. 12, 1991 12/3/91  55 B.C.  B-52  My PGP key is 324a3df234cb23e  (800) 234-2333  Often have embedded spaces  Older IR systems may not index numbers  But often very useful  e.g., looking up error codes/stack traces on the web  Will often index “meta-data” separately  Creation date, format, etc. Sec. 2.2.1 9
  • 10. Tokenization: Language issues  French  L'ensemble: one token or two?  L ? L’ ? Le ?  German noun compounds are not segmented  Lebensversicherungsgesellschaftsangestellter  ‘life insurance company employee’  German retrieval systems benefit greatly from a compound splitter module  Can give a 15% performance boost for German Sec. 2.2.1 10
  • 11. Tokenization: Language issues  Chinese and Japanese have no spaces between words:  莎拉波娃现在居住在美国东南部的佛罗里达。  Not always guaranteed a unique tokenization  Further complicated normalization in Japanese, with multiple alphabets intermingled is required  Dates/amounts in multiple formats フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana Hiragana Kanji Romaji End-user can express query entirely in hiragana! Sec. 2.2.1 11
  • 12. Stop words  Stop list: exclude from dictionary the commonest words.  They have little semantic content: ‘the’, ‘a’, ‘and’, ‘to’, ‘be’  There are a lot of them: ~30% of postings for top 30 words  But the trend is away from doing this:  Good compression techniques (IIR, Chapter 5)  the space for including stopwords in a system is very small  Good query optimization techniques (IIR, Chapter 7)  pay little at query time for including stop words.  You need them for:  Phrase queries: “King of Denmark”  Various song titles, etc.: “Let it be”, “To be or not to be”  Relational queries: “flights to London” Sec. 2.2.2 12
  • 13. Normalization to terms  Normalize words in indexed text (also query)  U.S.A. USA  Term is a (normalized) word type, which is an entry in our IR system dictionary  We most commonly implicitly define equivalence classes of terms by, e.g.,  deleting periods to form a term  U.S.A., USA  USA  deleting hyphens to form a term  anti-discriminatory, antidiscriminatory  antidiscriminatory  Crucial: Need to “normalize” indexed text as well as query terms into the same form Sec. 2.2.3 13
  • 14. Normalization to terms 14  Do we handle synonyms and homonyms?  E.g., by hand-constructed equivalence classes  car = automobile color = colour  We can rewrite to form equivalence-class terms  When the doc contains automobile, index it under car- automobile (and/or vice-versa)  Or we can expand a query  When the query contains automobile, look under car as well Alternative to creating equivalence classes
  • 15. Query expansion instead of normalization  An alternative to equivalence classing is to do asymmetric expansion of query  An example of where this may be useful  Enter: window Search: window, windows  Enter: windows Search: Windows, windows  Potentially more powerful, but less efficient Sec. 2.2.3 15
  • 16. Normalization: Case folding  Reduce all letters to lower case  exception: upper case in mid-sentence?  e.g., General Motors  Fed vs. fed  SAIL vs. sail  Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…  Google example: Query C.A.T.  #1 result was for “cat” not Caterpillar Inc. Sec. 2.2.3 16
  • 17. Normalization: Stemming and lemmatization 17  For grammatical reasons, docs may use different forms of a word  Example: organize, organizes, and organizing  There are families of derivationally related words with similar meanings  Example: democracy, democratic, and democratization
  • 18. Lemmatization  Reduce inflectional/variant forms to their base form, e.g.,  am, are, is  be  car, cars, car's, cars'  car  the boy's cars are different colors  the boy car be different color  Lemmatization implies doing “proper” reduction to dictionary headword form  It needs a complete vocabulary and morphological analysis to correctly lemmatize words Sec. 2.2.4 18
  • 19. Stemming  Reduce terms to their “roots” before indexing  Stemmers use language-specific rules, but they require less knowledge than a lemmatizer  Stemming: crude affix chopping  The exact stemmed form does not matter  only the resulted equivalence classes play role. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress Sec. 2.2.4 19
  • 20. Porter’s algorithm  Commonest algorithm for stemming English  Results suggest it’s at least as good as other stemming options  Conventions + 5 phases of reductions  phases applied sequentially  each phase consists of a set of commands Sec. 2.2.4 20
  • 21. Porter’s algorithm: Typical rules  sses  ss  ies  i  ational  ate  tional  tion  Rules sensitive to the measure of words  (m>1) EMENT →  replacement → replac  cement → cement Sec. 2.2.4 21
  • 22. Do stemming and other normalizations help? 22  English: very mixed results. Helps recall but harms precision  Example of harmful stemming:  operative (dentistry) ⇒ oper  operational (research) ⇒ oper  operating (systems) ⇒ oper  Definitely useful for Spanish, German, Finnish, …  30% performance gains for Finnish!
  • 23. Lemmatization vs. Stemming 23  Lemmatization produces at most very modest benefits for retrieval.  Either form of normalization tends not to improve English information retrieval performance in aggregate  The situation is different for languages with much more morphology (such as Spanish, German, and Finnish).  quite large gains from the use of stemmers
  • 24. Normalization: Other languages  Accents: e.g., French résumé vs. resume.  Umlauts: e.g., German: Tuebingen vs. Tübingen  Should be equivalent  Most important criterion:  How are your users like to write their queries for these words?  Users often may not type accents even in languages that standardly have accents  Often best to normalize to a de-accented term  Tuebingen, Tübingen, Tubingen  Tubingen  For foreign names, the spelling may be unclear or there may be variant transliteration standards giving different spellings  Soundex forms equivalence classes of words based on phonetic heuristics Sec. 2.2.3 24
  • 25. Resources  IIR 2  MIR 9.2  Porter’s stemmer:  http://www.tartarus.org/~martin/PorterStemmer/ 25

Editor's Notes

  1. One answer is using n-grams: Lecture 3
  2. Nevertheless: “Google ignores common words and characters such as where, the, how, and other digits and letters which slow down your search without improving the results.” (Though you can explicitly ask for them to remain.)
  3. Why not the reverse?
  4. Class of heuristics to expand a query into phonetic equivalents Language specific (mainly for names) E.g., chebyshev  tchebycheff