SlideShare a Scribd company logo
1 of 25
More on indexing and text operations
Modern Information Retrieval
University of Qom
Z. Imanimehr
Spring 2023
Recall the basic indexing pipeline
Tokenizer
Token stream Friends Romans Countrymen
Linguistic modules
Modified tokens friend roman countryman
Indexer
Inverted index
friend
roman
countryman
2 4
2
13 16
1
Document Friends, Romans, countrymen.
2
Text operations
3
 Tokenization
 Stop word removal
 Normalization
 Stemming or lemmatization
 Equivalence classes
 Example1: case folding
 Example2: using thesauri (or Soundex) to find equivalence
classes of synonyms and homonyms
4
Text operations & linguistic
preprocessing
Parsing a document
 What format is it in?
 pdf/word/excel/html?
 What language is it in?
 What character set is in use?
These tasks can be seen as classification problems, which
we will study later in the course.
But these tasks are often done heuristically …
Sec. 2.1
5
Indexing granularity
6
 What is a unit document?
 A file? Zip files?
 Whole book or chapters?
 A Powerpoint file or each of its slides?
Tokenization
 Input: “Friends, Romans, Countrymen”
 Output: Tokens
 Friends
 Romans
 Countrymen
 Each such token is now a candidate for an index
entry, after further processing
Sec. 2.2.1
7
Tokenization
 Issues in tokenization:
 Finland’s capital  Finland? Finlands? Finland’s?
 Hewlett-Packard  Hewlett and Packard as two
tokens?
 co-education
 lower-case
 state-of-the-art: break up hyphenated sequence.
 It can be effective to get the user to put in possible hyphens
 San Francisco: one token or two?
 How do you decide it is one token?
Sec. 2.2.1
8
Tokenization: Numbers
 Examples
 3/12/91 Mar. 12, 1991 12/3/91
 55 B.C.
 B-52
 My PGP key is 324a3df234cb23e
 (800) 234-2333
 Often have embedded spaces
 Older IR systems may not index numbers
 But often very useful
 e.g., looking up error codes/stack traces on the web
 Will often index “meta-data” separately
 Creation date, format, etc.
Sec. 2.2.1
9
Tokenization: Language issues
 French
 L'ensemble: one token or two?
 L ? L’ ? Le ?
 German noun compounds are not segmented
 Lebensversicherungsgesellschaftsangestellter
 ‘life insurance company employee’
 German retrieval systems benefit greatly from a
compound splitter module
 Can give a 15% performance boost for German
Sec. 2.2.1
10
Tokenization: Language issues
 Chinese and Japanese have no spaces between
words:
 莎拉波娃现在居住在美国东南部的佛罗里达。
 Not always guaranteed a unique tokenization
 Further complicated normalization in Japanese, with
multiple alphabets intermingled is required
 Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
End-user can express query entirely in hiragana!
Sec. 2.2.1
11
Stop words
 Stop list: exclude from dictionary the commonest words.
 They have little semantic content: ‘the’, ‘a’, ‘and’, ‘to’, ‘be’
 There are a lot of them: ~30% of postings for top 30 words
 But the trend is away from doing this:
 Good compression techniques (IIR, Chapter 5)
 the space for including stopwords in a system is very small
 Good query optimization techniques (IIR, Chapter 7)
 pay little at query time for including stop words.
 You need them for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 Relational queries: “flights to London”
Sec. 2.2.2
12
Normalization to terms
 Normalize words in indexed text (also query)
 U.S.A. USA
 Term is a (normalized) word type, which is an entry in our IR
system dictionary
 We most commonly implicitly define equivalence classes of
terms by, e.g.,
 deleting periods to form a term
 U.S.A., USA  USA
 deleting hyphens to form a term
 anti-discriminatory, antidiscriminatory  antidiscriminatory
 Crucial: Need to “normalize” indexed text as well as query terms
into the same form
Sec. 2.2.3
13
Normalization to terms
14
 Do we handle synonyms and homonyms?
 E.g., by hand-constructed equivalence classes
 car = automobile color = colour
 We can rewrite to form equivalence-class terms
 When the doc contains automobile, index it under car-
automobile (and/or vice-versa)
 Or we can expand a query
 When the query contains automobile, look under car as
well
Alternative to creating equivalence classes
Query expansion instead of normalization
 An alternative to equivalence classing is to do
asymmetric expansion of query
 An example of where this may be useful
 Enter: window Search: window, windows
 Enter: windows Search: Windows, windows
 Potentially more powerful, but less efficient
Sec. 2.2.3
15
Normalization: Case folding
 Reduce all letters to lower case
 exception: upper case in mid-sentence?
 e.g., General Motors
 Fed vs. fed
 SAIL vs. sail
 Often best to lower case everything, since users will use
lowercase regardless of ‘correct’ capitalization…
 Google example: Query C.A.T.
 #1 result was for “cat” not Caterpillar Inc.
Sec. 2.2.3
16
Normalization: Stemming and lemmatization
17
 For grammatical reasons, docs may use different
forms of a word
 Example: organize, organizes, and organizing
 There are families of derivationally related words
with similar meanings
 Example: democracy, democratic, and democratization
Lemmatization
 Reduce inflectional/variant forms to their base form,
e.g.,
 am, are, is  be
 car, cars, car's, cars'  car
 the boy's cars are different colors  the boy car be
different color
 Lemmatization implies doing “proper” reduction to
dictionary headword form
 It needs a complete vocabulary and morphological
analysis to correctly lemmatize words
Sec. 2.2.4
18
Stemming
 Reduce terms to their “roots” before indexing
 Stemmers use language-specific rules, but they require
less knowledge than a lemmatizer
 Stemming: crude affix chopping
 The exact stemmed form does not matter
 only the resulted equivalence classes play role.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compress and
compress ar both accept
as equival to compress
Sec. 2.2.4
19
Porter’s algorithm
 Commonest algorithm for stemming English
 Results suggest it’s at least as good as other stemming
options
 Conventions + 5 phases of reductions
 phases applied sequentially
 each phase consists of a set of commands
Sec. 2.2.4
20
Porter’s algorithm: Typical rules
 sses  ss
 ies  i
 ational  ate
 tional  tion
 Rules sensitive to the measure of words
 (m>1) EMENT →
 replacement → replac
 cement → cement
Sec. 2.2.4
21
Do stemming and other normalizations
help?
22
 English: very mixed results. Helps recall but harms
precision
 Example of harmful stemming:
 operative (dentistry) ⇒ oper
 operational (research) ⇒ oper
 operating (systems) ⇒ oper
 Definitely useful for Spanish, German, Finnish, …
 30% performance gains for Finnish!
Lemmatization vs. Stemming
23
 Lemmatization produces at most very modest
benefits for retrieval.
 Either form of normalization tends not to improve
English information retrieval performance in
aggregate
 The situation is different for languages with much
more morphology (such as Spanish, German, and
Finnish).
 quite large gains from the use of stemmers
Normalization: Other languages
 Accents: e.g., French résumé vs. resume.
 Umlauts: e.g., German: Tuebingen vs. Tübingen
 Should be equivalent
 Most important criterion:
 How are your users like to write their queries for these words?
 Users often may not type accents even in languages that
standardly have accents
 Often best to normalize to a de-accented term
 Tuebingen, Tübingen, Tubingen  Tubingen
 For foreign names, the spelling may be unclear or there
may be variant transliteration standards giving different
spellings
 Soundex forms equivalence classes of words based on phonetic
heuristics
Sec. 2.2.3
24
Resources
 IIR 2
 MIR 9.2
 Porter’s stemmer:
 http://www.tartarus.org/~martin/PorterStemmer/
25

More Related Content

Similar to More on Indexing Text Operations (1).pptx

Python programming
Python programmingPython programming
Python programmingnoor_faiza
 
Publishability workshop: Writing readable academic text
Publishability workshop: Writing readable academic textPublishability workshop: Writing readable academic text
Publishability workshop: Writing readable academic textLawrie Hunter
 
Ui testing on multi lingual
Ui testing on multi lingualUi testing on multi lingual
Ui testing on multi lingualtharayadav
 
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduatesScales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduatesHans Ecke
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...NALESVPMEngg
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdfbeshahashenafe20
 
CS152 Programming Paradigm
CS152 Programming Paradigm CS152 Programming Paradigm
CS152 Programming Paradigm Kaya Ota
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmMeetupDataScienceRoma
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptSamuelKetema1
 
Advanced_programming_language_design.pdf
Advanced_programming_language_design.pdfAdvanced_programming_language_design.pdf
Advanced_programming_language_design.pdfRodulfoGabrito
 
Using ontology based context in the
Using ontology based context in theUsing ontology based context in the
Using ontology based context in theijaia
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsLiwei Ren任力偉
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free GrammarsMarina Santini
 
Academic writing: pointers for G-cube PhD students 23-01.12
Academic writing: pointers for G-cube PhD students 23-01.12Academic writing: pointers for G-cube PhD students 23-01.12
Academic writing: pointers for G-cube PhD students 23-01.12Lawrie Hunter
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Chunyang Chen
 
An investigation of diachronic change in hypotaxis and parataxis in German th...
An investigation of diachronic change in hypotaxis and parataxis in German th...An investigation of diachronic change in hypotaxis and parataxis in German th...
An investigation of diachronic change in hypotaxis and parataxis in German th...Mario Bisiada
 
Generic Tools, Specific Laguages
Generic Tools, Specific LaguagesGeneric Tools, Specific Laguages
Generic Tools, Specific LaguagesMarkus Voelter
 

Similar to More on Indexing Text Operations (1).pptx (20)

Generic Programming
Generic ProgrammingGeneric Programming
Generic Programming
 
Python programming
Python programmingPython programming
Python programming
 
Publishability workshop: Writing readable academic text
Publishability workshop: Writing readable academic textPublishability workshop: Writing readable academic text
Publishability workshop: Writing readable academic text
 
Ui testing on multi lingual
Ui testing on multi lingualUi testing on multi lingual
Ui testing on multi lingual
 
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduatesScales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
 
CS152 Programming Paradigm
CS152 Programming Paradigm CS152 Programming Paradigm
CS152 Programming Paradigm
 
intro.ppt
intro.pptintro.ppt
intro.ppt
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
 
Advanced_programming_language_design.pdf
Advanced_programming_language_design.pdfAdvanced_programming_language_design.pdf
Advanced_programming_language_design.pdf
 
Using ontology based context in the
Using ontology based context in theUsing ontology based context in the
Using ontology based context in the
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free Grammars
 
Academic writing: pointers for G-cube PhD students 23-01.12
Academic writing: pointers for G-cube PhD students 23-01.12Academic writing: pointers for G-cube PhD students 23-01.12
Academic writing: pointers for G-cube PhD students 23-01.12
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
 
An investigation of diachronic change in hypotaxis and parataxis in German th...
An investigation of diachronic change in hypotaxis and parataxis in German th...An investigation of diachronic change in hypotaxis and parataxis in German th...
An investigation of diachronic change in hypotaxis and parataxis in German th...
 
Generic Tools, Specific Laguages
Generic Tools, Specific LaguagesGeneric Tools, Specific Laguages
Generic Tools, Specific Laguages
 

Recently uploaded

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

More on Indexing Text Operations (1).pptx

  • 1. More on indexing and text operations Modern Information Retrieval University of Qom Z. Imanimehr Spring 2023
  • 2. Recall the basic indexing pipeline Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Indexer Inverted index friend roman countryman 2 4 2 13 16 1 Document Friends, Romans, countrymen. 2
  • 3. Text operations 3  Tokenization  Stop word removal  Normalization  Stemming or lemmatization  Equivalence classes  Example1: case folding  Example2: using thesauri (or Soundex) to find equivalence classes of synonyms and homonyms
  • 4. 4 Text operations & linguistic preprocessing
  • 5. Parsing a document  What format is it in?  pdf/word/excel/html?  What language is it in?  What character set is in use? These tasks can be seen as classification problems, which we will study later in the course. But these tasks are often done heuristically … Sec. 2.1 5
  • 6. Indexing granularity 6  What is a unit document?  A file? Zip files?  Whole book or chapters?  A Powerpoint file or each of its slides?
  • 7. Tokenization  Input: “Friends, Romans, Countrymen”  Output: Tokens  Friends  Romans  Countrymen  Each such token is now a candidate for an index entry, after further processing Sec. 2.2.1 7
  • 8. Tokenization  Issues in tokenization:  Finland’s capital  Finland? Finlands? Finland’s?  Hewlett-Packard  Hewlett and Packard as two tokens?  co-education  lower-case  state-of-the-art: break up hyphenated sequence.  It can be effective to get the user to put in possible hyphens  San Francisco: one token or two?  How do you decide it is one token? Sec. 2.2.1 8
  • 9. Tokenization: Numbers  Examples  3/12/91 Mar. 12, 1991 12/3/91  55 B.C.  B-52  My PGP key is 324a3df234cb23e  (800) 234-2333  Often have embedded spaces  Older IR systems may not index numbers  But often very useful  e.g., looking up error codes/stack traces on the web  Will often index “meta-data” separately  Creation date, format, etc. Sec. 2.2.1 9
  • 10. Tokenization: Language issues  French  L'ensemble: one token or two?  L ? L’ ? Le ?  German noun compounds are not segmented  Lebensversicherungsgesellschaftsangestellter  ‘life insurance company employee’  German retrieval systems benefit greatly from a compound splitter module  Can give a 15% performance boost for German Sec. 2.2.1 10
  • 11. Tokenization: Language issues  Chinese and Japanese have no spaces between words:  莎拉波娃现在居住在美国东南部的佛罗里达。  Not always guaranteed a unique tokenization  Further complicated normalization in Japanese, with multiple alphabets intermingled is required  Dates/amounts in multiple formats フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana Hiragana Kanji Romaji End-user can express query entirely in hiragana! Sec. 2.2.1 11
  • 12. Stop words  Stop list: exclude from dictionary the commonest words.  They have little semantic content: ‘the’, ‘a’, ‘and’, ‘to’, ‘be’  There are a lot of them: ~30% of postings for top 30 words  But the trend is away from doing this:  Good compression techniques (IIR, Chapter 5)  the space for including stopwords in a system is very small  Good query optimization techniques (IIR, Chapter 7)  pay little at query time for including stop words.  You need them for:  Phrase queries: “King of Denmark”  Various song titles, etc.: “Let it be”, “To be or not to be”  Relational queries: “flights to London” Sec. 2.2.2 12
  • 13. Normalization to terms  Normalize words in indexed text (also query)  U.S.A. USA  Term is a (normalized) word type, which is an entry in our IR system dictionary  We most commonly implicitly define equivalence classes of terms by, e.g.,  deleting periods to form a term  U.S.A., USA  USA  deleting hyphens to form a term  anti-discriminatory, antidiscriminatory  antidiscriminatory  Crucial: Need to “normalize” indexed text as well as query terms into the same form Sec. 2.2.3 13
  • 14. Normalization to terms 14  Do we handle synonyms and homonyms?  E.g., by hand-constructed equivalence classes  car = automobile color = colour  We can rewrite to form equivalence-class terms  When the doc contains automobile, index it under car- automobile (and/or vice-versa)  Or we can expand a query  When the query contains automobile, look under car as well Alternative to creating equivalence classes
  • 15. Query expansion instead of normalization  An alternative to equivalence classing is to do asymmetric expansion of query  An example of where this may be useful  Enter: window Search: window, windows  Enter: windows Search: Windows, windows  Potentially more powerful, but less efficient Sec. 2.2.3 15
  • 16. Normalization: Case folding  Reduce all letters to lower case  exception: upper case in mid-sentence?  e.g., General Motors  Fed vs. fed  SAIL vs. sail  Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…  Google example: Query C.A.T.  #1 result was for “cat” not Caterpillar Inc. Sec. 2.2.3 16
  • 17. Normalization: Stemming and lemmatization 17  For grammatical reasons, docs may use different forms of a word  Example: organize, organizes, and organizing  There are families of derivationally related words with similar meanings  Example: democracy, democratic, and democratization
  • 18. Lemmatization  Reduce inflectional/variant forms to their base form, e.g.,  am, are, is  be  car, cars, car's, cars'  car  the boy's cars are different colors  the boy car be different color  Lemmatization implies doing “proper” reduction to dictionary headword form  It needs a complete vocabulary and morphological analysis to correctly lemmatize words Sec. 2.2.4 18
  • 19. Stemming  Reduce terms to their “roots” before indexing  Stemmers use language-specific rules, but they require less knowledge than a lemmatizer  Stemming: crude affix chopping  The exact stemmed form does not matter  only the resulted equivalence classes play role. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress Sec. 2.2.4 19
  • 20. Porter’s algorithm  Commonest algorithm for stemming English  Results suggest it’s at least as good as other stemming options  Conventions + 5 phases of reductions  phases applied sequentially  each phase consists of a set of commands Sec. 2.2.4 20
  • 21. Porter’s algorithm: Typical rules  sses  ss  ies  i  ational  ate  tional  tion  Rules sensitive to the measure of words  (m>1) EMENT →  replacement → replac  cement → cement Sec. 2.2.4 21
  • 22. Do stemming and other normalizations help? 22  English: very mixed results. Helps recall but harms precision  Example of harmful stemming:  operative (dentistry) ⇒ oper  operational (research) ⇒ oper  operating (systems) ⇒ oper  Definitely useful for Spanish, German, Finnish, …  30% performance gains for Finnish!
  • 23. Lemmatization vs. Stemming 23  Lemmatization produces at most very modest benefits for retrieval.  Either form of normalization tends not to improve English information retrieval performance in aggregate  The situation is different for languages with much more morphology (such as Spanish, German, and Finnish).  quite large gains from the use of stemmers
  • 24. Normalization: Other languages  Accents: e.g., French résumé vs. resume.  Umlauts: e.g., German: Tuebingen vs. Tübingen  Should be equivalent  Most important criterion:  How are your users like to write their queries for these words?  Users often may not type accents even in languages that standardly have accents  Often best to normalize to a de-accented term  Tuebingen, Tübingen, Tubingen  Tubingen  For foreign names, the spelling may be unclear or there may be variant transliteration standards giving different spellings  Soundex forms equivalence classes of words based on phonetic heuristics Sec. 2.2.3 24
  • 25. Resources  IIR 2  MIR 9.2  Porter’s stemmer:  http://www.tartarus.org/~martin/PorterStemmer/ 25

Editor's Notes

  1. One answer is using n-grams: Lecture 3
  2. Nevertheless: “Google ignores common words and characters such as where, the, how, and other digits and letters which slow down your search without improving the results.” (Though you can explicitly ask for them to remain.)
  3. Why not the reverse?
  4. Class of heuristics to expand a query into phonetic equivalents Language specific (mainly for names) E.g., chebyshev  tchebycheff