SlideShare a Scribd company logo
Lecture 03
Information Retrieval
Boolean Retrieval Model
Processing Boolean queries
 To process a simple conjunctive query such as “Brutus AND
Calpurnia” using an inverted index and the basic Boolean retrieval
model, we follow these steps:
1. Locate Brutus in the Dictionary
2. Retrieve its postings
3. Locate Calpurnia in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists
Boolean Retrieval Model
Processing Boolean queries
 The intersection operation is the crucial one: we need to efficiently
intersect postings lists so as to be able to quickly find documents that
contain both terms.
 This operation is sometimes referred to as merging postings lists.
Boolean Retrieval Model
Processing Boolean queries
 If the lengths of the postings lists are x and y, the
intersection takes O(x + y) operations.
 Processing more complex queries? Example:
 (Brutus OR Caesar) AND NOT Calpurnia
Boolean Retrieval Model
Processing Boolean queries
 Query optimization: is the process of selecting how to
organize the work of answering a query so that the least
total amount of work needs to be done by the system.
 Brutus AND Caesar AND Calpurnia
Boolean Retrieval Model
Processing Boolean queries
Brutus AND Caesar AND Calpurnia
 A major element is the order in which postings lists are
accessed.
 What is the best order for query processing?
(Calpurnia AND Brutus) AND Caesar
Boolean Retrieval Model
Processing Boolean queries
 if we start by intersecting the two smallest postings lists,
then all intermediate results must be no bigger than the
smallest postings list, and we are therefore likely to do the
least amount of total work.
The term vocabulary and postings lists
Choosing a Document Unit
• What is the document unit that should be
used for indexing?
Questio
n
• Text Message
• Attachment (.doc file / .rar file)Email Messages
• Individual Books (entire book as a unit)
• Each Chapter as a Unit
• Individual Sentences
Collection of
Books
Precision
Recall
The term vocabulary and postings lists
Determining the vocabulary of terms
Recall the major steps in inverted index construction:
1. Collect the documents to be indexed.
2. Tokenize the text.
3. Do linguistic preprocessing of tokens.
4. Index the documents that each term occurs in.
• Tokenization is the process of chopping
character streams into tokens throwing away
certain characters.
Tokenization
• Deals with building equivalence classes of
tokens which are the set of terms that are
indexed
Linguistic
Preprocessing
The term vocabulary and postings lists
Determining the vocabulary of terms
 Token/Type/or Term?
A token: is an instance of a sequence of characters in some
particular document that are grouped together as a useful
semantic unit for processing.
A type: is the class of all tokens containing the same character
sequence.
A term: is a type that is included in the IR system’s dictionary (a
• Tokenization is the process of chopping character
streams into tokens throwing away certain characters.Tokenization
The term vocabulary and postings lists
Determining the vocabulary of terms
 What about apostrophe for possession and
contractions?
 doc_1 : Dr. Thomas O’Daniel has been the President of Research since
December 2006.
 doc_2 : Students’ solutions weren’t correct.
 doc_3 : Ahmad’s notebook isn’t cheap.
Example: Query = O’Daniel AND Research
 Token 1: o’daniel
 Token 2: odaniel
 Token 3: o’ daniel
 Token 4: o daniel
• what are the correct tokens to use?
Questio
n
The term vocabulary and postings lists
Determining the vocabulary of terms
 What about tokens associated with special
characters?
 doc_1 : C# is a high-level, multi-paradigm, general-purpose programming
language.
 doc_2 : C++ (pronounced cee plus plus) is a general purpose
programming language.
 doc_3 : A+ is an array programming language descendent from the
programming language A.
Example: Query = C AND programming
 Token 1: C#
 Token 2: C #
• what are the correct tokens to use?
Questio
n
The term vocabulary and postings lists
Determining the vocabulary of terms
 What about hyphenated tokens?
 doc_1 : C# is a high-level, multi-paradigm, general-purpose programming
language.
 doc_2 : C++ (pronounced cee plus plus) is a general purpose
programming language.
 doc_3 : A+ is an array programming language descendent from the
programming language A.
Example: Query = general-purpose AND
programming
 Token 1: general-purpose
 Token 2: general purpose
• what are the correct tokens to use?
Questio
n
The term vocabulary and postings lists
Determining the vocabulary of terms
 What about tokens that should be regarding as a
single token?
 doc_1 : The West Bank, including East Jerusalem, has a land area of
5,640 km2.
 doc_2 :The West bank and Gaza Strip.
 doc_3 : There is a branch of the Arab Bank in Palestine in the West of
Jenin City.
Example: Query = West Bank AND Palestine
 Token 1: West Bank
 Token 2: West
 Token 3: Bank
• what are the correct tokens to use?
Questio
n
The term vocabulary and postings lists
Dropping Common Terms (Stop words Removal)
 Using a stop list significantly reduces the number of postings that a
system has to store.
 keyword searches with terms like the and by don’t seem very useful.
 However, this is not true for phrase searches. The
 meaning of flights to London is likely to be lost if the word to is
stopped out.
Example: The phrase query
“President of the United States” or
“Flights to London” is more precise than
“President” AND “United States”. and
“Flights” AND “London”
• some extremely common words which would
appear to be of little value in helping select
documents matching a user need are excluded from
the vocabulary entirely.
Stop
words
The term vocabulary and postings lists
Dropping Common Terms (Stop words Removal)
 The general trend in IR systems over time has
been:
 from standard use of quite large stop lists (200–
300 terms)
 to very small stop lists (7–12 terms)
 to no stop list whatsoever.
• how we can exploit the statistics of
language so as to be able to cope with
common words in better ways.
Questio
n
• Do we really need to use stop lists.
Questio
n
The term vocabulary and postings lists
Normalization (equivalence classing of terms)
 Token normalization: is the process of canonicalizing
(standardizing or normalizing) tokens so that matches occur
despite superficial differences in the character sequences of the
tokens.
 The easy case is if tokens in the query just match tokens in the
token list of the document.
 However, there are many cases when two character sequences are
not quite the same but you would like a match to occur.
Query
• Token1
• Token 2
• …
Document
• Token1
• Token 2
• …
The term vocabulary and postings lists
Normalization (equivalence classing of terms)
 Create equivalence classes, which are normally named after one
member of the set.
Query
• anti-discriminatory
• co-author
• U.S.A
• …
Document
• antidiscriminatory
• coauthor
• USA
• …
The term vocabulary and postings lists
Normalization (equivalence classing of terms)
 An alternative is to maintain relations between unnormalized
tokens. This method can be extended to hand-constructed lists of
synonyms such as car and automobile.
 These term relationships can be achieved in two ways:
1. The usual way is to index unnormalized tokens and to maintain a
query expansion list of multiple vocabulary entries to consider for a
certain query term.
2. The alternative is to perform the expansion during index
construction.
When the document contains automobile, we index it under car as
well (and, usually, also vice-versa).
  Use of either of these methods is considerably less efficient
than equivalence classing, as there are more postings to store and
The term vocabulary and postings lists
Accents and Diacritics
 Diacritics: signs which when written above or below a letter indicates a
difference in pronunciation from the same letter when unmarked or
differently marked.
 In English:
naive and naïve
 This can be done by normalizing tokens to remove diacritics.
 What about other languages?
ََ‫َتب‬‫ك‬َ‫و‬‫ُتب‬‫ك‬َ‫و‬‫ب‬ُ‫ت‬ُ‫ك‬
 It might be best to equate all words to a form without diacritics.
The term vocabulary and postings lists
Capitalization/Case-folding
 Case-folding: refers to reducing all letters to lower case.
Naive  naive
General Motors  general motors
Drew University  drew university
Drew West  drew west
The term vocabulary and postings lists
Capitalization/Case-folding
 Case-folding: refers to reducing all letters to lower case.
C.A.T  cat
The term vocabulary and postings lists
Capitalization/Case-folding
 An alternative to making every token lowercase is to just make
some tokens lowercase.
  The simplest heuristic is to convert to lowercase words at the
beginning of a sentence and all words occurring in a title that is all
uppercase or in which most or all words are capitalized.
 Mid-sentence capitalized words are left as capitalized (which is
usually correct).
 However, trying to get capitalization right in this way probably
doesn’t help if your users usually use lowercase regardless of the
correct case of words.
 Thus, lowercasing everything often remains the most practical
solution.
The term vocabulary and postings lists
Other issues in English
 Other possible normalizations are quite idiosyncratic and
particular to English.
 For instance, you might wish to equate:
colour and color.
3/12/91 and Mar. 12, 1991
  U.S., 3/12/91 is Mar. 12, 1991, whereas in Europe it is 3 Dec 1991.

More Related Content

What's hot

Language Model (N-Gram).pptx
Language Model (N-Gram).pptxLanguage Model (N-Gram).pptx
Language Model (N-Gram).pptx
HeneWijaya
 
Ch 2 Names scopes and bindings.pptx
Ch 2 Names scopes and bindings.pptxCh 2 Names scopes and bindings.pptx
Ch 2 Names scopes and bindings.pptx
RanjanaShevkar
 
Context free languages
Context free languagesContext free languages
Context free languages
Jahurul Islam
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalDustin Smith
 
02. chapter 3 lexical analysis
02. chapter 3   lexical analysis02. chapter 3   lexical analysis
02. chapter 3 lexical analysisraosir123
 
Formal language & automata theory
Formal language & automata theoryFormal language & automata theory
Formal language & automata theoryNYversity
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
Spark Summit
 
2.8 normal forms gnf & problems
2.8 normal forms   gnf & problems2.8 normal forms   gnf & problems
2.8 normal forms gnf & problems
Sampath Kumar S
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free Grammars
Marina Santini
 
8. operation contracts
8. operation contracts8. operation contracts
8. operation contracts
Hastri Diahfamily
 
Deterministic context free grammars &non-deterministic
Deterministic context free grammars &non-deterministicDeterministic context free grammars &non-deterministic
Deterministic context free grammars &non-deterministic
Leyo Stephen
 
Recursion tree method
Recursion tree methodRecursion tree method
Recursion tree method
Rajendran
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
Deptii Chaudhari
 
Python decision making
Python   decision makingPython   decision making
Python decision making
Learnbay Datascience
 
Arrays and structures
Arrays and structuresArrays and structures
Arrays and structuresMohd Arif
 
Continuity.ppt
Continuity.pptContinuity.ppt
Continuity.ppt
SupriyaGhosh43
 
Graph coloring and_applications
Graph coloring and_applicationsGraph coloring and_applications
Graph coloring and_applicationsmohammad alkhalil
 
Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
Gurram Poorna Prudhvi
 
Mathematical foundations of computer science
Mathematical foundations of computer scienceMathematical foundations of computer science
Mathematical foundations of computer science
BindhuBhargaviTalasi
 

What's hot (20)

Language Model (N-Gram).pptx
Language Model (N-Gram).pptxLanguage Model (N-Gram).pptx
Language Model (N-Gram).pptx
 
Ch 2 Names scopes and bindings.pptx
Ch 2 Names scopes and bindings.pptxCh 2 Names scopes and bindings.pptx
Ch 2 Names scopes and bindings.pptx
 
Unit 01 dbms
Unit 01 dbmsUnit 01 dbms
Unit 01 dbms
 
Context free languages
Context free languagesContext free languages
Context free languages
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
02. chapter 3 lexical analysis
02. chapter 3   lexical analysis02. chapter 3   lexical analysis
02. chapter 3 lexical analysis
 
Formal language & automata theory
Formal language & automata theoryFormal language & automata theory
Formal language & automata theory
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
 
2.8 normal forms gnf & problems
2.8 normal forms   gnf & problems2.8 normal forms   gnf & problems
2.8 normal forms gnf & problems
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free Grammars
 
8. operation contracts
8. operation contracts8. operation contracts
8. operation contracts
 
Deterministic context free grammars &non-deterministic
Deterministic context free grammars &non-deterministicDeterministic context free grammars &non-deterministic
Deterministic context free grammars &non-deterministic
 
Recursion tree method
Recursion tree methodRecursion tree method
Recursion tree method
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
 
Python decision making
Python   decision makingPython   decision making
Python decision making
 
Arrays and structures
Arrays and structuresArrays and structures
Arrays and structures
 
Continuity.ppt
Continuity.pptContinuity.ppt
Continuity.ppt
 
Graph coloring and_applications
Graph coloring and_applicationsGraph coloring and_applications
Graph coloring and_applications
 
Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
 
Mathematical foundations of computer science
Mathematical foundations of computer scienceMathematical foundations of computer science
Mathematical foundations of computer science
 

Viewers also liked

Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Julián Urbano
 
Ch8
Ch8Ch8
Ir 09
Ir   09Ir   09
Ir 08
Ir   08Ir   08
Ch2020
Ch2020Ch2020
Ch7
Ch7Ch7
Ai 02 intelligent_agents(1)
Ai 02 intelligent_agents(1)Ai 02 intelligent_agents(1)
Ai 02 intelligent_agents(1)
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th EditionCh 8
Ian Sommerville,  Software Engineering, 9th EditionCh 8Ian Sommerville,  Software Engineering, 9th EditionCh 8
Ian Sommerville, Software Engineering, 9th EditionCh 8
Mohammed Romi
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델
JUNGEUN KANG
 
Ir 02
Ir   02Ir   02
Ir 01
Ir   01Ir   01
Angel6 e05
Angel6 e05Angel6 e05
Angel6 e05
Mohammed Romi
 
Ch19 network layer-logical add
Ch19 network layer-logical addCh19 network layer-logical add
Ch19 network layer-logical add
Mohammed Romi
 
Swe notes
Swe notesSwe notes
Swe notes
Mohammed Romi
 
Ch12
Ch12Ch12
Information retrieval to recommender systems
Information retrieval to recommender systemsInformation retrieval to recommender systems
Information retrieval to recommender systems
Data Science Society
 
Chapter02 graphics-programming
Chapter02 graphics-programmingChapter02 graphics-programming
Chapter02 graphics-programming
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch 23
Ian Sommerville,  Software Engineering, 9th Edition Ch 23Ian Sommerville,  Software Engineering, 9th Edition Ch 23
Ian Sommerville, Software Engineering, 9th Edition Ch 23
Mohammed Romi
 
Ch 4 software engineering
Ch 4 software engineeringCh 4 software engineering
Ch 4 software engineering
Mohammed Romi
 

Viewers also liked (20)

Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Ch8
Ch8Ch8
Ch8
 
Ir 09
Ir   09Ir   09
Ir 09
 
Ir 08
Ir   08Ir   08
Ir 08
 
Ch2020
Ch2020Ch2020
Ch2020
 
Ch7
Ch7Ch7
Ch7
 
Ai 02 intelligent_agents(1)
Ai 02 intelligent_agents(1)Ai 02 intelligent_agents(1)
Ai 02 intelligent_agents(1)
 
Ian Sommerville, Software Engineering, 9th EditionCh 8
Ian Sommerville,  Software Engineering, 9th EditionCh 8Ian Sommerville,  Software Engineering, 9th EditionCh 8
Ian Sommerville, Software Engineering, 9th EditionCh 8
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델
 
Ir 02
Ir   02Ir   02
Ir 02
 
Bab ii
Bab iiBab ii
Bab ii
 
Ir 01
Ir   01Ir   01
Ir 01
 
Angel6 e05
Angel6 e05Angel6 e05
Angel6 e05
 
Ch19 network layer-logical add
Ch19 network layer-logical addCh19 network layer-logical add
Ch19 network layer-logical add
 
Swe notes
Swe notesSwe notes
Swe notes
 
Ch12
Ch12Ch12
Ch12
 
Information retrieval to recommender systems
Information retrieval to recommender systemsInformation retrieval to recommender systems
Information retrieval to recommender systems
 
Chapter02 graphics-programming
Chapter02 graphics-programmingChapter02 graphics-programming
Chapter02 graphics-programming
 
Ian Sommerville, Software Engineering, 9th Edition Ch 23
Ian Sommerville,  Software Engineering, 9th Edition Ch 23Ian Sommerville,  Software Engineering, 9th Edition Ch 23
Ian Sommerville, Software Engineering, 9th Edition Ch 23
 
Ch 4 software engineering
Ch 4 software engineeringCh 4 software engineering
Ch 4 software engineering
 

Similar to Ir 03

02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
beshahashenafe20
 
Search pitb
Search pitbSearch pitb
Search pitb
Nawab Iqbal
 
2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
HayomeTakele
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
SamuelKetema1
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
Habtamu100
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
captainmactavish1996
 
Text analytics
Text analyticsText analytics
Text analytics
Utkarsh Sharma
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
Hemantha Kulathilake
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
Habtamu100
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
Sean Golliher
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Chunyang Chen
 
Textmining
TextminingTextmining
Textmining
sidhunileshwar
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
Lucidworks
 
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Soham Mondal
 
Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptx
Chode Amarnath
 
Parser
ParserParser
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
ankit_ppt
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
rohithprabhas1
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
El Habib NFAOUI
 

Similar to Ir 03 (20)

02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
 
Search pitb
Search pitbSearch pitb
Search pitb
 
2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
Text analytics
Text analyticsText analytics
Text analytics
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
 
NLP todo
NLP todoNLP todo
NLP todo
 
Textmining
TextminingTextmining
Textmining
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...
 
Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptx
 
Parser
ParserParser
Parser
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 

More from Mohammed Romi

Ai 01 introduction
Ai 01 introductionAi 01 introduction
Ai 01 introduction
Mohammed Romi
 
Ai 03 solving_problems_by_searching
Ai 03 solving_problems_by_searchingAi 03 solving_problems_by_searching
Ai 03 solving_problems_by_searching
Mohammed Romi
 
Swiching
SwichingSwiching
Swiching
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch 4
Ian Sommerville,  Software Engineering, 9th Edition Ch 4Ian Sommerville,  Software Engineering, 9th Edition Ch 4
Ian Sommerville, Software Engineering, 9th Edition Ch 4
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch2
Ian Sommerville,  Software Engineering, 9th Edition Ch2Ian Sommerville,  Software Engineering, 9th Edition Ch2
Ian Sommerville, Software Engineering, 9th Edition Ch2
Mohammed Romi
 
Ian Sommerville, Software Engineering, 9th Edition Ch1
Ian Sommerville,  Software Engineering, 9th Edition Ch1Ian Sommerville,  Software Engineering, 9th Edition Ch1
Ian Sommerville, Software Engineering, 9th Edition Ch1
Mohammed Romi
 
Ch 6
Ch 6Ch 6

More from Mohammed Romi (7)

Ai 01 introduction
Ai 01 introductionAi 01 introduction
Ai 01 introduction
 
Ai 03 solving_problems_by_searching
Ai 03 solving_problems_by_searchingAi 03 solving_problems_by_searching
Ai 03 solving_problems_by_searching
 
Swiching
SwichingSwiching
Swiching
 
Ian Sommerville, Software Engineering, 9th Edition Ch 4
Ian Sommerville,  Software Engineering, 9th Edition Ch 4Ian Sommerville,  Software Engineering, 9th Edition Ch 4
Ian Sommerville, Software Engineering, 9th Edition Ch 4
 
Ian Sommerville, Software Engineering, 9th Edition Ch2
Ian Sommerville,  Software Engineering, 9th Edition Ch2Ian Sommerville,  Software Engineering, 9th Edition Ch2
Ian Sommerville, Software Engineering, 9th Edition Ch2
 
Ian Sommerville, Software Engineering, 9th Edition Ch1
Ian Sommerville,  Software Engineering, 9th Edition Ch1Ian Sommerville,  Software Engineering, 9th Edition Ch1
Ian Sommerville, Software Engineering, 9th Edition Ch1
 
Ch 6
Ch 6Ch 6
Ch 6
 

Recently uploaded

The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 

Recently uploaded (20)

The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 

Ir 03

  • 2. Boolean Retrieval Model Processing Boolean queries  To process a simple conjunctive query such as “Brutus AND Calpurnia” using an inverted index and the basic Boolean retrieval model, we follow these steps: 1. Locate Brutus in the Dictionary 2. Retrieve its postings 3. Locate Calpurnia in the Dictionary 4. Retrieve its postings 5. Intersect the two postings lists
  • 3. Boolean Retrieval Model Processing Boolean queries  The intersection operation is the crucial one: we need to efficiently intersect postings lists so as to be able to quickly find documents that contain both terms.  This operation is sometimes referred to as merging postings lists.
  • 4. Boolean Retrieval Model Processing Boolean queries  If the lengths of the postings lists are x and y, the intersection takes O(x + y) operations.  Processing more complex queries? Example:  (Brutus OR Caesar) AND NOT Calpurnia
  • 5. Boolean Retrieval Model Processing Boolean queries  Query optimization: is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system.  Brutus AND Caesar AND Calpurnia
  • 6. Boolean Retrieval Model Processing Boolean queries Brutus AND Caesar AND Calpurnia  A major element is the order in which postings lists are accessed.  What is the best order for query processing? (Calpurnia AND Brutus) AND Caesar
  • 7. Boolean Retrieval Model Processing Boolean queries  if we start by intersecting the two smallest postings lists, then all intermediate results must be no bigger than the smallest postings list, and we are therefore likely to do the least amount of total work.
  • 8. The term vocabulary and postings lists Choosing a Document Unit • What is the document unit that should be used for indexing? Questio n • Text Message • Attachment (.doc file / .rar file)Email Messages • Individual Books (entire book as a unit) • Each Chapter as a Unit • Individual Sentences Collection of Books Precision Recall
  • 9. The term vocabulary and postings lists Determining the vocabulary of terms Recall the major steps in inverted index construction: 1. Collect the documents to be indexed. 2. Tokenize the text. 3. Do linguistic preprocessing of tokens. 4. Index the documents that each term occurs in. • Tokenization is the process of chopping character streams into tokens throwing away certain characters. Tokenization • Deals with building equivalence classes of tokens which are the set of terms that are indexed Linguistic Preprocessing
  • 10. The term vocabulary and postings lists Determining the vocabulary of terms  Token/Type/or Term? A token: is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type: is the class of all tokens containing the same character sequence. A term: is a type that is included in the IR system’s dictionary (a • Tokenization is the process of chopping character streams into tokens throwing away certain characters.Tokenization
  • 11. The term vocabulary and postings lists Determining the vocabulary of terms  What about apostrophe for possession and contractions?  doc_1 : Dr. Thomas O’Daniel has been the President of Research since December 2006.  doc_2 : Students’ solutions weren’t correct.  doc_3 : Ahmad’s notebook isn’t cheap. Example: Query = O’Daniel AND Research  Token 1: o’daniel  Token 2: odaniel  Token 3: o’ daniel  Token 4: o daniel • what are the correct tokens to use? Questio n
  • 12. The term vocabulary and postings lists Determining the vocabulary of terms  What about tokens associated with special characters?  doc_1 : C# is a high-level, multi-paradigm, general-purpose programming language.  doc_2 : C++ (pronounced cee plus plus) is a general purpose programming language.  doc_3 : A+ is an array programming language descendent from the programming language A. Example: Query = C AND programming  Token 1: C#  Token 2: C # • what are the correct tokens to use? Questio n
  • 13. The term vocabulary and postings lists Determining the vocabulary of terms  What about hyphenated tokens?  doc_1 : C# is a high-level, multi-paradigm, general-purpose programming language.  doc_2 : C++ (pronounced cee plus plus) is a general purpose programming language.  doc_3 : A+ is an array programming language descendent from the programming language A. Example: Query = general-purpose AND programming  Token 1: general-purpose  Token 2: general purpose • what are the correct tokens to use? Questio n
  • 14. The term vocabulary and postings lists Determining the vocabulary of terms  What about tokens that should be regarding as a single token?  doc_1 : The West Bank, including East Jerusalem, has a land area of 5,640 km2.  doc_2 :The West bank and Gaza Strip.  doc_3 : There is a branch of the Arab Bank in Palestine in the West of Jenin City. Example: Query = West Bank AND Palestine  Token 1: West Bank  Token 2: West  Token 3: Bank • what are the correct tokens to use? Questio n
  • 15. The term vocabulary and postings lists Dropping Common Terms (Stop words Removal)  Using a stop list significantly reduces the number of postings that a system has to store.  keyword searches with terms like the and by don’t seem very useful.  However, this is not true for phrase searches. The  meaning of flights to London is likely to be lost if the word to is stopped out. Example: The phrase query “President of the United States” or “Flights to London” is more precise than “President” AND “United States”. and “Flights” AND “London” • some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. Stop words
  • 16. The term vocabulary and postings lists Dropping Common Terms (Stop words Removal)  The general trend in IR systems over time has been:  from standard use of quite large stop lists (200– 300 terms)  to very small stop lists (7–12 terms)  to no stop list whatsoever. • how we can exploit the statistics of language so as to be able to cope with common words in better ways. Questio n • Do we really need to use stop lists. Questio n
  • 17. The term vocabulary and postings lists Normalization (equivalence classing of terms)  Token normalization: is the process of canonicalizing (standardizing or normalizing) tokens so that matches occur despite superficial differences in the character sequences of the tokens.  The easy case is if tokens in the query just match tokens in the token list of the document.  However, there are many cases when two character sequences are not quite the same but you would like a match to occur. Query • Token1 • Token 2 • … Document • Token1 • Token 2 • …
  • 18. The term vocabulary and postings lists Normalization (equivalence classing of terms)  Create equivalence classes, which are normally named after one member of the set. Query • anti-discriminatory • co-author • U.S.A • … Document • antidiscriminatory • coauthor • USA • …
  • 19. The term vocabulary and postings lists Normalization (equivalence classing of terms)  An alternative is to maintain relations between unnormalized tokens. This method can be extended to hand-constructed lists of synonyms such as car and automobile.  These term relationships can be achieved in two ways: 1. The usual way is to index unnormalized tokens and to maintain a query expansion list of multiple vocabulary entries to consider for a certain query term. 2. The alternative is to perform the expansion during index construction. When the document contains automobile, we index it under car as well (and, usually, also vice-versa).   Use of either of these methods is considerably less efficient than equivalence classing, as there are more postings to store and
  • 20. The term vocabulary and postings lists Accents and Diacritics  Diacritics: signs which when written above or below a letter indicates a difference in pronunciation from the same letter when unmarked or differently marked.  In English: naive and naïve  This can be done by normalizing tokens to remove diacritics.  What about other languages? ََ‫َتب‬‫ك‬َ‫و‬‫ُتب‬‫ك‬َ‫و‬‫ب‬ُ‫ت‬ُ‫ك‬  It might be best to equate all words to a form without diacritics.
  • 21. The term vocabulary and postings lists Capitalization/Case-folding  Case-folding: refers to reducing all letters to lower case. Naive  naive General Motors  general motors Drew University  drew university Drew West  drew west
  • 22. The term vocabulary and postings lists Capitalization/Case-folding  Case-folding: refers to reducing all letters to lower case. C.A.T  cat
  • 23. The term vocabulary and postings lists Capitalization/Case-folding  An alternative to making every token lowercase is to just make some tokens lowercase.   The simplest heuristic is to convert to lowercase words at the beginning of a sentence and all words occurring in a title that is all uppercase or in which most or all words are capitalized.  Mid-sentence capitalized words are left as capitalized (which is usually correct).  However, trying to get capitalization right in this way probably doesn’t help if your users usually use lowercase regardless of the correct case of words.  Thus, lowercasing everything often remains the most practical solution.
  • 24. The term vocabulary and postings lists Other issues in English  Other possible normalizations are quite idiosyncratic and particular to English.  For instance, you might wish to equate: colour and color. 3/12/91 and Mar. 12, 1991   U.S., 3/12/91 is Mar. 12, 1991, whereas in Europe it is 3 Dec 1991.