Ir 03

Lecture 03
Information Retrieval

Boolean Retrieval Model
Processing Boolean queries
 To process a simple conjunctive query such as “Brutus AND
Calpurnia” using an inverted index and the basic Boolean retrieval
model, we follow these steps:
1. Locate Brutus in the Dictionary
2. Retrieve its postings
3. Locate Calpurnia in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists

 The intersection operation is the crucial one: we need to efficiently
intersect postings lists so as to be able to quickly find documents that
contain both terms.
 This operation is sometimes referred to as merging postings lists.

 If the lengths of the postings lists are x and y, the
intersection takes O(x + y) operations.
 Processing more complex queries? Example:
 (Brutus OR Caesar) AND NOT Calpurnia

 Query optimization: is the process of selecting how to
organize the work of answering a query so that the least
total amount of work needs to be done by the system.
 Brutus AND Caesar AND Calpurnia

Brutus AND Caesar AND Calpurnia
 A major element is the order in which postings lists are
accessed.
 What is the best order for query processing?
(Calpurnia AND Brutus) AND Caesar

 if we start by intersecting the two smallest postings lists,
then all intermediate results must be no bigger than the
smallest postings list, and we are therefore likely to do the
least amount of total work.

The term vocabulary and postings lists
Choosing a Document Unit
• What is the document unit that should be
used for indexing?
Questio
n
• Text Message
• Attachment (.doc file / .rar file)Email Messages
• Individual Books (entire book as a unit)
• Each Chapter as a Unit
• Individual Sentences
Collection of
Books
Precision
Recall

Determining the vocabulary of terms
Recall the major steps in inverted index construction:
1. Collect the documents to be indexed.
2. Tokenize the text.
3. Do linguistic preprocessing of tokens.
4. Index the documents that each term occurs in.
• Tokenization is the process of chopping
character streams into tokens throwing away
certain characters.
Tokenization
• Deals with building equivalence classes of
tokens which are the set of terms that are
indexed
Linguistic
Preprocessing

 Token/Type/or Term?
A token: is an instance of a sequence of characters in some
particular document that are grouped together as a useful
semantic unit for processing.
A type: is the class of all tokens containing the same character
sequence.
A term: is a type that is included in the IR system’s dictionary (a
• Tokenization is the process of chopping character
streams into tokens throwing away certain characters.Tokenization

 What about apostrophe for possession and
contractions?
 doc_1 : Dr. Thomas O’Daniel has been the President of Research since
December 2006.
 doc_2 : Students’ solutions weren’t correct.
 doc_3 : Ahmad’s notebook isn’t cheap.
Example: Query = O’Daniel AND Research
 Token 1: o’daniel
 Token 2: odaniel
 Token 3: o’ daniel
 Token 4: o daniel
• what are the correct tokens to use?
Questio
n

 What about tokens associated with special
characters?
 doc_1 : C# is a high-level, multi-paradigm, general-purpose programming
language.
 doc_2 : C++ (pronounced cee plus plus) is a general purpose
programming language.
 doc_3 : A+ is an array programming language descendent from the
programming language A.
Example: Query = C AND programming
 Token 1: C#
 Token 2: C #
Questio
n

 What about hyphenated tokens?
 doc_1 : C# is a high-level, multi-paradigm, general-purpose programming
language.
 doc_2 : C++ (pronounced cee plus plus) is a general purpose
programming language.
 doc_3 : A+ is an array programming language descendent from the
programming language A.
Example: Query = general-purpose AND
programming
 Token 1: general-purpose
 Token 2: general purpose
Questio
n

 What about tokens that should be regarding as a
single token?
 doc_1 : The West Bank, including East Jerusalem, has a land area of
5,640 km2.
 doc_2 :The West bank and Gaza Strip.
 doc_3 : There is a branch of the Arab Bank in Palestine in the West of
Jenin City.
Example: Query = West Bank AND Palestine
 Token 1: West Bank
 Token 2: West
 Token 3: Bank
Questio
n

Dropping Common Terms (Stop words Removal)
 Using a stop list significantly reduces the number of postings that a
system has to store.
 keyword searches with terms like the and by don’t seem very useful.
 However, this is not true for phrase searches. The
 meaning of flights to London is likely to be lost if the word to is
stopped out.
Example: The phrase query
“President of the United States” or
“Flights to London” is more precise than
“President” AND “United States”. and
“Flights” AND “London”
• some extremely common words which would
appear to be of little value in helping select
documents matching a user need are excluded from
the vocabulary entirely.
Stop
words

Dropping Common Terms (Stop words Removal)
 The general trend in IR systems over time has
been:
 from standard use of quite large stop lists (200–
300 terms)
 to very small stop lists (7–12 terms)
 to no stop list whatsoever.
• how we can exploit the statistics of
language so as to be able to cope with
common words in better ways.
Questio
n
• Do we really need to use stop lists.
Questio
n

Normalization (equivalence classing of terms)
 Token normalization: is the process of canonicalizing
(standardizing or normalizing) tokens so that matches occur
despite superficial differences in the character sequences of the
tokens.
 The easy case is if tokens in the query just match tokens in the
token list of the document.
 However, there are many cases when two character sequences are
not quite the same but you would like a match to occur.
Query
• Token1
• Token 2
• …
Document
• Token1
• Token 2
• …

 Create equivalence classes, which are normally named after one
member of the set.
Query
• anti-discriminatory
• co-author
• U.S.A
• …
Document
• antidiscriminatory
• coauthor
• USA
• …

 An alternative is to maintain relations between unnormalized
tokens. This method can be extended to hand-constructed lists of
synonyms such as car and automobile.
 These term relationships can be achieved in two ways:
1. The usual way is to index unnormalized tokens and to maintain a
query expansion list of multiple vocabulary entries to consider for a
certain query term.
2. The alternative is to perform the expansion during index
construction.
When the document contains automobile, we index it under car as
well (and, usually, also vice-versa).
  Use of either of these methods is considerably less efficient
than equivalence classing, as there are more postings to store and

Accents and Diacritics
 Diacritics: signs which when written above or below a letter indicates a
difference in pronunciation from the same letter when unmarked or
differently marked.
 In English:
naive and naïve
 This can be done by normalizing tokens to remove diacritics.
 What about other languages?
ََ‫َتب‬‫ك‬َ‫و‬‫ُتب‬‫ك‬َ‫و‬‫ب‬ُ‫ت‬ُ‫ك‬
 It might be best to equate all words to a form without diacritics.

Capitalization/Case-folding
 Case-folding: refers to reducing all letters to lower case.
Naive  naive
General Motors  general motors
Drew University  drew university
Drew West  drew west

 Case-folding: refers to reducing all letters to lower case.
C.A.T  cat

 An alternative to making every token lowercase is to just make
some tokens lowercase.
  The simplest heuristic is to convert to lowercase words at the
beginning of a sentence and all words occurring in a title that is all
uppercase or in which most or all words are capitalized.
 Mid-sentence capitalized words are left as capitalized (which is
usually correct).
 However, trying to get capitalization right in this way probably
doesn’t help if your users usually use lowercase regardless of the
correct case of words.
 Thus, lowercasing everything often remains the most practical
solution.

Other issues in English
 Other possible normalizations are quite idiosyncratic and
particular to English.
 For instance, you might wish to equate:
colour and color.
3/12/91 and Mar. 12, 1991
  U.S., 3/12/91 is Mar. 12, 1991, whereas in Europe it is 3 Dec 1991.

Ir 03

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Ir 03

Similar to Ir 03 (20)

More from Mohammed Romi

More from Mohammed Romi (7)

Recently uploaded

Recently uploaded (20)

Ir 03