IR

Information Retrieval (IR)
Techniques
Girish Khanzode

Information Retrieval
• Information Retrieval - Given a set of terms and a set of document terms select
only the most relevant document (precision), and preferably all the relevant ones
(recall)
• Goal - find documents relevant to an information need from a large document set
– Mostly textual information ( text/document retrieval)
– documents, images, videos, data, services, audio
– XML, RDF, html, txt, PDF
• Large collections on internet with billions of pages
• Information retrieval problem: locating relevant documents based on user input,
such as keywords or example documents

Information Retrieval / Data Retrieval
Information Retrieval Data Retrieval
Matching vague exact
Model probabilistic deterministic
Query language natural artificial
Query specification incomplete complete
Items wanted relevant all (matching)
Error handling insensitive sensitive

U
s
e
r
T
a
s
k
s
Retrieval:
Adhoc
Filtering
Browsing
Classic Models
Boolean
Vector
Probabilistic
Structured Models
Non-Overlapping Lists
Proximal
Browsing
Flat
Structure Guided
Hypertext
Set Theoretic
Fuzzy
Extended Boolean
Algebraic
Generalized Vector
Lat. Semantic Index
Neural Networks
Probabilistic
Inference Network
Belief Network
IR Models

Tasks
• Clustering - Group documents into clusters based on their contents
• Classification - Given a topics set and a new doc D, decide which topics
the documents D belongs to (spam / no-spam…)
• Information Extraction - Find all snippets dealing with a given topic (like
company merger)
• Question Answering - Handle wide range of question types like fact, list,
definition, how, why, hypothetical, semantically constrained, and cross-
lingual questions
• Opinion Mining - Analyze / summarize sentiment in a text

Terminology
• Searching - seek specific information within a body of information to get result of a search -
a set of hits
• Browsing - unstructured exploration of a body of information
• Linking - moving from one item to another following links like citations, references
• Query - a string of text, describing the information that user seeks. Each word of the query
is called a search term or keyword
• A query can be a single search term, a string of terms, a phrase in natural language or a
stylized expression using special symbols
• Full text searching - methods that compare the query with every word in the text, without
distinguishing the function (meaning, position) of the various words
• Fielded searching - methods that search on specific bibliographic or structural fields, such as
author or heading

Architecture
Documents
Hits
Representation
Function
Representation
Function
Query Representation Document Representation
Comparison
Function
offlineonline
Document acquisition
( web crawling…)
Index
Query

Zipf's Law
• Distribution of word frequencies is similar for different texts (natural
language) of significantly large size
• Zipf's law holds even for different languages

Luhn's Hypothesis
• Frequency of words is a measurement of word significance - A measurement of
the power of a word to discriminate documents by their content ...
• Resolving/Discriminating
power of words
• Optimal power half way
between the cut-offs

Techniques
• IR Models - Governs how a document and a query are represented and how
the relevance of a document to a user query is defined
– Boolean Model
– Vector Model
– Probabilistic Model
• Index Terms (attribute) Selection
– Stop list
– Word stem
– Index terms weighting methods
• Terms  Documents Frequency Matrices

Indexing Based IR
• Simple queries
– composed of two or three,
perhaps of dozen of keywords
– web retrieval
• Boolean queries
– `Database AND computer’
– online catalog and patent search
• Context queries
– proximity search, phrase queries

Sorting & Ranking
• User sends a query to search system which returns a set of hits
• For a large documents collection this set could be large
• The value of results depends on the order in which the hits are presented
• Ranking algorithms are at the core of information retrieval systems
(predicting which documents are relevant and which are not)
• Ranking methods
– Sorting the hits (by date…)
– Ranking the hits by similarity between query and document
– Ranking the hits by the importance of the documents

Bag of Words Model
• The most common way to represent documents in IR
• How to weight a word within a document
– Boolean: 1 is the word i is in doc j, 0 otherwise
– Tf*idf and others - the weight is a function of the word frequency in the
document and of the frequency of documents with that word
• What is a word
– Single, inflected word (going)
– Lemmatised word (going, go, gone  go)
– Multi-word, proper nouns, numbers, dates (board of directors, John Stack, April,
2012)
– Meaning: plan,project,design  PLAN#03

Bag of Words Model
• Treats all the words in a document as index
terms
• Assigns a weight to each term based on
importance (or presence/absence of word)
• Disregards order, structure, meaning of the
words
• Simple but effective
• Assumptions
– Term occurrence is independent
– Document relevance is independent
– Words are well-defined
• Consider three documents
– John likes to watch movies.
– Mary likes movies too.
– John also likes football
• The bag of words is shown below

Document Parsing
• Format and language of each document
– What format is it in?
– PDF / word / excel / html?
– What language is it in?
– What character set is in use?
• Each of these is a classification problem
• These tasks often performed heuristically
Sec. 2.1

Parsing Challenges
• Documents being indexed can be from different languages
– A single index may contain terms of several languages
• A document / components can contain multiple languages / formats
– French email with a German PDF attachment
• What is a unit document?
– A file?
– An email?
– An email with 5 attachments?
– A group of files (PPT or LaTeX as HTML pages)
Sec. 2.1

Tokenization
• Token - instance of a sequence of characters
• Each token is a candidate for an index entry after further processing
• Input: Customers Suppliers and Factory
• Output: Tokens
– Customers
– Suppliers
– Factory
Sec. 2.2.1

Tokenization Issues
• Finland’s capital → Finland? Finlands? Finland’s?
• Hewlett-Packard → Hewlett and Packard as two tokens?
– state-of-the-art - break up hyphenated sequence
– co-education
– lowercase, lower-case, lower case ?
• San Francisco - one token or two?
– How to decide if it is one token?
– Cheap San Francisco-Los Angeles fares
Sec. 2.2.1

Stop Words
• Many of the most frequently used words in English are useless in IR and text mining
• Those words are called stop words
– the, a, and, to, be , of, in, about, with …
– Little semantic value
– Stop words account for 20-30% of total word counts
• Stop list contains stop-words that should not to be used as index
– Prepositions
– Articles
– Pronouns
– Some adverbs and adjectives
– Some frequent words (e.g. document)
Sec. 2.2.2

Stop Words
• Removal of stop-words improves
efficiency and effectiveness of
searches
• A few standard stop-lists are
commonly used
• Reduces indexing data file sizes
Sec. 2.2.2

Stop Words - New Trend
• Stop words need very small space for storage due to good compression
techniques
• Query time is not affected due to stop words because of good query
optimization techniques
• Stop words are necessary for
– Phrase queries - King of Spain
– Various song titles.. - Let it be, To be or not to be
– Relational queries - flights to London

Normalization - Terms
• Normalization of words in indexed text and query into same form
• A term is a normalized word type, which is a single entry in IR system
dictionary
• Implicitly defines equivalent classes of terms by
– Deleting periods to form a term
• U.S.A., USA  USA
– Deleting hyphens to form a term
• anti-discriminatory, antidiscriminatory  antidiscriminatory
– Synonyms
• Car, Automobile
Sec. 2.2.3

Case Folding
• Reduces all letters to lower case
• Exception: upper case words in mid-sentence
• General Motors
• Fed vs. fed
• MIT vs. mit
• It is best to lowercase everything since users will often type search
queries in lowercase regardless of interested information
• Google example
– When query is C.A.T. -> #1 result is for “cat” (Wikipedia) but not Caterpillar
Inc.
Sec. 2.2.3

Synonyms and Homonyms
• Synonyms
– Document - motorcycle repair - motorcycle maintenance
– maintenance and repair are synonyms
– System can extend query as motorcycle and (repair or maintenance)
• Homonyms
– Object has different meanings as noun/verb
– Can disambiguate meanings to some extent from the context
• Extending queries automatically using synonyms can be problematic
– Need to understand intended meaning in order to infer synonyms
• Or verify synonyms with user
– Synonyms may have other meanings as well

Normalization - Synonyms
• Handling Synonyms and Homonyms
– Hand-constructed equivalence classes
• car = automobile color = colour
– Rewrite words to form equivalence-class terms
• When a document contains automobile, index it under car-automobile (and vice-versa)
– Expand a query
• When a query contains automobile, look under car as well
• Spelling mistakes
– Soundex - a phonetic algorithm for equivalent classes of words based on phonetic
heuristics
• Google  Googol

Stemming
• Techniques used to reduce words of variant form to a stem or root form
before indexing
• Stemming – Remove endings of word
– Computer
– Compute
– Computes
– Computing
– Computed
– Computation
Sec. 2.2.4
comput

Stemming
• As a result if query is house plans, the results will also include all pages
containing variation of that term
– House plan
– House planer
– House planning
• Increases recall at the expense of precision
• Improves effectiveness of IR and text mining
– Matching similar words
– Reducing indexing size
– Combing words with same roots may reduce indexing size as much as 40-50%.
• Produced by Stemmers
Sec. 2.2.4

Lemmatization
• Transform to standard dictionary form lemma, according to syntactic
category
– verb + ing → verb, noun + s → noun
• More accurate than stemming but consumes more resources
• Balance noise Vs. recognition rate
• Compromise between precision and recall
• Increased recall without hurting precision
• Produced by lemmatizers
• the boy's cars are different colors → the boy car be different color

Porter’s Algorithm
• The most common algorithm for stemming English and one that has
repeatedly been shown to be empirically very effective - suffix stripping
• Conventions + 5 phases of reductions
– phases applied sequentially
– each phase consists of a set of commands
– sample convention: of the rules in a compound command, select the one that
applies to the longest suffix
Sec. 2.2.4

Porter Algorithm Steps
1. Plurals and past participles
SSES -> SS caresses -> caress
(*v*) ING -> motoring -> motor
1. adj->n, n->v, n->adj, …
(m>0) OUSNESS -> OUS callousness -> callous
(m>0) ATIONAL -> ATE relational -> relate
1. (m>0) ICATE -> IC triplicate -> triplic
2. (m>1) AL -> revival -> reviv
(m>1) ANCE -> allowance -> allow
1. (m>1) E -> probate -> probat
2. (m > 1 and *d and *L) -> single letter controll -> control

Stemmers Comparison
• Sample text: Such an analysis can reveal features that are not easily visible from the
variations in the individual genes and can lead to a picture of expression that is more
biologically transparent and accessible to interpretation
• Porter’s: such an analysi can reveal featur that ar not easili visibl from the variat in the
individu gene and can lead to pictur of express that is more biolog transpar and access
to interpret
• Lovins’s: such an analys can reve featur that ar not eas vis from th vari in th individu
gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres
• Paice’s : such an analys can rev feat that are not easy vis from the vary in the individ
gen and can lead to a pict of express that is mor biolog transp and access to interpret

Deep Analysis
• Detailed Natural Language Processing (NLP) algorithms
• Semantic disambiguation, phrase indexing (board of directors), named
entities (President Monti = Mario Monti)...
• Standard search engines use deeper techniques (Google)

Document Indexing
• Store an index to optimize speed and performance
• Useful in finding relevant documents for a search query
• Reduces time and CPU usage
• Without an index, search engine will scan every document in
the corpus
• An index of 10,000 documents is queried in milliseconds
• A sequential scan of every word in 10,000 documents takes
much more time
• Each document is represented by a set of weighted
keywords known as terms
– D1 → {(t1, w1), (t2,w2), …}
• D1 → {(comput, 0.2), (architect, 0.3), …}
• D2 → {(comput, 0.1), (network, 0.5), …}
• Inverted file - used in retrieval for higher efficiency
– comput → {(D1,0.2), (D2,0.1), …}

Boolean Model
• Query terms are combined logically using the
Boolean operators
– AND, OR, NOT
– ((asthma AND exercise) AND (NOT cardiac))
– Views each document as a set of words
– Precise: document matches a condition or not
• Retrieval
– Given a Boolean query, system retrieves each
document that makes the query logically true
– Called exact match
• No Rank - A document is judged to be relevant if
the terms in the document satisfies the logical
expression of the query

Inverted index
• A data structure that attaches each distinctive term with a list of all
documents that contains the term in a document collection
• The list is called posting list
Sec. 1.2

What Goes in Inverted File
• Boolean retrieval
– Just the document number
• Ranked Retrieval
– Document number and term weight (TF, IDF, TF*IDF, ...)
• Proximity operators
– Word offsets for each occurrence of the term
– Example : t17 (doc1,49) (doc1,70) (doc2,3)

Inverted File Size
• Very compact for Boolean retrieval
– About 10% of the size of the documents
– If an aggressive stop word list is used
• Not much larger for ranked retrieval
– Perhaps 20%
• Enormous for proximity operators
– Sometimes larger than the documents
– But access is fast - you know where to look

Inverted Index Construction
Sec. 1.2
Friends Romans Countrymen
friend roman countryman
friend
roman
countryman
2 4
2
13 16
1
Friends, Romans, Countrymen

Inverted Index – Search Steps
• Given a query q
– Vocabulary search - find each term/word from q in the inverted index
– Results merging - Merge results to find documents that contain all or some of
the words/terms in q
– Rank score computation - To rank the resulting documents/pages, using
• Content-based ranking
• Link-based ranking

Inverted Index - Boolean Retrieval
11
11
11
11
11
11
1 2 3
11
11
11
4
blueblue
catcat
eggegg
fishfish
greengreen
hamham
hathat
oneone
3
4
1
4
4
3
2
1
blue
cat
egg
fish
green
ham
hat
one
2
11redred
11twotwo
2red
1two
Document
IDs

Boolean Retrieval
• To execute a Boolean query
– Build query syntax tree
– For each clause, look up postings
– Traverse postings and apply Boolean operator
• Efficiency analysis
– Postings traversal is linear (assuming sorted postings)
– Start with shortest posting first
( blue AND fish ) OR ham
1
2blue
fish 2

Query Processing - AND
• Consider processing the query - Brutus AND Caesar
– Locate Brutus in the Dictionary
• Retrieve its postings
– Locate Caesar in the Dictionary
• Retrieve its postings
– Merge the two postings
128
34
Brutus
Caesar
Sec. 1.3

The Merge
• Walk through the two postings simultaneously, in time linear in the total
number of postings entries
• These postings are sorted by docID
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
Brutus
Caesar
2 8
If the list lengths are x and y, the merge takes O(x+y) operations

Postings Lists – Merge Algorithm

2
1
1
2
1
1
1
1
1
1
1
Inverted Index: TF.IDF
22
11
22
11
11
11
1 2 3
11
11
11
4
11
11
11
11
11
11
11
tf
df
blueblue
catcat
eggegg
fishfish
greengreen
hamham
hathat
oneone
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
11 11redred
11 11twotwo
1red
1two
3
4
1
4
4
3
2
1
2
2
1
22

Positional Indexes
• Store term position in postings
• Supports richer queries (proximity….)
• Leads to larger indexes…

[2,4]
[3]
[2,4]
[2]
[1]
[1]
[3]
[2]
[1]
[1]
[3]
2
1
1
2
1
1
1
1
1
1
1
Inverted Index: Positional Information
22
11
22
11
11
11
1 2 3
11
11
11
4
11
11
11
11
11
11
22
11
tf
df
blueblue
catcat
eggegg
fishfish
greengreen
hamham
hathat
oneone
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
11 11redred
11 11twotwo
1red
1two
3
4
1
4
4
3
2
1
2
2
1
Document
Position of term
in document

13
Optimization of Index Search
• What is the best order of words for query processing?
• Consider a query that is an AND of n terms
• Process words in order of increasing freq
– start with smallest set, then keep cutting further
– This is why we kept document freq. in dictionary
• For each of the n terms, get its postings, then AND them together
Brutus
Caesar
Calpurnia
Query: Brutus AND Calpurnia AND Caesar
Sec. 1.3
1 2 3 5 8 16 21 34
2 4 8 16 32 64 128
16

More General Optimization
• (madding OR crowd) AND (ignoble OR strife)
• Get document frequencies for all terms
• Conservative - estimate the size of each OR by the sum of its doc
frequencies
• Process in increasing order of OR sizes
Sec. 1.3

Query Optimization
Term Frequency
eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812
Recommend a query processing order for –
(tangerine OR trees) AND (marmalade OR
skies) AND (kaleidoscope OR eyes)
300321
379571
363465
(kaleydoscopeOReyes) AND (tangerineORtrees)AND(marmaladeORskies)

Skip Pointers
• Intersection is the most important operation when for search engines
• This is because in web search, most queries are implicitly intersections
• car repairs, Britney spears songs translates into – car AND repairs, Britney
AND spears AND songs, which means it will be intersecting 2 or more
postings lists in order to return a result
• Because intersection is so crucial, search engines try to speed it up in any
way possible. One such way is to use skip pointers

Optimized Skip Pointers
• Augment Postings with skip pointers (at indexing time)
• Why? - To skip postings that will not figure in the search results.
• Where do we place skip pointers?
1282 4 8 41 48 64
311 2 3 8 11 17 21
3111
41 128
Sec. 2.3

Query Processing with Skip Pointers
• Start using the normal intersection algorithm
• Continue until the match 12 and advance to the next item in each list. At this point
the "car" list is on 48 and the "repairs" list is on 13, but 13 has a skip pointer
• Check the value the skip pointer is pointing at (i.e. 29) and if this value is less than
the current value of the "car" list (which it is), we follow our skip pointer and jump
to this value in the list

Where to Place Skips - Tradeoff
• More skips → shorter skip spans ⇒ more likely to skip
– But lots of comparisons to skip pointers.
• Fewer skips → few pointer comparison, but then long skip spans ⇒
few successful skips

Placing Skips
• Simple heuristic - for postings of length L, use √L evenly-spaced skip
pointers
• This ignores the distribution of query terms
• Easy if the index is relatively static
• Harder if L keeps changing because of updates
• How much do skip pointers help?
– CPUs were slow, they used to help a lot
– Today’s CPUs are fast and disk is slow, so reducing disk postings list size
dominates

Strengths and Weaknesses
• Strengths
– Very precise queries can be specified
– Easy to implement in the simple form
• Weaknesses
– Retrieval results are poor since term frequency is not considered - No index
term weighting
– Specifying the query may be difficult for casual users
– Result might be 1 or 0 (unordered set of documents)

Similarity Based Retrieval
• Retrieve documents that are similar to a given document
– Similarity may be defined on the basis of common words
– Find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find
relevance of other documents
• Relevance feedback
– Similarity can be used to refine answer set to keyword query
– User selects a few relevant documents from those retrieved by keyword
query and system finds other documents similar to these

Similarity Based Retrieval - Vector Space Model
• Define an n-dimensional space, where n is the number of words in the
document set
• Vector for document d goes from origin to a point whose ith
coordinate is
TF (d,t ) / n (t )
• The cosine of the angle between vectors of two documents is used as a
measure of their similarity

Vector Space Model
• Assumption - Documents that are close together in vector space talk
about same things
• Hence retrieve documents depending on closeness to the query
(similarity ~ closeness)

Vector Space Model
• Documents are treated as a bag of words or terms
• Documents are presented in high dimensional space
• Each document is represented as a vector
• Implemented by forming term-document matrix
• Dimension of space depends on number of indexing terms which are chosen to be
relevant for the collection
• Rank according to the similarity metric (e.g. cosine) between the query and document
• The smaller the angle between the document and query the more similar they are
believed to be
– Documents are represented by a term vector
– Queries are represented by a similar vector

Vector Space Model
• Term weights are not pure 0 or 1
• Each weight is computed based on some variations of TF or TF-IDF
scheme
• Query has the same shape as document (m dimensional vector)
– Cosine is commonly used in text clustering
• Measure of similarity between query q and a document dj is a cosine of
angle between these vectors
• Ad-hoc weightings (term frequency x inverse document frequency ) used
• No optimal ranking

Vector Space Model
• Vector space = all the keywords encountered <t1, t2, t3, …, tn>
• Document D = < a1, a2, a3, …, an> where ai = weight of ti in D
• Query Q = < b1, b2, b3, …, bn> where bi = weight of ti in Q
• R(D,Q) = Sim(D,Q)
• Consider Query q
– Relevance of di to q - Compare similarity of query q and document di
– Cosine similarity (the cosine of the angle between the two vectors)

Vectors Plot
Star
Diet
Document about astrology
Documents about movie stars
Documents about mammal behavior

Term-Document Matrix
• A collection of n documents can
be represented in vector space
model using this matrix
• A m × n matrix where m is
number of terms and n is number
of documents
• Term - Row of term-document
matrix
• Document - Column of term-
document matrix


















=
↓↓↓
aaa
aaa
aaa
A
ddd
mnmm
n
n
n




21
22221
11211
21
t
t
t
m
←
←
←

2
1

Similarity Formulae
• Dot Product
• Cosine
• Dice
• Jaccard ∑ ∑ ∑
∑
∑ ∑
∑
∑ ∑
∑
∑
−+
=
+
=
=
=
i i i
iiii
i
ii
i i
ii
i
ii
i i
ii
i
ii
ii
baba
ba
QDSim
ba
ba
QDSim
ba
ba
QDSim
baQDSim
)*(
)*(
),(
)*(2
),(
*
)*(
),(
)*(),(
22
22
22
t1
t2
D
Q

Index Storage
• Term-Document matrix is very sparse
• A few 100 terms for a document and a few terms for a query, while the
term space is large (~100k)
• Stored as
D1 → {(t1, a1), (t2,a2), …}
t1 → {(D1,a1), …}

Implementation
• Implementation of Vector Space Model using dot product
– Naïve implementation: O(m*n)
– Implementation using inverted file
• Given a query = {(t1,b1), (t2,b2)}
1. Find the sets of related documents through inverted file for t1 and t2
2. Calculate the score of the documents for each weighted term (t1,b1) →
{(D1,a1 *b1), …}
3. Combine the sets and sum the weights (∑)
4. O(|Q|*n)

Similarity Calculation
• Consider two documents D1, D2 and a query Q
– D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)

Ranked Retrieval
• Documents are ranked based on their score
• Advantages
– Query is easy to specify
– Output is ranked based on the estimated relevance of the documents to the
query
– A wide variety of theoretical models exist
• Disadvantages
– Query less precise (although weighting can be used)

Example Query
• A document space is defined by three terms – computer, application, users
• A set of documents defined as
– A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
– A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
– A7=(1, 1, 1), A8=(1, 0, 1). A9=(0, 1, 1)
• Query is computer and application
• What documents should be retrieved?

Example Query
• In Boolean query matching
– AND - document A4, A7 will be retrieved
– OR - retrieved: A1, A2, A4, A5, A6, A7, A8, A9
• In similarity matching (cosine)
– q=(1, 1, 0)
– S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
– S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
– S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
– Document retrieved set (with ranking)
• {A4, A7, A1, A2, A5, A6, A8, A9}

Assigning Weights to Terms
• Binary Weights
• Raw term frequency
• TF x IDF

Binary Weights
• Only the presence 1 or absence 0 of a term is included in the vector
docs t1 t2 t3 RSV=Q.Di
D1 1 0 1 4
D2 1 0 0 1
D3 0 1 1 5
D4 1 0 0 1
D5 1 1 1 6
D6 1 1 0 3
D7 0 1 0 2
D8 0 1 0 2
D9 0 0 1 3
D10 0 1 1 5
D11 1 0 1 3
Q 1 2 3
q1 q2 q3

Raw Term Weights
• The frequency of occurrence for the term in each document is included in
the vector
docs t1 t2 t3
D1 2 0 3
D2 1 0 0
D3 0 4 7
D4 3 0 0
D5 1 6 3
D6 3 5 0
D7 0 8 0
D8 0 10 0
D9 0 0 1
D10 0 3 5
D11 4 0 1

TF.IDF - Term Weighting
• Assigns a tf * idf weight to each term in each document
• Term weights components
– Local - How important is the term in this document?
– Global - How important is the term in the collection?
• Logic
– Terms that appear often in a document should get high weights
– Terms that appear in many documents should get low weights
• Mathematical Capturing
– Term Frequency (local)
– Inverse Document Frequency (global)

TF.IDF - Term Weighting
• tf = Term Frequency
– Frequency of a term/keyword in a document
– The higher the tf, the higher the importance (weight) for the doc.
• df = document frequency
– Number of documents containing the term
– Distribution of the term
• idf = Inverse Document Frequency
– Unevenness of term distribution in the corpus
– Specificity of term to a document
• The more the term is distributed evenly, the less it is specific to a document:
weight(t, D) = tf(t ,D) * idf(t)

Term Weighting
• Based on common sense, but adjusted/engineered following experiments
• Terms that occur in only a few documents are often more valuable than
ones that occur in many - IDF
• The more often a term occurs in a document, the more likely it is to be
important for that document - TF
• A term that occurs for same number of times in a short and a long
document is likely to be more valuable for the short document

• Word occurrence frequency is a measure of the significance of terms and
their discriminatory power
Word
frequency
Too frequent:
useless discriminators
Significant terms
Too rare: no significant
contribution to
the content of the document
Term Significance

TF.IDF Weight
• Term frequency weight measures importance in document:
• Inverse document frequency measures importance in collection:
• Some heuristic modifications

Relevance Ranking Using Terms
• TF-IDF (Term frequency/Inverse Document frequency) ranking
– Let n(d) = number of terms in the document d
– n(d, t) = number of occurrences of term t in the document d.
– Relevance of a document d to a term t
• The log factor is to avoid excessive weight to frequent terms
– Relevance of document to query Q
nn((dd))
nn((dd,, tt))
1 +1 +TFTF ((dd,, tt) =) = loglog
rr ((dd,, QQ) =) = ∑∑ TFTF ((dd,, tt))
nn((tt))tt∈∈QQ

Inverse Document Frequency
• IDF provides high values for rare words and low values for common words
4
1
10000
log
698.2
20
10000
log
301.0
5000
10000
log
0
10000
10000
log
=





=





=





=





For a collection
of 10000
documents

TF.IDF Normalization
• Normalize the term weights
– So longer documents are not unfairly given more weight
– Normalize usually means force all values to fall within a certain range, usually
between 0 and 1, inclusive.
∑=
=
t
k kik
kik
ik
nNtf
nNtf
w
1
22
)]/[log()(
)/log(

• Ranking of documents on the basis of estimated relevance to a query
• Relevance ranking is based on factors
– Term frequency
• Frequency of occurrence of query keyword in document
– Inverse document frequency
• How many documents the query keyword occurs in
– Fewer  give more importance to keyword
– Hyperlinks to documents
• More links to a document  document is more important

• Documents are returned in decreasing order of relevance score
– Usually only top few documents are returned
• Most systems improvise above model
– Common words like a, an, the, it etc are eliminated
• Called stop words
– Words that occur in title, author list, section headings are given greater
importance
– Words whose first occurrence is late in the document are given lower importance
– Proximity: if keywords in query occur close together in the document, the
document has higher importance than if they are far apart

Relevance Using Hyperlinks
• Number of documents relevant to a query can be enormous if only term
frequencies are taken into account
• Using term frequencies makes spamming easy
– A fitness center could add many occurrences of the words like weights to its
page to make its rank very high
• Most of the time people are looking for pages from popular sites
• Idea - use popularity of web site to rank site pages that match given
keywords
• Problem - hard to find actual popularity of site

• Solution: use number of hyperlinks to a site as a measure of the popularity or
prestige of the site
– Count only one hyperlink from each site
– Popularity measure is for site, not for individual page
• But most hyperlinks point to root of site
• Concept of site is difficult to define since a URL prefix like cs.yale.edu contains many
unrelated pages of varying popularity
• Refinements
– When computing prestige based on links to a site, give more weight to links from
sites that themselves have higher prestige
• Definition is circular
• Set up and solve system of simultaneous linear equations
– Above idea is the basis of the Google Page Rank ranking mechanism

• Connections in social networking
– Ranks prestige of people
– Someone known by multiple prestigious people has higher prestige
• Hub and authority based ranking
– A hub is a page that stores links to many pages (on a topic)
– An authority is a page that contains actual information on a topic
– Each page gets a hub prestige based on prestige of authorities that it points to
– Each page gets an authority prestige based on prestige of hubs that point to it
– Prestige definitions are cyclic and can be ontained by solving linear equations
– Use authority prestige when ranking answers to a query

Probability Ranking Principle
• Given a user query q and a document d, estimate the probability that the
user will find d relevant
• Robertson (1977)
• If a reference retrieval system’s response to each request is a ranking of
the documents in the collection in order of decreasing probability of
relevance to the user who submitted the request
• Probabilities are estimated as accurately as possible on the basis of
whatever data have been made available to the system for this purpose,
• Overall effectiveness of the system to its user will be the best that is
obtainable on the basis of those data

Bayes Classifier
• Bayes Decision Rule
– A document D is relevant if P(R|D) > P(NR|D)
• Estimating probabilities
– Use Bayes Rule
– Classify a document as relevant if
– Left hand side is likelihood ratio

Estimating P(D|R)
• Assume independence
• Binary independence model
– based on information related to presence and absence of terms in relevant and non-relevant
documents
– information acquired through relevance feedback process
• user stating which of the retrieved documents are relevant / non-relevant
– Based on the probability ranking principle, which ensures an optimal ranking
– document represented by a vector of binary features indicating term occurrence (or non-
occurrence)
– pi is probability that term i occurs (has value 1) in relevant document, si is probability of
occurrence in non-relevant document

Binary Independence Model
• Scoring function is
• Query provides information about relevant documents
• If we assume pi constant, si approximated by entire collection, get idf-like
weight

Contingency Table
• The scoring function is -

BM25 Ranking Algorithm
• Popular and effective ranking algorithm based on binary independence
model
– adds document and query term weights
– k1, k2 and K are parameters whose values are set empirically
– dl is doc length
– Typical TREC value for k1 is 1.2, k2 varies from 0 to 1000, b = 0.75

BM25 Example
• Query with two terms player Pele, (qf = 1)
• No relevance information (r and R are zero)
• N = 500,000 documents
• player occurs in 40,000 documents (n1 = 40, 000)
• Pele occurs in 300 documents (n2 = 300)
• Player occurs 15 times in doc (f1 = 15)
• Pele occurs 25 times (f2 = 25)
• Document length is 90% of the average length (dl/avdl = .9)
• k1 = 1.2, b = 0.75, and k2 = 100
• K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11

BM25 Example - Effect of Term Frequencies
Frequency of player Frequency of Pele BM25 Score
15 25 20.66
15 1 12.74
15 0 5.00
1 25 18.2
0 25 16.66

Relevance Feedback Process – Iterative Cycle
1. User presented with a list of retrieved documents
2. User marks those which are relevant (or not relevant)
1. In practice top 10-20 ranked documents are examined
2. Incremental: one document after the other
3. The relevance feedback algorithm selects important terms from documents assessed relevant by users
4. The relevance feedback algorithm emphasises the importance of these terms in a new query in the following ways
1. Query expansion - add these terms to the query
2. Term reweighing - modify the term weights in the query
3. Query expansion + term reweighing
5. The updated query is submitted to the system
6. If the user is satisfied with the new set of retrieved documents, then feedback process stops, otherwise go to step 2
• Approaches
– Approach 1: Add/Remove/Change query terms
– Approach 2: Re-weight query terms

Relevance Feedback Issues
• Relevance feedback
– Often users are not reliable in making relevance assessments, or do not make
relevance assessments
– Implicit relevance feedback by looking at what users access
• clicks in web logs
• works well - “wisdom of the crowd"
– Positive, negative
– Partial relevance assessments (very relevant or partially relevant)?
– Why is a document relevant?
• Interactive query expansion (as opposed to automatic)
– Users choose the terms to be added

Latent Semantic Indexing - LSI
• Term document matrices are very large but people talk about few things
• So how to represent term document by a lower dimensional latent space?
• Latent Semantic Analysis is the analysis of latent i.e. hidden semantics in a corpora of
text
• LSI transforms the original data in a different space so that two documents/words
about the same concept are mapped close (so that they have higher cosine similarity)
• LSI achieves this by Singular Value Decomposition (SVD) of term-document matrix
• Maps documents and terms to a low dimensional representation
• Design a mapping such that the low dimensional space reflects semantic association
• Compute document similarity based on the inner product in this latent semantic space

LSI
• For LSI truncated SVD is used
• Where
– Ukis m×k matrix whose columns are first k left singular vectors of A
– Σk is k×k diagonal matrix whose diagonal is formed by k leading singular
values of A
– Vkis n×k matrix whose columns are first k right singular vectors of A
• Rows of Uk= terms
• Rows of Vk= documents
• SVD can be used to compute optimal low rank approximations
VUA
T
kkkk Σ=

LSI
• In truncated LSI, first k independent linear components of A (singular vectors and values) are
included
• Documents are projected in means of least squares on space spread by first k singular
vectors of A (LSI space)
• First k components capture the major associational structure in in the term-document
matrix and throw out the noise
• Minor differences in terminology used in documents are ignored
• Closeness of objects (queries and documents) is determined by overall pattern of term
usage, so it is context based
• Documents which contain synonyms are closer in LSI space than in original space;
• Documents which contain polysemy (a word having different meaning in different contexts )
in different context are more far in LSI space than in original space

Concept Indexing (CI)
• Lexical focused relevance estimation is less effective than semantic
focussed one
• Indexing using concept decomposition (CD) instead of SVD like in LSI
• Concept decomposition was introduced in 2001
• I. S. Dhillon, D.S. Modha: Concept decomposition for large sparse text
data using clustering, Machine Learning, 42:1, 2001, pp. 143-175

Concept Decomposition
• Cluster documents in term-document matrix A on k groups
• Clustering algorithms
– Spherical k-means algorithm
– Fuzzy k-means algorithm
• Spherical k-means algorithm is a variant of k-means algorithm which uses the fact that
vectors of documents are of the unit norm
• Centroids of groups = concept vectors
• Concept matrix is matrix whose columns are centroids of groups
cj – centroid of j-th group
[ ]cccC kk 21
=

Concept Decomposition
• Next step: calculate concept decomposition
• Concept decomposition Dkof term-document matrix A is least squares
approximation of A on the space of concept vectors
where Z is solution of the least squares problem
• Rows of Ck = terms
• Columns of Z = documents
ZCD kk =
( ) ACCCZ T
kk
T
k
1−
=

System Evaluation
• There are many retrieval models/ algorithms/ systems
• Which one is the best?
– How much effective a system is capable of retrieving relevant documents?
• What is the best component for
– Ranking function (dot-product, cosine, …)
– Term selection (stop word removal, stemming…)
– Term weighting (TF, TF-IDF,…)
• How far down the ranked list will a user need to look to find some / all
relevant documents?

Effectiveness
• Goal of an IR system is to retrieve as many relevant documents as
possible and as few non-relevant documents as possible.
• Evaluating the above consists of a comparative evaluation of technical
• performance of IR system(s)
– In traditional IR, technical performance means the effectiveness of the IR
system: the ability of the IR system to retrieve relevant documents and
suppress non-relevant documents
• Effectiveness is measured by the combination of recall and precision

Measuring Query Retrieval Effectiveness
• Information-retrieval systems save space by using index structures that
support only approximate retrieval
• May result in
– False negative (false drop) - some relevant documents may not be retrieved
– False positive - some irrelevant documents may be retrieved
– For many applications a good index should not permit any false drops, but
may permit a few false positives

Precision and Recall
Retrieved &
relevant
Not retrieved but
relevant
Retrieved &
irrelevant
Not retrieved &
irrelevant
retrieved not retrieved
relevantirrelevant
Relevant
Relevant
&
Retrieved
Retrieved
All
Documents

Precision and Recall
• In the ideal case, the set of retrieved documents is equal to
the set of relevant documents. However, in most cases, the
two sets will be different
• This difference is formally measured with precision and
recall
• Precision - what percentage of the retrieved documents are
relevant to the query
• Recall - what percentage of the documents relevant to the
query were retrieved
documentsrelevantofnumberTotal
retrieveddocumentsrelevantofNumber
recall =
retrieveddocumentsofnumberTotal
retrieveddocumentsrelevantofNumber
precision =

Measuring Retrieval Effectiveness
• The above two measures do not take into account where the relevant documents are retrieved, this is, at
which rank (crucial since the output of most IR systems is a ranked list of documents).
• This is very important because an effective IR system should not only retrieve as many relevant
documents as possible and as few non-relevant documents as possible, but also it should retrieve
relevant documents before the non-relevant ones.
• Recall vs. Precision tradeoff
– Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many
irrelevant documents would be fetched, reducing precision
• Measures of retrieval effectiveness
– Recall as a function of number of documents fetched or
– Precision as a function of recall
• Equivalently, as a function of number of documents fetched
– Example - precision of 75% at recall of 50%, and 60% at a recall of 75%
• Problem: which documents are actually relevant, and which are not

Systems Evaluation - Challenges
• Effectiveness is related to the relevancy of retrieved items
• Relevancy is not typically binary but continuous
• Even if relevancy is binary, it can be a difficult judgment to make
• Relevancy from a human standpoint
– Subjective: Depends upon a specific user’s judgment
– Situational: Relates to user’s current needs
– Cognitive: Depends on human perception and behavior
– Dynamic: Changes over time
• Total number of relevant items is sometimes not available
– Sample across the database and perform relevance judgment on these items
– Apply different retrieval algorithms to the same database for the same query. The aggregate of
relevant items is taken as the total relevant set

10
1
Recall
Precision
The ideal
Returns relevant documents but
misses many useful ones too
Returns most relevant
documents but includes
lots of junk
Precision and Recall - Trade-offs

Recall / Precision
• Let us assume that for a given query, the following documents are relevant
• fd3, d5, d9, d25, d39, d44, d56, d71, d89, d123g
• Now suppose that the following documents are retrieved for that query
• For each relevant document (in red bold), we calculate the precision value and the recall value
• For example, for d56, we have 3 retrieved documents, and 2 among them are relevant, so the
precision is 2/3. We have 2 of the relevant documents so far retrieved (the total number of relevant
documents being 10), so recall is 2/10.

Recall / Precision
• For each query, we obtain pairs of recall and
precision values
• In our example, we would obtain (1/10, 1/1)
(2/10, 2/3) (3/10, 3/6) (4/10, 4/10) (5/10, 5/14) . .
. which are usually expressed in % (10%,
100%) (20%, 66.66%) (30%, 50%) (40%, 40%)
(50%, 35.71%) . . .
• This can be read for instance: at 20% recall,
we have 66.66% precision; at 50% recall, we
have 35.71% precision
• The pairs of values are plotted into a graph,
which has the following curve

The complete methodology
• For each IR system / IR system version
– For each query in the test collection
• We first run the query against the system to obtain a ranked list of retrieved
documents
• We use the ranking and relevance judgements to calculate recall/precision pairs
– Then we average recall / precision values across all queries, to obtain an
overall measure of the effectiveness

Computing Recall / Precision Points
• For a given query, produce the ranked list of retrievals
• Adjusting a threshold on this ranked list produces different sets of
retrieved documents, and therefore different recall/precision measures
• Mark each document in the ranked list that is relevant according to the
gold standard
• Compute a recall/precision pair for each position in the ranked list that
contains a relevant document

• Let total # of relevant docs = 6
• Check each new recall point
Missing one
relevant document.
Never reach
100% recall
n doc # relevant
1 588 X R=1/6=0.167; P=1/1=1
2 589 X R=2/6=0.333; P=2/2=1
3 576
4 590 X R=3/6=0.5; P=3/4=0.75
5 986
6 592 X R=4/6=0.667; P=4/6=0.667
7 984
8 988
9 578
10 985
11 103
12 591
13 772 X R=5/6=0.833; p=5/13=0.38
14 990

• Let total # of relevant docs = 6
• Check each new recall point
n doc # relevant
1 588 X R=1/6=0.167; P=1/1=1
2 576
3 589 X R=2/6=0.333; P=2/3=0.667
4 342
5 590 X R=3/6=0.5; P=3/5=0.6
6 717
7 984
8 772 X R=4/6=0.667; P=4/8=0.5
9 312 X R=5/6=0.833; P=5/9=0.556
10 498
11 113
12 628
13 772
14 592 X R=6/6=1.0; p=6/14=0.429

Average Recall / Precision Curve
• Typically average performance over a large set of queries
• Compute average precision at each standard recall level across all queries
• Plot average precision/recall curves to evaluate overall system
performance on a document/query corpus

Compare Systems
• The curve closest to the upper right-hand corner of the graph indicates
the best performance

R- Precision
• It is the precision at the R position in the
ranking of results for a query
• R = relevant documents
• R-precision places lower emphasis on the
exact ranking of the relevant documents
returned
• This can be useful when a topic has a large
number of judged relevant documents
• Or when an evaluator is more interested in
measuring aggregate performance as
opposed to the fine-grained quality of the
ranking provided by the system.
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67

F-Measure
• A measure of performance that considers both recall and precision
• Harmonic mean of recall and precision
• Compared to arithmetic mean, both need to be high for harmonic mean
to be high
PR
RP
PR
F 11
22
+
=
+
=

E Measure - Parameterized F Measure
• A variant of F measure that allows weighting emphasis on precision over
recall
• Value of β controls trade-off
– β = 1: Equally weight precision and recall (E=F)
– β > 1: Weight recall more
– β < 1: Weight precision more
PR
RP
PR
E 1
2
2
2
2
)1()1(
+
+
=
+
+
= β
β
β
β

Mean Average Precision (MAP)
• Average Precision: Average of the precision values at the points at which
each relevant document is retrieved
– Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633
– Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625
• Mean Average Precision: Average of the average precision value for a set
of queries
• Provides a single number value to decide better algorithm

Non-Binary Relevance
• Documents are rarely entirely relevant or non-relevant to a query
• Many sources of graded relevance judgments
– Relevance judgments on a 5-point scale
– Multiple judges
– Click distribution and deviation from expected levels (but click-through !=
relevance judgments)

A/B Testing
• Exploits existing user base to provide useful feedback
• Randomly send a small fraction (1−10%) of incoming users to a variant of
the system that includes a single change
• Judge effectiveness by measuring change in click-through
• The percentage of users that click on the top result (or any result on the
first page)

References
1. Baeza-Yates, R. and Ribeiro-Neto, B. (2011) - Modern Information Retrieval - the concepts and technology behind search -Addison Wesley
2. Grossman, D. A. and Frieder, O. (2004) - Information Retrieval. Algorithms and Heuristics, 2nd ed., volume 15 of The Information Retrieval Series
– Springer
3. Manning, C. D., Raghavan, P., and Schuetze, H., editors (2008) - Introduction to Information Retrieval - Cambridge University Press
4. Roelleke, T., Tsikrika, T., and Kazai, G. (2006) - A general matrix framework for modelling information retrieval - Journal on Information
Processing & Management (IP&M), Special Issue on Theory in Information Retrieval, 42(1)
5. Zaragoza, H., Hiemstra, D., and Tipping, M. (2003) - Bayesian extension to the language model for ad hoc information retrieval - In SIGIR '03:
Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pages 4-9, New York,
NY, USA. ACM Press
6. Roelleke, T., Tsikrika, T., and Kazai, G. (2006) - A general matrix framework for modelling information retrieval -Journal on Information Processing
& Management (IP&M), Special Issue on Theory in Information Retrieval, 42(1).
7. Roelleke, T. and Wang, J. (2006) - A parallel derivation of probabilistic information retrieval models - In ACM SIGIR, pages 107-114, Seattle, USA
8. Roelleke, T. and Wang, J. (2008) - TF-IDF uncovered: A study of theories and probabilities -In ACM SIGIR, pages 435 - 442, Singapore
9. Robertson, S. (2004) - Understanding inverse document frequency: On theoretical arguments for idf - Journal of Documentation, 60:503-520
10. Metzler, D. and Croft, W. B. (2004) - Combining the language model and inference network approaches to retrieval - Information Processing &
Management, 40(5):735-750

Thank You
Check Out My LinkedIn Profile at
https://in.linkedin.com/in/girishkhanzode

IR

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to IR

Similar to IR (20)

More from Girish Khanzode

More from Girish Khanzode (12)

Recently uploaded

Recently uploaded (20)

IR