SlideShare a Scribd company logo
1 of 136
Information Retrieval (IR)
Techniques
Girish Khanzode
Contents
Information Retrieval
• Information Retrieval - Given a set of terms and a set of document terms select
only the most relevant document (precision), and preferably all the relevant ones
(recall)
• Goal - find documents relevant to an information need from a large document set
– Mostly textual information ( text/document retrieval)
– documents, images, videos, data, services, audio
– XML, RDF, html, txt, PDF
• Large collections on internet with billions of pages
• Information retrieval problem: locating relevant documents based on user input,
such as keywords or example documents
Information Retrieval / Data Retrieval
Information Retrieval Data Retrieval
Matching vague exact
Model probabilistic deterministic
Query language natural artificial
Query specification incomplete complete
Items wanted relevant all (matching)
Error handling insensitive sensitive
IR Cycle
U
s
e
r
T
a
s
k
s
Retrieval:
Adhoc
Filtering
Browsing
Classic Models
Boolean
Vector
Probabilistic
Structured Models
Non-Overlapping Lists
Proximal
Browsing
Flat
Structure Guided
Hypertext
Set Theoretic
Fuzzy
Extended Boolean
Algebraic
Generalized Vector
Lat. Semantic Index
Neural Networks
Probabilistic
Inference Network
Belief Network
IR Models
Tasks
• Clustering - Group documents into clusters based on their contents
• Classification - Given a topics set and a new doc D, decide which topics
the documents D belongs to (spam / no-spam…)
• Information Extraction - Find all snippets dealing with a given topic (like
company merger)
• Question Answering - Handle wide range of question types like fact, list,
definition, how, why, hypothetical, semantically constrained, and cross-
lingual questions
• Opinion Mining - Analyze / summarize sentiment in a text
Terminology
• Searching - seek specific information within a body of information to get result of a search -
a set of hits
• Browsing - unstructured exploration of a body of information
• Linking - moving from one item to another following links like citations, references
• Query - a string of text, describing the information that user seeks. Each word of the query
is called a search term or keyword
• A query can be a single search term, a string of terms, a phrase in natural language or a
stylized expression using special symbols
• Full text searching - methods that compare the query with every word in the text, without
distinguishing the function (meaning, position) of the various words
• Fielded searching - methods that search on specific bibliographic or structural fields, such as
author or heading
Architecture
Documents
Hits
Representation
Function
Representation
Function
Query Representation Document Representation
Comparison
Function
offlineonline
Document acquisition
( web crawling…)
Index
Query
Zipf's Law
• Distribution of word frequencies is similar for different texts (natural
language) of significantly large size
• Zipf's law holds even for different languages
Luhn's Hypothesis
• Frequency of words is a measurement of word significance - A measurement of
the power of a word to discriminate documents by their content ...
• Resolving/Discriminating
power of words
• Optimal power half way
between the cut-offs
Techniques
• IR Models - Governs how a document and a query are represented and how
the relevance of a document to a user query is defined
– Boolean Model
– Vector Model
– Probabilistic Model
• Index Terms (attribute) Selection
– Stop list
– Word stem
– Index terms weighting methods
• Terms  Documents Frequency Matrices
Indexing Based IR
• Simple queries
– composed of two or three,
perhaps of dozen of keywords
– web retrieval
• Boolean queries
– `Database AND computer’
– online catalog and patent search
• Context queries
– proximity search, phrase queries
Sorting & Ranking
• User sends a query to search system which returns a set of hits
• For a large documents collection this set could be large
• The value of results depends on the order in which the hits are presented
• Ranking algorithms are at the core of information retrieval systems
(predicting which documents are relevant and which are not)
• Ranking methods
– Sorting the hits (by date…)
– Ranking the hits by similarity between query and document
– Ranking the hits by the importance of the documents
Bag of Words Model
• The most common way to represent documents in IR
• How to weight a word within a document
– Boolean: 1 is the word i is in doc j, 0 otherwise
– Tf*idf and others - the weight is a function of the word frequency in the
document and of the frequency of documents with that word
• What is a word
– Single, inflected word (going)
– Lemmatised word (going, go, gone  go)
– Multi-word, proper nouns, numbers, dates (board of directors, John Stack, April,
2012)
– Meaning: plan,project,design  PLAN#03
Bag of Words Model
• Treats all the words in a document as index
terms
• Assigns a weight to each term based on
importance (or presence/absence of word)
• Disregards order, structure, meaning of the
words
• Simple but effective
• Assumptions
– Term occurrence is independent
– Document relevance is independent
– Words are well-defined
• Consider three documents
– John likes to watch movies.
– Mary likes movies too.
– John also likes football
• The bag of words is shown below
Document Parsing
• Format and language of each document
– What format is it in?
– PDF / word / excel / html?
– What language is it in?
– What character set is in use?
• Each of these is a classification problem
• These tasks often performed heuristically
Sec. 2.1
Parsing Challenges
• Documents being indexed can be from different languages
– A single index may contain terms of several languages
• A document / components can contain multiple languages / formats
– French email with a German PDF attachment
• What is a unit document?
– A file?
– An email?
– An email with 5 attachments?
– A group of files (PPT or LaTeX as HTML pages)
Sec. 2.1
Tokenization
• Token - instance of a sequence of characters
• Each token is a candidate for an index entry after further processing
• Input: Customers Suppliers and Factory
• Output: Tokens
– Customers
– Suppliers
– Factory
Sec. 2.2.1
Tokenization Issues
• Finland’s capital → Finland? Finlands? Finland’s?
• Hewlett-Packard → Hewlett and Packard as two tokens?
– state-of-the-art - break up hyphenated sequence
– co-education
– lowercase, lower-case, lower case ?
• San Francisco - one token or two?
– How to decide if it is one token?
– Cheap San Francisco-Los Angeles fares
Sec. 2.2.1
Stop Words
• Many of the most frequently used words in English are useless in IR and text mining
• Those words are called stop words
– the, a, and, to, be , of, in, about, with …
– Little semantic value
– Stop words account for 20-30% of total word counts
• Stop list contains stop-words that should not to be used as index
– Prepositions
– Articles
– Pronouns
– Some adverbs and adjectives
– Some frequent words (e.g. document)
Sec. 2.2.2
Stop Words
• Removal of stop-words improves
efficiency and effectiveness of
searches
• A few standard stop-lists are
commonly used
• Reduces indexing data file sizes
Sec. 2.2.2
Stop Words - New Trend
• Stop words need very small space for storage due to good compression
techniques
• Query time is not affected due to stop words because of good query
optimization techniques
• Stop words are necessary for
– Phrase queries - King of Spain
– Various song titles.. - Let it be, To be or not to be
– Relational queries - flights to London
Normalization - Terms
• Normalization of words in indexed text and query into same form
• A term is a normalized word type, which is a single entry in IR system
dictionary
• Implicitly defines equivalent classes of terms by
– Deleting periods to form a term
• U.S.A., USA  USA
– Deleting hyphens to form a term
• anti-discriminatory, antidiscriminatory  antidiscriminatory
– Synonyms
• Car, Automobile
Sec. 2.2.3
Case Folding
• Reduces all letters to lower case
• Exception: upper case words in mid-sentence
• General Motors
• Fed vs. fed
• MIT vs. mit
• It is best to lowercase everything since users will often type search
queries in lowercase regardless of interested information
• Google example
– When query is C.A.T. -> #1 result is for “cat” (Wikipedia) but not Caterpillar
Inc.
Sec. 2.2.3
Synonyms and Homonyms
• Synonyms
– Document - motorcycle repair - motorcycle maintenance
– maintenance and repair are synonyms
– System can extend query as motorcycle and (repair or maintenance)
• Homonyms
– Object has different meanings as noun/verb
– Can disambiguate meanings to some extent from the context
• Extending queries automatically using synonyms can be problematic
– Need to understand intended meaning in order to infer synonyms
• Or verify synonyms with user
– Synonyms may have other meanings as well
Normalization - Synonyms
• Handling Synonyms and Homonyms
– Hand-constructed equivalence classes
• car = automobile color = colour
– Rewrite words to form equivalence-class terms
• When a document contains automobile, index it under car-automobile (and vice-versa)
– Expand a query
• When a query contains automobile, look under car as well
• Spelling mistakes
– Soundex - a phonetic algorithm for equivalent classes of words based on phonetic
heuristics
• Google  Googol
Stemming
• Techniques used to reduce words of variant form to a stem or root form
before indexing
• Stemming – Remove endings of word
– Computer
– Compute
– Computes
– Computing
– Computed
– Computation
Sec. 2.2.4
comput
Stemming
• As a result if query is house plans, the results will also include all pages
containing variation of that term
– House plan
– House planer
– House planning
• Increases recall at the expense of precision
• Improves effectiveness of IR and text mining
– Matching similar words
– Reducing indexing size
– Combing words with same roots may reduce indexing size as much as 40-50%.
• Produced by Stemmers
Sec. 2.2.4
Lemmatization
• Transform to standard dictionary form lemma, according to syntactic
category
– verb + ing → verb, noun + s → noun
• More accurate than stemming but consumes more resources
• Balance noise Vs. recognition rate
• Compromise between precision and recall
• Increased recall without hurting precision
• Produced by lemmatizers
• the boy's cars are different colors → the boy car be different color
Porter’s Algorithm
• The most common algorithm for stemming English and one that has
repeatedly been shown to be empirically very effective - suffix stripping
• Conventions + 5 phases of reductions
– phases applied sequentially
– each phase consists of a set of commands
– sample convention: of the rules in a compound command, select the one that
applies to the longest suffix
Sec. 2.2.4
Porter Algorithm Steps
1. Plurals and past participles
SSES -> SS caresses -> caress
(*v*) ING -> motoring -> motor
1. adj->n, n->v, n->adj, …
(m>0) OUSNESS -> OUS callousness -> callous
(m>0) ATIONAL -> ATE relational -> relate
1. (m>0) ICATE -> IC triplicate -> triplic
2. (m>1) AL -> revival -> reviv
(m>1) ANCE -> allowance -> allow
1. (m>1) E -> probate -> probat
2. (m > 1 and *d and *L) -> single letter controll -> control
Stemmers Comparison
• Sample text: Such an analysis can reveal features that are not easily visible from the
variations in the individual genes and can lead to a picture of expression that is more
biologically transparent and accessible to interpretation
• Porter’s: such an analysi can reveal featur that ar not easili visibl from the variat in the
individu gene and can lead to pictur of express that is more biolog transpar and access
to interpret
• Lovins’s: such an analys can reve featur that ar not eas vis from th vari in th individu
gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres
• Paice’s : such an analys can rev feat that are not easy vis from the vary in the individ
gen and can lead to a pict of express that is mor biolog transp and access to interpret
Deep Analysis
• Detailed Natural Language Processing (NLP) algorithms
• Semantic disambiguation, phrase indexing (board of directors), named
entities (President Monti = Mario Monti)...
• Standard search engines use deeper techniques (Google)
Document Indexing
• Store an index to optimize speed and performance
• Useful in finding relevant documents for a search query
• Reduces time and CPU usage
• Without an index, search engine will scan every document in
the corpus
• An index of 10,000 documents is queried in milliseconds
• A sequential scan of every word in 10,000 documents takes
much more time
• Each document is represented by a set of weighted
keywords known as terms
– D1 → {(t1, w1), (t2,w2), …}
• D1 → {(comput, 0.2), (architect, 0.3), …}
• D2 → {(comput, 0.1), (network, 0.5), …}
• Inverted file - used in retrieval for higher efficiency
– comput → {(D1,0.2), (D2,0.1), …}
Boolean Model
• Query terms are combined logically using the
Boolean operators
– AND, OR, NOT
– ((asthma AND exercise) AND (NOT cardiac))
– Views each document as a set of words
– Precise: document matches a condition or not
• Retrieval
– Given a Boolean query, system retrieves each
document that makes the query logically true
– Called exact match
• No Rank - A document is judged to be relevant if
the terms in the document satisfies the logical
expression of the query
Inverted index
• A data structure that attaches each distinctive term with a list of all
documents that contains the term in a document collection
• The list is called posting list
Sec. 1.2
Document vs. Inverted Views
What Goes in Inverted File
• Boolean retrieval
– Just the document number
• Ranked Retrieval
– Document number and term weight (TF, IDF, TF*IDF, ...)
• Proximity operators
– Word offsets for each occurrence of the term
– Example : t17 (doc1,49) (doc1,70) (doc2,3)
Inverted File Size
• Very compact for Boolean retrieval
– About 10% of the size of the documents
– If an aggressive stop word list is used
• Not much larger for ranked retrieval
– Perhaps 20%
• Enormous for proximity operators
– Sometimes larger than the documents
– But access is fast - you know where to look
Inverted Index Construction
Sec. 1.2
Friends Romans Countrymen
friend roman countryman
friend
roman
countryman
2 4
2
13 16
1
Friends, Romans, Countrymen
Inverted Index – Search Steps
• Given a query q
– Vocabulary search - find each term/word from q in the inverted index
– Results merging - Merge results to find documents that contain all or some of
the words/terms in q
– Rank score computation - To rank the resulting documents/pages, using
• Content-based ranking
• Link-based ranking
Inverted Index - Boolean Retrieval
11
11
11
11
11
11
1 2 3
11
11
11
4
blueblue
catcat
eggegg
fishfish
greengreen
hamham
hathat
oneone
3
4
1
4
4
3
2
1
blue
cat
egg
fish
green
ham
hat
one
2
11redred
11twotwo
2red
1two
Document
IDs
Boolean Retrieval
• To execute a Boolean query
– Build query syntax tree
– For each clause, look up postings
– Traverse postings and apply Boolean operator
• Efficiency analysis
– Postings traversal is linear (assuming sorted postings)
– Start with shortest posting first
( blue AND fish ) OR ham
1
2blue
fish 2
Query Processing - AND
• Consider processing the query - Brutus AND Caesar
– Locate Brutus in the Dictionary
• Retrieve its postings
– Locate Caesar in the Dictionary
• Retrieve its postings
– Merge the two postings
128
34
Brutus
Caesar
Sec. 1.3
The Merge
• Walk through the two postings simultaneously, in time linear in the total
number of postings entries
• These postings are sorted by docID
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
Brutus
Caesar
2 8
If the list lengths are x and y, the merge takes O(x+y) operations
Postings Lists – Merge Algorithm
2
1
1
2
1
1
1
1
1
1
1
Inverted Index: TF.IDF
22
11
22
11
11
11
1 2 3
11
11
11
4
11
11
11
11
11
11
11
tf
df
blueblue
catcat
eggegg
fishfish
greengreen
hamham
hathat
oneone
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
11 11redred
11 11twotwo
1red
1two
3
4
1
4
4
3
2
1
2
2
1
22
Positional Indexes
• Store term position in postings
• Supports richer queries (proximity….)
• Leads to larger indexes…
[2,4]
[3]
[2,4]
[2]
[1]
[1]
[3]
[2]
[1]
[1]
[3]
2
1
1
2
1
1
1
1
1
1
1
Inverted Index: Positional Information
22
11
22
11
11
11
1 2 3
11
11
11
4
11
11
11
11
11
11
22
11
tf
df
blueblue
catcat
eggegg
fishfish
greengreen
hamham
hathat
oneone
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
11 11redred
11 11twotwo
1red
1two
3
4
1
4
4
3
2
1
2
2
1
Document
Position of term
in document
13
Optimization of Index Search
• What is the best order of words for query processing?
• Consider a query that is an AND of n terms
• Process words in order of increasing freq
– start with smallest set, then keep cutting further
– This is why we kept document freq. in dictionary
• For each of the n terms, get its postings, then AND them together
Brutus
Caesar
Calpurnia
Query: Brutus AND Calpurnia AND Caesar
Sec. 1.3
1 2 3 5 8 16 21 34
2 4 8 16 32 64 128
16
More General Optimization
• (madding OR crowd) AND (ignoble OR strife)
• Get document frequencies for all terms
• Conservative - estimate the size of each OR by the sum of its doc
frequencies
• Process in increasing order of OR sizes
Sec. 1.3
Query Optimization
Term Frequency
eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812
Recommend a query processing order for –
(tangerine OR trees) AND (marmalade OR
skies) AND (kaleidoscope OR eyes)
300321
379571
363465
(kaleydoscopeOReyes) AND (tangerineORtrees)AND(marmaladeORskies)
Skip Pointers
• Intersection is the most important operation when for search engines
• This is because in web search, most queries are implicitly intersections
• car repairs, Britney spears songs translates into – car AND repairs, Britney
AND spears AND songs, which means it will be intersecting 2 or more
postings lists in order to return a result
• Because intersection is so crucial, search engines try to speed it up in any
way possible. One such way is to use skip pointers
Optimized Skip Pointers
• Augment Postings with skip pointers (at indexing time)
• Why? - To skip postings that will not figure in the search results.
• Where do we place skip pointers?
1282 4 8 41 48 64
311 2 3 8 11 17 21
3111
41 128
Sec. 2.3
Query Processing with Skip Pointers
• Start using the normal intersection algorithm
• Continue until the match 12 and advance to the next item in each list. At this point
the "car" list is on 48 and the "repairs" list is on 13, but 13 has a skip pointer
• Check the value the skip pointer is pointing at (i.e. 29) and if this value is less than
the current value of the "car" list (which it is), we follow our skip pointer and jump
to this value in the list
Where to Place Skips - Tradeoff
• More skips → shorter skip spans ⇒ more likely to skip
– But lots of comparisons to skip pointers.
• Fewer skips → few pointer comparison, but then long skip spans ⇒
few successful skips
Placing Skips
• Simple heuristic - for postings of length L, use √L evenly-spaced skip
pointers
• This ignores the distribution of query terms
• Easy if the index is relatively static
• Harder if L keeps changing because of updates
• How much do skip pointers help?
– CPUs were slow, they used to help a lot
– Today’s CPUs are fast and disk is slow, so reducing disk postings list size
dominates
Strengths and Weaknesses
• Strengths
– Very precise queries can be specified
– Easy to implement in the simple form
• Weaknesses
– Retrieval results are poor since term frequency is not considered - No index
term weighting
– Specifying the query may be difficult for casual users
– Result might be 1 or 0 (unordered set of documents)
Similarity Based Retrieval
• Retrieve documents that are similar to a given document
– Similarity may be defined on the basis of common words
– Find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find
relevance of other documents
• Relevance feedback
– Similarity can be used to refine answer set to keyword query
– User selects a few relevant documents from those retrieved by keyword
query and system finds other documents similar to these
Similarity Based Retrieval - Vector Space Model
• Define an n-dimensional space, where n is the number of words in the
document set
• Vector for document d goes from origin to a point whose ith
coordinate is
TF (d,t ) / n (t )
• The cosine of the angle between vectors of two documents is used as a
measure of their similarity
Vector Space Model
• Assumption - Documents that are close together in vector space talk
about same things
• Hence retrieve documents depending on closeness to the query
(similarity ~ closeness)
Vector Space Model
• Documents are treated as a bag of words or terms
• Documents are presented in high dimensional space
• Each document is represented as a vector
• Implemented by forming term-document matrix
• Dimension of space depends on number of indexing terms which are chosen to be
relevant for the collection
• Rank according to the similarity metric (e.g. cosine) between the query and document
• The smaller the angle between the document and query the more similar they are
believed to be
– Documents are represented by a term vector
– Queries are represented by a similar vector
Vector Space Model
• Term weights are not pure 0 or 1
• Each weight is computed based on some variations of TF or TF-IDF
scheme
• Query has the same shape as document (m dimensional vector)
– Cosine is commonly used in text clustering
• Measure of similarity between query q and a document dj is a cosine of
angle between these vectors
• Ad-hoc weightings (term frequency x inverse document frequency ) used
• No optimal ranking
Vector Space Model
• Vector space = all the keywords encountered <t1, t2, t3, …, tn>
• Document D = < a1, a2, a3, …, an> where ai = weight of ti in D
• Query Q = < b1, b2, b3, …, bn> where bi = weight of ti in Q
• R(D,Q) = Sim(D,Q)
• Consider Query q
– Relevance of di to q - Compare similarity of query q and document di
– Cosine similarity (the cosine of the angle between the two vectors)
Vectors Plot
Star
Diet
Document about astrology
Documents about movie stars
Documents about mammal behavior
Term-Document Matrix
• A collection of n documents can
be represented in vector space
model using this matrix
• A m × n matrix where m is
number of terms and n is number
of documents
• Term - Row of term-document
matrix
• Document - Column of term-
document matrix


















=
↓↓↓
aaa
aaa
aaa
A
ddd
mnmm
n
n
n




21
22221
11211
21
t
t
t
m
←
←
←

2
1
Term-Document Matrix
Similarity Formulae
• Dot Product
• Cosine
• Dice
• Jaccard ∑ ∑ ∑
∑
∑ ∑
∑
∑ ∑
∑
∑
−+
=
+
=
=
=
i i i
iiii
i
ii
i i
ii
i
ii
i i
ii
i
ii
ii
baba
ba
QDSim
ba
ba
QDSim
ba
ba
QDSim
baQDSim
)*(
)*(
),(
)*(2
),(
*
)*(
),(
)*(),(
22
22
22
t1
t2
D
Q
Index Storage
• Term-Document matrix is very sparse
• A few 100 terms for a document and a few terms for a query, while the
term space is large (~100k)
• Stored as
D1 → {(t1, a1), (t2,a2), …}
t1 → {(D1,a1), …}
Implementation
• Implementation of Vector Space Model using dot product
– Naïve implementation: O(m*n)
– Implementation using inverted file
• Given a query = {(t1,b1), (t2,b2)}
1. Find the sets of related documents through inverted file for t1 and t2
2. Calculate the score of the documents for each weighted term (t1,b1) →
{(D1,a1 *b1), …}
3. Combine the sets and sum the weights (∑)
4. O(|Q|*n)
Similarity Calculation
• Consider two documents D1, D2 and a query Q
– D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)
Ranked Retrieval
• Documents are ranked based on their score
• Advantages
– Query is easy to specify
– Output is ranked based on the estimated relevance of the documents to the
query
– A wide variety of theoretical models exist
• Disadvantages
– Query less precise (although weighting can be used)
Example Query
• A document space is defined by three terms – computer, application, users
• A set of documents defined as
– A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
– A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
– A7=(1, 1, 1), A8=(1, 0, 1). A9=(0, 1, 1)
• Query is computer and application
• What documents should be retrieved?
Example Query
• In Boolean query matching
– AND - document A4, A7 will be retrieved
– OR - retrieved: A1, A2, A4, A5, A6, A7, A8, A9
• In similarity matching (cosine)
– q=(1, 1, 0)
– S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
– S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
– S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
– Document retrieved set (with ranking)
• {A4, A7, A1, A2, A5, A6, A8, A9}
Assigning Weights to Terms
• Binary Weights
• Raw term frequency
• TF x IDF
Binary Weights
• Only the presence 1 or absence 0 of a term is included in the vector
docs t1 t2 t3 RSV=Q.Di
D1 1 0 1 4
D2 1 0 0 1
D3 0 1 1 5
D4 1 0 0 1
D5 1 1 1 6
D6 1 1 0 3
D7 0 1 0 2
D8 0 1 0 2
D9 0 0 1 3
D10 0 1 1 5
D11 1 0 1 3
Q 1 2 3
q1 q2 q3
Raw Term Weights
• The frequency of occurrence for the term in each document is included in
the vector
docs t1 t2 t3
D1 2 0 3
D2 1 0 0
D3 0 4 7
D4 3 0 0
D5 1 6 3
D6 3 5 0
D7 0 8 0
D8 0 10 0
D9 0 0 1
D10 0 3 5
D11 4 0 1
TF.IDF - Term Weighting
• Assigns a tf * idf weight to each term in each document
• Term weights components
– Local - How important is the term in this document?
– Global - How important is the term in the collection?
• Logic
– Terms that appear often in a document should get high weights
– Terms that appear in many documents should get low weights
• Mathematical Capturing
– Term Frequency (local)
– Inverse Document Frequency (global)
TF.IDF - Term Weighting
• tf = Term Frequency
– Frequency of a term/keyword in a document
– The higher the tf, the higher the importance (weight) for the doc.
• df = document frequency
– Number of documents containing the term
– Distribution of the term
• idf = Inverse Document Frequency
– Unevenness of term distribution in the corpus
– Specificity of term to a document
• The more the term is distributed evenly, the less it is specific to a document:
weight(t, D) = tf(t ,D) * idf(t)
Term Weighting
• Based on common sense, but adjusted/engineered following experiments
• Terms that occur in only a few documents are often more valuable than
ones that occur in many - IDF
• The more often a term occurs in a document, the more likely it is to be
important for that document - TF
• A term that occurs for same number of times in a short and a long
document is likely to be more valuable for the short document
• Word occurrence frequency is a measure of the significance of terms and
their discriminatory power
Word
frequency
Too frequent:
useless discriminators
Significant terms
Too rare: no significant
contribution to
the content of the document
Term Significance
TF.IDF Weight
• Term frequency weight measures importance in document:
• Inverse document frequency measures importance in collection:
• Some heuristic modifications
Relevance Ranking Using Terms
• TF-IDF (Term frequency/Inverse Document frequency) ranking
– Let n(d) = number of terms in the document d
– n(d, t) = number of occurrences of term t in the document d.
– Relevance of a document d to a term t
• The log factor is to avoid excessive weight to frequent terms
– Relevance of document to query Q
nn((dd))
nn((dd,, tt))
1 +1 +TFTF ((dd,, tt) =) = loglog
rr ((dd,, QQ) =) = ∑∑ TFTF ((dd,, tt))
nn((tt))tt∈∈QQ
Inverse Document Frequency
• IDF provides high values for rare words and low values for common words
4
1
10000
log
698.2
20
10000
log
301.0
5000
10000
log
0
10000
10000
log
=





=





=





=





For a collection
of 10000
documents
TF.IDF Normalization
• Normalize the term weights
– So longer documents are not unfairly given more weight
– Normalize usually means force all values to fall within a certain range, usually
between 0 and 1, inclusive.
∑=
=
t
k kik
kik
ik
nNtf
nNtf
w
1
22
)]/[log()(
)/log(
Relevance Ranking Using Terms
• Ranking of documents on the basis of estimated relevance to a query
• Relevance ranking is based on factors
– Term frequency
• Frequency of occurrence of query keyword in document
– Inverse document frequency
• How many documents the query keyword occurs in
– Fewer  give more importance to keyword
– Hyperlinks to documents
• More links to a document  document is more important
Relevance Ranking Using Terms
• Documents are returned in decreasing order of relevance score
– Usually only top few documents are returned
• Most systems improvise above model
– Common words like a, an, the, it etc are eliminated
• Called stop words
– Words that occur in title, author list, section headings are given greater
importance
– Words whose first occurrence is late in the document are given lower importance
– Proximity: if keywords in query occur close together in the document, the
document has higher importance than if they are far apart
Relevance Using Hyperlinks
• Number of documents relevant to a query can be enormous if only term
frequencies are taken into account
• Using term frequencies makes spamming easy
– A fitness center could add many occurrences of the words like weights to its
page to make its rank very high
• Most of the time people are looking for pages from popular sites
• Idea - use popularity of web site to rank site pages that match given
keywords
• Problem - hard to find actual popularity of site
Relevance Using Hyperlinks
• Solution: use number of hyperlinks to a site as a measure of the popularity or
prestige of the site
– Count only one hyperlink from each site
– Popularity measure is for site, not for individual page
• But most hyperlinks point to root of site
• Concept of site is difficult to define since a URL prefix like cs.yale.edu contains many
unrelated pages of varying popularity
• Refinements
– When computing prestige based on links to a site, give more weight to links from
sites that themselves have higher prestige
• Definition is circular
• Set up and solve system of simultaneous linear equations
– Above idea is the basis of the Google Page Rank ranking mechanism
Relevance Using Hyperlinks
• Connections in social networking
– Ranks prestige of people
– Someone known by multiple prestigious people has higher prestige
• Hub and authority based ranking
– A hub is a page that stores links to many pages (on a topic)
– An authority is a page that contains actual information on a topic
– Each page gets a hub prestige based on prestige of authorities that it points to
– Each page gets an authority prestige based on prestige of hubs that point to it
– Prestige definitions are cyclic and can be ontained by solving linear equations
– Use authority prestige when ranking answers to a query
Probability Ranking Principle
• Given a user query q and a document d, estimate the probability that the
user will find d relevant
• Robertson (1977)
• If a reference retrieval system’s response to each request is a ranking of
the documents in the collection in order of decreasing probability of
relevance to the user who submitted the request
• Probabilities are estimated as accurately as possible on the basis of
whatever data have been made available to the system for this purpose,
• Overall effectiveness of the system to its user will be the best that is
obtainable on the basis of those data
IR as Classification
Bayes Classifier
• Bayes Decision Rule
– A document D is relevant if P(R|D) > P(NR|D)
• Estimating probabilities
– Use Bayes Rule
– Classify a document as relevant if
– Left hand side is likelihood ratio
Estimating P(D|R)
• Assume independence
• Binary independence model
– based on information related to presence and absence of terms in relevant and non-relevant
documents
– information acquired through relevance feedback process
• user stating which of the retrieved documents are relevant / non-relevant
– Based on the probability ranking principle, which ensures an optimal ranking
– document represented by a vector of binary features indicating term occurrence (or non-
occurrence)
– pi is probability that term i occurs (has value 1) in relevant document, si is probability of
occurrence in non-relevant document
Binary Independence Model
Binary Independence Model
• Scoring function is
• Query provides information about relevant documents
• If we assume pi constant, si approximated by entire collection, get idf-like
weight
Contingency Table
• The scoring function is -
BM25 Ranking Algorithm
• Popular and effective ranking algorithm based on binary independence
model
– adds document and query term weights
– k1, k2 and K are parameters whose values are set empirically
– dl is doc length
– Typical TREC value for k1 is 1.2, k2 varies from 0 to 1000, b = 0.75
BM25 Example
• Query with two terms player Pele, (qf = 1)
• No relevance information (r and R are zero)
• N = 500,000 documents
• player occurs in 40,000 documents (n1 = 40, 000)
• Pele occurs in 300 documents (n2 = 300)
• Player occurs 15 times in doc (f1 = 15)
• Pele occurs 25 times (f2 = 25)
• Document length is 90% of the average length (dl/avdl = .9)
• k1 = 1.2, b = 0.75, and k2 = 100
• K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11
BM25 Example
BM25 Example - Effect of Term Frequencies
Frequency of player Frequency of Pele BM25 Score
15 25 20.66
15 1 12.74
15 0 5.00
1 25 18.2
0 25 16.66
Relevance Feedback Process – Iterative Cycle
1. User presented with a list of retrieved documents
2. User marks those which are relevant (or not relevant)
1. In practice top 10-20 ranked documents are examined
2. Incremental: one document after the other
3. The relevance feedback algorithm selects important terms from documents assessed relevant by users
4. The relevance feedback algorithm emphasises the importance of these terms in a new query in the following ways
1. Query expansion - add these terms to the query
2. Term reweighing - modify the term weights in the query
3. Query expansion + term reweighing
5. The updated query is submitted to the system
6. If the user is satisfied with the new set of retrieved documents, then feedback process stops, otherwise go to step 2
• Approaches
– Approach 1: Add/Remove/Change query terms
– Approach 2: Re-weight query terms
Relevance Feedback Issues
• Relevance feedback
– Often users are not reliable in making relevance assessments, or do not make
relevance assessments
– Implicit relevance feedback by looking at what users access
• clicks in web logs
• works well - “wisdom of the crowd"
– Positive, negative
– Partial relevance assessments (very relevant or partially relevant)?
– Why is a document relevant?
• Interactive query expansion (as opposed to automatic)
– Users choose the terms to be added
Latent Semantic Indexing - LSI
• Term document matrices are very large but people talk about few things
• So how to represent term document by a lower dimensional latent space?
• Latent Semantic Analysis is the analysis of latent i.e. hidden semantics in a corpora of
text
• LSI transforms the original data in a different space so that two documents/words
about the same concept are mapped close (so that they have higher cosine similarity)
• LSI achieves this by Singular Value Decomposition (SVD) of term-document matrix
• Maps documents and terms to a low dimensional representation
• Design a mapping such that the low dimensional space reflects semantic association
• Compute document similarity based on the inner product in this latent semantic space
Truncated SVD
LSI
• For LSI truncated SVD is used
• Where
– Ukis m×k matrix whose columns are first k left singular vectors of A
– Σk is k×k diagonal matrix whose diagonal is formed by k leading singular
values of A
– Vkis n×k matrix whose columns are first k right singular vectors of A
• Rows of Uk= terms
• Rows of Vk= documents
• SVD can be used to compute optimal low rank approximations
VUA
T
kkkk Σ=
LSI
• In truncated LSI, first k independent linear components of A (singular vectors and values) are
included
• Documents are projected in means of least squares on space spread by first k singular
vectors of A (LSI space)
• First k components capture the major associational structure in in the term-document
matrix and throw out the noise
• Minor differences in terminology used in documents are ignored
• Closeness of objects (queries and documents) is determined by overall pattern of term
usage, so it is context based
• Documents which contain synonyms are closer in LSI space than in original space;
• Documents which contain polysemy (a word having different meaning in different contexts )
in different context are more far in LSI space than in original space
Concept Indexing (CI)
• Lexical focused relevance estimation is less effective than semantic
focussed one
• Indexing using concept decomposition (CD) instead of SVD like in LSI
• Concept decomposition was introduced in 2001
• I. S. Dhillon, D.S. Modha: Concept decomposition for large sparse text
data using clustering, Machine Learning, 42:1, 2001, pp. 143-175
Concept Decomposition
• Cluster documents in term-document matrix A on k groups
• Clustering algorithms
– Spherical k-means algorithm
– Fuzzy k-means algorithm
• Spherical k-means algorithm is a variant of k-means algorithm which uses the fact that
vectors of documents are of the unit norm
• Centroids of groups = concept vectors
• Concept matrix is matrix whose columns are centroids of groups
cj – centroid of j-th group
[ ]cccC kk 21
=
Concept Decomposition
• Next step: calculate concept decomposition
• Concept decomposition Dkof term-document matrix A is least squares
approximation of A on the space of concept vectors
where Z is solution of the least squares problem
• Rows of Ck = terms
• Columns of Z = documents
ZCD kk =
( ) ACCCZ T
kk
T
k
1−
=
System Evaluation
• There are many retrieval models/ algorithms/ systems
• Which one is the best?
– How much effective a system is capable of retrieving relevant documents?
• What is the best component for
– Ranking function (dot-product, cosine, …)
– Term selection (stop word removal, stemming…)
– Term weighting (TF, TF-IDF,…)
• How far down the ranked list will a user need to look to find some / all
relevant documents?
Effectiveness
• Goal of an IR system is to retrieve as many relevant documents as
possible and as few non-relevant documents as possible.
• Evaluating the above consists of a comparative evaluation of technical
• performance of IR system(s)
– In traditional IR, technical performance means the effectiveness of the IR
system: the ability of the IR system to retrieve relevant documents and
suppress non-relevant documents
• Effectiveness is measured by the combination of recall and precision
Measuring Query Retrieval Effectiveness
• Information-retrieval systems save space by using index structures that
support only approximate retrieval
• May result in
– False negative (false drop) - some relevant documents may not be retrieved
– False positive - some irrelevant documents may be retrieved
– For many applications a good index should not permit any false drops, but
may permit a few false positives
Precision and Recall
Retrieved &
relevant
Not retrieved but
relevant
Retrieved &
irrelevant
Not retrieved &
irrelevant
retrieved not retrieved
relevantirrelevant
Relevant
Relevant
&
Retrieved
Retrieved
All
Documents
Precision and Recall
• In the ideal case, the set of retrieved documents is equal to
the set of relevant documents. However, in most cases, the
two sets will be different
• This difference is formally measured with precision and
recall
• Precision - what percentage of the retrieved documents are
relevant to the query
• Recall - what percentage of the documents relevant to the
query were retrieved
documentsrelevantofnumberTotal
retrieveddocumentsrelevantofNumber
recall =
retrieveddocumentsofnumberTotal
retrieveddocumentsrelevantofNumber
precision =
Measuring Retrieval Effectiveness
• The above two measures do not take into account where the relevant documents are retrieved, this is, at
which rank (crucial since the output of most IR systems is a ranked list of documents).
• This is very important because an effective IR system should not only retrieve as many relevant
documents as possible and as few non-relevant documents as possible, but also it should retrieve
relevant documents before the non-relevant ones.
• Recall vs. Precision tradeoff
– Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many
irrelevant documents would be fetched, reducing precision
• Measures of retrieval effectiveness
– Recall as a function of number of documents fetched or
– Precision as a function of recall
• Equivalently, as a function of number of documents fetched
– Example - precision of 75% at recall of 50%, and 60% at a recall of 75%
• Problem: which documents are actually relevant, and which are not
Systems Evaluation - Challenges
• Effectiveness is related to the relevancy of retrieved items
• Relevancy is not typically binary but continuous
• Even if relevancy is binary, it can be a difficult judgment to make
• Relevancy from a human standpoint
– Subjective: Depends upon a specific user’s judgment
– Situational: Relates to user’s current needs
– Cognitive: Depends on human perception and behavior
– Dynamic: Changes over time
• Total number of relevant items is sometimes not available
– Sample across the database and perform relevance judgment on these items
– Apply different retrieval algorithms to the same database for the same query. The aggregate of
relevant items is taken as the total relevant set
10
1
Recall
Precision
The ideal
Returns relevant documents but
misses many useful ones too
Returns most relevant
documents but includes
lots of junk
Precision and Recall - Trade-offs
Recall / Precision
• Let us assume that for a given query, the following documents are relevant
• fd3, d5, d9, d25, d39, d44, d56, d71, d89, d123g
• Now suppose that the following documents are retrieved for that query
• For each relevant document (in red bold), we calculate the precision value and the recall value
• For example, for d56, we have 3 retrieved documents, and 2 among them are relevant, so the
precision is 2/3. We have 2 of the relevant documents so far retrieved (the total number of relevant
documents being 10), so recall is 2/10.
Recall / Precision
• For each query, we obtain pairs of recall and
precision values
• In our example, we would obtain (1/10, 1/1)
(2/10, 2/3) (3/10, 3/6) (4/10, 4/10) (5/10, 5/14) . .
. which are usually expressed in % (10%,
100%) (20%, 66.66%) (30%, 50%) (40%, 40%)
(50%, 35.71%) . . .
• This can be read for instance: at 20% recall,
we have 66.66% precision; at 50% recall, we
have 35.71% precision
• The pairs of values are plotted into a graph,
which has the following curve
The complete methodology
• For each IR system / IR system version
– For each query in the test collection
• We first run the query against the system to obtain a ranked list of retrieved
documents
• We use the ranking and relevance judgements to calculate recall/precision pairs
– Then we average recall / precision values across all queries, to obtain an
overall measure of the effectiveness
Averaging
Computing Recall / Precision Points
• For a given query, produce the ranked list of retrievals
• Adjusting a threshold on this ranked list produces different sets of
retrieved documents, and therefore different recall/precision measures
• Mark each document in the ranked list that is relevant according to the
gold standard
• Compute a recall/precision pair for each position in the ranked list that
contains a relevant document
Computing Recall / Precision Points
• Let total # of relevant docs = 6
• Check each new recall point
Missing one
relevant document.
Never reach
100% recall
n doc # relevant
1 588 X R=1/6=0.167; P=1/1=1
2 589 X R=2/6=0.333; P=2/2=1
3 576
4 590 X R=3/6=0.5; P=3/4=0.75
5 986
6 592 X R=4/6=0.667; P=4/6=0.667
7 984
8 988
9 578
10 985
11 103
12 591
13 772 X R=5/6=0.833; p=5/13=0.38
14 990
Computing Recall / Precision Points
• Let total # of relevant docs = 6
• Check each new recall point
n doc # relevant
1 588 X R=1/6=0.167; P=1/1=1
2 576
3 589 X R=2/6=0.333; P=2/3=0.667
4 342
5 590 X R=3/6=0.5; P=3/5=0.6
6 717
7 984
8 772 X R=4/6=0.667; P=4/8=0.5
9 312 X R=5/6=0.833; P=5/9=0.556
10 498
11 113
12 628
13 772
14 592 X R=6/6=1.0; p=6/14=0.429
Average Recall / Precision Curve
• Typically average performance over a large set of queries
• Compute average precision at each standard recall level across all queries
• Plot average precision/recall curves to evaluate overall system
performance on a document/query corpus
Compare Systems
• The curve closest to the upper right-hand corner of the graph indicates
the best performance
R- Precision
• It is the precision at the R position in the
ranking of results for a query
• R = relevant documents
• R-precision places lower emphasis on the
exact ranking of the relevant documents
returned
• This can be useful when a topic has a large
number of judged relevant documents
• Or when an evaluator is more interested in
measuring aggregate performance as
opposed to the fine-grained quality of the
ranking provided by the system.
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67
F-Measure
• A measure of performance that considers both recall and precision
• Harmonic mean of recall and precision
• Compared to arithmetic mean, both need to be high for harmonic mean
to be high
PR
RP
PR
F 11
22
+
=
+
=
E Measure - Parameterized F Measure
• A variant of F measure that allows weighting emphasis on precision over
recall
• Value of β controls trade-off
– β = 1: Equally weight precision and recall (E=F)
– β > 1: Weight recall more
– β < 1: Weight precision more
PR
RP
PR
E 1
2
2
2
2
)1()1(
+
+
=
+
+
= β
β
β
β
Mean Average Precision (MAP)
• Average Precision: Average of the precision values at the points at which
each relevant document is retrieved
– Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633
– Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625
• Mean Average Precision: Average of the average precision value for a set
of queries
• Provides a single number value to decide better algorithm
Non-Binary Relevance
• Documents are rarely entirely relevant or non-relevant to a query
• Many sources of graded relevance judgments
– Relevance judgments on a 5-point scale
– Multiple judges
– Click distribution and deviation from expected levels (but click-through !=
relevance judgments)
A/B Testing
• Exploits existing user base to provide useful feedback
• Randomly send a small fraction (1−10%) of incoming users to a variant of
the system that includes a single change
• Judge effectiveness by measuring change in click-through
• The percentage of users that click on the top result (or any result on the
first page)
References
1. Baeza-Yates, R. and Ribeiro-Neto, B. (2011) - Modern Information Retrieval - the concepts and technology behind search -Addison Wesley
2. Grossman, D. A. and Frieder, O. (2004) - Information Retrieval. Algorithms and Heuristics, 2nd ed., volume 15 of The Information Retrieval Series
– Springer
3. Manning, C. D., Raghavan, P., and Schuetze, H., editors (2008) - Introduction to Information Retrieval - Cambridge University Press
4. Roelleke, T., Tsikrika, T., and Kazai, G. (2006) - A general matrix framework for modelling information retrieval - Journal on Information
Processing & Management (IP&M), Special Issue on Theory in Information Retrieval, 42(1)
5. Zaragoza, H., Hiemstra, D., and Tipping, M. (2003) - Bayesian extension to the language model for ad hoc information retrieval - In SIGIR '03:
Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pages 4-9, New York,
NY, USA. ACM Press
6. Roelleke, T., Tsikrika, T., and Kazai, G. (2006) - A general matrix framework for modelling information retrieval -Journal on Information Processing
& Management (IP&M), Special Issue on Theory in Information Retrieval, 42(1).
7. Roelleke, T. and Wang, J. (2006) - A parallel derivation of probabilistic information retrieval models - In ACM SIGIR, pages 107-114, Seattle, USA
8. Roelleke, T. and Wang, J. (2008) - TF-IDF uncovered: A study of theories and probabilities -In ACM SIGIR, pages 435 - 442, Singapore
9. Robertson, S. (2004) - Understanding inverse document frequency: On theoretical arguments for idf - Journal of Documentation, 60:503-520
10. Metzler, D. and Croft, W. B. (2004) - Combining the language model and inference network approaches to retrieval - Information Processing &
Management, 40(5):735-750
Thank You
Check Out My LinkedIn Profile at
https://in.linkedin.com/in/girishkhanzode

More Related Content

What's hot

WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMSai Kumar Ale
 
similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Analytical learning
Analytical learningAnalytical learning
Analytical learningswapnac12
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean modelVaibhav Khanna
 
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1Garry D. Lasaga
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval Systemvimalsura
 
Association Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationAssociation Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationKnoldus Inc.
 
Fuzzy logic and application in AI
Fuzzy logic and application in AIFuzzy logic and application in AI
Fuzzy logic and application in AIIldar Nurgaliev
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction) Primya Tamil
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and originShubhankar Mohan
 
Knowledge Representation & Reasoning
Knowledge Representation & ReasoningKnowledge Representation & Reasoning
Knowledge Representation & ReasoningSajid Marwat
 
Logics for non monotonic reasoning-ai
Logics for non monotonic reasoning-aiLogics for non monotonic reasoning-ai
Logics for non monotonic reasoning-aiShaishavShah8
 

What's hot (20)

WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
 
similarity measure
similarity measure similarity measure
similarity measure
 
Analytical learning
Analytical learningAnalytical learning
Analytical learning
 
Canfis
CanfisCanfis
Canfis
 
Term weighting
Term weightingTerm weighting
Term weighting
 
weak slot and filler
weak slot and fillerweak slot and filler
weak slot and filler
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 
Text MIning
Text MIningText MIning
Text MIning
 
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval System
 
Association Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationAssociation Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset Generation
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
 
Fuzzy logic and application in AI
Fuzzy logic and application in AIFuzzy logic and application in AI
Fuzzy logic and application in AI
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and origin
 
Signature files
Signature filesSignature files
Signature files
 
Semantic Networks
Semantic NetworksSemantic Networks
Semantic Networks
 
Knowledge Representation & Reasoning
Knowledge Representation & ReasoningKnowledge Representation & Reasoning
Knowledge Representation & Reasoning
 
Logics for non monotonic reasoning-ai
Logics for non monotonic reasoning-aiLogics for non monotonic reasoning-ai
Logics for non monotonic reasoning-ai
 

Viewers also liked

Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Information Retrieval Techniques of Google
Information Retrieval Techniques of Google Information Retrieval Techniques of Google
Information Retrieval Techniques of Google Cyr Ish
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudRob Gillen
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval Tariq Hassan
 
Information Retrieval Models Part I
Information Retrieval Models Part IInformation Retrieval Models Part I
Information Retrieval Models Part IIngo Frommholz
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!Jane Garay
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrievalSadaf Rafiq
 
Information retrieval
Information retrievalInformation retrieval
Information retrievalhplap
 
Search: Probabilistic Information Retrieval
Search: Probabilistic Information RetrievalSearch: Probabilistic Information Retrieval
Search: Probabilistic Information RetrievalVipul Munot
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Arjen de Vries
 
Analisis de sentimiento
Analisis de sentimientoAnalisis de sentimiento
Analisis de sentimientoJose Giraldez
 

Viewers also liked (20)

Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Information Retrieval Techniques of Google
Information Retrieval Techniques of Google Information Retrieval Techniques of Google
Information Retrieval Techniques of Google
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the Cloud
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
 
Information Retrieval Models Part I
Information Retrieval Models Part IInformation Retrieval Models Part I
Information Retrieval Models Part I
 
Text clustering
Text clusteringText clustering
Text clustering
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
 
Julia text mining_inmobi
Julia text mining_inmobiJulia text mining_inmobi
Julia text mining_inmobi
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
 
Search: Probabilistic Information Retrieval
Search: Probabilistic Information RetrievalSearch: Probabilistic Information Retrieval
Search: Probabilistic Information Retrieval
 
Data Visulalization
Data VisulalizationData Visulalization
Data Visulalization
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?
 
lec6
lec6lec6
lec6
 
Ir 01
Ir   01Ir   01
Ir 01
 
Analisis de sentimiento
Analisis de sentimientoAnalisis de sentimiento
Analisis de sentimiento
 

Similar to IR

02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdfbeshahashenafe20
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Shahriar Rafee
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdfHabtamu100
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic SearchRoi Blanco
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsVaibhav Khanna
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologiesenterprisesearchmeetup
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finaleAjit More
 
Text data mining1
Text data mining1Text data mining1
Text data mining1KU Leuven
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxShivaVemula2
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
 
How search engines work Anand Saini
How search engines work Anand SainiHow search engines work Anand Saini
How search engines work Anand SainiDr,Saini Anand
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information ArchitectureRob Bogue
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Semantic technology in nutshell 2013. Semantic! are you a linguist?
Semantic technology in nutshell 2013. Semantic! are you a linguist?Semantic technology in nutshell 2013. Semantic! are you a linguist?
Semantic technology in nutshell 2013. Semantic! are you a linguist?Heimo Hänninen
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)Nikola Milosevic
 

Similar to IR (20)

Chap1
Chap1Chap1
Chap1
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of words
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
Text Mining
Text MiningText Mining
Text Mining
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptx
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
How search engines work Anand Saini
How search engines work Anand SainiHow search engines work Anand Saini
How search engines work Anand Saini
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Semantic technology in nutshell 2013. Semantic! are you a linguist?
Semantic technology in nutshell 2013. Semantic! are you a linguist?Semantic technology in nutshell 2013. Semantic! are you a linguist?
Semantic technology in nutshell 2013. Semantic! are you a linguist?
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)
 

More from Girish Khanzode (12)

Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
NLP
NLPNLP
NLP
 
NLTK
NLTKNLTK
NLTK
 
NoSql
NoSqlNoSql
NoSql
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Hadoop
HadoopHadoop
Hadoop
 
Language R
Language RLanguage R
Language R
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 
Funtional Programming
Funtional ProgrammingFuntional Programming
Funtional Programming
 

Recently uploaded

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

IR

  • 3. Information Retrieval • Information Retrieval - Given a set of terms and a set of document terms select only the most relevant document (precision), and preferably all the relevant ones (recall) • Goal - find documents relevant to an information need from a large document set – Mostly textual information ( text/document retrieval) – documents, images, videos, data, services, audio – XML, RDF, html, txt, PDF • Large collections on internet with billions of pages • Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents
  • 4. Information Retrieval / Data Retrieval Information Retrieval Data Retrieval Matching vague exact Model probabilistic deterministic Query language natural artificial Query specification incomplete complete Items wanted relevant all (matching) Error handling insensitive sensitive
  • 6. U s e r T a s k s Retrieval: Adhoc Filtering Browsing Classic Models Boolean Vector Probabilistic Structured Models Non-Overlapping Lists Proximal Browsing Flat Structure Guided Hypertext Set Theoretic Fuzzy Extended Boolean Algebraic Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network IR Models
  • 7. Tasks • Clustering - Group documents into clusters based on their contents • Classification - Given a topics set and a new doc D, decide which topics the documents D belongs to (spam / no-spam…) • Information Extraction - Find all snippets dealing with a given topic (like company merger) • Question Answering - Handle wide range of question types like fact, list, definition, how, why, hypothetical, semantically constrained, and cross- lingual questions • Opinion Mining - Analyze / summarize sentiment in a text
  • 8. Terminology • Searching - seek specific information within a body of information to get result of a search - a set of hits • Browsing - unstructured exploration of a body of information • Linking - moving from one item to another following links like citations, references • Query - a string of text, describing the information that user seeks. Each word of the query is called a search term or keyword • A query can be a single search term, a string of terms, a phrase in natural language or a stylized expression using special symbols • Full text searching - methods that compare the query with every word in the text, without distinguishing the function (meaning, position) of the various words • Fielded searching - methods that search on specific bibliographic or structural fields, such as author or heading
  • 9. Architecture Documents Hits Representation Function Representation Function Query Representation Document Representation Comparison Function offlineonline Document acquisition ( web crawling…) Index Query
  • 10. Zipf's Law • Distribution of word frequencies is similar for different texts (natural language) of significantly large size • Zipf's law holds even for different languages
  • 11. Luhn's Hypothesis • Frequency of words is a measurement of word significance - A measurement of the power of a word to discriminate documents by their content ... • Resolving/Discriminating power of words • Optimal power half way between the cut-offs
  • 12. Techniques • IR Models - Governs how a document and a query are represented and how the relevance of a document to a user query is defined – Boolean Model – Vector Model – Probabilistic Model • Index Terms (attribute) Selection – Stop list – Word stem – Index terms weighting methods • Terms  Documents Frequency Matrices
  • 13. Indexing Based IR • Simple queries – composed of two or three, perhaps of dozen of keywords – web retrieval • Boolean queries – `Database AND computer’ – online catalog and patent search • Context queries – proximity search, phrase queries
  • 14. Sorting & Ranking • User sends a query to search system which returns a set of hits • For a large documents collection this set could be large • The value of results depends on the order in which the hits are presented • Ranking algorithms are at the core of information retrieval systems (predicting which documents are relevant and which are not) • Ranking methods – Sorting the hits (by date…) – Ranking the hits by similarity between query and document – Ranking the hits by the importance of the documents
  • 15. Bag of Words Model • The most common way to represent documents in IR • How to weight a word within a document – Boolean: 1 is the word i is in doc j, 0 otherwise – Tf*idf and others - the weight is a function of the word frequency in the document and of the frequency of documents with that word • What is a word – Single, inflected word (going) – Lemmatised word (going, go, gone  go) – Multi-word, proper nouns, numbers, dates (board of directors, John Stack, April, 2012) – Meaning: plan,project,design  PLAN#03
  • 16. Bag of Words Model • Treats all the words in a document as index terms • Assigns a weight to each term based on importance (or presence/absence of word) • Disregards order, structure, meaning of the words • Simple but effective • Assumptions – Term occurrence is independent – Document relevance is independent – Words are well-defined • Consider three documents – John likes to watch movies. – Mary likes movies too. – John also likes football • The bag of words is shown below
  • 17. Document Parsing • Format and language of each document – What format is it in? – PDF / word / excel / html? – What language is it in? – What character set is in use? • Each of these is a classification problem • These tasks often performed heuristically Sec. 2.1
  • 18. Parsing Challenges • Documents being indexed can be from different languages – A single index may contain terms of several languages • A document / components can contain multiple languages / formats – French email with a German PDF attachment • What is a unit document? – A file? – An email? – An email with 5 attachments? – A group of files (PPT or LaTeX as HTML pages) Sec. 2.1
  • 19. Tokenization • Token - instance of a sequence of characters • Each token is a candidate for an index entry after further processing • Input: Customers Suppliers and Factory • Output: Tokens – Customers – Suppliers – Factory Sec. 2.2.1
  • 20. Tokenization Issues • Finland’s capital → Finland? Finlands? Finland’s? • Hewlett-Packard → Hewlett and Packard as two tokens? – state-of-the-art - break up hyphenated sequence – co-education – lowercase, lower-case, lower case ? • San Francisco - one token or two? – How to decide if it is one token? – Cheap San Francisco-Los Angeles fares Sec. 2.2.1
  • 21. Stop Words • Many of the most frequently used words in English are useless in IR and text mining • Those words are called stop words – the, a, and, to, be , of, in, about, with … – Little semantic value – Stop words account for 20-30% of total word counts • Stop list contains stop-words that should not to be used as index – Prepositions – Articles – Pronouns – Some adverbs and adjectives – Some frequent words (e.g. document) Sec. 2.2.2
  • 22. Stop Words • Removal of stop-words improves efficiency and effectiveness of searches • A few standard stop-lists are commonly used • Reduces indexing data file sizes Sec. 2.2.2
  • 23. Stop Words - New Trend • Stop words need very small space for storage due to good compression techniques • Query time is not affected due to stop words because of good query optimization techniques • Stop words are necessary for – Phrase queries - King of Spain – Various song titles.. - Let it be, To be or not to be – Relational queries - flights to London
  • 24. Normalization - Terms • Normalization of words in indexed text and query into same form • A term is a normalized word type, which is a single entry in IR system dictionary • Implicitly defines equivalent classes of terms by – Deleting periods to form a term • U.S.A., USA  USA – Deleting hyphens to form a term • anti-discriminatory, antidiscriminatory  antidiscriminatory – Synonyms • Car, Automobile Sec. 2.2.3
  • 25. Case Folding • Reduces all letters to lower case • Exception: upper case words in mid-sentence • General Motors • Fed vs. fed • MIT vs. mit • It is best to lowercase everything since users will often type search queries in lowercase regardless of interested information • Google example – When query is C.A.T. -> #1 result is for “cat” (Wikipedia) but not Caterpillar Inc. Sec. 2.2.3
  • 26. Synonyms and Homonyms • Synonyms – Document - motorcycle repair - motorcycle maintenance – maintenance and repair are synonyms – System can extend query as motorcycle and (repair or maintenance) • Homonyms – Object has different meanings as noun/verb – Can disambiguate meanings to some extent from the context • Extending queries automatically using synonyms can be problematic – Need to understand intended meaning in order to infer synonyms • Or verify synonyms with user – Synonyms may have other meanings as well
  • 27. Normalization - Synonyms • Handling Synonyms and Homonyms – Hand-constructed equivalence classes • car = automobile color = colour – Rewrite words to form equivalence-class terms • When a document contains automobile, index it under car-automobile (and vice-versa) – Expand a query • When a query contains automobile, look under car as well • Spelling mistakes – Soundex - a phonetic algorithm for equivalent classes of words based on phonetic heuristics • Google  Googol
  • 28. Stemming • Techniques used to reduce words of variant form to a stem or root form before indexing • Stemming – Remove endings of word – Computer – Compute – Computes – Computing – Computed – Computation Sec. 2.2.4 comput
  • 29. Stemming • As a result if query is house plans, the results will also include all pages containing variation of that term – House plan – House planer – House planning • Increases recall at the expense of precision • Improves effectiveness of IR and text mining – Matching similar words – Reducing indexing size – Combing words with same roots may reduce indexing size as much as 40-50%. • Produced by Stemmers Sec. 2.2.4
  • 30. Lemmatization • Transform to standard dictionary form lemma, according to syntactic category – verb + ing → verb, noun + s → noun • More accurate than stemming but consumes more resources • Balance noise Vs. recognition rate • Compromise between precision and recall • Increased recall without hurting precision • Produced by lemmatizers • the boy's cars are different colors → the boy car be different color
  • 31. Porter’s Algorithm • The most common algorithm for stemming English and one that has repeatedly been shown to be empirically very effective - suffix stripping • Conventions + 5 phases of reductions – phases applied sequentially – each phase consists of a set of commands – sample convention: of the rules in a compound command, select the one that applies to the longest suffix Sec. 2.2.4
  • 32. Porter Algorithm Steps 1. Plurals and past participles SSES -> SS caresses -> caress (*v*) ING -> motoring -> motor 1. adj->n, n->v, n->adj, … (m>0) OUSNESS -> OUS callousness -> callous (m>0) ATIONAL -> ATE relational -> relate 1. (m>0) ICATE -> IC triplicate -> triplic 2. (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow 1. (m>1) E -> probate -> probat 2. (m > 1 and *d and *L) -> single letter controll -> control
  • 33. Stemmers Comparison • Sample text: Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation • Porter’s: such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to pictur of express that is more biolog transpar and access to interpret • Lovins’s: such an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres • Paice’s : such an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret
  • 34. Deep Analysis • Detailed Natural Language Processing (NLP) algorithms • Semantic disambiguation, phrase indexing (board of directors), named entities (President Monti = Mario Monti)... • Standard search engines use deeper techniques (Google)
  • 35. Document Indexing • Store an index to optimize speed and performance • Useful in finding relevant documents for a search query • Reduces time and CPU usage • Without an index, search engine will scan every document in the corpus • An index of 10,000 documents is queried in milliseconds • A sequential scan of every word in 10,000 documents takes much more time • Each document is represented by a set of weighted keywords known as terms – D1 → {(t1, w1), (t2,w2), …} • D1 → {(comput, 0.2), (architect, 0.3), …} • D2 → {(comput, 0.1), (network, 0.5), …} • Inverted file - used in retrieval for higher efficiency – comput → {(D1,0.2), (D2,0.1), …}
  • 36. Boolean Model • Query terms are combined logically using the Boolean operators – AND, OR, NOT – ((asthma AND exercise) AND (NOT cardiac)) – Views each document as a set of words – Precise: document matches a condition or not • Retrieval – Given a Boolean query, system retrieves each document that makes the query logically true – Called exact match • No Rank - A document is judged to be relevant if the terms in the document satisfies the logical expression of the query
  • 37. Inverted index • A data structure that attaches each distinctive term with a list of all documents that contains the term in a document collection • The list is called posting list Sec. 1.2
  • 39. What Goes in Inverted File • Boolean retrieval – Just the document number • Ranked Retrieval – Document number and term weight (TF, IDF, TF*IDF, ...) • Proximity operators – Word offsets for each occurrence of the term – Example : t17 (doc1,49) (doc1,70) (doc2,3)
  • 40. Inverted File Size • Very compact for Boolean retrieval – About 10% of the size of the documents – If an aggressive stop word list is used • Not much larger for ranked retrieval – Perhaps 20% • Enormous for proximity operators – Sometimes larger than the documents – But access is fast - you know where to look
  • 41. Inverted Index Construction Sec. 1.2 Friends Romans Countrymen friend roman countryman friend roman countryman 2 4 2 13 16 1 Friends, Romans, Countrymen
  • 42. Inverted Index – Search Steps • Given a query q – Vocabulary search - find each term/word from q in the inverted index – Results merging - Merge results to find documents that contain all or some of the words/terms in q – Rank score computation - To rank the resulting documents/pages, using • Content-based ranking • Link-based ranking
  • 43. Inverted Index - Boolean Retrieval 11 11 11 11 11 11 1 2 3 11 11 11 4 blueblue catcat eggegg fishfish greengreen hamham hathat oneone 3 4 1 4 4 3 2 1 blue cat egg fish green ham hat one 2 11redred 11twotwo 2red 1two Document IDs
  • 44. Boolean Retrieval • To execute a Boolean query – Build query syntax tree – For each clause, look up postings – Traverse postings and apply Boolean operator • Efficiency analysis – Postings traversal is linear (assuming sorted postings) – Start with shortest posting first ( blue AND fish ) OR ham 1 2blue fish 2
  • 45. Query Processing - AND • Consider processing the query - Brutus AND Caesar – Locate Brutus in the Dictionary • Retrieve its postings – Locate Caesar in the Dictionary • Retrieve its postings – Merge the two postings 128 34 Brutus Caesar Sec. 1.3
  • 46. The Merge • Walk through the two postings simultaneously, in time linear in the total number of postings entries • These postings are sorted by docID 34 1282 4 8 16 32 64 1 2 3 5 8 13 21 128 34 Brutus Caesar 2 8 If the list lengths are x and y, the merge takes O(x+y) operations
  • 47. Postings Lists – Merge Algorithm
  • 48. 2 1 1 2 1 1 1 1 1 1 1 Inverted Index: TF.IDF 22 11 22 11 11 11 1 2 3 11 11 11 4 11 11 11 11 11 11 11 tf df blueblue catcat eggegg fishfish greengreen hamham hathat oneone 1 1 1 1 1 1 2 1 blue cat egg fish green ham hat one 11 11redred 11 11twotwo 1red 1two 3 4 1 4 4 3 2 1 2 2 1 22
  • 49. Positional Indexes • Store term position in postings • Supports richer queries (proximity….) • Leads to larger indexes…
  • 50. [2,4] [3] [2,4] [2] [1] [1] [3] [2] [1] [1] [3] 2 1 1 2 1 1 1 1 1 1 1 Inverted Index: Positional Information 22 11 22 11 11 11 1 2 3 11 11 11 4 11 11 11 11 11 11 22 11 tf df blueblue catcat eggegg fishfish greengreen hamham hathat oneone 1 1 1 1 1 1 2 1 blue cat egg fish green ham hat one 11 11redred 11 11twotwo 1red 1two 3 4 1 4 4 3 2 1 2 2 1 Document Position of term in document
  • 51. 13 Optimization of Index Search • What is the best order of words for query processing? • Consider a query that is an AND of n terms • Process words in order of increasing freq – start with smallest set, then keep cutting further – This is why we kept document freq. in dictionary • For each of the n terms, get its postings, then AND them together Brutus Caesar Calpurnia Query: Brutus AND Calpurnia AND Caesar Sec. 1.3 1 2 3 5 8 16 21 34 2 4 8 16 32 64 128 16
  • 52. More General Optimization • (madding OR crowd) AND (ignoble OR strife) • Get document frequencies for all terms • Conservative - estimate the size of each OR by the sum of its doc frequencies • Process in increasing order of OR sizes Sec. 1.3
  • 53. Query Optimization Term Frequency eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812 Recommend a query processing order for – (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) 300321 379571 363465 (kaleydoscopeOReyes) AND (tangerineORtrees)AND(marmaladeORskies)
  • 54. Skip Pointers • Intersection is the most important operation when for search engines • This is because in web search, most queries are implicitly intersections • car repairs, Britney spears songs translates into – car AND repairs, Britney AND spears AND songs, which means it will be intersecting 2 or more postings lists in order to return a result • Because intersection is so crucial, search engines try to speed it up in any way possible. One such way is to use skip pointers
  • 55. Optimized Skip Pointers • Augment Postings with skip pointers (at indexing time) • Why? - To skip postings that will not figure in the search results. • Where do we place skip pointers? 1282 4 8 41 48 64 311 2 3 8 11 17 21 3111 41 128 Sec. 2.3
  • 56. Query Processing with Skip Pointers • Start using the normal intersection algorithm • Continue until the match 12 and advance to the next item in each list. At this point the "car" list is on 48 and the "repairs" list is on 13, but 13 has a skip pointer • Check the value the skip pointer is pointing at (i.e. 29) and if this value is less than the current value of the "car" list (which it is), we follow our skip pointer and jump to this value in the list
  • 57. Where to Place Skips - Tradeoff • More skips → shorter skip spans ⇒ more likely to skip – But lots of comparisons to skip pointers. • Fewer skips → few pointer comparison, but then long skip spans ⇒ few successful skips
  • 58. Placing Skips • Simple heuristic - for postings of length L, use √L evenly-spaced skip pointers • This ignores the distribution of query terms • Easy if the index is relatively static • Harder if L keeps changing because of updates • How much do skip pointers help? – CPUs were slow, they used to help a lot – Today’s CPUs are fast and disk is slow, so reducing disk postings list size dominates
  • 59. Strengths and Weaknesses • Strengths – Very precise queries can be specified – Easy to implement in the simple form • Weaknesses – Retrieval results are poor since term frequency is not considered - No index term weighting – Specifying the query may be difficult for casual users – Result might be 1 or 0 (unordered set of documents)
  • 60. Similarity Based Retrieval • Retrieve documents that are similar to a given document – Similarity may be defined on the basis of common words – Find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find relevance of other documents • Relevance feedback – Similarity can be used to refine answer set to keyword query – User selects a few relevant documents from those retrieved by keyword query and system finds other documents similar to these
  • 61. Similarity Based Retrieval - Vector Space Model • Define an n-dimensional space, where n is the number of words in the document set • Vector for document d goes from origin to a point whose ith coordinate is TF (d,t ) / n (t ) • The cosine of the angle between vectors of two documents is used as a measure of their similarity
  • 62. Vector Space Model • Assumption - Documents that are close together in vector space talk about same things • Hence retrieve documents depending on closeness to the query (similarity ~ closeness)
  • 63. Vector Space Model • Documents are treated as a bag of words or terms • Documents are presented in high dimensional space • Each document is represented as a vector • Implemented by forming term-document matrix • Dimension of space depends on number of indexing terms which are chosen to be relevant for the collection • Rank according to the similarity metric (e.g. cosine) between the query and document • The smaller the angle between the document and query the more similar they are believed to be – Documents are represented by a term vector – Queries are represented by a similar vector
  • 64. Vector Space Model • Term weights are not pure 0 or 1 • Each weight is computed based on some variations of TF or TF-IDF scheme • Query has the same shape as document (m dimensional vector) – Cosine is commonly used in text clustering • Measure of similarity between query q and a document dj is a cosine of angle between these vectors • Ad-hoc weightings (term frequency x inverse document frequency ) used • No optimal ranking
  • 65. Vector Space Model • Vector space = all the keywords encountered <t1, t2, t3, …, tn> • Document D = < a1, a2, a3, …, an> where ai = weight of ti in D • Query Q = < b1, b2, b3, …, bn> where bi = weight of ti in Q • R(D,Q) = Sim(D,Q) • Consider Query q – Relevance of di to q - Compare similarity of query q and document di – Cosine similarity (the cosine of the angle between the two vectors)
  • 66. Vectors Plot Star Diet Document about astrology Documents about movie stars Documents about mammal behavior
  • 67. Term-Document Matrix • A collection of n documents can be represented in vector space model using this matrix • A m × n matrix where m is number of terms and n is number of documents • Term - Row of term-document matrix • Document - Column of term- document matrix                   = ↓↓↓ aaa aaa aaa A ddd mnmm n n n     21 22221 11211 21 t t t m ← ← ←  2 1
  • 69. Similarity Formulae • Dot Product • Cosine • Dice • Jaccard ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ −+ = + = = = i i i iiii i ii i i ii i ii i i ii i ii ii baba ba QDSim ba ba QDSim ba ba QDSim baQDSim )*( )*( ),( )*(2 ),( * )*( ),( )*(),( 22 22 22 t1 t2 D Q
  • 70. Index Storage • Term-Document matrix is very sparse • A few 100 terms for a document and a few terms for a query, while the term space is large (~100k) • Stored as D1 → {(t1, a1), (t2,a2), …} t1 → {(D1,a1), …}
  • 71. Implementation • Implementation of Vector Space Model using dot product – Naïve implementation: O(m*n) – Implementation using inverted file • Given a query = {(t1,b1), (t2,b2)} 1. Find the sets of related documents through inverted file for t1 and t2 2. Calculate the score of the documents for each weighted term (t1,b1) → {(D1,a1 *b1), …} 3. Combine the sets and sum the weights (∑) 4. O(|Q|*n)
  • 72. Similarity Calculation • Consider two documents D1, D2 and a query Q – D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)
  • 73. Ranked Retrieval • Documents are ranked based on their score • Advantages – Query is easy to specify – Output is ranked based on the estimated relevance of the documents to the query – A wide variety of theoretical models exist • Disadvantages – Query less precise (although weighting can be used)
  • 74. Example Query • A document space is defined by three terms – computer, application, users • A set of documents defined as – A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1) – A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1) – A7=(1, 1, 1), A8=(1, 0, 1). A9=(0, 1, 1) • Query is computer and application • What documents should be retrieved?
  • 75. Example Query • In Boolean query matching – AND - document A4, A7 will be retrieved – OR - retrieved: A1, A2, A4, A5, A6, A7, A8, A9 • In similarity matching (cosine) – q=(1, 1, 0) – S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0 – S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 – S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 – Document retrieved set (with ranking) • {A4, A7, A1, A2, A5, A6, A8, A9}
  • 76. Assigning Weights to Terms • Binary Weights • Raw term frequency • TF x IDF
  • 77. Binary Weights • Only the presence 1 or absence 0 of a term is included in the vector docs t1 t2 t3 RSV=Q.Di D1 1 0 1 4 D2 1 0 0 1 D3 0 1 1 5 D4 1 0 0 1 D5 1 1 1 6 D6 1 1 0 3 D7 0 1 0 2 D8 0 1 0 2 D9 0 0 1 3 D10 0 1 1 5 D11 1 0 1 3 Q 1 2 3 q1 q2 q3
  • 78. Raw Term Weights • The frequency of occurrence for the term in each document is included in the vector docs t1 t2 t3 D1 2 0 3 D2 1 0 0 D3 0 4 7 D4 3 0 0 D5 1 6 3 D6 3 5 0 D7 0 8 0 D8 0 10 0 D9 0 0 1 D10 0 3 5 D11 4 0 1
  • 79. TF.IDF - Term Weighting • Assigns a tf * idf weight to each term in each document • Term weights components – Local - How important is the term in this document? – Global - How important is the term in the collection? • Logic – Terms that appear often in a document should get high weights – Terms that appear in many documents should get low weights • Mathematical Capturing – Term Frequency (local) – Inverse Document Frequency (global)
  • 80. TF.IDF - Term Weighting • tf = Term Frequency – Frequency of a term/keyword in a document – The higher the tf, the higher the importance (weight) for the doc. • df = document frequency – Number of documents containing the term – Distribution of the term • idf = Inverse Document Frequency – Unevenness of term distribution in the corpus – Specificity of term to a document • The more the term is distributed evenly, the less it is specific to a document: weight(t, D) = tf(t ,D) * idf(t)
  • 81. Term Weighting • Based on common sense, but adjusted/engineered following experiments • Terms that occur in only a few documents are often more valuable than ones that occur in many - IDF • The more often a term occurs in a document, the more likely it is to be important for that document - TF • A term that occurs for same number of times in a short and a long document is likely to be more valuable for the short document
  • 82. • Word occurrence frequency is a measure of the significance of terms and their discriminatory power Word frequency Too frequent: useless discriminators Significant terms Too rare: no significant contribution to the content of the document Term Significance
  • 83. TF.IDF Weight • Term frequency weight measures importance in document: • Inverse document frequency measures importance in collection: • Some heuristic modifications
  • 84. Relevance Ranking Using Terms • TF-IDF (Term frequency/Inverse Document frequency) ranking – Let n(d) = number of terms in the document d – n(d, t) = number of occurrences of term t in the document d. – Relevance of a document d to a term t • The log factor is to avoid excessive weight to frequent terms – Relevance of document to query Q nn((dd)) nn((dd,, tt)) 1 +1 +TFTF ((dd,, tt) =) = loglog rr ((dd,, QQ) =) = ∑∑ TFTF ((dd,, tt)) nn((tt))tt∈∈QQ
  • 85. Inverse Document Frequency • IDF provides high values for rare words and low values for common words 4 1 10000 log 698.2 20 10000 log 301.0 5000 10000 log 0 10000 10000 log =      =      =      =      For a collection of 10000 documents
  • 86. TF.IDF Normalization • Normalize the term weights – So longer documents are not unfairly given more weight – Normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive. ∑= = t k kik kik ik nNtf nNtf w 1 22 )]/[log()( )/log(
  • 87. Relevance Ranking Using Terms • Ranking of documents on the basis of estimated relevance to a query • Relevance ranking is based on factors – Term frequency • Frequency of occurrence of query keyword in document – Inverse document frequency • How many documents the query keyword occurs in – Fewer  give more importance to keyword – Hyperlinks to documents • More links to a document  document is more important
  • 88. Relevance Ranking Using Terms • Documents are returned in decreasing order of relevance score – Usually only top few documents are returned • Most systems improvise above model – Common words like a, an, the, it etc are eliminated • Called stop words – Words that occur in title, author list, section headings are given greater importance – Words whose first occurrence is late in the document are given lower importance – Proximity: if keywords in query occur close together in the document, the document has higher importance than if they are far apart
  • 89. Relevance Using Hyperlinks • Number of documents relevant to a query can be enormous if only term frequencies are taken into account • Using term frequencies makes spamming easy – A fitness center could add many occurrences of the words like weights to its page to make its rank very high • Most of the time people are looking for pages from popular sites • Idea - use popularity of web site to rank site pages that match given keywords • Problem - hard to find actual popularity of site
  • 90. Relevance Using Hyperlinks • Solution: use number of hyperlinks to a site as a measure of the popularity or prestige of the site – Count only one hyperlink from each site – Popularity measure is for site, not for individual page • But most hyperlinks point to root of site • Concept of site is difficult to define since a URL prefix like cs.yale.edu contains many unrelated pages of varying popularity • Refinements – When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige • Definition is circular • Set up and solve system of simultaneous linear equations – Above idea is the basis of the Google Page Rank ranking mechanism
  • 91. Relevance Using Hyperlinks • Connections in social networking – Ranks prestige of people – Someone known by multiple prestigious people has higher prestige • Hub and authority based ranking – A hub is a page that stores links to many pages (on a topic) – An authority is a page that contains actual information on a topic – Each page gets a hub prestige based on prestige of authorities that it points to – Each page gets an authority prestige based on prestige of hubs that point to it – Prestige definitions are cyclic and can be ontained by solving linear equations – Use authority prestige when ranking answers to a query
  • 92. Probability Ranking Principle • Given a user query q and a document d, estimate the probability that the user will find d relevant • Robertson (1977) • If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request • Probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, • Overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data
  • 94. Bayes Classifier • Bayes Decision Rule – A document D is relevant if P(R|D) > P(NR|D) • Estimating probabilities – Use Bayes Rule – Classify a document as relevant if – Left hand side is likelihood ratio
  • 95. Estimating P(D|R) • Assume independence • Binary independence model – based on information related to presence and absence of terms in relevant and non-relevant documents – information acquired through relevance feedback process • user stating which of the retrieved documents are relevant / non-relevant – Based on the probability ranking principle, which ensures an optimal ranking – document represented by a vector of binary features indicating term occurrence (or non- occurrence) – pi is probability that term i occurs (has value 1) in relevant document, si is probability of occurrence in non-relevant document
  • 97. Binary Independence Model • Scoring function is • Query provides information about relevant documents • If we assume pi constant, si approximated by entire collection, get idf-like weight
  • 98. Contingency Table • The scoring function is -
  • 99. BM25 Ranking Algorithm • Popular and effective ranking algorithm based on binary independence model – adds document and query term weights – k1, k2 and K are parameters whose values are set empirically – dl is doc length – Typical TREC value for k1 is 1.2, k2 varies from 0 to 1000, b = 0.75
  • 100. BM25 Example • Query with two terms player Pele, (qf = 1) • No relevance information (r and R are zero) • N = 500,000 documents • player occurs in 40,000 documents (n1 = 40, 000) • Pele occurs in 300 documents (n2 = 300) • Player occurs 15 times in doc (f1 = 15) • Pele occurs 25 times (f2 = 25) • Document length is 90% of the average length (dl/avdl = .9) • k1 = 1.2, b = 0.75, and k2 = 100 • K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11
  • 102. BM25 Example - Effect of Term Frequencies Frequency of player Frequency of Pele BM25 Score 15 25 20.66 15 1 12.74 15 0 5.00 1 25 18.2 0 25 16.66
  • 103. Relevance Feedback Process – Iterative Cycle 1. User presented with a list of retrieved documents 2. User marks those which are relevant (or not relevant) 1. In practice top 10-20 ranked documents are examined 2. Incremental: one document after the other 3. The relevance feedback algorithm selects important terms from documents assessed relevant by users 4. The relevance feedback algorithm emphasises the importance of these terms in a new query in the following ways 1. Query expansion - add these terms to the query 2. Term reweighing - modify the term weights in the query 3. Query expansion + term reweighing 5. The updated query is submitted to the system 6. If the user is satisfied with the new set of retrieved documents, then feedback process stops, otherwise go to step 2 • Approaches – Approach 1: Add/Remove/Change query terms – Approach 2: Re-weight query terms
  • 104. Relevance Feedback Issues • Relevance feedback – Often users are not reliable in making relevance assessments, or do not make relevance assessments – Implicit relevance feedback by looking at what users access • clicks in web logs • works well - “wisdom of the crowd" – Positive, negative – Partial relevance assessments (very relevant or partially relevant)? – Why is a document relevant? • Interactive query expansion (as opposed to automatic) – Users choose the terms to be added
  • 105. Latent Semantic Indexing - LSI • Term document matrices are very large but people talk about few things • So how to represent term document by a lower dimensional latent space? • Latent Semantic Analysis is the analysis of latent i.e. hidden semantics in a corpora of text • LSI transforms the original data in a different space so that two documents/words about the same concept are mapped close (so that they have higher cosine similarity) • LSI achieves this by Singular Value Decomposition (SVD) of term-document matrix • Maps documents and terms to a low dimensional representation • Design a mapping such that the low dimensional space reflects semantic association • Compute document similarity based on the inner product in this latent semantic space
  • 107. LSI • For LSI truncated SVD is used • Where – Ukis m×k matrix whose columns are first k left singular vectors of A – Σk is k×k diagonal matrix whose diagonal is formed by k leading singular values of A – Vkis n×k matrix whose columns are first k right singular vectors of A • Rows of Uk= terms • Rows of Vk= documents • SVD can be used to compute optimal low rank approximations VUA T kkkk Σ=
  • 108. LSI • In truncated LSI, first k independent linear components of A (singular vectors and values) are included • Documents are projected in means of least squares on space spread by first k singular vectors of A (LSI space) • First k components capture the major associational structure in in the term-document matrix and throw out the noise • Minor differences in terminology used in documents are ignored • Closeness of objects (queries and documents) is determined by overall pattern of term usage, so it is context based • Documents which contain synonyms are closer in LSI space than in original space; • Documents which contain polysemy (a word having different meaning in different contexts ) in different context are more far in LSI space than in original space
  • 109. Concept Indexing (CI) • Lexical focused relevance estimation is less effective than semantic focussed one • Indexing using concept decomposition (CD) instead of SVD like in LSI • Concept decomposition was introduced in 2001 • I. S. Dhillon, D.S. Modha: Concept decomposition for large sparse text data using clustering, Machine Learning, 42:1, 2001, pp. 143-175
  • 110. Concept Decomposition • Cluster documents in term-document matrix A on k groups • Clustering algorithms – Spherical k-means algorithm – Fuzzy k-means algorithm • Spherical k-means algorithm is a variant of k-means algorithm which uses the fact that vectors of documents are of the unit norm • Centroids of groups = concept vectors • Concept matrix is matrix whose columns are centroids of groups cj – centroid of j-th group [ ]cccC kk 21 =
  • 111. Concept Decomposition • Next step: calculate concept decomposition • Concept decomposition Dkof term-document matrix A is least squares approximation of A on the space of concept vectors where Z is solution of the least squares problem • Rows of Ck = terms • Columns of Z = documents ZCD kk = ( ) ACCCZ T kk T k 1− =
  • 112. System Evaluation • There are many retrieval models/ algorithms/ systems • Which one is the best? – How much effective a system is capable of retrieving relevant documents? • What is the best component for – Ranking function (dot-product, cosine, …) – Term selection (stop word removal, stemming…) – Term weighting (TF, TF-IDF,…) • How far down the ranked list will a user need to look to find some / all relevant documents?
  • 113. Effectiveness • Goal of an IR system is to retrieve as many relevant documents as possible and as few non-relevant documents as possible. • Evaluating the above consists of a comparative evaluation of technical • performance of IR system(s) – In traditional IR, technical performance means the effectiveness of the IR system: the ability of the IR system to retrieve relevant documents and suppress non-relevant documents • Effectiveness is measured by the combination of recall and precision
  • 114. Measuring Query Retrieval Effectiveness • Information-retrieval systems save space by using index structures that support only approximate retrieval • May result in – False negative (false drop) - some relevant documents may not be retrieved – False positive - some irrelevant documents may be retrieved – For many applications a good index should not permit any false drops, but may permit a few false positives
  • 115. Precision and Recall Retrieved & relevant Not retrieved but relevant Retrieved & irrelevant Not retrieved & irrelevant retrieved not retrieved relevantirrelevant Relevant Relevant & Retrieved Retrieved All Documents
  • 116. Precision and Recall • In the ideal case, the set of retrieved documents is equal to the set of relevant documents. However, in most cases, the two sets will be different • This difference is formally measured with precision and recall • Precision - what percentage of the retrieved documents are relevant to the query • Recall - what percentage of the documents relevant to the query were retrieved documentsrelevantofnumberTotal retrieveddocumentsrelevantofNumber recall = retrieveddocumentsofnumberTotal retrieveddocumentsrelevantofNumber precision =
  • 117. Measuring Retrieval Effectiveness • The above two measures do not take into account where the relevant documents are retrieved, this is, at which rank (crucial since the output of most IR systems is a ranked list of documents). • This is very important because an effective IR system should not only retrieve as many relevant documents as possible and as few non-relevant documents as possible, but also it should retrieve relevant documents before the non-relevant ones. • Recall vs. Precision tradeoff – Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many irrelevant documents would be fetched, reducing precision • Measures of retrieval effectiveness – Recall as a function of number of documents fetched or – Precision as a function of recall • Equivalently, as a function of number of documents fetched – Example - precision of 75% at recall of 50%, and 60% at a recall of 75% • Problem: which documents are actually relevant, and which are not
  • 118. Systems Evaluation - Challenges • Effectiveness is related to the relevancy of retrieved items • Relevancy is not typically binary but continuous • Even if relevancy is binary, it can be a difficult judgment to make • Relevancy from a human standpoint – Subjective: Depends upon a specific user’s judgment – Situational: Relates to user’s current needs – Cognitive: Depends on human perception and behavior – Dynamic: Changes over time • Total number of relevant items is sometimes not available – Sample across the database and perform relevance judgment on these items – Apply different retrieval algorithms to the same database for the same query. The aggregate of relevant items is taken as the total relevant set
  • 119. 10 1 Recall Precision The ideal Returns relevant documents but misses many useful ones too Returns most relevant documents but includes lots of junk Precision and Recall - Trade-offs
  • 120. Recall / Precision • Let us assume that for a given query, the following documents are relevant • fd3, d5, d9, d25, d39, d44, d56, d71, d89, d123g • Now suppose that the following documents are retrieved for that query • For each relevant document (in red bold), we calculate the precision value and the recall value • For example, for d56, we have 3 retrieved documents, and 2 among them are relevant, so the precision is 2/3. We have 2 of the relevant documents so far retrieved (the total number of relevant documents being 10), so recall is 2/10.
  • 121. Recall / Precision • For each query, we obtain pairs of recall and precision values • In our example, we would obtain (1/10, 1/1) (2/10, 2/3) (3/10, 3/6) (4/10, 4/10) (5/10, 5/14) . . . which are usually expressed in % (10%, 100%) (20%, 66.66%) (30%, 50%) (40%, 40%) (50%, 35.71%) . . . • This can be read for instance: at 20% recall, we have 66.66% precision; at 50% recall, we have 35.71% precision • The pairs of values are plotted into a graph, which has the following curve
  • 122. The complete methodology • For each IR system / IR system version – For each query in the test collection • We first run the query against the system to obtain a ranked list of retrieved documents • We use the ranking and relevance judgements to calculate recall/precision pairs – Then we average recall / precision values across all queries, to obtain an overall measure of the effectiveness
  • 124. Computing Recall / Precision Points • For a given query, produce the ranked list of retrievals • Adjusting a threshold on this ranked list produces different sets of retrieved documents, and therefore different recall/precision measures • Mark each document in the ranked list that is relevant according to the gold standard • Compute a recall/precision pair for each position in the ranked list that contains a relevant document
  • 125. Computing Recall / Precision Points • Let total # of relevant docs = 6 • Check each new recall point Missing one relevant document. Never reach 100% recall n doc # relevant 1 588 X R=1/6=0.167; P=1/1=1 2 589 X R=2/6=0.333; P=2/2=1 3 576 4 590 X R=3/6=0.5; P=3/4=0.75 5 986 6 592 X R=4/6=0.667; P=4/6=0.667 7 984 8 988 9 578 10 985 11 103 12 591 13 772 X R=5/6=0.833; p=5/13=0.38 14 990
  • 126. Computing Recall / Precision Points • Let total # of relevant docs = 6 • Check each new recall point n doc # relevant 1 588 X R=1/6=0.167; P=1/1=1 2 576 3 589 X R=2/6=0.333; P=2/3=0.667 4 342 5 590 X R=3/6=0.5; P=3/5=0.6 6 717 7 984 8 772 X R=4/6=0.667; P=4/8=0.5 9 312 X R=5/6=0.833; P=5/9=0.556 10 498 11 113 12 628 13 772 14 592 X R=6/6=1.0; p=6/14=0.429
  • 127. Average Recall / Precision Curve • Typically average performance over a large set of queries • Compute average precision at each standard recall level across all queries • Plot average precision/recall curves to evaluate overall system performance on a document/query corpus
  • 128. Compare Systems • The curve closest to the upper right-hand corner of the graph indicates the best performance
  • 129. R- Precision • It is the precision at the R position in the ranking of results for a query • R = relevant documents • R-precision places lower emphasis on the exact ranking of the relevant documents returned • This can be useful when a topic has a large number of judged relevant documents • Or when an evaluator is more interested in measuring aggregate performance as opposed to the fine-grained quality of the ranking provided by the system. n doc # relevant 1 588 x 2 589 x 3 576 4 590 x 5 986 6 592 x 7 984 8 988 9 578 10 985 11 103 12 591 13 772 x 14 990 R = # of relevant docs = 6 R-Precision = 4/6 = 0.67
  • 130. F-Measure • A measure of performance that considers both recall and precision • Harmonic mean of recall and precision • Compared to arithmetic mean, both need to be high for harmonic mean to be high PR RP PR F 11 22 + = + =
  • 131. E Measure - Parameterized F Measure • A variant of F measure that allows weighting emphasis on precision over recall • Value of β controls trade-off – β = 1: Equally weight precision and recall (E=F) – β > 1: Weight recall more – β < 1: Weight precision more PR RP PR E 1 2 2 2 2 )1()1( + + = + + = β β β β
  • 132. Mean Average Precision (MAP) • Average Precision: Average of the precision values at the points at which each relevant document is retrieved – Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633 – Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625 • Mean Average Precision: Average of the average precision value for a set of queries • Provides a single number value to decide better algorithm
  • 133. Non-Binary Relevance • Documents are rarely entirely relevant or non-relevant to a query • Many sources of graded relevance judgments – Relevance judgments on a 5-point scale – Multiple judges – Click distribution and deviation from expected levels (but click-through != relevance judgments)
  • 134. A/B Testing • Exploits existing user base to provide useful feedback • Randomly send a small fraction (1−10%) of incoming users to a variant of the system that includes a single change • Judge effectiveness by measuring change in click-through • The percentage of users that click on the top result (or any result on the first page)
  • 135. References 1. Baeza-Yates, R. and Ribeiro-Neto, B. (2011) - Modern Information Retrieval - the concepts and technology behind search -Addison Wesley 2. Grossman, D. A. and Frieder, O. (2004) - Information Retrieval. Algorithms and Heuristics, 2nd ed., volume 15 of The Information Retrieval Series – Springer 3. Manning, C. D., Raghavan, P., and Schuetze, H., editors (2008) - Introduction to Information Retrieval - Cambridge University Press 4. Roelleke, T., Tsikrika, T., and Kazai, G. (2006) - A general matrix framework for modelling information retrieval - Journal on Information Processing & Management (IP&M), Special Issue on Theory in Information Retrieval, 42(1) 5. Zaragoza, H., Hiemstra, D., and Tipping, M. (2003) - Bayesian extension to the language model for ad hoc information retrieval - In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pages 4-9, New York, NY, USA. ACM Press 6. Roelleke, T., Tsikrika, T., and Kazai, G. (2006) - A general matrix framework for modelling information retrieval -Journal on Information Processing & Management (IP&M), Special Issue on Theory in Information Retrieval, 42(1). 7. Roelleke, T. and Wang, J. (2006) - A parallel derivation of probabilistic information retrieval models - In ACM SIGIR, pages 107-114, Seattle, USA 8. Roelleke, T. and Wang, J. (2008) - TF-IDF uncovered: A study of theories and probabilities -In ACM SIGIR, pages 435 - 442, Singapore 9. Robertson, S. (2004) - Understanding inverse document frequency: On theoretical arguments for idf - Journal of Documentation, 60:503-520 10. Metzler, D. and Croft, W. B. (2004) - Combining the language model and inference network approaches to retrieval - Information Processing & Management, 40(5):735-750
  • 136. Thank You Check Out My LinkedIn Profile at https://in.linkedin.com/in/girishkhanzode