Ms. T. Primya
Assistant Professor
Department of Computer Science and Engineering
Dr. N. G. P. Institute of Technology
Coimbatore
 A retrieval model can be a description of either the
computational process or the human process of
retrieval
 the process of choosing documents for retrieval
 the process by which information needs are first
articulated and then refined.
 Boolean Models
 Vector Space Models
 Probabilistic Models
 Models based on Belief nets
 Models based on Language Models
 A document is represented as a set of keywords.
 Index terms are considered to be either present or absent in a
document and to provide equal evidence with respect to information
needs.
 Queries are Boolean expressions of keywords, connected by AND,
OR, and NOT, including the use of brackets to indicate scope.
[[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]
 Output: Document is relevant or not. No partial matches or ranking.
 User need: I’m interested in learning about vitamins
other than vitamin e that are anti-oxidants.
 User’s Boolean query: antioxidant AND vitamin
AND NOT vitamin e
 For each retrieval model, there explicit three
components:
 Document representation d
 Query q
 Ranking function R(d, q)
 An IR strategy is a technique by which a relevance
measure is obtained between a query and a document.
 Retrieve documents that make the query true.
 Boolean-Documents either match or don’t.
 Good for expert users with precise understanding of
their needs and of the collection.
 Also good for applications: Applications can easily
consume 1000s of results.
 Not good for the majority of users
 This is particularly true of web search.
 Boolean queries often have either too few or too many results.
Query 1
standard AND user AND dlink AND 650
→ 200,000 hits Feast!
Query 2
standard AND user AND dlink AND 650 AND no AND card AND found
→ 0 hits Famine!
 In Boolean retrieval, it takes a lot of skill to come up with a query that
produces a manageable number of hits.
 In ranked retrieval, “feast or famine” is less of a problem.
 Condition: Results that are more relevant are ranked higher than results that
are less relevant. (i.e., the ranking algorithm works.)
 A commonly used measure of overlap of two sets
 Let A and B be two sets
 Jaccard coefficient:
jaccard(A,B) = |A∩B| |A∪B|
 jaccard(A,A) = 1
 jaccard(A,B) = 0 if A∩B = 0
 A and B don’t have to be the same size. Always
assigns a number between 0 and 1.
What is the query-document match score that the Jaccard
coefficient computes for:
 Query
“ides of March”
 Document
“Caesar died in March”
jaccard(q,d) = 1/6
 It doesn’t consider term frequency (how many
occurrences a term has).
 Rare terms are more informative than frequent terms.
 Jaccard does not consider this information.
Advantages
 Can use very restrictive search
 Makes experienced users happy
 Clear formalism
 Simplicity
 It is still used in small scale searches like searching e-
mails, files from local hard drives
Disadvantages
 Simple queries do not work well.
 Complex query language, confusing to end users
 Difficult to control the number of documents
retrieved.
◦ All matched documents will be returned.
 Difficult to rank output.
◦ All matched documents logically satisfy the query.
 Difficult to perform relevance feedback.
◦ If a document is identified by the user as relevant or
irrelevant, how should the query be modified?
 Vector space model or term vector model is an
algebraic model for representing text documents (and
any objects, in general) as vectors of identifiers, such
as, for example, index terms.
 It is used in information filtering, information
retrieval, indexing and relevancy rankings.
The basis vectors correspond to the dimensions or
directions of the vector space
A vector is a point in a vector space and has length
(from the origin to the point) and direction
 A 2-dimensional vector can be written as [x, y]
 A 3-dimensional vector can be written as [x, y, z]
 Let V denote the size of the indexed vocabulary
 Any arbitrary span of text (i.e., a document, or a
query) can be represented as a vector in V-
dimensional space
 let’s assume three index terms: dog, bite, man (i.e.,
V=3)
1 = the term appears at least once
0 = the term does not appear
A query is a vector in V-dimensional space, where
V is the number of terms in the vocabulary
 The vector space model ranks documents based on
the vector-space similarity between the query vector
and the document vector
 There are many ways to compute the similarity
between two vectors
 One way is to compute the inner product
Multiply corresponding components and then sum
of those products
Pros and Cons
 The inner-product doesn’t account for the fact that
documents have widely varying lengths
 All things being equal, longer documents are more
likely to have the query-terms
 So, the inner-product favours long documents
 Document represented as a vector:
d =< d1; d2; …. dn >
 Query represented as a vector: q =< q1; q2;…. qn >
 Ranking function (retrieval status value):
 The cosine similarity between two vectors (or two
documents on the Vector Space) is a measure that
calculates the cosine of the angle between them.
 the cosine similarity equation is to solve the equation
of the dot product for the :
 The numerator is the inner product
 The denominator is the product of the two vector-
lengths
 Ranges from 0 to 1 (equals 1 if the vectors are
identical)
 a =[1, 2, 3]
 b =[4,-5,6]
a with b is dpab = 1*4 + 2*-5 + 3*6 = 12
a with itself is dpaa = 1*1 + 2*2 + 3*3 = 14
b with itself is dpbb = 4*4 + -5*-5 + 6*6 = 77
la = (dpaa) ½ = (14) ½ = 3.74; i.e., the length of a.
lb = (dpbb) ½ = (77)½ = 8.77; i.e., the length of b.
la*lb = (dpaa) ½ * (dpbb) ½ = 32.83;
i.e., the length product (lpab) of a and b.
dot product/length product ratio is
 The vector space model procedure can be divided
into three stages.
 The first stage is the document indexing where
content bearing terms are extracted from the
document text.
 The second stage is the weighting of the indexed
terms to enhance retrieval of document relevant to the
user.
 The last stage ranks the document with respect to the
query according to a similarity measure.

Boolean,vector space retrieval Models

  • 1.
    Ms. T. Primya AssistantProfessor Department of Computer Science and Engineering Dr. N. G. P. Institute of Technology Coimbatore
  • 2.
     A retrievalmodel can be a description of either the computational process or the human process of retrieval  the process of choosing documents for retrieval  the process by which information needs are first articulated and then refined.
  • 3.
     Boolean Models Vector Space Models  Probabilistic Models  Models based on Belief nets  Models based on Language Models
  • 4.
     A documentis represented as a set of keywords.  Index terms are considered to be either present or absent in a document and to provide equal evidence with respect to information needs.  Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope. [[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]  Output: Document is relevant or not. No partial matches or ranking.
  • 5.
     User need:I’m interested in learning about vitamins other than vitamin e that are anti-oxidants.  User’s Boolean query: antioxidant AND vitamin AND NOT vitamin e
  • 6.
     For eachretrieval model, there explicit three components:  Document representation d  Query q  Ranking function R(d, q)
  • 7.
     An IRstrategy is a technique by which a relevance measure is obtained between a query and a document.  Retrieve documents that make the query true.
  • 8.
     Boolean-Documents eithermatch or don’t.  Good for expert users with precise understanding of their needs and of the collection.  Also good for applications: Applications can easily consume 1000s of results.  Not good for the majority of users  This is particularly true of web search.
  • 9.
     Boolean queriesoften have either too few or too many results. Query 1 standard AND user AND dlink AND 650 → 200,000 hits Feast! Query 2 standard AND user AND dlink AND 650 AND no AND card AND found → 0 hits Famine!  In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits.  In ranked retrieval, “feast or famine” is less of a problem.  Condition: Results that are more relevant are ranked higher than results that are less relevant. (i.e., the ranking algorithm works.)
  • 10.
     A commonlyused measure of overlap of two sets  Let A and B be two sets  Jaccard coefficient: jaccard(A,B) = |A∩B| |A∪B|  jaccard(A,A) = 1  jaccard(A,B) = 0 if A∩B = 0  A and B don’t have to be the same size. Always assigns a number between 0 and 1.
  • 11.
    What is thequery-document match score that the Jaccard coefficient computes for:  Query “ides of March”  Document “Caesar died in March” jaccard(q,d) = 1/6
  • 12.
     It doesn’tconsider term frequency (how many occurrences a term has).  Rare terms are more informative than frequent terms.  Jaccard does not consider this information.
  • 13.
    Advantages  Can usevery restrictive search  Makes experienced users happy  Clear formalism  Simplicity  It is still used in small scale searches like searching e- mails, files from local hard drives
  • 14.
    Disadvantages  Simple queriesdo not work well.  Complex query language, confusing to end users  Difficult to control the number of documents retrieved. ◦ All matched documents will be returned.  Difficult to rank output. ◦ All matched documents logically satisfy the query.  Difficult to perform relevance feedback. ◦ If a document is identified by the user as relevant or irrelevant, how should the query be modified?
  • 15.
     Vector spacemodel or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms.  It is used in information filtering, information retrieval, indexing and relevancy rankings.
  • 16.
    The basis vectorscorrespond to the dimensions or directions of the vector space
  • 17.
    A vector isa point in a vector space and has length (from the origin to the point) and direction
  • 18.
     A 2-dimensionalvector can be written as [x, y]  A 3-dimensional vector can be written as [x, y, z]
  • 19.
     Let Vdenote the size of the indexed vocabulary  Any arbitrary span of text (i.e., a document, or a query) can be represented as a vector in V- dimensional space  let’s assume three index terms: dog, bite, man (i.e., V=3)
  • 20.
    1 = theterm appears at least once 0 = the term does not appear
  • 21.
    A query isa vector in V-dimensional space, where V is the number of terms in the vocabulary
  • 22.
     The vectorspace model ranks documents based on the vector-space similarity between the query vector and the document vector  There are many ways to compute the similarity between two vectors  One way is to compute the inner product
  • 23.
    Multiply corresponding componentsand then sum of those products
  • 24.
    Pros and Cons The inner-product doesn’t account for the fact that documents have widely varying lengths  All things being equal, longer documents are more likely to have the query-terms  So, the inner-product favours long documents
  • 25.
     Document representedas a vector: d =< d1; d2; …. dn >  Query represented as a vector: q =< q1; q2;…. qn >  Ranking function (retrieval status value):
  • 26.
     The cosinesimilarity between two vectors (or two documents on the Vector Space) is a measure that calculates the cosine of the angle between them.  the cosine similarity equation is to solve the equation of the dot product for the :  The numerator is the inner product  The denominator is the product of the two vector- lengths  Ranges from 0 to 1 (equals 1 if the vectors are identical)
  • 27.
     a =[1,2, 3]  b =[4,-5,6] a with b is dpab = 1*4 + 2*-5 + 3*6 = 12 a with itself is dpaa = 1*1 + 2*2 + 3*3 = 14 b with itself is dpbb = 4*4 + -5*-5 + 6*6 = 77 la = (dpaa) ½ = (14) ½ = 3.74; i.e., the length of a. lb = (dpbb) ½ = (77)½ = 8.77; i.e., the length of b. la*lb = (dpaa) ½ * (dpbb) ½ = 32.83; i.e., the length product (lpab) of a and b.
  • 28.
  • 29.
     The vectorspace model procedure can be divided into three stages.  The first stage is the document indexing where content bearing terms are extracted from the document text.  The second stage is the weighting of the indexed terms to enhance retrieval of document relevant to the user.  The last stage ranks the document with respect to the query according to a similarity measure.