SlideShare a Scribd company logo
Boolean and Vector Space
Retrieval Models

1
Retrieval Models
• A retrieval model specifies the details
of:
– Document representation
– Query representation
– Retrieval function

• Determines a notion of relevance.
• Notion of relevance can be binary or
continuous (i.e. ranked retrieval).
2
Classes of Retrieval Models
• Boolean models (set theoretic)
– Extended Boolean

• Vector space models
(statistical/algebraic)
– Generalized VS
– Latent Semantic Indexing

• Probabilistic models
3
Common Preprocessing Steps
• Strip unwanted characters/markup (e.g. HTML
tags, punctuation, numbers, etc.).
• Break into tokens (keywords) on whitespace.
• Stem tokens to “root” words
– computational  comput

• Remove common stopwords (e.g. a, the, it, etc.).
• Detect common phrases (possibly using a domain
specific dictionary).
• Build inverted index (keyword  list of docs
containing it).
4
Statistical Models
• A document is typically represented by a bag of
words (unordered words with frequencies).
• Bag = set that allows multiple occurrences of the
same element.
• User specifies a set of desired terms with optional
weights:
– Weighted query terms:
Q = < database 0.5; text 0.8; information 0.2 >
– Unweighted query terms:
Q = < database; text; information >
– No Boolean conditions specified in the query.
5
Statistical Retrieval
• Retrieval based on similarity between query
and documents.
• Output documents are ranked according to
similarity to query.
• Similarity based on occurrence frequencies of
keywords in query and document.
• Automatic relevance feedback can be
supported:
– Relevant documents “added” to query.
– Irrelevant documents “subtracted” from query.

6
Issues for Vector Space Model
• How to determine important words in a
document?
– Word sense?
– Word n-grams (and phrases, idioms,…)  terms

• How to determine the degree of importance of a
term within a document and within the entire
collection?
• How to determine the degree of similarity between
a document and the query?
• In the case of the web, what is a collection and
what are the effects of links, formatting
information, etc.?
7
The Vector-Space Model
• Assume t distinct terms remain after
preprocessing; call them index terms or the
vocabulary.
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|

• Each term, i, in a document or query, j, is given a
real-valued weight, wij.
• Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
8
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3

T3

D2 = 3T1 + 7T2 + T3

5

Q = 0T1 + 0T2 + 2T3
D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3
2

3

T1
D2 = 3T1 + 7T2 + T3

T2

7

3

• Is D1 or D2 more similar to Q?
• How to measure the degree of
similarity? Distance? Angle?
Projection?
9
Document Collection
• A collection of n documents can be represented in the
vector space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no
significance in the document or it simply doesn’t exist in
the document.
T1 T2 ….
Tt
D1 w11 w21 …
wt1
D2 w12 w22 … wt2
:
:
:
:
:
:
:
:
Dn w1n w2n …
wtn
10
Term Weights: Term Frequency
• More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j
• May want to normalize term frequency (tf) by
dividing by the frequency of the most common
term in the document:
tfij = fij / maxi{fij}

11
Term Weights: Inverse Document Frequency
• Terms that appear in many different documents are
less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
• An indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
12
TF-IDF Weighting
• A typical combined term importance indicator is
tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document but
rarely in the rest of the collection is given high
weight.
• Many other ways of determining term weights
have been proposed.
• Experimentally, tf-idf has been found to work
well.
13
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

14
Query Vector
• Query vector is typically treated as a
document and also tf-idf weighted.
• Alternative is for the user to supply weights
for the given query terms.

15
Similarity Measure
• A similarity measure is a function that computes
the degree of similarity between two vectors.
• Using a similarity measure between the query and
each document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.
– It is possible to enforce a certain threshold so that the
size of the retrieved set can be controlled.

16
Similarity Measure - Inner Product
• Similarity between vectors for the document di and query q can
be computed as the vector inner product (a.k.a. dot product):
t

sim(dj,q) = dj•q =

∑w w
i =1

ij

iq

where wij is the weight of term i in document j and wiq is the weight of
term i in the query

• For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
• For weighted term vectors, it is the sum of the products of the
weights of the matched terms.
17
Properties of Inner Product
• The inner product is unbounded.
• Favors long documents with a large number of
unique terms.
• Measures how many terms matched but not how
many terms are not matched.

18
Inner Product -- Examples
Binary:

t
e
e n t io n
ur er
m a
al se ct t
ge rm
ev aba hite pu t
a o
i
etr dat arc com tex man inf
r

– D = 1, 1, 1, 0, 1, 1,

0

– Q = 1, 0 , 1, 0, 0, 1,

1

Size of vector = size of vocabulary = 7
0 means corresponding term not found in
document or query

sim(D, Q) = 3

Weighted:
D1 = 2T1 + 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3

D2 = 3T1 + 7T2 + 1T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10
sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

19
Cosine Similarity Measure
• Cosine similarity measures the cosine of
the angle between two vectors.
• Inner product normalized by the vector
lengths.
 
(
dj q = ∑wij ⋅wiq )
 
CosSim(dj, q) =
dj ⋅ q
∑ ij ⋅∑ iq
w
w

t3

θ1

D1

t

•

i=
1

t

i=
1

2

t

2

θ2

Q
t1

i=
1

t2

D2

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / √(4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / √(9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
20
Comments on Vector Space Models
• Simple, mathematically based approach.
• Considers both local (tf) and global (idf) word
occurrence frequencies.
• Provides partial matching and ranked results.
• Tends to work quite well in practice despite
obvious weaknesses.
• Allows efficient implementation for large
document collections.

21

More Related Content

What's hot

Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
nimmyjans4
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
ssbd6985
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrieval
Basma Gamal
 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information RetrievalDishant Ailawadi
 
Information retrieval system
Information retrieval systemInformation retrieval system
Information retrieval system
Leslie Vargas
 
Inverted index
Inverted indexInverted index
Inverted index
Krishna Gehlot
 
Z39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptZ39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol ppt
SUNILKUMARSINGH
 
key word indexing and their types with example
key word indexing and their types with example key word indexing and their types with example
key word indexing and their types with example
Sourav Sarkar
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
Primya Tamil
 
Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)
Jeet Das
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Information Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case StudyInformation Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case Study
Bhojaraju Gunjal
 
Text MIning
Text MIningText MIning
Text MIning
Prakhyath Rai
 
Dublin core Presentation
Dublin core PresentationDublin core Presentation
Dublin core Presentation
Mary Jo Chrabasz
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
information retrieval Techniques and normalization
information retrieval Techniques and normalizationinformation retrieval Techniques and normalization
information retrieval Techniques and normalization
Ameenababs
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
 

What's hot (20)

Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrieval
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
Information retrieval system
Information retrieval systemInformation retrieval system
Information retrieval system
 
Inverted index
Inverted indexInverted index
Inverted index
 
Z39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptZ39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol ppt
 
key word indexing and their types with example
key word indexing and their types with example key word indexing and their types with example
key word indexing and their types with example
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Information Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case StudyInformation Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case Study
 
Text MIning
Text MIningText MIning
Text MIning
 
Dublin core Presentation
Dublin core PresentationDublin core Presentation
Dublin core Presentation
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
information retrieval Techniques and normalization
information retrieval Techniques and normalizationinformation retrieval Techniques and normalization
information retrieval Techniques and normalization
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 

Viewers also liked

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
Probabilistic Retrieval
Probabilistic RetrievalProbabilistic Retrieval
Probabilistic Retrieval
otisg
 
Research IT at the University of Bristol
Research IT at the University of BristolResearch IT at the University of Bristol
Research IT at the University of Bristol
Simon Price
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...
Simon Price
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space modeldalal404
 
Fuzzy Logic ppt
Fuzzy Logic pptFuzzy Logic ppt
Fuzzy Logic ppt
Ritu Bafna
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
Minha Hwang
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrievalSadaf Rafiq
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
Mounia Lalmas-Roelleke
 

Viewers also liked (11)

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Probabilistic Retrieval
Probabilistic RetrievalProbabilistic Retrieval
Probabilistic Retrieval
 
Research IT at the University of Bristol
Research IT at the University of BristolResearch IT at the University of Bristol
Research IT at the University of Bristol
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
SAX-VSM
SAX-VSMSAX-VSM
SAX-VSM
 
Fuzzy Logic ppt
Fuzzy Logic pptFuzzy Logic ppt
Fuzzy Logic ppt
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Precis writing
Precis writingPrecis writing
Precis writing
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 

Similar to Ir models

Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
Habtamu100
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
thenmozhip8
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
rchbeir
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
thenmozhip8
 
Term weighting
Term weightingTerm weighting
Term weighting
Primya Tamil
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
IRJET Journal
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
WrushabhShirsat3
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
National Institute of Informatics
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
pepe3059
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and trackingGeorge Ang
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
Vector space model12345678910111213.pptx
Vector space model12345678910111213.pptxVector space model12345678910111213.pptx
Vector space model12345678910111213.pptx
someyamohsen2
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
BereketAraya
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
BereketAraya
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
Vaibhav Khanna
 

Similar to Ir models (20)

Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Lec1
Lec1Lec1
Lec1
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 
Term weighting
Term weightingTerm weighting
Term weighting
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and tracking
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Vector space model12345678910111213.pptx
Vector space model12345678910111213.pptxVector space model12345678910111213.pptx
Vector space model12345678910111213.pptx
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
 

Recently uploaded

CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Po-Chuan Chen
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 

Recently uploaded (20)

CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 

Ir models

  • 1. Boolean and Vector Space Retrieval Models 1
  • 2. Retrieval Models • A retrieval model specifies the details of: – Document representation – Query representation – Retrieval function • Determines a notion of relevance. • Notion of relevance can be binary or continuous (i.e. ranked retrieval). 2
  • 3. Classes of Retrieval Models • Boolean models (set theoretic) – Extended Boolean • Vector space models (statistical/algebraic) – Generalized VS – Latent Semantic Indexing • Probabilistic models 3
  • 4. Common Preprocessing Steps • Strip unwanted characters/markup (e.g. HTML tags, punctuation, numbers, etc.). • Break into tokens (keywords) on whitespace. • Stem tokens to “root” words – computational  comput • Remove common stopwords (e.g. a, the, it, etc.). • Detect common phrases (possibly using a domain specific dictionary). • Build inverted index (keyword  list of docs containing it). 4
  • 5. Statistical Models • A document is typically represented by a bag of words (unordered words with frequencies). • Bag = set that allows multiple occurrences of the same element. • User specifies a set of desired terms with optional weights: – Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 > – Unweighted query terms: Q = < database; text; information > – No Boolean conditions specified in the query. 5
  • 6. Statistical Retrieval • Retrieval based on similarity between query and documents. • Output documents are ranked according to similarity to query. • Similarity based on occurrence frequencies of keywords in query and document. • Automatic relevance feedback can be supported: – Relevant documents “added” to query. – Irrelevant documents “subtracted” from query. 6
  • 7. Issues for Vector Space Model • How to determine important words in a document? – Word sense? – Word n-grams (and phrases, idioms,…)  terms • How to determine the degree of importance of a term within a document and within the entire collection? • How to determine the degree of similarity between a document and the query? • In the case of the web, what is a collection and what are the effects of links, formatting information, etc.? 7
  • 8. The Vector-Space Model • Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. • These “orthogonal” terms form a vector space. Dimension = t = |vocabulary| • Each term, i, in a document or query, j, is given a real-valued weight, wij. • Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj) 8
  • 9. Graphic Representation Example: D1 = 2T1 + 3T2 + 5T3 T3 D2 = 3T1 + 7T2 + T3 5 Q = 0T1 + 0T2 + 2T3 D1 = 2T1+ 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3 2 3 T1 D2 = 3T1 + 7T2 + T3 T2 7 3 • Is D1 or D2 more similar to Q? • How to measure the degree of similarity? Distance? Angle? Projection? 9
  • 10. Document Collection • A collection of n documents can be represented in the vector space model by a term-document matrix. • An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn 10
  • 11. Term Weights: Term Frequency • More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j • May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij = fij / maxi{fij} 11
  • 12. Term Weights: Inverse Document Frequency • Terms that appear in many different documents are less indicative of overall topic. df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents) • An indication of a term’s discrimination power. • Log used to dampen the effect relative to tf. 12
  • 13. TF-IDF Weighting • A typical combined term importance indicator is tf-idf weighting: wij = tfij idfi = tfij log2 (N/ dfi) • A term occurring frequently in the document but rarely in the rest of the collection is given high weight. • Many other ways of determining term weights have been proposed. • Experimentally, tf-idf has been found to work well. 13
  • 14. Computing TF-IDF -- An Example Given a document containing terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6 B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0 C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8 14
  • 15. Query Vector • Query vector is typically treated as a document and also tf-idf weighted. • Alternative is for the user to supply weights for the given query terms. 15
  • 16. Similarity Measure • A similarity measure is a function that computes the degree of similarity between two vectors. • Using a similarity measure between the query and each document: – It is possible to rank the retrieved documents in the order of presumed relevance. – It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled. 16
  • 17. Similarity Measure - Inner Product • Similarity between vectors for the document di and query q can be computed as the vector inner product (a.k.a. dot product): t sim(dj,q) = dj•q = ∑w w i =1 ij iq where wij is the weight of term i in document j and wiq is the weight of term i in the query • For binary vectors, the inner product is the number of matched query terms in the document (size of intersection). • For weighted term vectors, it is the sum of the products of the weights of the matched terms. 17
  • 18. Properties of Inner Product • The inner product is unbounded. • Favors long documents with a large number of unique terms. • Measures how many terms matched but not how many terms are not matched. 18
  • 19. Inner Product -- Examples Binary: t e e n t io n ur er m a al se ct t ge rm ev aba hite pu t a o i etr dat arc com tex man inf r – D = 1, 1, 1, 0, 1, 1, 0 – Q = 1, 0 , 1, 0, 0, 1, 1 Size of vector = size of vocabulary = 7 0 means corresponding term not found in document or query sim(D, Q) = 3 Weighted: D1 = 2T1 + 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3 D2 = 3T1 + 7T2 + 1T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2 19
  • 20. Cosine Similarity Measure • Cosine similarity measures the cosine of the angle between two vectors. • Inner product normalized by the vector lengths.   ( dj q = ∑wij ⋅wiq )   CosSim(dj, q) = dj ⋅ q ∑ ij ⋅∑ iq w w t3 θ1 D1 t • i= 1 t i= 1 2 t 2 θ2 Q t1 i= 1 t2 D2 D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / √(4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / √(9+49+1)(0+0+4) = 0.13 Q = 0T1 + 0T2 + 2T3 D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product. 20
  • 21. Comments on Vector Space Models • Simple, mathematically based approach. • Considers both local (tf) and global (idf) word occurrence frequencies. • Provides partial matching and ranked results. • Tends to work quite well in practice despite obvious weaknesses. • Allows efficient implementation for large document collections. 21