ICON 2017 tutorial : Word Embedding in IR: Present and Beyond

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Word Embedding in Information Retrieval: Present and
Beyond
Debasis Ganguly 1 Dwaipayan Roy 2 Ayan Bandopadhyay 2
Mandar Mitra 2
1IBM Research, Dublin
2CVPR Unit, Indian Statistical Institute
December 19, 2017
D. Ganguly, D. Roy, M. Mitra Word Embeddings in IR December 19, 2017 1 / 158

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Overview
1 Information Retrieval: A Primer
2 Introduction to Word Embedding
3 An overview of Word Embedding applications for IR
4 Generalized Language Model
5 Embedding based Relevance Feedback with Kernel Density Estimate
6 Documents as sets of vectors
7 Task specific word vector learning

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Information retrieval
Problem definition:
Given a user’s information need, find documents satisfying that need.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Information retrieval
Problem definition:
Given a user’s information need, find documents satisfying that need.
Types of information: text, images/graphics, speech, video, etc.
Text is still the most commonly used.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Bag of Words Representation
Vector space model
Language Model
Pseudo-Relevance Feedback and Query Expansion
IR Evaluation
6 Documents as sets of vectorsD. Ganguly, D. Roy, M. Mitra Word Embeddings in IR December 19, 2017 5 / 158

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
IR: bag of words approach
Document → list of keywords / content-descriptors / terms
User’s information need → (natural-language) query → list of
keywords
Measure overlap between query and documents.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Indexing
Tokenization: identify individual words.
In the prestigious season-ending tournament finale, Sindhu played her
heart out before losing 21-15, 12-21, 19-21 to Yamaguchi in an
energy-sapping summit clash that lasted an hour and 31 minutes.
⇓
In the prestigious season-ending tournament . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Indexing: stemming
Stemming: reduce words to a common root.
e.g. resignation, resigned, resigns → resign
use standard algorithms (Porter).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Indexing: stemming
Stemming: reduce words to a common root.
e.g. resignation, resigned, resigns → resign
use standard algorithms (Porter).
Stemming in NLTK
1 porter = nltk.PorterStemmer()
2 stemmed = [ porter.stem(w) for w in filtered ]
3 index_terms = sorted(set(stemmed))

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Indexing
Thesaurus: find synonyms for words in the document.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Indexing
Thesaurus: find synonyms for words in the document.
Phrases: find multi-word terms e.g. computer science, data mining.
use syntax/linguistic methods or “statistical” methods.
Named entities: identify names of people, organizations and places;
dates; monetary or other amounts, etc.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
IR: basic principle
Document → list of keywords / content-descriptors / terms
User’s information need → (natural-language) query → list of
keywords
Measure overlap between query and documents.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Boolean model
Keywords combined using and, or, (and) not
e.g. (medicine or treatment) and (hypertension or “high blood pressure”)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Boolean model
Efficient and easy to implement (list merging)
and ≡ intersection
or ≡ union

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Boolean model
Efficient and easy to implement (list merging)
and ≡ intersection
or ≡ union
Drawbacks
or — one match as good as many
and — one miss as bad as all
no ranking
queries may be difficult to formulate

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Vector space model
Language Model
IR Evaluation

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Vector space model
Any text item (“document”) is represented as list of terms and
associated weights.
D = (⟨t1, w1⟩, . . . , ⟨tn, wn⟩)
Term = keywords or content-descriptors
Weight = measure of the importance of a term in representing the
information contained in the document

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Term weights
Term frequency (tf): repeated words are strongly related to content
Inverse document frequency (idf): uncommon term is more important
Example: medicine vs. antibiotic
Normalization by document length
long docs. contain many distinct words.
long docs. contain same word many times.
term-weights for long documents should be reduced.
use # bytes, # distinct words, Euclidean length, etc.
Weight = tf x idf / normalization

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Term weights: commonly used weighting schemes
Pivoted normalization [Singhal et al., SIGIR 96]
1+log(tf)
1+log(average tf) × log( N
df)
(1.0 − slope) × pivot + slope × # unique terms
BM25 (probabilistic model) [Robertson and Zaragoza, FTIR 2009]
tf × log(N−df+0.5
df+0.5 )
k1((1 − b) + b dl
avdl ) + tf

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Retrieval
Measure vocabulary overlap between user query and documents.
t1 . . . tn
Q = q1 . . . qn
D = d1 . . . dn
Sim(Q, D) = ⃗Q.⃗D
=
∑
i qi × di

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Retrieval
Measure vocabulary overlap between user query and documents.
t1 . . . tn
Q = q1 . . . qn
D = d1 . . . dn
Sim(Q, D) = ⃗Q.⃗D
=
∑
i qi × di
Use inverted list (index).
Termi → (Di1 , wi1 ), . . . , (Dik
, wik
)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Vector space model
Language Model
IR Evaluation

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Language Model
Each document is a Multinomial distribution of terms.
Equivalent to an urn that contains N balls of p different colors, out of
which n1 are of color c1, n2 are of color c2,
∑p
i=1 ni = N.
Each term sampling is independent of the other.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Query sampling from document
Query terms are sampled independently from each document.
Documents are ranked by decreasing values of the posterior
probabilities P(D|Q).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Converting posteriors to priors during indexing
P(D|Q) is not known during indexing. Why?
Terms are known (from the document content).
Estimate the sampling probability of each term from a document.
P(t|D) = tf(t,D)∑
t′∈D tf(t′,D) (Maximum Likelihood)
P(D|Q) ∝ P(Q|D).P(D)
P(D|Q) ∝ P(D)
∏
q∈Q P(q|D)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Smoothing
Problem with Maximum likelihood?

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Smoothing
Problem with Maximum likelihood?
If a query term is absent in a document, the probability becomes zero.
Too harsh.
Consider a complementary event that a term can be generated from
the collection.
Throw a coin. If ‘Heads’ then sample the term t from document D
else if ‘Tails’ sample t from collection.
Let λ be the probability of ‘Heads’. Probability of ‘Tails’ is (1 − λ).
P(t|D) = λ tf(t,D)∑
t′∈D tf(t′,D) + (1 − λ)P(t|coll).
P(t|coll) = cf(t)∑
t′∈V cf(t′)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Vector space model
Language Model
IR Evaluation

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Motivation
Searching depends on matching keywords between user-query and
document.
Searchers and document creators may use different keywords to
denote same “concept”.
Vocabulary mismatch =⇒ poor retrieval quality.
Users may not know what they are looking for, but they’ll know when
they see it.
Effective query formulation may be difficult; simplify the problem
through iteration.
Boost recall: “find me similar documents...”

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Motivation
Searching depends on matching keywords between user-query and
document.
Searchers and document creators may use different keywords to
denote same “concept”.
Vocabulary mismatch =⇒ poor retrieval quality.
Users may not know what they are looking for, but they’ll know when
they see it.
Effective query formulation may be difficult; simplify the problem
through iteration.
Boost recall: “find me similar documents...”
Solution: expand the query by adding related words/phrases.
Issues:
select which terms to add to query
calculate weights for added terms

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Relevance Feedback: A Graphical Representation
Corpus

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Corpus
Index
Indexing

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
Information
Need

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
Information
Need
Query

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
Information
Need
Query
IR
System

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
Information
Need
Query
IR
System
Ranked
Docs.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
Information
Need
Query
IR
System
Ranked
Docs.
d1
d2
d3
d4
d5
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
Information
Need
Query
IR
System
Ranked
Docs.
d1
d2
d3
d4
d5
.
.
.
d1
d2
d3
d4
d5
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
Information
Need
Query
IR
System
Ranked
Docs.
d1
d2
d3
d4
d5
.
.
.
d1
d2
d3
d4
d5
.
.
.
Marked
Ranked
Docs.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
Information
Need
Query
IR
System
Ranked
Docs.
d1
d2
d3
d4
d5
.
.
.
d1
d2
d3
d4
d5
.
.
.
Marked
Ranked
Docs.
Feedback

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
Information
Need
Query
IR
System
Ranked
Docs.
d1
d2
d3
d4
d5
.
.
.
d1
d2
d3
d4
d5
.
.
.
Marked
Ranked
Docs.
Feedback
Query
Reformulation

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
Information
Need
Query
IR
System
Ranked
Docs.
d1
d2
d3
d4
d5
.
.
.
d1
d2
d3
d4
d5
.
.
.
Marked
Ranked
Docs.
Feedback
Query
Reformulation
New Ranked
Docs.
D1
D2
D3
D4
D5
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Relevance Feedback: Basic Idea
User issues a query.
Search engine returns a set of documents.
User marks some documents as relevant, some as non-relevant.
Search engine uses this feedback information, together with collection
statistics to formulate a better representation of the information need.
Search engine performs retrieval with the reformulated query.
The new set of documents returned by the engine, having (hopefully)
better recall.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Relevance Feedback: Basic Idea
User issues a query.
Search engine returns a set of documents.
User marks some documents as relevant, some as non-relevant.
Search engine uses this feedback information, together with collection
statistics to formulate a better representation of the information need.
Search engine performs retrieval with the reformulated query.
The new set of documents returned by the engine, having (hopefully)
better recall.
Can iterate this: several rounds of relevance feedback.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Relevance Feedback: Example 1
Initial Query: bike

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Results for initial query

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Feedback: Relevant document selected

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Result after re-retrieval

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
TREC 2 Topic - 113 : New Space Satellite Applications

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Rank Score Document title
1 0.539 NASA Hasn’t Scrapped Imaging Spectrometer
2 0.533 NASA Scratches Environment Gear From Satellite Plan
3 0.528 Science Panel Backs NASA Satellite Plan, But Urges
Launches of Smaller Probes
4 0.526 A NASA Satellite Project Accomplishes Incredible Feat:
Staying Within Budget
5 0.525 Scientist Who Exposed Global Warming Proposes
Satellites for Climate Research
6 0.524 Report Provides Support for the Critics of Using Big
Satellites to Study Climate
7 0.516 Arianespace Receives Satellite Launch Pact From Telesat
Canada
8 0.509 Telecommunications Tale of Two Companies

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
✓ 1 0.539 NASA Hasn’t Scrapped Imaging Spectrometer
✓ 2 0.533 NASA Scratches Environment Gear From Satellite Plan
3 0.528 Science Panel Backs NASA Satellite Plan, But Urges
Launches of Smaller Probes
4 0.526 A NASA Satellite Project Accomplishes Incredible Feat:
Staying Within Budget
5 0.525 Scientist Who Exposed Global Warming Proposes
Satellites for Climate Research
6 0.524 Report Provides Support for the Critics of Using Big
Satellites to Study Climate
7 0.516 Arianespace Receives Satellite Launch Pact From Telesat
Canada
✓ 8 0.509 Telecommunications Tale of Two Companies

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Initial query: new space satellite applications
Expanded query after relevance feedback
2.074 new 15.106 space
30.816 satellite 5.660 application
5.991 nasa 5.196 eos
4.196 launch 3.972 aster
3.516 instrument 3.446 arianespace
3.004 bundespost 2.806 ss
2.790 rocket 2.053 scientist
2.003 broadcast 1.172 earth
0.836 oil 0.646 measure

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 0.513 NASA Scratches Environment Gear From Satellite Plan
2 0.500 NASA Hasn’t Scrapped Imaging Spectrometer
3 0.493 When the Pentagon Launches a Secret Satellite, Space
Sleuths Do Some Spy Work of Their Own
4 0.493 NASA Uses ‘Warm’ Superconductors For Fast Circuit
5 0.492 Telecommunications Tale of Two Companies
6 0.491 Soviets May Adapt Parts of SS-20 Missile For
Commercial Use
7 0.490 Gaping Gap: Pentagon Lags in Race To Match the
Soviets In Rocket Launchers
8 0.490 Rescue of Satellite By Space Agency To Cost $90 Million

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Old Rank Score Document title
2 1 0.513 NASA Scratches Environment Gear From Satellite Plan
1 2 0.500 NASA Hasn’t Scrapped Imaging Spectrometer
3 0.493 When the Pentagon Launches a Secret Satellite, Space
Sleuths Do Some Spy Work of Their Own
4 0.493 NASA Uses ‘Warm’ Superconductors For Fast Circuit
8 5 0.492 Telecommunications Tale of Two Companies
6 0.491 Soviets May Adapt Parts of SS-20 Missile For
Commercial Use
7 0.490 Gaping Gap: Pentagon Lags in Race To Match the
Soviets In Rocket Launchers
8 0.490 Rescue of Satellite By Space Agency To Cost $90 Million

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Key concept for relevance feedback: Centroid
The centroid is the center of mass of a set of points.
Recall that we represent documents as points in a high-dimensional
space.
Thus: we can compute centroids of documents.
The centroid of a set of documents D:
⃗µ(D) =
1
|D|
∑
d∈D
⃗d

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Rocchio’s Algorithm for Relevance Feedback
Standard algorithm used for relevance feedback (SMART, ’70s).
Integrates a measure of relevance feedback into Vector Space Model.
Maximize the difference between average similarities for relevant and
non-relevant documents.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chooses the optimal query ⃗Qopt that maximizes:
⃗Qopt = arg max
⃗Q
[
sim(⃗Q, µ(DR)) − sim(⃗Q, µ(DNR))
]
DR: Set of relevant documents
DNR: Set of non-relevant documents
µ(DX): Centroid of a set of documents X

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chooses the optimal query ⃗Qopt that maximizes:
⃗Qopt = arg max
⃗Q
[
sim(⃗Q, µ(DR)) − sim(⃗Q, µ(DNR))
]
DR: Set of relevant documents
DNR: Set of non-relevant documents
µ(DX): Centroid of a set of documents X
To find the vector ⃗Qopt, that separates relevant and non-relevant
documents maximally.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Rocchio’s Algorithm Illustrated
relevant documents non-relevant documents

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
⃗µR : centroid of relevant documents

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
⃗µR not seperating relevant documents from non-relevant

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
⃗µNR : centroid of non-relevant documents

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
⃗µR − ⃗µNR : difference vector

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Add difference vector to ⃗µR to get ⃗µopt

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
⃗Qopt seperates relevant and non-relevant documents perfectly

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The optimal query vector:
⃗Qopt = µ(DR) +
[
µ(DR) − µ(DNR)
]
=
1
|DR|
∑
⃗di∈DR
⃗di +
[ 1
|DR|
∑
⃗di∈DR
⃗di −
1
|DNR|
∑
⃗dj∈DNR
⃗dj
]
The centroid of the relevant documents moved by the difference between
the two centroids

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Rocchio’s Algorithm
In practice:
⃗Qm = α⃗QO + β
1
|DR|
∑
⃗di∈DR
⃗di − γ
1
|DNR|
∑
⃗dj∈DNR
⃗dj
Qm: Modified query
QO: Original query
DR: Set of known relevant documents
DNR: Set of known non-relevant documents
α, β, γ: Weights (belief) of corresponding sets

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Rocchio’s Algorithm: Parameter Settings
⃗Qm = α⃗QO + β
1
|DR|
∑
⃗di∈DR
⃗di − γ
1
|DNR|
∑
⃗dj∈DNR
⃗dj
New Query
Set higher β, γ if there is a lot of judged documents.
Positive feedback more valuable than negative feedback
Set negative term weights to 0 (Negative weight doesn’t make sense
in vector space model).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Rocchio’s Algorithm: Example
Original query 0 4 0 8 0 0 α = 1.0 0 4 0 8 0 0
Positive feedback 2 4 8 0 0 2 β = 0.5 1 2 4 0 0 1
Negative feedback 8 0 4 4 0 16 γ = 0.25 2 0 1 1 0 4

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Rocchio’s Algorithm: Example
Original query 0 4 0 8 0 0 α = 1.0 0 4 0 8 0 0
Positive feedback 2 4 8 0 0 2 β = 0.5 1 2 4 0 0 1
Negative feedback 8 0 4 4 0 16 γ = 0.25 2 0 1 1 0 4
New query -1 6 3 7 0 -3

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Rocchio’s Algorithm: issues
What should be regarded as NR?
Documents marked non-relevant by user?
too little information
All documents not known to be relevant?
Unseen relevant documents may affect term-weights

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
How should terms be selected? How many?

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
How should terms be selected? How many?
Rank terms by # relevant documents they occur in.
Add 50-100 terms.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Relevance Feedback: Problem
Usually Expensive with respect to time.
Users are reluctant to provide explicit feedback.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Indirect Relevance Feedback
Uses Evidences rather than Explicit feedback.
Not specific to users.
Suitable for Web search.
Click / Time spent on a particular retrieved document.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Blind/Pseudo Relevance Feedback
In the absence of feedback, assume top-ranked documents are
relevant.
Obvious danger: if initial retrieval is poor, adhoc feedback can
aggravate the problem.
Most groups at TREC found adhoc feedback useful on average.
Suggestion: find ways to improve the initial retrieval.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Relevance based Language Model (RLM)1
Assumes that both query and relevant documents are sampled from a
latent relevance model R.
In absence of training data, top ranked documents considered as set
of (known) relevant documents.
The query serves as the only absolute evidence about the relevance
model.
The task is to find (estimate) the density function for R.
Estimates P(w|R) on the basis of P(w|Q).
1
Lavrenko and Croft - SIGIR 2001

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
RLM: Formally
Probability Ranking Principle
Rank documents by odds of their being observed in the relevant class.a
P(D|R)
P(D|N)
∼
∏
w∈D
P(w|R)
P(w|N)
a
S. Robertson. - Morgan Kaufmann Publishers, Inc.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Estimation
The probability of sampling a term w from the R, denoted by P(w|R),
approximated by
P(w|R) ≈
P(w, Q)
P(Q)
Joint probability of observing w along with query terms
Query prior probability

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Estimation
approximated by
P(w|R) ≈
P(w, Q)
P(Q)
∼ P(w, Q)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Independent and Indentically Distributed (IID) Sampling
Given a query Q = {q1, . . . , qk} and M, a set of initially retrieved
document.
Assumption
The query words Q = {q1, . . . , qk} and w sampled independently and
identically from (pseudo-)relevant documents
Pseudo-relevant documents =⇒ Top ranked documents

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Q = {q1, . . . , qk}; M = {D1, . . . , D|M|}

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Q = {q1, . . . , qk}; M = {D1, . . . , D|M|}
P(w, Q) =
∑
D∈M
P(D)P(w, Q|D)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Q = {q1, . . . , qk}; M = {D1, . . . , D|M|}
P(w, Q) =
∑
D∈M
P(D)P(w, Q|D)
P(w, Q|D) = P(w|D)
∏
q∈Q
P(q|D)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Q = {q1, . . . , qk}; M = {D1, . . . , D|M|}
P(w, Q) =
∑
D∈M
P(D)P(w, Q|D)
P(w, Q|D) = P(w|D)
∏
q∈Q
P(q|D)
P(w, Q) =
∑
D∈M
P(D)P(w|D)
∏
q∈Q
P(q|D)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Q = {q1, . . . , qk}; M = {D1, . . . , D|M|}
P(w, Q) =
∑
D∈M
P(D)P(w, Q|D)
P(w, Q|D) = P(w|D)
∏
q∈Q
P(q|D)
P(w, Q) =
∑
D∈M
P(D)P(w|D)
∏
q∈Q
P(q|D)
∏
q∈Q P(q|D): Maximum likelihood estimate of Q in D (LM based
retrieval score of D).
P(w|D): Maximum likelihood estimate of w in D.
P(D): Prior probability of selection of the document.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
RM32
Drawback of RLM
Does not consider the original query terms for estimation of the density
function
2
UMass at TREC 2004: Novelty and HARD - Jaleel et al.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
RM32
Drawback of RLM
function
RM3
A mixture model of RLM and query likelihood model
P′
(w|R) = µP(w|R) + (1 − µ)P(w|Q)
2

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Relevance Feedback: Summary
Different types of feedback
Explicit: Users do the work
Implicit: System guesses based on user activity
Blind/Pseudo: Users not involved
Query Expansion as a general mechanism

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Vector space model
Language Model
IR Evaluation

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Cranfield method (cleverdon et al., 60s)
Benchmark data
Document collection
Query / topic collection
Relevance judgments - information about which document is relevant
to which query

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Benchmark data
Document collection
to which query
syllabus
question paper
correct answers

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Benchmark data
Document collection
to which query
syllabus
question paper
correct answers
Assumptions
relevance of a document to a query is objectively discernible
all relevant documents contribute equally to the performance
measures
relevance of a document is independent of the relevance of other
documents

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Evaluation metrics
Background
User has an information need.
Information need is converted into a query.
Documents are relevant or non-relevant.
Ideal system retrieves all and only the relevant documents.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Evaluation metrics
Background
User has an information need.
Information need is converted into a query.
Documents are relevant or non-relevant.
Ideal system retrieves all and only the relevant documents.
Document
Collection
Information
need
User
System

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Set-based metrics
Recall =
#(relevant retrieved)
#(relevant)
=
#(true positives)
#(true positives + false negatives)
Precision =
#(relevant retrieved)
#(retrieved)
=
#(true positives)
#(true positives + false positives)
F =
1
α/P + (1 − α)/R
=
(β2 + 1)PR
β2P + R

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Metrics for ranked results
(Non-interpolated) average precision
Which is better?
1 Non-relevant
2 Non-relevant
3 Non-relevant
4 Relevant
5 Relevant
1 Relevant
2 Relevant
3 Non-relevant
4 Non-relevant
5 Non-relevant

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Rank Type Recall Precision
1 Relevant 0.2 1.00
2 Non-relevant
3 Relevant 0.4 0.67
4 Non-relevant
5 Non-relevant
6 Relevant 0.6 0.50

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 Relevant 0.2 1.00
2 Non-relevant
3 Relevant 0.4 0.67
4 Non-relevant
5 Non-relevant
6 Relevant 0.6 0.50
∞ Relevant 0.8 0.00

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 Relevant 0.2 1.00
2 Non-relevant
3 Relevant 0.4 0.67
4 Non-relevant
5 Non-relevant
6 Relevant 0.6 0.50
AvgP =
1
5
(1 +
2
3
+
3
6
)
(5 relevant docs. in all)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 Relevant 0.2 1.00
2 Non-relevant
3 Relevant 0.4 0.67
4 Non-relevant
5 Non-relevant
6 Relevant 0.6 0.50
AvgP =
1
5
(1 +
2
3
+
3
6
)
(5 relevant docs. in all)
AvgP =
1
NRel
∑
di∈Rel
i
Rank(di)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Interpolated average precision at a given recall point
Recall points correspond to 1
NRel
NRel different for different queries
P
R
1.00.0
Q1 (3 rel. docs)
Q2 (4 rel. docs)
Interpolation required to compute averages across queries

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Interpolated average precision
Pint(r) = max
r′≥r
P(r′
)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Pint(r) = max
r′≥r
P(r′
)
11-pt interpolated average precision
1 Relevant 0.2 1.00
2 Non-relevant
3 Relevant 0.4 0.67
4 Non-relevant
5 Non-relevant
6 Relevant 0.6 0.50

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Pint(r) = max
r′≥r
P(r′
)
1 Relevant 0.2 1.00
2 Non-relevant
3 Relevant 0.4 0.67
4 Non-relevant
5 Non-relevant
6 Relevant 0.6 0.50
R Interp. P
0.0 1.00
0.1 1.00
0.2 1.00
0.3 0.67
0.4 0.67
0.5 0.50
0.6 0.50
0.7 0.00
0.8 0.00
0.9 0.00
1.0 0.00

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.20.0 0.4 0.6 0.8 1.0

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Metrics for sub-document retrieval
Let pr - document part retrieved at rank r
rsize(pr) - amount of relevant text contained by pr
size(pr) - total number of characters contained by pr
Trel - total amount of relevant text for a given topic
P[r] =
∑r
i=1 rsize(pi)
∑r
i=1 size(pi)
R[r] =
1
Trel
r∑
i=1
rsize(pi)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Precision at k (P@k) - precision after k documents have been
retrieved
easy to interpret
not very stable / discriminatory
does not average well
R precision - precision after NRel documents have been retrieved

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Cumulated Gain
Idea:
Highly relevant documents are more valuable than marginally relevant
documents
Documents ranked low are less valuable

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Cumulated Gain
Idea:
Highly relevant documents are more valuable than marginally relevant
documents
Documents ranked low are less valuable
Gain ∈ {0, 1, 2, 3}
G = ⟨3, 2, 3, 0, 0, 1, 2, 2, 3, 0, . . .⟩
CG[i] =
i∑
j=1
G[i]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(n)DCG
DCG[i] =
CG[i] if i < b
DCG[i − 1] + G[i]/ logb i if i ≥ b

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(n)DCG
DCG[i] =
CG[i] if i < b
DCG[i − 1] + G[i]/ logb i if i ≥ b
Ideal G = ⟨3, 3, . . . , 3, 2, . . . , 2, 1, . . . , 1, 0, . . .⟩
nDCG[i] =
DCG[i]
Ideal DCG[i]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Mean Reciprocal Rank
Useful for known-item searches with a single target
Let ri — rank at which the “answer” for query i is retrieved.
Then reciprocal rank = 1/ri
Mean reciprocal rank (MRR) =
n∑
i=1
1
ri

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
TREC
http://trec.nist.gov
Organized by NIST every year since 1992
Typical tasks
adhoc
user enters a search topic for a one-time information need
document collection is static
routing/filtering
user’s information need is persistent
document collection is a stream of incoming documents
question answering

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
TREC data
Documents
Genres:
news (AP, LA Times, WSJ, SJMN, Financial Times, FBIS)
govt. documents (Federal Register, Congressional Records)
technical articles (Ziff Davis, DOE abstracts)
Size: 0.8 million documents – 1.7 million web pages
(cf. Google indexes several billion pages)
Topics
title
description
narrative

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
What is Word Embedding?
Represent every word as a vector in some abstract space.
What are the characteristics of this space?
Two terms t1 and t2 are close if and only if they share similar contexts.
Paris is close to France. Why?
If Paris is close to France, then Berlin will be close to Germany. Why?

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Word2vec: Working principle
Word2vec: Properties and Limitations

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Word2vec

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Word2vec
Architecture:
Input: One-hot vector representation of words.
Context: Concatenated one-hot vectors of neighboring words.
Output: Softmax of dimension D (the desired dimensionality of the
vector).
Connecting the cells
Current output is a function of current input and previous outputs.
y(wn) = f(y(wn−1), c(wn)) (1)
Objective Function (to maximize):
J(W) =
∑
wt,ct∈D+
∑
c∈ct
p(D = 1| ⃗wt,⃗c)) −
∑
wt,c′
t∈D−
∑
c∈c′
t
p(D = 1| ⃗wt,⃗c))

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Word2vec: Working principle
Word2vec: Properties and Limitations

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Compositional properties of word2vec
Adding (vector sum) of word vectors lead to composing the meanings
of terms.
Example: vec(‘Jadavpur’) + vec(‘NLP’) + vec(‘conference’) is
expected to be close to vec(‘ICON’17’)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
FastText - Learning word vectors from character sequences
Word2vec Limitations
1 For less frequent terms, vector representations would be noisy. Why?
2 For morphologically complex languages, the word vector representation
would also be noisy. Why?
Solution: FastText
1 Learn vector representations of character n-grams.
2 Obtain vector representation of a whole word by summing the vectors
of the n-grams.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Problems with the Term Independence Assumption
Word vectors to address term dependence
Multi-modal

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Term Dependence: Initial Retrieval
Term association: Has been an intriguing problem in IR.
Vocabulary mismatch: Different terms may used in two documents
that are about the same topic, e.g. “atomic” and “nuclear” etc.
Terms used in query are different from those in its relevant
documents.
Standard retrieval models assume term independence.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Term Dependence: Relevance Feedback
Uses statistical co-occurrence of words in top ranked docs with query
terms.
No way to take into account multi-word ‘concepts’, e.g. relating
‘osteoporosis’ to ‘bone disease’ beyond pre-defined phrases.
Noisy expansion terms can lead to ‘query drift’ and hence degraded
IR effectiveness after RF.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Multi-modal

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
How could Word Embedding help?
Embedded vector space could model the relationship between term
pairs, e.g. ‘atomic’ and ‘nuclear’.
Mainly two approaches: Global and Local Analysis

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Global Analysis
Independent of queries.
Done once at indexing time.
Takes into account term co-occurrences to normalize terms.
Examples: Approaches such as LSA, PLSA, LDA etc.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Local Analysis
More focused than global analysis.
Considers a focused subset of documents retrieved in response to a
query.
Can handle term dependence in specific context.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Word Embedding for Relevance Feedback (RF)
Semantic similarity captured by distance measure between word
vectors.
Integrate semantic similarity with statistical co-occurrences between
terms for RF.
Exploit term compositionality to extract meaningful concepts to use
in RF.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Multi-modal

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Word Embeddings for Multi-modal IR
Multi-modality: A document comprised of text, images, speech,
video, e.g. a typical Wikipedia page.
Given a unimodal (e.g. text/image) query or more generally a
multi-modal query, how can one retrieve relevant multi-modal
documents?
Given a Wiki page as a query, what are the other relevant documents?

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Word Embeddings for Multi-modal IR
Index the different modalities separately. Compute similarities
individually and fuse.
Problems: Different retrieval strategies. How to combine the scores?
Vector Embedding Approach: Joint embedding of categorical data,
such as text, and continuous data such as image features into vectors
of reals.
What we need: A similarity function between sets of vectors.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Traditional Language Model
Given:
Collection C: A set of documents
Query q
Documents ranked:
in decreasing order by the posterior probabilities P(d|q)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Term Transformation Events
We propose
A generative process in which a noisy channel may transform a term
t′ into a term t

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
If a term t is observed in query corresponding to a document d, it may
have occured in three possible ways:
Direct term sampling: Standard LM term sampling
Transformation via Document Sampling: Sampling a term t′ from
d which is then transformed to t by a noisy channel
Transformation via Collection Sampling: Sampling the term t′
from the collection which is then transformed to t by the noisy
channel

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Term Transformation Events : via Document Sampling
Figure: Schematics of generating a query term t from d
P(t, t′|d) = P(t|t′, d)P(t′|d)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Term Transformation Events : via Document Sampling
P(t, t′
|d) = P(t|t′
, d)P(t′
|d)
P(t|t′
, d) =
sim(t, t′)
∑
t′′∈d sim(t, t′′)
P(t, t′
|d) =
sim(t, t′)
∑
t′′∈d sim(t, t′′)
tf(t′, d)
|d|

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Term Transformation Events : via Collection Sampling
Figure: Schematics of generating a query term t from the collection
P(t, t′|C) = P(t|t′, C)P(t′|C)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Term Transformation Events : via Collection Sampling
P(t, t′
|C) = P(t|t′
, C)P(t′
|C)
P(t|t′
, C) =
sim(t, t′)
∑
t′′∈Nt
sim(t, t′′)
P(t, t′
|C) =
sim(t, t′)
∑
t′′∈Nt
sim(t, t′′)
cf(t′)
cs

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Noisy Channel
d
C
t
λ
t′
1 − λ − α − β
α
β
Figure: Schematics of generating a query term t in Generalized Language Model
(GLM). GLM degenerates to LM when α = β = 0.
λ: Standard Jelinek Mercer weighting parameter
α: Probability of sampling term t via a transformation through a term
t′ sampled from the document d
β: Probability of sampling term t via a transformation through a term
t′ sampled from collection

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Generalized Language Model : Evaluation
Topic Set Method MAP GMAP Recall
LM 0.2148 0.0761 0.4778
TREC-6 LDA-LM 0.2192 0.0790 0.5333
GLM 0.2287 0.0956 0.5020
LM 0.1771 0.0706 0.4867
TREC-7 LDA-LM 0.1631 0.0693 0.4854
GLM 0.1958 0.0867 0.5021
LM 0.2357 0.1316 0.5895
TREC-8 LDA-LM 0.2428 0.1471 0.5833
GLM 0.2503 0.1492 0.6246
LM 0.2555 0.1290 0.7715
Robust LDA-LM 0.2623 0.1712 0.8005
GLM 0.2864 0.1656 0.7967
Table: Comparative performance of LM, LDA and GLM on the TREC query sets.

Results
Metrics
Topic Set Method MAP GMAP Recall
TREC-6
LM 0.2148 0.0761 0.4778
LDA 0.2192 0.0790 0.5333
GLM 0.2287 0.0956 0.5020
TREC-7
LM 0.1771 0.0706 0.4867
LDA 0.1631 0.0693 0.4854
GLM 0.1958 0.0867 0.5021
TREC-8
LM 0.2357 0.1316 0.5895
LDA 0.2428 0.1471 0.5833
GLM 0.2503 0.1492 0.6246
Robust
LM 0.2555 0.1290 0.7715
LDA 0.2623 0.1712 0.8005
GLM 0.2864 0.1656 0.7967
. . . . . . . . . . . . . . . . . . . .

Parameter Variation Effects
0.2268
0.2272
0.2276
0.228
0.2284
0.2288
0.1 0.2 0.3 0.4
MAP
α
β=0.1 β=0.2
β=0.3 β=0.4
(a) TREC-6
0.1925
0.1935
0.1945
0.1955
0.1 0.2 0.3 0.4
MAP
α
β=0.1 β=0.2
β=0.3 β=0.4
(b) TREC-7
0.2485
0.2495
0.2505
0.1 0.2 0.3 0.4
MAP
α
β=0.1 β=0.2
β=0.3 β=0.4
(c) TREC-8
0.2835
0.2845
0.2855
0.2865
0.1 0.2 0.3 0.4
MAP
α
β=0.1 β=0.2
β=0.3 β=0.7
(d) Robust. . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Generalized Language Model : Discussion
GLM consistently and significantly outperforms both LM and LDA
smoothed LM for all the query sets.
LDA achieves higher recall for TREC-6, Robust, but it does not
significantly increase MAP indicating, no improvement in top rank
precision.
For GLM, an increase in recall is always associated with a significant
increase in MAP as well, indicating precision at top ranks remain
relatively stable compared to LDA.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Motivation of this work
Vocabulary mismatch problem hindrance to effective search
Standard IR approach: Use relevance feedback along with statistical
co-occurrence of words in top ranked docs with query terms.
Semantic similarity can be captured by distances between word vectors
Research Challenges:
Develop a formal framework to use word embeddings in IR feedback
models
Utilize semantic similarity in conjunction with statistical co-occurrences
in top docs

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Relevance based Language Model (RLM)3
Assumes that both query and relevant documents are sampled from a
latent relevance model R
In absence of training data, top ranked documents considered as set
of (known) relevant documents
The query serves as the only absolute evidence about the relevance
model
The task is to find (estimate) the density function for R
Estimates P(w|R) on the basis of P(w|Q)
3
Lavrenko et al. - SIGIR 2001

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Estimation
approximated by
P(w|R) ≈
P(w, Q)
P(Q)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Estimation
approximated by
P(w|R) ≈
P(w, Q)
P(Q)
∼ P(w, Q)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Assumption
The query words Q = {q1, . . . , qk} and w sampled from
(pseudo-)relevant documents
Pseudo-relevant documents =⇒ Top ranked documents
P(w|R) ≈ P(w, Q) =
∑
D∈M
P(w|D)
∏
q∈Q
P(q|D)
M : Set of pseudo-relevant documents

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
RM34
Drawback of RLM
function
4

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
RM34
Drawback of RLM
function
RM3
P′
(w|R) = µP(w|R) + (1 − µ)P(w|Q)
4

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Kernel Density Estimation (KDE)
76543210-1
5
4
3
2
1
0
76543210-1
5
4
3
2
1
0
76543210-1
5
4
3
2
1
0
Estimate a distribution that generates the given data (data points)
Place Gaussians centered around each observed data points
Combine the Gaussians to get a function peaked at the observed data
points
f(x) =
1
nh
n∑
i=1
K(
x − xi
h
)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
KDE with Gaussian Kernel
f(x) =
1
nh
n∑
i=1
K(
x − xi
h
)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
f(x) =
1
nh
n∑
i=1
K(
x − xi
h
)
K(
x − xi
h
) = N(
x
h
|
xi
h
, σ2
) =
1
√
2πσ
e
− 1
2σ2 (
x−xi
h
)
2
A normal distribution of scaled variable x
h parameterized by mean xi
h and
variance σ2

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
f(x) =
1
nh
n∑
i=1
K(
x − xi
h
)
K(
x − xi
h
) = N(
x
h
|
xi
h
, σ2
) =
1
√
2πσ
e
− 1
2σ2 (
x−xi
h
)
2
f(x) =
1
nh
n∑
i=1
1
√
2πσ
e
− 1
2σ2 (
x−xi
h
)
2
h and σ: Tunable Parameters

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
In Relevance Feedback Scenario
f(x) =
1
nh
n∑
i=1
K(
x − xi
h
)
The Points
Data points (xi): Q = {q1, . . . , qk}
Sample points (x): w → Candidate expansion term

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
In Relevance Feedback Scenario
f(x) =
1
nh
n∑
i=1
K(
x − xi
h
)
The Points
Data points (xi): Q = {q1, . . . , qk}
Sample points (x): w → Candidate expansion term
f(w) =
1
kh
k∑
i=1
K(
w − qi
h
)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Relevance Feedback with KDE
f(w2)
f(q2)
f(w1)
q1 q2 q3
w1 w2
One dimensional projection of embedded vectors

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Kernel Density Estimate (Unweighted)
f(w) =
1
nh
n∑
i=1
K(
w − qi
h
)
f(w2)
f(q2)
f(w1)
q1 q2 q3
w1 w2
n - Number of samples
h - The bandwidth
K(·) - Kernel function

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Kernel Density Estimate (Weighted)
f(w, α) =
1
nh
n∑
i=1
αiK(
w − qi
h
)
f(w2)
f(q2)
f(w1)
q1 q2 q3
w1 w2
n - Number of samples
h - The bandwidth
K(·) - Kernel function
αi - Weight of local
Kernel function around
ith data point

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
One dimensional KDE
f(w2)
f(q2)
f(w1)
q1 q2 q3
w1 w2
f(w, α) =
1
kh
k∑
i=1
αi
1
√
2πσ
e
−
(w−qi)2
2σ2h2
αi : The weight of the ith query term

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
One dimensional KDE
f(w, α) =
1
kh
k∑
i=1
αi
1
√
2πσ
e
−
(w−qi)2
2σ2h2
The vector embedding of the words of the collection vocabulary
makes it possible to define and compute the value of wqi as a
distance measure, e.g. the squared Euclidean distance, (⃗w⃗qi)T
(⃗w⃗qi),
between two word vectors w and qi
(w − qi)2
= (⃗w − ⃗qi)T
(⃗w − ⃗qi)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
One dimensional KDE
f(w, α) =
k∑
i=1
P(w|M)P(qi|M)
1
σ
√
2π
exp(−
(⃗w − ⃗qi)T(⃗w − ⃗qi)
2σ2h2
)
αi : P(w|M)P(qi|M)
M : set of terms of the union of M top ranked documents

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Two dimensional KDE
In 1D-KDE, set of all top ranked documents considered as a single
document model(M).
We now generalize the 1D-KDE to two dimensions so as to weight
the contributions obtained from different documents in the ranked list
separately.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Two dimensional KDE
q1 q2 q3 Terms
Documents
D1
D2
2
1
D1
D3
The x-axis corresponds to the query term vectors similar to one
dimensional KDE
The y-axis represents the normalized term frequency of query terms in
respective feedback documents

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Two dimensional KDE
xij = (qi, Dj): Encapsulates word vector for i-th query word qi and its
normalized term frequency in feedback document Dj (P(qi|Dj))

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Two dimensional KDE
f(x, α) =
1
kMh2
k∑
i=1
M∑
j=1
αijK(
x − xij
h
)
∼
k∑
i=1
M∑
j=1
αijN(
x
h
|
xij
h
, σ)
=
k∑
i=1
M∑
j=1
αij
1
2πσ2
exp(−
1
2σ2h2
(x − xij)T
(x − xij))
=
k∑
i=1
M∑
j=1
αij
2πσ2
exp(
(⃗w − ⃗qi)2 + (P(w|Dj)−P(qi|Dj))2
−2σ2h2
)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Two dimensional KDE
f(x, α) =
k∑
i=1
M∑
j=1
P(w|Dj)P(qi|Dj)
2πσ2
exp(
(⃗w − ⃗qi)2 + (P(w|Dj)−P(qi|Dj))2
−2σ2h2
)
αij : P(w|Dj)P(qi|Dj)
w in document Dj , denoted as x, will get a high value of the density
function if:
(i) ⃗w is close to query terms ⃗qi
(ii) w frequently co-occurs with query terms in each top ranked document
Dj

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Vector Addition for Compositionality
⃗German + ⃗Airlines ∼ ⃗Lufthansa

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Vector Addition for Compositionality
u
v
w1
w2
Achieved by performing an addition operation over the embedded
vectors

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Term Composition in KDE
Extending the set of pivot points
q1 q2q1+q2
w
(a) Without composition
q1 q2q1+q2
w
(b) With composition

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Importance of Composition
ISI Inter Services
Intelligence
Infrared
Spatial In-
terferometer
Indian
Statistical
Institute
Institute for
Scientific
Interchange
Islamic
State of Iraq
Integral
Satcom
Initiative
Ice Skating
Institute
Islamic
School
of Irving

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Importance of Composition
ISI
Kolkata
Inter Services
Intelligence
Infrared
Spatial In-
terferometer
Indian
Statistical
Institute
Institute for
Scientific
Interchange
Islamic
State of Iraq
Integral
Satcom
Initiative
Ice Skating
Institute
Islamic
School
of Irving

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Summary
Given
Corpora : Set of documents
Query Q : q1, . . . , qk
First level retrieval performed using LM
ET = All terms of top ranked M docs. considered as potential
expansion terms
Calculate KDE(w) for each w ∈ ET
Terms with high estimated value taken as expansion terms
Linearly interpolate derived density function with underlying query
model

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Evaluation : TREC-678, Robust
Dataset Method
Parameters Metrics
M N MAP Recall P@5
TREC 6
RM3 20 70 0.2634 0.5368 0.4360
1d KDE 10 80 0.2818†
0.5757†
0.4680†
2d KDE 10 80 0.2850†
0.5724†
0.4600†
TREC 7
RM3 20 70 0.2151 0.5432 0.4160
1d KDE 10 80 0.2351†
0.6001†
0.4425†
2d KDE 10 80 0.2380 0.6108†‡
0.4400
TREC 8
RM3 20 70 0.2701 0.6410 0.4760
1d KDE 10 80 0.2746 0.6749†
0.4888
2d KDE 10 80 0.2957†‡
0.6887†
0.5120†‡
TREC Rb
RM3 20 70 0.3304 0.8559 0.4949
1d KDE 10 80 0.3228 0.8725 0.4929
2d KDE 10 80 0.3456†‡
0.8772†‡
0.5152†‡
Table: † and ‡ denote significance with respect to RLM and 1d KDE respectively.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Evaluation : WT10G
Dataset Method Parameters Metrics
M N MAP Recall P@5
TREC 9
RM3 20 70 0.1930 0.6755 0.3233
1d KDE 10 80 0.1984 0.6851 0.3360
2d KDE 10 80 0.2145†‡
0.6878 0.3562†‡
TREC 10
RM3 20 70 0.1759 0.7386 0.3347
1d KDE 10 80 0.2192†
0.7499 0.4004†
2d KDE 10 80 0.2213†
0.7502 0.4204†
Table: † and ‡ denote significance with respect to RLM and 1d KDE respectively.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contributions
Integration of semantic information with co-occurrence statistics
Integration of Kernel Density Estimate in Relevance Feedback
Incorporation of Word Compositionality in the model

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contributions
Integration of semantic information with co-occurrence statistics
Integration of Kernel Density Estimate in Relevance Feedback
Incorporation of Word Compositionality in the model
Word Vector Compositionality based Relevance Feedback
using Kernel Density Estimation (CIKM 2016)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Motivation
Multi-modal and cross-modal IR

Documents as term vectors
Terms as dimensions of a document vector (forms an inner product
space).
Inner product d.q gives the similarity between document and query.
Figure: Sample three term space.
. . . . . . . . . . . . . . . . . . . .

Documents as sets of word embedded vectors
Each document: A set of real-valued vectors in p dimensions,
D = {xi}
|D|
i=1, xi ∈ Rp.
Need: Generalized distance (inverse similarity) measures, d(X, Y),
where X, Y are sets of vectors, which satisfy
d(X, X) = 0, d(X, Y) = d(Y, X) and d(X, Y) + d(Y, Z) ≥ d(X, Z).
. . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Documents as sets of word embedded vectors
Two distance metrics investigated:
Average inter-distance: d(X, Y) = 1
|X||Y|
∑
x∈X
∑
y∈Y d(x, y), where
d(x, y) is L2 or Euclidean distance between vectors x and y.
Hausdorff Distance:
d(X, Y) = max
(
maxx∈X miny∈Y d(x, y), maxy∈Y minx∈X d(x, y)
)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Illustrative Examples
Figure: Two example scenarios of single-topical documents, where the document
on the left has a higher similarity to the query than the document on the right.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Illustrative Examples
Figure: Two example scenarios where documents are multi-topical, i.e. K-means
clustering shows 4 distinct clusters. Document on the right is more similar to the
query.

Method Details
A document is treated as a mixture model of Gaussians of the
observed constituent words.
A query is treated as the observed points drawn from the underlying
mixture distribution of a document.
The query likelihood is then given by the probability of sampling the
observed query points from the mixture distribution.
sim(q, d) =
1
K|q|
∑
i
∑
k
qi · µk (2)
This is combined with the text based query likelihood (language
model based) to obtain the final query likelihood.
P(d|q) = αPLM(d|q) + (1 − α)PWVEC(d|q) (3)
. . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Practical Considerations for Implementation
Individually estimating the Gaussian mixture model for each
document is time consuming, and slows the indexing process.
Solution: Cluster the entire vocabulary with an EM based clustering
algorithm such as K-means.
Each term is thus mapped to a cluster id.
Induce the per-document clusters by grouping together words in a
document with the same cluster id and find the centre of each group
Ck.
µk =
1
|Ck|
∑
x∈Ck
x, Ck = {xi : c(wi) = k}, i = 1, . . . , |d| (4)

Results
Dataset Method Parameters Metrics
Clustered #clusters α MAP GMAP Recall P@5
TREC-6
LM n/a n/a n/a 0.2303 0.0875 0.5011 0.3920
LM+wvecsimone_cluster yes 1 0.4 0.2355 0.0918 0.5058 0.3920
LM+wvecsimno_cluster no n/a 0.4 0.2259 0.0827 0.5000 0.3600
LM+wvecsimkmeans yes 100 0.4 0.2345 0.0906 0.5027 0.4040
TREC-7
LM n/a n/a n/a 0.1750 0.0828 0.4803 0.4080
LM+wvecsimkmeans yes 100 0.4 0.1756 0.0874 0.4916 0.3840
TREC-8
LM n/a n/a n/a 0.2466 0.1318 0.5835 0.4320
LM+wvecsimone_cluster yes 1 0.4 0.2541†
0.1465 0.6017 0.4440
LM+wvecsimkmeans yes 100 0.4 0.2558†
0.1468 0.6017 0.4720
Robust
LM n/a n/a n/a 0.2651 0.1710 0.7803 0.4424
LM+wvecsimkmeans yes 100 0.4 0.2804†
0.1819 0.8010 0.4687
Table: Results of set-based word vector similarities with different settings. K:
#clusters, α: weight of the text based query likelihood.
. . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Observations
Results with word vector based similarities outperform pure text based
ones.
K = 100 produces best results for the TREC 8 and the TREC Robust
topic sets.
Show consistent improvements in both recall and precision at top
ranks.
Very fine-grained representation of documents (each constituent word
as its own cluster) is not optimal.
Somewhat surprisingly, K = 1, i.e., each document represented by a
single point (the average of all words) produces close results to
K = 100.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Motivation
Multi-modal and cross-modal IR

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Embedded Vector based Multi-modal IR
Use joint embeddings of text and other data type (e.g. images) to
automatically augment text documents with semantically related
‘vectors’.
Example: For a given text document, enhance its representative
content (for the purpose of more effective search) by augmenting
relevant images from the Wikimedia (Wikipedia image collection).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Embedded Vector based Cross-modal and Cross-lingual IR
Joint embeddings of vectors can be used for cross-lingual search.
Individual word embeddings for different languages can be aligned
with a parallel corpora.
Document-Query similarity can be measured on these embedded
vector space.
Joint embeddings can also be used for addressing cross-modal
information access, e.g. searching for text documents with image
query, searching for speech/video with text query and so on.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Joint Embedding of Multi-modal data
Data from different modalities need to be embedded in a joint space.
Why?
DeViSE (Deep Visual Semantic Embedding)
1 Apply CNN on images. Take the vector from layer f6 or f7.
2 Apply word2vec to obtain pre-trained word vectors.
3 Learn transformation from image to word.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
DeViSE Transformation
Use existing image captions to learn the transformation.
Captions are manually generated and expected to be good descriptors
of the images.
loss(image, label; θ) =
∑
j̸=label
max[0, M − ⃗tlabelθ ⃗image + ⃗tjθ ⃗image]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
Microblog Retrieval

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Problems for short documents
Short documents lack sufficient context for effective training of the
RNN.
If document boundaries are ignored, context increases but the context
is often noisy.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Context driven word2vec training
Given a set S of contextually similar word vector pairs,
Φ(w) = {v : (w, v) ∈ S}
Transform the vector space so that v becomes close to u in the
transformed space.
Transformation function:
l(⃗w; θ) =
∑
⃗v:v∈Φ(w)
∑
⃗u:u̸∈Φ(w) max
(
0, m − ((θ⃗w)T⃗v − (θ⃗w)T⃗u)
)
Similar to the image-word training. Why?

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Context driven word2vec training
The context S depends on the application.
For Microblog retrieval:
1 Other topically similar tweets.
2 Tweets within a time period - tweets are often associated with
temporal bursts.
For Session Modeling:
1 Queries that are close in time by the same user.

ICON 2017 tutorial : Word Embedding in IR: Present and Beyond

Recommended

Recommended

More Related Content

Similar to ICON 2017 tutorial : Word Embedding in IR: Present and Beyond

Similar to ICON 2017 tutorial : Word Embedding in IR: Present and Beyond (20)

Recently uploaded

Recently uploaded (20)

ICON 2017 tutorial : Word Embedding in IR: Present and Beyond