Presented at Theoretical Basis of Machine Learning 2018 (TBML 2018) held at International Centre for Theoretical Sciences (ICTS) Bengaluru, 27th - 29th December, 2018.
26. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Baseline Retrieval Model
Term Transformation Events
Noisy Channel
d
C
t
λ
t′
1 − λ − α − β
α
β
Figure: A Generalized Language Model (GLM). GLM degenerates to LM when
α = β = 0.
λ: Standard Jelinek Mercer weighting parameter
α: Probability of sampling term t via a transformation through a term
t′ sampled from the document d
β: Probability of sampling term t via a transformation through a term
t′ sampled from collection
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 20 / 78
33. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Baseline Retrieval Model
Vocabulary Mismatch
Vehicle
Smoke
Pollution
Doc1
Passenger vehicles & heavy-duty trucks
are a major source of air pollution
which includes ozone, particulate matter,
and other smog-forming emissions
in the form of smoke.
Doc2
The exhaust gas of cars powered by
fossil fuels are a major source of
toxic materials in the air that causes
severe damage to the eco-system.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 25 / 78
34. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Baseline Retrieval Model
Vocabulary Mismatch
Vehicle
Smoke
Pollution
Doc1
Passenger vehicles & heavy-duty trucks
are a major source of air pollution
which includes ozone, particulate matter,
and other smog-forming emissions
in the form of smoke.
Doc2
The exhaust gas of cars powered by
fossil fuels are a major source of
toxic materials in the air that causes
severe damage to the eco-system.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 25 / 78
35. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Baseline Retrieval Model
Vocabulary Mismatch
Vehicle
Smoke
Pollution
Doc1
Passenger vehicles & heavy-duty trucks
are a major source of air pollution
which includes ozone, particulate matter,
and other smog-forming emissions
in the form of smoke.
Doc2
The exhaust gas of cars powered by
fossil fuels are a major source of
toxic materials in the air that causes
severe damage to the eco-system.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 25 / 78
39. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Relevance Feedback based Query Expansion
Pseudo-Relevance Feedback based Query Expansion
Given a query Q
Perform initial retrieval using some retrieval model
Consider top K documents as (pseudo-)relevant
Select terms from those documents as candidate expansion terms
Perform re-retrieval with the expanded query
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 29 / 78
44. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Relevance Feedback based Query Expansion
Kernel Density Estimation (KDE)
76543210-1
5
4
3
2
1
0
76543210-1
5
4
3
2
1
0
76543210-1
5
4
3
2
1
0
Estimate a distribution that generates the given data (data points).
Place kernel function (e.g. Gaussian) centered around each observed
data point.
Combine the Gaussians to get a function peaked at the observed data
point.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 33 / 78
49. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Relevance Feedback based Query Expansion
Comparing the Estimations
Relevance Model Estimation
Given set of terms from (pseudo-)relevant documents, estimate the density
function that generates R.
76543210-1
5
4
3
2
1
0
Kernel Density Estimation
Given set of n data samples, estimates the density function that generates
the observed data.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 36 / 78
60. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Relevance Feedback based Query Expansion
Two dimensional KDE
The Points: Two Dimensional KDE
Observed data points (xij) = (qi, Dj): Encapsulates word vector for
i-th query word qi and its normalized term frequency in feedback
document Dj (P(qi|Dj)).
Points at which probability (density) is to be estimated (x)
= (w, Dj): Encapsulates word vector for candidate expansion term w
and its normalized term frequency in feedback document Dj
(P(w|Dj)).
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 45 / 78
62. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Relevance Feedback based Query Expansion
Two dimensional KDE
f(x, α) =
n∑
i=1
M∑
j=1
P(w|Dj)P(qi|Dj)
2πσ2
exp(
(w − qi)2 + (P(w|Dj) − P(qi|Dj))2
−2σ2h2
)
w in document Dj, denoted as x, will get a high value of the density
function if:
(i) w is semanticlly close to query terms qi;
(ii) w frequently co-occurs with query terms in each top ranked document
Dj.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 46 / 78
66. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Improving Relevance Feedback based Query Expansion
Summary
Given:
Corpora : Set of documents.
Query Q : q1, . . . , qn.
Vector embeddings of each of the terms.
First level retrieval performed using any LM.
ET = All terms of top ranked M documents considered as potential
expansion terms.
Calculate KDE(w) for each w ∈ ET.
Terms with high estimated value taken as expansion terms.
Linearly interpolate derived density function with underlying query
model.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 50 / 78
76. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Query Performance Prediction using Word Embedding
Query Performance Prediction (QPP)
Definition
To quantify the quality of search results when no relevance feedback is
given.
For an ambiguous query, it may be difficult for a search engine to
return satisfactory results for the query.
This can lead to poor IR effectiveness.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 55 / 78
81. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Query Performance Prediction using Word Embedding
Characteristics of Difficult Query
d(Q, C) - Distance between query and collection; Reflects likelihood of generation of Q
from C.
d(R, C) - Distance between retrieved documents and collection; Reflects likelihood of
generation of documents of R from C.
d(Q, Q) - Diversity of query. Reflects the ambiguity of Q; more the diameter, more
ambiguous the query.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 58 / 78
82. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Query Performance Prediction using Word Embedding
Characteristics of Difficult Query
d(Q, C) - Distance between query and collection; Reflects likelihood of generation of Q
from C.
d(R, C) - Distance between retrieved documents and collection; Reflects likelihood of
generation of documents of R from C.
d(Q, Q) - Diversity of query. Reflects the ambiguity of Q; more the diameter, more
ambiguous the query.
d(R, R) - Distance among retrieved documents; Manifestation of ambiguity of retrieved
documents.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 58 / 78
83. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Query Performance Prediction using Word Embedding
Characteristics of Difficult Query
d(Q, C) - Distance between query and collection; Reflects likelihood of generation of Q
from C.
d(R, C) - Distance between retrieved documents and collection; Reflects likelihood of
generation of documents of R from C.
d(Q, Q) - Diversity of query. Reflects the ambiguity of Q; more the diameter, more
ambiguous the query.
d(R, R) - Distance among retrieved documents; Manifestation of ambiguity of retrieved
documents.
d(Q, R) - Distance between query and retrieved documents R; Equivalent to distribution
of retrieval scores of R.
D. Roy (ISI, Kolkata) WE for IR December 28, 2018 58 / 78