Query Dependent Pseudo-Relevance Feedback based on Wikipedia

Query Dependent Pseudo-Relevance
Feedback
based on Wikipedia
SIGIR ‘09
Advisor: Dr. Koh Jia-Ling
Speaker: Lin, Yi-Jhen
Date: 2010/01/24
1

Outline
• Introduction
• Query Categorization
• Query Expansion Methods
• Experiments
• Conclusion
• Future Work
2

Introduction
• Query is too short to catch the specific
information need clearly.
• For query expansion methods, we prefer PRF
because it requires no user input.
• The problem is that the top retrieved documents
frequently contain non-relevant documents,
which introduce noise into the expanded query.
3

Introduction
• Meanwhile, resources on the web have emerged
which can potentially supplement an initial
search better in PRF, e.g. Wikipedia.
• Wikipedia covers a great many topics, and it
reflects the diverse interests and information
needs of users.
• The basic entry in Wikipedia is an entity page,
which is an article that contains information
focusing on one single entity.
4

Introduction
• The aim of this study is to explore the possible
utility of Wikipedia as a resource improving for
IR in PRF.
• We propose the effectiveness of three methods
for expansion term selection, each modeling the
Wikipedia based pseudo-relevance information
from a different perspective.
• We incorporate the expansion terms into the
original query and use language modeling IR to
evaluate these methods.
5

Introduction- Query
• Using PRF on the basis of Wikipedia, we
categorize a query into one of three types:
1) EQ : queries about specific entity
“Scalable Vector Machine”
2) AQ : ambiguous queries
“Apple” ”, ”Piratehttp://en.wikipedia.org/wiki/Pirate_(disambiguation)
”,
“Poaching http://en.wikipedia.org/wiki/Poaching_(disambiguation)”
3) BQ : broader queries (neither EQ nor AQ)
6

Introduction- Pseudo-relevant docs
• Pseudo-relevant documents are generated in two
ways according to the query type:
1) using top ranked articles from Wikipedia
(regarded as a collection) retrieved in response to
the query
2) using Wikipedia entity page corresponding to
the query
7

Introduction- furthermore, consider
terms distribution and structure
• In selecting expansion terms, term distributions
and structures of Wikipedia pages are taken into
account.
• We propose and compare a supervised method
and an unsupervised method for this task.
( field-based )
8

Introduction
Query
“ Data mining ”
Select terms
to expand query
Top ranked docs
9
Check Wikipedia
to find pseudo-relevant doc

Outline
• Introduction
• Experiments
• Conclusion
• Future Work
10

Query Categorization
• We briefly summarize the relevant features of
Wikipedia for our study, and then examine the
different categories of queries that are typically
encountered in user queries.
• Wikipedia Data Set (summarize the relevant features )
11

Wikipedia Data Set
• An analysis of 200 randomly selected articles
describing a single topic showed that only 4%
failed to adhere to a standard format.
12

• We define three types of queries according to
their relationship with Wikipedia topics:
• 1) EQ : queries about specific entity
“Scalable Vector Machine”
• 2) AQ : ambiguous queries
“Apple”, ”Pirate http://en.wikipedia.org/wiki/Pirate_(disambiguation)”,
“Poaching http://en.wikipedia.org/wiki/Poaching_(disambiguation)”
• 3) BQ : broader queries (neither EQ nor AQ)
13

• Strategy for AQ (a disambiguation process)
• Given an ambiguous query:
• 1) using query likelihood language model (initial
search) to retrieve top ranked 100 documents
• 2) cluster these documents using K-means
• 3) rank the clusters by cluster-based language model
• 4) the top ranked cluster is then compared to all the
referents (entity pages) extracted from the
disambiguation page associated with the query
• 5) the top matching entity page is then chosen for
the query
14

• 3) rank the clusters by cluster-based language
model, as proposed by Lee et. Al [19]
15

• Evaluation
• total 650 queries
• Each participant was asked to judge whether a query
is ambiguous or not.
• If it was, the participant determined which referent
from the disambiguation page is most likely to be
mapped to the query.
• If it was not, the participant manually search with
the query in Wikipedia to identify whether or not it
is defined by Wikipedia (EQ).
16

• Evaluation
• Participants are in general agreement, i.e., 87%
in judging whether a query is ambiguous or not.
• Determining which referent should a query be
mapped to, there is only 54% agreement.
17

• Evaluation : the effectiveness of the cluster-based
disambiguation process (most queries from TREC topic sets are AQ)
• We define that for each query :
• If there are at least two participants who indicate a
referent as the most likely mapping target, this
target will be used as an answer.
• If a query has no answer, it will not be counted by
the evaluation process.
• Experimental results shows that our disambiguation
process lead to an accuracy of 57%for AQ.
18

Outline
• Introduction
• Experiments
• Conclusion
• Future Work
19

Query Expansion Method
• #1 Relevance Model (baseline)
• A relevance model is a query expansion
approach based on the language modeling
framework.
• In the model, the query words q1,q2,…, qm and
the word w in relevant documents are sampled
identically and independent from a distribution
R
20

• #2 Strategy for Entity/Ambiguous Queries
• Both EQ and AQ can be associated with a
specific Wikipedia entity page.
• Instead of considering the top-ranked
documents from the test collection, only the
corresponding entity page from Wikipedia will
be used as pseudo-relevant information.
21

• #2 Strategy for Entity/Ambiguous Queries
• ranks all the terms in the entity page, then the
top K terms are chosen for expansion
• Score(t)= tf * idf
• idf = log (N / df ) , N: the # of docs in Wikipedia collection
22

Query Expansion Method- consider terms
distribution and structure
• #3 Field Evidence for Query Expansion
• E.g. the importance of a term appearing in the
overview may be different than its appearance in
an appendix.
• We examine two methods for utilizing evidence
from different fields
• #3.1 Unsupervised Method
• #3.2 Supervised Method
23

#3.1 Unsupervised Method
• We replace the term frequency in a pseudo
relevance document from original relevance
model with a linearly combined weighted term
frequencies.
24

#3.2 Supervised Method
• A alternative way of utilizing evidence of field
information is to transfer it into features for
supervised learning.
• A radial-based kernel(RBF) SVM is used.
• Each expansion terms is represented by a feature
vector (10 features in here)
25

• The first group of features are term distributions
in the PRF documents and collections.
• The features we used include:
1) TD in the test collection
2) TD in PRF (top 10) from the test collection
3) TD in the Wikipedia collection
4) TD in PRF (top 10) from the Wikipedia
collection.
26

• The second group of features is based on field
evidence (structure).
• As described before, we divide each entity page
into six fields. One feature is defined for each
field; this is computed as follows:
27

Outline
• Introduction
• Experiments
• Conclusion
• Future Work
28

Experiment
• Experiment Settings
• In our experiment, documents are retrieved for a
given query by the query-likelihood language
model(initial search).
• Experiments were conducted using four TREC
collection
• Retrieval effectiveness is measured in terms of
Mean Average Precision (mAP)
29

Experiment
• Baselines
query-likelihood model (QL)
relevance model (RMC)
relevance model based on Wikipedia (RMW)
• Parameters for relevance model:
N=10 (pseudo relevant docs) K=100 (adding terms) λ=0.6
30

Experiment
• Using Entity Pages for Relevance Feedback
• Our method utilizing only the entity page corresponding
to the query for PRF (RE)
• Not all queries can be mapped to a specific Wikipedia
entity page, thus the method is only applicable to EQ and
AQ.
31

Experiment
• Field based expansion
• Note that in our experiments on field based
expansion, the top retrieved documents are
considered pseudo relevant documents for EQ
and AQ.(not an entity page)
• BQ can be applied here.
32

Experiment
• Field based expansion
• For the supervised method, we compare two
ways of incorporating expansion terms for retrieval.
• The first is to add the top ranked 100 good terms
(SL).
• The second is to add the top ranked 10 good terms,
each given the classification probability as weight
(SLW).
• Unsupervised method:
The relevance model with weighted TFs is denoted
as (RMWTF).
33

Experiment
• Performance improves as
weights for Links, Content
increase.
• the increase of weight to
Overview leads to
deterioration of the
performance.
• This shows that the positions
where a term appears have
different impacts on the
indication of term relevancy.
34

Experiment
• Query Dependent Expansion
• RE for EQ and AQ, RMWTF for BQ.
• We denote the query dependent method as (QD)
35

Outline
• Introduction
• Experiments
• Conclusion
• Future Work
36

Conclusion
• We have explored utilization of Wikipedia in
PRF.
• TREC topics are categorized into three types
based on Wikipedia. (We evaluated these methods on four TREC
collections.)
• We propose and study different methods for
term selection using pseudo relevance
information from Wikipedia entity pages.
• Our experimental results show that the query
dependent approach can improve over a baseline
relevance model.
37

Outline
• Introduction
• Experiments
• Conclusion
• Future Work
38

Future Work
• More investigation is needed for the broader
queries.
• For ambiguous queries, if the disambiguation
process can achieve improved accuracy, the
effectiveness of the final retrieval will be improved.
• For the supervised term selection method, the
results obtained are not satisfactory in terms of
accuracy.
• By combining the initial result from the test
collection and Wikipedia together, one may be able
to develop an expansion strategy which is robust to
the query being degraded by either of the resources.
39

Query Dependent Pseudo-Relevance Feedback based on Wikipedia

More Related Content

What's hot

Viewers also liked

Similar to Query Dependent Pseudo-Relevance Feedback based on Wikipedia

Recently uploaded

Query Dependent Pseudo-Relevance Feedback based on Wikipedia