Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Probabilistic Models of NovelProbabilistic Models of Novel
Document Rankings forDocument Rankings for
Faceted Topic RetrievalFaceted Topic Retrieval
Ben Cartrette and Praveen Chandar
Dept. of Computer and Information Science
University of Delaware
Newark, DE
( CIKM ’09 )
Date: 2010/05/03
Speaker: Lin, Yi-Jhen
Advisor: Dr. Koh, Jia-Ling

AgendaAgenda
Introduction
- Motivation, Goal
Faceted Topic Retrieval
- Task, Evaluation
Faceted Topic Retrieval Models
- 4 kinds of models
Experiment & Results
Conclusion

Introduction - MotivationIntroduction - Motivation
Modeling documents as independently
relevant does not necessarily provide the
optimal user experience.

Traditional
evaluation
measure would
reward System1
since it has
higher recall
Introduction - MotivationIntroduction - Motivation
Actually,
we prefer
System2
(since it has
more
information)
System2 is better !

IntroductionIntroduction
Novelty and diversity become the
new definition of relevance and
evaluation measures .
They can be achieved through
retrieving documents that are
relevant to query, but cover different
facets of the topic.
we call faceted topic retrieval !

Introduction - GoalIntroduction - Goal
The faceted topic retrieval system
must be able to find a small set of
documents that covers all of the
facets
3 documents that cover 10 facets is
preferable to 5 documents that cover
10 facets

Faceted Topic Retrieval - TaskFaceted Topic Retrieval - Task
Define the task in terms of
Information need :
A faceted topic retrieval information need
is one that has a set of answers – facets –
that are clearly delineated
How that need is best satisfied :
Each answer is fully contained within at
least one document

Faceted Topic Retrieval - TaskFaceted Topic Retrieval - Task
Information need
invest in next generation technologies
increase use of renewable energy sources
Invest in renewable energy sources
double ethanol in gas supply
shift to biodiesel
shift to coal
Facets (a set of
answers)

A Query :
A sort list of keywords
A ranked list of documents
that contain as many unique
facets as possible.
D1D1
DnDn
D2D2

-Evaluation-Evaluation
S-recall
S-precision
Redundancy

Evaluation –Evaluation –
an example for S-recall and S-precisionan example for S-recall and S-precision
Total : 10 facets (assume all facets in
documents are non-overlapped)

Evaluation –Evaluation –
an example for Redundancyan example for Redundancy

Faceted topic retrieval modelsFaceted topic retrieval models
4 kinds of models
- MMR (Maximal Marginal Relevance)
- Probabilistic Interpretation of MMR
- Greedy Result Set Pruning
- A Probabilistic Set-Based Approach

1. MMR1. MMR
2. Probabilistic2. Probabilistic
Interpretation of MMRInterpretation of MMR
Let c1=0, c3=c4

3. Greedy Result Set Pruning3. Greedy Result Set Pruning
First, rank without considering
novelty (in order of relevance)
Second, step down the list of
documents, prune documents with
similarity greater than some
threshold ϴ
 I.e., at rank i, remove any document Dj,
j > i, with sim(Dj,Di) > ϴ

4. A Probabilistic Set-Based4. A Probabilistic Set-Based
ApproachApproach
 P(F ϵ D) :Probability of D contains F
the probability that a facet Fj occurs in at
least one document in a set D is
the probability that all of the facets in a
set F are captured by the documents D is

4. A Probabilistic Set-Based4. A Probabilistic Set-Based
ApproachApproach
4.1 Hypothesizing Facets
4.2 Estimating Document-Facet
Probabilities
4.3 Maximizing Likelihood

4.1 Hypothesizing Facets4.1 Hypothesizing Facets
Two unsupervised probabilistic methods :
Relevance modeling
Topic modeling with LDA
Instead of extract facets directly
from any particular word or phrase,
we build a “ facet model ”P(w|F)

4.1 Hypothesizing Facets4.1 Hypothesizing Facets
Since we do not know the facet
terms or the set of documents
relevant to the facet, we will
estimate them from the retrieved
documents
Obtain m models from the top m
retrieved documents by taking each
document along with its k nearest
neighbors as the basis for a facet
model

Relevance modelingRelevance modeling
Estimate m ”facet models“ P(w|Fj)
from a set of retrieved documents
using the so-called RM2 approach:
DFj : the set of documents relevant to facet Fj
fk : facet terms

Topic modeling with LDATopic modeling with LDA
Probabilistic P(w|Fj) and P(Fj) can
found through expectation
maximization

4.2 Estimating Document-Facet4.2 Estimating Document-Facet
ProbabilitiesProbabilities
Both the facet relevance model and LDA
model produce generation probabilistic
P(Di|Fj)
P(Di|Fj) : the probability that sampling
terms from the facet model Fj will
produce document Di

4.3 Maximizing Likelihood4.3 Maximizing Likelihood
Define the likelihood function
Constrain :
K : hypothesized minimum number
required to cover the facets
Maximizing L(y) is a NP-Hard problem
Approximate solution :
For each facet Fj, take the document Di
with maximum

Experiment - DataExperiment - Data
A Query :
A sort list of keywords
Top 130 retrieved documents
D1D1
D130D130
D2D2
Query Likelihood L.M.

Top 130 retrieved
documents
D1D1
D130D130
D2D2
2 assessors
to judge
44.7 relevant documents per
query
Each document contains 4.3
facets
39.2 unique facets on average
( for average one unique facet
per relevant document )
Agreement :
72% of all relevant documents
were judged relevant by both
assessors
For 60 queries :

TDT5 sample topic definition
Judgments
Query

Experiment – Retrieval EnginesExperiment – Retrieval Engines
Using Lemur toolkit
 LM baseline: a query-likelihood language model
 RM baseline: a pseudo-feedback with relevance
model
 MMR: query similarity scores from LM baseline
and cosine similarity for novelty
 AvgMix (Prob MMR) : the probabilistic MMR
model using query-likelihood scores from LM
baseline and the AvgMix novelty score.
 Pruning: removing documents from the LM
baseline on cosine similarity
 FM: the set-based facet model

Experiment – Retrieval EnginesExperiment – Retrieval Engines
FM: the set-based facet model
 FM-RM:
each of the top m documents and their K nearest
neighbors becomes a “facet model ”P(w|Fj), then
compute the probability P(Di|Fj)
 FM-LDA:
use LDA to discover subtopics zj, and get P(zj|
D) , we extract 50 subtopics

Experiments - EvaluationExperiments - Evaluation
Use five-fold cross-validation to
train and test systems
48 queries in four folds to train
model parameters
Parameters are used to obtain
ranked results on the remaining 12
queries
At the minimum optimal rank S-
rec, we report S-recall, redundancy,
MAP

ConclusionConclusion
We defined a type of novelty retrieval
task called faceted topic retrieval 
retrieve the facets of information need
in a small set of documents.
We presented two novel models: One
that prunes a retrieval ranking and
one a formally-motivated probabilistic
models.
Both models are competitive with
MMR, and outperform another
probabilistic model.

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Similar to Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval (20)

More from YI-JHEN LIN

More from YI-JHEN LIN (6)

Recently uploaded

Recently uploaded (20)

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval