5. 5
Family Leave Law
(ROBUST04 qid:648)
0.2725
MAP
Family Leave 0.4679
However …
Can short queries have the similar
property?
6. 6
Family Leave Law
(ROBUST04 qid:648)
0.2725
MAP
Family Leave 0.4679
However …
Can short queries have the similar
property?
• Subquery of the short query could be better!
7. A high level overview
• A comparison between the Best Subqueries with
the Original Queries for TREC collections:
7
Collection Orig. Queries
Best
Subqueries
Diff.
Disk12 0.2597 0.2880 +10.9%
ROBUST04 0.2399 0.2772 +15.5%
AQUAINT 0.2107 0.2426 +15.1%
WT2G 0.3285 0.3580 +9.0%
WT10G 0.1720 0.2051 +19.2%
GOV2 0.3060 0.3221 +5.3%
On Average 0.2528 0.2821 +12.5%
9. We formulate it as a
Subquery Ranking Problem
Family Leave Law
9
Family
Leave
Law
Family Leave
Leave Law
Family Law
F
F
F
F
F
F
F
0.2725
0.0029
0.2477
0.0000
0.4679
0.0639
0.0046
LearnExtract
Subquery Features Label(MAP
)
10. Then the key is the Features
10
Family Leave Features
11. Previously Proposed Features
11
Previously Proposed Features (for verbose query)
Statistical Query Post-Retrieval
TF
IDF
Collection TF
Collection IDF
Mutual Information
Similarity with Orig.
Contain Stopwords?
Query Drift
Query Scope
Clarity Score
Weighted Information Gain
Family Leave Features
12. The Problem of Previously Proposed
Features
12
Family Leave Law
IDFs
13.26 12.39 8.98
13. The Problem of Previously Proposed
Features
13
Remove the term with lowest IDF
Family Leave Law
IDFs
13.26 12.39 8.98
14. The Problem of Previously Proposed
Features
14
?
?
When stop removing?
Remove the term with lowest IDF
Family Leave Law
IDFs
13.26 12.39 8.98
15. The Problem of Previously Proposed
Features
15
Other features do not work well (details in the paper)
?
?
When stop removing?
Remove the term with lowest IDF
Family Leave Law
IDFs
13.26 12.39 8.98
16. New futures are proposed to tackle the
problem
• Post-retrieval
• Focus on term relationship
• document level features term level features
16
17. New futures are proposed to tackle the
problem
• Post-retrieval
• Focus on term relationship
• document level features term level features
• 3 Categories of features
• Term Proximity based Features
• Term Score based Features
• Compactness and Positions of Term Score
Tensors
17
18. Term Proximity based Features (PXM)
• Term Dependency Model [Metzler05]
18
Family Leave Law
19. Term Proximity based Features (PXM)
• Term Dependency Model [Metzler05]
19
Family Leave Law
• Already know it is a law code
• Occur together
• In that order
20. Term Proximity based Features (PXM)
• Term Dependency Model [Metzler05]
20
• Already know it is a law code
• Occur together
• In that order
How to capture the feature?
Family Leave Law
21. How to Capture PXM?
• Use proximity query
21
#combine(#uw4(family leave) #ow4(family leave))
Unordered Window of 4 Ordered Window of 4
22. • Use proximity query
22
#combine(#uw4(family leave) #ow4(family leave))
Unordered Window of 4 Ordered Window of 4
• Explore the ranking scores
0.5894
0.5632
0.5323
0.4927
How to Capture PXM?
MIN
MAX
MAX-MIN
MAX/MIN
SUM
MEAN
STD
GMEAN
proximity
ranking
scores
0.5894
0.5632
0.5323
0.4927
proximity
ranking
scores
0.6288
0.6109
0.6099
0.5912
original
ranking
scores
correlationcorrelation
23. Term Score based Features (TS)
• TF-IDF Constraint [Fang2011]
23
SVM Tutorial SVM Tutorial
99 1 50 50
Counter Intuitive
24. • TF-IDF Constraint [Fang2011]
24
• We instead look at the term scores…
SVM Tutorial SVM Tutorial
99 1 50 50
Counter Intuitive
Term Score based Features (TS)
25. 25
• We look at the term scores…
• Colors are relevant probability
• Queries have different term scores distribution
One term is
more important
Terms are of relatively
equivalent importance
Term Score based Features (TS)
26. 26
• Explore the ranking scores of terms
0.2123 0.4596 0.0038
0.2346 0.4087 0.0002
0.2016 0.4456 0.0016
0.1946 0.4213 0.0027
0.1942 0.3928 0.0059
How to Capture TS?
Family Leave Law
feature func
(max)
feature funcs
MIN, MAX, MAX-MIN, MAX/MIN, SUM, MEAN, STD, GMEAN
0.4596
0.4087
0.4456
0.4213
0.3928
feature func
(mean)
0.4256
Final
Feature
doc1
doc2
doc3
doc4
doc5
Individual Term Score
27. Compactness and Positions of Term Score Tensors
(TCP)
• Normalized Query Commitment (NQC) [Shtok2012]
27
0.5894
0.5632
0.5323
0.4927
document ranking scores
0.6678
0.5632
0.4896
Quote:
“Higher deviation value was
correlated with potentially lower
query drift, and thus indicating the
better effectiveness"
Larger
Gap
Larger
Gap
28. 28
Compactness and Positions of Term Score Tensors
(TCP)
• We instead look at the
term scores…
• Term scores as tensors
in multi-dimensional
space
Relevant Documents
NonRelevant Documents
29. 29
Compactness and Positions of Term Score Tensors
(TCP)
• We instead look at the
term scores…
• Term scores as tensors
in multi-dimensional
space
• Best subquery has more
compact tensors
• But clustered at different
locations
Relevant Documents
NonRelevant Documents
31. 31
Tensor Closeness to Diagonal (CDG)
• The distance from the tensors
centroid to the diagonal line in
multi-dimensional space
• Mean and Standard deviation
of the distances from tensors
to the diagonal line
32. 32
Tensor Closeness to Nearest Axis (CNA)
• The distance from the tensors
centroid to the nearest axis in
multi-dimensional space
• Mean and Standard deviation
of the distances from tensors
to the nearest axis
35. 35
Experiments - LambdaMART with other
features
• Mutual Information (MI)
• Collection Term Frequency (CTF)
• Document Frequency (DF)
• Inverted Document Frequency (IDF)
• Min Document Term Frequency (MINTF) and Max Document
Term Frequency (MAXTF)
• Average Document Term Frequency (AVGTF) and Standard
Deviation Document Term Frequency (STDTF)
• Average Document Term Frequency with IDF (AVGTFIDF) and
with Collection Occurrence Probability (AVGTFCOP)
• Simplied Clarity Score (SCS)
37. 37
Feature Analysis
BasicBasic PXMPXM TSTS TCPTCP
BasicBasic PXMPXM TS TCPTCP
• Performance Difference
• The larger the more important of the feature
39. 39
Related Work – Query Reduction
• Statistical Features
• TF-IDF based
• Mutual Information
• Domain specific
• Query Features
• Similarity Original Query
• Term Dependency Features
• Tree-based dependency
• Post Retrieval Features
• Query-document Relevance Scores
• Weighted Information Gain
• Query drift