SlideShare a Scribd company logo
Can Short Queries Be Even
Shorter?
University of Delaware
1
Long/Verbose Queries
Had Been Extensively Studied
2
• Example long query here …
3
However …
Can short queries have the similar
property?
4
Family Leave Law
(ROBUST04 qid:648)
0.2725
MAP
However …
Can short queries have the similar
property?
5
Family Leave Law
(ROBUST04 qid:648)
0.2725
MAP
Family Leave 0.4679
However …
Can short queries have the similar
property?
6
Family Leave Law
(ROBUST04 qid:648)
0.2725
MAP
Family Leave 0.4679
However …
Can short queries have the similar
property?
• Subquery of the short query could be better!
A high level overview
• A comparison between the Best Subqueries with
the Original Queries for TREC collections:
7
Collection Orig. Queries
Best
Subqueries
Diff.
Disk12 0.2597 0.2880 +10.9%
ROBUST04 0.2399 0.2772 +15.5%
AQUAINT 0.2107 0.2426 +15.1%
WT2G 0.3285 0.3580 +9.0%
WT10G 0.1720 0.2051 +19.2%
GOV2 0.3060 0.3221 +5.3%
On Average 0.2528 0.2821 +12.5%
8
Question:
“Family Leave Law”
Original Query
“Family Leave”
Best Subquery
?
• Can we identify those optimal
subqueries?
• How do identify?
We formulate it as a
Subquery Ranking Problem
Family Leave Law
9
Family
Leave
Law
Family Leave
Leave Law
Family Law
F
F
F
F
F
F
F
0.2725
0.0029
0.2477
0.0000
0.4679
0.0639
0.0046
LearnExtract
Subquery Features Label(MAP
)
Then the key is the Features
10
Family Leave Features
Previously Proposed Features
11
Previously Proposed Features (for verbose query)
Statistical Query Post-Retrieval
TF
IDF
Collection TF
Collection IDF
Mutual Information
Similarity with Orig.
Contain Stopwords?
Query Drift
Query Scope
Clarity Score
Weighted Information Gain
Family Leave Features
The Problem of Previously Proposed
Features
12
Family Leave Law
IDFs
13.26 12.39 8.98
The Problem of Previously Proposed
Features
13
Remove the term with lowest IDF
Family Leave Law
IDFs
13.26 12.39 8.98
The Problem of Previously Proposed
Features
14
?
?
When stop removing?
Remove the term with lowest IDF
Family Leave Law
IDFs
13.26 12.39 8.98
The Problem of Previously Proposed
Features
15
Other features do not work well (details in the paper)
?
?
When stop removing?
Remove the term with lowest IDF
Family Leave Law
IDFs
13.26 12.39 8.98
New futures are proposed to tackle the
problem
• Post-retrieval
• Focus on term relationship
• document level features term level features
16
New futures are proposed to tackle the
problem
• Post-retrieval
• Focus on term relationship
• document level features term level features
• 3 Categories of features
• Term Proximity based Features
• Term Score based Features
• Compactness and Positions of Term Score
Tensors
17
Term Proximity based Features (PXM)
• Term Dependency Model [Metzler05]
18
Family Leave Law
Term Proximity based Features (PXM)
• Term Dependency Model [Metzler05]
19
Family Leave Law
• Already know it is a law code
• Occur together
• In that order
Term Proximity based Features (PXM)
• Term Dependency Model [Metzler05]
20
• Already know it is a law code
• Occur together
• In that order
How to capture the feature?
Family Leave Law
How to Capture PXM?
• Use proximity query
21
#combine(#uw4(family leave) #ow4(family leave))
Unordered Window of 4 Ordered Window of 4
• Use proximity query
22
#combine(#uw4(family leave) #ow4(family leave))
Unordered Window of 4 Ordered Window of 4
• Explore the ranking scores
0.5894
0.5632
0.5323
0.4927
How to Capture PXM?
MIN
MAX
MAX-MIN
MAX/MIN
SUM
MEAN
STD
GMEAN
proximity
ranking
scores
0.5894
0.5632
0.5323
0.4927
proximity
ranking
scores
0.6288
0.6109
0.6099
0.5912
original
ranking
scores
correlationcorrelation
Term Score based Features (TS)
• TF-IDF Constraint [Fang2011]
23
SVM Tutorial SVM Tutorial
99 1 50 50
Counter Intuitive
• TF-IDF Constraint [Fang2011]
24
• We instead look at the term scores…
SVM Tutorial SVM Tutorial
99 1 50 50
Counter Intuitive
Term Score based Features (TS)
25
• We look at the term scores…
• Colors are relevant probability
• Queries have different term scores distribution
One term is
more important
Terms are of relatively
equivalent importance
Term Score based Features (TS)
26
• Explore the ranking scores of terms
0.2123 0.4596 0.0038
0.2346 0.4087 0.0002
0.2016 0.4456 0.0016
0.1946 0.4213 0.0027
0.1942 0.3928 0.0059
How to Capture TS?
Family Leave Law
feature func
(max)
feature funcs
MIN, MAX, MAX-MIN, MAX/MIN, SUM, MEAN, STD, GMEAN
0.4596
0.4087
0.4456
0.4213
0.3928
feature func
(mean)
0.4256
Final
Feature
doc1
doc2
doc3
doc4
doc5
Individual Term Score
Compactness and Positions of Term Score Tensors
(TCP)
• Normalized Query Commitment (NQC) [Shtok2012]
27
0.5894
0.5632
0.5323
0.4927
document ranking scores
0.6678
0.5632
0.4896
Quote:
“Higher deviation value was
correlated with potentially lower
query drift, and thus indicating the
better effectiveness"
Larger
Gap
Larger
Gap
28
Compactness and Positions of Term Score Tensors
(TCP)
• We instead look at the
term scores…
• Term scores as tensors
in multi-dimensional
space
Relevant Documents
NonRelevant Documents
29
Compactness and Positions of Term Score Tensors
(TCP)
• We instead look at the
term scores…
• Term scores as tensors
in multi-dimensional
space
• Best subquery has more
compact tensors
• But clustered at different
locations
Relevant Documents
NonRelevant Documents
30
Compactness of Tensors
• Mean and Standard Deviation of the distances between tensors
and their centroid
31
Tensor Closeness to Diagonal (CDG)
• The distance from the tensors
centroid to the diagonal line in
multi-dimensional space
• Mean and Standard deviation
of the distances from tensors
to the diagonal line
32
Tensor Closeness to Nearest Axis (CNA)
• The distance from the tensors
centroid to the nearest axis in
multi-dimensional space
• Mean and Standard deviation
of the distances from tensors
to the nearest axis
33
Experiments
Collection #qry |QL|=2 |QL|=3 |QL|=4
Disk12 150 30(20%) 37(25%) 41(27%)
ROBUST04 250 75(33%) 147(59%) 17(7%)
AQUAINT 50 21(42%) 27(54%) 1(2%)
WT2G 50 24(48%) 23(46%) 0(0%)
WT10G 100 30(30%) 25(25%) 20(20%)
GOV2 150 44(29%) 65(43%) 35(23%)
Keep Drop
34
Experiments - mapping labels from AP to
Integer
35
Experiments - LambdaMART with other
features
• Mutual Information (MI)
• Collection Term Frequency (CTF)
• Document Frequency (DF)
• Inverted Document Frequency (IDF)
• Min Document Term Frequency (MINTF) and Max Document
Term Frequency (MAXTF)
• Average Document Term Frequency (AVGTF) and Standard
Deviation Document Term Frequency (STDTF)
• Average Document Term Frequency with IDF (AVGTFIDF) and
with Collection Occurrence Probability (AVGTFCOP)
• Simplied Clarity Score (SCS)
36
Results
Collection OG SR UB
Disk12 0.3216
0.3309
+2.89
%
0.3372
+4.85
%
ROBUST0
4 0.2506
0.2566
+2.39
%
0.2662
+6.23
%
AQUAINT 0.2063
0.2091
+1.36
%
0.2184
+5.87
%
WT2G 0.2983
0.2983
+0.00
%
0.3083
+3.35
%
WT10G 0.2544
0.2663
+4.68
%
0.2738
+7.63
%
Collection OG SR UB
Disk12 0.2597
0.2833
+9.09
%
0.2880
+10.90
%
ROBUST0
4 0.2399
0.2643
+10.17%
0.2772
+15.55%
AQUAINT 0.2107
0.2323
+10.25%
0.2426
+15.14%
WT2G 0.3285
0.3380
+2.89
%
0.3580
+8.98
%
WT10G 0.1720
0.1949
+13.31%
0.2051
+19.24%
GOV2 0.3060
0.3113
-
1.73
%
0.3221
+5.26
%
|QL|=2 |QL|=3
37
Feature Analysis
BasicBasic PXMPXM TSTS TCPTCP
BasicBasic PXMPXM TS TCPTCP
• Performance Difference
• The larger the more important of the feature
38
Feature Analysis
Basic Features AVGTFCOP SCS CTF
Diff.
0.2294
-15.5%
0.2363
-12.9%
0.2370
-12.7%
TCP TCP(TC) TCP(CDG) TCP(CNA)
Diff.
0.2342
-14.0%
0.2359
-13.6%
0.2329
-14.7%
PXM PXM(h) PXM(corr)
Diff.
0.2341
-14.2%
0.2364
-13.3%
TS TS1 TS2 TS3
Diff.
0.2337
-13.5%
0.2256
-16.2%
0.2259
-16.1%
TS1: TS(MAX/MIN,SUM); TS2: TS(SUM,SUM); TS3: TS(GMEAN,MEAN)
39
Related Work – Query Reduction
• Statistical Features
• TF-IDF based
• Mutual Information
• Domain specific
• Query Features
• Similarity Original Query
• Term Dependency Features
• Tree-based dependency
• Post Retrieval Features
• Query-document Relevance Scores
• Weighted Information Gain
• Query drift
Thank You!
Q & A
40

More Related Content

Similar to Can Short Queries be Even Shorter?

AI3391 Artificial Intelligence Session 21 CSP.pptx
AI3391 Artificial Intelligence Session 21 CSP.pptxAI3391 Artificial Intelligence Session 21 CSP.pptx
AI3391 Artificial Intelligence Session 21 CSP.pptx
Asst.prof M.Gokilavani
 
Multivariate Analysis
Multivariate AnalysisMultivariate Analysis
Multivariate Analysis
Stig-Arne Kristoffersen
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
Data Science Warsaw
 
Faster and smaller inverted indices with Treaps Research Paper
Faster and smaller inverted indices with Treaps Research PaperFaster and smaller inverted indices with Treaps Research Paper
Faster and smaller inverted indices with Treaps Research Paper
sameiralk
 
Test design made easy (and fun) Rik Marselis EuroSTAR
Test design made easy (and fun) Rik Marselis EuroSTARTest design made easy (and fun) Rik Marselis EuroSTAR
Test design made easy (and fun) Rik Marselis EuroSTAR
Rik Marselis
 
Reverted Indexing for Expansion and Feedback
Reverted Indexing for Expansion and FeedbackReverted Indexing for Expansion and Feedback
Reverted Indexing for Expansion and Feedback
Gene Golovchinsky
 
Trivandrum
TrivandrumTrivandrum
Trivandrum
vgovindaraju
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
Machine Learning Valencia
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Innovation Quotient Pvt Ltd
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
Greg Makowski
 
Multiple objectives in Collaborative Filtering (RecSys 2010)
Multiple objectives in Collaborative Filtering (RecSys 2010)Multiple objectives in Collaborative Filtering (RecSys 2010)
Multiple objectives in Collaborative Filtering (RecSys 2010)
Tamas Jambor
 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
Ptidej Team
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
Julián Urbano
 
"Эффективность и оптимизация кода в Java 8" Сергей Моренец
"Эффективность и оптимизация кода в Java 8" Сергей Моренец"Эффективность и оптимизация кода в Java 8" Сергей Моренец
"Эффективность и оптимизация кода в Java 8" Сергей Моренец
Fwdays
 
IoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDBIoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDB
Ivo Andreev
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Timescale
 
Journey of Migrating Millions of Queries on The Cloud
Journey of Migrating Millions of Queries on The CloudJourney of Migrating Millions of Queries on The Cloud
Journey of Migrating Millions of Queries on The Cloud
takezoe
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
Maarten Smeets
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
Jungsu Heo
 

Similar to Can Short Queries be Even Shorter? (20)

AI3391 Artificial Intelligence Session 21 CSP.pptx
AI3391 Artificial Intelligence Session 21 CSP.pptxAI3391 Artificial Intelligence Session 21 CSP.pptx
AI3391 Artificial Intelligence Session 21 CSP.pptx
 
Multivariate Analysis
Multivariate AnalysisMultivariate Analysis
Multivariate Analysis
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
 
Faster and smaller inverted indices with Treaps Research Paper
Faster and smaller inverted indices with Treaps Research PaperFaster and smaller inverted indices with Treaps Research Paper
Faster and smaller inverted indices with Treaps Research Paper
 
Test design made easy (and fun) Rik Marselis EuroSTAR
Test design made easy (and fun) Rik Marselis EuroSTARTest design made easy (and fun) Rik Marselis EuroSTAR
Test design made easy (and fun) Rik Marselis EuroSTAR
 
Reverted Indexing for Expansion and Feedback
Reverted Indexing for Expansion and FeedbackReverted Indexing for Expansion and Feedback
Reverted Indexing for Expansion and Feedback
 
Trivandrum
TrivandrumTrivandrum
Trivandrum
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Multiple objectives in Collaborative Filtering (RecSys 2010)
Multiple objectives in Collaborative Filtering (RecSys 2010)Multiple objectives in Collaborative Filtering (RecSys 2010)
Multiple objectives in Collaborative Filtering (RecSys 2010)
 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
"Эффективность и оптимизация кода в Java 8" Сергей Моренец
"Эффективность и оптимизация кода в Java 8" Сергей Моренец"Эффективность и оптимизация кода в Java 8" Сергей Моренец
"Эффективность и оптимизация кода в Java 8" Сергей Моренец
 
IoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDBIoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDB
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
Journey of Migrating Millions of Queries on The Cloud
Journey of Migrating Millions of Queries on The CloudJourney of Migrating Millions of Queries on The Cloud
Journey of Migrating Millions of Queries on The Cloud
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
 

More from Twitter Inc.

An Exploration of Ranking-based Strategy for Contextual Suggestions
An Exploration of Ranking-based Strategy for Contextual SuggestionsAn Exploration of Ranking-based Strategy for Contextual Suggestions
An Exploration of Ranking-based Strategy for Contextual Suggestions
Twitter Inc.
 
An Opinion-aware Approach to Contextual Suggestion
An Opinion-aware Approach to Contextual SuggestionAn Opinion-aware Approach to Contextual Suggestion
An Opinion-aware Approach to Contextual Suggestion
Twitter Inc.
 
Evaluating the Effectiveness of Axiomatic Approaches in Web Track
Evaluating the Effectiveness of Axiomatic Approaches in Web TrackEvaluating the Effectiveness of Axiomatic Approaches in Web Track
Evaluating the Effectiveness of Axiomatic Approaches in Web Track
Twitter Inc.
 
VIRLab SIGIR14 Demo
VIRLab SIGIR14 DemoVIRLab SIGIR14 Demo
VIRLab SIGIR14 Demo
Twitter Inc.
 
Combining the opinion profile modeling with complex context filtering for Con...
Combining the opinion profile modeling with complex context filtering for Con...Combining the opinion profile modeling with complex context filtering for Con...
Combining the opinion profile modeling with complex context filtering for Con...
Twitter Inc.
 
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
Twitter Inc.
 
Retrieval Performance Bound Analysis for Single Term Queries
Retrieval Performance Bound Analysis for Single Term QueriesRetrieval Performance Bound Analysis for Single Term Queries
Retrieval Performance Bound Analysis for Single Term Queries
Twitter Inc.
 
Opinion-based User Profile Modeling for Contextual Suggestions
Opinion-based User Profile Modeling for Contextual SuggestionsOpinion-based User Profile Modeling for Contextual Suggestions
Opinion-based User Profile Modeling for Contextual Suggestions
Twitter Inc.
 
Anserini SIGIR 2017 Poster
Anserini SIGIR 2017 PosterAnserini SIGIR 2017 Poster
Anserini SIGIR 2017 Poster
Twitter Inc.
 
TREC 2014 Contextual Suggestion Talk
TREC 2014 Contextual Suggestion TalkTREC 2014 Contextual Suggestion Talk
TREC 2014 Contextual Suggestion Talk
Twitter Inc.
 
A Reproducibility Study of Information Retrieval Models
A Reproducibility Study of Information Retrieval ModelsA Reproducibility Study of Information Retrieval Models
A Reproducibility Study of Information Retrieval Models
Twitter Inc.
 

More from Twitter Inc. (11)

An Exploration of Ranking-based Strategy for Contextual Suggestions
An Exploration of Ranking-based Strategy for Contextual SuggestionsAn Exploration of Ranking-based Strategy for Contextual Suggestions
An Exploration of Ranking-based Strategy for Contextual Suggestions
 
An Opinion-aware Approach to Contextual Suggestion
An Opinion-aware Approach to Contextual SuggestionAn Opinion-aware Approach to Contextual Suggestion
An Opinion-aware Approach to Contextual Suggestion
 
Evaluating the Effectiveness of Axiomatic Approaches in Web Track
Evaluating the Effectiveness of Axiomatic Approaches in Web TrackEvaluating the Effectiveness of Axiomatic Approaches in Web Track
Evaluating the Effectiveness of Axiomatic Approaches in Web Track
 
VIRLab SIGIR14 Demo
VIRLab SIGIR14 DemoVIRLab SIGIR14 Demo
VIRLab SIGIR14 Demo
 
Combining the opinion profile modeling with complex context filtering for Con...
Combining the opinion profile modeling with complex context filtering for Con...Combining the opinion profile modeling with complex context filtering for Con...
Combining the opinion profile modeling with complex context filtering for Con...
 
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...
 
Retrieval Performance Bound Analysis for Single Term Queries
Retrieval Performance Bound Analysis for Single Term QueriesRetrieval Performance Bound Analysis for Single Term Queries
Retrieval Performance Bound Analysis for Single Term Queries
 
Opinion-based User Profile Modeling for Contextual Suggestions
Opinion-based User Profile Modeling for Contextual SuggestionsOpinion-based User Profile Modeling for Contextual Suggestions
Opinion-based User Profile Modeling for Contextual Suggestions
 
Anserini SIGIR 2017 Poster
Anserini SIGIR 2017 PosterAnserini SIGIR 2017 Poster
Anserini SIGIR 2017 Poster
 
TREC 2014 Contextual Suggestion Talk
TREC 2014 Contextual Suggestion TalkTREC 2014 Contextual Suggestion Talk
TREC 2014 Contextual Suggestion Talk
 
A Reproducibility Study of Information Retrieval Models
A Reproducibility Study of Information Retrieval ModelsA Reproducibility Study of Information Retrieval Models
A Reproducibility Study of Information Retrieval Models
 

Recently uploaded

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 

Recently uploaded (20)

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 

Can Short Queries be Even Shorter?

  • 1. Can Short Queries Be Even Shorter? University of Delaware 1
  • 2. Long/Verbose Queries Had Been Extensively Studied 2 • Example long query here …
  • 3. 3 However … Can short queries have the similar property?
  • 4. 4 Family Leave Law (ROBUST04 qid:648) 0.2725 MAP However … Can short queries have the similar property?
  • 5. 5 Family Leave Law (ROBUST04 qid:648) 0.2725 MAP Family Leave 0.4679 However … Can short queries have the similar property?
  • 6. 6 Family Leave Law (ROBUST04 qid:648) 0.2725 MAP Family Leave 0.4679 However … Can short queries have the similar property? • Subquery of the short query could be better!
  • 7. A high level overview • A comparison between the Best Subqueries with the Original Queries for TREC collections: 7 Collection Orig. Queries Best Subqueries Diff. Disk12 0.2597 0.2880 +10.9% ROBUST04 0.2399 0.2772 +15.5% AQUAINT 0.2107 0.2426 +15.1% WT2G 0.3285 0.3580 +9.0% WT10G 0.1720 0.2051 +19.2% GOV2 0.3060 0.3221 +5.3% On Average 0.2528 0.2821 +12.5%
  • 8. 8 Question: “Family Leave Law” Original Query “Family Leave” Best Subquery ? • Can we identify those optimal subqueries? • How do identify?
  • 9. We formulate it as a Subquery Ranking Problem Family Leave Law 9 Family Leave Law Family Leave Leave Law Family Law F F F F F F F 0.2725 0.0029 0.2477 0.0000 0.4679 0.0639 0.0046 LearnExtract Subquery Features Label(MAP )
  • 10. Then the key is the Features 10 Family Leave Features
  • 11. Previously Proposed Features 11 Previously Proposed Features (for verbose query) Statistical Query Post-Retrieval TF IDF Collection TF Collection IDF Mutual Information Similarity with Orig. Contain Stopwords? Query Drift Query Scope Clarity Score Weighted Information Gain Family Leave Features
  • 12. The Problem of Previously Proposed Features 12 Family Leave Law IDFs 13.26 12.39 8.98
  • 13. The Problem of Previously Proposed Features 13 Remove the term with lowest IDF Family Leave Law IDFs 13.26 12.39 8.98
  • 14. The Problem of Previously Proposed Features 14 ? ? When stop removing? Remove the term with lowest IDF Family Leave Law IDFs 13.26 12.39 8.98
  • 15. The Problem of Previously Proposed Features 15 Other features do not work well (details in the paper) ? ? When stop removing? Remove the term with lowest IDF Family Leave Law IDFs 13.26 12.39 8.98
  • 16. New futures are proposed to tackle the problem • Post-retrieval • Focus on term relationship • document level features term level features 16
  • 17. New futures are proposed to tackle the problem • Post-retrieval • Focus on term relationship • document level features term level features • 3 Categories of features • Term Proximity based Features • Term Score based Features • Compactness and Positions of Term Score Tensors 17
  • 18. Term Proximity based Features (PXM) • Term Dependency Model [Metzler05] 18 Family Leave Law
  • 19. Term Proximity based Features (PXM) • Term Dependency Model [Metzler05] 19 Family Leave Law • Already know it is a law code • Occur together • In that order
  • 20. Term Proximity based Features (PXM) • Term Dependency Model [Metzler05] 20 • Already know it is a law code • Occur together • In that order How to capture the feature? Family Leave Law
  • 21. How to Capture PXM? • Use proximity query 21 #combine(#uw4(family leave) #ow4(family leave)) Unordered Window of 4 Ordered Window of 4
  • 22. • Use proximity query 22 #combine(#uw4(family leave) #ow4(family leave)) Unordered Window of 4 Ordered Window of 4 • Explore the ranking scores 0.5894 0.5632 0.5323 0.4927 How to Capture PXM? MIN MAX MAX-MIN MAX/MIN SUM MEAN STD GMEAN proximity ranking scores 0.5894 0.5632 0.5323 0.4927 proximity ranking scores 0.6288 0.6109 0.6099 0.5912 original ranking scores correlationcorrelation
  • 23. Term Score based Features (TS) • TF-IDF Constraint [Fang2011] 23 SVM Tutorial SVM Tutorial 99 1 50 50 Counter Intuitive
  • 24. • TF-IDF Constraint [Fang2011] 24 • We instead look at the term scores… SVM Tutorial SVM Tutorial 99 1 50 50 Counter Intuitive Term Score based Features (TS)
  • 25. 25 • We look at the term scores… • Colors are relevant probability • Queries have different term scores distribution One term is more important Terms are of relatively equivalent importance Term Score based Features (TS)
  • 26. 26 • Explore the ranking scores of terms 0.2123 0.4596 0.0038 0.2346 0.4087 0.0002 0.2016 0.4456 0.0016 0.1946 0.4213 0.0027 0.1942 0.3928 0.0059 How to Capture TS? Family Leave Law feature func (max) feature funcs MIN, MAX, MAX-MIN, MAX/MIN, SUM, MEAN, STD, GMEAN 0.4596 0.4087 0.4456 0.4213 0.3928 feature func (mean) 0.4256 Final Feature doc1 doc2 doc3 doc4 doc5 Individual Term Score
  • 27. Compactness and Positions of Term Score Tensors (TCP) • Normalized Query Commitment (NQC) [Shtok2012] 27 0.5894 0.5632 0.5323 0.4927 document ranking scores 0.6678 0.5632 0.4896 Quote: “Higher deviation value was correlated with potentially lower query drift, and thus indicating the better effectiveness" Larger Gap Larger Gap
  • 28. 28 Compactness and Positions of Term Score Tensors (TCP) • We instead look at the term scores… • Term scores as tensors in multi-dimensional space Relevant Documents NonRelevant Documents
  • 29. 29 Compactness and Positions of Term Score Tensors (TCP) • We instead look at the term scores… • Term scores as tensors in multi-dimensional space • Best subquery has more compact tensors • But clustered at different locations Relevant Documents NonRelevant Documents
  • 30. 30 Compactness of Tensors • Mean and Standard Deviation of the distances between tensors and their centroid
  • 31. 31 Tensor Closeness to Diagonal (CDG) • The distance from the tensors centroid to the diagonal line in multi-dimensional space • Mean and Standard deviation of the distances from tensors to the diagonal line
  • 32. 32 Tensor Closeness to Nearest Axis (CNA) • The distance from the tensors centroid to the nearest axis in multi-dimensional space • Mean and Standard deviation of the distances from tensors to the nearest axis
  • 33. 33 Experiments Collection #qry |QL|=2 |QL|=3 |QL|=4 Disk12 150 30(20%) 37(25%) 41(27%) ROBUST04 250 75(33%) 147(59%) 17(7%) AQUAINT 50 21(42%) 27(54%) 1(2%) WT2G 50 24(48%) 23(46%) 0(0%) WT10G 100 30(30%) 25(25%) 20(20%) GOV2 150 44(29%) 65(43%) 35(23%) Keep Drop
  • 34. 34 Experiments - mapping labels from AP to Integer
  • 35. 35 Experiments - LambdaMART with other features • Mutual Information (MI) • Collection Term Frequency (CTF) • Document Frequency (DF) • Inverted Document Frequency (IDF) • Min Document Term Frequency (MINTF) and Max Document Term Frequency (MAXTF) • Average Document Term Frequency (AVGTF) and Standard Deviation Document Term Frequency (STDTF) • Average Document Term Frequency with IDF (AVGTFIDF) and with Collection Occurrence Probability (AVGTFCOP) • Simplied Clarity Score (SCS)
  • 36. 36 Results Collection OG SR UB Disk12 0.3216 0.3309 +2.89 % 0.3372 +4.85 % ROBUST0 4 0.2506 0.2566 +2.39 % 0.2662 +6.23 % AQUAINT 0.2063 0.2091 +1.36 % 0.2184 +5.87 % WT2G 0.2983 0.2983 +0.00 % 0.3083 +3.35 % WT10G 0.2544 0.2663 +4.68 % 0.2738 +7.63 % Collection OG SR UB Disk12 0.2597 0.2833 +9.09 % 0.2880 +10.90 % ROBUST0 4 0.2399 0.2643 +10.17% 0.2772 +15.55% AQUAINT 0.2107 0.2323 +10.25% 0.2426 +15.14% WT2G 0.3285 0.3380 +2.89 % 0.3580 +8.98 % WT10G 0.1720 0.1949 +13.31% 0.2051 +19.24% GOV2 0.3060 0.3113 - 1.73 % 0.3221 +5.26 % |QL|=2 |QL|=3
  • 37. 37 Feature Analysis BasicBasic PXMPXM TSTS TCPTCP BasicBasic PXMPXM TS TCPTCP • Performance Difference • The larger the more important of the feature
  • 38. 38 Feature Analysis Basic Features AVGTFCOP SCS CTF Diff. 0.2294 -15.5% 0.2363 -12.9% 0.2370 -12.7% TCP TCP(TC) TCP(CDG) TCP(CNA) Diff. 0.2342 -14.0% 0.2359 -13.6% 0.2329 -14.7% PXM PXM(h) PXM(corr) Diff. 0.2341 -14.2% 0.2364 -13.3% TS TS1 TS2 TS3 Diff. 0.2337 -13.5% 0.2256 -16.2% 0.2259 -16.1% TS1: TS(MAX/MIN,SUM); TS2: TS(SUM,SUM); TS3: TS(GMEAN,MEAN)
  • 39. 39 Related Work – Query Reduction • Statistical Features • TF-IDF based • Mutual Information • Domain specific • Query Features • Similarity Original Query • Term Dependency Features • Tree-based dependency • Post Retrieval Features • Query-document Relevance Scores • Weighted Information Gain • Query drift