SlideShare a Scribd company logo
1 of 1
Download to read offline
• Kendall's τ and AP correla�on are successful at comparing two given rankigns
• What about the correla�on between the observed and the true ranking?
• Useful as a single, well-understood, figure of the reliability of an experiment
• Contrary to sensi�vity or sta�s�cal significance, it gives an idea of global
similarity with the truth, not just about individual pairs of systems (eg. t-test)
or about a swap somewhere in the ranking (eg. ANOVA)
Toward Es�ma�ng the Rank Correla�on
between the Test Collec�on Results
and the True System Performance
Julián Urbano and Mónica Marrero
fully reproducible:
data and code
available online
SIGIR 2016
Pisa, July 19th
Evalua�on
0.0 0.2 0.4 0.6 0.8 1.0
0.01.02.0
Population of Topics
Effectiveness
Density
Sample of Topics
Effectiveness
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
04812
Test Collec�on Real World
S5 > S12 > S6 > S2 > S1 > S4...
Future Work
• Be�er es�mators of discordance
• Interval es�mators
• Fully Bayesian approach
• Consider other sources of
variability besides topics, such as
systems or documents
Results: Error of es�mators
0.020.040.060.080.10
tau − adhoc6
topic set size
Error
10
20
30
40
50
60
70
80
90
100
ML
MSQD
RES
KD
SH(w/o)
SH(w)
tau − adhoc7
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
tau − adhoc8
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
tauAP − adhoc8
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
tauAP − adhoc7
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
tauAP − adhoc6
topic set size
Error
10
20
30
40
50
60
70
80
90
100
0.020.040.060.080.10
• Split-half es�mators perform very poorly
• About 0.035 error with 50 topics
• All proposals near the same, but MSQD be�er with small samples
Results: Bias of es�mators
tau − adhoc6
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
ML
MSQD
RES
KD
SH(w/o)
SH(w)
0.000.040.08
tau − adhoc7
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
0.000.040.08
0.000.040.08
tau − adhoc8
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
0.000.040.08
tauAP − adhoc6
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
0.000.040.08
tauAP − adhoc7
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
0.000.040.08
tauAP − adhoc8
topic set size
Bias
10
20
30
40
50
60
70
80
90
100
• Split-half es�mators are clearly biased
• Correla�ons generally overes�mated
• MSQD much be�er with small collec�ons, KD slightly be�er otherwise
• We need to know the true scores in order to evaluate the es�mators!
• Stochas�c simula�on from a previous collec�on Y: maintains distribu�ons
and correla�ons, and prefixes vector of true mean scores E[Xs]=μs :=Ys
• From TREC 6, 7 & 8, simulate 3x1000 collec�ons of n=10, 20,...,100 topics
• Split-half baselines w/ and w/o replacement: y=a·ebx
, 2000 replicates
S1 > S2 > S3 > S4 > S5 > S6...
Expected Correla�on with the True Ranking
bias correction
rank of Xi within
the sample

More Related Content

More from Julián Urbano

Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Julián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
Julián Urbano
 

More from Julián Urbano (20)

Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
 

Recently uploaded

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
anilsa9823
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 

Recently uploaded (20)

Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 

Toward Estimating the Rank Correlation between the Test Collection Results and the True System Performance

  • 1. • Kendall's τ and AP correla�on are successful at comparing two given rankigns • What about the correla�on between the observed and the true ranking? • Useful as a single, well-understood, figure of the reliability of an experiment • Contrary to sensi�vity or sta�s�cal significance, it gives an idea of global similarity with the truth, not just about individual pairs of systems (eg. t-test) or about a swap somewhere in the ranking (eg. ANOVA) Toward Es�ma�ng the Rank Correla�on between the Test Collec�on Results and the True System Performance Julián Urbano and Mónica Marrero fully reproducible: data and code available online SIGIR 2016 Pisa, July 19th Evalua�on 0.0 0.2 0.4 0.6 0.8 1.0 0.01.02.0 Population of Topics Effectiveness Density Sample of Topics Effectiveness Frequency 0.0 0.2 0.4 0.6 0.8 1.0 04812 Test Collec�on Real World S5 > S12 > S6 > S2 > S1 > S4... Future Work • Be�er es�mators of discordance • Interval es�mators • Fully Bayesian approach • Consider other sources of variability besides topics, such as systems or documents Results: Error of es�mators 0.020.040.060.080.10 tau − adhoc6 topic set size Error 10 20 30 40 50 60 70 80 90 100 ML MSQD RES KD SH(w/o) SH(w) tau − adhoc7 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 tau − adhoc8 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 tauAP − adhoc8 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 tauAP − adhoc7 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 tauAP − adhoc6 topic set size Error 10 20 30 40 50 60 70 80 90 100 0.020.040.060.080.10 • Split-half es�mators perform very poorly • About 0.035 error with 50 topics • All proposals near the same, but MSQD be�er with small samples Results: Bias of es�mators tau − adhoc6 topic set size Bias 10 20 30 40 50 60 70 80 90 100 ML MSQD RES KD SH(w/o) SH(w) 0.000.040.08 tau − adhoc7 topic set size Bias 10 20 30 40 50 60 70 80 90 100 0.000.040.08 0.000.040.08 tau − adhoc8 topic set size Bias 10 20 30 40 50 60 70 80 90 100 0.000.040.08 tauAP − adhoc6 topic set size Bias 10 20 30 40 50 60 70 80 90 100 0.000.040.08 tauAP − adhoc7 topic set size Bias 10 20 30 40 50 60 70 80 90 100 0.000.040.08 tauAP − adhoc8 topic set size Bias 10 20 30 40 50 60 70 80 90 100 • Split-half es�mators are clearly biased • Correla�ons generally overes�mated • MSQD much be�er with small collec�ons, KD slightly be�er otherwise • We need to know the true scores in order to evaluate the es�mators! • Stochas�c simula�on from a previous collec�on Y: maintains distribu�ons and correla�ons, and prefixes vector of true mean scores E[Xs]=μs :=Ys • From TREC 6, 7 & 8, simulate 3x1000 collec�ons of n=10, 20,...,100 topics • Split-half baselines w/ and w/o replacement: y=a·ebx , 2000 replicates S1 > S2 > S3 > S4 > S5 > S6... Expected Correla�on with the True Ranking bias correction rank of Xi within the sample