0
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007
Summary <ul><li>Motivation </li></ul><ul><li>Significance Testing </li></ul><ul><li>General Approach </li></ul><ul><li>Sig...
Motivation <ul><li>Goal => Promote retrieval methods that truly are better rather than methods that by chance perform bett...
Significance Testing <ul><li>Significance Testing </li></ul><ul><ul><li>1. A test statistic or criterion by which to judge...
General Approach
Randomization test p-value = 0.0138
Wilcoxon Test p-value = 0.0560
Sign Test p-value = 0.3222 p-value = 0.3604
Bootstrap Test p-value = 0.0107
Student’s Paired t-test p-value = 0.0153
Results
Discussion <ul><li>Sing and Wilcoxon tests: </li></ul><ul><ul><li>The use this tests should not be use because they test c...
Conclusion <ul><li>The Randomization test is the recomendaded test to used to compare two IR systems. </li></ul><ul><li>Th...
Upcoming SlideShare
Loading in...5
×

Comparison statisticalsignificancetestir

451

Published on

Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
451
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Problemas de usar a MAP: a ruido na avaliação de sistemas de retrieval. Alguns tópicos são mais fáceis q outros. As pessoas que são contractadas para criar os julgamentos de relevância dos tópicos são meros humanos, logo cometem erros. Finalmente a escolha da colecção de documentos também influência o resultado da avaliação.. Por motivos óbvios. Porquê significance tests?
  • General Approach: Two actual runs from TREC 3, 5-8 were used The MAP of each runs is as showed in the excel table (accounting every topic) 5 significance test were use to measure if the difference in MAP between System A and System B was statistically significant, which means.. If System A is in fact better that System B. For every significance test the p-value was calculated according to the test statistic. Then that value is confronted with the significance level, that states the maximum value that a p-value can have to reject the null hypothesis. finally the null hypothesis is accept or rejected. Significance Testing 1. A test statistic or criterion by which to judge the two systems. IR researchers commonly use the difference in mean average precision (MAP) or the difference in the mean of another IR metric. 2. A distribution of the test statistic given a null hypothesis. A typical null hypothesis is that there is no difference in our two systems 3.A significance level (p-value) that is computed by taking the value of the test statistic for our experimental systems and determining how likely a value could have occurred under the null hypothesis.
  • Null hypothesis = System A and System B have the same distribution. Statistic Test = Mean Average Precision (MAP) P-Value = number of times the difference between MPA(A ) - MPA(B) &lt;= -0.052 + number of times the difference between MPA(A ) - MPA(B) &gt;= 0.052 / total number of permutations (100,000). Characteristics : Distribution-free and doesn’t assumes random sampling.
  • It can be used as an alternative to the  paired Student&apos;s t-test  when the population cannot be assumed to be  normally distributed . But when N (the number of samples) is bigger than 25 the distribution of the wilcoxon text approximates to a normal distributions. Null hypothesis = System A and System B have the same distribution. Test statistic = is the sum of the ranks. p-value = is the minimum value of the test statistic.
  • Null hypothesis = System A and System B have the same distribution. Test statistic = is the number of pairs for which system A is better than System B. p-value = numero de pairs em q o sistems A é melhor que o sistema B, a dividir pelo número total de pares da permutação. Tied cases = no caso de haver empate, portanto, em q o sistema A teve o mm score que B, e tendo em conta que a precisão numérica pode variar de computador para computador, pode definir uma medida de “Diferença Minima”, segunda a qual é possível desempatar os empatas . IMPORTANTE = o valor do p-value diminui substancialmente (0.0987) quando aumentamos o valor da “Diferença Minima”, pq isso quer dizer que os casos de empate se vão transformar em casos de sucesso para o sistema A.
  • Null hypothesis = the scores of System A and System B are random samples from the same distribution (diferent from randomization test, wilcoxon test and sign test). Statistic Test = Mean Average Precision (MAP) P-Value = fraction of samples in the shifted distribution that have an absolute value as large or larger that our experiment’s difference. Sampling with replacement - Sampling schemes may be without replacement (&apos;WOR&apos; - no element can be selected more than once in the same sample) or with replacement (&apos;WR&apos; - an element may appear multiple times in the one sample). Characteristics : Distribution-free and assumes random sampling.
  • Null Hipothesis = System A and System B are random samples from the normal distribution. Statistic Test = Mean Average Precision (MAP) P-Value = fraction of samples in the shifted distribution that have an absolute value as large or larger that our experiment’s difference. Characteristics : Normal Distribution and assumes random sampling. IMPORTANTE: só funciona com populações que sigam uma distribuição normal, portanto pode não ser adequado a todas as null hypothesis. Exemplo??
  • In this section we report the amount of agreement among p-values produced by the various significance tests. Table 1 shows the RMSE or each of the tests on a subset of the TREC run pairs. We formed this subset by removing all pairs for which all tests agreed on p-value. * If the tests agree with each other there is practical difference among tests. The randomization test, bootstrap test and t test largely agree with each other. The RMSE between these three tests is approximately 0,01 which is an error of 20% for a p-value of 0.05. The wilcoxon test and sign tests don’t agree with any of the other tests. Compared to the randomization test, and this to the t-test and bootstrap, the wilcoxon and sig tests will result in failure to detect significance and false detection of significance. Root Mean Square Error (RMSE)  of an  estimator  is one of many ways to quantify the difference between an  estimator  and the true value of the quantity being estimated. 
  • Wilcoxon and sign tests : were apropriated before affordable computation existed, but are inappropriate today. Random sampling versus not random sampling: An IR researcher may argue that the assumption of random samples from a population is required to draw an inference from the experiment to the larget world. This cannot be the case. IR researchers have for long understood that inferences from their experiments must be carefuly drawn given the construction of the test setup. Using significance test based on the assumption of random sampling is not warranted for most IR research.
  • A researcher using the wilcoxon test and sign test is likely spend a lot longer searching for methods that improve retrieval performance compared to a researcher using the randomization, bootstrap or t test.
  • Transcript of "Comparison statisticalsignificancetestir"

    1. 1. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007
    2. 2. Summary <ul><li>Motivation </li></ul><ul><li>Significance Testing </li></ul><ul><li>General Approach </li></ul><ul><li>Significance Test’s </li></ul><ul><ul><li>Randomization test, Wilcoxon test, Sign test, Bootstrap test, Student’s t test; </li></ul></ul><ul><li>Results </li></ul><ul><li>Discussion </li></ul><ul><li>Conclusions </li></ul>
    3. 3. Motivation <ul><li>Goal => Promote retrieval methods that truly are better rather than methods that by chance perform better given a set of topics, judgments, and documents used in the evaluation. </li></ul><ul><li>Given two information retrieval (IR) systems, how can we determine which one is better than the other? </li></ul><ul><ul><li>Common approaches like TREC use the difference of the Mean Average Precision (MAP). Problems? How can they be solved? Use significance tests! </li></ul></ul><ul><li>What significance test should IR researchers use? </li></ul><ul><ul><li>Student’s paired test t? Wilcoxon signed ranked test? Sing test? bootstrap? Fisher’s randomization? </li></ul></ul>
    4. 4. Significance Testing <ul><li>Significance Testing </li></ul><ul><ul><li>1. A test statistic or criterion by which to judge the two systems. IR researchers commonly use the difference in mean average precision (MAP) or the difference in the mean of another IR metric. </li></ul></ul><ul><ul><li>2. A distribution of the test statistic given a null hypothesis. A typical null hypothesis is that there is no difference in our two systems </li></ul></ul><ul><ul><li>3.A significance level (p-value) that is computed by taking the value of the test statistic for our experimental systems and determining how likely a value could have occurred under the null hypothesis. </li></ul></ul>
    5. 5. General Approach
    6. 6. Randomization test p-value = 0.0138
    7. 7. Wilcoxon Test p-value = 0.0560
    8. 8. Sign Test p-value = 0.3222 p-value = 0.3604
    9. 9. Bootstrap Test p-value = 0.0107
    10. 10. Student’s Paired t-test p-value = 0.0153
    11. 11. Results
    12. 12. Discussion <ul><li>Sing and Wilcoxon tests: </li></ul><ul><ul><li>The use this tests should not be use because they test criteria that do not match the criteria of interest. </li></ul></ul><ul><li>Randomization and Bootstrap tests: </li></ul><ul><ul><li>This tests can use whatever criterion we specify while the other tests are fixed in their test statistics. </li></ul></ul><ul><li>Bootstrap test and Student’s t test: </li></ul><ul><ul><li>The scores from the two IR Systems are random samples from a single population. Test topics are not random samples from the population of topics but hand selected to meet various criteria. </li></ul></ul><ul><li>Student’s t test: </li></ul><ul><ul><li>This test can only be used for the difference between means and not for median or other test statistics. </li></ul></ul><ul><ul><li>At smaller sample sizes, violations in normality may result in errors in the t-test. </li></ul></ul>
    13. 13. Conclusion <ul><li>The Randomization test is the recomendaded test to used to compare two IR systems. </li></ul><ul><li>The Wilcoxon Signed Ranked Test and Sign tests should no longer be used in this context. </li></ul><ul><li>The Randomization test, Bootstrap shifted method test, and Student’s t test all produced comparable significance values => there’s is no practical difference between them! </li></ul><ul><li>The Wilcoxon Signed Ranked test and Sign tests both procuded very different p-values => can incorrectly predict significance and can fail to detect significance results. </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×