Significance tests

1,918 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,918
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
37
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Significance tests

  1. 1. Significance Tests in NLP Presented by Jinho D. Choi University of Colorado at Boulder September 15th, 2010
  2. 2. Data Type • Continuous data • Outputs are from infinitely many possible values (regression). • e.g., temperatures, document relevancies. • Each value is relevant to one another. • One sample t-test, Paired two sample t-test. • Categorical data • Outputs are from finitely defined categories (classification). • e.g,. pos-tags, dependency labels. • Each value is not relevant to one another. • Wilcoxon’s signed-rank test, Fisher’s exact test, Pearson’s chi-square test, McNemar’s test
  3. 3. One sample t-test • One sample t-test • The true mean is known, and the normal distribution is assumed. • Null hypothesis: difference between true mean and our mean is zero. • Example • Average ITA score = 84.31% (true mean) be say get know see our mean 90.88% 89.75% 84.11% 87.57% 88.19% 90.25% • Calculate t-score: • Use the t-score to find p-value in the distribution table. • Degree of freedom: minimal # of values to determine all the data points. • p ≤ 0.01 → the difference is statistically significant with over 99% confidence.
  4. 4. Paired two sample t-test • Paired two sample t-test • Each sample is tested by two players or a player twice. • Null hypothesis: mean difference between two normally distributed populations is zero. • Example EBC EBN SIN XIN WEB WSJ Mean LTH 83.36 86.32 86.80 85.50 85.53 87.15 85.88 Clear 84.06 86.77 86.55 85.41 85.70 87.58 86.09 • Calculate t-score: • Find p-value. • p = 0.1701→ the difference is not statistically significant. NLP data is often not normally distributed.
  5. 5. Wilcoxon signed-rank test • Wilcoxon signed-rank test • Non-parametric test: no distribution is assumed. • Null hypothesis: median difference between pairs of observations is zero • Example EBC EBN SIN XIN WEB WSJ LTH 83.36 86.32 86.80 85.50 85.53 87.15 Clear 84.06 86.77 86.55 85.41 85.70 87.58 Clear - LTH 0.7 0.45 -0.25 -0.09 0.17 0.43 Singed rank 6 5 -3 -1 2 4 • W+ = 2 + 4 + 5 + 6 = 17, W- = |-1| + |-3| = 4 • Use the min(W+, W-) to find p-value. • p ≤ 0.2188 → the difference is not statistically significant. • cf. paired two sample t-test: p = 0.1701.
  6. 6. Fisher's exact test • Fisher's exact test • Comparing binary outputs produced by two methods. • The significance of the deviation can be calculated exactly. • Null hypothesis: output difference between two methods is zero. Method 1 Method 2 Total Class 1 a b a+b Class 2 c d c+d Total a+c b+d n • Example Clear LTH Total Correct 142,731 142,375 285,106 Incorrect 23,055 23,411 46,466 Total 165,786 165,786 331,572 Really?
  7. 7. Pearson's chi-square test • Pearson's chi-square test • Each observation is independent from one another. • The chi-square distribution is assumed. • Null hypothesis: difference between observed frequency distribution and true distribution is zero. observed • Example true Clear LTH X2 Correct 142,731 142,375 0.89 Incorrect 23,055 23,411 5.41 Total 165,786 165,786 6.3 • Calculate X2-score: • Use the X2-score to find p-value. • p = 0.0121→ the difference is statistically significant with 98.79% confidence.
  8. 8. McNemar's test • McNemar's test • Applied to 2×2 contingency tables with binary outputs. • Non-parametric test: no distribution is assumed. • Null hypothesis: p(b) = p(c) Method 2:+ Method 1:+ a Method 1:- b • Example Method 2:- c d Clear 1: + Clear 1: - Total LTH 2: + 138,402 3,973 142,375 LTH 2: - 4,329 19,082 23,411 Total 142,731 23,055 165,786 • Calculate X2-score: • Use the X2-score to find p-value. • p < 0.0001→ the difference is statistically significant with 99.99% confidence.

×