Upcoming SlideShare
×

# Significance tests

1,918 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,918
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
37
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Significance tests

1. 1. Signiﬁcance Tests in NLP Presented by Jinho D. Choi University of Colorado at Boulder September 15th, 2010
2. 2. Data Type • Continuous data • Outputs are from inﬁnitely many possible values (regression). • e.g., temperatures, document relevancies. • Each value is relevant to one another. • One sample t-test, Paired two sample t-test. • Categorical data • Outputs are from ﬁnitely deﬁned categories (classiﬁcation). • e.g,. pos-tags, dependency labels. • Each value is not relevant to one another. • Wilcoxon’s signed-rank test, Fisher’s exact test, Pearson’s chi-square test, McNemar’s test
3. 3. One sample t-test • One sample t-test • The true mean is known, and the normal distribution is assumed. • Null hypothesis: difference between true mean and our mean is zero. • Example • Average ITA score = 84.31% (true mean) be say get know see our mean 90.88% 89.75% 84.11% 87.57% 88.19% 90.25% • Calculate t-score: • Use the t-score to ﬁnd p-value in the distribution table. • Degree of freedom: minimal # of values to determine all the data points. • p ≤ 0.01 → the difference is statistically signiﬁcant with over 99% conﬁdence.
4. 4. Paired two sample t-test • Paired two sample t-test • Each sample is tested by two players or a player twice. • Null hypothesis: mean difference between two normally distributed populations is zero. • Example EBC EBN SIN XIN WEB WSJ Mean LTH 83.36 86.32 86.80 85.50 85.53 87.15 85.88 Clear 84.06 86.77 86.55 85.41 85.70 87.58 86.09 • Calculate t-score: • Find p-value. • p = 0.1701→ the difference is not statistically signiﬁcant. NLP data is often not normally distributed.
5. 5. Wilcoxon signed-rank test • Wilcoxon signed-rank test • Non-parametric test: no distribution is assumed. • Null hypothesis: median difference between pairs of observations is zero • Example EBC EBN SIN XIN WEB WSJ LTH 83.36 86.32 86.80 85.50 85.53 87.15 Clear 84.06 86.77 86.55 85.41 85.70 87.58 Clear - LTH 0.7 0.45 -0.25 -0.09 0.17 0.43 Singed rank 6 5 -3 -1 2 4 • W+ = 2 + 4 + 5 + 6 = 17, W- = |-1| + |-3| = 4 • Use the min(W+, W-) to ﬁnd p-value. • p ≤ 0.2188 → the difference is not statistically signiﬁcant. • cf. paired two sample t-test: p = 0.1701.
6. 6. Fisher's exact test • Fisher's exact test • Comparing binary outputs produced by two methods. • The signiﬁcance of the deviation can be calculated exactly. • Null hypothesis: output difference between two methods is zero. Method 1 Method 2 Total Class 1 a b a+b Class 2 c d c+d Total a+c b+d n • Example Clear LTH Total Correct 142,731 142,375 285,106 Incorrect 23,055 23,411 46,466 Total 165,786 165,786 331,572 Really?
7. 7. Pearson's chi-square test • Pearson's chi-square test • Each observation is independent from one another. • The chi-square distribution is assumed. • Null hypothesis: difference between observed frequency distribution and true distribution is zero. observed • Example true Clear LTH X2 Correct 142,731 142,375 0.89 Incorrect 23,055 23,411 5.41 Total 165,786 165,786 6.3 • Calculate X2-score: • Use the X2-score to ﬁnd p-value. • p = 0.0121→ the difference is statistically signiﬁcant with 98.79% conﬁdence.
8. 8. McNemar's test • McNemar's test • Applied to 2×2 contingency tables with binary outputs. • Non-parametric test: no distribution is assumed. • Null hypothesis: p(b) = p(c) Method 2:+ Method 1:+ a Method 1:- b • Example Method 2:- c d Clear 1: + Clear 1: - Total LTH 2: + 138,402 3,973 142,375 LTH 2: - 4,329 19,082 23,411 Total 142,731 23,055 165,786 • Calculate X2-score: • Use the X2-score to ﬁnd p-value. • p < 0.0001→ the difference is statistically signiﬁcant with 99.99% conﬁdence.