Stat2013

437 views

Published on

Course work aCSIR

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
437
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Stat2013

  1. 1. Sucheta Tripathy, IICB, November – December 2013
  2. 2. Chi square test  Sucheta Tripathy, Biostatistics course work IICB,Nov, 2013
  3. 3. Definitions •Model or Hypothesis •Null Hypothesis •There is no significant difference between the 2. TRU E FALS E •Goodness of fit
  4. 4. What you need  A probability value  Degree of freedom  A contingent table Determine if the deviation is due to chance Accept 10% Reject
  5. 5. Chi Square Test
  6. 6. Example 1  Mendellian law of dominance Aa X Aa A a A -> Tall (Dominant) a -> Dwarf (recessive) Aa is ……. a X A AA Aa Aa aa 639 Tall and 281 dwarf Chi square requires that you have numeric values Chi square should not be calculated if the expected value is less than 5
  7. 7. Choosing a Test  First Check if there is a Hypothesis to check  If yes, then decide which one  If No, then there is NO statistical test for that.  What is there. Parametric tests have data that comes in a standard probability distribution. Non-parametric studies can be used for both normally and not-normally distributed data: Question: Then why not to use them always? Parametric tests make a lot of assumptions: If the assumptions are correct, the results are more accurate.
  8. 8. Choosing a Test  First Check if there is a Hypothesis to check  If yes, then decide which one  If No, then there is NO statistical test for that.  What is there. Parametric tests have data that comes in a standard probability distribution. Non-parametric studies can be used for both normally and not-normally distributed data: Question: Then why not to use them always? Parametric tests make a lot of assumptions: If the assumptions are correct, the results are more accurate.
  9. 9. Example 9:3:3:1
  10. 10. Example Number of sixes Rolls 0 1 2 3 Number of 48 35 15 03 p1 = P(roll 0 sixes) = P(X=0) = 0.58 P(k out of n) = p2 = P(roll 1 six) = P(X=1) = 0.345 p3 = P(roll 2 sixes) = P(X=2) = 0.07 p4 = P(roll 3 sixes) = P(X=3) = 0.005 n! k!(n-k)! http://www.mathsisfun.com/data/binomial-distribution.html pk(1-p)(n-k)
  11. 11. Parametric • Two samples – compare mean value for some variable of interest Nonparametric t-test for independent samples Wald-Wolfowitz runs test Mann-Whitney U test KolmogorovSmirnov two sample test
  12. 12. • Compare two variables measured in the same sample Parametric t-test for dependent samples • If more than two variables Repeated are measured in same measures sample ANOVA Nonparametric Sign test Wilcoxon’s matched pairs test Friedman’s two way analysis of variance Cochran Q
  13. 13. Null Hypothesis  Coined by English Geneticist Ronald Fischer in 1935.  At a given probability can either be true or false Comparing Populations/datasets: Population A and Population B Null hypothesis is true -> No significant difference between the populations. Null Hypothesis is false -> Significant difference between populations There are formula to calculate the value of a population comparison: There are look up tables with values If calculated value is less than look up value -> Null hypothesis is False else True
  14. 14. Null Hypothesis Testing You have a doubt that whenever it rains your experiment fails?!!! NULL Hypothesis is True: No significant difference (Your experiment failing and raining) -> Rain has nothing to do with your experiment failing NULL Hypothesis is false: There is a significant damage to your experiments when it rains -> Rain ruins your experiment!! Record when your experiment fails and check if it rains during that time. It may be so that it happens by chance or it may be so that there indeed is a relationship.
  15. 15. Lab study vs statistics research  http://www.youtube.com/watch?feature=player_embe dded&v=PbODigCZqL8
  16. 16. T-test  The t-statistic was introduced in 1908 by William Sealy Gosset  Used in a normally distributed population http://www.socialresearchmethods.net/kb/stat_t.php
  17. 17. T-test Sqrt((Sum(D^2) – (sum(D))^2/n)/n-1)
  18. 18. Why Standardize??
  19. 19. ANOVA: F statistics  Analysis of variance One Way Two way 2 2 Between Groups Within groups Cancel out between variation with group variation
  20. 20. So How big is F? Since F is Mean Square Between / Mean Square Within = MSG / MSE A large value of F indicates relatively more difference between groups than within groups (evidence against H0) To get the P-value, we compare to F(I-1,n-I)-distribution • I-1 degrees of freedom in numerator (# groups -1) • n - I degrees of freedom in denominator (rest of df)
  21. 21. Connections between SST, MST, and standard deviation If ignore the groups for a moment and just compute the standard deviation of the entire data set, we see s 2 x ij n 1 x 2 SST DFT MST So SST = (n -1) s2, and MST = s2. That is, SST and MST measure the TOTAL variation in the data set. SST: Sum of squares of Treatment MST: Mean square of treatment DFT: Degree of freedom of treatment
  22. 22. Connections between SSE, MSE, and standard deviation Remember: si xij xi ni 2 1 2 SS[ Within Group i ] dfi So SS[Within Group i] = (si2) (df i ) This means that we can compute SSE from the standard deviations and sizes (df) of each group: SSE SS[Within] 2 i s ( ni 1) SS[Within Group i ] 2 i s (dfi )
  23. 23. Computing ANOVA F statistic data group 5.3 1 6.0 1 6.7 1 5.5 2 6.2 2 6.4 2 5.7 2 7.5 3 7.2 3 7.9 3 TOTAL TOTAL/df group mean 6.00 6.00 6.00 5.95 5.95 5.95 5.95 7.53 7.53 7.53 WITHIN difference: data - group mean plain squared -0.70 0.490 0.00 0.000 0.70 0.490 -0.45 0.203 0.25 0.063 0.45 0.203 -0.25 0.063 -0.03 0.001 -0.33 0.109 0.37 0.137 1.757 0.25095714 overall mean: 6.44 BETWEEN difference group mean - overall mean plain squared -0.4 0.194 -0.4 0.194 -0.4 0.194 -0.5 0.240 -0.5 0.240 -0.5 0.240 -0.5 0.240 1.1 1.188 1.1 1.188 1.1 1.188 5.106 2.55275 F = 2.5528/0.25025 = 10.21575
  24. 24. Validation  The larger the F value -> More variation  reject null hypothesis
  25. 25. A X-x squ are B X-x squ are C 1 62 72 42 2 81 49 52 3 75 63 31 4 58 68 80 5 67 39 22 6 48 79 71 7 26 40 68 8 36 76 9 45 Mea n TMS TO MS MST(between) and MSE (within) df1 and df2 X-x squ are
  26. 26. In Summary SST (x ij x) 2 2 s (DFT) obs SSE (x ij xi ) obs SSG 2 2 si (df i ) groups (x i obs SSE SSG x) 2 n i (x i x) 2 groups SST; MS SS ; F DF MSG MSE
  27. 27. 2 R Statistic R2 gives the percent of variance due to between group variation R 2 SS[Between ] SS[Total ] We will see R2 again when we study regression. SSG SST
  28. 28. Where’s the Difference? Once ANOVA indicates that the groups do not all appear to have the same means, what do we do? Analysis of Variance for days Source DF SS MS treatmen 2 34.74 17.37 Error 22 59.26 2.69 Total 24 94.00 Level A B P N 8 8 9 Pooled StDev = Mean 7.250 8.875 10.111 1.641 StDev 1.669 1.458 1.764 F 6.45 P 0.006 Individual 95% CIs For Mean Based on Pooled StDev ----------+---------+---------+-----(-------*-------) (-------*-------) (------*-------) ----------+---------+---------+-----7.5 9.0 10.5 Clearest difference: P is worse than A (CI’s don’t overlap)
  29. 29. Multiple Comparisons Once ANOVA indicates that the groups do not all have the same means, we can compare them two by two using the 2-sample t test • We need to adjust our p-value threshold because we are doing multiple tests with the same data. •There are several methods for doing this. • If we really just want to test the difference between one pair of treatments, we should set the study up that way.
  30. 30. Tuckey’s Pairwise Comparisons Tukey's pairwise comparisons Family error rate = 0.0500 Individual error rate = 0.0199 95% confidence Use alpha = 0.0199 for each test. Critical value = 3.55 Intervals for (column level mean) - (row level mean) A B -4.863 -0.859 These give 98.01% CI’s for each pairwise difference. -3.685 0.435 P B -3.238 0.766 98% CI for A-P is (-0.86,-4.86) Only P vs A is significant (both values have same sign)
  31. 31. Tukey’s Method in R Tukey multiple comparisons of means 95% family-wise confidence level diff lwr upr B-A 1.6250 -0.43650 3.6865 P-A 2.8611 0.85769 4.8645 P-B 1.2361 -0.76731 3.2395
  32. 32. Independent sample t-test Number of words recalled df = (n1-1) + (n2-1) = 18 t x1 x2 s x1 x2 t ( 0.05,18) t 19 26 1 2.101 t ( 0.05,18)  Reject H0 7
  33. 33. T test  One sample t test  Unpaired and paired t test  Same set of subjects over a period of time  Independent sets of subjects over a period of time. http://www.youtube.com/watch?v=JlfLnx8sh-o One tailed and two tailed t-test: One tailed: Average height of class A is greater than class B Two tailed: Average height of class A is different from class B
  34. 34. Z- test statistics  Sample size is large  Population variance is known Sample size is small population variance is unknown go for t-test
  35. 35. Calculation of z value Z = X - µ / sqrt (variance/n) Suppose that in a particular geographic region, the mean and standard deviation of scores on a reading test are 100 points, and 12 points, respectively. Our interest is in the scores of 55 students in a particular school who received a mean score of 96. We can ask whether this mean score is significantly lower than the regional mean — that is, are the students in this school comparable to a simple random sample of 55 students from the region as a whole, or are their scores surprisingly low? We begin by calculating the standard error of the mean:
  36. 36. F-tests / Analysis of Variance (ANOVA) t= obtained difference between sample means difference expected by chance (error) variance (differences) between sample means F= variance (differences) expected by chance (error) Difference between sample means is easy for 2 samples: (e.g. X1=20, X2=30, difference =10) but if X3=35 the concept of differences between sample means gets tricky
  37. 37. F-tests / Analysis of Variance (ANOVA) Simple ANOVA example Total variability Between treatments variance Within treatments variance ---------------------------- -------------------------- Measures differences due to: Measures differences due to: 1. Treatment effects 1. Chance 2. Chance
  38. 38. F-tests / Analysis of Variance (ANOVA) F= MSbetween When treatment has no effect, differences between groups/treatments are entirely due to chance. Numerator and denominator will be similar. F-ratio should have value around 1.00 MSwithin When the treatment does have an effect then the between-treatment differences (numerator) should be larger than chance (denominator). F-ratio should be noticeably larger than 1.00
  39. 39. F-tests / Analysis of Variance (ANOVA) Simple independent samples ANOVA example F(3, 8) = 9.00, p<0.05 Placebo Drug A Drug B Drug C Mean 1.0 1.0 4.0 6.0 SD 1.73 1.0 1.0 1.73 n 3 3 3 3 There is a difference somewhere - have to use post-hoc tests (essentially t-tests corrected for multiple comparisons) to examine further
  40. 40. F Test Anova  http://www.youtube.com/watch?v=-yQb_ZJnFXw
  41. 41. Non Parametric tests Non-parametric tests are basically used in order to overcome the underlying assumption of normality in parametric tests. Quite general assumptions regarding the population are used in these tests Read more: Mann-Whitney U-test / Mann-Whitney-Wilcoxon IT DOES NOT ASSUME THE VARIANCES TO BE EQUAL!!
  42. 42. Mann-Whitney-Wilcoxon (MWW) or Wilcoxon Rank-Sum Test) German Gustav Deuchler in 1914 (with a missing term in the variance) and later independently by Frank Wilcoxon in 1945 This test is based on the idea that the particular pattern exhibited when 'm' number of X random variables and 'n' number of Y random variables are arranged together in increasing order of magnitude provides information about the relationship between their parent populations. Assumptions: •Two samples are random and are independent of each other •Observations are numeric and ordinal(Arranged in ranks) It is a test of comparison of medians
  43. 43. When to use this?
  44. 44. When to use this? Test of Normality: Simple Histogram method Normal Probability plot
  45. 45. How to Construct a normal probability plot Data rank i 20 1 15 2 26 3 32 4 18 5 28 6 35 7 14 8 26 9 22 10 17 11 i-.5/N(X) Z Theoritic value(Y) al value observed value Mean: 38.8 Sd= 11.4
  46. 46. Ranking the values 4.5 5 5.5 6 6 27 A 5 6 7 8 9 N1=5 B 2 1 5 7 3 4 N2=6 Total number of comparison= 5 X 6 = 30 0 0 0.5 2.5 0 0 3 How to rank? Less =0 Tie = 0.5 More = 1
  47. 47. Step by step  Rank the values  Add the ranks  Select larger of the two ranks.  Calculate N1, N2 and Nx and Tx (Nx – Number of people with higher rank, rank total of larger value)  Calculate U U = N1 * N2 + Nx * (Nx + 1)/Tx - Tx
  48. 48. Less is the value -> Reject Null Hypothesis
  49. 49. Calculating U value  For smaller dataset: U is the count of ranks of smaller dataset.  For larger dataset: U1 = R1 – n1(n1+1)/2 U2 = R2 – n2(n2+1)/2
  50. 50. Kruskal-Wallis test (H Test)  Non-parametric test  Equivalent to Anova (F test) in parametric test  Does not require the distribution to be normal  Distribution need to be independent  Used more often when the distribution is un-equal.  Data is ordinal Assumes the distribution to have the same shape: 1. If one distribution is skewed to left and other to the right (un-equal variance), this test will give in-accurate result
  51. 51. Kruskal Wallis Test Group A Group B Group C 27 20 34 2 8 31 4 14 3 18 36 23 7 21 30 9 22 6
  52. 52. Kruskal Wallis Test  Define Null or alternative Hypothesis  State probability  Calculate Degree of Freedom  Find critical value  Calculate the test hypothesis  State result H0 Accept NULL hypothesis: There is no difference between the samples H1 Reject NULL hypothesis : There is difference between the samples > Critical value reject null hypothesis
  53. 53. Rank Value Group A Group B Group C 1 2 27 20 34 2 3 2 8 31 3 4 4 14 3 4 6 18 36 23 5 7 7 21 30 6 8 9 22 6 7 9 8 14 9 18 Group A Group B Group C 14 10 17 10 6 16 11 21 3 8 2 12 22 9 18 13 13 23 5 11 15 14 27 7 12 4 15 30 Total R 39 65 67 16 31 n 6 6 17 34 18 36 Σ 2 Ri n 20 1 H= 12 N(N+1) 6 - 3 (N+1) 12/18(19) X (39^2/6 + 65^2/6 + 67^2/6) - 3(18+1) =2.854 Critical value Reject NULL hypothesis (5.99)
  54. 54. Kolmogorov Smirnov test(KS)  Non-parametric  Distribution is unknown  One way and Two way  One way – Checks the goodness of fit  Two way - compare the distribution Goodness of Fit: A Hypothesis (Mendel’s law of dominance) NULL Hypothesis: H0 : F(x) = F*(x) for all x H1: F(x) = F*(x) for at least one value of x
  55. 55. The K-S statistic Dn is defined as: K-S test Dn = max [ | Fn(x) - F(x) | ] where Dn is known as the K-S distance n = total number of data points F(x) = distribution function of the fitted distribution Fn(x) = i/n i = the cumulative rank of the data point
  56. 56. Kolmogorov Smirnov test(KS) Group 1 Group 2 Not confident 20 4 Slightly confident 30 27 Somewhat confident 13 28 Confident 20 18 Very confident 41 47 1. Take Total 2. Find Frequency 3. Calculate cumulative frequency 4. Find difference 5. Get the largest difference 6. Find critical value (1.36/sqrt(sample size)) 7. Test goodness of fit e.g; Our D > crit D (Distribution is unequal) -> reject NULL Hypothesis
  57. 57. Group 1 Freq Cumul ative freq Group 2 Freq Cumul ative frq D Not confide nt 20 0.1612 0.1612 4 0.0322 0.0322 0.129 Slightly confide nt 30 0.2419 0.403 27 0.2177 0.25 0.153 Somew hat confide nt 13 0.104 0.508 28 0.225 0.47 0.032 Confide nt 20 0.161 0.669 18 0.145 0.62 0.048 Very confide nt 41 0.330 1 47 0.379 1 0 Critical D = 1.36/sqrt(n1+n2/n1*n2) = Test NULL Hypothesis
  58. 58. Kolmogorov Smirnov test(KS) Group 1 Group 2 Not confident 20 4 Slightly confident 30 27 Somewhat confident 13 28 Confident 20 18 Very confident 41 47 1:2:3:2:1 1. Take Total 2. Find Frequency 3. Calculate cumulative frequency 4. Find difference 5. Get the largest difference 6. Find critical value (1.36/sqrt(sample size)) 7. Test goodness of fit e.g; Our D > crit D (Distribution is unequal) -> reject NULL Hypothesis
  59. 59. Methods of Estimation  Methods of moments  Maximum likelihood  Bayesian Estimators  Markov chain monte carlo… Why?? Population size is too large Testing a hypothesis on a set of samples
  60. 60. Probability density function (pdf ) -> For continuous variables Probability mass function (pmf) -> For discrete variables Parameter space Set of all Family of pdf/pmf Estimator T is unbiased: if the sample parameter is ……. Population parameter
  61. 61. Probability density function
  62. 62. Estimation Methods  Data gets 2 or multi dimensional…..
  63. 63. Method of maximum likelihood  The maximum likelihood estimates of a distribution type are the values of its parameters that produce the maximum joint probability density or mass for the observed data X given the chosen probability model. Maximum likelihood is more general, can be applied on any probability distribution.
  64. 64. The MLE  Best parameters obtained by maximizing the probability of the observed samples.  Has good convergence properties as sample sizes increase: Estimated value may equal real value with Large N  Applications are many: From speech recognition to natural language processing to computational biology.
  65. 65. Simple MLE: Coin tossing  Toss a coin:  Head  Tail Flip coin 10 times (n) = H, H, H, T, H, T, T, T, H, T => 1, 1, 1, 0, 1, 0, 0, 0, 1, 0 An appropriate model for getting a head in a single flip is: If P = 0.6 and Xi = 0 and Xi =1
  66. 66. The maximum likelihood Example: We want to estimate the probability, p, that individuals are infected with a certain kind of parasite. Ind.: Probability of Infected: observation: 1 1 p 2 0 1-p 3 1 p 4 1 p 5 0 1-p 6 1 p 7 1 p 8 0 1-p 9 0 1-p 10 1 p The maximum likelihood method (discrete distribution): 1. Write down the probability of each observation by using the model parameters 2. Write down the probability of all the data Pr( Data | p) p 6 (1 p) 4 3. Find the value parameter(s) that maximize this probability
  67. 67. The maximum likelihood Example: We want to estimate the probability, p, that individuals are infected with a certain kind of parasite. 0 1-p 3 1 p 4 1 p 5 0 1-p 6 1 p 7 1 p 8 0 1-p 9 0 1-p 10 1 p Pr( Data | p) p 6 (1 p) 4 - Find the value parameter(s) that maximize this probability 0.0012 2 L( p ) 0.0008 p 0.0004 1 L(p, K, N) 1 Likelihood function: 0.0000 Ind.: Probability of Infected: observation: 0.0 0.2 0.4 0.6 p 0.8 1.0
  68. 68. Brute Force…
  69. 69. Likelihood Function x1 n  L(P|X1…..Xn) = Π F(xi|P) i=1 1-x1 =P (1-P) x1 x2 x2 1-x2 xn 1-x1 = P P …P (1-P) =P xn P (1-P) ………P (1-P) 1-x2 (1-P)…..(1-P) 1-xn x1+x2+x3….+xn n – (x1+x2…..+xn) (1-P) n n ∑ Xi =P i=1 ∑ Xi (1-P) n - i=1 1-xn
  70. 70. Analytically maximum likelihood can also be found… By finding the derivative with respect to P and finding where the slope is 0. 2 log Λ http://www.ics.uci.edu/~smyth/courses/cs274/papers/MLtutorial.pdf
  71. 71. Recap…  Get the population type – set the equation.  Write the loglikelihood function  Differentiate  Set the value of differentiation 0.  Solve the equation to estimate the parameter.
  72. 72. Methods of moments  Oldest method  Distribution dependent  Geometric  Poisson  Bernoulii….  Depends upon PDF
  73. 73. Methods of Moment  Population moments can be determined by sample moments.  Can be robust  Sample mean can determine population mean and sample variance can determine population variance.  Does Not work well when the distribution is exponential.

×