Probability and basic statistics with R

2,214 views

Published on

Quantitative Data Analysis -
Part III: Probability and basic statistics-
Master in Global Environmental Change -
IE University

Published in: Education, Technology

Probability and basic statistics with R

  1. 1. Quantitative Data AnalysisProbability and basic statistics
  2. 2. probabilityThe most familiar way of thinking about probability is within aframework of repeatable random experiments. In this view theprobability of an event is defined as the limiting proportion of timesthe event would occur given many repetitions.
  3. 3. ProbabilityInstead of exclusively relying on knowledge of the proportion of timesan event occurs in repeated sampling, this approach allows theincorporation of subjective knowledge, so-called prior probabilities,that are then updated. The common name for this approach isBayesian statistics.
  4. 4. The Fundamental Rules of ProbabilityRule 1: Probability is always positiveRule 2: For a given sample space, the sum of probabilities is 1Rule 3: For disjoint (mutually exclusive) events, P(AUB)=P (A)+ P (B)
  5. 5. CountingPermutations (order is important)Combinations (order is not important)
  6. 6. Probability functionsThe factorial function factorial(n) gamma(n+1)Combinations can be calculated with choose(x,n)
  7. 7. Simple statisticsmean(x) arithmetic average of the values in xmedian(x) median value in xvar(x) sample variance of xcor(x,y) correlation between vectors x and yquantile(x) vector containing the minimum, lower quartile, median,upper quartile, and maximum of xrowMeans(x) row means of dataframe or matrix xcolMeans(x) column means
  8. 8. cumulative probability functionThe cumulative probability function is, for any value of x, the probability of obtaining a sample value that is less than or equal to x. curve(pnorm(x),-3,3)
  9. 9. probability density functionThe probability density is the slope of this curve (its‘derivative’). curve(dnorm(x),-3,3)
  10. 10. Continuous Probability Distributions
  11. 11. Continuous Probability DistributionsR has a wide range of built-in probability distributions, for each ofwhich four functions are available: the probability density function(which has a d prefix); the cumulative probability (p); the quantiles ofthe distribution (q); and random numbers generated from thedistribution (r).
  12. 12. Normal distributionpar(mfrow=c(2,2))x<-seq(-3,3,0.01)y<-exp(-abs(x))plot(x,y,type="l")y<-exp(-abs(x)^2)plot(x,y,type="l")y<-exp(-abs(x)^3)plot(x,y,type="l")y<-exp(-abs(x)^8)plot(x,y,type="l")
  13. 13. Normal distribution norm.R
  14. 14. ExerciseSuppose we have measured the heights of 100 people. The meanheight was 170 cm and the standard deviation was 8 cm. We can askthree sorts of questions about data like these: what is the probabilitythat a randomly selected individual will be:shorter than a particular height? taller than a particular height? between one specified height and another?
  15. 15. Exercise normal.R
  16. 16. The central limit theoremIf you take repeated samples from a population with finite varianceand calculate their averages, then the averages will be normallydistributed.
  17. 17. Checking normality fishes.R
  18. 18. Checking normality
  19. 19. The gamma distributionThe gamma distribution is useful for describing a wide range ofprocesses where the data are positively skew (i.e. non-normal, with along tail on the right).
  20. 20. The gamma distributionx<-seq(0.01,4,.01)par(mfrow=c(2,2))y<-dgamma(x,.5,.5)plot(x,y,type="l")y<-dgamma(x,.8,.8)plot(x,y,type="l")y<-dgamma(x,2,2)plot(x,y,type="l")y<-dgamma(x,10,10)plot(x,y,type="l") gammas.R
  21. 21. The gamma distribution α is the shape parameter and β −1 is the scale parameter. Specialcases of the gamma distribution are the exponential =1 and chi-squared =/2, =2.The mean of the distribution is αβ , the variance is αβ 2, theskewness is 2/√α and the kurtosis is 6/α.
  22. 22. The gamma distribution gammas.R
  23. 23. Exercise
  24. 24. Exercise fishes2.R
  25. 25. The exponential distribution
  26. 26. QuantitativeData Analysis Hypothesis testing
  27. 27. cumulative probability functionThe cumulative probability function is, for any value of x, the probability of obtaining a sample value that is less than or equal to x. curve(pnorm(x),-3,3)
  28. 28. probability density functionThe probability density is the slope of this curve (its‘derivative’). curve(dnorm(x),-3,3)
  29. 29. ExerciseSuppose we have measured the heights of 100 people. The meanheight was 170 cm and the standard deviation was 8 cm. We can askthree sorts of questions about data like these: what is the probabilitythat a randomly selected individual will be:shorter than a particular height? taller than a particular height? between one specified height and another?
  30. 30. Exercise normal.R
  31. 31. Why Test?Statistics is an experimental science, not really a branch ofmathematics.It’s a tool that can tell you whether data are accidentally or reallysimilar.It does not give you certainty.
  32. 32. Steps in hypothesis testing!1. Set the null hypothesis and the alternative hypothesis.2. Calculate the p-value.3. Decision rule: If the p-value is less than 5% then reject the null hypothesis otherwise the null hypothesis remains valid. In any case, you must give the p-value as a justification for your decision.
  33. 33. Types of Errors…A Type I error occurs when we reject a true null hypothesis (i.e.Reject H0 when it is TRUE) H0 T F Reject I Reject IIA Type II error occurs when we don’t reject a false null hypothesis(i.e. Do NOT reject H0 when it is FALSE) 11.33
  34. 34. Critical regions and power The table shows schematically relation between relevant probabilities under null and alternative hypothesis. do not reject rejectNull hypothesis is true 1-  (Type I error)Null hypothesis is false  (Type II error) 1- 
  35. 35. SignificanceIt is common in hypothesis testing to set probability of Type I error, to some values called the significance levels. These levels usually setto 0.1, 0.05 and 0.01. If null hypothesis is true and probability ofobserving value of the current test statistic is lower than thesignificance levels then hypothesis is rejected.Sometimes instead of setting pre-defined significance level, p-value isreported. It is also called observed significance level.
  36. 36. 36n e en e ppt Significance Level©A i When we reject the null hypothesis there is a risk of drawing a wrongTa conclusionani Risk of drawing a wrong conclusion (called p-value or observed a significance level) can be calculated Researcher decides the maximum risk (called significance level) he is ready to take Usual significance level is 5%
  37. 37. P-valueWe start from the basic assumption: The null hypothesis is trueP-value is the probability of getting a value equal to or more extremethan the sample result, given that the null hypothesis is trueDecision rule: If p-value is less than 5% then reject the nullhypothesis; if p-value is 5% or more then the null hypothesis remainsvalidIn any case, you must give the p-value as a justification for yourdecision.
  38. 38. Interpreting the p-value… Overwhelming Evidence (Highly Significant) Strong Evidence (Significant) Weak Evidence (Not Significant) No Evidence (Not Significant)0 .01 .05 .10
  39. 39. Power analysisThe power of a test is the probability of rejecting the null hypothesiswhen it is false.It has to do with Type II errors: β is the probability of accepting thenull hypothesis when it is false. In an ideal world, we would obviouslymake as small as possible.The smaller we make the probability of committing a Type II error, thegreater we make the probability of committing a Type I error, andrejecting the null hypothesis when, in fact, it is correct.Most statisticians work with α=0.05 and β =0.2. Now the power of atest is defined as 1− β =0.8
  40. 40. ConfidenceA confidence interval with a particular confidence level isintended to give the assurance that, if the statistical model is correct,then taken over all the data that might have been obtained, theprocedure for constructing the interval would deliver a confidenceinterval that included the true value of the parameter the proportionof the time set by the confidence level.
  41. 41. Dont Complicate ThingsUse the classical tests:var.test to compare two variances (Fishers F)t.test to compare two means (Students t)wilcox.test to compare two means with non-normal errors (Wilcoxons rank test)prop.test (binomial test) to compare twoproportionscor.test (Pearsons or Spearmans rankcorrelation) to correlate two variableschisq.test (chi-square test) or fisher.test(Fishers exact test) to test for independencein contingency tables
  42. 42. Comparing Two VariancesBefore comparing means, verify that the variances are notsignificantly different. var.text(set1, set2)This performs Fishers F testIf the variances are significantly different, you can transform theoutput (y) variable to equalise variances, or you can still use thet.test (Welchs modified test).
  43. 43. Comparing Two MeansStudents t-test (t.test) assumes the samplesare independent, the variances constant,and the errors normally distributed. It willuse the Welch-Satterthwaite approximation(default, less power) if the variances aredifferent. This test can also be used for paireddata.Wilcoxon rank sum test (wilcox.test) is usedfor independent samples, errors not normallydistributed. If you do a transform to getconstant variance, you will probably have touse this test.
  44. 44. Student’s tThe test statistic is the number of standard errors by which the twosample means are separated:
  45. 45. Power analysisSo how many replicates do we need in each of two samples to detecta difference of 10% with power =80% when the mean is 20 (i.e. delta=20) and the standard deviation is about 3.5? power.t.test(delta=2,sd=3.5,power=0.8)You can work out what size of difference your sample of 30 wouldallow you to detect, by specifying n and omitting delta: power.t.test(n=30,sd=3.5,power=0.8)
  46. 46. Paired ObservationsThe measurements will not be independent.Use the t.test with paired=T. Now you’re doing a single sample testof the differences against 0.When you can do a paired t.test, you should always do the pairedtest. It’s more powerful.Deals with blocking, spatial correlation, and temporal correlation.
  47. 47. Sign TestUsed when you cant measure a difference but can see it.Use the binomial test (binom.test) for this.Binomial tests can also be used to compare proportions. prop.test
  48. 48. Chi-squared contingency tablesthe contingencies are all the events that could possibly happen. Acontingency table shows the counts of how many times each of thecontingencies actually happened in a particular sample.
  49. 49. Chi-square Contingency TablesDeals with count data.Suppose there are two characteristics (hair colour and eye colour).The null hypothesis is that they are uncorrelated.Create a matrix that contains the data and applychisq.test(matrix).This will give you a p-value for matrix values given the assumption ofindependence.
  50. 50. Fishers Exact TestUsed for analysis of contingency tables when one or more of theexpected frequencies is less than 5.Use fisher.test(x)
  51. 51. compare two proportionsIt turns out that 196 men were promoted out of 3270 candidates,compared with 4 promotions out of only 40 candidates for thewomen. prop.test(c(4,196),c(40,3270))
  52. 52. Correlation and covariancecovariance is a measure of how much two variables changetogetherthe Pearson product-moment correlation coefficient(sometimes referred to as the PMCC, and typically denoted by r) is ameasure of the correlation (linear dependence) between twovariables
  53. 53. Correlation and CovarianceAre two parameters correlated significantly?Create and attach the data.frameApply cor(data.frame)To determine the significance of acorrelation, apply cor.test(data.frame)You have three options: Kendalls tau(method = "k"), Spearmans rank (method ="s"), or (default) Pearsons product-momentcorrelation (method = "p")
  54. 54. Kolmogorov-Smirnov TestAre two sample distributions significantly different?orDoes a sample distribution arise from a specific distribution?ks.test(A,B)

×