Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Statistics for second language educators

254 views

Published on

Introduction to statistical concepts (population, sample, sampling, central tendency, spread). Mainly aimed at language teachers in advanced studies programmes (e.g., Masters courses)

Published in: Education
  • Be the first to comment

Statistics for second language educators

  1. 1. INTRODUCTION TO STATISTICS FOR SECOND LANGUAGE EDUCATORS DR ACHILLEAS KOSTOULAS
  2. 2. OBJECTIVES OF THIS SESSION  You will learn how to construct a sample  You will learn how to describe your sample using statistical methods.  You will learn how to find connections between different phenomena in your data.
  3. 3. OUTLINE OF THIS SESSION 1. Populations and samples 2. Different types of data 3. Univariate analysis  Central tendency  Spread 4. Bivariate analysis  Cross-tabulations  T-Tests  Correlations
  4. 4. POPULATIONS & SAMPLES
  5. 5. POPULATION  The total number of people (or events, or things) whose properties or behaviour we are interested in understanding  e.g. University students in Austria  Symbol: N Population Sampling frame Sample
  6. 6. SAMPLING FRAME  The total number of people (or events, or things) that we have access to for our research  e.g. Students currently present in this classroom Population Sampling frame Sample
  7. 7. SAMPLE  The total number of people who were contacted and agreed to participate in the study  Symbol: n Population Sampling frame Sample
  8. 8. SAMPLING METHODS / STRATEGIES  Simple random sampling  Systematic sampling  Convenience sampling
  9. 9. HOW LARGE SHOULD THE SAMPLE BE? Depends on:  the population size  the degree of certainty required  the statistical tools we want to use
  10. 10. DIFFERENT TYPES OF DATA
  11. 11. Variables Cases Values
  12. 12. LEVELS OF MEASUREMENT 1. Nominal data 2. Ordinal data 3. Scale data
  13. 13. CATEGORICAL / NOMINAL DATA  A variable is nominal if values do not have a numerical relation to each other.  Examples:  Gender (M / F / Other)  Place of Birth
  14. 14. ORDINAL DATA  Ordinal variables are like categorical ones, but we can rank the values according to order, size, frequency, etc.  Examples:  Level of education (High School, BA, MA/Mag., Doctorate)  Attitudes (strongly disagree, disagree, neutral, agree, strongly agree)
  15. 15. CONTINUOUS / SCALE DATA  A variable is continuous if it contains an infinite number of values that can be mathematically manipulated.  Examples:  Age (12, 13, 13:2, 14…)  Height (165cm, 167cm, 183cm…)
  16. 16. UNIVARIATE ANALYSIS CENTRAL TENDENCY
  17. 17. MEASURES OF CENTRAL TENDENCY  Mode (the most common value)  Median (the middle value)  Mean (the middle value, weighted)
  18. 18. EXAMPLE (RAW DATA) Case Height Gender Loves Statistics 1 167 M Strongly disagree 2 178 M Strongly agree 3 189 F Agree 4 201 F Agree 5 182 M Disagree 6 175 F Strongly agree 7 162 M Strongly disagree 8 180 F Disagree 9 187 M Agree
  19. 19. EXAMPLE (PROCESSED) Case Height Gender Loves Statistics 1 167 1 4 2 178 1 1 3 189 2 3 4 201 2 3 5 182 1 2 6 175 2 4 7 162 1 1 8 180 2 2 9 187 1 3 Total 1,621 - -
  20. 20. CENTRAL TENDENCY: THE MODE Gender N % --Male 5 55 --Female 4 44 --Total 9 100* Case Gender 1 1 2 1 3 2 4 2 5 1 6 2 7 1 8 2 9 1 *Rounding up error “The majority of respondents were male (n = 5, 55%). “ “Respondents were almost evenly split between male (n = 5, 55%) and female (n = 4, 44%)”
  21. 21. CENTRAL TENDENCY: THE MEDIAN Case <3 1 4 2 1 3 3 4 3 5 2 6 4 7 1 8 2 9 3 Case <3 1 4 6 4 3 3 4 3 9 3 5 2 8 2 2 1 7 1 I love statistics N % --Strongly agree 2 22 --Agree 3 33 --Disagree 2 22 --Strongly disagree 2 22 --Total 9 100* *Rounding up error “As can be seen in Table 1, attitudes towards statistics were largely positive (x̅ = 3)”
  22. 22. CENTRAL TENDENCY: THE MEAN  1,441 / 9 = 180.1 Case Height 1 167 2 178 3 189 4 201 5 182 6 175 7 162 8 180 9 187 Total 1,441 “Respondents were rather tall (M = 180.1)”
  23. 23. UNIVARIATE ANALYSIS SPREAD
  24. 24. COMPARE THESE TWO SCHOOLS School A School B 0 1 2 3 4 5 6 7 8 40-49 50-59 60-69 70-79 80-89 90-100 Based on Muijs 2007
  25. 25. COMPARE THESE TWO SCHOOLS Case School A School B 1 45 60 2 50 65 3 55 65 4 60 70 5 65 70 6 70 70 7 70 70 8 75 70 9 80 70 10 85 75 11 90 75 12 95 80 Media n 70 70 Mean 70 70
  26. 26. MEASURES OF SPREAD  Range (the difference between the highest and the lowest value)  Interquartile range (the difference between the highest and lowest values after we remove extremes)  Standard deviation
  27. 27. MEASURES OF SPREAD: RANGE Case School A School B 1 45 60 2 50 65 3 55 65 4 60 70 5 65 70 6 70 70 7 70 70 8 75 70 9 80 70 10 85 75 11 90 75 12 95 80 Media n 70 70 Mean 70 70 Range Range (School A): 95 – 45 = 50 Range (School B): 80 – 60 = 20 “The test scores in School A ranged from 45 to 95 (M = 70. Scores in School B were more tightly distributed, ranging from 60 to 80 (M = 70)”
  28. 28. MEASURES OF SPREAD: INTERQUARTILE RANGE Case School A School B 1 45 60 2 50 65 3 55 65 4 60 70 5 65 70 6 70 70 7 70 70 8 75 70 9 80 70 10 85 75 11 90 75 12 95 80 IQR IQR (School A): 82.5 – 57.5 = 25 IQR (School B): 72.5 – 67.5 = 5 “Although the average test performance in both schools was similar (M = 70), the test scores in School A were much more widely distributed than those in School B (IQRA = 25, IQRB = 5”
  29. 29. MEASURES OF SPREAD: STANDARD DEVIATION Case School A School B 1 45 60 2 50 65 3 55 65 4 60 70 5 65 70 6 70 70 7 70 70 8 75 70 9 80 70 10 85 75 11 90 75 12 95 80 Media n 70 70 Mean 70 70 SD (School A) / SDA: 15.81 SD (School B) / SDB: 5.22 “The test scores in School A were satisfactory (M = 70, SD= 15.81). School B reported similar results, which clustered more tightly around the average (M = 70, SD = 5.22)”
  30. 30. UNIVARIATE STATISTICS: SUMMARY Central Tendency Spread Mode Median Mean Range IQR SD Nominal  Ordinal     Continuous    
  31. 31. POP QUIZ  Average and mean are the same thing In daily use, the words average and mean are interchangeable. In statistics, the mean is one type of average. The mode and median are also types of average  We must always use the median with ordinal variables Technically, we can use both the median and the mode, but the median is a more powerful metric. The third option, the mean, cannot be used with ordinal data.  We must always use the median with continuous variables It is usually the best option. However, if we have unusual data (with one or two very high or very low values) it may be better to use the median.  We can calculate mean values in a Likert scale (1: Strongly Agree; 2: Agree; 3: Disagree; 4: Strongly Disagree) Some people do, but you shouldn’t. Likert scales produce ordinal data. You should not use the mean when your data is ordinal.  The appropriate spread metric for nominal variables is the IQR No. Nominal data cannot be ranked in any sensible way, so they do not have a spread.
  32. 32. BIVARIATE ANALYSIS CROSSTABULATIONS & CHI-SQUARE TESTS
  33. 33. CROSSTABULATIONS  We use a cross-tabulation when we want to compare two ordinal or nominal variables  Examples:  Gender x Favourite colour  School type x Attitudes towards mathematics
  34. 34. EXAMPLE CROSSTABULATION
  35. 35. CHI-SQUARE TEST Action Figures Barbie Dolls Male (50) 25 25 Female (60) 30 30
  36. 36. CHI-SQUARE TEST Action Figures Barbie Dolls Male (50) 48 2 25 25 Female (60) 25 35 30 30
  37. 37. CHI-SQUARE “A statistically significant difference was found in the toy preferences of boys and girls. As can be seen in Table 1 boys were much more likely to prefer action figures, compared to girls (χ2= 36.068, df=1, p=o.ooo)“ Gender AF BD Total --Male 48 2 50 -- Female 25 35 60 --Total 73 37 110
  38. 38. BIVARIATE ANALYSIS T-TESTS
  39. 39. T-TESTS  We use a t-test to see if there is any connection between a nominal variable (the independent variable) and a continuous one (the dependent variable)  The t-test breaks up your population in two groups (e.g., boys and girls), examines the mean value of the independent variable for each group, and then compares them.
  40. 40. T-TESTS
  41. 41. T-TEST
  42. 42. BIVARIATE ANALYSIS CORRELATIONS
  43. 43. CORRELATIONS  We use a correlation (e.g., Spearmann or Pearson‘s coefficient) to see if there is any connection between two continuous variables (e.g. weight and height).  Correlations range from 1 to -1. A high value on either side means that the two variables are strongly connected. A value close to 0 means that they are not.  We can depict correlations visually with a scatterplot diagramme.
  44. 44. STRONG POSITIVE CORRELATION
  45. 45. STRONG NEGATIVE CORRELATION
  46. 46. NO CORRELATION
  47. 47. CORRELATION IS NOT CAUSATION!
  48. 48. REALLY, IT DOESN‘T
  49. 49. OR DOES IT?
  50. 50. SUMMARY Nominal Ordinal Continuous Nominal Crosstabs / χ2 Crosstabs / χ2 T-Test (if it has two values) Ordinal Crosstabs / χ2 Crosstabs / χ2 T-Test (if it has two values) Continuous T-Test (if it has two values) T-Test (if it has two values) Correlation
  51. 51. POP QUIZ  If I want to test whether there is a connection music preferences and gender, I must use a cross-tab That is correct. Music preferences and gender are both nominal variables. The correct procedure for pairing nominal variables is a crosstab (and chi-square)  A p value of 0.045 shows that something is statistically significant. That is correct. The usual threshold of statistical significance in educational research is 0.05, and anything lower than that is considered significant.  I can prove that something is causing something else using a Pearson‘s correlation coefficient. No, you cannot. Correlation does not imply causation.

×