Introduction to Statistics - Part 1


Published on

Published in: Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to Statistics - Part 1

  1. 1. Quantitative Data Analysis: Statistics – Part 1
  2. 2. " ... while man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty. You can, for example, never foretell what any one man will do, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant. So says the statistician. "
  3. 3. Overview <ul><li>General Statistics </li></ul><ul><li>The Normal Distribution </li></ul><ul><li>Z-Tests </li></ul><ul><li>Confidence Intervals </li></ul><ul><li>T-Tests </li></ul>
  4. 4. <ul><li>  </li></ul><ul><li>~ THE GOLDEN RULE ~ </li></ul><ul><li>Statistics NEVER replace </li></ul><ul><li>the judgment of the expert. </li></ul>
  5. 5. Approach to Statistical Research <ul><li>Formulate a Hypothesis </li></ul><ul><li>State predictions of the hypothesis </li></ul><ul><li>Perform experiments or observations </li></ul><ul><li>Interpret experiments or observations </li></ul><ul><li>Evaluate results with respect to hypothesis </li></ul><ul><li>Refine hypothesis and start again </li></ul><ul><li>(Basically the same as all other research) </li></ul>
  6. 6. Hypothesis Testing <ul><li>H 0 : Null Hypothesis , status quo </li></ul><ul><li>H A : Alternative Hypothesis , research question </li></ul><ul><li>So, either : </li></ul><ul><li>&quot; The data does not support H 0 &quot; </li></ul><ul><li>or </li></ul><ul><li>&quot; We fail to reject H 0 &quot; </li></ul>
  7. 7. Types of Data <ul><li>Continuous </li></ul><ul><ul><li>height, age, time </li></ul></ul><ul><li>Discrete </li></ul><ul><ul><li># of days worked this week, # leaves on a tree </li></ul></ul><ul><li>Ordinal </li></ul><ul><ul><li>{Good, O.K., Bad} </li></ul></ul><ul><li>Nominal </li></ul><ul><ul><li>{Yes/No}, {Teacher/Chemist/Haberdasher} </li></ul></ul>
  8. 8. Picturing The Data
  9. 10. Time-Series Plots <ul><li>Time related Data </li></ul><ul><li>e.g. Stock Prices </li></ul>
  10. 12. Pie Charts <ul><li>Nominal/Ordinal </li></ul><ul><li>Only suitable for data that adds up to 1 </li></ul><ul><li>Hard to compare values in the chart </li></ul>
  11. 14. Bar Charts <ul><li>Nominal/Ordinal </li></ul><ul><li>Easier to compare values than pie chart </li></ul><ul><li>Suitable for a wider range of data </li></ul>
  12. 16. Histograms <ul><li>Continuous Data </li></ul><ul><li>Divide Data into ranges </li></ul>
  13. 18. Dot Plots <ul><li>Nominal/Ordinal </li></ul><ul><li>Represents all the data </li></ul><ul><li>Difficult to read </li></ul>
  14. 20. Scatter Plots <ul><li>Excellent for examining association between two variables </li></ul>
  15. 22. Box Plots <ul><li>Nominal/Ordinal </li></ul><ul><li>1IQR, 3IQR - First interquartile range (IQR), third interquartile range (IQR) </li></ul><ul><li>Outliers </li></ul>
  16. 26. John Tukey <ul><li>Born June 16, 1915 </li></ul><ul><li>Died July 26, 2000 </li></ul><ul><li>Born in New Bedford, Massachusetts </li></ul><ul><li>He introduced the box plot in his 1977 book,&quot; Exploratory Data Analysis &quot; </li></ul><ul><li>Also the Cooley–Tukey FFT algorithm and jackknife estimation </li></ul>
  17. 27. <ul><li>While working with John von Neumann on early computer designs, Tukey introduced the word &quot;bit&quot; as a contraction of &quot;binary digit&quot;. The term &quot;bit&quot; was first used in an article by Claude Shannon in 1948. </li></ul><ul><li>The term &quot;software&quot;, which claims Paul Niquette coined in 1953, was first used in print by Tukey in a 1958 article in American Mathematical Monthly , and thus some attribute the term to him. </li></ul>John Tukey Paul Niquette Claude Shannon John von Neumann
  18. 28. Question 1 <ul><li>In a telephone survey of 68 households, when asked do they have pets, the following were the responses : </li></ul><ul><li>16 : No Pets </li></ul><ul><li>28 : Dogs </li></ul><ul><li>32 : Cats </li></ul><ul><li>Draw the appropriate graphic to illustrate the results !! </li></ul>
  19. 29. Question 1 - Solution <ul><li>Total number surveyed = 68 </li></ul><ul><li>Number with no pets = 16 </li></ul><ul><li>=>Total with pets = (68 - 16) = 52 </li></ul><ul><li>But total 28 dogs + 32 cats = 60 </li></ul><ul><li>=> So some people have both cats and dogs </li></ul>
  20. 31. Question 1 - Solution <ul><li>How many? It must be (60 - 52) = 8 people </li></ul><ul><li>No pets = 16 </li></ul><ul><li>Dogs = 20 </li></ul><ul><li>Cats = 24 </li></ul><ul><li>Both = 8 </li></ul><ul><li>------------------------- </li></ul><ul><li>Total = 68 </li></ul>
  21. 32. Question 1 - Solution <ul><li>Graphic: Pie Chart or Bar Chart </li></ul>
  22. 33. Question 1 - Solution <ul><li>Graphic: Pie Chart or Bar Chart </li></ul>
  23. 34. Pitfalls of Surveys
  24. 35. The Literary Digest Poll <ul><li>1936 US Presidential Election </li></ul><ul><li>Alf Landon (R) vs. Franklin D. Roosevelt (D) </li></ul>
  25. 36. The Literary Digest Poll <ul><li>Literary Digest had been conducting successful presidential election polls since 1916 </li></ul><ul><li>They had correctly predicted the outcomes of the 1916, 1920, 1924, 1928, and 1932 elections by conducting polls. </li></ul><ul><li>These polls were a lucrative venture for the magazine: readers liked them; newspapers played them up; and each “ballot” included a subscription blank. </li></ul>
  26. 37. The Literary Digest Poll <ul><li>In 1936 they sent out 10 million ballots to two groups of people: </li></ul><ul><ul><li>prospective subscribers, “who were chiefly upper- and middle-income people” </li></ul></ul><ul><ul><li>a list designed to &quot;correct for bias&quot; from the first list, consisting of names selected from telephone books and motor vehicle registries </li></ul></ul>
  27. 39. The Literary Digest Poll <ul><li>Response rate: approximately 25%, or 2,376,523 responses </li></ul><ul><li>Result: Landon in a landslide (predicted 57% of the vote, Roosevelt predicted 40%) </li></ul>
  28. 40. The Literary Digest Poll <ul><li>Response rate: approximately 25%, or 2,376,523 responses </li></ul><ul><li>Result: Landon in a landslide (predicted 57% of the vote, Roosevelt predicted 40%) </li></ul><ul><li>Election result: Roosevelt received approximately 60% of the vote </li></ul>
  29. 41. The Literary Digest Poll <ul><li>POSSIBLE CAUSES OF ERROR </li></ul><ul><li>Selection Bias : By taking names and addresses from telephone directories, survey systematically excluded poor voters. </li></ul><ul><ul><li>Republicans were markedly overrepresented </li></ul></ul><ul><ul><li>in 1936, Democrats did not have as many phones,  not as likely to drive cars, and did not read the Literary Digest </li></ul></ul><ul><ul><li>“ Sampling Frame” is the actual population of individuals from which a sample is drawn: Selection bias results when sampling frame is not representative of the population of interest </li></ul></ul>
  30. 42. The Literary Digest Poll <ul><li>POSSIBLE CAUSES OF ERROR </li></ul><ul><li>Non-response Bias : Because only 20% of 10 million people returned surveys, non-respondents may have different preferences from respondents </li></ul><ul><ul><li>Indeed, respondents favored Landon </li></ul></ul><ul><ul><li>Greater response rates reduce the odds of biased samples </li></ul></ul>
  31. 43. Definitions and Formula
  32. 44. Terminology <ul><li>Population: is a set of entities concerning which statistical inferences are to be drawn. </li></ul><ul><li>Sample: a number of independent observations from the same probability distribution </li></ul><ul><li>Parameter: the distribution of a random variable as belonging to a family of probability distributions, distinguished from each other by the values of a finite number of parameters </li></ul><ul><li>Bias: a factor that causes a statistical sample of a population to have some examples of the population less represented than others. </li></ul>
  33. 45. Outliers (and their treatment)
  34. 46. Outliers (and their treatment) <ul><li>An &quot;outlier&quot; is an observation that does not fit the pattern in the rest of the data </li></ul><ul><ul><li>Check the data </li></ul></ul><ul><ul><li>Check with the measurer </li></ul></ul><ul><ul><li>If reason to believe it is NOT real, change it if possible, otherwise leave it out (but note). </li></ul></ul><ul><ul><li>If reason to believe it is real, leave it out and note. </li></ul></ul>
  35. 47. The Mean <ul><li>The Mean (Arithmetic) </li></ul><ul><li>The mean is defined as the sum of all the elements, divided by the number of elements. </li></ul><ul><li>The statistical mean of a set of observations is the average of the measurements in a set of data </li></ul>
  36. 48. The Mode <ul><li>The mode is defined as the most frequently element in a set of elements. </li></ul><ul><ul><li>For example [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] has a mode of 6. </li></ul></ul><ul><li>Given the list of data [1, 1, 2, 4, 4] the mode is not unique - the dataset may be said to be bimodal, while a set with more than two modes may be described as multimodal. </li></ul>
  37. 49. The Median <ul><li>The median is defined as the middle element, or the value separating the higher half of a sample from the lower half. </li></ul><ul><li>If there is an even number of elements, it is half the sum of the middle two elements. </li></ul>
  38. 50. The Variance <ul><li>But there can be a lot of variance in individual elements, </li></ul><ul><li>e.g. teacher salaries </li></ul><ul><li>Average = €22,000 </li></ul><ul><li>Lowest = € 12,000 </li></ul><ul><li>Difference = 12,000 - 22,000 = -10,000 </li></ul>
  39. 51. The Variance
  40. 52. The Variance
  41. 53. The Variance
  42. 54. The Variance <ul><li>Sum of (Sample - Average) = 0, thus we need to define variance. </li></ul><ul><li>The variance of a set of data is a cumulative measure of the squares of the difference of all the data values from the mean divided by sample size minus one. </li></ul>
  43. 55. Standard Deviation <ul><li>The standard deviation of a set of data is the positive square root of the variance. </li></ul>- 1 - 1
  44. 56. <ul><li>Born 27 March 1857 </li></ul><ul><li>Died 27 April 1936 </li></ul><ul><li>Born in Islington, London, England </li></ul><ul><li>Father of Mathematical Statistics </li></ul><ul><li>protégé of Francis Galton </li></ul><ul><li>Inventor of the P-value, the Pearson correlation coefficient, Chi distance, the Method of moments, and Principal Component Analysis </li></ul>Karl Pearson
  45. 57. <ul><li>Karl Pearson the term &quot;standard deviation&quot; in 1893, &quot;although the idea was by then nearly a century old&quot; (Abbott; Stigler, page 328). </li></ul><ul><li>The term &quot;standard deviation&quot; was introduced in a lecture of 31 January 1893, as a convenient substitute for the cumbersome &quot;root mean square error&quot; and the older expressions &quot;error of mean square&quot; and &quot;mean error.&quot; </li></ul><ul><li>The term was firist used in a publication in 1894 by Pearson in &quot;Contributions to the Mathematical Theory of Evolution,&quot; (Philosophical Transactions of the Royal Society A, 185, (1894), 71-110.). </li></ul><ul><li> </li></ul>
  46. 58. Question 2 <ul><li>Find the mean and variance of the following sample values : </li></ul><ul><li>36, 41, 43, 44, 46 </li></ul>
  47. 59. Question 2 <ul><li>Mean : </li></ul><ul><li>=( 36 + 41 + 43 + 44 + 46) / 5 </li></ul><ul><li>=210 / 5 </li></ul><ul><li>=42 </li></ul>
  48. 60. Question 2 <ul><li>Variance </li></ul><ul><li>Difference Square </li></ul><ul><li>36 – 42 = -6 36 </li></ul><ul><li>41 – 42 = -1 1 </li></ul><ul><li>43 – 42 = 1 1 </li></ul><ul><li>44 – 42 = 2 4 </li></ul><ul><li>46 – 42 = 4 16 </li></ul><ul><li>------------------------------------ </li></ul><ul><li>58 </li></ul>Variance = 58 / (5 -1) = 58 / 4 = 14.5 Standard Deviation = SquareRoot(14.5) = 3.8