Exploring Data

479 views

Published on

AP Statistics

Published in: Education, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
479
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Exploring Data

    1. 1. Exploring Data• Displaying Distributions with Graphs• Displaying Distributions with Numbers
    2. 2. Displaying Distributions with Graphs• Introduction• Displaying categorical variables: bar graphs• Displaying quantitative variables: dotplots and stemplots• Displaying quantitative variables: histograms• Relative frequency, cumulative frequency, percentiles, and ogives• Time plots
    3. 3. IntroductionStatistics is the branch of mathematics dealing withthe collection, analysis, interpretation, andpresentation of numerical data.Individuals are the objects described by a set of data.When the individual is human, it is called a subject.A variable is any characteristic of an individual. Avariable can take different values for differentindividuals.
    4. 4. IntroductionSome variables, simply place individuals (orsubjects) into categories. Other variables, takenumerical values for which we can do arithmetic.A categorical variable places an individual into agroup or category.A quantitative variable takes numerical values forwhich arithmetic operations such as adding andaveraging make sense.The distribution of a variable tells us what values thevariable takes and how often it takes these values.
    5. 5. Displaying Categorical Variables: Bar GraphsA bar graph shows the distribution of a categoricalvariable and gives either the count or percent ofobservations that fall in each category.The horizontal axis lists each categorical variable.The vertical axis shows the number (or percent) ofobservations.Leave a space between each bar.Always label axes and add a title.
    6. 6. Displaying Quantitative Variables: Dotplots and StemplotsA dotplot is the most simple display of quantitativedata. To create a dotplot, draw a horizontal line andlist each outcome in ascending order below the line.Mark a dot above the number that corresponds toeach data value. Add a title.For example, the number of goals scored per gameby the Boston Bruins during the NHL playoffs in2011 is: 0, 1, 4, 5, 2, 1, 4, 7, 3, 5, 5, 2, 6, 2, 3, 3, 4, 1,0, 2, 8, 4, 0, 5, 4. Create a dotplot of this data.
    7. 7. Displaying Quantitative Variables: Dotplots and StemplotsRefer to the handout for caffeine content (in mg) for38 different soft drinks. For this data, a dotplot is notideal due to the large spread. Instead, construct astemplot.Separate each observation into a stem consisting ofall digits except the rightmost digit. The rightmostdigit is the leaf. For example, 35 mg of caffeine willhave a stem of 3 and a leaf of 5.Write the stems vertically in increasing order fromtop to bottom.
    8. 8. Displaying Quantitative Variables: Dotplots and StemplotsDraw a vertical line to the right of the stems.For each observation, write the leaf to the right of itsassociated stem, making sure to space the leavesequally. Then rewrite the stems and arrange theleaves so they are in increasing order out from thestem.Add a title and key (3 | 5 = 35 mg).Note: it may be necessary to split stems or truncateobservations.
    9. 9. Displaying Quantitative Variables: Dotplots and StemplotsAfter completing a dotplot or stemplot, describe theoverall pattern of the distribution. Give the centerand spread and determine if there are outliers. Anoutlier is an individual observation that falls outsidethe overall pattern of the graph.Also comment on the shape of the distribution.Distributions may be symmetric (roughly a mirrorimage), skewed right (the right tail is larger than theleft tail), or skewed left (the left tail is much largerthan the right tail).
    10. 10. ActivityIs Barack Obama a “young” president? Here are theages of all the U.S. presidents on inauguration day:Washington 57, J. Adams 61, Jefferson 57, Madison 57,Monroe 58, J.Q. Adams 57, Jackson 61, Van Buren 54, W.Harrison 68, Tyler 51, Polk 49, Taylor 64, Fillmore 50, Pierce48, Buchanan 65, Lincoln 52, A. Johnson 56, Grant 46, Hayes54, Garfield 49, Arthur 51, Cleveland 47, B. Harrison 55,Cleveland 55, McKinley 54, T. Roosevelt 42, Taft 51, Wilson56, Harding 55, Coolidge 51, Hoover 54, F. Roosevelt 51,Truman 60, Eisenhower 61, Kennedy 43, L. Johnson 55, Nixon56, Ford 61, Carter 52, Reagan 69, G. Bush 64, Clinton 46,G.W. Bush 54, Obama 47.
    11. 11. Displaying Quantitative Variables: HistogramsDisplay the presidential age at inauguration using ahistogram. On a TI-83:STAT EDIT 1:Edit and enter values into L12nd STAT PLOT 1: On, choose histogram,XList: L1, Freq:1GraphSketch the result from the calculator into your notes.Always add axes labels and a title.
    12. 12. Displaying Quantitative Variables: HistogramsUnlike the bar graph, the bars of the histogram areadjacent to account for continuity of the values onthe x-axis.There is no “correct” number of classes on the x-axis. However, 7 classes seems to make thehistogram look “best” and between 5 and 10 areprobably sufficient. Too few classes will result in askyscraper histogram while too many will result in apancake histogram.In general, use the number of classes your calculatorchooses.
    13. 13. Relative Frequency, Cumulative Frequency, Percentiles, and OgivesSometimes we are interested in describing therelative position of an individual within adistribution. For instance, a PSAT result may indicateyou were in the 80th percentile. This means youscored better than 80% of students (and 20% scoredbetter than you).The pth percentile of a distribution is the value suchthat p percent of observations fall at or below it.
    14. 14. Relative Frequency, Cumulative Frequency, Percentiles, and OgivesA histogram is good for displaying the overallpattern of a distribution but is poor for determiningthe percentile of an individual observation.A relative cumulative frequency plot, or ogive, isuseful in determining percentiles.
    15. 15. Relative Frequency, Cumulative Frequency, Percentiles, and OgivesFrom the presidential inauguration data, we knowthere are 44 presidents (observations).Fill in the table: Relative Relative Cumulative Class Frequency Frequency Frequency Cumulative Frequency 40 - 44 45 - 49 50 - 54 55 - 59 60 - 64 65 - 69
    16. 16. Relative Frequency, Cumulative Frequency, Percentiles, and OgivesThe relative frequency cumulative plot is a linegraph that plots relative cumulative frequency vs.class. Create one using data from the previous slideand don’t forget to label axes and add a title.What percentile is Barack Obama? On the x-axis,locate the class that contains 47. Scroll up until youreach the line, then scroll left to read off theapproximate percentile.What age corresponds to the 50th percentile?
    17. 17. Time PlotsA time plot of a variable plots each observationagainst the time at which it was measured. Time isalways placed on the x-axis.Civil unrest disturbances in the United Statesbetween 1968 and 1972 was measured according tothe table on the next slide. Using the data, construct atime plot of the number of disturbances vs. time.Remember to label axes and add a title.Connect each observation with a line and commenton the overall trend and the seasonal variation.
    18. 18. Time PlotsYear Months Count Year Months Count Jan - Mar 6 Jan - Mar 12 Apr - Jun 46 Apr - Jun 211968 Jul - Sep 25 1971 Jul - Sep 5 Oct - Dec 3 Oct - Dec 1 Jan - Mar 5 Jan - Mar 3 Apr - Jun 27 Apr - Jun 81969 Jul - Sep 19 1972 Jul - Sep 5 Oct - Dec 6 Oct - Dec 5 Jan - Mar 26 Apr - Jun 241970 Jul - Sep 20 Oct - Dec 6
    19. 19. Displaying Distributions with Numbers• Measuring center: the mean and the median• Comparing the mean and median• Measuring spread: the quartiles• The five-number summary and modified boxplots• Measuring center: the standard deviation• Choosing measures of center and spread• Changing the unit of measurement• Comparing distributions
    20. 20. Measuring Center: the Mean and Median To find the mean (average) of a set of observations, add their individual values and divide by the number of observations. If the n observations are x1, x2, …, xn, then the mean is:
    21. 21. Measuring Center: the Mean and Median Consider the set S = {1, 1, 2, 2, 3, 3, 4, 4}. The mean of this set is 2.5. Now consider the set T = {1, 1, 2, 2, 3, 3, 4, 40}. Find the mean. Notice the extreme observation strongly effects the mean. Therefore, we say the mean is not a resistant to extreme observations.
    22. 22. Measuring Center: the Mean and Median The median, M, is the midpoint of a distribution; the number such that half of the observations are smaller and half of the observations are larger. To find the median, arrange the observations in order of size, from smallest to largest. If the number of observations, n, is odd, the median is the center observation in the ordered list. If the number of observations, n, is even, the median is the mean of the two center observations in the ordered list.
    23. 23. Measuring Center: the Mean and Median Consider the set S = {1, 1, 2, 2, 3, 3, 4, 4}. The median of this set is 2.5. Now consider the set T = {1, 1, 2, 2, 3, 3, 4, 40}. Find the median. Notice the extreme observation has little effect on the median. Therefore, we say the median is resistant to extreme observations.
    24. 24. Comparing the Mean and MedianIf a distribution is approximately symmetric, themean and median are approximately equal.In skewed distributions, the mean is farther out in thelarger tail (because it is not resistant).Distributions skewed left will have a mean less thanthe median.Distributions skewed right will have a mean greaterthan the median.
    25. 25. Measuring Spread: the QuartilesThe simplest measure of spread for any distributionis range: Range = maximum value - minimum valueQuartiles measure the range of the middle half of ourobservations. The first quartile, Q1, is the 25thpercentile. The third quartile, Q3, is the 75thpercentile.
    26. 26. Measuring Spread: the QuartilesTo find Q1 and Q3, arrange the observations in orderof size from smallest to largest. Then find the overallmedian.Q1 is the median of the observations smaller than theoverall median.Q3 is the median of the observations larger than theoverall median.
    27. 27. Measuring Spread: the QuartilesThe interquartile range, IQR, is the range covered bythe middle half of data: IQR = Q3 - Q1An observation between Q1 and Q3 is not unusuallysmall or large. This observation is between the 25thand 75th percentile.
    28. 28. Measuring Spread: the QuartilesUsing the IQR, we can now write a definition for anoutlier.An observation is considered an outlier if it issmaller than Q1 - 1.5 IQR or larger than Q3 + 1.5IQR.
    29. 29. Measuring Spread: the Five-Number Summary and Modified BoxplotsThe five-number summary combines a measure ofcenter (median) and measures of spread (range andquartiles). It consists of five numbers written in orderfrom smallest to largest. The numbers are: Minimum, Q1, M, Q3, Maximum
    30. 30. Measuring Spread: the Five-Number Summary and Modified BoxplotsA modified box plot is a graph of the five-numbersummary. Properties of the modified boxplot are:A central box spans Q1 and Q3;A vertical line in the box marks M;Horizontal lines extend from the box out to thesmallest and largest observations that are notoutliers;Observations more than 1.5 IQR’s outside the centralbox are plotted individually.
    31. 31. Measuring Spread: the Standard DeviationThe standard deviation, s, measures how far awaythe observations in a distribution are from theirmean. To calculate standard deviation, first calculatevariance, s2.The variance, s2, of a set of observations is the meanof the squares of the deviations of the observationsfrom their mean.
    32. 32. Measuring Spread: the Standard DeviationThe standard deviation, s, is the square root ofvariance.Why divide by n - 1 instead of n? Since the sum ofthe deviations must equal zero, the last deviation canbe found once we know the other n - 1 deviations.Only n - 1 of the squared deviations can vary freelyso we average by dividing the total by n - 1. Thenumber n - 1 is called the degrees of freedom.
    33. 33. Measuring Spread: the Standard DeviationProperties of the standard deviation:s measures spread about the mean and should beused only when the mean is chosen as the measure ofcenter;s = 0 when there is no spread. When there is spread,s > 0. Larger spreads imply larger values of s.Like the mean, the standard deviation (and variance)is not resistant to outliers. Strong skewness or a fewoutliers can make s very large.
    34. 34. Measuring Spread: the Standard DeviationHere are some TI-83 commands to find all thesummary statistics mentioned in these notes:Enter data into L1STAT CALC 1:1-Var Stats L1Read off:xbar,Sx, minX, Q1, Med, Q3, maxX
    35. 35. Choosing Measures of Center and Spread If a distribution is strongly skewed or has outliers, use the five-number summary to describe center and spread. If a distribution is reasonably symmetric and free from outliers, use mean and standard deviation to describe center and spread.
    36. 36. Changing the Unit of MeasurementThe same variable can be recorded in different unitsof measurement. Common examples are changingdistances from miles to kilometers and changingtemperature from °F to °C.A linear transformation changes the original value x,into a variable xnew via an equation of form:
    37. 37. Changing the Unit of MeasurementThe effect of a linear transformation on measures ofcenter and spread are:Adding the same number a to each observation addsa to mean, median and quartiles, but does not changemeasures of spread.Multiplying each observation by b multiplies mean,median and quartiles by b and also multipliesstandard deviation and IQR by b.

    ×