Describing Distributions with Numbers
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Describing Distributions with Numbers

on

  • 332 views

Statistics

Statistics

Statistics

Views

Total Views
332
Views on SlideShare
332
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Describing Distributions with Numbers Presentation Transcript

  • 1. 1 INTRODUCTION TO STATISTICS & PROBABILITY Chapter 1: Looking at Data—Distributions (Part 2) 1.2 Describing Distributions with Numbers Dr. Nahid Sultana
  • 2. 1.2 Describing Distributions with Numbers 2 Objectives  Measures of center: mean, median  Measures of spread: quartiles, standard deviation  Five-number summary and boxplot  IQR and outliers  Choosing among summary statistics  Changing the unit of measurement
  • 3. Measures of center: The Mean 3  The most common measure of center is the arithmetic average, or mean, or sample mean.  To calculate the average, or mean, add all values, then divide by the number of individuals.  It is the “center of mass.”  If the n observations are x1, x2, x3, …, xn, their mean is: sum of observations x1  x2  ...  xn x  n n 1 or in more compact notation, x  n  xi
  • 4. Measures of center: The Mean (cont…) 4 Find the mean: Here are the scores on the first exam in an introductory statistics course for 10 students: 80 73 92 85 75 98 93 55 Find the mean first-exam score for these students. Solution: 80 90
  • 5. Measuring Center: The Median 5  Another common measure of center is the median.  The median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger. To find the median of a distribution: 1. Arrange all observations from smallest to largest. 2. If the number of observations n is odd, the median M is the center observation in the ordered list. 3. If the number of observations n is even, the median M is the average of the two center observations in the ordered list.
  • 6. Measuring Center: The Median (cont...) 6 Find the median: Here are the scores on the first exam in an introductory statistics course for 10 students: 80 73 92 85 75 98 93 55 80 Find the median first-exam score for these students. Solution: 90 Note: The location of the median is (n + 1)/2 in the sorted list.
  • 7. Comparing Mean and Median 7
  • 8. Comparing Mean and Median (Cont...) 8  The mean and the median are the same only if the distribution is symmetrical.  In a skewed distribution, the mean is usually farther out in the long tail than is the median.  The median is a measure of center that is resistant to skew and outliers. The mean is not.
  • 9. Measuring Spread: The Quartiles 9 A measure of center alone can be misleading. A useful numerical description of a distribution requires both a measure of center and a measure of spread.  We describe the spread or variability of a distribution by giving several percentiles.  The median divides the data in two parts; half of the observations are above the median and half are below the median. We could call the median the 50th percentile.  The lower quartile (first quartile, Q1)is the median of the lower half of the data; the upper quartile (third quartile, Q3) is the median of the upper half of the data.  With the median, the quartiles divide the data into four equal parts; 25% of the data are in each part
  • 10. Measuring Spread: The Quartiles (Cont.) Calculate the quartiles and inter-quartile: 10 1. Arrange the observations in increasing order and locate the median M. 2. The first quartile Q1 is the median of the lower half of the data, excluding M. 3. The third quartile Q3 is it is the median of the upper half of the data, excluding M.
  • 11. Measuring Spread: The Quartiles (Cont.) 11 Example: Here are the scores on the first-exam in an introductory statistics course for 10 students: 80 73 92 85 75 98 93 55 80 90 Find the quartiles for these first-exam scores. Solution: In order, the scores are: 55 73 75 80 80 85 90 92 93 98 The median is, Q1 = 75, the median of the first five numbers: 55, 73, 75, 80, 80. Q3 = 92, the median of the last five numbers: 85, 90, 92, 93, 98.
  • 12. The Five-Number Summary 12 The five-number summary of a distribution consists of  The smallest observation (Min)  The first quartile (Q1)  The median (M)  The third quartile (Q3)  The largest observation (Max) written in order from smallest to largest. Minimum Q1 M Q3 Maximum
  • 13. Boxplots 13 A boxplot is a graph of the five-number summary.  Draw a central box from Q1 to Q3.  Draw a line inside the box to mark the median M.  Extend lines from the box out to the minimum and maximum values that are not outliers.
  • 14. Boxplots (Cont…) 14 Example: Here are the scores on the first-exam in an introductory statistics course for 10 students: 80 73 92 85 75 98 93 Make a boxplot for these first-exam scores. Solution: In order, the scores are: 55, 73, 75, 80, 80, 85, 90, 92, 93, 98 Min = 55 Q1 = 75 M = 82.5 Q3 = 92 Max = 98 55 80 90
  • 15. Comparing Boxplots to Histograms 15 15
  • 16. Boxplots and skewed data 16 Years until death Boxplots for a symmetric and a right-skewed distribution 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Boxplots show symmetry or skew. Disease X Multiple Myeloma
  • 17. Suspected Outliers: 1.5  IQR Rule 17  Outliers are troublesome data points, and it is important to be able to identify them. The interquartile range IQR is the distance between the first and third quartiles, IQR = Q3 − Q1  IQR is used as part of a rule of thumb for identifying outliers. The 1.5  IQR Rule for Outliers Call an observation an outlier if it falls more than 1.5  IQR above the third quartile or below the first quartile.  Suspected low outlier: any value < Q1 – 1.5  IQR  Suspected high outlier: any value > Q3 + 1.5  IQR
  • 18. Suspected Outliers: 1.5  IQR Rule (Cont..) 18 Individual #25 has a value of 7.9 years, which is 3.55 years above the third quartile. This is more than 1.5 * IQR =3.225 years. Thus, individual #25 is a suspected outlier.
  • 19. Suspected Outliers: 1.5  IQR Rule (Cont..) 19  Modified boxplots plot suspected outliers individually.  The 8 largest call lengths are 438, 465, 479, 700, 700, 951, 1148, 2631  They are plotted as individual points, though 2 of them are identical and so do not appear separately.
  • 20. Measuring Spread: The Standard Deviation 20 The most common measure of spread looks at how far each observation is from the mean. This measure is called the standard deviation.  The standard deviation s measures the average distance of the observations from their mean.  It is calculated by  This average squared distance is called the variance.
  • 21. Calculating The Standard Deviation 21 1. Calculate mean 2. Calculate each deviation, deviation = observation – mean 3. Square each deviation 4. Calculate the sum of the squared deviations 5. Divided by degrees freedom, (df) = (n-1), this is called the variance. 6. Calculate the square root of the variance…this is the standard deviation. The variance = 52/(9 – 1) = 6.5 Standard deviation = 6.5 = 2.55 xi (xi-mean) (xi-mean)2 1 1 - 5 = -4 (-4)2 = 16 3 3 - 5 = -2 (-2)2 = 4 4 4 - 5 = -1 (-1)2 = 1 4 4 - 5 = -1 (-1)2 = 1 4 4 - 5 = -1 (-1)2 = 1 5 5-5=0 (0)2 = 0 7 7-5=2 (2)2 = 4 8 8-5=3 (3)2 = 9 9 9-5=4 (4)2 = 16 Mean=5 Sum=0 Sum=52
  • 22. Properties of The Standard Deviation 22  s measures spread about the mean and should be used only when the mean is the measure of center.  s = 0 only when all observations have the same value and there is no spread. Otherwise, s > 0.  s is not resistant to outliers.  s has the same units of measurement as the original observations.
  • 23. Choosing Measures of Center and Spread 23 We now have a choice between two descriptions for center and spread  Mean and Standard Deviation  Median and Interquartile Range  The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.  Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA FIRST!
  • 24. Changing the Unit of Measurement 24  Variables can be recorded in different units of measurement.  Most often, one measurement unit is a linear transformation of another measurement unit: xnew = a + bx. Example 1: If a distance x is measured in kilometers, the same distance in miles is xnew = 0.62 x This transformation changes the units without changing the origin —a distance of 0 kilometers is the same as a distance of 0 miles. Example 2: Temperatures can be expressed in degrees Fahrenheit or degrees Celsius. This transformation changes both the unit; size and the origin of the measurements —The origin in the Celsius scale (0◦C, the temperature at which water freezes) is 32◦ in the Fahrenheit scale.
  • 25. Changing the Unit of Measurement (Cont…) 25  Linear transformations do not change the basic shape of a distribution (skew, symmetry).  But they do change the measures of center and spread:  Multiplying each observation by a positive number b multiplies both measures of center (mean, median) and spread (IQR, s) by b.  Adding the same number a (positive or negative) to each observation adds a to measures of center and to quartiles but it does not change measures of spread (IQR, s).