This chapter will focus on descriptive statistics. Using descriptive statistics and the information from Chapter 3, 4, and 5, we will begin the study of inferential statistics.
Most important characteristics necessary to describe, explore, and compare data sets. page 34 of text
page 35 of text
Discussion of what these rating values represent is found on page 33 (the Chapter Problem for Chapter 2).
Final result of a frequency table on the Qwerty data given in previous slide. Next few slides will develop definitions and steps for developing the frequency table.
The concept is usually difficult for some students. Emphasize that boundaries are often used in graphical representations of data. See Section 2-3.
Students will question where the -0.5 and the 14.5 came from. After establishing the boundaries between the existing classes, explain that one should find the boundary below the first class and above the last class, again referencing the use of boundaries in the development of some histograms (Section 2-3).
Being able to identify the midpoints of each class will be important when determining the mean and standard deviation of a frequency table.
Class widths ideally should be the same between each class. However, open-ended first or last classes (e.g., 65 years or older) are sometimes necessary to keep from have a large number of classes with a frequency of 0.
page 36 of text #1: Classes should not overlap in order to determine which class a value belongs to. #2: A common mistake of students developing frequency tables. #3: Open-ended classes are sometimes necessary #4: Round up to use fewer decimal places or use relevant numbers #5: Less than 5, the table will be too concise; more than 20, the table will be too large or unwieldy. #6: One test to make sure all data has been included.
This is the same as the flow chart on the next slide.
page 37 of text
page 38 of text Emphasis on the relative frequency table will assist in development the concept of probability distributions - the use of percentages with classes will relate to the use of probabilities with random variables.
page 55 of text
In this text, the arithmetic mean will be referred to as the MEAN.
is pronounced ‘sigma’ (Greek upper case ‘S’) indicates ‘summation’ of values
The mean is sensitive to every value , so one exceptional value can affect the mean dramatically. The median (a couple of slides later) overcomes this disadvantage.
The mean is sensitive to every value , so one exceptional value can affect the mean dramatically. The median (next slide) overcomes that disadvantage.
Examples on page 57 of text
Examples on page 57 of text
page 58 of text
page 58 of text Name the two values if the set is bimodal
Midrange is seldom used because it is too sensitive to extremes. (See Figure 2-12, page 59)
Midrange is seldom used because it is too sensitive to extremes. (See Figure 2-12, page 59)
Data skewed to the left is said to be ‘negatively skewed’ with the mean and median to the left of the mode. Data skewed to the right is said to be ‘positively skewed’ with the mean and media to the right of the mode.
Data not ‘lopsided’.
Data lopsided to left (or slants down to the left - definition of skew is ‘slanting’)
Data lopsided to the right (or slants down to the right)
Even though the measures of center are all the same, it is obvious from the dotplots of each group of data that there are some differences in the ‘spread’ (or variation) of the data. page 69 of text
The range is not a very useful measure of variation as it only uses two values of the data. Other measures of variation (to follow) will more more useful as they are computed by using every data value.
Ask students to explain what this definition indicates.
The definition indicates that one should find the average distance each score is from the mean. The use of n-1 in the denominator is necessary because there are only n-1 independent values - that is, only n-1 values can be assigned any number before the nth value is determined. page 70 of text
This formula is easier to use if computing the standard deviation ‘by hand’ as it does not rely on the use of the mean. Only three values are needed - n, x, and x 2 .
page 71 of text
is the lowercase Greek ‘sigma’. Note the division by N(population size), rather than n-1 Most data is a sample, rather than a population; therefore, this formula is not use very often.
page 74 of text
After computing the standard deviation, square the resulting number using a calculator.
The formulas for variance are the same as for standard deviation without the square root - remind student, squaring a square root will result in the radicand.
Reminder: range is the highest score minus the lowest score
Reminder: range is the highest score minus the lowest score
page 79 of text
Some student have difficulty understand the idea of ‘within one standard deviation of the mean’. Emphasize that this means the interval from one standard deviation below the mean to one standard deviation above the mean.
These percentages will be verified by the concepts learned in Chapter 5. Emphasize the Empirical Rule is appropriate for data that is in a BELL-SHAPED distribution.
Emphasize Chebyshev’s Theorem applies to data that is in a distribution of any shape - that is, it is less prescriptive than the Empirical Rule. page 80 of text
This idea will be revisited throughout the study of Elementary Statistics.
page 85 of text
This concept of ‘unusual’ values will be revisited several times during the course, especially in Chapter 7 - Hypothesis Testing. page 86 of text
These 5 numbers will be used in this text’s method for depicting boxplots.
page 94 of text
Outliers can have a dramatic effect on the certain descriptive statistics (mean and standard deviation). It is important to be aware of such data values.
page 96 of text
Medians and quartiles are not very sensitive to extreme values. Refer to Section 2-6 for finding the Q1 and Q3 values, and Section 2-4 for finding the median.
It appears that the Qwerty data set has a skewed right distribution page 97 of text
page 97 of text
Transcript of "Descriptive statistics"
1.
1
Descriptive Statistics
2-1 Overview
2-2 Summarizing Data with Frequency Tables
2-3 Pictures of Data
2-4 Measures of Center
2-5 Measures of Variation
2-6 Measures of Position
2-7 Exploratory Data Analysis (EDA)
2.
2
Descriptive Statistics
summarize or describe the important
characteristics of a known set of
population data
Inferential Statistics
use sample data to make inferences (or
generalizations) about a population
2 -1 Overview
3.
3
1. Center: A representative or average value that
indicates where the middle of the data set is located
2. Variation: A measure of the amount that the values
vary among themselves
3. Distribution: The nature or shape of the
distribution of data (such as bell-shaped, uniform, or
skewed)
4. Outliers: Sample values that lie very far away from
the vast majority of other sample values
5. Time: Changing characteristics of the data over
time
Important Characteristics of Data
4.
4
Frequency Table
lists classes (or categories) of values,
along with frequencies (or counts) of the
number of values that fall into each class
2-2 Summarizing Data With
Frequency Tables
8.
8
Lower Class Limits
Lower Class
Limits
0 - 2 20
3 - 5 14
6 - 8 15
9 - 11 2
12 - 14 1
Rating Frequency
are the smallest numbers that can actually belong to
different classes
9.
9
Upper Class Limits
Upper Class
Limits
0 - 2 20
3 - 5 14
6 - 8 15
9 - 11 2
12 - 14 1
Rating Frequency
are the largest numbers that can actually belong to
different classes
10.
10
are the numbers used to separate classes, but
without the gaps created by class limits
Class Boundaries
14.
14
midpoints of the classes
Class Midpoints
Class
Midpoints
0 - 1 2 20
3 - 4 5 14
6 - 7 8 15
9 - 10 11 2
12 - 13 14 1
Rating Frequency
15.
15
is the difference between two consecutive lower
class limits or two consecutive class boundaries
Class Width
16.
16
Class Width
Class Width
3 0 - 2 20
3 3 - 5 14
3 6 - 8 15
3 9 - 11 2
3 12 - 14 1
Rating Frequency
is the difference between two consecutive lower
class limits or two consecutive class boundaries
17.
17
1. Be sure that the classes are mutually exclusive.
2. Include all classes, even if the frequency is zero.
3. Try to use the same width for all classes.
4. Select convenient numbers for class limits.
5. Use between 5 and 20 classes.
6. The sum of the class frequencies must equal the
number of original data values.
Guidelines For Frequency Tables
18.
18
3. Select for the first lower limit either the lowest score or a
convenient value slightly less than the lowest score.
4. Add the class width to the starting point to get the second lower
class limit, add the width to the second lower limit to get the
third, and so on.
5. List the lower class limits in a vertical column and enter the
upper class limits.
6. Represent each score by a tally mark in the appropriate class.
Total tally marks to find the total frequency for each class.
Constructing A Frequency Table
1. Decide on the number of classes .
2. Determine the class width by dividing the range by the number
of classes (range = highest score - lowest score) and round up.
class width ≈ round up of
range
number of classes
22.
22
Cumulative Frequency Table
Cumulative
Frequencies
0 - 2 20
3 - 5 14
6 - 8 15
9 - 11 2
12 - 14 1
Rating Frequency
Less than 3 20
Less than 6 34
Less than 9 49
Less than 12 51
Less than 15 52
Rating
Cumulative
Frequency
23.
23
Frequency Tables
0 - 2 20
3 - 5 14
6 - 8 15
9 - 11 2
12 - 14 1
Rating Frequency
0 - 2 38.5%
3 - 5 26.9%
6 - 8 28.8%
9 - 11 3.8%
12 - 14 1.9%
Rating
Relative
Frequency
Less than 3 20
Less than 6 34
Less than 9 49
Less than 12 51
Less than 15 52
Rating
Cumulative
Frequency
24.
24
a value at the
center or middle
of a data set
Measures of Center
25.
25
Mean
(Arithmetic Mean)
AVERAGE
the number obtained by adding the
values and dividing the total by the
number of values
Definitions
26.
26
Notation
Σ denotes the addition of a set of values
x is the variable usually used to represent the individual
data values
n represents the number of data values in a sample
N represents the number of data values in a population
27.
27
Notation
is pronounced ‘x-bar’ and denotes the mean of a set
of sample values
x =
n
Σ x
x
28.
28
Notation
µ is pronounced ‘mu’ and denotes the mean of all values
in a population
is pronounced ‘x-bar’ and denotes the mean of a set
of sample values
Calculators can calculate the mean of data
x =
n
Σ x
x
N
µ =
Σ x
29.
29
Definitions
Median
the middle value when the original
data values are arranged in order of
increasing (or decreasing) magnitude
30.
30
Definitions
Median
the middle value when the original
data values are arranged in order of
increasing (or decreasing) magnitude
often denoted by x (pronounced ‘x-tilde’)
~
31.
31
Definitions
Median
the middle value when the original
data values are arranged in order of
increasing (or decreasing) magnitude
often denoted by x (pronounced ‘x-tilde’)
is not affected by an extreme value
~
32.
32
6.72 3.46 3.60 6.44
3.46 3.60 6.44 6.72
no exact middle -- shared by two numbers
3.60 + 6.44
2
(even number of values)
MEDIAN is 5.02
33.
33
6.72 3.46 3.60 6.44 26.70
3.46 3.60 6.44 6.72 26.70
(in order - odd number of values)
exact middle MEDIAN is 6.44
6.72 3.46 3.60 6.44
3.46 3.60 6.44 6.72
no exact middle -- shared by two numbers
3.60 + 6.44
2
(even number of values)
MEDIAN is 5.02
34.
34
Definitions
Mode
the score that occurs most frequently
Bimodal
Multimodal
No Mode
denoted by M
the only measure of central tendency that can be
used with nominal data
35.
35
a. 5 5 5 3 1 5 1 4 3 5
b. 1 2 2 2 3 4 5 6 6 6 7 9
c. 1 2 3 6 7 8 9 10
Examples
Mode is 5
Bimodal - 2 and 6
No Mode
36.
36
Midrange
the value midway between the highest and
lowest values in the original data set
Definitions
37.
37
Midrange
the value midway between the highest and
lowest values in the original data set
Definitions
Midrange =
highest score + lowest score
2
38.
38
Symmetric
Data is symmetric if the left half of its
histogram is roughly a mirror of its
right half.
Skewed
Data is skewed if it is not symmetric
and if it extends more to one side than
the other.
Definitions
40.
40
Skewness
Mode = Mean = Median
SKEWED LEFT
(negatively)
SYMMETRIC
Mean Mode
Median
41.
41
Skewness
Mode = Mean = Median
SKEWED LEFT
(negatively)
SYMMETRIC
Mean Mode
Median
SKEWED RIGHT
(positively)
MeanMode
Median
42.
42
Waiting Times of Bank Customers
at Different Banks
in minutes
Jefferson Valley Bank
Bank of Providence
6.5
4.2
6.6
5.4
6.7
5.8
6.8
6.2
7.1
6.7
7.3
7.7
7.4
7.7
7.7
8.5
7.7
9.3
7.7
10.0
43.
43
Jefferson Valley Bank
Bank of Providence
6.5
4.2
6.6
5.4
6.7
5.8
6.8
6.2
7.1
6.7
7.3
7.7
7.4
7.7
7.7
8.5
7.7
9.3
7.7
10.0
Jefferson Valley Bank
7.15
7.20
7.7
7.10
Bank of Providence
7.15
7.20
7.7
7.10
Mean
Median
Mode
Midrange
Waiting Times of Bank Customers
at Different Banks
in minutes
51.
51
Population Standard Deviation
calculators can compute the
population standard deviation
of data
2
Σ (x - µ)
N
σ =
52.
52
Measures of Variation
Variance
standard deviation squared
53.
53
Measures of Variation
Variance
standard deviation squared
s
σ
2
2
} use square key
on calculatorNotation
54.
54
Sample
Variance
Population
Variance
Variance
Σ (x - x )2
n - 1
s
2
=
Σ (x - µ)2
N
σ 2
=
55.
55
Estimation of Standard Deviation
Range Rule of Thumb
x - 2s x x + 2s
Range ≈ 4s
or
(minimum
usual value)
(maximum
usual value)
56.
56
Estimation of Standard Deviation
Range Rule of Thumb
x - 2s x x + 2s
Range ≈ 4s
or
(minimum
usual value)
(maximum
usual value)
Range
4
s ≈ =
highest value - lowest value
4
57.
57
x
The Empirical Rule
(applies to bell-shaped distributions)FIGURE 2-15
58.
58
x - s x x + s
68% within
1 standard deviation
34% 34%
The Empirical Rule
(applies to bell-shaped distributions)FIGURE 2-15
59.
59
x - 2s x - s x x + 2sx + s
68% within
1 standard deviation
34% 34%
95% within
2 standard deviations
The Empirical Rule
(applies to bell-shaped distributions)
13.5% 13.5%
FIGURE 2-15
60.
60
x - 3s x - 2s x - s x x + 2s x + 3sx + s
68% within
1 standard deviation
34% 34%
95% within
2 standard deviations
99.7% of data are within 3 standard deviations of the mean
The Empirical Rule
(applies to bell-shaped distributions)
0.1% 0.1%
2.4% 2.4%
13.5% 13.5%
FIGURE 2-15
61.
61
Chebyshev’s Theorem
applies to distributions of any shape.
the proportion (or fraction) of any set of data
lying within K standard deviations of the mean
is always at least 1 - 1/K
2
, where K is any
positive number greater than 1.
at least 3/4 (75%) of all values lie within 2
standard deviations of the mean.
at least 8/9 (89%) of all values lie within 3
standard deviations of the mean.
62.
62
Measures of Variation Summary
For typical data sets, it is unusual for a
score to differ from the mean by more than
2 or 3 standard deviations.
63.
63
z Score (or standard score)
the number of standard deviations
that a given value x is above or below
the mean
Measures of Position
64.
64
Sample
z = x - x
s
Population
z = x - µ
σ
Measures of Position
z score
70.
70
Quartiles, Deciles, Percentiles
Fractiles
(Quantiles)
partitions data into approximately equal parts
71.
71
Exploratory Data Analysis
the process of using statistical tools
(such as graphs, measures of center,
and measures of variation) to investigate
the data sets in order to understand their
important characteristics
72.
72
Outliers
a value located very far away from
almost all of the other values
an extreme value
can have a dramatic effect on the mean,
standard deviation, and on the scale of
the histogram so that the true nature of
the distribution is totally obscured
73.
73
Boxplots
(Box-and-Whisker Diagram)
Reveals the:
center of the data
spread of the data
distribution of the data
presence of outliers
Excellent for comparing two
or more data sets
74.
74
Boxplots
5 - number summary
Minimum
first quartile Q1
Median (Q2)
third quartile Q3
Maximum
77.
77
Exploring
Measures of center: mean, median, and mode
Measures of variation: Standard deviation and
range
Measures of spread and relative location:
minimum values, maximum value, and quartiles
Unusual values: outliers
Distribution: histograms, stem-leaf plots, and
boxplots
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.
Be the first to comment