4. Basic Concepts of Statistics
Statistics is concerned with:
• Processing and analyzing data
• Collecting, presenting, and transforming
data to assist decision makers
5. Key Definitions
• A population (universe) is the collection of
all members of a group
• A sample is a portion of the population
selected for analysis
• A parameter is a numerical measure that
describes a characteristic of a population
• A statistic is a numerical measure that
describes a characteristic of a sample
6. Population vs. Sample
a b c d
ef gh i jk l m n
o p q rs t u v w
x y z
Population Sample
b c
g i n
o r u
y
Measures used to describe a
population are called
parameters
Measures computed from
sample data are called
statistics
7. Two Branches of Statistics
• Descriptive statistics
– Collecting, summarizing, and presenting data
• Inferential statistics
– Drawing conclusions about a population based only
on sample data
8. Descriptive Statistics
• Collect data
– e.g., Survey
• Present data
– e.g., Tables and graphs
• Characterize data
– e.g., Sample mean = i
X
n
9. Inferential Statistics
• Estimation
– e.g., Estimate the population
mean weight using the sample
mean weight
• Hypothesis testing
– e.g., Test the claim that the
population mean weight is 120
pounds
Drawing conclusions about a population based on sample results.
11. Types of Data
Data
Categorical Numerical
Discrete Continuous
Examples:
Marital Status
Political Party
Eye Color
(Defined categories) Examples:
Number of Children
Defects per hour
(Counted items)
Examples:
Weight
Voltage
(Measured characteristics)
12. Levels of Measurement
and Measurement Scales
Interval Data
Ordinal Data
Nominal Data
Highest Level
(Strongest forms of
measurement)
Higher Levels
Lowest Level
(Weakest form of
measurement)
Categories (no ordering
or direction)
Ordered Categories
(rankings, order, or
scaling)
Differences between
measurements but no
true zero
Ratio Data
Differences between
measurements, true
zero exists
13. Levels of Measurement
and Measurement Scales
Interval Data
Ordinal Data
Nominal Data
Height, Age, Weekly Food
Spending
Service quality rating,
Standard & Poor’s bond
rating, Student letter grades
Marital status, Type of car
owned
Ratio Data
Temperature in Fahrenheit,
Standardized exam score
Categories (no ordering or
direction)
Ordered Categories (rankings,
order, or scaling)
Differences between
measurements but no true
zero
Differences between
measurements, true zero
exists
EXAMPLES:
14. Organizing and Presenting
Data Graphically
• Data in raw form are usually not easy to use for decision
making
– Some type of organization is needed
• Table
• Graph
• Techniques reviewed here:
– Bar charts and pie charts
– Pareto diagram
– Ordered array
– Stem-and-leaf display
– Frequency distributions, histograms and polygons
– Cumulative distributions and ogives
– Contingency tables
– Scatter diagrams
15. Tables and Charts for Categorical
Data
Categorical Data
Graphing Data
Pie Charts Pareto
Diagram
Bar Charts
Tabulating Data
Summary Table
16. The Summary Table
Example: Current Investment Portfolio
Investment Amount Percentage
Type (in thousands $) (%)
Stocks 46.5 42.27
Bonds 32.0 29.09
CD 15.5 14.09
Savings 16.0 14.55
Total 110.0 100.0
(Variables are
Categorical)
Summarize data by category
17. Bar and Pie Charts
• Bar charts and Pie charts are often used
for qualitative data (categories or
nominal scale)
• Height of bar or size of pie slice shows
the frequency or percentage for each
category
18. Bar Chart Example
Investor's Portfolio
0 10 20 30 40 50
Stocks
Bonds
CD
Savings
Amount in $1000's
Investment Amount Percentage
Type (in thousands $) (%)
Stocks 46.5 42.27
Bonds 32.0 29.09
CD 15.5 14.09
Savings 16.0 14.55
Total 110.0 100.0
Current Investment Portfolio
19. Pie Chart Example
Percentages are
rounded to the
nearest percent
Current Investment Portfolio
Savings
15%
CD
14%
Bonds
29%
Stocks
42%
Investment Amount Percentage
Type (in thousands $) (%)
Stocks 46.5 42.27
Bonds 32.0 29.09
CD 15.5 14.09
Savings 16.0 14.55
Total 110.0 100.0
20. Pareto Diagram
• Used to portray categorical data (nominal scale)
• A bar chart, where categories are shown in
descending order of frequency
• A cumulative polygon is often shown in the same
graph
• Used to separate the “vital few” from the “trivial
many”
22. Tables and Charts for
Numerical Data
Numerical Data
Ordered Array
Stem-and-Leaf
Display Histogram Polygon Ogive
Frequency Distributions and
Cumulative Distributions
23. The Ordered Array
A sequence of data in rank order:
Shows range (min to max)
Provides some signals about variability
within the range
May help identify outliers (unusual observations)
If the data set is large, the ordered array is
less useful
24. • Data in raw form (as collected):
24, 26, 24, 21, 27, 27, 30, 41, 32, 38
• Data in ordered array from smallest to largest:
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
(continued)
The Ordered Array
25. What is a Frequency Distribution?
• A frequency distribution is a list or a table …
• containing class groupings (ranges within
which the data fall) ...
• and the corresponding frequencies with
which data fall within each grouping or
category
Tabulating Numerical Data:
Frequency Distributions
26. Why Use a Frequency Distribution?
• It is a way to summarize numerical data
• It condenses the raw data into a more
useful form...
• It allows for a quick visual interpretation
of the data
27. Class Intervals
and Class Boundaries
• Each class grouping has the same width
• Determine the width of each interval by
Usually at least 5 but no more than 15 groupings
Class boundaries never overlap
Round up the interval width to get desirable
endpoints
groupings
class
desired
of
number
range
interval
of
Width
28. Frequency Distribution Example
Example: A manufacturer of insulation
randomly selects 20 winter days and records
the daily high temperature
24, 35, 17, 21, 24, 37, 26, 46, 58, 30,
32, 13, 12, 38, 41, 43, 44, 27, 53, 27
29. • Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
• Find range: 58 - 12 = 46
• Select number of classes: 5 (usually between 5 and 15)
• Compute class interval (width): 10 (46/5 then round up)
• Determine class boundaries (limits): 10, 20, 30, 40, 50, 60
• Compute class midpoints: 15, 25, 35, 45, 55
• Count observations & assign to classes
Frequency Distribution Example
(continued)
30. Frequency Distribution Example
Class Frequency
10 but less than 20 3 .15 15
20 but less than 30 6 .30 30
30 but less than 40 5 .25 25
40 but less than 50 4 .20 20
50 but less than 60 2 .10 10
Total 20 1.00 100
Relative
Frequency
Percentage
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
(continued)
31. Tabulating Numerical Data:
Cumulative Frequency
Class
10 but less than 20 3 15 3 15
20 but less than 30 6 30 9 45
30 but less than 40 5 25 14 70
40 but less than 50 4 20 18 90
50 but less than 60 2 10 20 100
Total 20 100
Percentage
Cumulative
Percentage
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Frequency
Cumulative
Frequency
32. Graphing Numerical Data:
The Histogram
• A graph of the data in a frequency distribution is
called a histogram
• The class boundaries (or class midpoints) are
shown on the horizontal axis
• the vertical axis is either frequency, relative
frequency, or percentage
• Bars of the appropriate heights are used to
represent the number of observations within
each class
33. Histogram: Daily High Temperature
0
1
2
3
4
5
6
7
5 15 25 35 45 55 65
Frequency
Class Midpoints
Histogram Example
(No gaps
between
bars)
Class
10 but less than 20 15 3
20 but less than 30 25 6
30 but less than 40 35 5
40 but less than 50 45 4
50 but less than 60 55 2
Frequency
Class Midpoint
34. Frequency Polygon: Daily High Temperature
0
1
2
3
4
5
6
7
5 15 25 35 45 55 65
Frequency
Graphing Numerical Data:
The Frequency Polygon
Class Midpoints
Class
10 but less than 20 15 3
20 but less than 30 25 6
30 but less than 40 35 5
40 but less than 50 45 4
50 but less than 60 55 2
Frequency
Class Midpoint
(In a percentage polygon
the vertical axis would be
defined to show the
percentage of
observations per class)
35. Graphing Cumulative Frequencies:
The Ogive (Cumulative % Polygon)
Ogive: Daily High Temperature
0
20
40
60
80
100
10 20 30 40 50 60
Cumulative
Percentage
Class Boundaries (Not Midpoints)
Class
Less than 10 0 0
10 but less than 20 10 15
20 but less than 30 20 45
30 but less than 40 30 70
40 but less than 50 40 90
50 but less than 60 50 100
Cumulative
Percentage
Lower
class
boundary
10 20 30 40 50 60
36. Tabulating and Graphing
Multivariate Categorical Data
• Contingency Table for Investment Choices ($1000’s)
Investment Investor A Investor B Investor C Total
Category
Stocks 46.5 55 27.5 129
Bonds 32.0 44 19.0 95
CD 15.5 20 13.5 49
Savings 16.0 28 7.0 51
Total 110.0 147 67.0 324
(Individual values could also be expressed as percentages of the overall total,
percentages of the row totals, or percentages of the column totals)
37. • Side-by-side bar charts
(continued)
Tabulating and Graphing
Multivariate Categorical Data
Comparing Investors
0 10 20 30 40 50 60
S toc k s
B onds
CD
S avings
Inves tor A Inves tor B Inves tor C
38. Side-by-Side Chart Example
• Sales by quarter for three sales territories:
0
10
20
30
40
50
60
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
East
West
North
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
East 20.4 27.4 59 20.4
West 30.6 38.6 34.6 31.6
North 45.9 46.9 45 43.9
39. • Scatter Diagrams are used to examine
possible relationships between two
numerical variables
• The Scatter Diagram:
– one variable is measured on the vertical
axis and the other variable is measured on
the horizontal axis
Scatter Diagrams
40. Scatter Diagram Example
Cost per Day vs. Production Volume
0
50
100
150
200
250
0 10 20 30 40 50 60 70
Volume per Day
Cost
per
Day
Volume
per day
Cost per
day
23 131
24 120
26 140
29 151
33 160
38 167
41 185
42 170
50 188
55 195
60 200
41. • A Time Series Plot is used to study
patterns in the values of a variable over
time
• The Time Series Plot:
– one variable is measured on the vertical
axis and the time period is measured on
the horizontal axis
Time Series Plot
42. Scatter Diagram Example
Number of Franchises, 1996-2004
0
20
40
60
80
100
120
1994 1996 1998 2000 2002 2004 2006
Year
Number
of
Franchises
Year
Number of
Franchises
1996 43
1997 54
1998 60
1999 73
2000 82
2001 95
2002 107
2003 99
2004 95
44. Measures of Central Tendency
Central Tendency
Arithmetic Mean Median Mode
n
X
X
n
i
i
1
n
/
1
n
2
1
G )
X
X
X
(
X
Overview
Midpoint of
ranked values
Most
frequently
observed
value
45. Arithmetic Mean
• The arithmetic mean (mean) is the most
common measure of central tendency
– For a sample of size n:
Sample size
n
X
X
X
n
X
X n
2
1
n
1
i
i
Observed values
46. Arithmetic Mean
• The most common measure of central tendency
• Mean = sum of values divided by the number of values
• Affected by extreme values (outliers)
(continued)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
3
5
15
5
5
4
3
2
1
4
5
20
5
10
4
3
2
1
47. Median
• In an ordered array, the median is the “middle”
number (50% above, 50% below)
• Not affected by extreme values
0 1 2 3 4 5 6 7 8 9 10
Median = 3
0 1 2 3 4 5 6 7 8 9 10
Median = 3
48. Finding the Median
• The location of the median:
– If the number of values is odd, the median is the middle number
– If the number of values is even, the median is the average of the two
middle numbers
• Note that is not the value of the median, only the
position of the median in the ranked data
data
ordered
the
in
position
2
1
n
position
Median
2
1
n
49. Chap 3-49
Mode
• A measure of central tendency
• Value that occurs most often
• Not affected by extreme values
• Used for either numerical or categorical
(nominal) data
• There may may be no mode
• There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
50. • Five houses on a hill by the beach
Review Example
$2,000 K
$500 K
$300 K
$100 K
$100 K
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
51. Review Example:
Summary Statistics
• Mean: ($3,000,000/5)
= $600,000
• Median: middle value of ranked data
= $300,000
• Mode: most frequent value
= $100,000
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
Sum $3,000,000
52. • Mean is generally used, unless
extreme values (outliers) exist
• Then median is often used, since
the median is not sensitive to
extreme values.
– Example: Median home prices may
be reported for a region – less
sensitive to outliers
Which measure of location
is the “best”?
53. Quartiles
• Quartiles split the ranked data into 4 segments with an
equal number of values per segment
25% 25% 25% 25%
The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are larger)
Only 25% of the observations are greater than the third
quartile
Q1 Q2 Q3
54. Quartile Formulas
Find a quartile by determining the value in the
appropriate position in the ranked data, where
First quartile position: Q1 = (n+1)/4
Second quartile position: Q2 = (n+1)/2 (the median position)
Third quartile position: Q3 = 3(n+1)/4
where n is the number of observed values
55. (n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,
so Q1 = 12.5
Quartiles
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
Example: Find the first quartile
Q1 and Q3 are measures of noncentral location
Q2 = median, a measure of central tendency
56. (n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
so Q1 = 12.5
Q2 is in the (9+1)/2 = 5th position of the ranked data,
so Q2 = median = 16
Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,
so Q3 = 19.5
Quartiles
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
Example:
(continued)
57. Same center,
different variation
Measures of Variation
Variation
Variance Standard
Deviation
Coefficient of
Variation
Range Interquartile
Range
Measures of variation give information
on the spread or variability of the
data values.
58. Range
• Simplest measure of variation
• Difference between the largest and the
smallest values in a set of data:
Range = Xlargest – Xsmallest
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Example:
59. • Ignores the way in which data are distributed
• Sensitive to outliers
7 8 9 10 11 12
Range = 12 - 7 = 5
7 8 9 10 11 12
Range = 12 - 7 = 5
Disadvantages of the Range
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119
63. Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Is the square root of the variance
• Has the same units as the original data
– Sample standard deviation:
1
-
n
)
X
(X
S
n
1
i
2
i
64. Calculation Example:
Sample Standard Deviation
Sample
Data (Xi) : 10 12 14 15 17 18 18 24
n = 8 Mean = X = 16
4.3095
7
130
1
8
16)
(24
16)
(14
16)
(12
16)
(10
1
n
)
X
(24
)
X
(14
)
X
(12
)
X
(10
S
2
2
2
2
2
2
2
2
A measure of the “average” scatter
around the mean
66. Comparing Standard Deviations
Mean = 15.5
S = 3.338
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5
S = 0.926
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
S = 4.567
Data C
67. Advantages of Variance and
Standard Deviation
• Each value in the data set is used in the
calculation
• Values far from the mean are given extra
weight
(because deviations from the mean are squared)
68. Coefficient of Variation
• Measures relative variation
• Always in percentage (%)
• Shows variation relative to mean
• Can be used to compare two or more sets of
data measured in different units
100%
X
S
CV
69. Comparing Coefficient
of Variation
• Stock A:
– Average price last year = $50
– Standard deviation = $5
• Stock B:
– Average price last year = $100
– Standard deviation = $5
Both stocks have
the same
standard
deviation, but
stock B is less
variable relative
to its price
10%
100%
$50
$5
100%
X
S
CVA
5%
100%
$100
$5
100%
X
S
CVB
70. Z Scores
• A measure of distance from the mean (for example, a Z-
score of 2.0 means that a value is 2.0 standard deviations
from the mean)
• The difference between a value and the mean, divided by
the standard deviation
• A Z score above 3.0 or below -3.0 is considered an outlier
S
X
X
Z
71. Z Scores
Example:
• If the mean is 14.0 and the standard deviation is 3.0, what is
the Z score for the value 18.5?
• The value 18.5 is 1.5 standard deviations above the mean
• (A negative Z-score would mean that a value is less than the
mean)
1.5
3.0
14.0
18.5
S
X
X
Z
(continued)
72. Shape of a Distribution
• Describes how data are distributed
• Measures of shape
– Symmetric or skewed
Mean = Median
Mean < Median Median < Mean
Right-Skewed
Left-Skewed Symmetric
73. Using Microsoft Excel
• Descriptive Statistics can be obtained from
Microsoft® Excel
– Use menu choice:
tools / data analysis / descriptive statistics
– Enter details in dialog box
77. Numerical Measures
for a Population
• Population summary measures are called parameters
• The population mean is the sum of the values in the
population divided by the population size, N
N
X
X
X
N
X
N
2
1
N
1
i
i
μ = population mean
N = population size
Xi = ith value of the variable X
Where
78. • Average of squared deviations of values from
the mean
– Population variance:
Population Variance
N
μ)
(X
σ
N
1
i
2
i
2
Where μ = population mean
N = population size
Xi = ith value of the variable X
79. Population Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Is the square root of the population variance
• Has the same units as the original data
– Population standard deviation:
N
μ)
(X
σ
N
1
i
2
i
80. • If the data distribution is approximately bell-
shaped, then the interval:
• contains about 68% of the values in
the population or the sample
The Empirical Rule
1σ
μ
μ
68%
1σ
μ
81. • contains about 95% of the values in
the population or the sample
• contains about 99.7% of the values
in the population or the
sample
The Empirical Rule
2σ
μ
3σ
μ
3σ
μ
99.7%
95%
2σ
μ
82. Exploratory Data Analysis
• Box-and-Whisker Plot: A Graphical display of
data using 5-number summary:
Minimum -- Q1 -- Median -- Q3 -- Maximum
Example:
Minimum 1st Median 3rd Maximum
Quartile Quartile
Minimum 1st Median 3rd Maximum
Quartile Quartile
25% 25% 25% 25%
83. Shape of Box-and-Whisker Plots
• The Box and central line are centered between the
endpoints if data are symmetric around the median
• A Box-and-Whisker plot can be shown in either vertical or
horizontal format
Min Q1 Median Q3 Max
85. Box-and-Whisker Plot Example
• Below is a Box-and-Whisker plot for the
following data:
0 2 2 2 3 3 4 5 5 10 27
• The data are right skewed, as the plot depicts
0 2 3 5 27
0 2 3 5 27
Min Q1 Q2 Q3 Max
86. The Sample Covariance
• The sample covariance measures the strength of the linear
relationship between two variables (called bivariate data)
• The sample covariance:
– Only concerned with the strength of the relationship
– No causal effect is implied
1
n
)
Y
Y
)(
X
X
(
)
Y
,
X
(
cov
n
1
i
i
i
87. • Covariance between two random variables:
cov(X,Y) > 0 X and Y tend to move in the same direction
cov(X,Y) < 0 X and Y tend to move in opposite directions
cov(X,Y) = 0 X and Y are independent
Interpreting Covariance
88. Coefficient of Correlation
• Measures the relative strength of the linear
relationship between two variables
• Sample coefficient of correlation:
where
Y
X S
S
Y)
,
(X
cov
r
1
n
)
X
(X
S
n
1
i
2
i
X
1
n
)
Y
)(Y
X
(X
Y)
,
(X
cov
n
1
i
i
i
1
n
)
Y
(Y
S
n
1
i
2
i
Y
89. Features of
Correlation Coefficient, r
• Unit free
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker the linear relationship
90. Chap 3-90
Scatter Plots of Data with Various
Correlation Coefficients
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = +.3
r = +1
Y
X
r = 0
91. Using Excel to Find
the Correlation Coefficient
• Select
Tools/Data Analysis
• Choose Correlation from
the selection menu
• Click OK . . .
92. Using Excel to Find
the Correlation Coefficient
• Input data range and select
appropriate options
• Click OK to get output
(continued)
93. Interpreting the Result
• r = .733
• There is a relatively
strong positive linear
relationship between
test score #1
and test score #2
• Students who scored high on the first test tended to
score high on second test, and students who scored
low on the first test tended to score low on the
second test
Scatter Plot of Test Scores
70
75
80
85
90
95
100
70 75 80 85 90 95 100
Test #1 Score
Test
#2
Score