Lecture 2 Descriptive statistics.pptx

Part II
Each slide has its own narration in an audio file.
For the explanation of any slide click on the audio icon to start it.
Professor Friedman's Statistics Course by H & L Friedman is licensed under a
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

 A third important property of data – after
location and dispersion - is its shape.
 Shape can be described by degree of
asymmetry (i.e., skewness).
◦ mean > median positive or right-skewness
◦ mean = median symmetric or zero-skewness
◦ mean < median negative or left-skewness
 Positive skewness can arise when the mean is
increased by some unusually high values.
 Negative skewness can arise when the mean is
decreased by some unusually low values.
Descriptive Statistics II 2

 Left skewed:
 Right skewed:
 Symmetric:
Source: Levine et al., Business Statistics, Pearson, 2013.

Data (for n=12 employees):
2 3 8 ┋ 8 9 10 ┋ 10 12 15 ┋ 18 22 63
𝑋= 180/12 = 15 hours
Median = 10 hours
The (extremely slow) employee who took 63 hours to
complete the task skewed the entire distributon to the
right.
s2 = 2868 / 11 = 260.79
s = 16.25 hours
CV = 107.7%
This guy
took a VERY
long time!

 Scores of 17 students on a national calculus
exam. Data:
0, 0, 10, 12, 15, 18, 20, 25, 30, 33, 34, 41, 56,
87, 92, 94, 95
 Open MS Excel.
 Go to Data Analysis—Analysis Tools —
Descriptive Statistics.
 If you do not have Data Analysis-Analysis Tools, you
have to use the Add-in feature and add it to MS Excel.
 Make sure to check the Summary Statistics box
once you are in descriptive statistics.
 See MS Excel Output on next slide.

MS Excel uses a formula – the Pearson Coefficient of
Skewness – to calculate skewness. You do not have to know
the formula. If the coefficient is 0 or very close to it, you
have a symmetric distribution.
Column1
Mean 38.94117647
Standard Error 8.111117365
Median 30
Mode 0
Standard Deviation 33.44299364
Sample Variance 1118.433824
Kurtosis -0.82259021
Skewness 0.782252352
Range 95
Minimum 0
Maximum 95
Sum 662
Count 17
From the output:
• mean is 38.94
• median is 30
• mode is 0
• standard deviation is 33.44
• variance is 1118.43
• skewness is .78 (positive)
• range is 95
• n is 17

 We can convert the original scores to new
scores with 𝑋 = 0 and s = 1.
 This will give us a pure number with no
units of measurement.
 Any score below the mean will now be
negative.
 Any score at the mean will be 0.
 Any score above the mean will be positive.

To compute the Z-scores:
𝑍 =
𝑋 − 𝑋
𝑠
Example.
Data: 0, 2, 4, 6, 8, 10
𝑋 = 30/6 = 5; s = 3.74
X  Z
0 0−5
3.74
-1.34
2 2−5
3.74
-.80
4 4−5
3.74
-.27
6 6−5
3.74
.27
8 8−5
3.74
.80
10 10−5
3.74
1.34

 Data: Exam Scores
Original data Change 7 to 97 Change 23 to 93
X Z X Z X Z
65 -0.45 65 -0.81 65 -1.40
73 -0.11 73 -0.38 73 -0.79
78 0.10 78 -0.10 78 -0.40
69 -0.28 69 -0.60 69 -1.09
78 0.10 78 -0.10 78 -0.40
7 -2.89 <= 97 0.94 97 1.07
23 -2.21 23 -3.12 <= 93 0.76
98 0.94 98 0.99 98 1.14
99 0.99 99 1.05 99 1.22
99 0.99 99 1.05 99 1.22
97 0.90 97 0.94 97 1.07
99 0.99 99 1.05 99 1.22
75 -0.02 75 -0.27 75 -0.63
79 0.14 79 -0.05 79 -0.32
85 0.40 85 0.28 85 0.14
63 -0.53 63 -0.92 63 -1.56
67 -0.36 67 -0.70 67 -1.25
72 -0.15 72 -0.43 72 -0.86
73 -0.11 73 -0.38 73 -0.79
93 0.73 93 0.72 93 0.76
95 0.82 95 0.83 95 0.91
Mean 75.57 Mean 79.86 Mean 83.19
s 23.75 s 18.24 s. 12.96

 No matter what you are measuring, a Z-score of
more than +5 or less than – 5 would indicate a
very, very unusual score.
 For standardized data, if it is normally distributed,
95% of the data will be between ±2 standard
deviations about the mean.
 If the data follows a normal distribution,
◦ 95% of the data will be between -1.96 and +1.96.
◦ 99.7% of the data will fall between -3 and +3.
◦ 99.99% of the data will fall between -4 and +4.
 Worst case scenario: 75% of the data are between 2
standard deviations about the mean.
[Chebychev.]

 When examining a distribution for shape,
sometime the five number summary is useful:
Smallest| Q1 | Median | Q3 | Largest
 Example:
𝑋 = 15
5-number summary: 2 | 8 | 10 | 16.5 | 63
This data is right-skewed.
In right-skewed distributions, the distance from Q3 to
Xlargest (16.5 to 63) is significantly greater than the distance
from Xsmallest to Q1(2 to 8).
2 3 8 8 9 10 10 12 15 18 22 63
Smallest Largest
Median
Q1
Q3

 The boxplot is a way to graphically portray a
distribution of data by means of its five-number
summary.
 Boxplot can be drawn along the horizontal or vertically.
Vertical line drawn within the box is the
median
Vertical line at the left side of box is Q1
Vertical line at the right side of box is Q3
Line on left connects left side of box with
Xsmallest (lower 25% of data)
Line on right connects right side of box
with Xlargest (upper 25% of data)

 A “bell-shaped” symmetric data distribution
would look like this:

 We summarize categorical data using
frequencies and graphical methods.

 A frequency distribution records data
grouped into classes and the number of
observations that fell into each class.
 A frequency distribution can be used for:
◦ categorical data
◦ numerical data that can be grouped into intervals
◦ numerical data with repeated observations
 A percentage distribution records the percent
of the observations that fell into each class.

Example. A sample was taken of 200 professors at a (fictitious)
local college. Each was asked for his or her (take-home) weekly
salary. The responses ranged from about$520 to $590. If we
wanted to display the data in, say, 7 equal intervals, we would use
an interval width of $10.
Width of interval =
𝑅𝑎𝑛𝑔𝑒
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
=
$70
7
= $10/class.
The Frequency / Percentage
Distribution:
.
Take-home pay frequency percentage
520 and under 530 6 3 %
530 " " 540 30 15
540 " " 550 38 19
550 " " 560 52 26
560 " " 570 42 21
570 " " 580 24 12
580 to 590 8 4
200 100 %

A Cumulative Distribution focuses on the
number or percentage of cases that lie below
or above specified values rather than within
intervals.
Take-home pay frequency percentage
less than 520 0 0
" " 530 6 3
" " 540 36 18
" " 550 74 37
" " 560 126 63
" " 570 168 84
" " 580 192 96
" " 590 200 100

The Frequency Histogram:

The Frequency Polygon

The Cumulative Frequency Distribution

 Categorical Data – graphical representation
◦ Contingency Table
◦ Side-by-Side Bar Chart
 Numerical Data – looking for relationships in
bivariate data
◦ Scatter Plot
◦ Correlation
◦ The Regression Line

Two categorical variables are most easily displayed in a
contingency table. This is a table of two-way frequencies.
Example: “Who would you vote for in the next election?”
This also works for two-way percentages:
.
Male Female
Republican Candidate 250 250 500
Democrat Candidate 150 350 500
400 600 1000

Chart: Relative Performance (Source: Microsoft.com)

What can we do with 2 numerical variables? We
can graph them.
Example – Grade and Height (in inches)
Y (Grade) 100 95 90 80 70 65 60 40 30 20
X (Height) 73 79 62 69 74 77 81 63 68 74

 Correlation coefficient is r = .12
 Coefficient of determination is r2 = .01
We will learn about the above measures, as well
as more about scatter plots, in the topic
onCORRELATION.

 Practice, practice, practice.
◦ As always, do lots and lots of problems. You can
find these in the online lecture notes and
homework assignments.

Lecture 2 Descriptive statistics.pptx

Recommended

Recommended

More Related Content

Similar to Lecture 2 Descriptive statistics.pptx

Similar to Lecture 2 Descriptive statistics.pptx (20)

More from ABCraftsman

More from ABCraftsman (6)

Recently uploaded

Recently uploaded (20)

Lecture 2 Descriptive statistics.pptx