SlideShare a Scribd company logo
1 of 37
STA6166-2-1
 Review approaches to visually displaying Data.
 Graphics that display key statistical features of measurements from a
sample.
 Define the distribution of a set of data.
 Review common basic statistics.
• Extremes (Minimum and Maximum)
• Central Tendency ( Mean, Median)
• Spread (Range, Variance, Standard Deviation)
 Review not so common basic statistics.
• Extremes (upper and lower quartiles)
• Central Tendency (Mode, Winsorized Mean)
• Spread (Interquartile Range)
Graphics, Tables and Basic
Statistics (Chapter 3)
Lecture Objectives :
STA6166-2-2
The visual portrayal of quantitative information
Are used to:
• Display the actual data table
• Display quantities derived from the
data
• Show what has been learned
about the data from other analyses
• Allow one to see what may be
occurring in the data over and
above what has already been
described
Graphical Display
Objectives
• Tabulation
• Description
• Illustration
• Exploration
Graphics
“A picture is worth a
thousand words…”
STA6166-2-3
 Avoid distortion of the true story.
 Induce the viewer to think about the substance,
not the graph.
 Reveal the data at several layers of detail.
 Encourage the eye to compare different
pieces.
 Support the statistical and verbal descriptions
of the data.
Objectives
As you create graphics keep the following in mind.
STA6166-2-4
Chocolate Manufacturers Association
National Confectioners Association
7900 Westpark Blvd. Suite A 320, McLean, Virginia 22102
URL: http://www.candyusa.org/nutfact.html
Standard data format
Qualitative characteristic Quantitative characteristics
Nutrient Profiles for Selected Candy
STA6166-2-5
Example Data
STA6166-2-6
Candy data as Excel spreadsheet
STA6166-2-7
Calories in Common Candies
0
50
100
150
200
250
A
f
t
e
r
D
i
n
n
e
r
M
i
n
t
C
a
n
d
y
C
o
r
n
C
h
e
w
i
n
g
G
u
m
G
u
m
m
y
B
e
a
r
s
L
i
c
o
r
i
c
e
T
w
i
s
t
s
M
i
l
k
C
h
o
c
o
l
a
.
.
.
M
i
l
k
C
h
o
c
o
l
a
.
.
.
M
i
l
k
C
h
o
c
o
l
a
.
.
.
P
e
c
t
i
n
S
l
i
c
e
s
S
o
u
r
B
a
l
l
s
T
a
f
f
y
Display the data table
What are the problems with this graph?
Column chart
STA6166-2-8
Calories in Common Candies
0
50
100
150
200
250
C
hew
ing
G
um
Butterscotch
Lollipop
Sour B
alls
StarlightM
ints
Toffee
Sem
iSw
eetC
hocolateC
hips
G
um
m
y
Bears
LicoriceTw
ists
PectinSlices
AfterD
innerM
int
C
andy
C
orn
C
aram
els
Jelly
Beans
M
ilkC
hocolateC
overedR
aisins
Taffy
M
ilkC
hocolateM
altedM
ilkBalls
PeanutBrittle
D
arkC
hocolateBar
M
ilk
C
hocolateA
lm
ond
Bar
M
ilkC
hocolate
B
ar
M
ilkC
hocolateC
overedPeanuts
Sorting and expanding the scale of the graph allows all
labels to be seen as well as displaying a characteristic of
the data.
Alternate Display
STA6166-2-9
Calories in Common Candies
0 50 100 150 200 250
Chewing Gum
Lollipop
StarlightMints
SemiSweetChocolateChips
LicoriceTwists
AfterDinnerMint
Caramels
MilkChocolateCoveredRaisins
MilkChocolateMaltedMilkBalls
DarkChocolateBar
MilkChocolate Bar
Vertical Display of Data
In this case, a vertical display allows better comparison of
calorie amounts.
STA6166-2-10
SatFat ( 9, 40.9%)
NoSatFat (13, 59.1%)
Pie Chart of SatFatC
1 ( 3, 13.6%)
6 ( 1, 4.5%)
4 ( 1, 4.5%)
0 (14, 63.6%)
3 ( 3, 13.6%)
Pie Chart of protein
Pie Charts
A pie chart is good for making relative comparisons among
pieces of a whole.
STA6166-2-11
Describe Distributions of Measurements
• Box & Whisker plot (Boxplot)
• Histogram
Compare Distributions
• Multiple Box & Whisker plots
Associations and Bivariate Distributions
• Scatter plot
• Symbolic scatter plot
Multidimensional Data Displays
• All pairwise scatter plot
• Rotating scatter plot
Graphical Methods in Support of Statistical Inference
• Regression lines
• Residual plots
• Quantile-quantile plots
• Cumulative distribution function plots
• Confidence and prediction interval plots
• Partial leverage plots
• Smoothed curves
Statistical Uses of Graphics
Most of these
will be
demonstrated
at some point
in the course.
STA6166-2-12
Basic Statistics
Before we get more into statistical uses of graphics, we
need to define some basic statistics. These statistics are
typically referred to as “descriptive statistics”, although
as we will see, they are much more than that. These
basic statistics address specific aspects of the
distribution of the data.
• What is the range of the data?
• When we sort the data, what number might we see
in the “middle” of the range of values?
• What number tells us over what sub range do we
find the bulk of the data ?
We will use the calorie data to illustrate.
STA6166-2-13
Extremes
Extremes
• Minimum(calories) = 10
• Maximum(calories) = 210
First, if we sort the data we can immediately identify the
extremes.
The minimum and maximum are “statistics”.
Reminder: A statistic is a function of the data. In this
case, the function is very simple.
10 60 60 60 60 60 70 130 140 140 160 160 160 160 160 160 180 180 200 210 210 210
STA6166-2-14
Range: the difference between the largest and
smallest measurements of a variable.
Extremes
•Minimum(calories) = 10
•Maximum(calories) = 210
Range = 210-10 = 200
Range
Tells us something about the spread of the data.
The middle of the range is a measure of the “center” of
the data.
Midrange = minimum + (Range/2)
=10 + 200/2
=110
Is it a “good” measure of the center of the data?
STA6166-2-15
Measures of Central Tendency
Median = middle value in the sorted list of n numbers: at position (n+1)/2
= unique value at (n+1)/2 if n is an odd number or
= average of the values at n/2 and n/2+1 if n is even
= (160 + 160)/2 = 160
Mean = sum of all values divided by number of values (average)
= (10 + 60 + 60 + 60 + … + 210 + 210)/22
= 133.6
Trimmed mean = mean of data where some fraction of the smallest and
largest data values are not considered. Usually the
smallest 5% and largest 5% values (rounded to nearest
integer) of data are removed for this computation.
= 136.0 (with 10% trimmed, 5% each tail).
Estimate the value that is in the center of the
“distribution” of the data .
Again – these are statistics (functions of the data)
STA6166-2-16
Mathematical Notation
We will need some mathematical notation if we are to
make any progress in understanding statistics. In
particular, since all statistics are functions of the data,
we should be able to represent these statistics
symbolically as equations using mathematical notation.
1 2
1
n
i
n
i
y
y y y
y
n n
   
 

Let Y be the symbolic name of a random variable (e.g. a placeholder
for the true name of a variable – weight, gender, time, etc.) Let yi
symbolically represent the i-th value of variable Y, observed in the
sample. Let the symbol, S, represent the mathematical equation for
summation. Then the sample mean can be expressed as:
Symbolic “name”
for sample mean
Number of observations
STA6166-2-17
Suppose we divide the sorted data into four equal parts. The values which
separate the four parts are known as the quartiles. The first or lower quartile
Q1, is the 25th percentile of the sorted data, the second quartile, Q2, is the
median and the third or upper quartile, Q3, is the 75th percentile of the data.
Because the sample size integer, n+1, does not always divide easily by 4, we do
some estimating of these quartiles by linear interpolation between values.
Here n=22, (n+1)/4=23/4=5.75, hence Q1 is three quarters between the 5th and 6th
observations in the sorted list. The 5th value is 60 and the 6th
value is 60, thus
60 + .75(60-60)=60.
For Q2, (n+1)/2 = 23/2 = 11.5, e.g. half way between the 11th and 12th obs.
Q2 = 160 + .5(160-160) = 160.
For Q3, 3(n+1)/4 = 3(23)/4 = 69/4 = 17.25, e.g a quarter of the way between the 17th
and 18th observations.
Q3 = 180 + .25(180-180) = 180
Quartiles
10 60 60 60 60 60 70 130 140 140 160 160 160 160 160 160 180 180 200 210 210 210
STA6166-2-18
Percentiles
100pth Percentile: that value in a sorted list of the data that
has approx p100% of the measurements below it
and approx (1-p)100% above it. (The p quantile.)
Examples:
Q1 = 25th percentile
Q2 = 50th percentile
Q3 = 75th percentile
• Ott & Longnecker suggest finding a general 100pth percentile via a
complicated graphical method (pp. 87-90).
• We will relegate these elaborate calculations to software packages…
• We will however return to this later when we discuss QQ-Plots.
Distribution
function 0 < p < 1
STA6166-2-19
A simpler way to find Q1 & Q3 is as follows:
1. Order the data from the lowest to the highest value, and find the
median.
2. Divide the ordered data into the lower half and the upper half, using
the median as the dividing value. (Always exclude the median itself
from each half.)
3. Q1 is just the median of the lower half.
4. Q3 is just the median of the upper half.
Ex: For the candy data we still get Q1=60 and Q3=180.
Ex: {3, 4, 7, 8, 9, 11, 12, 15, 18}.
We get Q1=(4+7)/2=5.5 and Q3=(12+15)/2=13.5.
Simplified Quartiles
STA6166-2-20
Interquartile Range (IQR): Difference between the third
quartile (Q3) and the first quartile (Q1).
IQR = Q3-Q1 = 180 - 60 = 120
Quartiles:
Q1 = 25th = 60
Q2 = 50th = median = 160
Q3 = 75th = 180
Measures of Variability
Range
Interquartile Range
Variance
Standard Deviation
STA6166-2-21
Variance: The sum of squared deviations
of measurements from their
mean divided by n-1.
n
y
y
n
i
i


 1
Sample Mean
 
1
1
2
2





n
y
y
s
n
i
i
Variance and Standard Deviation
Standard Deviation: The square
root of the variance.
2
s
s 
These measure the spread
of the data.
Rough approximation for large n:
srange/4.
STA6166-2-22
Using Excel Data Analysis Tool
Under the “Tools” menu in
Excel there is a tool called
“Data Analysis”. This tool
is not normally loaded
when the Excel default
installation is used so you
may have to load it
yourself. This will require
the Excel CD. Use the
Tools > Add Ins option,
select the Data Analysis
tool and add it to your
menu.
STA6166-2-23
Excel Data Analysis Tool
Select the Data Analysis Tool
Select Descriptive Statistics
The menu below appears.
Enter the Input Range and
check the output options
desired.
STA6166-2-24
Excel Descriptive Statistics Output
You should be able to easily
identify the basic statistics we
have described so far.
Note: the variance is not in this
list. This is typical of statistics
packages. Since the variance is
simply the square of the
Standard Deviation, it is often
considered redundant.
Learn to use the Excel Help
files. Type “Statistic” in the
Excel Help Keyword dialog for
a list of helps available.
STA6166-2-25
Pull down
menus
Session
worksheet
with script
commands
Spreadsheet
like data area
Importing a text
data file in standard
format into Minitab
STA6166-2-26
Descriptive Statistics
Variable N Mean Median TrMean StDev SEMean
calories 22 133.6 160.0 136.0 60.5 12.9
Variable Min Max Q1 Q3
calories 10.0 210.0 60.0 180.0
Computing Descriptive
Stats
STA6166-2-27
Mode = most
abundant
Frequency Table
A tabular representation of a set of data.
A frequency table also describes the distribution of the
data and facilitates the estimation of probabilities.
The “Histogram” dialog in the Excel Data
Analysis Tool can be used to create this table.
But it is not straightforward.
STA6166-2-28
Stem and Leaf Plot
Rough grouping or “binning” of the data.
Histogram of calories N = 22
Midpoint Count
20 1 *
40 0
60 5 *****
80 1 *
100 0
120 0
140 3 ***
160 6 ******
180 2 **
200 1 *
220 3 ***
• A printer graph of the
frequency table.
• Easy to do by hand.
• Quick visualization of
the data.
STA6166-2-29
200
100
0
calories
Median (Q2)
75th percentile (Q3)
25th percentile (Q1)
Maximum
Minimum
Interquartile
range
Box Plot
(SAS Proc Insight)
Box Plot for Calories
A visualization of most of the basic statistics.
Is there an Excel Tool? No.
STA6166-2-30
Percentiles
100pth Percentile: that value in a sorted list of the data that
has approx p100% of the measurements below it
and approx (1-p)100% above it. (The p quantile.)
Examples:
Q1 = 25th percentile
Q2 = 50th percentile
Q3 = 75th percentile
A distribution is said to be symmetric if the distance from the median to the
100pth percentile is the same as the distance from the median to the
100(1-p)th percentile. Otherwise the distribution is said to be skewed.
In the case above, the distribution is skewed to the right since the right tail is
longer than the left tail.
Smoothed
histogram 0 < p < 1
STA6166-2-31
200
150
100
50
0
9
8
7
6
5
4
3
2
1
0
calories
F
re
q
u
e
n
c
y
Frequency
Bin width
Frequency Histogram
A graphical presentation of the frequency table where the relative
areas of the bars are in proportion to the frequencies.
This is a frequency histogram
STA6166-2-32
Density Histogram
A density histogram (or simply a histogram) is
constructed just like a frequency histogram, but now the
total area of the bars sums to one. This is accomplished
by rescaling the vertical axis. Instead of frequencies, the
vertical axis records the rescaled value of the density.
Sum of shaded area is equal to one.
Histograms have
important ties to
probability.
STA6166-2-33
Five bins
Number of Bins for
Histograms
Six bins
Eleven bins
Smoothed histogram or density curve.
How we view the
“distribution” of a dataset
can depend on how
much data we have and
how it is binned.
STA6166-2-34
15
10
5
0
200
100
0
totfat
c
alor
ies
Graphics to examine relationships
Scatterplot
Is the relationship linear
or non-linear?
15
10
5
0
200
100
0
totfat
calories
Beware, changing the relative
lengths of the axes can
change how the relationship is
perceived.
STA6166-2-35
Matrix Plot
View multiple variables at one time.
STA6166-2-36
Three-D
Views
Brushing the plot
to identify
interesting points.
STA6166-2-37
Displaying
multiple variables
symbolically.
Chernoff Faces

More Related Content

Similar to Graphics Basic Stats in Excel.ppt

3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf
AmanuelDina
 
Univariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi squareUnivariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi square
kongara
 

Similar to Graphics Basic Stats in Excel.ppt (20)

Ders 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptxDers 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptx
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf
 
presentation
presentationpresentation
presentation
 
Student’s presentation
Student’s presentationStudent’s presentation
Student’s presentation
 
Basics of statistics by Arup Nama Das
Basics of statistics by Arup Nama DasBasics of statistics by Arup Nama Das
Basics of statistics by Arup Nama Das
 
descriptive statistics- 1.pptx
descriptive statistics- 1.pptxdescriptive statistics- 1.pptx
descriptive statistics- 1.pptx
 
ap_stat_1.3.ppt
ap_stat_1.3.pptap_stat_1.3.ppt
ap_stat_1.3.ppt
 
Descriptive statistics and graphs
Descriptive statistics and graphsDescriptive statistics and graphs
Descriptive statistics and graphs
 
Central Tendency.pptx
Central Tendency.pptxCentral Tendency.pptx
Central Tendency.pptx
 
Empirics of standard deviation
Empirics of standard deviationEmpirics of standard deviation
Empirics of standard deviation
 
G7-quantitative
G7-quantitativeG7-quantitative
G7-quantitative
 
Measure of Variability Report.pptx
Measure of Variability Report.pptxMeasure of Variability Report.pptx
Measure of Variability Report.pptx
 
Statistics assignment
Statistics assignmentStatistics assignment
Statistics assignment
 
Univariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi squareUnivariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi square
 
Measures of-variation
Measures of-variationMeasures of-variation
Measures of-variation
 
8490370.ppt
8490370.ppt8490370.ppt
8490370.ppt
 
maft0a2_Statistics_lecture2_2021.pptx
maft0a2_Statistics_lecture2_2021.pptxmaft0a2_Statistics_lecture2_2021.pptx
maft0a2_Statistics_lecture2_2021.pptx
 
Medical statistics
Medical statisticsMedical statistics
Medical statistics
 
Measures of Dispersion.pptx
Measures of Dispersion.pptxMeasures of Dispersion.pptx
Measures of Dispersion.pptx
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health Education
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
latest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answerslatest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answers
 
Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactistics
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 

Graphics Basic Stats in Excel.ppt

  • 1. STA6166-2-1  Review approaches to visually displaying Data.  Graphics that display key statistical features of measurements from a sample.  Define the distribution of a set of data.  Review common basic statistics. • Extremes (Minimum and Maximum) • Central Tendency ( Mean, Median) • Spread (Range, Variance, Standard Deviation)  Review not so common basic statistics. • Extremes (upper and lower quartiles) • Central Tendency (Mode, Winsorized Mean) • Spread (Interquartile Range) Graphics, Tables and Basic Statistics (Chapter 3) Lecture Objectives :
  • 2. STA6166-2-2 The visual portrayal of quantitative information Are used to: • Display the actual data table • Display quantities derived from the data • Show what has been learned about the data from other analyses • Allow one to see what may be occurring in the data over and above what has already been described Graphical Display Objectives • Tabulation • Description • Illustration • Exploration Graphics “A picture is worth a thousand words…”
  • 3. STA6166-2-3  Avoid distortion of the true story.  Induce the viewer to think about the substance, not the graph.  Reveal the data at several layers of detail.  Encourage the eye to compare different pieces.  Support the statistical and verbal descriptions of the data. Objectives As you create graphics keep the following in mind.
  • 4. STA6166-2-4 Chocolate Manufacturers Association National Confectioners Association 7900 Westpark Blvd. Suite A 320, McLean, Virginia 22102 URL: http://www.candyusa.org/nutfact.html Standard data format Qualitative characteristic Quantitative characteristics Nutrient Profiles for Selected Candy
  • 6. STA6166-2-6 Candy data as Excel spreadsheet
  • 7. STA6166-2-7 Calories in Common Candies 0 50 100 150 200 250 A f t e r D i n n e r M i n t C a n d y C o r n C h e w i n g G u m G u m m y B e a r s L i c o r i c e T w i s t s M i l k C h o c o l a . . . M i l k C h o c o l a . . . M i l k C h o c o l a . . . P e c t i n S l i c e s S o u r B a l l s T a f f y Display the data table What are the problems with this graph? Column chart
  • 8. STA6166-2-8 Calories in Common Candies 0 50 100 150 200 250 C hew ing G um Butterscotch Lollipop Sour B alls StarlightM ints Toffee Sem iSw eetC hocolateC hips G um m y Bears LicoriceTw ists PectinSlices AfterD innerM int C andy C orn C aram els Jelly Beans M ilkC hocolateC overedR aisins Taffy M ilkC hocolateM altedM ilkBalls PeanutBrittle D arkC hocolateBar M ilk C hocolateA lm ond Bar M ilkC hocolate B ar M ilkC hocolateC overedPeanuts Sorting and expanding the scale of the graph allows all labels to be seen as well as displaying a characteristic of the data. Alternate Display
  • 9. STA6166-2-9 Calories in Common Candies 0 50 100 150 200 250 Chewing Gum Lollipop StarlightMints SemiSweetChocolateChips LicoriceTwists AfterDinnerMint Caramels MilkChocolateCoveredRaisins MilkChocolateMaltedMilkBalls DarkChocolateBar MilkChocolate Bar Vertical Display of Data In this case, a vertical display allows better comparison of calorie amounts.
  • 10. STA6166-2-10 SatFat ( 9, 40.9%) NoSatFat (13, 59.1%) Pie Chart of SatFatC 1 ( 3, 13.6%) 6 ( 1, 4.5%) 4 ( 1, 4.5%) 0 (14, 63.6%) 3 ( 3, 13.6%) Pie Chart of protein Pie Charts A pie chart is good for making relative comparisons among pieces of a whole.
  • 11. STA6166-2-11 Describe Distributions of Measurements • Box & Whisker plot (Boxplot) • Histogram Compare Distributions • Multiple Box & Whisker plots Associations and Bivariate Distributions • Scatter plot • Symbolic scatter plot Multidimensional Data Displays • All pairwise scatter plot • Rotating scatter plot Graphical Methods in Support of Statistical Inference • Regression lines • Residual plots • Quantile-quantile plots • Cumulative distribution function plots • Confidence and prediction interval plots • Partial leverage plots • Smoothed curves Statistical Uses of Graphics Most of these will be demonstrated at some point in the course.
  • 12. STA6166-2-12 Basic Statistics Before we get more into statistical uses of graphics, we need to define some basic statistics. These statistics are typically referred to as “descriptive statistics”, although as we will see, they are much more than that. These basic statistics address specific aspects of the distribution of the data. • What is the range of the data? • When we sort the data, what number might we see in the “middle” of the range of values? • What number tells us over what sub range do we find the bulk of the data ? We will use the calorie data to illustrate.
  • 13. STA6166-2-13 Extremes Extremes • Minimum(calories) = 10 • Maximum(calories) = 210 First, if we sort the data we can immediately identify the extremes. The minimum and maximum are “statistics”. Reminder: A statistic is a function of the data. In this case, the function is very simple. 10 60 60 60 60 60 70 130 140 140 160 160 160 160 160 160 180 180 200 210 210 210
  • 14. STA6166-2-14 Range: the difference between the largest and smallest measurements of a variable. Extremes •Minimum(calories) = 10 •Maximum(calories) = 210 Range = 210-10 = 200 Range Tells us something about the spread of the data. The middle of the range is a measure of the “center” of the data. Midrange = minimum + (Range/2) =10 + 200/2 =110 Is it a “good” measure of the center of the data?
  • 15. STA6166-2-15 Measures of Central Tendency Median = middle value in the sorted list of n numbers: at position (n+1)/2 = unique value at (n+1)/2 if n is an odd number or = average of the values at n/2 and n/2+1 if n is even = (160 + 160)/2 = 160 Mean = sum of all values divided by number of values (average) = (10 + 60 + 60 + 60 + … + 210 + 210)/22 = 133.6 Trimmed mean = mean of data where some fraction of the smallest and largest data values are not considered. Usually the smallest 5% and largest 5% values (rounded to nearest integer) of data are removed for this computation. = 136.0 (with 10% trimmed, 5% each tail). Estimate the value that is in the center of the “distribution” of the data . Again – these are statistics (functions of the data)
  • 16. STA6166-2-16 Mathematical Notation We will need some mathematical notation if we are to make any progress in understanding statistics. In particular, since all statistics are functions of the data, we should be able to represent these statistics symbolically as equations using mathematical notation. 1 2 1 n i n i y y y y y n n        Let Y be the symbolic name of a random variable (e.g. a placeholder for the true name of a variable – weight, gender, time, etc.) Let yi symbolically represent the i-th value of variable Y, observed in the sample. Let the symbol, S, represent the mathematical equation for summation. Then the sample mean can be expressed as: Symbolic “name” for sample mean Number of observations
  • 17. STA6166-2-17 Suppose we divide the sorted data into four equal parts. The values which separate the four parts are known as the quartiles. The first or lower quartile Q1, is the 25th percentile of the sorted data, the second quartile, Q2, is the median and the third or upper quartile, Q3, is the 75th percentile of the data. Because the sample size integer, n+1, does not always divide easily by 4, we do some estimating of these quartiles by linear interpolation between values. Here n=22, (n+1)/4=23/4=5.75, hence Q1 is three quarters between the 5th and 6th observations in the sorted list. The 5th value is 60 and the 6th value is 60, thus 60 + .75(60-60)=60. For Q2, (n+1)/2 = 23/2 = 11.5, e.g. half way between the 11th and 12th obs. Q2 = 160 + .5(160-160) = 160. For Q3, 3(n+1)/4 = 3(23)/4 = 69/4 = 17.25, e.g a quarter of the way between the 17th and 18th observations. Q3 = 180 + .25(180-180) = 180 Quartiles 10 60 60 60 60 60 70 130 140 140 160 160 160 160 160 160 180 180 200 210 210 210
  • 18. STA6166-2-18 Percentiles 100pth Percentile: that value in a sorted list of the data that has approx p100% of the measurements below it and approx (1-p)100% above it. (The p quantile.) Examples: Q1 = 25th percentile Q2 = 50th percentile Q3 = 75th percentile • Ott & Longnecker suggest finding a general 100pth percentile via a complicated graphical method (pp. 87-90). • We will relegate these elaborate calculations to software packages… • We will however return to this later when we discuss QQ-Plots. Distribution function 0 < p < 1
  • 19. STA6166-2-19 A simpler way to find Q1 & Q3 is as follows: 1. Order the data from the lowest to the highest value, and find the median. 2. Divide the ordered data into the lower half and the upper half, using the median as the dividing value. (Always exclude the median itself from each half.) 3. Q1 is just the median of the lower half. 4. Q3 is just the median of the upper half. Ex: For the candy data we still get Q1=60 and Q3=180. Ex: {3, 4, 7, 8, 9, 11, 12, 15, 18}. We get Q1=(4+7)/2=5.5 and Q3=(12+15)/2=13.5. Simplified Quartiles
  • 20. STA6166-2-20 Interquartile Range (IQR): Difference between the third quartile (Q3) and the first quartile (Q1). IQR = Q3-Q1 = 180 - 60 = 120 Quartiles: Q1 = 25th = 60 Q2 = 50th = median = 160 Q3 = 75th = 180 Measures of Variability Range Interquartile Range Variance Standard Deviation
  • 21. STA6166-2-21 Variance: The sum of squared deviations of measurements from their mean divided by n-1. n y y n i i    1 Sample Mean   1 1 2 2      n y y s n i i Variance and Standard Deviation Standard Deviation: The square root of the variance. 2 s s  These measure the spread of the data. Rough approximation for large n: srange/4.
  • 22. STA6166-2-22 Using Excel Data Analysis Tool Under the “Tools” menu in Excel there is a tool called “Data Analysis”. This tool is not normally loaded when the Excel default installation is used so you may have to load it yourself. This will require the Excel CD. Use the Tools > Add Ins option, select the Data Analysis tool and add it to your menu.
  • 23. STA6166-2-23 Excel Data Analysis Tool Select the Data Analysis Tool Select Descriptive Statistics The menu below appears. Enter the Input Range and check the output options desired.
  • 24. STA6166-2-24 Excel Descriptive Statistics Output You should be able to easily identify the basic statistics we have described so far. Note: the variance is not in this list. This is typical of statistics packages. Since the variance is simply the square of the Standard Deviation, it is often considered redundant. Learn to use the Excel Help files. Type “Statistic” in the Excel Help Keyword dialog for a list of helps available.
  • 25. STA6166-2-25 Pull down menus Session worksheet with script commands Spreadsheet like data area Importing a text data file in standard format into Minitab
  • 26. STA6166-2-26 Descriptive Statistics Variable N Mean Median TrMean StDev SEMean calories 22 133.6 160.0 136.0 60.5 12.9 Variable Min Max Q1 Q3 calories 10.0 210.0 60.0 180.0 Computing Descriptive Stats
  • 27. STA6166-2-27 Mode = most abundant Frequency Table A tabular representation of a set of data. A frequency table also describes the distribution of the data and facilitates the estimation of probabilities. The “Histogram” dialog in the Excel Data Analysis Tool can be used to create this table. But it is not straightforward.
  • 28. STA6166-2-28 Stem and Leaf Plot Rough grouping or “binning” of the data. Histogram of calories N = 22 Midpoint Count 20 1 * 40 0 60 5 ***** 80 1 * 100 0 120 0 140 3 *** 160 6 ****** 180 2 ** 200 1 * 220 3 *** • A printer graph of the frequency table. • Easy to do by hand. • Quick visualization of the data.
  • 29. STA6166-2-29 200 100 0 calories Median (Q2) 75th percentile (Q3) 25th percentile (Q1) Maximum Minimum Interquartile range Box Plot (SAS Proc Insight) Box Plot for Calories A visualization of most of the basic statistics. Is there an Excel Tool? No.
  • 30. STA6166-2-30 Percentiles 100pth Percentile: that value in a sorted list of the data that has approx p100% of the measurements below it and approx (1-p)100% above it. (The p quantile.) Examples: Q1 = 25th percentile Q2 = 50th percentile Q3 = 75th percentile A distribution is said to be symmetric if the distance from the median to the 100pth percentile is the same as the distance from the median to the 100(1-p)th percentile. Otherwise the distribution is said to be skewed. In the case above, the distribution is skewed to the right since the right tail is longer than the left tail. Smoothed histogram 0 < p < 1
  • 31. STA6166-2-31 200 150 100 50 0 9 8 7 6 5 4 3 2 1 0 calories F re q u e n c y Frequency Bin width Frequency Histogram A graphical presentation of the frequency table where the relative areas of the bars are in proportion to the frequencies. This is a frequency histogram
  • 32. STA6166-2-32 Density Histogram A density histogram (or simply a histogram) is constructed just like a frequency histogram, but now the total area of the bars sums to one. This is accomplished by rescaling the vertical axis. Instead of frequencies, the vertical axis records the rescaled value of the density. Sum of shaded area is equal to one. Histograms have important ties to probability.
  • 33. STA6166-2-33 Five bins Number of Bins for Histograms Six bins Eleven bins Smoothed histogram or density curve. How we view the “distribution” of a dataset can depend on how much data we have and how it is binned.
  • 34. STA6166-2-34 15 10 5 0 200 100 0 totfat c alor ies Graphics to examine relationships Scatterplot Is the relationship linear or non-linear? 15 10 5 0 200 100 0 totfat calories Beware, changing the relative lengths of the axes can change how the relationship is perceived.
  • 35. STA6166-2-35 Matrix Plot View multiple variables at one time.
  • 36. STA6166-2-36 Three-D Views Brushing the plot to identify interesting points.