03/20/25 2
Presentation ofQualitative Variables
• The simplest way of presenting/summarizing a qualitative
variable is by using a frequency table, which shows the
frequency of occurrence of each of the different categories.
• Such a table could also include the relative frequency,
which indicates the proportion or percentage of occurrence
of each of the categories.
• The frequency table could then be pictorially represented
by a bar graph or a pie diagram.
3.
03/20/25 3
An Example
•A manufacturer of jeans has plants in California (CA),
Arizona (AZ), and Texas (TX). A sample of 25 pairs of
jeans was randomly selected from a computerized
database, and the state in which each was produced was
recorded. The data are as follows:
• CA AZ AZ TX CA CA CA TX TX TX AZ AZ CA AZ TX
CA AZ TX TX TX CA AZ AZ CA CA
• Quite uninformative at this stage!
• Need to summarize to reveal information.
4.
03/20/25 4
The FrequencyTable
State Produced Frequency Relative
Frequency (%)
Cumulative
Relative
Frequency (%)
CA 9 9/25 = 36% 36%
AZ 8 8/25 = 32% 68%
TX 8 8/25 = 32% 100%
03/20/25 6
Example …continued
• By looking at this frequency table and bar graph, one is
able to obtain the information that there seems to be equal
proportions of pairs of jeans being manufactured in the
three states.
• Frequency table and bar graph certainly more informative
than the raw presentation of the sample data.
• Another method of pictorial presentation of qualitative
data is by using the pie diagram. In this case a pie is
divided into the categories with a given category’s angle
being equal to 360 degrees times the relative frequency of
occurrence of that category.
03/20/25 8
Pie Chartfrom Minitab
AZ (8, 32.0%)
TX (8, 32.0%)
CA (9, 36.0%)
Pie Chart of Place
9.
03/20/25 9
Presentation ofQuantitative Variables
• When the quantitative variable is discrete (such as counts),
a frequency table and a bar graph could also be used for
summarizing it.
• Only difference is that the values of the variables could not
be reshuffled in the graph, in contrast to when the variable
is categorical or qualitative.
• For example suppose that we asked a sample 20 students
about the number of siblings in their family. The sample
data might be:
• 4, 1, 6, 2, 2, 3, 4, 1, 2, 2, 3, 7, 2, 1, 1, 5, 3, 4, 6, 3
03/20/25 12
Frequency Tablesand Histograms
Consider the variable “Lunch,” which represents the
percentage of students in the school district whose
lunches are not free. The higher the value of this variable,
the richer the district.
n = Number of Observations = 86
LV = Lowest Value = 15
HV = Highest Value = 96
Let us construct a frequency table with classes:
[10,20), [20,30), [30,40), …, [90,100)
03/20/25 15
Stem-and-Leaf Plots
•An important tool for presenting quantitative data when the
sample size is not too large is via a stem-and-leaf plot.
• By using this method, there is usually no loss of
information in that the exact values of the observations
could be recovered (in contrast to a frequency table for
continuous data).
• Basic idea: To divide each observation into a stem and a
leaf.
• The stems will serve as the ‘body of the plant’ while the
leaves will serve as the ‘branches or leaves’ of the plant.
• An illustration makes the idea transparent.
16.
03/20/25 16
An Example
•A random sample of 30 subjects from the 1910 subjects in
the blood pressure data set was selected. We present here
the systolic blood pressures of these 30 subjects.
• 30 Systolic Blood Pressures: 122 135 110 126 100 110 110
126 94 124 108 110 92 98 118 110 102 108 126 104 110
120 110 118 100 110 120 100 120 92
• Lowest Value = 92, Highest Value = 135
• Stems: 9,10, 11, 12, 13
• Leaves: Ones Digit
03/20/25 18
Stem-and-Leaf …continued
• In this stem-and-leaf plot, because there will only be 5
stems if we use 9, 10, 11, 12, 13, we decided to subdivide
each stem into two parts corresponding to leaf values <= 4,
and those >= 5.
• Such a procedure usually produces better looking
distributions.
• Looking at this stem-and-leaf plot, notice that many of the
observations are in the range of 100-126.
• The exact values could be recovered from this plot.
• By arranging the leaves in ascending order, the plot also
becomes more informative.
19.
03/20/25 19
Comparative Stem-and-LeafPlots
• When comparing the distributions of two groups (e.g.,
when classified according to GENDER), side-by-side
stem-and-leaf plots (also side-by-side histograms) could be
used.
• To illustrate, consider 30 observations from the blood
pressure data set with Gender and Systolic Blood Pressure
being the observed variables.
• For the males (Sex = 0): 122, 120, 130, 110, 134, 136, 142,
100, 120, 162, 126, 132, 124, 130
• For the females (Sex = 1): 132, 94, 104, 100, 130, 110,
102, 110, 130, 92, 125, 108, 100, 130, 100, 100
03/20/25 21
Scatterplots:
Studying RelationshipBetween Poverty and Math
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
Lunch
A
c
tu
a
lM
a
th
Question: What kind of relationship is there between Lunch
and PACT Math Scores?
22.
03/20/25 22
Numerical SummaryMeasures
• Overview
• Why do we need numerical summary
measures?
• Measures of Location
• Measures of Variation
• Measures of Position
• Box Plots
23.
03/20/25 23
Why weNeed Summary Measures?
• “A picture is worth a thousand words, but beauty is
always in the eyes of the beholder!”
• Graphs or pictures sometimes unwieldy
• Usually wants a small set of numbers that could
provide the important features of the data set
• When making decisions, objectivity is enhanced
when they are based on numbers!
• Numerical summaries and tabular/graphical
presentations complement each other
24.
03/20/25 24
The Setting
•In defining and illustrating our summary
measures, assume that we have sample data
• Sample Data: X1, X2, X3, …, Xn
• Sample Size: n
• These summary measures are thus (sample)
statistics.
• If instead they are based on the population values,
they will be (population) parameters.
25.
03/20/25 25
Measures ofLocation or Center
• These are summary measures that provide
information on the “center” of the data set
• Usually, these measures of location are where the
observations cluster, but not always
• In layman’s terms, these measures are what we
associate with “averages”
• Will discuss two measures: sample mean and
sample median
26.
03/20/25 26
Sample Meanor Arithmetic Average
n
n
i
i X
X
X
n
X
n
X
2
1
1
1
1
• The sample mean equals the sum of the
observations divided by the number of
observations.
• It is defined symbolically via
27.
03/20/25 27
Properties ofthe Sample Mean
• “Center of Gravity”
• Sum of the deviations of the observations from the
mean is always zero (barring rounding errors)
• Sample mean could however be affected
drastically by extreme or outliers
• The sample mean is very conducive to
mathematical analysis compared to other measures
of location
03/20/25 29
Sample MeanComputation
3333
92
135
122
30
1
i
i
X
1
.
111
30
3333
X
• This value of 111.1 could be interpreted as the
balancing point of the 30 systolic blood pressure
observations.
• Locating this in the histogram we have:
30.
03/20/25 30
Sample Meanin Histogram
9
3 9
9 1
0
5 1
1
1 1
1
7 1
2
3 1
2
9 1
3
5
0
1
0
2
0
3
0
S
y
sto
licB
lo
o
dP
re
ss
u
re
Relative
Frequency
(in
%)
31.
03/20/25 31
Sample Median
•Sample median (M) = value that divides the
arranged/ordered data set into two equal parts.
• At least 50% are <= M and at least 50% are >= M
• Not sensitive to outliers but harder to deal with
mathematically
• Appropriate when histogram is left or right-skewed
• Better to present both mean and median in practice
32.
03/20/25 32
Illustration ofComputation of Median
• Consider again the blood pressure data earlier.
• n=30: an even number.
• Median will be the average of the 15th and 16th
observations in arranged data.
• Arranged data: 92, 92, 94, 98, 100, 100, 100, 102,
104, 108, 108, 110, 110, 110, 110, 110, 110, 110,
110, 118, 118, 120, 120, 120, 122, 124, 126, 126,
126, 135
33.
03/20/25 33
Continued ...
•The sample median is the average of 110 and 110,
which are the 15th and 16th observations in the
arranged data.
• The median equals 110.
• Note that it is very close to the sample mean value
of 111.1
• This closeness is because of the near symmetry of
the distribution
34.
03/20/25 34
Relative Positionsof Mean and Median
• For symmetric distributions, the mean and the
median coincide.
• For right-skewed distributions, the mean tends to be
larger than the median (mean pulled up by the large
extreme values)
• For left-skewed distributions, the mean tends to be
smaller than the median (mean pulled down by the
small extreme values)