03/20/25 1
Descriptive Statistics
Lecture 02:
Tabular and Graphical Presentation
of Data and Measures of Locations
03/20/25 2
Presentation of Qualitative Variables
• The simplest way of presenting/summarizing a qualitative
variable is by using a frequency table, which shows the
frequency of occurrence of each of the different categories.
• Such a table could also include the relative frequency,
which indicates the proportion or percentage of occurrence
of each of the categories.
• The frequency table could then be pictorially represented
by a bar graph or a pie diagram.
03/20/25 3
An Example
• A manufacturer of jeans has plants in California (CA),
Arizona (AZ), and Texas (TX). A sample of 25 pairs of
jeans was randomly selected from a computerized
database, and the state in which each was produced was
recorded. The data are as follows:
• CA AZ AZ TX CA CA CA TX TX TX AZ AZ CA AZ TX
CA AZ TX TX TX CA AZ AZ CA CA
• Quite uninformative at this stage!
• Need to summarize to reveal information.
03/20/25 4
The Frequency Table
State Produced Frequency Relative
Frequency (%)
Cumulative
Relative
Frequency (%)
CA 9 9/25 = 36% 36%
AZ 8 8/25 = 32% 68%
TX 8 8/25 = 32% 100%
03/20/25 5
The Bar Chart
0
5
10
Frequency
CA AZ TX
03/20/25 6
Example … continued
• By looking at this frequency table and bar graph, one is
able to obtain the information that there seems to be equal
proportions of pairs of jeans being manufactured in the
three states.
• Frequency table and bar graph certainly more informative
than the raw presentation of the sample data.
• Another method of pictorial presentation of qualitative
data is by using the pie diagram. In this case a pie is
divided into the categories with a given category’s angle
being equal to 360 degrees times the relative frequency of
occurrence of that category.
03/20/25 7
Pie Diagram
CA
AZ
TX
Angles (in degrees):
CA=(360)(.36)=129.6
AZ=(360)(.32)=115.2
TX=(360)(.32)=115.2
129.6o
115.2o
115.2o
03/20/25 8
Pie Chart from Minitab
AZ (8, 32.0%)
TX (8, 32.0%)
CA (9, 36.0%)
Pie Chart of Place
03/20/25 9
Presentation of Quantitative Variables
• When the quantitative variable is discrete (such as counts),
a frequency table and a bar graph could also be used for
summarizing it.
• Only difference is that the values of the variables could not
be reshuffled in the graph, in contrast to when the variable
is categorical or qualitative.
• For example suppose that we asked a sample 20 students
about the number of siblings in their family. The sample
data might be:
• 4, 1, 6, 2, 2, 3, 4, 1, 2, 2, 3, 7, 2, 1, 1, 5, 3, 4, 6, 3
03/20/25 10
Its Bar Graph is
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1 2 3 4 5 6 7
03/20/25 11
Lunch ActualLang ActualMath
59 32 38
46 26 30
90 63 67
29 17 24
41 24 26
51 30 41
41 25 30
43 32 36
70 33 36
93 50 66
84 50 66
64 27 32
52 36 43
50 31 43
53 28 35
78 36 41
57 31 42
51 39 42
55 41 53
60 37 45
96 46 66
75 34 45
60 29 36
71 43 53
68 42 51
76 47 52
82 49 55
73 30 41
31 24 30
75 45 57
57 29 40
80 51 63
54 30 44
67 28 33
76 45 50
87 61 61
54 27 33
60 32 41
35 26 35
51 29 36
50 35 42
43 23 26
66 32 44
86 63 75
54 25 33
87 60 69
49 29 37
46 38 43
50 38 44
57 40 50
90 60 75
26 17 20
47 23 27
53 37 39
58 34 43
16 13 15
74 48 54
77 43 55
94 41 62
88 49 62
78 50 59
79 46 58
61 41 47
45 26 34
87 49 62
68 36 52
76 45 56
32 22 31
63 39 53
33 20 26
64 44 53
39 20 22
37 21 27
47 23 30
40 29 41
43 25 27
37 24 31
64 37 43
59 36 45
70 32 41
55 37 46
90 38 47
45 32 35
31 25 24
35 29 32
15 14 18
An Example of a Real Data Set: Poverty versus PACT in SC
03/20/25 12
Frequency Tables and Histograms
Consider the variable “Lunch,” which represents the
percentage of students in the school district whose
lunches are not free. The higher the value of this variable,
the richer the district.
n = Number of Observations = 86
LV = Lowest Value = 15
HV = Highest Value = 96
Let us construct a frequency table with classes:
[10,20), [20,30), [30,40), …, [90,100)
03/20/25 13
Classes MidPoint Freq. RF CF CRF
[10, 20) 15 2 2.33 2 2.33
[20, 30) 25 2 2.33 4 4.65
[30, 40) 35 9 10.47 13 15.12
[40, 50) 45 13 15.12 26 30.23
[50, 60) 55 20 23.26 46 53.49
[60, 70) 65 12 13.95 58 67.44
[70, 80) 75 14 16.28 72 83.72
[80, 90) 85 8 9.30 80 93.02
[90, 100) 95 6 6.98 86 100.00
86
Frequency Table for Variable “Lunch”
03/20/25 14
Frequency Histogram
10 20 30 40 50 60 70 80 90 100
0
10
20
Lunch
Frequency
03/20/25 15
Stem-and-Leaf Plots
• An important tool for presenting quantitative data when the
sample size is not too large is via a stem-and-leaf plot.
• By using this method, there is usually no loss of
information in that the exact values of the observations
could be recovered (in contrast to a frequency table for
continuous data).
• Basic idea: To divide each observation into a stem and a
leaf.
• The stems will serve as the ‘body of the plant’ while the
leaves will serve as the ‘branches or leaves’ of the plant.
• An illustration makes the idea transparent.
03/20/25 16
An Example
• A random sample of 30 subjects from the 1910 subjects in
the blood pressure data set was selected. We present here
the systolic blood pressures of these 30 subjects.
• 30 Systolic Blood Pressures: 122 135 110 126 100 110 110
126 94 124 108 110 92 98 118 110 102 108 126 104 110
120 110 118 100 110 120 100 120 92
• Lowest Value = 92, Highest Value = 135
• Stems: 9,10, 11, 12, 13
• Leaves: Ones Digit
03/20/25 17
Stem-and-Leaf Plot
• 9 | 224
• 9 | 8
• 10 | 00024
• 10 | 88
• 11 | 00000000
• 11 | 88
• 12 | 00024
• 12 | 666
• 13 |
• 13 | 5
03/20/25 18
Stem-and-Leaf … continued
• In this stem-and-leaf plot, because there will only be 5
stems if we use 9, 10, 11, 12, 13, we decided to subdivide
each stem into two parts corresponding to leaf values <= 4,
and those >= 5.
• Such a procedure usually produces better looking
distributions.
• Looking at this stem-and-leaf plot, notice that many of the
observations are in the range of 100-126.
• The exact values could be recovered from this plot.
• By arranging the leaves in ascending order, the plot also
becomes more informative.
03/20/25 19
Comparative Stem-and-Leaf Plots
• When comparing the distributions of two groups (e.g.,
when classified according to GENDER), side-by-side
stem-and-leaf plots (also side-by-side histograms) could be
used.
• To illustrate, consider 30 observations from the blood
pressure data set with Gender and Systolic Blood Pressure
being the observed variables.
• For the males (Sex = 0): 122, 120, 130, 110, 134, 136, 142,
100, 120, 162, 126, 132, 124, 130
• For the females (Sex = 1): 132, 94, 104, 100, 130, 110,
102, 110, 130, 92, 125, 108, 100, 130, 100, 100
03/20/25 20
Comparing Male/Female Systolic Blood
Pressures
Females Male
Leaves Stem Leaves
2 4 09
0 0 0 0 2 4 8 10 0
0 0 11 0
5 12 0 0 2 4 6
0 0 0 2 13 0 0 2 4 6
14 2
15
16 2
03/20/25 21
Scatterplots:
Studying Relationship Between Poverty and Math
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
Lunch
A
c
tu
a
lM
a
th
Question: What kind of relationship is there between Lunch
and PACT Math Scores?
03/20/25 22
Numerical Summary Measures
• Overview
• Why do we need numerical summary
measures?
• Measures of Location
• Measures of Variation
• Measures of Position
• Box Plots
03/20/25 23
Why we Need Summary Measures?
• “A picture is worth a thousand words, but beauty is
always in the eyes of the beholder!”
• Graphs or pictures sometimes unwieldy
• Usually wants a small set of numbers that could
provide the important features of the data set
• When making decisions, objectivity is enhanced
when they are based on numbers!
• Numerical summaries and tabular/graphical
presentations complement each other
03/20/25 24
The Setting
• In defining and illustrating our summary
measures, assume that we have sample data
• Sample Data: X1, X2, X3, …, Xn
• Sample Size: n
• These summary measures are thus (sample)
statistics.
• If instead they are based on the population values,
they will be (population) parameters.
03/20/25 25
Measures of Location or Center
• These are summary measures that provide
information on the “center” of the data set
• Usually, these measures of location are where the
observations cluster, but not always
• In layman’s terms, these measures are what we
associate with “averages”
• Will discuss two measures: sample mean and
sample median
03/20/25 26
Sample Mean or Arithmetic Average
 
n
n
i
i X
X
X
n
X
n
X 



 


2
1
1
1
1
• The sample mean equals the sum of the
observations divided by the number of
observations.
• It is defined symbolically via
03/20/25 27
Properties of the Sample Mean
• “Center of Gravity”
• Sum of the deviations of the observations from the
mean is always zero (barring rounding errors)
• Sample mean could however be affected
drastically by extreme or outliers
• The sample mean is very conducive to
mathematical analysis compared to other measures
of location
03/20/25 28
Illustration
• Consider the systolic blood pressure data set
considered in Lecture 01
• Sample Size = n = 30
• Data: 122, 135, 110, 126, 100, 110, 110, 126, 94,
124, 108, 110, 92, 98, 118, 110, 102, 108, 126,
104, 110, 120, 110, 118, 100, 110, 120, 100, 120,
92
03/20/25 29
Sample Mean Computation








3333
92
135
122
30
1

i
i
X
1
.
111
30
3333


X
• This value of 111.1 could be interpreted as the
balancing point of the 30 systolic blood pressure
observations.
• Locating this in the histogram we have:
03/20/25 30
Sample Mean in Histogram
9
3 9
9 1
0
5 1
1
1 1
1
7 1
2
3 1
2
9 1
3
5
0
1
0
2
0
3
0
S
y
sto
licB
lo
o
dP
re
ss
u
re
Relative
Frequency
(in
%)
03/20/25 31
Sample Median
• Sample median (M) = value that divides the
arranged/ordered data set into two equal parts.
• At least 50% are <= M and at least 50% are >= M
• Not sensitive to outliers but harder to deal with
mathematically
• Appropriate when histogram is left or right-skewed
• Better to present both mean and median in practice
03/20/25 32
Illustration of Computation of Median
• Consider again the blood pressure data earlier.
• n=30: an even number.
• Median will be the average of the 15th and 16th
observations in arranged data.
• Arranged data: 92, 92, 94, 98, 100, 100, 100, 102,
104, 108, 108, 110, 110, 110, 110, 110, 110, 110,
110, 118, 118, 120, 120, 120, 122, 124, 126, 126,
126, 135
03/20/25 33
Continued ...
• The sample median is the average of 110 and 110,
which are the 15th and 16th observations in the
arranged data.
• The median equals 110.
• Note that it is very close to the sample mean value
of 111.1
• This closeness is because of the near symmetry of
the distribution
03/20/25 34
Relative Positions of Mean and Median
• For symmetric distributions, the mean and the
median coincide.
• For right-skewed distributions, the mean tends to be
larger than the median (mean pulled up by the large
extreme values)
• For left-skewed distributions, the mean tends to be
smaller than the median (mean pulled down by the
small extreme values)

Tabular and Graphical Representations in healthcare

  • 1.
    03/20/25 1 Descriptive Statistics Lecture02: Tabular and Graphical Presentation of Data and Measures of Locations
  • 2.
    03/20/25 2 Presentation ofQualitative Variables • The simplest way of presenting/summarizing a qualitative variable is by using a frequency table, which shows the frequency of occurrence of each of the different categories. • Such a table could also include the relative frequency, which indicates the proportion or percentage of occurrence of each of the categories. • The frequency table could then be pictorially represented by a bar graph or a pie diagram.
  • 3.
    03/20/25 3 An Example •A manufacturer of jeans has plants in California (CA), Arizona (AZ), and Texas (TX). A sample of 25 pairs of jeans was randomly selected from a computerized database, and the state in which each was produced was recorded. The data are as follows: • CA AZ AZ TX CA CA CA TX TX TX AZ AZ CA AZ TX CA AZ TX TX TX CA AZ AZ CA CA • Quite uninformative at this stage! • Need to summarize to reveal information.
  • 4.
    03/20/25 4 The FrequencyTable State Produced Frequency Relative Frequency (%) Cumulative Relative Frequency (%) CA 9 9/25 = 36% 36% AZ 8 8/25 = 32% 68% TX 8 8/25 = 32% 100%
  • 5.
    03/20/25 5 The BarChart 0 5 10 Frequency CA AZ TX
  • 6.
    03/20/25 6 Example …continued • By looking at this frequency table and bar graph, one is able to obtain the information that there seems to be equal proportions of pairs of jeans being manufactured in the three states. • Frequency table and bar graph certainly more informative than the raw presentation of the sample data. • Another method of pictorial presentation of qualitative data is by using the pie diagram. In this case a pie is divided into the categories with a given category’s angle being equal to 360 degrees times the relative frequency of occurrence of that category.
  • 7.
    03/20/25 7 Pie Diagram CA AZ TX Angles(in degrees): CA=(360)(.36)=129.6 AZ=(360)(.32)=115.2 TX=(360)(.32)=115.2 129.6o 115.2o 115.2o
  • 8.
    03/20/25 8 Pie Chartfrom Minitab AZ (8, 32.0%) TX (8, 32.0%) CA (9, 36.0%) Pie Chart of Place
  • 9.
    03/20/25 9 Presentation ofQuantitative Variables • When the quantitative variable is discrete (such as counts), a frequency table and a bar graph could also be used for summarizing it. • Only difference is that the values of the variables could not be reshuffled in the graph, in contrast to when the variable is categorical or qualitative. • For example suppose that we asked a sample 20 students about the number of siblings in their family. The sample data might be: • 4, 1, 6, 2, 2, 3, 4, 1, 2, 2, 3, 7, 2, 1, 1, 5, 3, 4, 6, 3
  • 10.
    03/20/25 10 Its BarGraph is 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 2 3 4 5 6 7
  • 11.
    03/20/25 11 Lunch ActualLangActualMath 59 32 38 46 26 30 90 63 67 29 17 24 41 24 26 51 30 41 41 25 30 43 32 36 70 33 36 93 50 66 84 50 66 64 27 32 52 36 43 50 31 43 53 28 35 78 36 41 57 31 42 51 39 42 55 41 53 60 37 45 96 46 66 75 34 45 60 29 36 71 43 53 68 42 51 76 47 52 82 49 55 73 30 41 31 24 30 75 45 57 57 29 40 80 51 63 54 30 44 67 28 33 76 45 50 87 61 61 54 27 33 60 32 41 35 26 35 51 29 36 50 35 42 43 23 26 66 32 44 86 63 75 54 25 33 87 60 69 49 29 37 46 38 43 50 38 44 57 40 50 90 60 75 26 17 20 47 23 27 53 37 39 58 34 43 16 13 15 74 48 54 77 43 55 94 41 62 88 49 62 78 50 59 79 46 58 61 41 47 45 26 34 87 49 62 68 36 52 76 45 56 32 22 31 63 39 53 33 20 26 64 44 53 39 20 22 37 21 27 47 23 30 40 29 41 43 25 27 37 24 31 64 37 43 59 36 45 70 32 41 55 37 46 90 38 47 45 32 35 31 25 24 35 29 32 15 14 18 An Example of a Real Data Set: Poverty versus PACT in SC
  • 12.
    03/20/25 12 Frequency Tablesand Histograms Consider the variable “Lunch,” which represents the percentage of students in the school district whose lunches are not free. The higher the value of this variable, the richer the district. n = Number of Observations = 86 LV = Lowest Value = 15 HV = Highest Value = 96 Let us construct a frequency table with classes: [10,20), [20,30), [30,40), …, [90,100)
  • 13.
    03/20/25 13 Classes MidPointFreq. RF CF CRF [10, 20) 15 2 2.33 2 2.33 [20, 30) 25 2 2.33 4 4.65 [30, 40) 35 9 10.47 13 15.12 [40, 50) 45 13 15.12 26 30.23 [50, 60) 55 20 23.26 46 53.49 [60, 70) 65 12 13.95 58 67.44 [70, 80) 75 14 16.28 72 83.72 [80, 90) 85 8 9.30 80 93.02 [90, 100) 95 6 6.98 86 100.00 86 Frequency Table for Variable “Lunch”
  • 14.
    03/20/25 14 Frequency Histogram 1020 30 40 50 60 70 80 90 100 0 10 20 Lunch Frequency
  • 15.
    03/20/25 15 Stem-and-Leaf Plots •An important tool for presenting quantitative data when the sample size is not too large is via a stem-and-leaf plot. • By using this method, there is usually no loss of information in that the exact values of the observations could be recovered (in contrast to a frequency table for continuous data). • Basic idea: To divide each observation into a stem and a leaf. • The stems will serve as the ‘body of the plant’ while the leaves will serve as the ‘branches or leaves’ of the plant. • An illustration makes the idea transparent.
  • 16.
    03/20/25 16 An Example •A random sample of 30 subjects from the 1910 subjects in the blood pressure data set was selected. We present here the systolic blood pressures of these 30 subjects. • 30 Systolic Blood Pressures: 122 135 110 126 100 110 110 126 94 124 108 110 92 98 118 110 102 108 126 104 110 120 110 118 100 110 120 100 120 92 • Lowest Value = 92, Highest Value = 135 • Stems: 9,10, 11, 12, 13 • Leaves: Ones Digit
  • 17.
    03/20/25 17 Stem-and-Leaf Plot •9 | 224 • 9 | 8 • 10 | 00024 • 10 | 88 • 11 | 00000000 • 11 | 88 • 12 | 00024 • 12 | 666 • 13 | • 13 | 5
  • 18.
    03/20/25 18 Stem-and-Leaf …continued • In this stem-and-leaf plot, because there will only be 5 stems if we use 9, 10, 11, 12, 13, we decided to subdivide each stem into two parts corresponding to leaf values <= 4, and those >= 5. • Such a procedure usually produces better looking distributions. • Looking at this stem-and-leaf plot, notice that many of the observations are in the range of 100-126. • The exact values could be recovered from this plot. • By arranging the leaves in ascending order, the plot also becomes more informative.
  • 19.
    03/20/25 19 Comparative Stem-and-LeafPlots • When comparing the distributions of two groups (e.g., when classified according to GENDER), side-by-side stem-and-leaf plots (also side-by-side histograms) could be used. • To illustrate, consider 30 observations from the blood pressure data set with Gender and Systolic Blood Pressure being the observed variables. • For the males (Sex = 0): 122, 120, 130, 110, 134, 136, 142, 100, 120, 162, 126, 132, 124, 130 • For the females (Sex = 1): 132, 94, 104, 100, 130, 110, 102, 110, 130, 92, 125, 108, 100, 130, 100, 100
  • 20.
    03/20/25 20 Comparing Male/FemaleSystolic Blood Pressures Females Male Leaves Stem Leaves 2 4 09 0 0 0 0 2 4 8 10 0 0 0 11 0 5 12 0 0 2 4 6 0 0 0 2 13 0 0 2 4 6 14 2 15 16 2
  • 21.
    03/20/25 21 Scatterplots: Studying RelationshipBetween Poverty and Math 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 Lunch A c tu a lM a th Question: What kind of relationship is there between Lunch and PACT Math Scores?
  • 22.
    03/20/25 22 Numerical SummaryMeasures • Overview • Why do we need numerical summary measures? • Measures of Location • Measures of Variation • Measures of Position • Box Plots
  • 23.
    03/20/25 23 Why weNeed Summary Measures? • “A picture is worth a thousand words, but beauty is always in the eyes of the beholder!” • Graphs or pictures sometimes unwieldy • Usually wants a small set of numbers that could provide the important features of the data set • When making decisions, objectivity is enhanced when they are based on numbers! • Numerical summaries and tabular/graphical presentations complement each other
  • 24.
    03/20/25 24 The Setting •In defining and illustrating our summary measures, assume that we have sample data • Sample Data: X1, X2, X3, …, Xn • Sample Size: n • These summary measures are thus (sample) statistics. • If instead they are based on the population values, they will be (population) parameters.
  • 25.
    03/20/25 25 Measures ofLocation or Center • These are summary measures that provide information on the “center” of the data set • Usually, these measures of location are where the observations cluster, but not always • In layman’s terms, these measures are what we associate with “averages” • Will discuss two measures: sample mean and sample median
  • 26.
    03/20/25 26 Sample Meanor Arithmetic Average   n n i i X X X n X n X         2 1 1 1 1 • The sample mean equals the sum of the observations divided by the number of observations. • It is defined symbolically via
  • 27.
    03/20/25 27 Properties ofthe Sample Mean • “Center of Gravity” • Sum of the deviations of the observations from the mean is always zero (barring rounding errors) • Sample mean could however be affected drastically by extreme or outliers • The sample mean is very conducive to mathematical analysis compared to other measures of location
  • 28.
    03/20/25 28 Illustration • Considerthe systolic blood pressure data set considered in Lecture 01 • Sample Size = n = 30 • Data: 122, 135, 110, 126, 100, 110, 110, 126, 94, 124, 108, 110, 92, 98, 118, 110, 102, 108, 126, 104, 110, 120, 110, 118, 100, 110, 120, 100, 120, 92
  • 29.
    03/20/25 29 Sample MeanComputation         3333 92 135 122 30 1  i i X 1 . 111 30 3333   X • This value of 111.1 could be interpreted as the balancing point of the 30 systolic blood pressure observations. • Locating this in the histogram we have:
  • 30.
    03/20/25 30 Sample Meanin Histogram 9 3 9 9 1 0 5 1 1 1 1 1 7 1 2 3 1 2 9 1 3 5 0 1 0 2 0 3 0 S y sto licB lo o dP re ss u re Relative Frequency (in %)
  • 31.
    03/20/25 31 Sample Median •Sample median (M) = value that divides the arranged/ordered data set into two equal parts. • At least 50% are <= M and at least 50% are >= M • Not sensitive to outliers but harder to deal with mathematically • Appropriate when histogram is left or right-skewed • Better to present both mean and median in practice
  • 32.
    03/20/25 32 Illustration ofComputation of Median • Consider again the blood pressure data earlier. • n=30: an even number. • Median will be the average of the 15th and 16th observations in arranged data. • Arranged data: 92, 92, 94, 98, 100, 100, 100, 102, 104, 108, 108, 110, 110, 110, 110, 110, 110, 110, 110, 118, 118, 120, 120, 120, 122, 124, 126, 126, 126, 135
  • 33.
    03/20/25 33 Continued ... •The sample median is the average of 110 and 110, which are the 15th and 16th observations in the arranged data. • The median equals 110. • Note that it is very close to the sample mean value of 111.1 • This closeness is because of the near symmetry of the distribution
  • 34.
    03/20/25 34 Relative Positionsof Mean and Median • For symmetric distributions, the mean and the median coincide. • For right-skewed distributions, the mean tends to be larger than the median (mean pulled up by the large extreme values) • For left-skewed distributions, the mean tends to be smaller than the median (mean pulled down by the small extreme values)