Exploring Data

Exploring Data
• Displaying Distributions with Graphs
• Displaying Distributions with Numbers

Displaying Distributions with Graphs

• Introduction
• Displaying categorical variables: bar graphs
• Displaying quantitative variables: dotplots and
stemplots
• Displaying quantitative variables: histograms
• Relative frequency, cumulative frequency,
percentiles, and ogives
• Time plots

Introduction

Statistics is the branch of mathematics dealing with
the collection, analysis, interpretation, and
presentation of numerical data.
Individuals are the objects described by a set of data.
When the individual is human, it is called a subject.
A variable is any characteristic of an individual. A
variable can take different values for different
individuals.

Introduction

Some variables, simply place individuals (or
subjects) into categories. Other variables, take
numerical values for which we can do arithmetic.
A categorical variable places an individual into a
group or category.
A quantitative variable takes numerical values for
which arithmetic operations such as adding and
averaging make sense.
The distribution of a variable tells us what values the
variable takes and how often it takes these values.

Displaying Categorical Variables: Bar
Graphs
A bar graph shows the distribution of a categorical
variable and gives either the count or percent of
observations that fall in each category.
The horizontal axis lists each categorical variable.
The vertical axis shows the number (or percent) of
observations.
Leave a space between each bar.
Always label axes and add a title.

Displaying Quantitative Variables:
Dotplots and Stemplots
A dotplot is the most simple display of quantitative
data. To create a dotplot, draw a horizontal line and
list each outcome in ascending order below the line.
Mark a dot above the number that corresponds to
each data value. Add a title.
For example, the number of goals scored per game
by the Boston Bruins during the NHL playoffs in
2011 is: 0, 1, 4, 5, 2, 1, 4, 7, 3, 5, 5, 2, 6, 2, 3, 3, 4, 1,
0, 2, 8, 4, 0, 5, 4. Create a dotplot of this data.

Refer to the handout for caffeine content (in mg) for
38 different soft drinks. For this data, a dotplot is not
ideal due to the large spread. Instead, construct a
stemplot.
Separate each observation into a stem consisting of
all digits except the rightmost digit. The rightmost
digit is the leaf. For example, 35 mg of caffeine will
have a stem of 3 and a leaf of 5.
Write the stems vertically in increasing order from
top to bottom.

Draw a vertical line to the right of the stems.
For each observation, write the leaf to the right of its
associated stem, making sure to space the leaves
equally. Then rewrite the stems and arrange the
leaves so they are in increasing order out from the
stem.
Add a title and key (3 | 5 = 35 mg).
Note: it may be necessary to split stems or truncate
observations.

After completing a dotplot or stemplot, describe the
overall pattern of the distribution. Give the center
and spread and determine if there are outliers. An
outlier is an individual observation that falls outside
the overall pattern of the graph.
Also comment on the shape of the distribution.
Distributions may be symmetric (roughly a mirror
image), skewed right (the right tail is larger than the
left tail), or skewed left (the left tail is much larger
than the right tail).

Activity

Is Barack Obama a “young” president? Here are the
ages of all the U.S. presidents on inauguration day:
Washington 57, J. Adams 61, Jefferson 57, Madison 57,
Monroe 58, J.Q. Adams 57, Jackson 61, Van Buren 54, W.
Harrison 68, Tyler 51, Polk 49, Taylor 64, Fillmore 50, Pierce
48, Buchanan 65, Lincoln 52, A. Johnson 56, Grant 46, Hayes
54, Garfield 49, Arthur 51, Cleveland 47, B. Harrison 55,
Cleveland 55, McKinley 54, T. Roosevelt 42, Taft 51, Wilson
56, Harding 55, Coolidge 51, Hoover 54, F. Roosevelt 51,
Truman 60, Eisenhower 61, Kennedy 43, L. Johnson 55, Nixon
56, Ford 61, Carter 52, Reagan 69, G. Bush 64, Clinton 46,
G.W. Bush 54, Obama 47.

Histograms
Display the presidential age at inauguration using a
histogram. On a TI-83:
STAT EDIT 1:Edit and enter values into L1
2nd STAT PLOT 1: On, choose histogram,
XList: L1, Freq:1
Graph
Sketch the result from the calculator into your notes.
Always add axes labels and a title.

Histograms
Unlike the bar graph, the bars of the histogram are
adjacent to account for continuity of the values on
the x-axis.
There is no “correct” number of classes on the x-
axis. However, 7 classes seems to make the
histogram look “best” and between 5 and 10 are
probably sufficient. Too few classes will result in a
skyscraper histogram while too many will result in a
pancake histogram.
In general, use the number of classes your calculator
chooses.

Relative Frequency, Cumulative
Frequency, Percentiles, and Ogives
Sometimes we are interested in describing the
relative position of an individual within a
distribution. For instance, a PSAT result may indicate
you were in the 80th percentile. This means you
scored better than 80% of students (and 20% scored
better than you).
The pth percentile of a distribution is the value such
that p percent of observations fall at or below it.

A histogram is good for displaying the overall
pattern of a distribution but is poor for determining
the percentile of an individual observation.
A relative cumulative frequency plot, or ogive, is
useful in determining percentiles.

From the presidential inauguration data, we know
there are 44 presidents (observations).
Fill in the table:
Relative
Relative Cumulative
Class Frequency Frequency Frequency
Cumulative
Frequency

40 - 44
45 - 49
50 - 54
55 - 59
60 - 64
65 - 69

The relative frequency cumulative plot is a line
graph that plots relative cumulative frequency vs.
class. Create one using data from the previous slide
and don’t forget to label axes and add a title.
What percentile is Barack Obama? On the x-axis,
locate the class that contains 47. Scroll up until you
reach the line, then scroll left to read off the
approximate percentile.
What age corresponds to the 50th percentile?

Time Plots

A time plot of a variable plots each observation
against the time at which it was measured. Time is
always placed on the x-axis.
Civil unrest disturbances in the United States
between 1968 and 1972 was measured according to
the table on the next slide. Using the data, construct a
time plot of the number of disturbances vs. time.
Remember to label axes and add a title.
Connect each observation with a line and comment
on the overall trend and the seasonal variation.

Time Plots

Year Months Count Year Months Count

Jan - Mar 6 Jan - Mar 12
Apr - Jun 46 Apr - Jun 21
1968 Jul - Sep 25 1971 Jul - Sep 5
Oct - Dec 3 Oct - Dec 1
Jan - Mar 5 Jan - Mar 3
Apr - Jun 27 Apr - Jun 8
1969 Jul - Sep 19 1972 Jul - Sep 5
Oct - Dec 6 Oct - Dec 5
Jan - Mar 26
Apr - Jun 24
1970 Jul - Sep 20
Oct - Dec 6

Displaying Distributions with Numbers

• Measuring center: the mean and the median
• Comparing the mean and median
• Measuring spread: the quartiles
• The five-number summary and modified boxplots
• Measuring center: the standard deviation
• Choosing measures of center and spread
• Changing the unit of measurement
• Comparing distributions

Measuring Center: the Mean and Median

To find the mean (average) of a set of observations,
add their individual values and divide by the number
of observations.
If the n observations are x1, x2, …, xn, then the mean
is:


Consider the set S = {1, 1, 2, 2, 3, 3, 4, 4}. The mean
of this set is 2.5.
Now consider the set T = {1, 1, 2, 2, 3, 3, 4, 40}.
Find the mean.
Notice the extreme observation strongly effects the
mean. Therefore, we say the mean is not a resistant
to extreme observations.


The median, M, is the midpoint of a distribution; the
number such that half of the observations are smaller
and half of the observations are larger. To find the
median, arrange the observations in order of size,
from smallest to largest.
If the number of observations, n, is odd, the median
is the center observation in the ordered list.
If the number of observations, n, is even, the median
is the mean of the two center observations in the
ordered list.


Consider the set S = {1, 1, 2, 2, 3, 3, 4, 4}. The
median of this set is 2.5.
Now consider the set T = {1, 1, 2, 2, 3, 3, 4, 40}.
Find the median.
Notice the extreme observation has little effect on
the median. Therefore, we say the median is resistant
to extreme observations.

Comparing the Mean and Median

If a distribution is approximately symmetric, the
mean and median are approximately equal.
In skewed distributions, the mean is farther out in the
larger tail (because it is not resistant).
Distributions skewed left will have a mean less than
the median.
Distributions skewed right will have a mean greater
than the median.

Measuring Spread: the Quartiles

The simplest measure of spread for any distribution
is range:
Range = maximum value - minimum value
Quartiles measure the range of the middle half of our
observations. The first quartile, Q1, is the 25th
percentile. The third quartile, Q3, is the 75th
percentile.


To find Q1 and Q3, arrange the observations in order
of size from smallest to largest. Then find the overall
median.
Q1 is the median of the observations smaller than the
overall median.
Q3 is the median of the observations larger than the
overall median.


The interquartile range, IQR, is the range covered by
the middle half of data:
IQR = Q3 - Q1
An observation between Q1 and Q3 is not unusually
small or large. This observation is between the 25th
and 75th percentile.


Using the IQR, we can now write a definition for an
outlier.
An observation is considered an outlier if it is
smaller than Q1 - 1.5 IQR or larger than Q3 + 1.5
IQR.

Measuring Spread: the Five-Number
Summary and Modified Boxplots
The five-number summary combines a measure of
center (median) and measures of spread (range and
quartiles). It consists of five numbers written in order
from smallest to largest. The numbers are:
Minimum, Q1, M, Q3, Maximum

Measuring Spread: the Five-Number
Summary and Modified Boxplots
A modified box plot is a graph of the five-number
summary. Properties of the modified boxplot are:
A central box spans Q1 and Q3;
A vertical line in the box marks M;
Horizontal lines extend from the box out to the
smallest and largest observations that are not
outliers;
Observations more than 1.5 IQR’s outside the central
box are plotted individually.

Measuring Spread: the Standard
Deviation
The standard deviation, s, measures how far away
the observations in a distribution are from their
mean. To calculate standard deviation, first calculate
variance, s2.
The variance, s2, of a set of observations is the mean
of the squares of the deviations of the observations
from their mean.

Deviation
The standard deviation, s, is the square root of
variance.

Why divide by n - 1 instead of n? Since the sum of
the deviations must equal zero, the last deviation can
be found once we know the other n - 1 deviations.
Only n - 1 of the squared deviations can vary freely
so we average by dividing the total by n - 1. The
number n - 1 is called the degrees of freedom.

Deviation
Properties of the standard deviation:
s measures spread about the mean and should be
used only when the mean is chosen as the measure of
center;
s = 0 when there is no spread. When there is spread,
s > 0. Larger spreads imply larger values of s.
Like the mean, the standard deviation (and variance)
is not resistant to outliers. Strong skewness or a few
outliers can make s very large.

Deviation
Here are some TI-83 commands to find all the
summary statistics mentioned in these notes:
Enter data into L1
STAT CALC 1:1-Var Stats L1
Read off:
xbar,Sx, minX, Q1, Med, Q3, maxX

Choosing Measures of Center and Spread

If a distribution is strongly skewed or has outliers,
use the five-number summary to describe center and
spread.
If a distribution is reasonably symmetric and free
from outliers, use mean and standard deviation to
describe center and spread.

Changing the Unit of Measurement

The same variable can be recorded in different units
of measurement. Common examples are changing
distances from miles to kilometers and changing
temperature from °F to °C.
A linear transformation changes the original value x,
into a variable xnew via an equation of form:

Changing the Unit of Measurement

The effect of a linear transformation on measures of
center and spread are:
Adding the same number a to each observation adds
a to mean, median and quartiles, but does not change
measures of spread.
Multiplying each observation by b multiplies mean,
median and quartiles by b and also multiplies
standard deviation and IQR by b.

Exploring Data

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (17)

Similar to Exploring Data

Similar to Exploring Data (20)

Recently uploaded

Recently uploaded (20)

Exploring Data

Editor's Notes