2. Organizing Data
• Looking at data can be overwhelming
• There is a lot of raw (unsorted) data
• It’s important to organize the data for
summarization
• One way to summarize data is to use
tables and graphs
• To do this, we’ll need to consider the
qualities of the variable under
consideration
• Qualitative or Quantitative
• If Quantitative:
• What level of measurement?
• Discrete or continuous?
3. ZIP Codes
Recall that a ZIP Code is a
quantitative variable at the
nominal level of
measurement (almost like
it’s a categorical variable)
since there is no real order
to ZIP codes.
We’ll organize the ZIP codes
by reading down the list
and making a tally mark
next to the headers
(creating new headers as
we find new ZIP codes)
Class Tally Frequency (f) Relative Frequency (
𝒇
𝒏
)
95458 || 2 1.96%
95490 |||| |||| 9 8.82%
95469 ||| 3 2.94%
95482 |||| |||| |||| |||| |||| |||| |||| |||| |||| ||||
|||| |||| ||
62 60.78%
95461 | 1 0.98%
95470 |||| |||| 10 9.8%
95415 || 2 1.96%
95428 | 1 0.98%
95485 | 1 0.98%
95449 || 2 1.96%
95451 || 2 1.96%
95425 | 1 0.98%
95454 | 1 0.98%
95463 | 1 0.98%
95437 || 2 1.96%
95453 || 2 1.96%
𝑛 = 𝑓 = 102
𝑓
𝑛
=99.98%
4. ZIP Codes
Recall that a ZIP Code is a
quantitative variable at the
nominal level of
measurement (almost like
it’s a categorical variable)
since there is no real order
to ZIP codes.
We’ll organize the ZIP codes
by reading down the list
and making a tally mark
next to the headers
(creating new headers as
we find new ZIP codes)
Class Tally Frequency (f) Relative Frequency (
𝒇
𝒏
)
95458 || 2 1.96%
95490 |||| |||| 9 8.82%
95469 ||| 3 2.94%
95482 |||| |||| |||| |||| |||| |||| |||| |||| |||| ||||
|||| |||| ||
62 60.78%
95461 | 1 0.98%
95470 |||| |||| 10 9.8%
95415 || 2 1.96%
95428 | 1 0.98%
95485 | 1 0.98%
95449 || 2 1.96%
95451 || 2 1.96%
95425 | 1 0.98%
95454 | 1 0.98%
95463 | 1 0.98%
95437 || 2 1.96%
95453 || 2 1.96%
𝑛 = 𝑓 = 102
𝑓
𝑛
=99.98%
5. Women’s Heights
It would be helpful to actually look at data grouped, instead of just ‘as it is’
◦ This is especially true since the order here has meaning
◦ Also, we don’t want to look at each height separately (too many numbers)
◦ We’ll group the heights into a small enough number of groups that we can see any patterns that exist
◦ We’ll do this by making a Grouped Frequency Distribution Table
How we’ll we group them?
◦ Pick two numbers
◦ Lower Limit of the First Class and Class Width
◦ If we all use these number, and use them correctly, we’ll get identical tables.
Lower Limit of the First Class
◦ This is the smallest number we’re going to tally
6. Women’s Heights
Lower Limit of the First Class
◦ This is the smallest number we’re going to tally
◦ It must be either the height of the shortest person, or an even smaller height
Class Width
◦ This is how many separate heights are in each class
◦ It is also the difference between the lower limit of successive classes
Let’s use 57 as the lower limit of the first class and 2 as the class width
This gives us class limits of 57-58, 59-60, 61-62, 63-64, 65-66, 67-68, 69-70, 71-72 (we can stop
here because nobody was taller than 72 inches)
7. Women’s Heights
Before we tally the heights up, let’s address an issue that comes up when the variable is
continuous (as height is)
What if were rounding to something finer than whole inches?
What if a person is actually 63.6 inches tall?
We don’t really want gaps between our classes, so we use something called Class Boundaries
Class Boundaries
◦ Split the difference between the upper class limit of one class and the lower class limit of the next
higher class
◦ The boundary between the first two classes is 58.5, halfway from 58 to 59, and it is the both the upper
class boundary of the first class and the lower class boundary of the second class.
9. Graphs
Graphs are a great way to demonstrate data
◦ They help us to look for patterns
◦ There are different ways of displaying the data
◦ Today, we’ll consider the following:
◦ Pareto chart
◦ Pie chart
◦ Histogram
◦ Scatterplot
10. Pareto Chart
If you are dealing with categorical data
or quantitative data at the nominal
level of measurement, the Pareto chart
gives a very good picture
◦ A Pareto chart is a bar graph whose bars:
◦ Do not touch (usually)
◦ Are arranged from the class of the largest
frequency to the smallest frequency
◦ Can be arranged vertically or horizontally
◦ Often relative frequencies are used, but here is
an example using Men’s Zip Codes just using
regular frequencies
0
5
10
15
20
25
95482 95490 95470 95458 95469 95461 95415 95428 95485 95453
Frequency Men's ZIP Codes
11. Pareto Chart
Looking at this image, what jumps out
at you?
◦ Can you see why this is a good choice to
present the data?
◦ It is obvious which is the most common
ZIP code
◦ You may note that there are two
different ZIP codes that have 4 men
living in them, and two with 2 men, and
5 with 1 man
◦ When the category is tied, the order doesn’t
matter, so long as they stay descending in
frequency0
5
10
15
20
25
95482 95490 95470 95458 95469 95461 95415 95428 95485 95453
Frequency
Men's ZIP Codes
12. Pie chart
Another way to present qualitative and quantitative variables at the nominal level of
measurement is with the Pie chart
◦ A Pie chart is a graph which represents 100% of the data being looked at
◦ You “slice” the “pie” so that the size of each piece indicates the size of the frequencies of the different
categories
◦ We determine the size of the piece by what is called the central angle
◦ The central angle is the angle made by the two edges of the slice (assuming you started from the exact
center of the pie)
Central
Angle
13. Pie chart
We measure angles in degrees, and there are 360 of them
around the center of the pie
We need to determine how many degrees to make the central
angle for the slice representing each class
◦
𝑓
𝑛
∗ 360
◦ This splits up the central angles precisely proportionately to the
frequency of the classes.
◦ 95482:
25
42
∗ 360 ≈ 214°
◦ 95490, 95470:
4
42
∗ 360 ≈ 34°
◦ 95458, 95469:
2
42
∗ 360 ≈ 17°
◦ 95461, 95415, 95428, 95485, 95453:
1
42
∗ 360 ≈ 9°
◦ These angles total up to 361°
95482
95490
95470
95458
95469
95461
95415
95428
95485
95453
Men's ZIP Codes
14. Histograms
Neither the Pareto chart nor the Pie chart
is suitable for variables at the interval and
ratio levels of measurement
You can’t put them in any order in the chart
The best way to convey the data from these
types of variables is with a Histogram
A Histogram is a bar graph in which the bars
touch
Thus we use the class boundaries when we mark off the
scale on the horizontal axis
As with the Pareto chart, the vertical axis
shows the frequencies
Be sure the frequency scale goes high enough to
accommodate the class with the greatest frequency, but
not too much higher
15. Histograms
Neither the Pareto chart nor the Pie chart
is suitable for variables at the interval and
ratio levels of measurement
One advantage of a histogram is that it can
readily display large data sets
The histogram can give you the shape of the
data, the center, and the spread of the data
Here is an example of a previous classes
women’s heights divided into 8 classes
You’ll note the ‘break’ in the horizontal axis;
that is to show that the axis has been
interrupted
This is proper to show, and not always done
16. Histograms
Neither the Pareto chart nor the Pie chart
is suitable for variables at the interval and
ratio levels of measurement
You’ll also note that once you have shows
that the horizontal axis is not perfectly to
scale, you should choose where to start
(where to place the 56.5) and then
everything else has been decided from
there!
17. Bivariate Data (Two variables)
Sometimes we want to look at two
variables at once
Bivariate – Two Variables
Suppose we want to study the connection
between people’s ages and the number of
pets they have
Here, the ordered pair is (age, # of pets)
(19, 2), (23, 2), (18, 4), (18, 2), (28, 0), (19, 3), (37, 1),
(20, 0), (34, 0), (40, 1), (18, 27), (19, 0), (18, 2), (18, 1),
(18, 4), (20, 1), (19, 3), (26, 2), (23, 2), (29, 1), (23, 0),
(19, 5), (19, 10), (29, 0), (19, 2), (19, 0)
This is called a Scatter Plot
Each point on a Scatter Plot gives us two
pieces of data about a single member of the
sample, one datum for each variable
Are there any data points that seem odd?
It’s easy to see that (18, 27) is an outlier; are there
others?
Let’s take this one out and see what things look like
now…
0
5
10
15
20
25
30
15 20 25 30 35 40 45
18. Bivariate Data (Two variables)
This allows us to see what kind of variability is going on a little bit easier
Was this ‘OK’ to do?
Outliers happen, sometimes from mistakes, sometimes simply because they do exist
You should note that you have removed an outlier to look at the data
0
2
4
6
8
10
12
15 20 25 30 35 40 45
The horizontal variable is the x-variable, and
it’s sometimes called the independent
variable.
The vertical variable is the y-variable, and it’s
sometimes called the dependent variable
Note: This terminology is not meant to imply
that the one causes the other
19. Activity: Making a grouped frequency
distribution table
Construct a grouped frequency distribution table for the heights of the men in the Class Data
Base, using 60 as the lower limit of the first class and 3 inches as the class width. Have columns
for the Class Limits, the Class Boundaries, the Tally, the Frequency, and the Relative Frequency to
the nearest hundredth.