2. Outliers
• An outlier is an extremely high or an extremely low
data value when compared with the rest of the data
values.
• Procedure for Identifying Outliers
• Step 1 Arrange the data in order and find Q1 and Q3.
• Step 2 Find the interquartile range: IQR = Q3 - Q1.
• Step 3 Multiply the IQR by 1.5. = IQR *1.5
• Step 4 Subtract the value obtained in step 3 from Q1
and add the value to Q3.
• Step 5 Check the data set for any data value that is
smaller than Q1 - 1.5*(IQR) or larger than Q3 +
1.5*(IQR).
3. • Check the following data set for outliers. 5, 6,
12, 13, 15, 18, 22, 50
• Solution
• The data value 50 is extremely suspect. These
are the steps in checking for an outlier.
• Step 1 Find Q1 and Q3.
• Q1 is 9 and Q3 is 20.
• Step 2 Find the interquartile range (IQR),
which is Q3 - Q1.
• IQR = (Q3 - Q1 ) =20 - 9 =11
4. • Step 3 Multiply this value by 1.5.
• 1.5 * (11) = 16.5
• Step 4 Subtract the value obtained in step 3
from Q1, and add the value in step 3 to Q3.
• 9 - 16.5 = -7.5 and
• 20 + 16.5 =36.5
• Step 5 Check the data set for any data values
that fall outside the interval from - 7.5 to
36.5.
• The value 50 is outside this interval; hence, it
can be considered an outlier.
5. • Reasons why outliers may occur.
1. The data value may have resulted from a
measurement or observational error. The
variable is measured incorrectly.
2. The data value may have resulted from a
recording error.
3. The data value may have been obtained from
a subject that is not in the defined population
4. The data value might be a legitimate value
that occurred by chance (although the
probability is extremely small).
6. Exploratory Data Analysis (EDA),
• Here we organize date use a stem and leaf plot.
• The measure of central tendency used in EDA is the
median. The measure of variation used in EDA is the
interquartile range Q3 Q1.
• In EDA the data are represented graphically using a
boxplot (sometimes called a box-and-whisker plot).
• The purpose of exploratory data analysis is to examine
data to find out what information can be discovered
about the data such as the center and the spread.
7. Stem and Leaf Plots
• The stem and leaf plot is a method of organizing
data and is a combination of sorting and graphing.
• It has the advantage over a grouped frequency
distribution of retaining the actual data while
showing them in graphical form.
• A stem and leaf plot is a data plot that uses part of
the data value as the stem and part of the data
value as the leaf to form groups or classes.
8. Example
• At an outpatient testing center, the number of
cardiograms performed each day for 20 days is
shown. Construct a stem and leaf plot for the
data.
• 25 31 20 32 13
• 14 43 02 57 23
• 36 32 33 32 44
• 32 52 44 51 45
9. • Step 1
• Arrange the data in order:
• 02, 13, 14, 20, 23, 25, 31, 32, 32, 32,
• 32, 33, 36, 43, 44, 44, 45, 51, 52, 57
• Step 2
• Separate the data according to the first digit.
• 02 13, 14 20, 23, 25 31, 32, 32, 32, 32, 33, 36
• 43, 44, 44, 45 51, 52, 57
• Step 3
• display can be made by using the leading digit as the stem
and the trailing digit as the leaf.
• For example, for the value 32, the leading digit, 3, is the
stem and the trailing digit, 2, is the leaf. For the value 14,
the 1 is the stem and the 4 is the leaf.
10. The plot also shows that the testing center treated from a minimum of 2 patients to a
maximum of 57 patients in any one day. If there are no data values in a class, you
should write the stem number and leave the leaf row blank. Do not put a zero in the
leaf row leaf row.
• Leading Trailing
• digit (stem) digit (leaf )
• 0 2
• 1 3 4
• 2 0 3 5
• 3 1 2 2 2 2 3 6
• 4 3 4 4 5
• 5 1 2 7
11. Boxplot
• Boxplot is a graph of a data set obtained by
drawing a horizontal line from the minimum
data value to Q1, drawing a horizontal line
from Q3 to the maximum data value, and
drawing a box whose vertical sides pass
through Q1 and Q3 with a vertical line inside
the box passing through the median or Q2.
12. A boxplot can be used to graphically represent the data set.
These plots involve five specific values:
1. The lowest value of the data set (i.e., minimum)
2. Q1
3. The median
4. Q3
5. The highest value of the data set (i.e., maximum)
These values are called a five-number summary of the data
set.
13. Procedure for constructing a boxplot
• 1. Find the five-number summary for the data
values, that is, the maximum and minimum data
values, Q1 and Q3, and the median.
• 2. Draw a horizontal axis with a scale such that it
includes the maximum and minimum data values.
• 3. Draw a box whose vertical sides go through Q1
and Q3, and draw a vertical line though the
median.
• 4. Draw a line from the minimum data value to
the left side of the box and a line from the
maximum data value to the right side of the box.
14. Min Q1 M Q3 Max
smallest
value
largest
value
Boxplot
First, Second and Third Quartiles
(Second Quartile is the Median, M)
[ ] *
Outlier
Lower
Fence
Upper
Fence
Smallest Data Value > Lower Fence Largest Data Value < Upper Fence
(Min unless min is an outlier) (Max unless max is an outlier)
Five-number summary
15. • Step 5 Draw a scale for the data on the x axis.
• Step 6 Located the lowest value, Q1, median, Q3,
and the highest value on the scale.
• Step 7 Draw a box around Q1 and Q3, draw a
vertical line through the median, and connect the
upper value and the lower value to the box.
16. Information Obtained from a Boxplot
• If the median is near the center of the box and
each horizontal line is of approximately equal
length, then the distribution is roughly symmetric
• If the median is to the left of the center of the box
or the right line is substantially longer than the
left line, then the distribution is skewed right
• If the median is to the right of the center of the
box or the left line is substantially longer than the
right line, then the distribution is skewed left
17. Why Use a Boxplot?
• A boxplot provides an alternative to a histogram, a dotplot, and a stem-and-
leaf plot. Among the advantages of a boxplot over a histogram are ease of
construction and convenient handling of outliers.
• In addition, the construction of a boxplot does not involve subjective
judgements, as does a histogram. That is, two individuals will construct the
same boxplot for a given set of data - which is not necessarily true of a
histogram, because the number of classes and the class endpoints must be
chosen. On the other hand, the boxplot lacks the details the histogram
provides.
• Dotplots and stemplots retain the identity of the individual observations; a
boxplot does not. Many sets of data are more suitable for display as
boxplots than as a stemplot. A boxplot as well as a stemplot are useful for
making side-by-side comparisons.
18. Example 1
Consumer Reports did a study of ice cream bars (sigh, only vanilla flavored)
in their August 1989 issue?
Construct a boxplot for the data above.
342 377 319 353 295 234 294 286
377 182 310 439 111 201 182 197
209 147 190 151 131 151
20. Example 2
The weights of 20 randomly selected juniors at MSHS are recorded below:
a) Construct a boxplot of the data
b) Determine if there are any mild or extreme outliers.
121 126 130 132 143 137 141 144 148 205
125 128 131 133 135 139 141 147 153 213
22. Example 3
The following are the scores of 12 members of a woman’s golf team in
tournament play:
a) Construct a boxplot of the data.
b) Are there any mild or extreme outliers?
c) Find the mean and standard deviation.
d) Based on the mean and median describe the distribution?
89 90 87 95 86 81
111 108 83 88 91 79
23. Example 3 - Answer
Q1 = 84.5 Q2 = 88.5 Q3 = 93
Min = 79 Max = 111 Range = 32
IQR = 18.5 UF = 120.75 LF = 56.75
78 81 84 87 90 93 96 99 102 105 108 111 114 117 120 123 126
Golf Scores
No Outliers
Mean= 90.67 St Dev = 9.85
Distribution appears to be skewed right (mean > median and long whisker)
24. Example 4
Comparative Boxplots: The scores of 18 first year college women on the
Survey of Study Habits and Attitudes (this psychological test measures
motivation, study habits and attitudes toward school) are given below:
The college also administered the test to 20 first-year college men. There
scores are also given:
Compare the two distributions by constructing boxplots. Are there any
outliers in either group? Are there any noticeable differences or
similarities between the two groups?
154 109 137 115 152 140 154 178 101
103 126 126 137 165 165 129 200 148
108 140 114 91 180 115 126 92 169 146
109 132 75 88 113 151 70 115 187 104
25. Example 4 - Answer
Q1 = 126 98 Q2 = 138.5 114.5 Q3 = 154 143
Min = 101 70 Max = 200 187 Range = 99 117
IQR = 28 45 UF = 196 210.5 LF = 59 30.5
60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220
Comparing Men and Women Study Habits and Attitudes
Women
Men
*
Women’s median is greater and they have less variability (spread) in their scores;
the women’s distribution is more symmetric while the men’s is skewed right.
Women have an outlier; while the men do not.
27. Probability experiments
• A probability experiment is a chance process that leads to well-
defined results called outcomes.
• An outcome is the result of a single trial of a probability experiment.
• A sample space is the set of all possible outcomes of a probability
experiment.
• Some sample spaces for various probability experiments are shown
here.
• Experiment Sample space
• Toss one coin Head, tail
• Roll a die 1, 2, 3, 4, 5, 6
• Answer a true/false question True, false
• Toss two coins Head-head, tail-tail, head-tail,
• tail-head
28. Gender of Children
• Find the sample space for the gender of the
children if a family has three children. Use
• B for boy and G for girl.
• Solution
• There are two genders, male and female, and
each child could be either gender.
• Hence, there are eight possibilities, as shown
here. : BBB BBG BGB GBB GGG GGB GBG BGG
29. A tree diagram
• A tree diagram is a device consisting of line segments
issued from a starting point and also from the outcome
point. It is used to determine all possible outcomes of a
probability experiment.
30. Simple and compound events
• An outcome : the result of a single trial of a
probability experiment.
• An event consists of a set of outcomes of a
probability experiment.
31. An event can be one outcome or more than one outcome.
For example, if a die is rolled and a 6 shows, this result is
called an outcome, since it is a result of a single trial.
An event with one outcome is called a simple event.
The event of getting an odd number when a die is rolled is
called a compound event, since it consists of three
outcomes or three simple events (1, 3, 5)
A compound event consists of two or more outcomes or
simple events.
32. There are three basic interpretations of probability:
1. Classical probability
2. Empirical or relative frequency probability
3. Subjective probability
33. Classical probability uses sample spaces to
determine the numerical probability that an
event will happen.
Classical probability assumes that all outcomes in
the sample space are equally likely to occur.
Equally likely events are events that have the same
probability of occurring. when a single die is
rolled, each outcome has the same probability of
occurring.
34.
35. Drawing Cards
Find the probability of getting a red ace when a card is drawn at random
from an ordinary deck of cards.
Solution
Since there are 52 cards and there are 2 red aces, namely, the ace of
hearts and the ace of diamonds, P(red ace) 2 /52 = 26.
Example 4–6 Gender of Children
If a family has three children, find the probability that two of the three
children are girls.
Solution
The sample space for the gender of the children for a family that has
three children has eight outcomes, that is, BBB, BBG, BGB, GBB,
GGG, GGB, GBG, and BGG.
Since there are three ways to have two girls, namely, GGB, GBG, and
BGG, P(two girls) = 3/8
36. Probability rules
• Probability Rule 1
• The probability of any event E is a number (either a fraction or
decimal) between and including 0 and 1. This is denoted by 0 ≤
P(E) ≤ 1. Rule 1 states that probabilities cannot be negative or
greater than 1.
• Probability Rule 2
• If an event E cannot occur (i.e., the event contains no members in
the sample space), its probability is 0.
• Rolling a Die
• When a single die is rolled, find the probability of getting a 9.
• Since the sample space is 1, 2, 3, 4, 5, and 6, it is impossible to get
a 9. Hence, the probability is P(9) = 0.
37. • Probability Rule 3
• If an event E is certain, then the probability of E is 1.
• Rolling a Die
• When a single die is rolled, what is the probability of getting a
number less than 7?
• Solution
• Since all outcomes—1, 2, 3, 4, 5, and 6—are less than 7, the
probability is P(number less than 7) = 6/6= 1
• The event of getting a number less than 7 is certain.
• In other words, probability values range from 0 to 1. When the
probability of an event is close to 0, its occurrence is highly
unlikely.
• When the probability of an event is near 0.5, there is about a 50-
50 chance that the event will occur; and when the probability of
an event is close to 1, the event is highly likely to occur.
38. • Probability Rule 4
• The sum of the probabilities of all the
outcomes in the sample space is 1.
• For example, in the roll of a fair die, each
outcome in the sample space has a probability
• of . Hence, the sum of the probabilities of the
outcomes is as shown.