Statistics (GE 4 CLASS).pptx

Section 2. Mathematics as a Tool
(Part 1 – Data Management)
A Review
GenEd Second Generation Training
Mathematics in the Modern World

Some Statistical Terminologies…
Statistics involves the collection, organization, summarization,
presentation, and interpretation of data. (Aufmann et al, 2013)
Population – refers to an entire group that is being studied.
Parameter – is a value calculated using all the data from a
population.
Census – is a survey of an entire population.
Sample – is a smaller subset of the population, ideally one that is
fairly representative of the population.
Statistic – is a value calculated using the data from the sample.

Classifying variables
Variable – a particular characteristic or trait of the units of the
population that can take on different values.
Qualitative – when a characteristic can be placed into a well-
defined groups or categories.
Quantitative – when a characteristic is expressed in numerical
value.
Discrete – the domain is at most countable.
Continuous – can take all possible values within a range; that
is, a measurement.

Levels of Measurement – was first proposed by the American
psychologist Stanley Smith Stevens in 1946.
Nominal – is one in which the values of the variables are names
or labels.
Ordinal – uses numerical categories that convey a meaningful
order.
Interval – measurement shows order and the spaces between the
values also have significant meaning.
Ratio – the ratio between any two values has meaning, because
the data include an absolute zero value.

Measures of Center
Once the data are collected, it is useful to summarize the data
set by identifying a value around which the data are centered.
Mode – is the most frequently occurring number in a data set.
Median – is the middle number or the mean of the two middle
numbers in an ordered set of data.
Mean – is the numerical balancing point of the data set.

Measures of Center
Which measure of center is most useful?
 A teacher wants to know about her students family situation.
She asks for the number of children in their families:
6 3 2 3 4 1 2 2 4 3 1 2 2 4

Measures of Center
Mean – Median – Mode
 The mean is easy to compute. You only deal with one
number. It is not so with the median.
 The mean is affected by outliers while the median is
resistant. In a sense, the median is able to resist the pull of a
far away value, but the mean is drawn to such values.
 A change in any of the numbers changes the mean, and the
mean can be changed drastically by changing an extreme
value.
 In contrast, the median and the mode of a set of data are
usually not changed by changing an extreme value.
 The mean, the median, and the mode are all averages;
however, they are generally not equal.

Measures of Center
Compare the mean, the median, and the mode for the
salaries of 5 employees of a small company.
Salaries: P370,000 P60,000 P36,000 P20,000 P20,000
Mean = P101,200
Median = P 36,000
Mode = P 20,000
Most of the employees of this company would probably
agree that the median of P36,000 better represents the average of
the salaries than does either the mean or the mode.

Measures of Center
Consider the data in the following table below.
 In the first game, Barry has the best average.
 In the second game, Barry has the best average.
 If the statistics for the games are combined, Warren has the best
average.
In statistics, an example such as this is known as a Simpson’s
paradox.

Measures of Center
Simpson-Yule paradox means that sometimes when you
divide data in groups, it looks different from viewing it as a
whole.
Consider the following data on the test scores for two
students.
Is this an example of Simpson’s paradox? Explain.
English History English and History
combined
Maria 84, 65, 70, 90, 99,
84
89, 75, 85 Average: ?
Sarah 66, 84, 75, 77, 94,
96, 81
72, 78, 98, 81, 68,
92, 88, 86
Average: ?

Measures of Dispersion
Another important feature that can help us understand more
about a data set is the manner in which the data are distributed.
 Range is the difference between the largest value (maximum)
and the smallest value (minimum) in the data.
 Standard deviation is an extremely important measure of
spread that is based on the mean. It is a measure of the
average deviation for all of the data point from the mean.
 Variance is the square of the standard deviation of the data. It
does not use the same unit of measure as the original data.

Measure of Dispersion
Consider the following data sets:
1. 5 5 5 5 5 5 5 5 5 5
2. 0 0 0 0 0 10 10 10 10 10
3. 4 4 4 5 5 5 6 6 6
4. 0 5 5 5 5 5 5 10
5
5
x
x


5
5
x
x


5
5
x
x


5
5
x
x


0
0
R
s


10
5.27
R
s


2
0.87
R
s


10
2.67
R
s



Measure of Dispersion
Properties that determine the usefulness of the standard
deviation:
 It is use to describe the variability of the distribution only
when the mean is used to describe the center.
 It is equal to zero when there is no variability. This happens
only when all observations are of the same value.
 It has the same units of measurement as the original
observations.
 Like the mean, it can be influenced by outliers.
for population: for sample:
 
2




 x
n
 
2
1



 x x
s
n

Measures of Relative Position
z-score
The z-score for a given data value x is the number of standard
deviations that x is above or below the mean of the data.
z-score of xi in a population:
z-score of xi in a sample:
i
i
x
x
z




i
i
x
x x
z
s



Measure of Relative Position
Percentiles and Quartiles
are useful when you want to know where the score is located
in reference to the other scores.
 Percentile is a data value for which the specified percentage
of the data is below that value.
 The median is the 50th percentile.
 The 25th, 50th , 75th percentiles divide the data into lower
quartile Q1, middle quartile Q2, and upper quartile Q3,
respectively.
 In using quartiles, there are five numbers to be used
altogether: min value, Q1, median, Q3, and max value.
 Quartiles are useful for box plots.

Problem. (Task: Discuss your solutions to each of the 3 problems)
1.The mean time to download a file is 12 minutes with std.
deviation of 4 minutes. Your download time is 20 minutes.
Your friend’s download time is 6 minutes. How can you
compare your download time with your friend?
2. Raul takes 2 tests in Chemistry. He scored 72 in long test 1 for
which the class mean score was 65 with std. deviation of 8. He
received a score of 60 in long test 2 for which the mean was 45
and the std. deviation was 12. In comparison to other students,
did Raul do better in LT 1 or LT 2?
3. A consumer group tested a sample of 100 light bulbs. The
mean life expectancy of the light bulbs was 842 hours with std.
deviation of 90 hours. One particular light bulb from the
company has a z-score life expectancy of 1.2. What was the
life span of the bulb?

Normal Distribution and Probability
Normal Distribution
is an extremely important concept, because it occurs so often
in the data we collect from the natural world, as well as in many
of the more theoretical ideas that are the foundation of statistics.

Characteristics of a Normal Distribution
Shape
A normal distribution is a perfectly symmetric, mound-shaped
distribution. It is commonly referred to the as a normal curve, or
bell curve.

Center
Due to the exact symmetry of a normal curve, the center of a
normal distribution is located at the highest point of the
distribution, and all the statistical measures of center are equal.

Center
It is also important to realize that this center peak divides the
data into two equal parts.

Spread
In an idealized normal distribution of a continuous random
variable, the distribution continues infinitely in both directions.

Area under the Curve
 Areas under the curve that are symmetric about the mean are
equal.
 The total area under the curve is 1.

Empirical Rule for a Normal Distribution
In a normal distribution, approximately
 68% of the data lie within 1 standard deviation of the mean.
 95% of the data lie within 2 standard deviations of the mean.
 99.7% of the data lie within 3 standard deviations of the mean.

Empirical Rule for a Normal Distribution
Example. The heights of a large group of people are assumed to be
normally distributed. Their mean height is 66.5 inches, and the
standard deviation is 2.4 inches. Find and interpret the
intervals representing one, two, and three standard deviations
of the mean.
One standard deviation of the mean:
Approximately 68% of the people are between 64.1 and 68.9 inches tall.
Two standard deviations of the mean:
Therefore, approximately 95% of the people are between 61.7 and 71.3 inches tall.
Three standard deviations of the mean:
Nearly all of the people (99.74%) are between 59.3 and 73.7 inches tall.

Problem. (Use the Empirical rule)
A vegetable distributor knows that during the month of August, the
weights of its tomatoes are normally distributed with a mean of 0.61 kg and a
standard deviation of 0.15 kg.
a. What percent of the tomatoes weigh less than 0.76 kg?
b. In a shipment of 6000 tomatoes, how many tomatoes can be expected to
weigh more than 0.31 kg?
c. In a shipment of 4500 tomatoes, how many tomatoes can be expected to
weigh from 0.31 kg to 0.91 kg?
a. 0.76 kg is 1 standard deviation above the mean of 0.61 kg. In a normal distribution, 34% of all data
lie between the mean and 1 standard deviation above the mean, and 50% of all data lie below the
mean. Thus, 34% + 50% = 84% of the tomatoes weigh less than 0.76 kg.
b. 0.31 kg is 2 standard deviations below the mean of 0.61 kg. In a normal distribution, 47.5% of all
data lie between the mean and 2 standard deviations below the mean, and 50% of all data lie above
the mean. This gives a total of 47.5% + 50% = 97.5% of the tomatoes that weigh more than 0.31 kg.
Therefore 97.5% of 6000 = 5850 of the tomatoes can be expected to weigh more than 0.31 kg.
c. 0.31 kg is 2 standard deviations below the mean of 0.61 kg and 0.91 kg is 2 standard deviations
above the mean of 0.61 kg. In a normal distribution, 95% of all data lie within 2 standard deviations
of the mean. Therefore 95% of 4500 = 4275 of the tomatoes can be expected to weigh from 0.31 kg
to 0.91 kg.

If the original distribution
of x values is a normal
distribution, then the
corresponding distribution of
z-scores will also be a normal
distribution. This normal
distribution of z-scores is
called the standard normal
distribution.
Standard Normal Distribution
The standard normal distribution is the normal distribution
that has a mean of 0 and a standard deviation of 1.

Standard Normal Distribution
In the standard normal distribution, the area of the distribution
from z = a to z = b represents
 the percentage of z-values that lie in the interval from a to b.
 the probability that z lies in the interval from a to b.

Problem
1. A soda machine dispenses soda into 12-ounce cups. Tests show
that the actual amount of soda dispensed is normally distributed,
with a mean of 11.5 oz and a standard deviation of 0.2 oz.
a. What percent of cups will receive less than 11.25 oz of soda?
b. What percent of cups will receive between 11.2 oz and 11.55 oz
of soda?
c. If a cup is chosen at random, what is the probability that the
machine will overflow the cup?
2. The OnTheGo company manufactures laptop computers. A study
indicates that the life spans of their computers are normally
distributed, with a mean of 4.0 years and a standard deviation of
1.2 years. How long should the company warrant its computers if
the company wishes less than 4% of its computers to fail during
the warranty period?

Statistical Hypotheses
A hypothesis is simply a conjecture about a characteristic or
set of facts.
When performing statistical analyses, our hypotheses provide
the general framework of what we are testing and how to perform
the test.
Hypothesis testing involves testing the difference between a
hypothesized value of a population parameter and the estimate of
that parameter which is calculated from a sample.

Overview of the Process
The hypothesis to be tested is called the null hypothesis and
given the symbol H0 The alternative hypothesis is given the
symbol H1.

Sample null and alternative hypotheses
If the H1 is either > or <, the test is referred to as one-sided test.
If H1 contains ≠, it is two-sided test.
1. H0 : = 0
(Mean is equal to a reference value)
H1 : ≠ 0 or
H1 : > 0 or H1 : < 0
2. H0 : 1 = 2
(Two population means are equal)
H1 : 1 ≠ 2 or
H1 : 1 > 2 or H1 : 1< 0
3. H0 : 1 = 2 = . . . = k
(The k population means are equal)
H1 : at least two means are not equal
4. H0 : π1 = π2
(Two population proportions are
equal)
H1 : π1 ≠ π2 or
H1 : π1 > π2 or H1 : π1< π0
5. H0 : = 0
(There is no linear correlation
between the two variables )
H1 : ≠ 0 or
H1 : > 0 or H1 : < 0

Tests Concerning the Mean
To test whether an observed difference between a population
mean and a reference value or to test whether the difference
between the two values of the mean is significant or can be
attributed to chance, the following statistical tests are used.
The z–test is used if the population standard deviation is
known or if not, the sample standard deviation can be used as an
estimate of the population standard deviation provided that the
sample size is large; that is, n ≥ 30.
The t–test is used if the sample size is less than 30 and the
sample standard deviation is known.

The purpose of Analysis of variance (ANOVA) is much the
same as the t – tests; however, if a series of several t–tests are
used to evaluate several mean differences, the risk of Type I error
increases; that is, the α-levels accumulate over a series of tests so
that the final experiment wise α-level can be quite large.
The ANOVA is necessary to protect researchers from
excessive risk of a Type I error.
The ANOVA allows researcher to evaluate all of the mean
differences in a single hypothesis test using a single α-level and,
thereby, keeps the risk of a Type I error under control no matter
how many different means are being compared.

The ANOVA tests the homogeneity of a set of means but if
the null hypothesis is rejected in favor of the alternative
hypothesis that the means are not all equal, further test should be
done (Post Hoc) to determine which pairs of means are
significantly different.
The following Post Hoc Tests are available in most statistical
software:
1. Duncan’s multiple range test
2. Tukey’s procedure
3. Scheffe test
4. Fisher’s least significant difference

Linear Regression and Correlation
Correlation measures the relationship between bivariate data.
Bivariate data are data sets in which each subject has two
observations associated with it.
A response variable measures an outcome or result of a study.
An explanatory variable is a variable that we think explains
or causes changes in the response variables.
Linear regression is an approach for modeling the relationship
between a dependent variable (outcome) and one or more
explanatory variables. The case of one explanatory variable is
called simple linear regression.

Scatterplot is a graph of plotted points showing the relationship
between two numerical variables.

Examining a Scatterplot
1. Describe the overall pattern of a scatterplot by the form,
direction, and strength of the relationship.
2. Then look for any striking deviations from the pattern. Identify
each occurrence of an outlier.

Linear Regression
– involves using data to calculate a line that best fits that data
and then using that line to predict scores.
Least-Square Regression Line
– is the line that minimizes the sum of the squares of the
vertical deviations from each data point to the line.
The equation of the least-squares line is
where and
ŷ ax b
 
  
   
2
2
n xy x y
a
n x x



  
 
 
b y ax

Linear Correlation Coefficient
– determine the strength of a linear relationship between two
variables which is denoted by the variable r.
If the linear correlation coefficient r is positive, the
relationship between the variables has a positive correlation. In
this case, if one variable increases, the other variable also tends to
increase.
If r is negative, the linear relationship between the
variables has a negative correlation. In this case, if one variable
increases, the other variable tends to decrease.
    
       
2 2
2 2
n xy x y
r
n x x n y y


  
  
   

Happiness vs Life Expectancy
Source: CHED GenEd 1st Generation Training
What is the equation of the least-square regression line?
Country Happiness Life Expectancy
Japan 6.8 80.80
South Korea 6.2 74.20
China 6.3 70.40
Taiwan 6.2 76.40
Indonesia 6.6 78.00
Philippines 6.4 69.00
Singapore 6.8 77.60
Vietnam 6.1 69.40
India 6.2 63.00
Bangladesh 5.7 59.50

Happiness vs Life Expectancy
a = 16.661
b =- 33.635
Will the line give accurate predictions?
Correlation Coefficient r = 0.82
Predict the life expectancy for the following countries:
Actual LE
a) Zimbabwe: happiness = 4.2 35.40
b) Ghana: happiness = 5.4 57.90
c) Belarus: happiness = 6.1 68.60
ˆ 16.661 33.635
 
y x

Example. Unemployment and family income are undoubtedly related; we
would assume that as the national annual unemployment rate increases,
average annual family income would decrease. Table on next slide gives
the annual unemployment rate and the average annual family income for
the Philippines according to regions from the Philippine Statistics
Authority.
a. Use linear regression to predict the average annual family income of
the Philippines if the annual unemployment rate is 6.3%.
b. Use linear regression to predict the annual unemployment rate if the
average annual family income of the Philippines is P267,000.
c. Are the predictions in parts (a) and (b) reliable? Why or why not?

Region
Annual
Unemployment Rate
Ave. Annual Family
Income (000,000)
NCR 8.5 4.25
Cordilla 4.8 2.82
I - Ilocos Region 8.4 2.38
II - Cagayan Valley 3.2 2.37
III - Central Luzon 7.8 2.99
IVA - CALABARZON 8.0 3.12
IVB - MIMAROPA 3.3 2.22
V - Bicol Region 5.6 1.87
VI - Western Visayas 5.4 2.26
VII - Central Visayas 5.9 2.39
VIII- Eastern Visayas 5.4 1.97
IX - Zamboanga Peninsula 3.5 1.90
X - Northern Mindanao 5.6 2.21
XI - Davao Region 5.8 2.47
XII - SOCCSKSARGEN 3.5 1.88
Caraga 5.7 1.98
ARMM 3.5 1.39

References:
Aufmann et al (2013). Mathematical Excursions 3ed. Brooks/Cole ,Cengage
Learning.
Bluman, A. G. (2012). Elementary statistics: a step by step approach 8ed. New
York: McGraw-Hill.
COMAP, Inc. (2013). For all practical purposes: mathematical literacy in
today’s world. New York: W.H Freeman and Company.
Johnson & Mowry (2012). Mathematics: a practical odyssey. Brooks/Cole,
Cengage Learning
Lawsky et al (2014). CK-12 advanced probability and statistics, 2ed. CK-12
Foundation.
Nocon, R. & Nocon, E. (2016). Essential mathematics for the modern world..
QC: C & E Publishing, Inc.
Vistru-Yu, C. and Gozon, A. (2016). Statistics a review ppt. CHED’s GE First
Generation Training.

Statistics (GE 4 CLASS).pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Statistics (GE 4 CLASS).pptx

Similar to Statistics (GE 4 CLASS).pptx (20)

Recently uploaded

Recently uploaded (20)

Statistics (GE 4 CLASS).pptx