3. Basic Statistics
Measure of Central Tendency
(how to describe a set of data by identifying
the central position within that set of data)
4. Mean
The mean (or average) is the most popular and well known measure of
central tendency. It can be used with both discrete and continuous
data, although its use is most often with continuous data and
represented as
Median
Middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data.
In order to calculate the median, suppose we have the data below:
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude:
14 35 45 55 55 56 56 65 87 89 92
In this case, 56 middle mark because there are 5 scores before it
and after it. This works fine when you have an odd number of scores.
But what will be the median if you have an even number??
(take the middle two scores and average)
Different Measure of Central Tendency
7. Mode
The mode is the most frequent score in our data set. On a
histogram it represents the highest bar in a bar chart or
histogram.
However, one of the
problems with the mode
is that it is not
unique, so it leaves
us with problems when
we have two or more
values that share the
highest frequency.
***
Mean-mode=3(mean- median)
Different Measure of Central Tendency
8. Mode for Grouped Data
Formula:
Steps in Computing the Mode for Grouped Data
1. Determine the modal class.
2. Get the value of ∆1.
3. Get the value of ∆2.
4. Get the lower boundary of the modal class.
5. Apply the formula by substituting the values
obtained in the preceding steps.
10. Example:
Example 1:
Ms. Sulit collects the data on the ages of Mathematics teachers
in Santa Rosa School, and her study yields the following:
38 35 28 36 35 33 40
Solution:
= 35
Based on the computed mean, 38 is the average age of Mathematics
teachers.
11. Example
Below are Amaya’s subjects and the corresponding number of
units and grades she got for the previous grading period.
Compute her grade point average.
Subject Units Grade
Filipino .9 86
English 1.5 85
Mathematics 1.5 88
Science 1.8 87
Social Studies .9 86
TLE 1.2 83
MAPEH 1.2 87
= 86.1
12. Percentile
Provides information about how the data are spread over the
interval from the smallest value to largest value.
i.e.
P th percentile is a value such that at least p% of the items
take on this value or less and at least (100-p)% takes on the
value or more.
Quartile
a. First Quartile (Q1) = 25th percentile
b. Second Quartile (Q2) = 50th percentile(median)
c. Third Quartile (Q3) = 75th percentile
******
Computation: p th percentile is the value in the I th position.
i=(p/100)n
Different Measure of Central Tendency
13. Measure of Dispersion
(how "spread out" a group of scores is)
Range: Difference between Largest and smallest value of dataset
Interquartile Range: Difference between Third and First
Quartile
Variance: Average of the Squared difference between each data
value and the mean.
For sample, For Population
Standard deviation: Positive Square root of Variance.
For Sample, For Population
14. Quartile Deviation: Half of the distance between first and third
quartile.
Mean Absolute deviation: Average of the absolute deviation of
values from “a”, ”a” is the central value.
Coefficient of variation: How large the Standard Deviation is in
relation to mean, defined as (standard deviation/mean)X 100
Measure of Dispersion
(how "spread out" a group of scores is)
15. Distribution Shape
Skewness : It is the measure of asymmetry of a frequency
distribution.
Zero indicates perfect symmetry; the normal distribution has
a skewness of zero.
Positive skewness indicates that the "tail" of the
distribution is more stretched on the side above the mean.
Negative skewness indicates that the tail of the
distribution is more stretched on the side below the mean."
Mean=Median=Mode
Mode<Median<Mean
Mean<Median<Mode
Skewness = 3(mean-median)/Standard deviation , lie between the interval [-3,3]
Skewness = (Q3 –Q1-2 median)/(Q3-Q1)
16. Kurtosis: It is the measure of flatness or peakedness AND
measured by the following equation
The normal distribution has a kurtosis of 3 (MESOKURTIC).
Positive kurtosis (>3) indicates a relatively peaked
distribution (LEPTOKURTIC).
Negative kurtosis (<3) indicates a relatively flat
distribution (PLAYTKURTIC).
Another measure of kurtosis is
Distribution Shape
Platykurtic LeptokurticMesokurtic
17. Probability Distribution
(How probabilities are distributed over the values of Random Variable)
Probability is represented by a real number in the range from 0
to 1. An Impossible event has a probability of 0 and a certain
event has a probability of 1.
i.e.
Probability refers to chance or likelihood of a particular
event take place.
Defined as (number of favorable outcomes/the possible outcomes)
18. Discrete Distribution
(Binomial ,Poisson…)
Binomial
Where a variable follows Binomial Distribution?
Failure and success in exam
Determination of probabilities of occurrence of certain
combination of head & tail
Defective and non-defective items from a manufacturing
process
Probability Mass Function of Binomial:
where
p(x)= probability of x success in n trials
n= number of trials
p= probability of success in one trial
Mean=np ,Variance=npq , Standard Deviation=√(npq)
if p=0.5, skewness 0
if p<0.5, positively skewed,
if p>0.5, negatively skewed
19. Poisson
A binomial distribution will become a Poisson Distribution if
number of trial n->infinite, probability of success p->0,
but np=λ is finite.
Probability Mass Function of Poisson:
Mean=Variance=λ
Positively skewed, coefficient of skewness= 1/√λ
Leptokurtic as ϒ2=1/λ which is > 0
Real life example:
Number of printing mistake per page in large text
Number of telephone call in per unit interval of time
Number of death from rare disease.
Discrete Distribution
(Binomial ,Poisson…)
20. The normal distribution is the most important and most widely
used distribution in statistics. It is sometimes called the
"bell curve.“
Probability Density Function of Normal distribution:
Seven features of normal distributions are listed below. These
features are illustrated in more detail in the remaining
sections of this chapter.
Normal distributions are symmetric around their mean.
The mean, median, and mode of a normal distribution are equal.
The area under the normal curve is equal to 1.0.
Normal distributions are defined by two parameters, the mean
(μ) and the standard deviation (σ).
68% of the area of a normal distribution is within one
standard deviation of the mean.
Approximately 95% of the area of a normal distribution is
within two standard deviations of the mean.
Continuous Distribution
(Normal…..)
21. Sampling Distribution
What is sampling distribution?
A sampling distribution is created by sampling.
Sampling distribution is defined as the frequency
distribution of the statistic for many samples.
If it is the distribution of means and is also called the
sampling distribution of the mean.
22. Sampling distribution of Mean
Unbiased
Variance of sampling distribution of means based on N
observation :
Large Samples produce sample estimates very close to the
parameter.
Independent Random samples be drawn from each two Normal
Population , then the sampling distribution of difference
between two sample means will be normally distributed.
23. Difference between non parametric and
parametric Statistics
Parametric statistics – inferential test that assumes certain
characteristics are true of an underlying population,
especially the shape of its distribution. Commonly used for
normally distributed interval or ratio dependent variables.
Non-parametric statistics – inferential test that makes few or no
assumptions about the population from which observations were
drawn (distribution-free tests).
There is generally at least one non-parametric equivalent test
for each type of parametric test. Non-parametric statistics are
less powerful that parametric tests.
Non-parametric tests are generally used when assumptions about
the underlying population are questionable (e.g., non-
normality).Commonly used to analyse DVs that are non-normal or are
nominal or ordinal.
24. Statistical Inference
Inference
•Use a sample to learn something about a population.
Hypothesis Testing
•A hypothesis is an assertion about the population.
•A statistical method that uses sample data to evaluate a
hypothesis about a population parameter.
•We answer a question such as, “If the hypothesis were true,
would it be unlikely to get data such as we obtained?”
Why Test?
•Statistics is an experimental science, not really a branch of
mathematics.
•It’s a tool that can tell you whether data are accidentally or
really similar.
•It does not give you certainty.
25. Statistical Inference
Five Parts of a Test
• Assumptions about type of data (quantitative, categorical),
population distribution (e.g., normal, binomial)
•Hypotheses:
Null hypothesis(H0): A statement indicating “no effect”
Alternative hypothesis(Ha): A statement indicating “an effect”
•Test Statistic:
A function of data to measure discrepancy between the null and
alternative hypotheses.
•P-value (p):
A measure of evidence about the null hypothesis H0. The smaller
the P-value, the stronger the evidence against H0.
•Conclusion:
Select a significance level (such as 0.05 or 0.01) and reject H0if
P-value ≤ significance level. Otherwise, we fail to reject H0,
i.e. H0is not necessarily true, but it is plausible.
26. Hypothesis Testing P- value approach
The P-value approach involves determining "likely" or "unlikely"
by determining the probability — assuming the null hypothesis
were true. If the P-value is small, say less than (or equal to)
α, then it is "unlikely." And, if the P-value is large, say more
than α, then it is "likely.“
Four Steps involve in P- value approach:
1> specify null and alternative hypothesis.
2> Calculate the value of test statistic using sample data and
assuming null hypothesis is true.
3> Using the known distribution of the test statistic, calculate
the P-value:
i.e. "If the null hypothesis is true, what is the probability
that we'd observe a more extreme test statistic in the direction
of the alternative hypothesis than we did?“
4> Set the significance level, α, the probability of making a
Type I error to be small — 0.01, 0.05, or 0.10. Compare the P-
value to α.
27. An Example :-
Suppose we want to know if the new drug had an influence on
IQ.
Null hypothesis – The average IQ of the people that uses the
drug is 100
Alternative hypothesis – The average IQ of a the people that
uses the drug is not 100
We need the following data in order to perform a z test:
a) Population mean b) hypothesis mean c) sample size
d)sample mean e) population standard deviation
Then you will calculate something called the one-sample z
test statistic, like this:
Let us our Z-
statistic
value = 2.17
28. Example:
The z statistic assumes a normal probability distribution,
so we would find the P-value like this:
The area in red is 0.015 +
0.015 = 0.030, 3 percent.
If we had chosen a
significance level of 5
percent, this would mean
that we had achieved
statistical significance.
We would reject the null
hypothesis in favour of
the alternative
hypothesis.
We conclude that we had
evidence that the drug
caused the average IQ to
deviate from 100 IQ
points.
0.025
0.025
29. Single Sample problems: students t test
If the data is available on a single variable, then an
appropriate test is t test provided the following assumptions
are satisfied:
•The data is quantitative in nature
•Normality is satisfied
•Observations are independent
30. Examples:
•The cereal packs are sold in market in packets of 300 grams. A
sample of 10 packs revealed the following weights:
296,298,300.1,299,302,290,289,297,296,299.5
•The objective is to know whether the average weight is
actually maintained at 300 grams.
•Null Hypothesis: Mean weight is 300 grams
•Alternative Hypothesis: Mean weight is lower than 300 grams
•The data is quantitative and hence student t test is
appropriate.
•P value 0.016
•Therefore, the data strongly support that the mean weight is
not maintained at 300 grams.
31. Two Sample Problems: An example
•Consider the data on Cadmium level on persons categorized as
smokers and non-smokers
Smokers: 10, 8.4, 3.5, 8.9, 9.0, 8.8,7.9
Non-Smokers : 3.1,3.5, 4.5, 4.3, 2.2, 2.7
•A Query: Is mean Cadmium level higher among the smokers?
•Fisher's t-test is appropriate
•Null Hypothesis: Mean Cadmium level s are equal in two groups
•Alternative Hypothesis: Mean Cadmium level is higher among the
smokers
•p-value = 0.0003045
•Conclusion: P value<.05 implies the rejection of the null
hypothesis.
32. Example
Suppose scores of 6 students given special coaching are as
follows:
Before: 8,3,4,5,6,7 After: 9,6,5,5,6,8
Query: Is the special coaching beneficial?
Observe: Experimental units( i.e. students) are the same;
only situations are different.
The assumption of independence is violated.
Paired t test is appropriate instead of Fisher’s t test.
• Null Hypothesis: Mean scores are the same in two situations
• Alternative Hypothesis: Mean score increases after the
special coaching.
Result: t = -2.2361, df= 5, p-value = 0.03779
What will the conclusion here???
33. Inference in Categorical data:
Sometimes both the independent and dependent variables are
categorical.
(e.g. Treatment (Drug/placebo) versus survival (alive/dead)
or Smoke (Y/N) versus lung cancer (Y/N))
In these situations, we usually count the number (or
proportion) of patients or subjects that fall into each
possible category.
For categorical data, the statistical tests we commonly use
•Test for proportions
•chi-square test
34. An Example
Suppose that 39 out of 80 people contacted in a survey of city
residents oppose a new tax.
Test whether the data is consistent with the hypothesis that
people accept new taxes.
For the data, number of successes x=39
•Number of data points n=80
•Null hypothesis: theoretical proportion of people accepting
new tax is 50% against it is higher
• P value 0.4875
Conclusion: Since p value is higher than .05, we failed to
reject or we accept the null hypothesis that people accept new
taxes.
35. Another Example:
Consider the following data on smoking and drinking habits of
500 individuals:
Query: Whether smoking and drinking habits are associated.
Heavy
Smoker
Moderate
Smoker
Non-Smoker
Heavy
Drinker
20 62 6
Moderate
Drinker
40 8 159
Non
Drinker
10 20 175
Null Hypothesis : Smoking and Drinking are
independent.
Alternative : They are not associated.
CHI-Square Test
36. ANOVA
• An extension of Fisher’s t test
• Number of groups is more than two
One Way Data
(Variation in one direction only(one way))
Categories Dose 1 Dose 2 Dose 3
Sources of variation in the data are
A. Variation within each dose group
B. Variation between dose groups
Main Question: Do the responses depend on which dose group the subject
is in?
The higher is the variation between the dose groups the higher is the
influence of dose groups.
wi
th
in
between
38. Two Way ANOVA
Two-way ANOVA is a type of study design with one numerical
outcome variable and two categorical explanatory variables.
Example:
Consider yield figures of three paddy varieties when three
Pesticides are used.
•Pesticides & Variety are both categorical
•The data is two way.
Variety
Pesticides A B C
1 33 99 67
2 56 87 65
3 78 89 77
Two hypotheses of interest:
• Whether there is any effect of the pesticides .
• Whether there is any effect of the paddy varieties.
Analysis of Variance Table
Df Sum Sq Mean Sq F value Pr(>F)
variety 2 2225.17 1112.58 44.063 0.000259
pesticide 3 1191.00 397.00 15.723 0.003008
Residuals 6 151.50 25.25
Conclusion: P value(.000259 & .003008) < .05, hence reject the
null hypothesis i.e. Varieties and pesticides are both
significantly effective.
39. Summary of single Sample Test:
Which Variable
are we looking
for?
Qualitative
Frequency Chi
Square/ Test of
Proportion
Quantitative
Student t test
40. Summary of Two sample Test:
Which Variable
are we looking
for?
Qualitative
X
Qualitative
Chi Square test
Qualitative
X
Quantitative
ANOVA
Quantitative
X
Quantitative
Paired t
test/Fisher t test
41. ANCOVA
•A covariate Independent Variable is added to an ANOVA
(can be dichotomous or metric)
•Effect of the covariate on the Dependent Variable is removed .
•Of interest are:
Main effects of IVs and interaction terms
Contribution of CV (akin to Step 1 in HMLR)
•e.g., GPA is used as a CV, when analysing whether there is a
difference in Educational Satisfaction between Males and
Females.
•Reduces variance associated with covariate (CV) from the DV
error (unexplained variance) term
•Increases power of F-test
•May not be able to achieve experimental control over a
variable (e.g., randomisation), but can measure it and
statistically control for its effect.
42. Assumption of ANCOVA
•Normality
•Homogeneity of Variance
•Independence of observations
•Independence of IV and CV
•Multicollinearity - if more than one CV, they should not
be highly correlated - eliminate highly correlated CVs
•Reliability of CVs - not measured with error - only use
reliable CVs