Detailed demonstration of Multiple Sample Test like Analysis of Variance (ANOVA), kinds of ANOVA One Way, Two Way, Chi-square with their assumptions and applications using excel, and much more.
Let me know if anything is needed. Happy to help. ping @ #bobrupakroy
2. Anova
Analysis of Variance (ANOVA) is a
statistical method used to compare the
difference between two or more sample
means.
And in real life we will encounter more
than 2 samples.
Anova can also be simply referred as
Multiple Sample Test.
Rupak Roy
3. 2 kinds of Anova
One Way
Two way
One way or two way refers to the number of
independent variables in the data . One
way anova has one independent variable
with 2 levels and two way anova can have
multiple levels.
Rupak Roy
4. One way Anova
here are the scores of 10 days
by the athletics trained by different
trainers. We need to determine
whether the trainers have any
effect on the scores. i.e. are the
differences observed in the
sample means statistically
significant?
There are two variances of anova
1) Within group variance ( sum of squared difference
between each observations and the mean of the group it
belongs) i.e. Sum of squares Within (SSW).
2) Between group variance ( sum of squared difference
between each group mean and the overall mean) i.e. Sum
of Squares Between (SSB).
Rupak Roy
5. So what Anova does it takes the ratio of SSB and
SSW, if the ratio is close to each other then it
concludes that the means are not different. If the
ratio is not close to 1 then it concludes the means
are different.
Alternative way to understand Anova’s SSB & SSW
+ve or – neg: how far or near the group average is
from the overall mean ( between group variation
(SSB))
+ve or –neg: how far or near the observation is from
the group average ( within group variation (SSW))
Rupak Roy
6. In excel
go to data tab and
select data analysis
and then
Anova single Factor
fill the input range
with alpha i.e.
level of confidence
Rupak Roy
7. In the output
check for
Between group
value
Between group P-value is 0.74 which is > 0.05, So we failed
to reject the null of hypothesis. Therefore trainers have no
impact on the scores made by athletics .
Rupak Roy
8. Assumptions for anova
The samples must be independent to
each other or we can say data should be
random in nature.
Normality – the distributions of the
population must be normally distributed
or even approximately.
All the populations should have a
common variance. If not then outcome
of P-value will not be reliable.
Rupak Roy
9. Two way anova
We use two way anova to compare the
effect of multiple levels or factors.
Or simply we can say multiple factors
influencing the outcome.
Two types of two way anova functions:
1. with replication
2. without replication
Rupak Roy
10. With a two way ANOVA with replication,
refers if we have 2 groups and within that
group individuals are doing more than one
thing ( like two groups of students from two
colleges taking two tests ) and if we only
have one group taking two tests, we will use
without replication.
Rupak Roy
11. Example: 2 way anova
In Excel:
go to data tab and
select data analysis
Then
from the list select
Anova : two factor
with Replication
Rupak Roy
12. Provide the input range
and rows per sample : 3
as we can see 3 rows per sample
Alpha : 0.05
Rupak Roy
13. Two-way anova gives 3 p-values because it tests 3 null
hypothesis.
1st Null hypothesis : Sample : Cold
Hot
Humid
2nd Null hypothesis: Columns : place, place2, place3
3rd Null hypothesis: Interaction: combination of 1st factor & 2nd
factor. So we need the P-values of the Interaction to conclude
that the multiple factors have any effect over the outcome or
not and here we have 0.12 which is not statistically significant.
However, the P-value of one of the individual factor(sample) is
significant 3.06538E-07
Rupak Roy
14. Hence we conclude that the combinations of hot,
cold or humid climate with place1, place2, place3
have no effect in the population size.
But with the statistically significant P value of the 1st
null hypothesis we can also conclude that the
climate type hot, cold & humid have a positive
effect on the population rise. And again for the 2nd
null hypothesis places have no effect in the
population rise.
If an anova results in rejecting the null hypothesis, we
can also understand from the rejection is that at least
one sample mean is different.
To determine which group mean are different, we
use Post Hoc Test.
Rupak Roy
15. Types of Post Hoc Test
LSD Tests
Tukey Tests
Scheffe Tests
But very often we want to know which are
different i.e. Post Hoc Tests.
Rupak Roy
16. Chi -square
a statistical method for multiple sample
tests.
2 common applications of chi-square:
* Test of association
* goodness of fit
Remember it is used only for dealing with
count or categorical data.
Rupak Roy
17. Chi-square test of association or
Independence
Let’s understand with the help of an
example where if the age has any impact
on the type of cars or we can say is there
any association between age and type
of cars.
Rupak Roy
18. Here
Null Hypothesis (Ho): there is no association between
age and car types.
Alternative Hypothesis (Ha): there is an association
between age and car types.
In order to run the chi-square test we need the Expected
values and if we don’t have, we have to calculate
Manually the Expected values.
Expected Values = Row Total* Column Total)/N
Where N is the sum of the observations in the sample.
Rupak Roy
19. Expected Table
we can also use
reference to cells
during calculation
like =(D3*F3)/F7
Refer the lab video for better understanding
Rupak Roy
20. Expected values
Now, in Excel
Chi-square = CHISQ.TEST(actual-range, expected-range)
= 1.01888E-42
So we will reject the null hypothesis and conclude there are
some association between age and types of car.
Rupak Roy
22. Chi-square Test for Goodness of Fit
It describes whether or not the data has
followed a particular distribution.
Example:
Whether or not the sample or the distribution is following
binomial distribution.
Rupak Roy
23. Take an example where a we toss 2 coins at a time , and
we got 0 head in 5 toss, 1 head in 6 toss , 2 heads in 19
toss, total 30 times tossed
Now we will calculate expected binomial probability
of heads in the expected table
Number_s: E4 i.e. 0 for 0 head. Trails: 2 i.e. two coins
Probability_s (success) : 2(trials)/6(total trials) = 1/3 for 0
head, 1head, 2 heads
False: point probability.
Rupak Roy
24. Expected Values = Row total * Column Total
Therefore,
flipped=0.44*Column Total of actual times the coin
flipped (for 0 head)
then we will repeat the same
steps for 1and
2 heads in the
Expected table
Hence, Chi-square test = (actual & expected )=1.0090E-18
Rupak Roy
25. With chi-square test result 1.0090E-18
which is smaller than 0.05 provides a strong
evidence that the distribution is following
a binomial distribution.
Rupak Roy
26. Practice
The time taken to assemble a laptop in a
repairing shop having normal distribution of
mean 10 hours and standard deviation of 1
hour, what is the probability that the shop can
assemble a laptop in a given period of time ?
a) assemble > 6hours?
b) assemble < 10 hours ?
Answers : a) 0.9 =1-NORM.DIST(6,10,1,TRUE)
b) 0.5 =NORM.DIST(10,10,1,TRUE)
Rupak Roy
27. Poisson distribution
A train passes through a busy crossing at an average rate
of 150 miles per hour.
What will the probability that no train passes in 10
minutes ?
In Excel:
= poisson.dist( X , mean , cumulative)
where X = 0 (no train passes)
Average rate of the train per minute = 150/60 =2.5 = 3 miles
Therefore average rate through 10min = 3x10= 30 miles
Hence,
= poisson.dist (0,30,FALSE) = 9.35762E-14
Rupak Roy
28. Find the probability that 5 train passes through
a given only 10 minute.
= poisson.dist ( X , mean , cumulative)
where X = 5 ,
average rate per minute = 150/60 =2.5 = 3 miles
therefore average rate through 10min = 3x10= 30 miles
Hence
P value = poisson.dist ( 5,30,False)
= 2.25735e-08
Rupak Roy
29. The average registration for a event is 15%. A mail
campaign for promoting a event was sent to 10000
customers and you received 1250 registrations. Are you
sure the registration by mail campaign was really good
than expected or it is just a randomness.
Variable outcome: Registrations Yes or No.
Therefore, it is a binomial probability distribution
Probability of success rate (based on previous data): 15%
Probability of new Success: 1250/10 000 = 12.5%=0.125
Probability of seeing the mail campaign outcome due to
randomness
=BINOM.DIST(1250,10000,0.125,FALSE) =0.01
which is very low. Hence the new campaign was a success.
Rupak Roy