linear models.pptx

Two-way ANOVA,
multiple regression
and general linear models

ANOVA
with more predictors:
summary(aov(seedlings ~
fertilization * management,
data=seedl))
• Df Sum Sq Mean Sq F value Pr(>F)
• fertilization 1 108.3 108.30 9.404 0.0053 **
• management 2 400.5 200.23 17.386 2.15e-
05 ***
• fertilization:management 2 34.2 17.10 1.485
0.2466
• Residuals 24 276.4 11.52
seedlings management fertilization
9 abandoned fert
4 abandoned fert
12 abandoned fert
12 abandoned fert
11 abandoned fert
11 abandoned unfert
8 abandoned unfert
16 abandoned unfert
14 abandoned unfert
12 abandoned unfert
14 grazing fert
14 grazing fert
15 grazing fert
24 grazing fert
21 grazing fert
25 grazing unfert
14 grazing unfert
20 grazing unfert
20 grazing unfert
19 grazing unfert
10 mowing fert
6 mowing fert
5 mowing fert
8 mowing fert
8 mowing fert
18 mowing unfert
11 mowing unfert
14 mowing unfert
16 mowing unfert
12 mowing unfert

Interaction
– test of additivity
H0: the effect of a factor
is not affected by the other factor
– in plots: mean-connecting lines are paralel
mow ed*fertilized; LS Means
Current effect: F(1, 16)=0.0000, p=1.0000
Effective hypothesis decomposition
Vertical bars denote 0.95 confidence intervals
mow ed 0
mow ed 1
0 1
fertilized
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
No
.
species
mow ed*fertilized; LS Means
Current effect: F(1, 16)=18.000, p=.00062
Effective hypothesis decomposition
Vertical bars denote 0.95 confidence intervals
mow ed 0
mow ed 1
0 1
fertilized
0
2
4
6
8
10
12
14
No.species
no interaction:
additive effect
lines are paralel
the effect of mowing
is the same
regardles of fertilization
interaction:
non-additive effect
lines are not paralel
the effect of mowing
is more pronounced
in unfertilized plots

Chocolate rats heavier, music no effect
Number of observations
– should be balanced in all groups
music
diet Rock Folk
Paper 15 15
Chocoloate 15 15
Mean mass of rats ~ diet + music
music
diet Rock Folk
Paper 20 40
Chocoloate 10 20
music
diet Rock Folk
Paper 30 10
Chocoloate 10 30
Number of observations
Chocolate rats heavier, Folk rats heavier

site 1 site 2
site 3 site 4
fixed: fertilization
random: sites
you are not interested in the effect of site
you can generalize to all (comparable) sites
Fixed vs. random effects of predictors
– depends on how general is your research question
– fixed effect: you are interested in differences
between particular factor levels only
(fertilized / unfertilized, breed1 / breed2 / breed3)
– random effect: you want to generalize the results
to all other possible levels
(site1 / site2 / site3 / site4, breed1 / breed2 / breed3, sampling-unit identity)
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
your results are valid only for your site

site 1 site 2
site 3 site 4
fixed: fertilization
random: sites
you are not interested in the effect of site
you can generalize to all (comparable) sites
Fixed vs. random effects of predictors
– Several computaional approaches
– simple use of different MS in F test:
– complicated advanced fitting in R
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
your results are valid only for your site
Effect tested A fixed
B fixed
A random
B random
A fixed
B random)
Factor A MSA/MSe MSA/MSAB MSA/MSAB
Factor B MSB/MSe MSB/MSAB MSB/MSe
A x B interaction MSAB/MSe MSAB/MSe MSAB/MSe

2 bedrocks: site 1 site 2 site 3
granite
limestone
site 4 site 5 site 6
Factors:
fertilization
sites (nested in bedrock)
– it is impossible to have more bedrocks on one site
bedrock
Hierarchical (nested) design
– not all combination of factor levels are available
– factor with more levels
is “nested in“ factor with less levels
Fbedrock = MSbedrock / MSsite
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F

R – fairly complicated to fit a proper models with
nested factors
random effects
interactions
packages: lme4 (simpler) or nlme (advanced)
https://www.jaredknowles.com/journal/2013/11/25/getting-started-with-mixed-effect-models-in-r
2 bedrocks: site 1 site 2 site 3
granite
limestone
site 4 site 5 site 6
Factors:
fertilization (fixed)
sites (random, nested in bedrock)
– it is impossible to have more bedrocks on one site
bedrock (fixed)
fertilization:bedrock (interaction of fixed factors is interesting, other usually not)
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F
F U F U
U F U F

Multiple regression
main effects of two predictors:
summary(lm(seedlings ~ productivity + temperature, data=seedl))
seedlings productivity temperature
9 589 7.2
4 674 4.5
12 484 7.4
12 504 5.4
11 484 5
11 572 6
8 411 6.2
16 374 7.6
14 353 8.4
12 406 4.7
14 759 6.8
14 789 4.6
15 689 4.8
24 611 7
21 456 5.8
25 386 7.6
14 538 4.8
20 350 9
20 413 4.8
19 599 6.5
10 558 6.4
6 752 4.2
5 687 6
8 667 4.4
8 662 7.6
18 479 4.8
11 561 9
14 450 8.4
16 578 4.7
12 592 6.8
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.286421 7.220427 2.948 0.00652 **
productivity -0.017128 0.007928 -2.160 0.03977 *
temperature 0.245520 0.681770 0.360 0.72156
> summary(lm(seedlings~productivity,data=seedl))
(Intercept) 23.429505 4.024972 5.821 2.97e-06 ***
productivity -0.018256 0.007169 -2.547 0.0167 *
> summary(lm(seedlings~temperature,data=seedl))
(Intercept) 8.2922 4.2487 1.952 0.061 .
temperature 0.8274 0.6661 1.242 0.224

Multiple regression
seedlings productivity temperature
9 589 7.2
4 674 4.5
12 484 7.4
12 504 5.4
11 484 5
11 572 6
8 411 6.2
16 374 7.6
14 353 8.4
12 406 4.7
14 759 6.8
14 789 4.6
15 689 4.8
24 611 7
21 456 5.8
25 386 7.6
14 538 4.8
20 350 9
20 413 4.8
19 599 6.5
10 558 6.4
6 752 4.2
5 687 6
8 667 4.4
8 662 7.6
18 479 4.8
11 561 9
14 450 8.4
16 578 4.7
12 592 6.8
(Intercept) 21.286421 7.220427 2.948 0.00652 **
productivity -0.017128 0.007928 -2.160 0.03977 *
temperature 0.245520 0.681770 0.360 0.72156
Equation: y = b0 + b1x1 + b2x2 ... (plane)
Interaction: y = b0 + b1x1 + b2x2 + b1,2x1x2
– curved surface (different slope
in the front
and in the back)

(Intercept) 21.286421 7.220427 2.948 0.00652 **
productivity -0.017128 0.007928 -2.160 0.03977 *
temperature 0.245520 0.681770 0.360 0.72156
...
Multiple R-squared: 0.192, Adjusted R-squared: 0.132
F-statistic: 3.207 on 2 and 27 DF, p-value: 0.0563
> anova(lm(seedlings~temperature+productivity,data=seedl))
Analysis of Variance Table
Response: seedlings
Df Sum Sq Mean Sq F value Pr(>F)
temperature 1 42.80 42.800 1.7454 0.19755
productivity 1 114.46 114.464 4.6678 0.03977 *
Residuals 27 662.10 24.522
> anova(lm(seedlings~productivity+temperature,data=seedl))
Response: seedlings
productivity 1 154.08 154.084 6.2834 0.01851 *
temperature 1 3.18 3.180 0.1297 0.72156
Residuals 27 662.10 24.522
ANOVA test of whole model
ANOVA tests of each predictor:
Mind the order of predictors
simple vs. partial effects (in addition to the previous predictors)
Multiple regression
Explained variation:
(43+114)/(43+114+662)=0.192

General linear models
– ANOVA and Regression are equivalent
– same idea of testing variability explained by a model
– fitting model by least squares

Square of the difference
=TOTAL square
Overall mean
Group
mean
Difference from
the group mean
Square of
the difference
= ERROR square
Difference of the
group mean
from the overall
mean
Square of
the difference
= GROUP square
Difference from
the overall mean
Variance: mean of squared differences from mean
– get the differences
– square them

Sums of squares in regression
Total square
Error
square
Regression
square
ܱܵܵܶܶ = 𝑌𝑖 − 𝑌 2
ܴܵܵ‫ܴܩܧ‬ = 𝑌𝑖 − 𝑌
2
ܵܵ𝑒 = 𝑌𝑖 − 𝑌𝑖
2
This square is minimized
Individual values of Y
Mean of Y
Individual fitted
values of Y (values Y calculated
as Y= a + bx
Fitted value
mean(Y)

– ANOVA and Regression are equivalent
– same idea of testing variability explained by a model
– fitting model by least squares
–> both types of predictors (numeric, factor)
can be combined
– you can use any wild combination of predictor types,
interactions, nestedness, random effects...
– one more semester:
(P. Šmilauer: Modern Regression Methods, KBE/785E)
– simplest case – analysis of covariance
– 1 numeric predictor
– 1 categorical predictor
– no interaction
– model – paralel lines

General linear models: analysis of covariance
– 1 numeric predictor
– 1 categorical predictor
– no interaction
– model – paralel lines
> anova(lm(seedlings~productivity+management,data=seedl))
productivity 1 154.08 154.084 19.843 0.0001418 ***
management 2 463.39 231.693 29.837 1.852e-07 ***
Residuals 26 201.90 7.765
seedlings management productivity
9 abandoned 589
4 abandoned 674
12 abandoned 484
12 abandoned 504
11 abandoned 484
11 abandoned 572
8 abandoned 411
16 abandoned 374
14 abandoned 353
12 abandoned 406
14 grazing 759
14 grazing 789
15 grazing 689
24 grazing 611
21 grazing 456
25 grazing 386
14 grazing 538
20 grazing 350
20 grazing 413
19 grazing 599
10 mowing 558
6 mowing 752
5 mowing 687
8 mowing 667
8 mowing 662
18 mowing 479
11 mowing 561
14 mowing 450
16 mowing 578
12 mowing 592

Analyzing many predictors:
If you have too many predictors
(e.g. measures of everything in field observation)
do not include everything to your model!
–> fit Minimal adequate model
– backward selection
– include everything to the first model,
remove all non-significant terms
– forward selection
– start with the null model
– add individual terms
– one by one (due to colinearity)
– based on p-value or AIC
– analyze final model

AIC - Akaike information criterion
AIC = 2 k – 2 log ( n / SSE ) + C
k – number of model parameters
(i.e. model df)
SSE – residual sum of squares (RSS),
C – constant (can be ignored)
Quantifies the information accounted for by a predictor
– lower AIC suggests a better fit,
absolute values of AIC are not informative
– allows comparisons between models with different
number of df – penalization of complicated models
Can be combined with an F-test of significance

Question: What does the mass of human body depend on?
Sampling design: 21 randomly chosen inhabitants of České Budějovice.
Minimal adequate models – Forward selection
Body
mass
Sex Body
height
Hair
colour
Vegetarian Hours spent weekly
by physical excersise
mass sex height colour vegetarian hours
95 M 185 blonde 1 3
96 M 165 blonde 0 2
91 M 178 blonde 1 1
82 M 186 blonde 0 2
87 M 196 black 1 4
75 M 178 black 1 6
81 M 186 black 0 2
84 M 187 black 0 6
95 M 196 red 1 1
100 M 201 red 0 8
69 M 169 red 0 12
52 F 156 blonde 0 1
58 F 168 blonde 0 8
62 F 178 blonde 0 5
61 F 168 blonde 1 6
45 F 155 black 0 4
55 F 164 black 1 3
71 F 181 black 0 1
83 F 185 red 1 2
62 F 175 red 0 4
64 F 171 red 1 2

Question: What does the mass of human body depend on?
Sampling design: 21 randomly chosen inhabitants of České Budějovice.
For each person was recorded:
– body mass,
– sex,
– body height,
– hair colour,
– whether he/she is vegetarian
– number of hours spent weekly by physical exercise.
Start with null model:
> lm.0<-lm(mass~+1, data=BM)
> add1(lm.0, .~.+sex*height*colour*vegetarian*hours, test="F")
Single term additions
Model:
mass ~ +1
Df Sum of Sq RSS AIC F value Pr(F)
<none> 5318.7 118.224
sex 1 3410.9 1907.7 98.692 33.9710 1.295e-05 ***
height 1 3194.1 2124.6 100.953 28.5649 3.704e-05 ***
colour 2 191.1 5127.6 121.455 0.3354 0.7194
vegetarian 1 224.8 5093.9 119.317 0.8384 0.3713
hours 1 98.6 5220.0 119.830 0.3591 0.5561

next step
> lm.1<-update(lm.0, .~.+sex)
Model:
mass ~ sex
<none> 1907.7 98.692
height 1 791.27 1116.5 89.441 12.7570 0.00218 **
colour 2 297.60 1610.1 99.131 1.5711 0.23655
vegetarian 1 139.13 1768.6 99.102 1.4160 0.24952
hours 1 289.43 1618.3 97.237 3.2193 0.08959 .
next step
> lm.2<-update(lm.1, .~.+height)
Model:
mass ~ sex + height
<none> 1116.47 89.441
colour 2 245.787 870.68 88.220 2.2583 0.13681
vegetarian 1 45.693 1070.77 90.564 0.7254 0.40620
hours 1 192.420 924.05 87.469 3.5400 0.07714 .
sex:height 1 192.466 924.00 87.468 3.5410 0.07710 .
stop here (based on p) or include interaction (based on AIC)

Analysis of final model
> anova(lm.0, lm.1, lm.2, test="F")
Model 1: mass ~ +1
Model 2: mass ~ sex
Model 3: mass ~ sex + height
Res.Df RSS Df Sum of Sq F Pr(>F)
1 20 5318.7
2 19 1907.7 1 3410.9 54.992 7.094e-07 ***
3 18 1116.5 1 791.3 12.757 0.00218 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(lm.2)
Call:
lm(formula = mass ~ sex + height, data = BM)
Residuals:
Min 1Q Median 3Q Max
-8.559 -5.865 -2.027 3.041 20.865
Coefficients:
(Intercept) -41.8184 28.9782 -1.443 0.166170
sexM 16.9264 4.1986 4.031 0.000783 ***
height 0.6062 0.1697 3.572 0.002180 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.876 on 18 degrees of freedom
Multiple R-squared: 0.7901, Adjusted R-squared: 0.7668
F-statistic: 33.87 on 2 and 18 DF, p-value: 7.914e-07

Conclusion: Body mass is
significantly dependent on
sex and body height, these
predictors have additive
effects
Men are on average heavier
than women and the mass
increases with height
But be careful here
because sex and height
are related to each other!

How to plot the figure
plot(mass~height, data=BM, type="n", ylab="Body mass", xlab="Body height")
### plots an emply plot, i.e. the axes (with appropriate ranges, so the data fit in -
THIS IS IMPORTANT) and labels; this is specified by type="n"
points(mass~height, data=BM[BM$sex=="M",], pch=16)
### adds full points for males to the empty plot
points(mass~height, data=BM[BM$sex=="M",], pch=1)
### empty points for females
###generates x values to be later used by predict function
males.pred<-data.frame(sex="M", height=150:205)###generates a range of the height
predictor values for which the fitted values for males should be generated
females.pred<-data.frame(sex="F", height=150:205)###same for females
### predicts y values based on the model
lines(150:205, predict(lm.2, newdata=males.pred))
###Adds a solid line to the plot, corresponding to the regression fit for males
lines(150:205, predict(lm.2, newdata=females.pred), lty=2)
###Adds a dashed line to the plot, corresponding to the regression fit for females
legend(x="bottomright", legend=c("Males", "Females"), pch=c(16,1), lty=c(1,2),
inset=0.05, bty="n")
### Adds a legend to the plot

Overall conclusion
Statistics:
Numbers and formulas
– summary statistics – how big and variable are data
– hypothesis testing
– p – are the relationships larger than random?
– choose test based on data type
– data arrangement
Logic of discovery
– observation vs. experiment
– statistical vs. causal relationship
– avoid all possible bias
– random selection, proper control treatment
– enough replicates

Continuous
(e.g. 0.3, 4, 7, 5.2 etc.)
Ordinal
(e.g. 1=little,
2=medium,
3=a lot)
Categories
frequencies or percentages
(e.g. germinated: 18,
not germinated: 32)
Type of dependent variable
Type
of
predictor
Categories
–>
comparison
of
means
2 groups: t-test
(paired or not)
>2 groups:
one-way ANOVA
2 or more predictors:
two/more-way ANOVA
2 groups (not paired):
Mann-Whitney test
2 groups (paired):
Wilcoxon test
Continuous
–>
linear
relationship
2 variables, one cause
and one effect:
simple regression
2 variables,
no cause / effect:
Pearson correlation
>2 variables,
more causes
and one effect:
multiple regression
2 variables:
Spearman correlation
1 grouping variable:
Goodnes of fit
18 : 32
>2 groups:
Kruskal-Wallis test
more predictors
of both types:
Both
types
>1 grouping variable:
Contingency table
A B
C 18 32
D 26 24
Summary statistics
How big?
Mean, median...
How variable?
Variance, quartile range,
standard deviation,
coef. of variation...
How accurate estimate?
Standard error,
confidence interval

linear models.pptx

Recommended

Recommended

More Related Content

Similar to linear models.pptx

Similar to linear models.pptx (20)

More from Pudhuvai Baveesh

More from Pudhuvai Baveesh (20)

Recently uploaded

Recently uploaded (20)

linear models.pptx