Yusuf YIGINI, PhD - FAO, Land and Water Division (CBL)
GSP - Eurasian Soil
Partnership - Dijital
Toprak Haritalama ve
Modelleme Egitimi
Izmir, Turkiye
21-25 Agustos 2017
Linear Models
Linear Models
Simple linear regression uses a independent variable
to predict the outcome of a dependent variable. This
is the most basic form of regression, numerous
complex modeling techniques can be learned by
understanding this basic complex.
Linear Models
In R several classical statistical models can be
implemented using the function: lm (linear model).
The lm function can be used for simple and multiple
linear regression
> ?lm
Linear Models
In R several classical statistical models can be
implemented using the function: lm (linear model).
The lm function can be used for simple and multiple
linear regression
> ?lm
Usage
lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE,
qr = TRUE, singular.ok = TRUE, contrasts = NULL,
offset, ...)
Linear Models
The first argument in the lm function (formula) is
where you specify the structure of the statistical
model.
Common structure of a statistical model is:
y ~x Simple linear regression of y on x
y ~ x + z Multiple regression of y on x and z
>
https://drive.google.com/file/d/0B4-nNA2Ua3DoUmJMQ3JJTFpzTDg/view?usp=sharing
Linear Models
To demonstrate simple linear regression in R, we will
again use the Macedonian Soil Dataset. Here we will
regress Soil Organic Carbon on DEM.
DSM_table <- read.csv("DSM_table2.csv")
Linear Models
To demonstrate simple linear regression in R, we will
again use the Macedonian Soil Dataset. Here we will
regress Soil Organic Carbon on DEM.
> head(DSM_table)
ID ProfID X Y UpperDepth LowerDepth Value Lambda
1 4 P0004 7485085 4653725 0 30 11.878804 0.1
2 7 P0007 7486492 4653203 0 30 3.490205 0.1
3 8 P0008 7485564 4656242 0 30 2.317673 0.1
4 9 P0009 7495075 4652933 0 30 1.936148 0.1
5 10 P0010 7494798 4651945 0 30 1.339719 0.1
6 11 P0011 7492500 4651760 0 30 2.285384 0.1
tsme slp prec dem
1 0.160096433 13 998.034 2327
2 0.002569598 35 1014.300 1986
3 0.002601836 6 779.994 1243
4 0.002841078 25 839.183 1120
5 0.002677120 30 843.919 1098
Linear Models
The summary statistics,
> summary(cbind(SOC = DSM_table$Value, Slope =DSM_table$slp,
Precipitation=DSM_table$prec, DEM=DSM_table$dem))
SOC Slope Precipitation DEM
Min. : 0.000 Min. : 0.000 Min. : 424.5 Min. : 45.0
1st Qu.: 1.005 1st Qu.: 0.000 1st Qu.: 532.3 1st Qu.: 404.2
Median : 1.493 Median : 3.000 Median : 564.3 Median : 592.0
Mean : 1.912 Mean : 7.414 Mean : 597.5 Mean : 642.3
3rd Qu.: 2.244 3rd Qu.:11.000 3rd Qu.: 641.8 3rd Qu.: 768.0
Max. :50.332 Max. :56.000 Max. :1180.3 Max. :2375.0
NA's :1
Linear Models
Our hypothesis here is that elevation is a good
predictor of SOC!?.To start, let’s have a look at what
the data looks like
plot(DSM_table$Value, DSM_table$dem)
Linear Models
Our hypothesis here is that elevation is a good
predictor of SOC!?.To start, let’s have a look at what
the data looks like
plot(DSM_table$Value, DSM_table$dem)
Linear Models
There appears there is not meaningful. To fit a linear
model, we can use the lm function:
model1 <- lm(Value ~ dem, data=DSM_table, y=TRUE, x = TRUE)
> model1
Call:
lm(formula = Value ~ dem, data = DSM_table, x = TRUE, y = TRUE)
Coefficients:
(Intercept) dem1
0.715117 0.001856
Linear Models
> summary(model1)
Call:
lm(formula = Value ~ dem1, data = DSM_table, x = TRUE, y = TRUE)
Residuals:
Min 1Q Median 3Q Max
-3.917 -0.769 -0.224 0.389 48.895
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.151e-01 6.929e-02 10.32 <2e-16 ***
dem1 1.856e-03 9.367e-05 19.82 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.954 on 3300 degrees of freedom
Multiple R-squared: 0.1063, Adjusted R-squared: 0.1061
F-statistic: 392.6 on 1 and 3300 DF, p-value: < 2.2e-16
Linear Models
> class(model1)
[1] "lm"
the output from the lm function is an object of class
lm. An object of class "lm" is a list containing at least
the following components:
Linear Models
> class(model1)
[1] "lm"
the output from the lm function is an object of class
lm. An object of class "lm" is a list containing at least
the following components:
coefficients - a named vector of coefficients
residuals - the residuals, that is response minus fitted
values.
fitted.values - the fitted mean values.
rank - the numeric rank of the fitted linear model.
weights - (only for weighted fits) the specified weights.
df.residual -the residual degrees of freedom.
call - the matched call.
terms - the terms object used.
contrasts - (only where relevant) the contrasts used.
xlevels -(only where relevant) a record of the levels of the factors
used in fitting.
offset- the offset used (missing if none were used).
y - if requested, the response used.
x- if requested, the model matrix used.
model - if requested (the default), the model frame used.
na.action - (where relevant) information returned by model.frame on
the special handling of NAs.
Linear Models
class(model1)
[1] "lm"
model1$coefficients
(Intercept) dem1
0.715116901 0.001856089
> formula(model1)
Value ~ dem1
the output from the lm function is an object of class
lm. An object of class "lm" is a list containing at least
the following components:
Linear Models
head(residuals(model1))
1 2 3 4 5 6
6.8445691 -1.2674728 -0.7045624 -0.5960802 -1.1628119 -1.1990168
names(summary(model1))
[1] "call" "terms" "residuals" "coefficients"
[5] "aliased" "sigma" "df" "r.squared"
[9] "adj.r.squared" "fstatistic" "cov.unscaled"
Here is a list of what is available from the summary
function for this model:
Linear Models
summary(model1)[[4]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.715116901 6.928677e-02 10.32112 1.333183e-24
dem1 0.001856089 9.367314e-05 19.81452 1.188533e-82
summary(model1)[[7]]
[1] 2 3300 2
To extract some of the information from the
summary which is of a list structure, we can use:
Linear Models
> summary(model1)[["r.squared"]]
[1] 0.1063245
> summary(model1)[[8]]
[1] 0.1063245
What is the RSquared of model1?
Linear Models
> summary(model1)[["r.squared"]]
[1] 0.1063245
> summary(model1)[[8]]
[1] 0.1063245
What is the RSquared of model1?
Linear Models
head(predict(model1))
1 2 3 4 5 6
5.034235 4.757678 3.022235 2.532228 2.502530 3.484401
head(DSM_table$Value)
[1] 11.878804 3.490205 2.317673 1.936148 1.339719 2.285384
head(residuals(model1))
1 2 3 4 5 6
6.8445691 -1.2674728 -0.7045624 -0.5960802 -1.1628119 -1.1990168
head(model2$residuals)
1 2 3 4 5 6
7.3395541 -1.1690560 -0.5363049 -1.2854938 -1.9692882 -1.5173124
> head(model2$fitted.values)
1 2 3 4 5 6
4.539250 4.659261 2.853978 3.221641 3.309007 3.802697
Linear Models
plot(model1$y, model1$fitted.values)
Lets plot() the observed vs. predicted from
the model
Multiple regression in R
model2subset <-DSM_table2[, c("Value", "slp", "prec", "dem")]
summary(model2subset)
Value slp prec dem
Min. : 0.000 Min. : 0.000 Min. : 424.5 Min. : 45.0
1st Qu.: 1.005 1st Qu.: 0.000 1st Qu.: 532.3 1st Qu.: 404.2
Median : 1.493 Median : 3.000 Median : 564.3 Median : 592.0
Mean : 1.912 Mean : 7.414 Mean : 597.5 Mean : 642.3
3rd Qu.: 2.244 3rd Qu.:11.000 3rd Qu.: 641.8 3rd Qu.: 768.0
Max. :50.332 Max. :56.000 Max. :1180.3 Max. :2375.0
NA's :1
We will regress SOC on Precipitation,
Slope and Elevation. First lets subset these
data out, then get their summary statistics
Multiple regression in R
cor(na.omit(model2subset))
Value slp prec dem
Value 1.0000000 0.2730310 0.3155474 0.3317814
slp 0.2730310 1.0000000 0.5765489 0.6011170
prec 0.3155474 0.5765489 1.0000000 0.8158338
dem 0.3317814 0.6011170 0.8158338 1.0000000
A quick way to look for relationships between
variables in a data frame is with the cor function.
Note the use of the na.omit function.
Multiple regression in R
> pairs(na.omit(model2subset))
To visualize these relationships, we can use pairs
Multiple linear regression in R
> pairs(na.omit(model2subset))
To visualize these relationships, we can use pairs
Multiple regression in R
model2 <- lm(Value ~ slp + prec + dem, data = model2subset)
model2
Call:
lm(formula = Value ~ slp + prec + dem, data = model2subset)
Coefficients:
(Intercept) slp prec dem
-0.027413 0.020314 0.001868 0.001048
fitting the multiple linear regression,
Multiple linear regression in R
summary(model2)
Call:
lm(formula = Value ~ slp + prec + dem, data = model2subset)
Residuals:
Min 1Q Median 3Q Max
-3.527 -0.717 -0.219 0.379 48.907
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0274134 0.2240410 -0.122 0.902622
slp 0.0203135 0.0041948 4.843 1.34e-06 ***
prec 0.0018682 0.0004956 3.770 0.000166 ***
dem 0.0010477 0.0001681 6.232 5.18e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.937 on 3297 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.1223, Adjusted R-squared: 0.1215
F-statistic: 153.2 on 3 and 3297 DF, p-value: < 2.2e-16
EXERCISE
TASK1:
Regress SOC (Value) on; Slope (slp), Elevation (dem), TWI
(twi), Annual Nightly Mean Temperature (tmpn), Annual Daily
Mean Temperature (tmpd), Precipitation (prec)
DATA: https://goo.gl/ow7pL7
TASK2: Sonuçları Aşağıdaki Linkte Paylaşın
https://goo.gl/zlNcb5
,
Multiple regression in R
MLR.SOC.Map <- predict(covStack, model2,
"SOCMap_0_30_MLR.tif", format = "GTiff",
datatype = "FLT4S", overwrite = TRUE)
Applying the MLR Model Spatially and create a
Multiple Linear Regression Soil Organic Carbon Map
of Macedonia

12. Linear models

  • 1.
    Yusuf YIGINI, PhD- FAO, Land and Water Division (CBL) GSP - Eurasian Soil Partnership - Dijital Toprak Haritalama ve Modelleme Egitimi Izmir, Turkiye 21-25 Agustos 2017
  • 2.
  • 3.
    Linear Models Simple linearregression uses a independent variable to predict the outcome of a dependent variable. This is the most basic form of regression, numerous complex modeling techniques can be learned by understanding this basic complex.
  • 4.
    Linear Models In Rseveral classical statistical models can be implemented using the function: lm (linear model). The lm function can be used for simple and multiple linear regression > ?lm
  • 5.
    Linear Models In Rseveral classical statistical models can be implemented using the function: lm (linear model). The lm function can be used for simple and multiple linear regression > ?lm Usage lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...)
  • 6.
    Linear Models The firstargument in the lm function (formula) is where you specify the structure of the statistical model. Common structure of a statistical model is: y ~x Simple linear regression of y on x y ~ x + z Multiple regression of y on x and z > https://drive.google.com/file/d/0B4-nNA2Ua3DoUmJMQ3JJTFpzTDg/view?usp=sharing
  • 7.
    Linear Models To demonstratesimple linear regression in R, we will again use the Macedonian Soil Dataset. Here we will regress Soil Organic Carbon on DEM. DSM_table <- read.csv("DSM_table2.csv")
  • 8.
    Linear Models To demonstratesimple linear regression in R, we will again use the Macedonian Soil Dataset. Here we will regress Soil Organic Carbon on DEM. > head(DSM_table) ID ProfID X Y UpperDepth LowerDepth Value Lambda 1 4 P0004 7485085 4653725 0 30 11.878804 0.1 2 7 P0007 7486492 4653203 0 30 3.490205 0.1 3 8 P0008 7485564 4656242 0 30 2.317673 0.1 4 9 P0009 7495075 4652933 0 30 1.936148 0.1 5 10 P0010 7494798 4651945 0 30 1.339719 0.1 6 11 P0011 7492500 4651760 0 30 2.285384 0.1 tsme slp prec dem 1 0.160096433 13 998.034 2327 2 0.002569598 35 1014.300 1986 3 0.002601836 6 779.994 1243 4 0.002841078 25 839.183 1120 5 0.002677120 30 843.919 1098
  • 9.
    Linear Models The summarystatistics, > summary(cbind(SOC = DSM_table$Value, Slope =DSM_table$slp, Precipitation=DSM_table$prec, DEM=DSM_table$dem)) SOC Slope Precipitation DEM Min. : 0.000 Min. : 0.000 Min. : 424.5 Min. : 45.0 1st Qu.: 1.005 1st Qu.: 0.000 1st Qu.: 532.3 1st Qu.: 404.2 Median : 1.493 Median : 3.000 Median : 564.3 Median : 592.0 Mean : 1.912 Mean : 7.414 Mean : 597.5 Mean : 642.3 3rd Qu.: 2.244 3rd Qu.:11.000 3rd Qu.: 641.8 3rd Qu.: 768.0 Max. :50.332 Max. :56.000 Max. :1180.3 Max. :2375.0 NA's :1
  • 10.
    Linear Models Our hypothesishere is that elevation is a good predictor of SOC!?.To start, let’s have a look at what the data looks like plot(DSM_table$Value, DSM_table$dem)
  • 11.
    Linear Models Our hypothesishere is that elevation is a good predictor of SOC!?.To start, let’s have a look at what the data looks like plot(DSM_table$Value, DSM_table$dem)
  • 12.
    Linear Models There appearsthere is not meaningful. To fit a linear model, we can use the lm function: model1 <- lm(Value ~ dem, data=DSM_table, y=TRUE, x = TRUE) > model1 Call: lm(formula = Value ~ dem, data = DSM_table, x = TRUE, y = TRUE) Coefficients: (Intercept) dem1 0.715117 0.001856
  • 13.
    Linear Models > summary(model1) Call: lm(formula= Value ~ dem1, data = DSM_table, x = TRUE, y = TRUE) Residuals: Min 1Q Median 3Q Max -3.917 -0.769 -0.224 0.389 48.895 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.151e-01 6.929e-02 10.32 <2e-16 *** dem1 1.856e-03 9.367e-05 19.82 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.954 on 3300 degrees of freedom Multiple R-squared: 0.1063, Adjusted R-squared: 0.1061 F-statistic: 392.6 on 1 and 3300 DF, p-value: < 2.2e-16
  • 14.
    Linear Models > class(model1) [1]"lm" the output from the lm function is an object of class lm. An object of class "lm" is a list containing at least the following components:
  • 15.
    Linear Models > class(model1) [1]"lm" the output from the lm function is an object of class lm. An object of class "lm" is a list containing at least the following components: coefficients - a named vector of coefficients residuals - the residuals, that is response minus fitted values. fitted.values - the fitted mean values. rank - the numeric rank of the fitted linear model. weights - (only for weighted fits) the specified weights. df.residual -the residual degrees of freedom. call - the matched call. terms - the terms object used. contrasts - (only where relevant) the contrasts used. xlevels -(only where relevant) a record of the levels of the factors used in fitting. offset- the offset used (missing if none were used). y - if requested, the response used. x- if requested, the model matrix used. model - if requested (the default), the model frame used. na.action - (where relevant) information returned by model.frame on the special handling of NAs.
  • 16.
    Linear Models class(model1) [1] "lm" model1$coefficients (Intercept)dem1 0.715116901 0.001856089 > formula(model1) Value ~ dem1 the output from the lm function is an object of class lm. An object of class "lm" is a list containing at least the following components:
  • 17.
    Linear Models head(residuals(model1)) 1 23 4 5 6 6.8445691 -1.2674728 -0.7045624 -0.5960802 -1.1628119 -1.1990168 names(summary(model1)) [1] "call" "terms" "residuals" "coefficients" [5] "aliased" "sigma" "df" "r.squared" [9] "adj.r.squared" "fstatistic" "cov.unscaled" Here is a list of what is available from the summary function for this model:
  • 18.
    Linear Models summary(model1)[[4]] Estimate Std.Error t value Pr(>|t|) (Intercept) 0.715116901 6.928677e-02 10.32112 1.333183e-24 dem1 0.001856089 9.367314e-05 19.81452 1.188533e-82 summary(model1)[[7]] [1] 2 3300 2 To extract some of the information from the summary which is of a list structure, we can use:
  • 19.
    Linear Models > summary(model1)[["r.squared"]] [1]0.1063245 > summary(model1)[[8]] [1] 0.1063245 What is the RSquared of model1?
  • 20.
    Linear Models > summary(model1)[["r.squared"]] [1]0.1063245 > summary(model1)[[8]] [1] 0.1063245 What is the RSquared of model1?
  • 21.
    Linear Models head(predict(model1)) 1 23 4 5 6 5.034235 4.757678 3.022235 2.532228 2.502530 3.484401 head(DSM_table$Value) [1] 11.878804 3.490205 2.317673 1.936148 1.339719 2.285384 head(residuals(model1)) 1 2 3 4 5 6 6.8445691 -1.2674728 -0.7045624 -0.5960802 -1.1628119 -1.1990168 head(model2$residuals) 1 2 3 4 5 6 7.3395541 -1.1690560 -0.5363049 -1.2854938 -1.9692882 -1.5173124 > head(model2$fitted.values) 1 2 3 4 5 6 4.539250 4.659261 2.853978 3.221641 3.309007 3.802697
  • 22.
    Linear Models plot(model1$y, model1$fitted.values) Letsplot() the observed vs. predicted from the model
  • 23.
    Multiple regression inR model2subset <-DSM_table2[, c("Value", "slp", "prec", "dem")] summary(model2subset) Value slp prec dem Min. : 0.000 Min. : 0.000 Min. : 424.5 Min. : 45.0 1st Qu.: 1.005 1st Qu.: 0.000 1st Qu.: 532.3 1st Qu.: 404.2 Median : 1.493 Median : 3.000 Median : 564.3 Median : 592.0 Mean : 1.912 Mean : 7.414 Mean : 597.5 Mean : 642.3 3rd Qu.: 2.244 3rd Qu.:11.000 3rd Qu.: 641.8 3rd Qu.: 768.0 Max. :50.332 Max. :56.000 Max. :1180.3 Max. :2375.0 NA's :1 We will regress SOC on Precipitation, Slope and Elevation. First lets subset these data out, then get their summary statistics
  • 24.
    Multiple regression inR cor(na.omit(model2subset)) Value slp prec dem Value 1.0000000 0.2730310 0.3155474 0.3317814 slp 0.2730310 1.0000000 0.5765489 0.6011170 prec 0.3155474 0.5765489 1.0000000 0.8158338 dem 0.3317814 0.6011170 0.8158338 1.0000000 A quick way to look for relationships between variables in a data frame is with the cor function. Note the use of the na.omit function.
  • 25.
    Multiple regression inR > pairs(na.omit(model2subset)) To visualize these relationships, we can use pairs
  • 26.
    Multiple linear regressionin R > pairs(na.omit(model2subset)) To visualize these relationships, we can use pairs
  • 27.
    Multiple regression inR model2 <- lm(Value ~ slp + prec + dem, data = model2subset) model2 Call: lm(formula = Value ~ slp + prec + dem, data = model2subset) Coefficients: (Intercept) slp prec dem -0.027413 0.020314 0.001868 0.001048 fitting the multiple linear regression,
  • 28.
    Multiple linear regressionin R summary(model2) Call: lm(formula = Value ~ slp + prec + dem, data = model2subset) Residuals: Min 1Q Median 3Q Max -3.527 -0.717 -0.219 0.379 48.907 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.0274134 0.2240410 -0.122 0.902622 slp 0.0203135 0.0041948 4.843 1.34e-06 *** prec 0.0018682 0.0004956 3.770 0.000166 *** dem 0.0010477 0.0001681 6.232 5.18e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.937 on 3297 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.1223, Adjusted R-squared: 0.1215 F-statistic: 153.2 on 3 and 3297 DF, p-value: < 2.2e-16
  • 29.
    EXERCISE TASK1: Regress SOC (Value)on; Slope (slp), Elevation (dem), TWI (twi), Annual Nightly Mean Temperature (tmpn), Annual Daily Mean Temperature (tmpd), Precipitation (prec) DATA: https://goo.gl/ow7pL7 TASK2: Sonuçları Aşağıdaki Linkte Paylaşın https://goo.gl/zlNcb5 ,
  • 30.
    Multiple regression inR MLR.SOC.Map <- predict(covStack, model2, "SOCMap_0_30_MLR.tif", format = "GTiff", datatype = "FLT4S", overwrite = TRUE) Applying the MLR Model Spatially and create a Multiple Linear Regression Soil Organic Carbon Map of Macedonia