Linear Regression
      with
 1: Prepare data/specify model/read results

          2012-12-07 @HSPH
         Kazuki Yoshida, M.D.
           MPH-CLE student

                                       FREEDOM
                                       TO	
  KNOW
Group Website is at:
http://rpubs.com/kaz_yos/useR_at_HSPH
Previously in this group
n   Introduction               n   Graphics

n   Reading Data into R (1)    n   Groupwise, continuous

n   Reading Data into R (2)    n


n   Descriptive, continuous

n   Descriptive, categorical

n   Deducer
Menu


n   Linear regression
Ingredients
        Statistics                   Programming
n   Data preparation         n   within()

n   Model formula            n   factor(), relevel()

                              n   lm()

                              n   formula = Y ~ X1 + X2

                              n   summary()

                              n   anova(), car::Anova()
Open
R Studio
Create a new script
   and save it.
http://www.umass.edu/statdata/statdata/data/
We will use lowbwt dataset used in BIO213




             lowbwt.dat
http://www.umass.edu/statdata/statdata/data/lowbwt.txt
http://www.umass.edu/statdata/statdata/data/lowbwt.dat
Load dataset from web


lbw <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat",
                  head = T, skip = 4)



                                       skip 4 rows
          header = TRUE
             to pick up
           variable names
“Fix” dataset


        lbw[c(10,39), "BWT"] <- c(2655, 3035)



            BWT column
                               Replace data points
10th,39th                  to make the dataset identical
  rows                         to BIO213 dataset
Lower case variable names


    names(lbw) <- tolower(names(lbw))



 Put them back into    Convert variable
  variable names      names to lower case
See overview
library(gpairs)
gpairs(lbw)
Recoding
Changing and creating variables
Name of newly created dataset
  (here replacing original)         Take dataset

 dataset <-
 	

within(dataset, {
 	

	

_variable manipulations_
 })         Perform variable manipulation
       You can specify by variable name
      only. No need for dataset$var_name
lbw <- within(lbw, {

     ## Relabel race
     race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

     ## Categorize ftv (frequency of visit)
     ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
     ftv.cat <- relevel(ftv.cat, ref = "Normal")

     ## Dichotomize ptl
     preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))

})
Numeric to categorical:
                    element by element                                    1st will be reference
lbw <- within(lbw, {

     ## Relabel race
     race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

     ## Categorize ftv (frequency of visit)
     ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
     ftv.cat <- relevel(ftv.cat, ref = "Normal")

     ## Dichotomize ptl
     preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))

})
                           1 to White                             1st will be reference
Categorize race and label: 2 to Black
                           3 to Other
Explained more in depth
factor() to create categorical variable
  Create new
variable named                               Take race variable
    race.cat
  lbw <- within(lbw, {

       ## Relabel race
       race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

  })


   Order levels 1, 2, 3
  Make 1 reference level
                                                Label levels 1, 2, 3 as
                                                White, Black, Other
Numeric to categorical:
                     range to element
lbw <- within(lbw, {
                                                                    1st will be reference
     ## Relabel race
     race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

     ## Categorize ftv (frequency of visit)
     ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
     ftv.cat <- relevel(ftv.cat, ref = "Normal")

     ## Dichotomize ptl
     preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))

})                                      How breaks work

(-Inf                       0] 1 2] 3              4     5     6                     Inf    ]
             None             Normal                         Many
Reset reference level
lbw <- within(lbw, {

     ## Relabel race
     race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

     ## Categorize ftv (frequency of visit)
     ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
     ftv.cat <- relevel(ftv.cat, ref = "Normal")

     ## Dichotomize ptl
     preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))

})

               Change reference level of ftv.cat variable
                       from None to Normal
Numeric to Boolean to Category
lbw <- within(lbw, {

     ## Relabel race
     race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

     ## Categorize ftv (frequency of visit)
     ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
     ftv.cat <- relevel(ftv.cat, ref = "Normal")

     ## Dichotomize ptl
     preterm <- factor(ptl >= 1, levels = c(FALSE,TRUE), labels = c("0","1+"))

})

       TRUE, FALSE                      ptl < 1 to FALSE, then to “0”
        vector created                  ptl >= 1 to TRUE, then to “1+”
             here                                          levels                  labels
Binary 0,1 to No,Yes
lbw <- within(lbw, {

     ## Categorize smoke ht ui
     smoke <- factor(smoke, levels = 0:1, labels = c("No","Yes"))      One-by-one
     ht      <- factor(ht,     levels = 0:1, labels = c("No","Yes"))
     ui      <- factor(ui,    levels = 0:1, labels = c("No","Yes"))     method
})



## Alternative to above
lbw[,c("smoke","ht","ui")] <-
  lapply(lbw[,c("smoke","ht","ui")],
       function(var) {                                                 Loop method
          var <- factor(var, levels = 0:1, labels = c("No","Yes"))
       })
model formula
formula

 outcome ~ predictor1 + predictor2 + predictor3




               SAS equivalent:
model outcome = predictor1 predictor2 predictor3;
In the case of t-test

 continuous variable       grouping variable to
   to be compared            separate groups



          age ~ zyg
         Variable to be   Variable used
          explained        to explain
linear sum



Y ~ X1 + X2
n   . All variables except for the outcome

n   + X2 Add X2 term

n   - 1 Remove intercept

n   X1:X2 Interaction term between X1 and X2

n   X1*X2 Main effects and interaction term
Interaction term



Y ~ X1 + X2 + X1:X2
     Main effects   Interaction
Interaction term



Y ~ X1 * X2
   Main effects & interaction
On-the-fly variable manipulation
                        Inhibit formula
                   interpretation. For math
                         manipulation


  Y ~ X1 + I(X2 * X3)
              New variable (X2 times X3)
              created on-the-fly and used
Fit a model


lm.full <- lm(bwt ~ age + lwt + smoke + ht + ui +
              ftv.cat + race.cat + preterm ,
             data = lbw)
See model object



   lm.full
Call: command repeated




             Coefficient for each
                  variable
See summary



summary(lm.full)
Call: command repeated         Residual
                                       distribution


                                          Coef/SE = t



                                              Dummy
                                              variables
                                               created



Model                             R^2 and adjusted R^2
F-test
ftv.catNone No 1st trimester visit people compared to
    Normal 1st trimester visit people (reference level)
ftv.catMany Many 1st trimester visit people compared to
    Normal 1st trimester visit people (reference level)
race.catBlack Black people compared to
     White people (reference level)
race.catOther Other people compared to
     White people (reference level)
Confidence intervals



confint(fit.lm)
Confidence intervals
         Lower      Upper
        boundary   boundary
ANOVA table (type I)



anova(lm.full)
ANOVA table (type I)
   degree of    Sequential   Mean SS
   freedom         SS        = SS/DF




 F = Mean SS / Mean SS of residual
Type I = Sequential SS
    1 age


          1st gets all in type I


                                               er lap
                                            ov I
                                          ut pe
                                      ll b n ty
                                    sa 1i
             las                  et n
                                 g e                    2 lwt
            on emtr           nd twe
                             2 e
              ly                b
                 in aini
                    typ ng
3 smoke                eI
ANOVA table (type III)


     library(car)
Anova(lm.full, type = 3)
ANOVA table (type III)
                 Marginal    degree of
                   SS        freedom
 Multi-
category
variables
tested as
   one




            F = Mean SS / Mean SS of residual
Type III = Marginal SS
      1 age
                           gin
                         ar I
                    ets m e II
              1s t g typ
                     in
               o nly




                                             e I in
                                          typ rg
                                                II
                                       i n ma
         las




                                     ly ets
        on    tg                                      2 lwt
                 ets


                                       dg
           ly
              in ma
                                   2n
                 typ rg
                                  on
3 smoke              e I in
                        II
Comparison

Type I            Type III
Effect plot

library(effects)
plot(allEffects(lm.full), ylim = c(2000,4000))

                                Fix Y-axis
                               values for all
                                   plots
Effect of a variable
with other covariate
   set at average
Interaction
This model is for
demonstration purpose.
                Continuous * Continuous


  lm.full.int <- lm(bwt ~ age*lwt + smoke +
    ht + ui + age*ftv.cat + race.cat*preterm,
    data = lbw)


 Continuous * Categorical
                            Categorical * Categorical
Anova(lm.full.int, type = 3)
Marginal    degree of
                   SS        freedom




Interaction
   terms




              F = Mean SS / Mean SS of residual
plot(effect("age:lwt", lm.full.int))



                                                 lwt level
Continuous * Continuous
plot(effect("age:ftv.cat", lm.full.int), multiline = TRUE)
 Continuous * Categorical
plot(effect(c("race.cat*preterm"), lm.full.int),
x.var = "preterm", z.var = "race.cat", multiline = TRUE)
 Categorical * Categorical
Linear regression with R 1

Linear regression with R 1

  • 1.
    Linear Regression with 1: Prepare data/specify model/read results 2012-12-07 @HSPH Kazuki Yoshida, M.D. MPH-CLE student FREEDOM TO  KNOW
  • 2.
    Group Website isat: http://rpubs.com/kaz_yos/useR_at_HSPH
  • 3.
    Previously in thisgroup n Introduction n Graphics n Reading Data into R (1) n Groupwise, continuous n Reading Data into R (2) n n Descriptive, continuous n Descriptive, categorical n Deducer
  • 4.
    Menu n Linear regression
  • 5.
    Ingredients Statistics Programming n Data preparation n within() n Model formula n factor(), relevel() n lm() n formula = Y ~ X1 + X2 n summary() n anova(), car::Anova()
  • 6.
  • 7.
    Create a newscript and save it.
  • 8.
  • 9.
    We will uselowbwt dataset used in BIO213 lowbwt.dat http://www.umass.edu/statdata/statdata/data/lowbwt.txt http://www.umass.edu/statdata/statdata/data/lowbwt.dat
  • 10.
    Load dataset fromweb lbw <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat", head = T, skip = 4) skip 4 rows header = TRUE to pick up variable names
  • 11.
    “Fix” dataset lbw[c(10,39), "BWT"] <- c(2655, 3035) BWT column Replace data points 10th,39th to make the dataset identical rows to BIO213 dataset
  • 12.
    Lower case variablenames names(lbw) <- tolower(names(lbw)) Put them back into Convert variable variable names names to lower case
  • 13.
  • 14.
  • 16.
  • 17.
    Name of newlycreated dataset (here replacing original) Take dataset dataset <- within(dataset, { _variable manipulations_ }) Perform variable manipulation You can specify by variable name only. No need for dataset$var_name
  • 18.
    lbw <- within(lbw,{ ## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other")) ## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal") ## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+")) })
  • 19.
    Numeric to categorical: element by element 1st will be reference lbw <- within(lbw, { ## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other")) ## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal") ## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+")) }) 1 to White 1st will be reference Categorize race and label: 2 to Black 3 to Other
  • 20.
    Explained more indepth factor() to create categorical variable Create new variable named Take race variable race.cat lbw <- within(lbw, { ## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other")) }) Order levels 1, 2, 3 Make 1 reference level Label levels 1, 2, 3 as White, Black, Other
  • 21.
    Numeric to categorical: range to element lbw <- within(lbw, { 1st will be reference ## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other")) ## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal") ## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+")) }) How breaks work (-Inf 0] 1 2] 3 4 5 6 Inf ] None Normal Many
  • 22.
    Reset reference level lbw<- within(lbw, { ## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other")) ## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal") ## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+")) }) Change reference level of ftv.cat variable from None to Normal
  • 23.
    Numeric to Booleanto Category lbw <- within(lbw, { ## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other")) ## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal") ## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(FALSE,TRUE), labels = c("0","1+")) }) TRUE, FALSE ptl < 1 to FALSE, then to “0” vector created ptl >= 1 to TRUE, then to “1+” here levels labels
  • 24.
    Binary 0,1 toNo,Yes lbw <- within(lbw, { ## Categorize smoke ht ui smoke <- factor(smoke, levels = 0:1, labels = c("No","Yes")) One-by-one ht <- factor(ht, levels = 0:1, labels = c("No","Yes")) ui <- factor(ui, levels = 0:1, labels = c("No","Yes")) method }) ## Alternative to above lbw[,c("smoke","ht","ui")] <- lapply(lbw[,c("smoke","ht","ui")], function(var) { Loop method var <- factor(var, levels = 0:1, labels = c("No","Yes")) })
  • 25.
  • 26.
    formula outcome ~predictor1 + predictor2 + predictor3 SAS equivalent: model outcome = predictor1 predictor2 predictor3;
  • 27.
    In the caseof t-test continuous variable grouping variable to to be compared separate groups age ~ zyg Variable to be Variable used explained to explain
  • 28.
  • 29.
    n . All variables except for the outcome n + X2 Add X2 term n - 1 Remove intercept n X1:X2 Interaction term between X1 and X2 n X1*X2 Main effects and interaction term
  • 30.
    Interaction term Y ~X1 + X2 + X1:X2 Main effects Interaction
  • 31.
    Interaction term Y ~X1 * X2 Main effects & interaction
  • 32.
    On-the-fly variable manipulation Inhibit formula interpretation. For math manipulation Y ~ X1 + I(X2 * X3) New variable (X2 times X3) created on-the-fly and used
  • 33.
    Fit a model lm.full<- lm(bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + preterm , data = lbw)
  • 34.
  • 35.
    Call: command repeated Coefficient for each variable
  • 36.
  • 37.
    Call: command repeated Residual distribution Coef/SE = t Dummy variables created Model R^2 and adjusted R^2 F-test
  • 38.
    ftv.catNone No 1sttrimester visit people compared to Normal 1st trimester visit people (reference level) ftv.catMany Many 1st trimester visit people compared to Normal 1st trimester visit people (reference level)
  • 39.
    race.catBlack Black peoplecompared to White people (reference level) race.catOther Other people compared to White people (reference level)
  • 40.
  • 41.
    Confidence intervals Lower Upper boundary boundary
  • 42.
    ANOVA table (typeI) anova(lm.full)
  • 43.
    ANOVA table (typeI) degree of Sequential Mean SS freedom SS = SS/DF F = Mean SS / Mean SS of residual
  • 44.
    Type I =Sequential SS 1 age 1st gets all in type I er lap ov I ut pe ll b n ty sa 1i las et n g e 2 lwt on emtr nd twe 2 e ly b in aini typ ng 3 smoke eI
  • 45.
    ANOVA table (typeIII) library(car) Anova(lm.full, type = 3)
  • 46.
    ANOVA table (typeIII) Marginal degree of SS freedom Multi- category variables tested as one F = Mean SS / Mean SS of residual
  • 47.
    Type III =Marginal SS 1 age gin ar I ets m e II 1s t g typ in o nly e I in typ rg II i n ma las ly ets on tg 2 lwt ets dg ly in ma 2n typ rg on 3 smoke e I in II
  • 48.
  • 49.
    Effect plot library(effects) plot(allEffects(lm.full), ylim= c(2000,4000)) Fix Y-axis values for all plots
  • 50.
    Effect of avariable with other covariate set at average
  • 51.
  • 52.
    This model isfor demonstration purpose. Continuous * Continuous lm.full.int <- lm(bwt ~ age*lwt + smoke + ht + ui + age*ftv.cat + race.cat*preterm, data = lbw) Continuous * Categorical Categorical * Categorical
  • 53.
  • 54.
    Marginal degree of SS freedom Interaction terms F = Mean SS / Mean SS of residual
  • 55.
    plot(effect("age:lwt", lm.full.int)) lwt level Continuous * Continuous
  • 56.
  • 57.
    plot(effect(c("race.cat*preterm"), lm.full.int), x.var ="preterm", z.var = "race.cat", multiline = TRUE) Categorical * Categorical