0
Preparing data for   modeling in     2013-03-08 @HSPH    Kazuki Yoshida, M.D.      MPH-CLE student                        ...
Group Website is at:http://rpubs.com/kaz_yos/useR_at_HSPH
OpenR Studio
Create a new script   and save it.
http://www.umass.edu/statdata/statdata/data/
We will use lowbwt dataset used in    BIO213 Applied Regression for Clinical Research             lowbwt.dathttp://www.uma...
NAME: ! LOW BIRTH WEIGHT DATA (LOWBWT.DAT)    KEYWORDS: Logistic Regression    SIZE: 189 observations, 11 variables    SOU...
LIST OF VARIABLES:Columns    Variable                                              Abbreviation---------------------------...
PEDAGOGICAL NOTES:  These data have been used as an example of fitting a multiple  logistic regression model.  STORY BEHIN...
Load dataset from weblbw <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat",                  head = T...
“Fix” dataset        lbw[c(10,39), "BWT"] <- c(2655, 3035)            BWT column                               Replace dat...
Lower case variable names    names(lbw) <- tolower(names(lbw)) Put them back into    Convert variable  variable names     ...
See overview
library(gpairs)gpairs(lbw)
RecodingChanging and creating variables
Why?
Different variable formsmean different modeling     assumptions!
Variable form and assumptionn   Continuous variables:     n   Linearity assumptionn   Categorical variables:     n   N...
Relabel race: 1, 2, 3 to White, Black, Other                    Take race variable  Create newvariable named              ...
Dichotomize ptl                            If condition is              Change to                            true, then “1...
Change 0,1 binary to No,Yes binary                 equality is tested by ==, not =lbw$smoke   <-   factor(ifelse(lbw$smoke...
cutting a continuous variable                into categories lbw$ftv.cat     <- cut(lbw$ftv, breaks = c(-Inf, 0, 2, Inf), ...
Make “Normal” the reference level                      “Normal” as reference levellbw$ftv.cat   <- relevel(lbw$ftv.cat, re...
within() allows directuse of variable names
within() methodlbw <- within(lbw, {     ## Relabel race     race.cat <- factor(race, levels = 1:3, labels = c("White","Bla...
model formula
formula outcome ~ predictor1 + predictor2 + predictor3               SAS equivalent:model outcome = predictor1 predictor2 ...
In the case of t-test continuous variable       grouping variable to   to be compared            separate groups          ...
linear sumY ~ X1 + X2
n   . All variables except for the outcomen   + X2 Add X2 termn   - 1 Remove interceptn   X1:X2 Interaction term betwe...
Interaction termY ~ X1 + X2 + X1:X2     Main effects   Interaction
Interaction termY ~ X1 * X2   Main effects & interaction
On-the-fly variable manipulation                        Inhibit formula                   interpretation. For math         ...
Fit a modellm.full <- lm(bwt ~ age + lwt + smoke + ht + ui +              ftv.cat + race.cat + preterm ,             data ...
See model object   lm.full
Call: command repeated             Coefficient for each                  variable
See summarysummary(lm.full)
Call: command repeated         Residual                                       distribution                                ...
ftv.catNone No 1st trimester visit people compared to    Normal 1st trimester visit people (reference level)ftv.catMany Ma...
race.catBlack Black people compared to     White people (reference level)race.catOther Other people compared to     White ...
Confidence intervalsconfint(fit.lm)
Confidence intervals         Lower      Upper        boundary   boundary
20130308 Preparing data for modeling in R
20130308 Preparing data for modeling in R
Upcoming SlideShare
Loading in...5
×

20130308 Preparing data for modeling in R

2,311

Published on

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,311
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
28
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "20130308 Preparing data for modeling in R"

  1. 1. Preparing data for modeling in 2013-03-08 @HSPH Kazuki Yoshida, M.D. MPH-CLE student FREEDOM TO  KNOW
  2. 2. Group Website is at:http://rpubs.com/kaz_yos/useR_at_HSPH
  3. 3. OpenR Studio
  4. 4. Create a new script and save it.
  5. 5. http://www.umass.edu/statdata/statdata/data/
  6. 6. We will use lowbwt dataset used in BIO213 Applied Regression for Clinical Research lowbwt.dathttp://www.umass.edu/statdata/statdata/data/lowbwt.txthttp://www.umass.edu/statdata/statdata/data/lowbwt.dat
  7. 7. NAME: ! LOW BIRTH WEIGHT DATA (LOWBWT.DAT) KEYWORDS: Logistic Regression SIZE: 189 observations, 11 variables SOURCE: Hosmer and Lemeshow (2000) Applied Logistic Regression: Second ! Edition. These data are copyrighted by John Wiley & Sons Inc. and must ! be acknowledged and used accordingly. Data were collected at Baystate ! Medical Center, Springfield, Massachusetts during 1986. DESCRIPTIVE ABSTRACT: The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy. NOTE: This data set consists of the complete data. A paired data set created from this low birth weight data may be found in lowbwtm11.dat and a 3 to 1 matched data set created from the low birth weight data may be found in mlowbwt.dat.http://www.umass.edu/statdata/statdata/data/lowbwt.txt
  8. 8. LIST OF VARIABLES:Columns Variable Abbreviation-----------------------------------------------------------------------------2-4 Identification Code ID10 Low Birth Weight (0 = Birth Weight >= 2500g, LOW 1 = Birth Weight < 2500g)17-18 Age of the Mother in Years AGE23-25 Weight in Pounds at the Last Menstrual Period LWT32 Race (1 = White, 2 = Black, 3 = Other) RACE40 Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE48 History of Premature Labor (0 = None 1 = One, etc.) PTL55 History of Hypertension (1 = Yes, 0 = No) HT61 Presence of Uterine Irritability (1 = Yes, 0 = No) UI67 Number of Physician Visits During the First Trimester FTV (0 = None, 1 = One, 2 = Two, etc.)73-76 Birth Weight in Grams BWT----------------------------------------------------------------------------- http://www.umass.edu/statdata/statdata/data/lowbwt.txt
  9. 9. PEDAGOGICAL NOTES: These data have been used as an example of fitting a multiple logistic regression model. STORY BEHIND THE DATA: Low birth weight is an outcome that has been of concern to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are very high for low birth weight babies. A womans behavior during pregnancy (including diet, smoking habits, and receiving prenatal care) can greatly alter the chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight. The variables identified in the code sheet given in the table have been shown to be associated with low birth weight in the obstetrical literature. The goal of the current study was to ascertain if these variables were important in the population being served by the medical center where the data were collected. References: 1. Hosmer and Lemeshow, Applied Logistic Regression, Wiley, (1989).http://www.umass.edu/statdata/statdata/data/lowbwt.txt
  10. 10. Load dataset from weblbw <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat", head = T, skip = 4) skip 4 rows header = TRUE to pick up variable names
  11. 11. “Fix” dataset lbw[c(10,39), "BWT"] <- c(2655, 3035) BWT column Replace data points10th,39th to make the dataset identical rows to BIO213 dataset
  12. 12. Lower case variable names names(lbw) <- tolower(names(lbw)) Put them back into Convert variable variable names names to lower case
  13. 13. See overview
  14. 14. library(gpairs)gpairs(lbw)
  15. 15. RecodingChanging and creating variables
  16. 16. Why?
  17. 17. Different variable formsmean different modeling assumptions!
  18. 18. Variable form and assumptionn Continuous variables: n Linearity assumptionn Categorical variables: n No residual confounding assumption
  19. 19. Relabel race: 1, 2, 3 to White, Black, Other Take race variable Create newvariable named Order levels 1, 2, 3 race.cat Make 1 reference level lbw$race.cat <- factor(lbw$race, levels = 1:3, labels = c("White","Black","Other")) Label levels 1, 2, 3 as White, Black, Other Using this variable as continuous is meaning less!!
  20. 20. Dichotomize ptl If condition is Change to true, then “1+” categoricallbw$preterm <- factor(ifelse(lbw$ptl >= 1, "1+", "0")) condition ifelse function give if not (else) “0” either one of two values
  21. 21. Change 0,1 binary to No,Yes binary equality is tested by ==, not =lbw$smoke <- factor(ifelse(lbw$smoke == 1, "Yes", "No"))lbw$ht <- factor(ifelse(lbw$ht == 1, "Yes", "No"))lbw$ui <- factor(ifelse(lbw$ui == 1, "Yes", "No"))lbw$low <- factor(ifelse(lbw$low == 1, "Yes", "No")) if 1, return “Yes” if not, return “No”
  22. 22. cutting a continuous variable into categories lbw$ftv.cat <- cut(lbw$ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) Breaks at breaks = c(-Inf, 0, 2, Inf) (-Inf None 0] 1 2] 3 Normal 4 5 6 Many Inf] 4 bounds for 3 categoriesLabel them as labels = c("None","Normal","Many")
  23. 23. Make “Normal” the reference level “Normal” as reference levellbw$ftv.cat <- relevel(lbw$ftv.cat, ref = "Normal")
  24. 24. within() allows directuse of variable names
  25. 25. within() methodlbw <- within(lbw, { ## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other")) ## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal") ## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+")) ## Categorize smoke ht ui smoke <- factor(smoke, levels = 0:1, labels = c("No","Yes")) ht <- factor(ht, levels = 0:1, labels = c("No","Yes")) ui <- factor(ui, levels = 0:1, labels = c("No","Yes"))}) You can specify variables with variable name only. No need for lbw$
  26. 26. model formula
  27. 27. formula outcome ~ predictor1 + predictor2 + predictor3 SAS equivalent:model outcome = predictor1 predictor2 predictor3;
  28. 28. In the case of t-test continuous variable grouping variable to to be compared separate groups age ~ zyg Variable to be Variable used explained to explain
  29. 29. linear sumY ~ X1 + X2
  30. 30. n . All variables except for the outcomen + X2 Add X2 termn - 1 Remove interceptn X1:X2 Interaction term between X1 and X2n X1*X2 Main effects and interaction term
  31. 31. Interaction termY ~ X1 + X2 + X1:X2 Main effects Interaction
  32. 32. Interaction termY ~ X1 * X2 Main effects & interaction
  33. 33. On-the-fly variable manipulation Inhibit formula interpretation. For math manipulation Y ~ X1 + I(X2 * X3) New variable (X2 times X3) created on-the-fly and used
  34. 34. Fit a modellm.full <- lm(bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + preterm , data = lbw)
  35. 35. See model object lm.full
  36. 36. Call: command repeated Coefficient for each variable
  37. 37. See summarysummary(lm.full)
  38. 38. Call: command repeated Residual distribution Coef/SE = t Dummy variables createdModel R^2 and adjusted R^2F-test
  39. 39. ftv.catNone No 1st trimester visit people compared to Normal 1st trimester visit people (reference level)ftv.catMany Many 1st trimester visit people compared to Normal 1st trimester visit people (reference level)
  40. 40. race.catBlack Black people compared to White people (reference level)race.catOther Other people compared to White people (reference level)
  41. 41. Confidence intervalsconfint(fit.lm)
  42. 42. Confidence intervals Lower Upper boundary boundary
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×