Logistic Regression in Case-Control Study

  • 1,796 views
Uploaded on

This is a basic presentation about use of Logistic regression in case-control study of genetics data in R.

This is a basic presentation about use of Logistic regression in case-control study of genetics data in R.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,796
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
24
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Coeffcients are calculated my MLE
  • In order to test hypotheses in logistic regression, we have used the likelihood ratio test and the Wald test.
  • If the confidence interval includes 0 we can say that there is no significant difference between the means of the two populations, at a given level of confidence. The width of the confidence interval gives us some idea about how uncertain we are about the difference in the means. A very wide interval may indicate that more data should be collected before anything definite can be said. A confidence interval that includes 1.0 means that the association between the exposure and outcome could have been found by chance alone and that the association is not statistically significant.
  • Binomial is specifying a choice of variance and link functions. Variance is binomial and link is logit function.

Transcript

  • 1. Logistic Regression in Case- Control study using – A statistical tool Satish Gupta
  • 2. What is R?  The R statistical programming language is a free open source package.  The language is very powerful for writing programs.  Many statistical functions are already built in.  Contributed packages expand the functionality to cutting edge research.
  • 3. Getting Started  Go to www.r-project.org  Downloads: CRAN (Comprehensive R Archive Network)  Set your Mirror: location close to you.  Select Windows 95 or later, MacOS or UNIX platforms
  • 4. Getting Started
  • 5. Basic operators and calculations Comparison operators  equal: ==  not equal: !=  greater/less than: > <  greater/less than or equal: >= <= Example: 1 == 1 # Returns TRUE
  • 6. Basic operators and calculations Logical operators  AND: & x <- 1:10; y <- 10:1 # Creates the sample vectors 'x' and 'y'. x > y & x > 5 # Returns TRUE where both comparisons return TRUE.  OR: | x == y | x != y # Returns TRUE where at least one comparison is TRUE.  NOT: ! !x > y # The '!' sign returns the negation (opposite) of a logical vector.
  • 7. Basic operators and calculations Calculations  Four basic arithmetic functions: addition, subtraction, multiplication and division 1 + 1; 1 - 1; 1 * 1; 1 / 1 # Returns results of basic arithmetic calculations.  Calculations on vectors x <- 1:10; sum(x); mean(x), sd(x); sqrt(x) # Calculates for the vector x its sum, mean, standard deviation and square root. x <- 1:10; y <- 1:10; x + y # Calculates the sum for each element in the vectors x and y.
  • 8. R-Graphics R provides comprehensive graphics utilities for visualizing and exploring scientific data. It includes:  Scatter plots  Line plots  Bar plots  Pie charts  Heatmaps  Venn diagrams  Density plots  Box plots
  • 9. Data handling in R  Load data: mydata = read.csv(“/path/mydata.csv”)  See data on screen: data(mydata)  See top part of data: head(mydata)  Specific number of rows and column of data: mydata[1:10,1:3]  To get a type of data: class(mydata)  Changing class of data: newdata = as.matrix(mydata)  Summary of data: summary(mydata)  Selecting (KEEPING) variables (columns) newdata = mydata[c(1,3:5)]
  • 10. Data handling in R  Selecting observations newdata= subset(mydata, age>=20 | age <10, select=c(ID, weight) newdata= subset(mydata, sex==“Male” & age >25, select=weight:income)  Excluding (DROPPING) variables (columns) newdata = mydata[c(-3,-5)] mydata$v3 = NULL
  • 11. R-Library  There are many tools defined as “package” are present in R for different kind of analysis including data from genetics and genomics.  Depending upon the availability of library, it can be downloaded from two sources Using CRAN (Comprehensive R Archive Network) as: install.packages(“package_name”) Using Bioconductor as: source("http://bioconductor.org/biocLite.R") biocLite(“package_name”)
  • 12. R-Library  To load a package, library() #Lists all libraries/packages that are available on a system. library(genetics) #Package for genetics data analysis library(help=genetics) #Lists all functions/objects of “genetics” package ?function #Opens documentation of a function
  • 13. What is Logistic Regression?  Logistic regression describes the relationship between a dichotomous response variable and a set of explanatory variables.  Logistic regression is often used because the relationship between the DV (a discrete variable) and a predictor is non-linear.
  • 14.  A General Model: Logistic Regression JJ disease disease disease XX p p p βββ +++= − = 110) 1 log()logit( Where: pdisease is the probability that an individual has a particular disease. β0 is the intercept β1, β2 …βJ are the coefficients (effects) of genetic factors X1, X2 …XJ are the variables of genetic factors
  • 15. Assumptions  Logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables.  Because it does not impose these requirements, it is preferred to discriminant analysis when the data does not satisfy these assumptions.
  • 16. Questions ??  What is the relative importance of each predictor variable?  How does each predictor variable affect the outcome?  Does a predictor variable make the solution better or worse or have no effect?  Are there interactions among predictors?  Does adding interactions among predictors (continuous or categorical) improve the model?  What is the strength of association between the outcome variable and a set of predictors?  Often in model comparison you want non-significant differences so strength of association is reported for even non-significant effects.
  • 17. Types of Logistic Regression  Unconditional logistic regression  Conditional logistic regression ** Rule of thumbs  Use conditional logistic regression if matching has been done, and unconditional if there has been no matching.  When in doubt, use conditional because it always gives unbiased results. The unconditional method is said to overestimate the odds ratio if it is not appropriate.
  • 18. Data Format Status Matset Se_Quartiles GPX1 GPX4 SEP15 TXN2 1 1 <60 CT TT AG AG 0 1 >60 – 70 CC CC GG GG 1 2 <60 TT CC AG AA 0 2 >70 – 80 CC CT GG GG 1 3 >80 CC CC AA AA 0 3 >60 – 70 CT TT GG GG 1 4 <60 CC CC AA AG 0 4 >70 – 80 TT TT GG GG 1 5 >80 CC CC AG AA 0 5 <60 CC CC GG GG 1 6 >70 – 80 CT TT AA AA 0 6 >80 CC CC GG AG 1 7 >60 – 70 TT CC AA AG
  • 19. Data and Library loading  Load and use data in R (Using Lung cancer data from PLoS One 2013, 8(3):e59051). lung = read.csv(“/path/lung.csv”, sep= “t”, header = TRUE)  Load the library and use data for analysis library(epicalc) use(lung)
  • 20. Data Analysis  Performing conditional logistic regression (Case vs. Control) clogit_lung = clogit(Status ~ Se_Quartiles + strata(Matset), data = .data) clogistic.display(clogit_lung) OR(95%CI) P(Wald's test) P(LR-test) Quartiles: ref.=<60 <0.001 >60 – 70 0.4(0.15 – 1.09) 0.074 >70 – 80 0.11(0.03 – 0.33) <0.001 >80 0.10(0.03 – 0.34) <0.001
  • 21. Data Analysis  Performing conditional logistic regression (Case vs. Control), clogit_lung = clogit(Status ~ GPX1+ strata(Matset), data = .data) clogistic.display(clogit_lung) OR(95%CI) P(Wald's test) P(LR-test) GPX1: ref.=CC 0.032 CT 0.44(0.22 – 0.86) 0.017 TT 0.42(0.13 – 1.38) 0.151
  • 22. Data Analysis  Performing conditional logistic regression (Case vs. Control), clogit_lung = clogit(Status ~ Se_Quartiles + GPX1+ strata(Matset), data = .data) clogistic.display(clogit_lung)   crude OR(95%CI) adj. OR(95%CI) P(Wald's test) P(LR-test) Quartiles: ref.=<60 <0.001 >60 – 70 0.4(0.15 – 1.09) 0.32(0.11 – 0.96) 0.042 >70 – 80 0.11(0.03 – 0.33) 0.09(0.02 – 0.3) <0.001 >80 0.1(0.03 – 0.34) 0.05(0.01 – 0.23) <0.001 GPX1:ref.=CC 0.006 CT 0.44(0.22 – 0.86) 0.26(0.11 – 0.65) 0.004 TT 0.42(0.13 – 1.38) 0.44(0.09 – 2.18) 0.313 Environmental Factor Genetic Factor
  • 23. Data Analysis  Performing unconditional logistic regression (Case vs. Control), ulogit_lung = glm(Status ~ Se_Quartiles , family=binomial, data = .data) logistic.display(ulogit_lung) OR(95%CI) P(Wald's test) P(LR-test) Quartiles: ref.=<60 <0.001 >60 – 70 0.41 (0.17 – 1.02) 0.054 >70 – 80 0.13 (0.05 – 0.34) <0.001 >80 0.17 (0.07 – 0.42) <0.001
  • 24. Data Analysis  Performing unconditional logistic regression (Case vs. Control), ulogit_lung = glm(Status ~ GPX1 , family=binomial, data = .data) logistic.display(ulogit_lung) OR(95%CI) P(Wald's test) P(LR-test) Quartiles: ref.=CC 0.034 CT 0.45 (0.24 – 0.85) 0.014 TT 0.44 (0.14 – 1.36) 0.156
  • 25. Data Analysis  Performing unconditional logistic regression (Case vs. Control), ulogit_lung = glm(Status ~ Se_Quartiles , family=binomial, data = .data) logistic.display(ulogit_lung) crude OR(95%CI) adj. OR(95%CI) P(Wald's test) P(LR-test) Quartiles: ref.=<60 <0.001 >60 – 70 0.41 (0.17 – 1.02) 0.43 (0.17 – 1.08) 0.074 >70 – 80 0.13 (0.05 – 0.34) 0.13 (0.05 – 0.34) <0.001 >80 0.17 (0.07 – 0.42) 0.15 (0.06 – 0.39) <0.001 GPX1:ref.=CC 0.024 CT 0.45 (0.24 – 0.85) 0.40(0.20 – 0.80) 0.01 TT 0.44 (0.14 – 1.36) 0.42 (0.12 – 1.41) 0.161
  • 26. Something More   Changing the default reference GPX1 = relevel(GPX1, ref = "TT") pack()  Saving the result result = clogistic.display(clogit_lung) write.csv(result$table, file=“path/result.csv“, sep = “t”) write.table(result$table, file=“path/result.xls“, sep = “t”)
  • 27. Summary: regression models  Regression models can be used to describe the average effect of predictors on outcomes in your data set.  They can tell how likely that the effect is just be due to chance.  They can look at each predictor “adjusting for” the others (estimating what would happen if all others were held constant.)
  • 28. Thanks to, Prof. Virasakdi Chongsuvivatwong Epidemiology Unit, Faculty of Medicine, Prince of Songkla University, Thailand