Logistic regression (blyth 2006) (simplified)

  • 255 views
Uploaded on

An introduction to logistic regression for physicians, public health students and other health workers. Logistic regression is a way to look at effect of a numeric independent variable on a binary …

An introduction to logistic regression for physicians, public health students and other health workers. Logistic regression is a way to look at effect of a numeric independent variable on a binary (yes-no) dependent variable. For example, you can analyze or model the effect of birth weight on survival.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
255
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
20
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Logistic RegressionLogistic Regression Dr Mike BlythDr Mike Blyth February 2006February 2006
  • 2. Logistic RegressionLogistic Regression A way to look at effect ofA way to look at effect of – ““Numeric” (interval or ratio) independentNumeric” (interval or ratio) independent variablevariable OnOn – BinaryBinary (yes-no) dependent variable(yes-no) dependent variable
  • 3. Dependent variable is continuousDependent variable is continuous intervalinterval oror ratioratio (numeric)(numeric) Independent variables are also interval orIndependent variables are also interval or ratioratio ExamplesExamples – Effect of weight on blood pressureEffect of weight on blood pressure – Effect of drug dose on reticulocyte countEffect of drug dose on reticulocyte count Review Linear RegressionReview Linear Regression
  • 4. Linear RegressionLinear Regression Independent Variable Dependent Variable
  • 5. Logistic RegressionLogistic Regression Independent Variable Dependent Variable
  • 6. Logistic RegressionLogistic Regression Dependent variable is binary (yes/no) outcome.Dependent variable is binary (yes/no) outcome. Independent variables are continuous intervalIndependent variables are continuous interval Examples:Examples: – Relation of weight and BP to 10 year risk of deathRelation of weight and BP to 10 year risk of death – Relation of CD4 count to 1 year risk of AIDS diagnosisRelation of CD4 count to 1 year risk of AIDS diagnosis
  • 7. Why do we need it?Why do we need it? Could use categorical analysis such as frequency tableCould use categorical analysis such as frequency table AIDSAIDS No AIDSNo AIDS CD4 > 350CD4 > 350 8080 2020 150 < CD4 < 350150 < CD4 < 350 5050 5050 CD4 < 150CD4 < 150 2020 8080 • Problems a) some information is lost when we collapse the numeric data into categories. This leads to loss of power. b) no estimate of magnitude of relation
  • 8. Odds RatioOdds Ratio Probability:Probability: p = probability of eventp = probability of event 1 - p = probabilty of1 - p = probabilty of notnot the event (also called q)the event (also called q) p varies from 0 to 1p varies from 0 to 1 OddsOdds – Ratio of probability of event to probability of notRatio of probability of event to probability of not having the event: Odds = p/(1 - p)having the event: Odds = p/(1 - p) – When p = 0.5, odds = 1 (or “1:1 odds”)When p = 0.5, odds = 1 (or “1:1 odds”) – When p = 0.1, odds = 0.1/0.9 = 0.11When p = 0.1, odds = 0.1/0.9 = 0.11
  • 9. Log Odds RatioLog Odds Ratio The log odds ratio (also called “logit”) is simply the naturalThe log odds ratio (also called “logit”) is simply the natural logarithm of the odds ratio:logarithm of the odds ratio: ¤ logitlogit = ln(odds ratio)= ln(odds ratio) = ln(p/(1-p))= ln(p/(1-p)) = ln(p) – ln(1-p)= ln(p) – ln(1-p) ln (1) = 0, so logit is 0 when odds are 1:1, orln (1) = 0, so logit is 0 when odds are 1:1, or probability = 50%probability = 50% The logit for event of probability p is the opposite of the logitThe logit for event of probability p is the opposite of the logit for the probability of not having the event.for the probability of not having the event.
  • 10. Relation between probability p and logit 0.000 0.250 0.500 0.750 1.000 -8 -6 -4 -2 0 2 4 6 8 logit = ln[p/(1-p)]
  • 11. Logistic regression modelLogistic regression model The linear regression model with one variableThe linear regression model with one variable isis y = a + bx + ey = a + bx + e The logistic regression model with oneThe logistic regression model with one variable isvariable is logit = a + bx + elogit = a + bx + e wherewhere logit = ln(p/(1-p))logit = ln(p/(1-p))
  • 12. The logistic regression model with oneThe logistic regression model with one variable isvariable is logit = a + bxlogit = a + bx where logit = ln(p/(1-p))where logit = ln(p/(1-p)) In other words, the model says the odds of the eventIn other words, the model says the odds of the event happening arehappening are – A constant factor (a)A constant factor (a) – Some other constant (b)Some other constant (b) – times a numeric risk factor (x) (for example, SBP)times a numeric risk factor (x) (for example, SBP) Logistic regression modelLogistic regression model
  • 13. Logistic regression modelLogistic regression model Given value of the independent variables, theGiven value of the independent variables, the regression equation predicts theregression equation predicts the Log Odds RatioLog Odds Ratio
  • 14. Logistic regression modelLogistic regression model The statistics program calculates theThe statistics program calculates the coefficient bcoefficient b TheThe coefficient bcoefficient b shows how much the oddsshows how much the odds ratio changes with a change in theratio changes with a change in the independent variableindependent variable Positive bPositive b  higher risk with higher valueshigher risk with higher values Negative bNegative b  lower risk with higher valueslower risk with higher values
  • 15. Logistic regression modelLogistic regression model Hypothetical example given above examining relation of BP toHypothetical example given above examining relation of BP to risk of stroke/death. The model predicts:risk of stroke/death. The model predicts: ln(odds ratio) = constant + bln(odds ratio) = constant + b ∙ SBPSBP ee(lnoddsratio)(lnoddsratio) = e= e(c+b(c+b∙ SBP)SBP) Odds RatioOdds Ratio == ee(c+b(c+b∙SBP)SBP) == eecc ∙ ee(b(b∙SBP)SBP)
  • 16. Logistic regression modelLogistic regression model The coefficient b shows how much the odds ratioThe coefficient b shows how much the odds ratio changes with a change in the independent variablechanges with a change in the independent variable Odds RatioOdds Ratio == eecc ∙ ee(bx)(bx) In other words,In other words, Odds RatioOdds Ratio == somethingsomething ∙ (e(ebb ))(x)(x)
  • 17. Logistic regression modelLogistic regression model Odds RatioOdds Ratio = constant= constant ∙ ((eebb ))(x)(x) SoSo eebb is the factor indicating effect of x on theis the factor indicating effect of x on the event.event. Each one unit change in x will multiply the oddsEach one unit change in x will multiply the odds ratio by a factor of eratio by a factor of ebb ..
  • 18. Logistic regression modelLogistic regression model Odds RatioOdds Ratio = constant= constant ∙ ((eebb ))(x)(x) – Suppose b = 0.693 so eSuppose b = 0.693 so ebb = 2= 2 – A one-unit change in x willA one-unit change in x will doubledouble the odds ratiothe odds ratio – Suppose b = -0.693 so eSuppose b = -0.693 so ebb = 0.5= 0.5 – A one-unit change in x willA one-unit change in x will halvehalve the odds ratio.the odds ratio. – If b = 0, eIf b = 0, ebb = 1, and x has no effect on OR= 1, and x has no effect on OR
  • 19. Logistic regression modelLogistic regression model For the hypothetical example above, the report isFor the hypothetical example above, the report is given by Epi Info asgiven by Epi Info as TermTerm OddsOdds RatioRatio 95% CI95% CI CoeffCoeff S. E.S. E. ZZ PP BPBP 1.05971.0597 1.0221.022 1.0981.098 0.05790.0579 0.01850.0185 3.1313.131 0.00170.0017 ConstConst ** ** ** -7.201-7.201 2.29942.2994 3.1313.131 0.00170.0017
  • 20. Logistic regression modelLogistic regression model TermTerm Odds RatioOdds Ratio 95% CI95% CI CoefficientCoefficient S. E.S. E. ZZ P-valueP-value BPBP 1.05971.0597 1.0221.022 1.0981.098 0.05790.0579 0.0180.018 3.1313.131 0.00170.0017 ConstantConstant ** ** ** -7.2014-7.2014 2.2992.299 3.1313.131 0.00170.0017 Coefficient, or beta, or b, is the slope or magnitude of the effect.
  • 21. Logistic regression modelLogistic regression model TermTerm OddsOdds RatioRatio 95% CI95% CI CoefficientCoefficient S. E.S. E. ZZ P-valueP-value BPBP 1.05971.0597 1.02201.0220 1.09871.0987 0.05790.0579 0.01850.0185 3.13193.1319 0.00170.0017 ConstantConstant ** ** ** -7.2014-7.2014 2.29942.2994 3.13193.1319 0.00170.0017 Odds ratio for one unit change in the independent variable (e.g. BP). This is the calculated eb eb A one unit change in BP multiplies the odds ratio by 1.0597.
  • 22. Logistic regression modelLogistic regression model TermTerm Odds RatioOdds Ratio 95% CI95% CI CoeffCoeff S. E.S. E. ZZ P-valueP-value BPBP 1.05971.0597 1.0221.022 1.0981.098 0.05790.0579 0.01850.0185 3.13193.1319 0.00170.0017 ConstantConstant ** ** ** -7.2014-7.2014 2.29942.2994 3.13193.1319 0.00170.0017 95% confidence interval for that odds ratio. The confidence interval does not include 1, so the effect is statistically significant
  • 23. Using more than one independentUsing more than one independent variablevariable Single variable:Single variable: logit = c + bxlogit = c + bx OR = c’ ∙ (eOR = c’ ∙ (ebb ))xx Multiple variables:Multiple variables: logit = c + blogit = c + b11xx11 + b+ b22xx22 + … + b+ … + bnnxxnn OR = c’ ∙ (eOR = c’ ∙ (eb1b1 ))x1x1 ∙ (e∙ (eb2b2 ))x2x2 ∙ … ∙ (e∙ … ∙ (ebnbn ))xnxn Note that the termsNote that the terms multiplymultiply their effect ontheir effect on odds ratio.odds ratio.
  • 24. Using more than one independentUsing more than one independent variablevariable Analysis reports a b coefficient for eachAnalysis reports a b coefficient for each independent variable.independent variable. That coefficient is the effect of the givenThat coefficient is the effect of the given independent variable, separated from theindependent variable, separated from the effects of all the other independent variables.effects of all the other independent variables.
  • 25. Real Life ExampleReal Life Example Prospective cohort study of causes ofProspective cohort study of causes of cardiac disease: Evans County Study 1965cardiac disease: Evans County Study 1965 Independent variables = age, gender,Independent variables = age, gender, race, social index, SBP, diabetes, smoking,race, social index, SBP, diabetes, smoking, cholesterol, and an obesity indexcholesterol, and an obesity index Dependent variable = risk of dying duringDependent variable = risk of dying during 10 year period10 year period
  • 26. VariableVariable RangeRange b coeffb coeff SESE pp ConstantConstant -6.376-6.376 1.6341.634 <0.001<0.001 AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001 GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121 Age x genderAge x gender -0.043-0.043 0.0170.017 0.0110.011 Social indexSocial index 20-8420-84 -0.056-0.056 0.0400.040 0.1600.160 (Soc ind)(Soc ind)22 400-7056400-7056 0.00060.0006 0.00030.0003 0.0820.082 SBPSBP 88-31088-310 0.0190.019 0.0020.002 <0.001<0.001 DiabetesDiabetes 0=n, 1=y0=n, 1=y 1.1231.123 0.2610.261 <0.001<0.001 SmokingSmoking 0=n, 1=y0=n, 1=y 0.3170.317 0.1570.157 0.0430.043 CholesterolCholesterol 94-54694-546 0.00310.0031 0.00150.0015 0.0410.041 QuartletQuartlet 2.11-8.762.11-8.76 -1.064-1.064 0.4320.432 0.0140.014 (Quartlet)(Quartlet)22 4.44-76.84.44-76.8 0.1120.112 0.0490.049 0.0220.022 Cited in Kelsey et al., Methods in Observational Epidemiology, 1986
  • 27. VariableVariable RangeRange b coeffb coeff SESE pp ConstantConstant -6.376-6.376 1.6341.634 <0.001<0.001 AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001 GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121 Age x genderAge x gender -0.043-0.043 0.0170.017 0.0110.011 Social indexSocial index 20-8420-84 -0.056-0.056 0.0400.040 0.1600.160 (Soc ind)(Soc ind)22 400-7056400-7056 0.00060.0006 0.00030.0003 0.0820.082 SBPSBP 88-31088-310 0.0190.019 0.0020.002 <0.001<0.001 DiabetesDiabetes 0=n, 1=y0=n, 1=y 1.1231.123 0.2610.261 <0.001<0.001 SmokingSmoking 0=n, 1=y0=n, 1=y 0.3170.317 0.1570.157 0.0430.043 CholesterolCholesterol 94-54694-546 0.00310.0031 0.00150.0015 0.0410.041 QuartletQuartlet 2.11-8.762.11-8.76 -1.064-1.064 0.4320.432 0.0140.014 (Quartlet)(Quartlet)22 4.44-76.84.44-76.8 0.1120.112 0.0490.049 0.0220.022
  • 28. Statistical SignificanceStatistical Significance The p value indicates statistical significanceThe p value indicates statistical significance Age is positively correlated with risk of deathAge is positively correlated with risk of death Gender has positive b coefficient, but the p valueGender has positive b coefficient, but the p value is 0.12, indicating that we cannot say that there isis 0.12, indicating that we cannot say that there is a significant relationship.a significant relationship. VariableVariable RangeRange b coeffb coeff SESE pp AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001 GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121
  • 29. Dichotomous (yes-no) variablesDichotomous (yes-no) variables Gender is coded as 0 for male, 1 for femaleGender is coded as 0 for male, 1 for female eebb [e[e1.51.5 = 4.48] is change in OR for 1 unit change in gender,= 4.48] is change in OR for 1 unit change in gender, i.e. OR for females relative to malesi.e. OR for females relative to males eebb for any dummy variable (coded 0-1) is the adjustedfor any dummy variable (coded 0-1) is the adjusted OR for that risk factor, since “1 unit of change” =OR for that risk factor, since “1 unit of change” = presence vs. absence of risk factorpresence vs. absence of risk factor VariableVariable RangeRange b coeffb coeff SESE pp ConstantConstant -6.376-6.376 1.6341.634 <0.001<0.001 AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001 GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121
  • 30. Squared termsSquared terms Social index squared is included as well asSocial index squared is included as well as social index itself.social index itself. Squared terms allow for curvilinearSquared terms allow for curvilinear relationships, just as in ordinaryrelationships, just as in ordinary regressionregression VariableVariable RangeRange b coeffb coeff SESE pp Age x genderAge x gender -0.043-0.043 0.0170.017 0.0110.011 Social indexSocial index 20-8420-84 -0.056-0.056 0.0400.040 0.1600.160 (Soc ind)(Soc ind)22 400-7056400-7056 0.00060.0006 0.00030.0003 0.0820.082
  • 31. Interaction termsInteraction terms Age and gender are entered into model asAge and gender are entered into model as separate termsseparate terms Age x gender included to see whether ageAge x gender included to see whether age has different effect in males than inhas different effect in males than in females.females. VariableVariable RangeRange b coeffb coeff SESE pp AgeAge 40-69 y40-69 y 0.0860.086 0.1150.115 <0.001<0.001 GenderGender 0=m, 1=f0=m, 1=f 1.5001.500 0.9670.967 0.1210.121 Age x genderAge x gender M: 0-0M: 0-0 F: 40-69F: 40-69 -0.043-0.043 0.0170.017 0.0110.011
  • 32. InterpretationInterpretation With binary, dummy variables, eWith binary, dummy variables, ebb is the odds ratio.is the odds ratio. You can compare the strength (slope) of the effectYou can compare the strength (slope) of the effect by comparing b.by comparing b. With numeric variables, b is not a direct measure ofWith numeric variables, b is not a direct measure of strength of effect.strength of effect. – Example: b is quite small in effect of BP on mortality,Example: b is quite small in effect of BP on mortality, because it is the effect of onlybecause it is the effect of only one mmHgone mmHg change in BP. BPchange in BP. BP is still an important factor in mortality because there is ais still an important factor in mortality because there is a widewide rangerange in the BP.in the BP.
  • 33. InterpretationInterpretation In a prospective cohort study we can useIn a prospective cohort study we can use logistic regression model to predictlogistic regression model to predict probabilityprobability of the event given the independent variables.of the event given the independent variables. Also can derive relative risk.Also can derive relative risk. In a cross sectional study we only have theIn a cross sectional study we only have the odds ratio.odds ratio.
  • 34. Selection of variablesSelection of variables Same principle as with ordinary regressionSame principle as with ordinary regression Forward selection: add one variable at a timeForward selection: add one variable at a time until there are no more that make a significantuntil there are no more that make a significant differencedifference Backward selection: start with all, remove oneBackward selection: start with all, remove one at a time to see if they made a significantat a time to see if they made a significant contributioncontribution EPI Info has suggestions on how to do thisEPI Info has suggestions on how to do this