A personal statistical analysis that I conducted in S.P.S.S. and the R programming language. A logistic regression was performed in order to investigate which myopic factors are the most significant.
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Logistic regression in Myopia data
1. Logistic regression in Myopia data
Achilleas Papatsimpas
Mathematician,
M.Sc. in Statistics and Operational Research
1 INTRODUCTION
Myopia is an eye condition where a person has difficulty seeing things at a distance. This
condition is primarily because the eyeball is too long. In an eye that sees normally, the image
of what is being viewed is transmitted to the back portion of the eye, which called retina,
and hits the retina to form a clear picture. In the myopic eye, the image focuses in front of
the retina, so the resultant image on the retinal itself is blurry. The blurry image, as a result,
creates problems with a variety of distance viewing tasks (e.g., reading the blackboard, doing
homework, driving, playing sports) and requires wearing glasses or contact lenses to correct
the problem (Hosmer, Lemeshow and Sturdivant, 2013).
The risk factors for the development of myopia include genetic factors (e.g., family history
of myopia) and the amount and type of visual activity that a child performs (e.g., studying,
reading, TV watching, computer or video game playing and sports/outdoor activity). There
is strong evidence that having myopic parents increases the chance that a child will become
myopic and weaker evidence that certain types of visual activities (called “near work” like read-
ing) increase the chance that a child will become myopic (Hosmer, Lemeshow and Sturdivant,
2013).
The dataset used in this project is from 618 of the subjects who had at least five years of
follow up and were not myopic when they entered the study. All data are from their initial
exam and includes 10 variables. In addition to the ocular data there is information on age at
entry, year of entry, family history of myopia and hours of various visual activities. A subject
was coded as myopic if they became myopic at any time during the first five years of follow up.
We refer to this data set as the MYOPIA data.
We will perform a logistic regression in the MYOPIA data, in order to investigate which risk
factors are the most significant.
1
2. Variable Variable Description Values/Labels Variable Name
1 Myopia within the first five
years of follow up
0= No, 1=Yes MYOPIC
2 Gender 0= Male, 1= Female GENDER
3 Spherical Equivalent Refraction diopter SPHEQ
4 How many hours per week out-
side of school the child spent en-
gaging in sports/outdoor activi-
ties
Hours per week SPORTHR
5 How many hours per week out-
side of school the child spent
reading for pleasure
Hours per week READHR
6 How many hours per week out-
side of school the child spent
playing video/computer games
or working on the computer
Hours per week COMPHR
7 How many hours per week out-
side of school the child spent
reading or studying for school
assignments
Hours per week STUDYHR
8 How many hours per week out-
side of school the child spent
watching television
Hours per week TVHR
9 Was the subject’s mother my-
opic?
0= No, 1=Yes MOMMY
10 Was the subject’s father myopic? 0= No, 1=Yes DADMY
Table 1.1: Variables in the Myopia study
2
3. 2 LOGISTIC REGRESSION WITH S.P.S.S.
2.1 FULL MODEL
Suppose we are interested in investigating predictors of Myopia within the first five years
of follow up. Note that the Myopia dependent variable, MYOPIC is a binary variable. 0
means no myopia within the first five years of follow up and 1 means myopia. Predictor
variables are GENDER, SPHEQ, SPORTHR, READHR, COMPHR, STUDYHR, TVHR, MOMMY
and DADMY. SPHEQ, SPORTHR, READHR, COMPHR, STUDYHR, TVHR are quantitative. In
addition, GENDER, MOMMY and DADMY are categorical variables. We use the Entry method
(Enter means to add all variables to the model simultaneously).
Now we are looking at the S.P.S.S. statistical output. We can see that there are 618 cases used
in the analysis.
Table 2.1: Case Processing Summary
The Block 0 output is for a model that includes only the intercept (which S.P.S.S. calls the
constant). At the classification table we see that 537/618 = 86.9% decide to stop the research
while 13.1% decide to allow it to continue.
Table 2.2: Classification Table: Block 0
3
4. In the “Variables in the Equation” table we see that the intercept-only model is
ln(odds) = −1.892
If we exponentiate both sides of this expression we conclude that our predicted odds is
Exp(B) = 0.151. Regression weights and a statistical test of the H0 : B = 0 for the variables in
the equation (only the constant for Block 0).
Table 2.3: Variables in the equation: Block 0
In the “Variables not in the Equation” table we see the contribution of each predictor if it
was added alone into the equation.
Table 2.4: Variables not in the equation: Block 0
Now we look at the output, Block 1. Under Omnibus Tests of Model Coefficients we test the
Hypothesis:
H0 : bi = 0 for i = 1,...,9
vs
H1 : bi = 0 for at least 1 coefficient
4
5. Therefore we conclude that H0 is rejected since p-value < .001.
Table 2.5: Omnibus Tests of Model Coefficients
Table 2.6: Model Summary
The Classification table shows that the model is 89.6% accurate.
Table 2.7: Classification Table: Block 1
The “Variables in the Equation” table contains the coefficients for the (fitted) line and other
5
6. relative information about the coefficients.
Table 2.8: Variables in the equation: Block 1
The equation of the line found from the output is
ln
ˆp(x)
1− ˆp(x) = 1.679−0.585·GENDER−4.011·SPHEQ−0.047·SPORT HR+0.078·RE ADHR+
0.042·COMPHR −0.022·TV HR −0.187·STUDY HR −0.739· MOMMY −0.809·D ADMY
2.2 INTERPRETING THE FULL MODEL
GENDER does not contribute to the model. The negative B indicates that the target group
(Yes) tends to have more of those coded “0” (females) than of these coded “1” (males) - but
not significantly (p-value = 0.063). SPHEQ, SPORTHR, STUDYHR, MOMMY and DADMY do
contribute to the model, as they are significant factors (p-value < 0.05).
Finally, READHR, COMPHR and TVHR do not contribute to the model as they are not
significant.
2.3 REDUCED MODEL
Now, we conduct an analysis where the dependent variable is MYOPIC and the predictors
are SPHEQ, SPORTHR, STUDYHR, MOMMY and DADMY, which we found previously that
they contribute to the model. We will call this model as the reduced model. Furthermore,
we can test the significance of the difference between the full model and the reduced model,
as long as the reduced model is nested within the other. Our 9-predictor full model had a
-2LogLikelihood statistic of 305.201 (Block 1, Model summary table). Removing the variables
mentioned before, produced an increase of 7.827. As a result, the reduced model has a -
2LogLikelihood statistic of 303.028. This difference is a χ2
on 4 df (one df for each predictor
6
7. variable).
Table 2.9: Omnibus Tests of Model Coefficients: Reduced model
Table 2.10: Model Summary: Reduced model
To determine the p-value associated with this χ2
, we compute the following p in S.P.S.S. as
p = 1−CDF.CHISQ(7.827,4)
The calculations show that p=0.10. We conclude that the reduced model is as effective as the
full model, as
χ2
(4,N = 618) = 7.827
and p-value = 0.10 > .05.
Note that our overall success rate in classification has improved from 89.6 to 90.1, as we can
see at the classification table below.
7
8. Table 2.11: Classification Table: Reduced model
The equation of the new model found from the output is:
ln
ˆp(x)
1− ˆp(x) = 1.438−3.969·SPHEQ−0.047·SPORT HR−0.148·STUDY HR−0.651·MOMMY −
0.819·D ADMY
Table 2.12: Variables in the Equation: Reduced model
EXAMPLE Supposing that we have a child with Spherical Equivalent Refraction (SPHEQ) of
1.40. The child spends daily 4 hours engaging in sports and outdoor activities and 1 hour in
reading or studying for school assignments. Also, the child’s parents aren’t myopic.
Therefore, we have the following prediction:
ln
ˆp(x)
1− ˆp(x)
= 1.438−3.969·1.40−0.047·4−0.148·1−0.651·0−0.819·0 = −4.4546
and
ˆp(x) =
exp(−4.4546)
1+exp(−4.4546)
=
0.0116249689
1+0.0116249689
= 0.011491382
That is, our model predicts that there’s an 1% possibility of a myopic child.
3 LOGISTIC REGRESSION WITH R
3.1 FULL MODEL
Now we conduct the previous logistic regression in R. As before, the dependent variable is
MYOPIC and the predictors are GENDER, SPHEQ, SPORTHR, READHR, COMPHR, STUDYHR,
TVHR, MOMMY and DADMY. The Coefficients table which contains the coefficients for the
8
9. (fitted) line and other relative information about them, is given below
Table 3.1: Coefficients table in R
R also calculates the descriptive statistics (minimum and maximum statistics, median, first
and third quartiles).
Table 3.2: Descriptive statistics
Finally, we get the Analysis of Deviance table.
Table 3.3: Analysis of Deviance
9
10. 3.2 REDUCED MODEL
Now, we conduct an analysis where the predictors are SPHEQ, SPORTHR, STUDYHR, MOMMY
and DADMY. The coefficients table is given below.
Table 3.4: Coefficients table: Reduced model
The Analysis of Deviance table is:
Table 3.5: Analysis of Deviance table: Reduced model
We can test the significance of the difference between the full and the reduced model, as long
as the reduced model is nested within the other. Our 9-predictor model had a -2LogLikelihood
statistic of 305.2 (Coefficients table, Residual deviance). Removing the variables mentioned
before, produced an increase of 7.8267. As a result, the reduced model has a -2LogLikelihood
statistic of 313.03. This difference is a χ2
on 4 df (one df for each predictor variable - look at
10
11. the analysis of variance table below)
Table 3.6: Analysis of Variance table: Reduced model
To determine the p-value associated with this χ2
, we compute the expression in R
p = 1−pchisq(7.8264,4)
The calculations show that p = 0.0981484. We conclude that the reduced model is as effective
as the full model, as
χ2
(4,N = 618) = 7.8264
and p-value = 0.0981484 > .05.
REFERENCES
1. Karl L. Wuensch, Binary Logistic Regression with SPSS (2014), East Carolina University
2. Logistic Regression on SPSS, https://www.researchgate.net
3. Hosmer, D.W. Lemeshow, S. and Sturdivant, R.X. (2013) Applied Logistic Regression:
Second Edition, John Wiley & Sons Inc., New York, NY
4. Binary Logistic Regression, Training in Quantitative Psychology at UNL, Courses in
Research Methods, Design & Data Analysis
11