Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docx
main
1. Prediction Study on the AddHealth Dataset
James Mason, Joao Carreira, Tomofumi Ogawa, Yannik Pitcan
Abstract
The National Longitudinal Study of Adolescent to Adult
Health (AddHealth) is a longitudinal study of a nation-
ally representative sample of adolescents. The study
was conducted in 4 different waves, the earliest of which
occurred in 1994-1995 (grades 7-12) and the most re-
cent in 2008 (ages 24-32). In each wave an in-home in-
terview was conducted with the goal of understanding
what forces and behaviors may influence adolescents’
health during the lifecourse.
The public-use dataset contains between about 4-
6K respondents in each wave and each respondent was
surveyed on 2-3K items.
In this work, we leverage the information in the Ad-
dHealth study to build a prediction model for depres-
sion from the information we have from participants
in their adolescent period.
There are several challenges involved in building such
a model. First, the ratio between participants and
survey variables is low (around 6K participants for 2-
3K variables). Second, the survey results are sparse –
many items are left unanswered. Third, as is normal
in a longitudinal study such as this one, some partici-
pants were lost in the later waves of the study. Last, in
the study dataset answers are represented with numer-
ical values. However, for many of the 2-3K items these
values don’t have a numerical meaning (e.g., value 4
means “refused to answer”).
In this report we apply different statistical and ma-
chine learning techniques to predict depression and ad-
dress these challenges.
1 Introduction
AddHealth sampled students in grades 7-12 in 1994-95
in Wave I, then conducted three additional waves of
follow-up. Wave II was in 1996, Wave III was in 2001-
02, and Wave IV was in 2008-09. An additional wave
is planned for 2016-18.
The public-use dataset contains between about 4-6K
respondents, depending on which wave, with about 2-
3K variables collected in each. This dataset is survey-
sampled, with some demographic groups oversampled
compared to others. We ignore the external validity
issues presented by this (and thus do not use the sample
weights implied by the sampling design).
Our central interest is prediction. Since the environ-
ment of each person in their adolescence and young-
adulthood might affect their mental health as adults,
we decided to use the variables collected in Wave I to
predict a depression score in Wave IV. The depression
data in Wave IV are a series of categorical frequency
indicators (0=“never” through 3=“most of the time”)
in the “Social Psychology and Mental Health” section;
hence we propose a depression score constructed from
these responses.
One of the major potential problems we might face
in the analysis is that the number of possible predic-
tors in Wave I overwhelms the sample size available
to us, therefore it will be necessary to select the best
predictors to use in our analysis. Another issue has to
do with the problem of doing predictions when many
respondents may be missing data on some fraction of
the large number of predictor variables.
We have decided to approach this problem in the
following way. First, we chose a set of prediction tech-
niques that we thought were promising for a dataset
with these characteristics. Some of the prediction
methods were OLS, Random Forests and Support Vec-
tor Regression. Second, because some of the methods
(e.g., OLS) assume linearity of variables we one-hot en-
coded our input data set. Third, to reduce the number
of predictor variables we applied two dimensionality
reduction techniques: Random Projection and Princi-
pal Component Analysis. Fourth, because these meth-
ods are sensitive to the value of their hyperparameters
we trained these models with cross-validation. Finally,
because we are concerned with how well we are able
to predict depression in new/unseen data we split our
dataset into two sets: a training set and a test set.
We report the error of predictions when our models
(trained with the training set) are used to predict de-
pression of people in the test set.
This report is organized in the following way. In
Section 2 we present in detail the dataset we analyzed
and discuss some of the challenges of analyzing such a
dataset. In Section 3 we describe how we prepared the
dataset for analysis using prediction models. In Sec-
tion 4 we present the methodology we used to build our
prediction models. In Section 5 we compare the qual-
1
2. ity of the predictions from the various models. Finally,
in Section 6 we summarize our work.
2 Dataset
The data are split into four waves.
Wave I consisted of two stages. Stage 1 was a strat-
ified, random sample of all high schools in the United
States. Stage 2 was an in-home sample of 27,000
teenagers consisting of core samples from each commu-
nity. Some over-samples were also selected. In other
words, an adolescent could qualify for multiple sam-
ples. Wave II was identical to the Wave I sample with
a few exceptions: (a) those who were in the 12th grade
at Wave I and not part of the genetic sample, (b) re-
spondents only in the Wave I disabled sample, (c) 65
adolescents who were in the genetic sample and not
interviewed in Wave I were interviewed in Wave II.
If Wave I respondents could be found and inter-
viewed again after six years, they were in the Wave III
sample. Urine and saliva samples were also collected
at this time.
Finally, the Wave IV sample consisted of all original
Wave I respondents. Readings of vitals were also taken.
There were a few NA values in the data after col-
lection. The Wave I dataset had the highest with over
five percent of its values being NA’s.
3 Data Processing
We wish to predict depression of participants 10 to 15
years after they have first participated in the survey.
In AddHealth, depression is measured using ten items,
which respondents answer using a four-point frequency
scale, as described in Section 3.2. The responses to
these ten items are then summarized into a single de-
pression score.
To predict depression we have performed the follow-
ing study. First, we have prepossessed the data to 1)
filter individuals that were lost to follow-up during the
study, 2) we have cleaned the data, 3) we have gener-
ated for each participant of Wave 4 a general mental
health score that we aim to predict.
Second, we built a linear prediction model regressed
on an expanded design matrix. In this expanded ma-
trix each predictor variable has a binary value indicat-
ing whether for a specific survey item the participant
chose a specific answer. Third, we used cross-validation
to estimate the hyperparameters used by the model to
provide the smallest prediction error.
Analysis was conducted using R [7] and Python. In
the rest of the section we explain each step in more
detail and explain the reasoning behind our approach.
3.1 Data Processing
Before building a prediction model we cleaned the raw
AddHealth dataset. Some of the problems we found
while cleaning the data were: filtering individuals that
are lost to follow participants and removing survey
variables unimportant to our prediction task.
To address the first problem, lost to follow partici-
pants, we have filtered out all participants that have
participated in the Wave I of the study but not in Wave
IV. We intend to analyze this subset of the population
in order to understand if they differ significantly from
the other participants.
Second, some variables in the dataset are likely to
not be important for prediction purposes. For instance,
the survey variables IMONTH, IDAY, and IYEAR indicate
the month, day and year the survey was conducted,
respectively. Such variables were removed and ignored
in the context of our analysis.
3.2 Depression Scores
Out outcome measure is depression, as measured in
Wave IV. The “Social Psychology and Mental Health”
section of Wave IV contains ten items, H4MH18 through
H4MH27, measuring depression. Most of these items
were taken from the Center for Epidemiologic Studies
Depression Scale (CES-D) [8]. These ask “How often
was the following true during the past seven days”
with a rating scale of (0=“never” through 3=“most of
the time”. The items were:
• “You were bothered by things that usually don’t
bother you”
• “You could not shake off the blues, even with help
from your family and your friends”
2
3. • “You felt you were just as good as other people”
*
• “You had trouble keeping your mind on what you
were doing”
• “You felt depressed”
• “You felt that you were too tired to do things”
• “You felt happy” *
• “You enjoyed life” *
• “You felt sad”
• “You felt that people disliked you”
Three items (indicated by *) were positive, and were
reverse-coded before analysis (i.e., 3=“never” through
0=“most of the time”).
To generate a single outcome which could be
modeled as a continuous outcome, there are three
commonly-used methods. First, we could could con-
struct sum-scores by adding the codes (0:3) from each
item, yielding a score range from 0:30. This would
treat each item equally, and treat the spaces between
the for levels within each item equally. Second, we
could conduct an Exploratory Factor Analysis (EFA)
and generate factor scores for a single Principal Factor.
This would weight each item differently, but still treat
the spaces between the four levels within those items
equally. Third, we could fit an Item-Response Theory
(IRT) model, which is a kind of latent variable model,
and use the latent variable scores.
We chose to use IRT scores, because their distribu-
tion was closer to Normal than sum-scores or EFA
scores. Specifically, we fit Masters’ Partial Credit
Model [5]using the TAM [4] package.
P(Xpi = c) =
exp
c
k=0 (θp − δik)
li
l=0 exp
l
k=0 (θp − δik)
where Xpi ∈ {0, 1, 2, 3} is the response of person p
to item i, θp is the (latent) depression of person p, δik
is how high on the depression scale a person must be
to answer item i at level k rather than k − 1, and li is
the maximum level for item i (always 3).
We estimated the model parameters (item param-
eters, and parameters of the θ distribution) using
Marginal Maximum Likelihood. We predicated de-
pression scores θp for each individual using Empirical
Bayes: Expected a-Posteriori (EAP), with a N(0, σ2
)
prior.
Although EAP can generate scores for individuals
with missing data on the indicators, we chose not to
generate depression scores for those who answered less
than seven of the ten items. This resulted in the loss
of a single observation.
3.3 Predicting on categorical variables
There are 6504 participants and 2794 variables in
Wave I. Most of these variables are categorical,
whereas others are numerical. For instance, item 19 in
Wave I of the study is “(During the past seven days:)
You could not shake off the blues, even with help
from your family and your friends” with the following
answers:
• 0 – ”ever or rarely”
• 1 – ”sometimes”
• 2 – ”a lot of the time”
• 3 – ”most of the time or all of the time”
• 4 – ”refused”
In categorical variables like this we cannot use the
value of the survey response directly in a linear model,
because the values are not linear. The answer ’refused’
is not indicative of more depression than the answer
’sometimes’.
To address this problem, we decided to expand each
variable in the original dataset into a set of binary val-
ues, one for each possible response to the corresponding
survey item (one-hot encoding). For this item, we gen-
erated 5 variables, one for each of the 5 answers. This
way we are able to pull apart the contribution of each
particular answer to the final prediction of.
This procedure expanded our 2794 predictor vari-
ables into 20086 variables.
4 Prediction
4.1 Reducing number of predictor vari-
ables
The number of predictor variables largely exceeds the
number of observations. This is an issue especially
when we use regression based models such as OLS and
exacerbated by one-hot encoding. Thus, reducing the
number of predictors is necessary.
To address this problem we have tried two ap-
proaches: Principal Component Analysis (PCA) and
Random Projection.
To perform PCA on the one-hot encoded matrix we
first computed the set of principal components that
accounts for more than 80% of the variance in the
dataset. This step resulted in 658 principal compo-
nents. We then projected our design matrix onto the
subspace spanned by these principal components.
Alternatively to the PCA approach, we have decided
to project the one-hot encoded design matrix into a
subspace spanned by random vectors (Random Pro-
jection). By the Johnson-Lindenstrauss lemma [9] we
know that when projecting our data vectors into a
lower-dimensional space the distance between points
is approximately preserved with high probability. This
approach has the advantage of being computationally
faster than other dimensionality reduction alternatives
3
4. (e.g., PCA).
To perform this projection we did the following.
First, we created a D × p projection matrix V where
D is a parameter chosen by users, p is the number of
predictors variables in the expanded matrix, and each
entry Vij is a standard normal variable, Vij ∼ N(0, 1).
Then, we projected our data vectors into the space
spanned by V , i.e., we compute
Xtransformed = XV T
.
The n × D projected matrix Xtransformed was used on
several prediction methods explained in detail in the
next subsections.
4.2 Regression Based Model
We employed three regression-based models to predict
depression scores: 1) OLS, 2) preshrunk estimator sug-
gested by Copas [2], and 3) preshrunk estimator with
cross-validated shrinkage ˆK. Notice that all three mod-
els require the number of predictor variables to be
smaller than the number of observations and hence re-
duction of predictor variables.
For each of the three models, PCA was performed
where the subset of principal components was chosen
to explain 80% of total variance.
In addition, for each of the three, Random Pro-
jection was used with the size of the projected ma-
trix which was cross-validated by 5-folds: that is,
with choosing candidates of the size D beforehand,
5, 10, 25, 50, 100, 150, and 200 for the preshrunk mod-
els and 500, 1000, and 2000 in addition for OLS, we
produced the projected matrix Xtransformed with each
candidate D. Then with each Xtransformed we split the
training data into 5 folds, and for each fold we used
the other 4 folds to build a model and to predict the
depression scores in this fold. In other words each fold
was used to build a model at 4 times to predict the
scores in the different 4 folds, and left out exactly once
to gain the predicted scores in this fold. After these
procedures we obtained the mean prediction error on
the training set so that we can decide the best size
Dbest.
4.3 Tree-based prediction
Prediction methods based on OLS have two limita-
tions. First, OLS cannot (directly) take into account
interactions between predictor variables. Second, OLS
imposes a strict condition on the ration between obser-
vations and predictor variables.
Tree-based methods have been successfully used in
classification and regression settings and do not suffer
from these problems. Because of this, we have used two
predictions methods based on decision trees: random
forest regression and XGBoost regression.
4.3.1 Random Forests
Random forests is an ensemble method used for classifi-
cation and regression. Because we are trying to predict
a continuous value (depression) we focus on regression.
Random forest regression works by constructing mul-
tiple decision trees at training time and outputting the
mean of the values predicted by each tree.
We used the sklearn.ensemble.RandomForestRegressor
python class for this task. Because we applied this
method to the one-hot encoded design matrix, each
decision tree built during this process performs a
simple ”Yes/No” decision on a random subset of the
predictor variables.
Because training of Random Forests can be compu-
tationally expensive we made use of 10 cores to speed
up this process.
4.3.2 XGBoost
XGBoost [1] is a gradient tree boosting library used for
prediction. Boosted trees methods work by using en-
sembles of decision trees to build predictions. However,
unlike random forests, these methods build ensembles
of decision trees in an incremental fashion, minimiz-
ing the residual errors at each new decision tree. XG-
Boost in particular is an efficient library for performing
boosted trees prediction.
4.4 Support Vector Regression
Support Vector Regression (SVR) is a prediction
method that casts problems into a convex optimiza-
tion framework of the form:
minimize
1
2
||w||2
subject to
yi − w, xi − b ≤
w, xi + b − yi ≤
This formulation can be used to find an hyperplane
that approximates the data with an error of at most
for each data point.
We applied the SVR implementation in the scikit-
learn Python library [6] to our prediction problem. Be-
cause we use a linear kernel, we decided to use the Lin-
earSVR python class. We found this class to perform
significantly faster than the sklearn.svm.SVR class –
because it is built on top of the LIBLINEAR [3] library
(not the slower LIBSVM).
4
5. 4.5 Prediction Models Used
From the strategies above, we selected nine prediction
models to evaluate:
1. Random projection (number of dimensions deter-
mined by cross-validation) followed by OLS,
2. Random projection (number of dimensions deter-
mined by cross-validation) followed by Copas’ pre-
shrunk regression,
3. Random projection (number of dimensions deter-
mined by cross-validation) followed by pre-shrunk
regression where the shrinkage factor was deter-
mined by cross-validation,
4. Principal Component Analysis followed by OLS,
5. Principal Component Analysis followed by Copas’
pre-shrunk regression,
6. Principal Component Analysis followed by pre-
shrunk regression where the shrinkage factor was
determined by cross-validation,
7. Random Forest Regression (using the original fea-
ture matrix) with maximum depth and number of
trees determined by cross-validation,
8. XGBoost Regression (using the original feature
matrix) with maximum depth and minimum child
weight determined by cross-validation,
9. Support Vector Regression (using the original
feature matrix) with C determined by cross-
validation.
In addition, we implemented a random predictor,
which predicted the depression score for an individ-
ual by randomly selecting another individual’s score.
This provides a baseline prediction root mean squared
prediction error to which the other models can be com-
pared.
5 Results
5.1 IRT Scores
The range of IRT scores was from about -1.90 to 4.21,
with a mean of about 0 (by design) and a standard de-
viation of about 1.06. The distribution of IRT scores is
presented in Figure 1. Note that since the IRT scores
have a SD of about 1, when evaluating root mean
squared prediction errors, the errors are essentially in
standard deviation units.
5.2 Cross Validation
We used 5-fold cross-validation to determine the best
value for the hyperparameters of the following models:
regression-based, tree-based and support vector regres-
sion.
Histogram of depression scores
Depression Scores (IRT)
Frequency
−2 −1 0 1 2 3 4
02004006008001000
Figure 1: Distribution of IRT scores for depression
Regression When using Random Projection we
used CV to determine the best choice of D, the number
of dimensions of projection. Since the optimal choice
may vary based on the type of prediction model, this
parameter was independently cross-validated for for
OLS, OLS with Copas shrinkage, and OLS with cross-
validated shrinkage. The candidate values for D, along
with associated root mean square prediction errors are
listed in Table ??.
For OLS, the Dbest was 100, whereas when Copas
or cross-validated shrinkage was used, Dbest was 150.
These values of Dbest were used in subsequent analyses.
Note that over-fitting is a problem to be concerned
with: in OLS with D = 2000, the root mean square
prediction error was almost twice as large as with the
optimal D = 100.
Tree-based methods We used cross-validation
with random forest regression and XGBoost.
For random forest regression we tried the parame-
ters and corresponding values shown in Table 1. For
XGBoost regression we tried the parameters and cor-
responding values in Table ??.
Parameter Description Values
max depth Max tree depth 10,50,100,200
n estimators Number of trees 100, 1000, 1500, 2000, 4000, 8000
Table 1: Parameters and values used for cross-
validation with Random Forest
Support Vector Regression We have cross-
validated our model using the parameters and respec-
tive values presented in Table 3.
5
6. Parameter Description Values
max depth Max tree depth 1,3,5,7,9
min child weight The minimum number of
variables in each leaf node
1,3,5,7,9
Table 2: Parameters and values used for cross-
validation with XGBoost
Parameter Description Values
epsilon Error margin 0.01, 0.1, 0.5, 1, 2,
4, 8, 16
C Penalty given to data
points far from optimal
hyperplane
1 × 10−8
,
1 × 10−7
, 1 × 10−6
,
1 × 10−5
,1 × 10−4
,
1 × 10−3
, 1 × 10−2
,
1 × 10−1
, 1,10,100
Table 3: Parameters and values used for cross-
validation with SVR
5.3 Prediction Quality
The prediction results from each prediction method,
using the best settings as determined by cross-
validation, are presented in Table ??.
Considering root mean squared prediction error
(RMSPE), the tree-based and support-vectors models
worked better than the PCA-based models, which in
turn worked better than the random-projection-based
models.
Among the former, the best predictions were from
Random Forest regression with RMSPE of 0.970, fol-
lowed closely by XGBoost regression with RMSPE of
0.975. Support Vector regression fared slightly worse,
with RMSPE of 0.991.
Among the PCA models, OLS with Copas shrinkage
did the best, with RMSPE of 0.996, followed by OLS
with cross-validated shrinkage factor, with RMSPE of
1.002, and OLS with no shrinkage, with RMSPE of
1.010.
Among Random Projection models, OLS with Co-
pas shrinkage did the best with RMSPE of 1.033, fol-
lowed by OLS with cross-validated shrinkage factor,
with RMSPE of 1.042, and OLS with no shrinkage,
with RMSPE of 1.049.
However, all of these models performed better than
the baseline RMSPE of 1.508 provided by random pre-
diction.
6 Conclusion
In this report we presented a methodology for the pre-
diction of depression in a sparse dataset from an ado-
lescent health study.
We identified challenges in doing prediction using
this dataset and addressed them with a combination
of feature reduction, statistical and machine learning
techniques.
All the prediction methods we used performed bet-
ter than a random predictor. The machine learning
techniques we used achieved slightly better predictions
than the methods based on OLS (X RMSPE vs Y).
This might result from the interaction of different vari-
ables.
Interestingly, the predictions after Random Projec-
tion were in the ballpark of the RMSPE of other predic-
tions without dimensionality reduction. This indicates
that Random Projection can work in practice.
The AddHealth dataset was not constructed with
prediction in mind, but rather taking a snapshot of
the state of adolescent health in the US. We believe
a dataset built for prediction would have taken into
account from the get go some of the problems we ad-
dresses: lack of observations, the format of the dataset
and depression scores. Despite these challenges we
were able to build a predictor of depression scores with
one standard deviation of RMSPE and significantly
better than a random predictor.
6
7. References
[1] T. Chen and C. Guestrin. Xgboost: A scalable tree
boosting system. CoRR, abs/1603.02754, 2016.
[2] J. B. Copas. Regression, prediction and shrinkage.
Journal of the Royal Statistical Society. Series B
(Methodological), 1983.
[3] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,
and C.-J. Lin. LIBLINEAR: A library for large
linear classification. Journal of Machine Learning
Research, 9:1871–1874, 2008.
[4] T. Kiefer, A. Robitzsch, and M. Wu. TAM: Test
Analysis Modules, 2016. R package version 1.16-0.
[5] G. N. Masters. A rasch model for partial credit
scoring. Psychometrika, 47(2):149–174, 1 June
1982.
[6] F. Pedregosa, G. Varoquaux, A. Gramfort,
V. Michel, B. Thirion, O. Grisel, and et al. Scikit-
learn: Machine learning in python. Journal of Ma-
chine Learning Research, 2011.
[7] R Core Team. R: A Language and Environment for
Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria, 2015.
[8] L. S. Radloff. The CES-D scale: A Self-Report
depression scale for research in the general popula-
tion. Applied psychological measurement, 1(3):385–
401, 1 June 1977.
[9] J. William and L. Joram. Extensions of lipschitz
mappings into a hilbert space. Conference in mod-
ern analysis and probability, 1984.
7