SlideShare a Scribd company logo
1 of 7
Download to read offline
Prediction Study on the AddHealth Dataset
James Mason, Joao Carreira, Tomofumi Ogawa, Yannik Pitcan
Abstract
The National Longitudinal Study of Adolescent to Adult
Health (AddHealth) is a longitudinal study of a nation-
ally representative sample of adolescents. The study
was conducted in 4 different waves, the earliest of which
occurred in 1994-1995 (grades 7-12) and the most re-
cent in 2008 (ages 24-32). In each wave an in-home in-
terview was conducted with the goal of understanding
what forces and behaviors may influence adolescents’
health during the lifecourse.
The public-use dataset contains between about 4-
6K respondents in each wave and each respondent was
surveyed on 2-3K items.
In this work, we leverage the information in the Ad-
dHealth study to build a prediction model for depres-
sion from the information we have from participants
in their adolescent period.
There are several challenges involved in building such
a model. First, the ratio between participants and
survey variables is low (around 6K participants for 2-
3K variables). Second, the survey results are sparse –
many items are left unanswered. Third, as is normal
in a longitudinal study such as this one, some partici-
pants were lost in the later waves of the study. Last, in
the study dataset answers are represented with numer-
ical values. However, for many of the 2-3K items these
values don’t have a numerical meaning (e.g., value 4
means “refused to answer”).
In this report we apply different statistical and ma-
chine learning techniques to predict depression and ad-
dress these challenges.
1 Introduction
AddHealth sampled students in grades 7-12 in 1994-95
in Wave I, then conducted three additional waves of
follow-up. Wave II was in 1996, Wave III was in 2001-
02, and Wave IV was in 2008-09. An additional wave
is planned for 2016-18.
The public-use dataset contains between about 4-6K
respondents, depending on which wave, with about 2-
3K variables collected in each. This dataset is survey-
sampled, with some demographic groups oversampled
compared to others. We ignore the external validity
issues presented by this (and thus do not use the sample
weights implied by the sampling design).
Our central interest is prediction. Since the environ-
ment of each person in their adolescence and young-
adulthood might affect their mental health as adults,
we decided to use the variables collected in Wave I to
predict a depression score in Wave IV. The depression
data in Wave IV are a series of categorical frequency
indicators (0=“never” through 3=“most of the time”)
in the “Social Psychology and Mental Health” section;
hence we propose a depression score constructed from
these responses.
One of the major potential problems we might face
in the analysis is that the number of possible predic-
tors in Wave I overwhelms the sample size available
to us, therefore it will be necessary to select the best
predictors to use in our analysis. Another issue has to
do with the problem of doing predictions when many
respondents may be missing data on some fraction of
the large number of predictor variables.
We have decided to approach this problem in the
following way. First, we chose a set of prediction tech-
niques that we thought were promising for a dataset
with these characteristics. Some of the prediction
methods were OLS, Random Forests and Support Vec-
tor Regression. Second, because some of the methods
(e.g., OLS) assume linearity of variables we one-hot en-
coded our input data set. Third, to reduce the number
of predictor variables we applied two dimensionality
reduction techniques: Random Projection and Princi-
pal Component Analysis. Fourth, because these meth-
ods are sensitive to the value of their hyperparameters
we trained these models with cross-validation. Finally,
because we are concerned with how well we are able
to predict depression in new/unseen data we split our
dataset into two sets: a training set and a test set.
We report the error of predictions when our models
(trained with the training set) are used to predict de-
pression of people in the test set.
This report is organized in the following way. In
Section 2 we present in detail the dataset we analyzed
and discuss some of the challenges of analyzing such a
dataset. In Section 3 we describe how we prepared the
dataset for analysis using prediction models. In Sec-
tion 4 we present the methodology we used to build our
prediction models. In Section 5 we compare the qual-
1
ity of the predictions from the various models. Finally,
in Section 6 we summarize our work.
2 Dataset
The data are split into four waves.
Wave I consisted of two stages. Stage 1 was a strat-
ified, random sample of all high schools in the United
States. Stage 2 was an in-home sample of 27,000
teenagers consisting of core samples from each commu-
nity. Some over-samples were also selected. In other
words, an adolescent could qualify for multiple sam-
ples. Wave II was identical to the Wave I sample with
a few exceptions: (a) those who were in the 12th grade
at Wave I and not part of the genetic sample, (b) re-
spondents only in the Wave I disabled sample, (c) 65
adolescents who were in the genetic sample and not
interviewed in Wave I were interviewed in Wave II.
If Wave I respondents could be found and inter-
viewed again after six years, they were in the Wave III
sample. Urine and saliva samples were also collected
at this time.
Finally, the Wave IV sample consisted of all original
Wave I respondents. Readings of vitals were also taken.
There were a few NA values in the data after col-
lection. The Wave I dataset had the highest with over
five percent of its values being NA’s.
3 Data Processing
We wish to predict depression of participants 10 to 15
years after they have first participated in the survey.
In AddHealth, depression is measured using ten items,
which respondents answer using a four-point frequency
scale, as described in Section 3.2. The responses to
these ten items are then summarized into a single de-
pression score.
To predict depression we have performed the follow-
ing study. First, we have prepossessed the data to 1)
filter individuals that were lost to follow-up during the
study, 2) we have cleaned the data, 3) we have gener-
ated for each participant of Wave 4 a general mental
health score that we aim to predict.
Second, we built a linear prediction model regressed
on an expanded design matrix. In this expanded ma-
trix each predictor variable has a binary value indicat-
ing whether for a specific survey item the participant
chose a specific answer. Third, we used cross-validation
to estimate the hyperparameters used by the model to
provide the smallest prediction error.
Analysis was conducted using R [7] and Python. In
the rest of the section we explain each step in more
detail and explain the reasoning behind our approach.
3.1 Data Processing
Before building a prediction model we cleaned the raw
AddHealth dataset. Some of the problems we found
while cleaning the data were: filtering individuals that
are lost to follow participants and removing survey
variables unimportant to our prediction task.
To address the first problem, lost to follow partici-
pants, we have filtered out all participants that have
participated in the Wave I of the study but not in Wave
IV. We intend to analyze this subset of the population
in order to understand if they differ significantly from
the other participants.
Second, some variables in the dataset are likely to
not be important for prediction purposes. For instance,
the survey variables IMONTH, IDAY, and IYEAR indicate
the month, day and year the survey was conducted,
respectively. Such variables were removed and ignored
in the context of our analysis.
3.2 Depression Scores
Out outcome measure is depression, as measured in
Wave IV. The “Social Psychology and Mental Health”
section of Wave IV contains ten items, H4MH18 through
H4MH27, measuring depression. Most of these items
were taken from the Center for Epidemiologic Studies
Depression Scale (CES-D) [8]. These ask “How often
was the following true during the past seven days”
with a rating scale of (0=“never” through 3=“most of
the time”. The items were:
• “You were bothered by things that usually don’t
bother you”
• “You could not shake off the blues, even with help
from your family and your friends”
2
• “You felt you were just as good as other people”
*
• “You had trouble keeping your mind on what you
were doing”
• “You felt depressed”
• “You felt that you were too tired to do things”
• “You felt happy” *
• “You enjoyed life” *
• “You felt sad”
• “You felt that people disliked you”
Three items (indicated by *) were positive, and were
reverse-coded before analysis (i.e., 3=“never” through
0=“most of the time”).
To generate a single outcome which could be
modeled as a continuous outcome, there are three
commonly-used methods. First, we could could con-
struct sum-scores by adding the codes (0:3) from each
item, yielding a score range from 0:30. This would
treat each item equally, and treat the spaces between
the for levels within each item equally. Second, we
could conduct an Exploratory Factor Analysis (EFA)
and generate factor scores for a single Principal Factor.
This would weight each item differently, but still treat
the spaces between the four levels within those items
equally. Third, we could fit an Item-Response Theory
(IRT) model, which is a kind of latent variable model,
and use the latent variable scores.
We chose to use IRT scores, because their distribu-
tion was closer to Normal than sum-scores or EFA
scores. Specifically, we fit Masters’ Partial Credit
Model [5]using the TAM [4] package.
P(Xpi = c) =
exp
c
k=0 (θp − δik)
li
l=0 exp
l
k=0 (θp − δik)
where Xpi ∈ {0, 1, 2, 3} is the response of person p
to item i, θp is the (latent) depression of person p, δik
is how high on the depression scale a person must be
to answer item i at level k rather than k − 1, and li is
the maximum level for item i (always 3).
We estimated the model parameters (item param-
eters, and parameters of the θ distribution) using
Marginal Maximum Likelihood. We predicated de-
pression scores θp for each individual using Empirical
Bayes: Expected a-Posteriori (EAP), with a N(0, σ2
)
prior.
Although EAP can generate scores for individuals
with missing data on the indicators, we chose not to
generate depression scores for those who answered less
than seven of the ten items. This resulted in the loss
of a single observation.
3.3 Predicting on categorical variables
There are 6504 participants and 2794 variables in
Wave I. Most of these variables are categorical,
whereas others are numerical. For instance, item 19 in
Wave I of the study is “(During the past seven days:)
You could not shake off the blues, even with help
from your family and your friends” with the following
answers:
• 0 – ”ever or rarely”
• 1 – ”sometimes”
• 2 – ”a lot of the time”
• 3 – ”most of the time or all of the time”
• 4 – ”refused”
In categorical variables like this we cannot use the
value of the survey response directly in a linear model,
because the values are not linear. The answer ’refused’
is not indicative of more depression than the answer
’sometimes’.
To address this problem, we decided to expand each
variable in the original dataset into a set of binary val-
ues, one for each possible response to the corresponding
survey item (one-hot encoding). For this item, we gen-
erated 5 variables, one for each of the 5 answers. This
way we are able to pull apart the contribution of each
particular answer to the final prediction of.
This procedure expanded our 2794 predictor vari-
ables into 20086 variables.
4 Prediction
4.1 Reducing number of predictor vari-
ables
The number of predictor variables largely exceeds the
number of observations. This is an issue especially
when we use regression based models such as OLS and
exacerbated by one-hot encoding. Thus, reducing the
number of predictors is necessary.
To address this problem we have tried two ap-
proaches: Principal Component Analysis (PCA) and
Random Projection.
To perform PCA on the one-hot encoded matrix we
first computed the set of principal components that
accounts for more than 80% of the variance in the
dataset. This step resulted in 658 principal compo-
nents. We then projected our design matrix onto the
subspace spanned by these principal components.
Alternatively to the PCA approach, we have decided
to project the one-hot encoded design matrix into a
subspace spanned by random vectors (Random Pro-
jection). By the Johnson-Lindenstrauss lemma [9] we
know that when projecting our data vectors into a
lower-dimensional space the distance between points
is approximately preserved with high probability. This
approach has the advantage of being computationally
faster than other dimensionality reduction alternatives
3
(e.g., PCA).
To perform this projection we did the following.
First, we created a D × p projection matrix V where
D is a parameter chosen by users, p is the number of
predictors variables in the expanded matrix, and each
entry Vij is a standard normal variable, Vij ∼ N(0, 1).
Then, we projected our data vectors into the space
spanned by V , i.e., we compute
Xtransformed = XV T
.
The n × D projected matrix Xtransformed was used on
several prediction methods explained in detail in the
next subsections.
4.2 Regression Based Model
We employed three regression-based models to predict
depression scores: 1) OLS, 2) preshrunk estimator sug-
gested by Copas [2], and 3) preshrunk estimator with
cross-validated shrinkage ˆK. Notice that all three mod-
els require the number of predictor variables to be
smaller than the number of observations and hence re-
duction of predictor variables.
For each of the three models, PCA was performed
where the subset of principal components was chosen
to explain 80% of total variance.
In addition, for each of the three, Random Pro-
jection was used with the size of the projected ma-
trix which was cross-validated by 5-folds: that is,
with choosing candidates of the size D beforehand,
5, 10, 25, 50, 100, 150, and 200 for the preshrunk mod-
els and 500, 1000, and 2000 in addition for OLS, we
produced the projected matrix Xtransformed with each
candidate D. Then with each Xtransformed we split the
training data into 5 folds, and for each fold we used
the other 4 folds to build a model and to predict the
depression scores in this fold. In other words each fold
was used to build a model at 4 times to predict the
scores in the different 4 folds, and left out exactly once
to gain the predicted scores in this fold. After these
procedures we obtained the mean prediction error on
the training set so that we can decide the best size
Dbest.
4.3 Tree-based prediction
Prediction methods based on OLS have two limita-
tions. First, OLS cannot (directly) take into account
interactions between predictor variables. Second, OLS
imposes a strict condition on the ration between obser-
vations and predictor variables.
Tree-based methods have been successfully used in
classification and regression settings and do not suffer
from these problems. Because of this, we have used two
predictions methods based on decision trees: random
forest regression and XGBoost regression.
4.3.1 Random Forests
Random forests is an ensemble method used for classifi-
cation and regression. Because we are trying to predict
a continuous value (depression) we focus on regression.
Random forest regression works by constructing mul-
tiple decision trees at training time and outputting the
mean of the values predicted by each tree.
We used the sklearn.ensemble.RandomForestRegressor
python class for this task. Because we applied this
method to the one-hot encoded design matrix, each
decision tree built during this process performs a
simple ”Yes/No” decision on a random subset of the
predictor variables.
Because training of Random Forests can be compu-
tationally expensive we made use of 10 cores to speed
up this process.
4.3.2 XGBoost
XGBoost [1] is a gradient tree boosting library used for
prediction. Boosted trees methods work by using en-
sembles of decision trees to build predictions. However,
unlike random forests, these methods build ensembles
of decision trees in an incremental fashion, minimiz-
ing the residual errors at each new decision tree. XG-
Boost in particular is an efficient library for performing
boosted trees prediction.
4.4 Support Vector Regression
Support Vector Regression (SVR) is a prediction
method that casts problems into a convex optimiza-
tion framework of the form:
minimize
1
2
||w||2
subject to
yi − w, xi − b ≤
w, xi + b − yi ≤
This formulation can be used to find an hyperplane
that approximates the data with an error of at most
for each data point.
We applied the SVR implementation in the scikit-
learn Python library [6] to our prediction problem. Be-
cause we use a linear kernel, we decided to use the Lin-
earSVR python class. We found this class to perform
significantly faster than the sklearn.svm.SVR class –
because it is built on top of the LIBLINEAR [3] library
(not the slower LIBSVM).
4
4.5 Prediction Models Used
From the strategies above, we selected nine prediction
models to evaluate:
1. Random projection (number of dimensions deter-
mined by cross-validation) followed by OLS,
2. Random projection (number of dimensions deter-
mined by cross-validation) followed by Copas’ pre-
shrunk regression,
3. Random projection (number of dimensions deter-
mined by cross-validation) followed by pre-shrunk
regression where the shrinkage factor was deter-
mined by cross-validation,
4. Principal Component Analysis followed by OLS,
5. Principal Component Analysis followed by Copas’
pre-shrunk regression,
6. Principal Component Analysis followed by pre-
shrunk regression where the shrinkage factor was
determined by cross-validation,
7. Random Forest Regression (using the original fea-
ture matrix) with maximum depth and number of
trees determined by cross-validation,
8. XGBoost Regression (using the original feature
matrix) with maximum depth and minimum child
weight determined by cross-validation,
9. Support Vector Regression (using the original
feature matrix) with C determined by cross-
validation.
In addition, we implemented a random predictor,
which predicted the depression score for an individ-
ual by randomly selecting another individual’s score.
This provides a baseline prediction root mean squared
prediction error to which the other models can be com-
pared.
5 Results
5.1 IRT Scores
The range of IRT scores was from about -1.90 to 4.21,
with a mean of about 0 (by design) and a standard de-
viation of about 1.06. The distribution of IRT scores is
presented in Figure 1. Note that since the IRT scores
have a SD of about 1, when evaluating root mean
squared prediction errors, the errors are essentially in
standard deviation units.
5.2 Cross Validation
We used 5-fold cross-validation to determine the best
value for the hyperparameters of the following models:
regression-based, tree-based and support vector regres-
sion.
Histogram of depression scores
Depression Scores (IRT)
Frequency
−2 −1 0 1 2 3 4
02004006008001000
Figure 1: Distribution of IRT scores for depression
Regression When using Random Projection we
used CV to determine the best choice of D, the number
of dimensions of projection. Since the optimal choice
may vary based on the type of prediction model, this
parameter was independently cross-validated for for
OLS, OLS with Copas shrinkage, and OLS with cross-
validated shrinkage. The candidate values for D, along
with associated root mean square prediction errors are
listed in Table ??.
For OLS, the Dbest was 100, whereas when Copas
or cross-validated shrinkage was used, Dbest was 150.
These values of Dbest were used in subsequent analyses.
Note that over-fitting is a problem to be concerned
with: in OLS with D = 2000, the root mean square
prediction error was almost twice as large as with the
optimal D = 100.
Tree-based methods We used cross-validation
with random forest regression and XGBoost.
For random forest regression we tried the parame-
ters and corresponding values shown in Table 1. For
XGBoost regression we tried the parameters and cor-
responding values in Table ??.
Parameter Description Values
max depth Max tree depth 10,50,100,200
n estimators Number of trees 100, 1000, 1500, 2000, 4000, 8000
Table 1: Parameters and values used for cross-
validation with Random Forest
Support Vector Regression We have cross-
validated our model using the parameters and respec-
tive values presented in Table 3.
5
Parameter Description Values
max depth Max tree depth 1,3,5,7,9
min child weight The minimum number of
variables in each leaf node
1,3,5,7,9
Table 2: Parameters and values used for cross-
validation with XGBoost
Parameter Description Values
epsilon Error margin 0.01, 0.1, 0.5, 1, 2,
4, 8, 16
C Penalty given to data
points far from optimal
hyperplane
1 × 10−8
,
1 × 10−7
, 1 × 10−6
,
1 × 10−5
,1 × 10−4
,
1 × 10−3
, 1 × 10−2
,
1 × 10−1
, 1,10,100
Table 3: Parameters and values used for cross-
validation with SVR
5.3 Prediction Quality
The prediction results from each prediction method,
using the best settings as determined by cross-
validation, are presented in Table ??.
Considering root mean squared prediction error
(RMSPE), the tree-based and support-vectors models
worked better than the PCA-based models, which in
turn worked better than the random-projection-based
models.
Among the former, the best predictions were from
Random Forest regression with RMSPE of 0.970, fol-
lowed closely by XGBoost regression with RMSPE of
0.975. Support Vector regression fared slightly worse,
with RMSPE of 0.991.
Among the PCA models, OLS with Copas shrinkage
did the best, with RMSPE of 0.996, followed by OLS
with cross-validated shrinkage factor, with RMSPE of
1.002, and OLS with no shrinkage, with RMSPE of
1.010.
Among Random Projection models, OLS with Co-
pas shrinkage did the best with RMSPE of 1.033, fol-
lowed by OLS with cross-validated shrinkage factor,
with RMSPE of 1.042, and OLS with no shrinkage,
with RMSPE of 1.049.
However, all of these models performed better than
the baseline RMSPE of 1.508 provided by random pre-
diction.
6 Conclusion
In this report we presented a methodology for the pre-
diction of depression in a sparse dataset from an ado-
lescent health study.
We identified challenges in doing prediction using
this dataset and addressed them with a combination
of feature reduction, statistical and machine learning
techniques.
All the prediction methods we used performed bet-
ter than a random predictor. The machine learning
techniques we used achieved slightly better predictions
than the methods based on OLS (X RMSPE vs Y).
This might result from the interaction of different vari-
ables.
Interestingly, the predictions after Random Projec-
tion were in the ballpark of the RMSPE of other predic-
tions without dimensionality reduction. This indicates
that Random Projection can work in practice.
The AddHealth dataset was not constructed with
prediction in mind, but rather taking a snapshot of
the state of adolescent health in the US. We believe
a dataset built for prediction would have taken into
account from the get go some of the problems we ad-
dresses: lack of observations, the format of the dataset
and depression scores. Despite these challenges we
were able to build a predictor of depression scores with
one standard deviation of RMSPE and significantly
better than a random predictor.
6
References
[1] T. Chen and C. Guestrin. Xgboost: A scalable tree
boosting system. CoRR, abs/1603.02754, 2016.
[2] J. B. Copas. Regression, prediction and shrinkage.
Journal of the Royal Statistical Society. Series B
(Methodological), 1983.
[3] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,
and C.-J. Lin. LIBLINEAR: A library for large
linear classification. Journal of Machine Learning
Research, 9:1871–1874, 2008.
[4] T. Kiefer, A. Robitzsch, and M. Wu. TAM: Test
Analysis Modules, 2016. R package version 1.16-0.
[5] G. N. Masters. A rasch model for partial credit
scoring. Psychometrika, 47(2):149–174, 1 June
1982.
[6] F. Pedregosa, G. Varoquaux, A. Gramfort,
V. Michel, B. Thirion, O. Grisel, and et al. Scikit-
learn: Machine learning in python. Journal of Ma-
chine Learning Research, 2011.
[7] R Core Team. R: A Language and Environment for
Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria, 2015.
[8] L. S. Radloff. The CES-D scale: A Self-Report
depression scale for research in the general popula-
tion. Applied psychological measurement, 1(3):385–
401, 1 June 1977.
[9] J. William and L. Joram. Extensions of lipschitz
mappings into a hilbert space. Conference in mod-
ern analysis and probability, 1984.
7

More Related Content

What's hot

Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statisticsSantosh Bhandari
 
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...inventionjournals
 
Introduction to Statistics (Part -I)
Introduction to Statistics (Part -I)Introduction to Statistics (Part -I)
Introduction to Statistics (Part -I)YesAnalytics
 
Concept of Inferential statistics
Concept of Inferential statisticsConcept of Inferential statistics
Concept of Inferential statisticsSarfraz Ahmad
 
Relevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshareRelevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshareSanjeev Deshmukh
 
Quantitative Data Analysis
Quantitative Data AnalysisQuantitative Data Analysis
Quantitative Data AnalysisAsma Muhamad
 
probability and statistics Chapter 1 (1)
probability and statistics Chapter 1 (1)probability and statistics Chapter 1 (1)
probability and statistics Chapter 1 (1)abfisho
 
De-Mystifying Stats: A primer on basic statistics
De-Mystifying Stats: A primer on basic statisticsDe-Mystifying Stats: A primer on basic statistics
De-Mystifying Stats: A primer on basic statisticsGillian Byrne
 
Chapter 1 introduction to statistics for engineers 1 (1)
Chapter 1 introduction to statistics for engineers 1 (1)Chapter 1 introduction to statistics for engineers 1 (1)
Chapter 1 introduction to statistics for engineers 1 (1)abfisho
 
Meaning and uses of statistics
Meaning and uses of statisticsMeaning and uses of statistics
Meaning and uses of statisticsRekhaChoudhary24
 
Sample Data Preparation
Sample Data PreparationSample Data Preparation
Sample Data PreparationSai Chandan
 
Inferential statictis ready go
Inferential statictis ready goInferential statictis ready go
Inferential statictis ready goMmedsc Hahm
 
Top 10 Uses Of Statistics In Our Day to Day Life
Top 10 Uses Of Statistics In Our Day to Day Life Top 10 Uses Of Statistics In Our Day to Day Life
Top 10 Uses Of Statistics In Our Day to Day Life Stat Analytica
 
What is statistics
What is statisticsWhat is statistics
What is statisticsRaj Teotia
 
Statistics for the Health Scientist: Basic Statistics I
Statistics for the Health Scientist: Basic Statistics IStatistics for the Health Scientist: Basic Statistics I
Statistics for the Health Scientist: Basic Statistics IDrLukeKane
 
Introduction to Statistics
Introduction to StatisticsIntroduction to Statistics
Introduction to StatisticsAnjan Mahanta
 

What's hot (20)

Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...
 
Introduction to Statistics (Part -I)
Introduction to Statistics (Part -I)Introduction to Statistics (Part -I)
Introduction to Statistics (Part -I)
 
Chapter 01
Chapter 01Chapter 01
Chapter 01
 
Grade 7 Statistics
Grade 7 StatisticsGrade 7 Statistics
Grade 7 Statistics
 
Statistics
StatisticsStatistics
Statistics
 
Concept of Inferential statistics
Concept of Inferential statisticsConcept of Inferential statistics
Concept of Inferential statistics
 
Relevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshareRelevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshare
 
Quantitative Data Analysis
Quantitative Data AnalysisQuantitative Data Analysis
Quantitative Data Analysis
 
probability and statistics Chapter 1 (1)
probability and statistics Chapter 1 (1)probability and statistics Chapter 1 (1)
probability and statistics Chapter 1 (1)
 
De-Mystifying Stats: A primer on basic statistics
De-Mystifying Stats: A primer on basic statisticsDe-Mystifying Stats: A primer on basic statistics
De-Mystifying Stats: A primer on basic statistics
 
Statistics:Fundamentals Of Statistics
Statistics:Fundamentals Of StatisticsStatistics:Fundamentals Of Statistics
Statistics:Fundamentals Of Statistics
 
Chapter 1 introduction to statistics for engineers 1 (1)
Chapter 1 introduction to statistics for engineers 1 (1)Chapter 1 introduction to statistics for engineers 1 (1)
Chapter 1 introduction to statistics for engineers 1 (1)
 
Meaning and uses of statistics
Meaning and uses of statisticsMeaning and uses of statistics
Meaning and uses of statistics
 
Sample Data Preparation
Sample Data PreparationSample Data Preparation
Sample Data Preparation
 
Inferential statictis ready go
Inferential statictis ready goInferential statictis ready go
Inferential statictis ready go
 
Top 10 Uses Of Statistics In Our Day to Day Life
Top 10 Uses Of Statistics In Our Day to Day Life Top 10 Uses Of Statistics In Our Day to Day Life
Top 10 Uses Of Statistics In Our Day to Day Life
 
What is statistics
What is statisticsWhat is statistics
What is statistics
 
Statistics for the Health Scientist: Basic Statistics I
Statistics for the Health Scientist: Basic Statistics IStatistics for the Health Scientist: Basic Statistics I
Statistics for the Health Scientist: Basic Statistics I
 
Introduction to Statistics
Introduction to StatisticsIntroduction to Statistics
Introduction to Statistics
 

Similar to main

Introduction-to-Statistics.pptx
Introduction-to-Statistics.pptxIntroduction-to-Statistics.pptx
Introduction-to-Statistics.pptxAlaaKhazaleh3
 
Essay On Juvenile Incarceration
Essay On Juvenile IncarcerationEssay On Juvenile Incarceration
Essay On Juvenile IncarcerationLissette Hartman
 
Statistical concepts
Statistical conceptsStatistical concepts
Statistical conceptsCarlo Magno
 
Data Matrix Of Cpi Data Distribution After Transformation...
Data Matrix Of Cpi Data Distribution After Transformation...Data Matrix Of Cpi Data Distribution After Transformation...
Data Matrix Of Cpi Data Distribution After Transformation...Kimberly Jones
 
Pg. 05Question FiveAssignment #Deadline Day 22.docx
Pg. 05Question FiveAssignment #Deadline Day 22.docxPg. 05Question FiveAssignment #Deadline Day 22.docx
Pg. 05Question FiveAssignment #Deadline Day 22.docxmattjtoni51554
 
Data analysis presentation by Jameel Ahmed Qureshi
Data analysis presentation by Jameel Ahmed QureshiData analysis presentation by Jameel Ahmed Qureshi
Data analysis presentation by Jameel Ahmed QureshiJameel Ahmed Qureshi
 
48  january 2  vol 27 no 18  2013  © NURSING STANDARD RC.docx
48  january 2  vol 27 no 18  2013  © NURSING STANDARD  RC.docx48  january 2  vol 27 no 18  2013  © NURSING STANDARD  RC.docx
48  january 2  vol 27 no 18  2013  © NURSING STANDARD RC.docxblondellchancy
 
WEEK 7 – EXERCISES Enter your answers in the spaces pr.docx
WEEK 7 – EXERCISES Enter your answers in the spaces pr.docxWEEK 7 – EXERCISES Enter your answers in the spaces pr.docx
WEEK 7 – EXERCISES Enter your answers in the spaces pr.docxwendolynhalbert
 
IRJET- Intelligent Depression Detection System
IRJET-  	  Intelligent Depression Detection SystemIRJET-  	  Intelligent Depression Detection System
IRJET- Intelligent Depression Detection SystemIRJET Journal
 
Statistics Based On Ncert X Class
Statistics Based On Ncert X ClassStatistics Based On Ncert X Class
Statistics Based On Ncert X ClassRanveer Kumar
 
Stat11t Chapter1
Stat11t Chapter1Stat11t Chapter1
Stat11t Chapter1gueste87a4f
 
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docx
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docxComplete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docx
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docxbreaksdayle
 

Similar to main (20)

1.1 statistical and critical thinking
1.1 statistical and critical thinking1.1 statistical and critical thinking
1.1 statistical and critical thinking
 
Introduction-to-Statistics.pptx
Introduction-to-Statistics.pptxIntroduction-to-Statistics.pptx
Introduction-to-Statistics.pptx
 
Ch1
Ch1Ch1
Ch1
 
CH1.pdf
CH1.pdfCH1.pdf
CH1.pdf
 
Essay On Juvenile Incarceration
Essay On Juvenile IncarcerationEssay On Juvenile Incarceration
Essay On Juvenile Incarceration
 
Statistical concepts
Statistical conceptsStatistical concepts
Statistical concepts
 
chapter 1.pptx
chapter 1.pptxchapter 1.pptx
chapter 1.pptx
 
Statistics Exericse 29
Statistics Exericse 29Statistics Exericse 29
Statistics Exericse 29
 
Data Matrix Of Cpi Data Distribution After Transformation...
Data Matrix Of Cpi Data Distribution After Transformation...Data Matrix Of Cpi Data Distribution After Transformation...
Data Matrix Of Cpi Data Distribution After Transformation...
 
QUANTITAIVE DATA ANALYSIS
QUANTITAIVE DATA ANALYSISQUANTITAIVE DATA ANALYSIS
QUANTITAIVE DATA ANALYSIS
 
Pg. 05Question FiveAssignment #Deadline Day 22.docx
Pg. 05Question FiveAssignment #Deadline Day 22.docxPg. 05Question FiveAssignment #Deadline Day 22.docx
Pg. 05Question FiveAssignment #Deadline Day 22.docx
 
Data analysis presentation by Jameel Ahmed Qureshi
Data analysis presentation by Jameel Ahmed QureshiData analysis presentation by Jameel Ahmed Qureshi
Data analysis presentation by Jameel Ahmed Qureshi
 
Statistics
StatisticsStatistics
Statistics
 
48  january 2  vol 27 no 18  2013  © NURSING STANDARD RC.docx
48  january 2  vol 27 no 18  2013  © NURSING STANDARD  RC.docx48  january 2  vol 27 no 18  2013  © NURSING STANDARD  RC.docx
48  january 2  vol 27 no 18  2013  © NURSING STANDARD RC.docx
 
WEEK 7 – EXERCISES Enter your answers in the spaces pr.docx
WEEK 7 – EXERCISES Enter your answers in the spaces pr.docxWEEK 7 – EXERCISES Enter your answers in the spaces pr.docx
WEEK 7 – EXERCISES Enter your answers in the spaces pr.docx
 
IRJET- Intelligent Depression Detection System
IRJET-  	  Intelligent Depression Detection SystemIRJET-  	  Intelligent Depression Detection System
IRJET- Intelligent Depression Detection System
 
Statistics Based On Ncert X Class
Statistics Based On Ncert X ClassStatistics Based On Ncert X Class
Statistics Based On Ncert X Class
 
Stat11t chapter1
Stat11t chapter1Stat11t chapter1
Stat11t chapter1
 
Stat11t Chapter1
Stat11t Chapter1Stat11t Chapter1
Stat11t Chapter1
 
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docx
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docxComplete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docx
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS®.docx
 

main

  • 1. Prediction Study on the AddHealth Dataset James Mason, Joao Carreira, Tomofumi Ogawa, Yannik Pitcan Abstract The National Longitudinal Study of Adolescent to Adult Health (AddHealth) is a longitudinal study of a nation- ally representative sample of adolescents. The study was conducted in 4 different waves, the earliest of which occurred in 1994-1995 (grades 7-12) and the most re- cent in 2008 (ages 24-32). In each wave an in-home in- terview was conducted with the goal of understanding what forces and behaviors may influence adolescents’ health during the lifecourse. The public-use dataset contains between about 4- 6K respondents in each wave and each respondent was surveyed on 2-3K items. In this work, we leverage the information in the Ad- dHealth study to build a prediction model for depres- sion from the information we have from participants in their adolescent period. There are several challenges involved in building such a model. First, the ratio between participants and survey variables is low (around 6K participants for 2- 3K variables). Second, the survey results are sparse – many items are left unanswered. Third, as is normal in a longitudinal study such as this one, some partici- pants were lost in the later waves of the study. Last, in the study dataset answers are represented with numer- ical values. However, for many of the 2-3K items these values don’t have a numerical meaning (e.g., value 4 means “refused to answer”). In this report we apply different statistical and ma- chine learning techniques to predict depression and ad- dress these challenges. 1 Introduction AddHealth sampled students in grades 7-12 in 1994-95 in Wave I, then conducted three additional waves of follow-up. Wave II was in 1996, Wave III was in 2001- 02, and Wave IV was in 2008-09. An additional wave is planned for 2016-18. The public-use dataset contains between about 4-6K respondents, depending on which wave, with about 2- 3K variables collected in each. This dataset is survey- sampled, with some demographic groups oversampled compared to others. We ignore the external validity issues presented by this (and thus do not use the sample weights implied by the sampling design). Our central interest is prediction. Since the environ- ment of each person in their adolescence and young- adulthood might affect their mental health as adults, we decided to use the variables collected in Wave I to predict a depression score in Wave IV. The depression data in Wave IV are a series of categorical frequency indicators (0=“never” through 3=“most of the time”) in the “Social Psychology and Mental Health” section; hence we propose a depression score constructed from these responses. One of the major potential problems we might face in the analysis is that the number of possible predic- tors in Wave I overwhelms the sample size available to us, therefore it will be necessary to select the best predictors to use in our analysis. Another issue has to do with the problem of doing predictions when many respondents may be missing data on some fraction of the large number of predictor variables. We have decided to approach this problem in the following way. First, we chose a set of prediction tech- niques that we thought were promising for a dataset with these characteristics. Some of the prediction methods were OLS, Random Forests and Support Vec- tor Regression. Second, because some of the methods (e.g., OLS) assume linearity of variables we one-hot en- coded our input data set. Third, to reduce the number of predictor variables we applied two dimensionality reduction techniques: Random Projection and Princi- pal Component Analysis. Fourth, because these meth- ods are sensitive to the value of their hyperparameters we trained these models with cross-validation. Finally, because we are concerned with how well we are able to predict depression in new/unseen data we split our dataset into two sets: a training set and a test set. We report the error of predictions when our models (trained with the training set) are used to predict de- pression of people in the test set. This report is organized in the following way. In Section 2 we present in detail the dataset we analyzed and discuss some of the challenges of analyzing such a dataset. In Section 3 we describe how we prepared the dataset for analysis using prediction models. In Sec- tion 4 we present the methodology we used to build our prediction models. In Section 5 we compare the qual- 1
  • 2. ity of the predictions from the various models. Finally, in Section 6 we summarize our work. 2 Dataset The data are split into four waves. Wave I consisted of two stages. Stage 1 was a strat- ified, random sample of all high schools in the United States. Stage 2 was an in-home sample of 27,000 teenagers consisting of core samples from each commu- nity. Some over-samples were also selected. In other words, an adolescent could qualify for multiple sam- ples. Wave II was identical to the Wave I sample with a few exceptions: (a) those who were in the 12th grade at Wave I and not part of the genetic sample, (b) re- spondents only in the Wave I disabled sample, (c) 65 adolescents who were in the genetic sample and not interviewed in Wave I were interviewed in Wave II. If Wave I respondents could be found and inter- viewed again after six years, they were in the Wave III sample. Urine and saliva samples were also collected at this time. Finally, the Wave IV sample consisted of all original Wave I respondents. Readings of vitals were also taken. There were a few NA values in the data after col- lection. The Wave I dataset had the highest with over five percent of its values being NA’s. 3 Data Processing We wish to predict depression of participants 10 to 15 years after they have first participated in the survey. In AddHealth, depression is measured using ten items, which respondents answer using a four-point frequency scale, as described in Section 3.2. The responses to these ten items are then summarized into a single de- pression score. To predict depression we have performed the follow- ing study. First, we have prepossessed the data to 1) filter individuals that were lost to follow-up during the study, 2) we have cleaned the data, 3) we have gener- ated for each participant of Wave 4 a general mental health score that we aim to predict. Second, we built a linear prediction model regressed on an expanded design matrix. In this expanded ma- trix each predictor variable has a binary value indicat- ing whether for a specific survey item the participant chose a specific answer. Third, we used cross-validation to estimate the hyperparameters used by the model to provide the smallest prediction error. Analysis was conducted using R [7] and Python. In the rest of the section we explain each step in more detail and explain the reasoning behind our approach. 3.1 Data Processing Before building a prediction model we cleaned the raw AddHealth dataset. Some of the problems we found while cleaning the data were: filtering individuals that are lost to follow participants and removing survey variables unimportant to our prediction task. To address the first problem, lost to follow partici- pants, we have filtered out all participants that have participated in the Wave I of the study but not in Wave IV. We intend to analyze this subset of the population in order to understand if they differ significantly from the other participants. Second, some variables in the dataset are likely to not be important for prediction purposes. For instance, the survey variables IMONTH, IDAY, and IYEAR indicate the month, day and year the survey was conducted, respectively. Such variables were removed and ignored in the context of our analysis. 3.2 Depression Scores Out outcome measure is depression, as measured in Wave IV. The “Social Psychology and Mental Health” section of Wave IV contains ten items, H4MH18 through H4MH27, measuring depression. Most of these items were taken from the Center for Epidemiologic Studies Depression Scale (CES-D) [8]. These ask “How often was the following true during the past seven days” with a rating scale of (0=“never” through 3=“most of the time”. The items were: • “You were bothered by things that usually don’t bother you” • “You could not shake off the blues, even with help from your family and your friends” 2
  • 3. • “You felt you were just as good as other people” * • “You had trouble keeping your mind on what you were doing” • “You felt depressed” • “You felt that you were too tired to do things” • “You felt happy” * • “You enjoyed life” * • “You felt sad” • “You felt that people disliked you” Three items (indicated by *) were positive, and were reverse-coded before analysis (i.e., 3=“never” through 0=“most of the time”). To generate a single outcome which could be modeled as a continuous outcome, there are three commonly-used methods. First, we could could con- struct sum-scores by adding the codes (0:3) from each item, yielding a score range from 0:30. This would treat each item equally, and treat the spaces between the for levels within each item equally. Second, we could conduct an Exploratory Factor Analysis (EFA) and generate factor scores for a single Principal Factor. This would weight each item differently, but still treat the spaces between the four levels within those items equally. Third, we could fit an Item-Response Theory (IRT) model, which is a kind of latent variable model, and use the latent variable scores. We chose to use IRT scores, because their distribu- tion was closer to Normal than sum-scores or EFA scores. Specifically, we fit Masters’ Partial Credit Model [5]using the TAM [4] package. P(Xpi = c) = exp c k=0 (θp − δik) li l=0 exp l k=0 (θp − δik) where Xpi ∈ {0, 1, 2, 3} is the response of person p to item i, θp is the (latent) depression of person p, δik is how high on the depression scale a person must be to answer item i at level k rather than k − 1, and li is the maximum level for item i (always 3). We estimated the model parameters (item param- eters, and parameters of the θ distribution) using Marginal Maximum Likelihood. We predicated de- pression scores θp for each individual using Empirical Bayes: Expected a-Posteriori (EAP), with a N(0, σ2 ) prior. Although EAP can generate scores for individuals with missing data on the indicators, we chose not to generate depression scores for those who answered less than seven of the ten items. This resulted in the loss of a single observation. 3.3 Predicting on categorical variables There are 6504 participants and 2794 variables in Wave I. Most of these variables are categorical, whereas others are numerical. For instance, item 19 in Wave I of the study is “(During the past seven days:) You could not shake off the blues, even with help from your family and your friends” with the following answers: • 0 – ”ever or rarely” • 1 – ”sometimes” • 2 – ”a lot of the time” • 3 – ”most of the time or all of the time” • 4 – ”refused” In categorical variables like this we cannot use the value of the survey response directly in a linear model, because the values are not linear. The answer ’refused’ is not indicative of more depression than the answer ’sometimes’. To address this problem, we decided to expand each variable in the original dataset into a set of binary val- ues, one for each possible response to the corresponding survey item (one-hot encoding). For this item, we gen- erated 5 variables, one for each of the 5 answers. This way we are able to pull apart the contribution of each particular answer to the final prediction of. This procedure expanded our 2794 predictor vari- ables into 20086 variables. 4 Prediction 4.1 Reducing number of predictor vari- ables The number of predictor variables largely exceeds the number of observations. This is an issue especially when we use regression based models such as OLS and exacerbated by one-hot encoding. Thus, reducing the number of predictors is necessary. To address this problem we have tried two ap- proaches: Principal Component Analysis (PCA) and Random Projection. To perform PCA on the one-hot encoded matrix we first computed the set of principal components that accounts for more than 80% of the variance in the dataset. This step resulted in 658 principal compo- nents. We then projected our design matrix onto the subspace spanned by these principal components. Alternatively to the PCA approach, we have decided to project the one-hot encoded design matrix into a subspace spanned by random vectors (Random Pro- jection). By the Johnson-Lindenstrauss lemma [9] we know that when projecting our data vectors into a lower-dimensional space the distance between points is approximately preserved with high probability. This approach has the advantage of being computationally faster than other dimensionality reduction alternatives 3
  • 4. (e.g., PCA). To perform this projection we did the following. First, we created a D × p projection matrix V where D is a parameter chosen by users, p is the number of predictors variables in the expanded matrix, and each entry Vij is a standard normal variable, Vij ∼ N(0, 1). Then, we projected our data vectors into the space spanned by V , i.e., we compute Xtransformed = XV T . The n × D projected matrix Xtransformed was used on several prediction methods explained in detail in the next subsections. 4.2 Regression Based Model We employed three regression-based models to predict depression scores: 1) OLS, 2) preshrunk estimator sug- gested by Copas [2], and 3) preshrunk estimator with cross-validated shrinkage ˆK. Notice that all three mod- els require the number of predictor variables to be smaller than the number of observations and hence re- duction of predictor variables. For each of the three models, PCA was performed where the subset of principal components was chosen to explain 80% of total variance. In addition, for each of the three, Random Pro- jection was used with the size of the projected ma- trix which was cross-validated by 5-folds: that is, with choosing candidates of the size D beforehand, 5, 10, 25, 50, 100, 150, and 200 for the preshrunk mod- els and 500, 1000, and 2000 in addition for OLS, we produced the projected matrix Xtransformed with each candidate D. Then with each Xtransformed we split the training data into 5 folds, and for each fold we used the other 4 folds to build a model and to predict the depression scores in this fold. In other words each fold was used to build a model at 4 times to predict the scores in the different 4 folds, and left out exactly once to gain the predicted scores in this fold. After these procedures we obtained the mean prediction error on the training set so that we can decide the best size Dbest. 4.3 Tree-based prediction Prediction methods based on OLS have two limita- tions. First, OLS cannot (directly) take into account interactions between predictor variables. Second, OLS imposes a strict condition on the ration between obser- vations and predictor variables. Tree-based methods have been successfully used in classification and regression settings and do not suffer from these problems. Because of this, we have used two predictions methods based on decision trees: random forest regression and XGBoost regression. 4.3.1 Random Forests Random forests is an ensemble method used for classifi- cation and regression. Because we are trying to predict a continuous value (depression) we focus on regression. Random forest regression works by constructing mul- tiple decision trees at training time and outputting the mean of the values predicted by each tree. We used the sklearn.ensemble.RandomForestRegressor python class for this task. Because we applied this method to the one-hot encoded design matrix, each decision tree built during this process performs a simple ”Yes/No” decision on a random subset of the predictor variables. Because training of Random Forests can be compu- tationally expensive we made use of 10 cores to speed up this process. 4.3.2 XGBoost XGBoost [1] is a gradient tree boosting library used for prediction. Boosted trees methods work by using en- sembles of decision trees to build predictions. However, unlike random forests, these methods build ensembles of decision trees in an incremental fashion, minimiz- ing the residual errors at each new decision tree. XG- Boost in particular is an efficient library for performing boosted trees prediction. 4.4 Support Vector Regression Support Vector Regression (SVR) is a prediction method that casts problems into a convex optimiza- tion framework of the form: minimize 1 2 ||w||2 subject to yi − w, xi − b ≤ w, xi + b − yi ≤ This formulation can be used to find an hyperplane that approximates the data with an error of at most for each data point. We applied the SVR implementation in the scikit- learn Python library [6] to our prediction problem. Be- cause we use a linear kernel, we decided to use the Lin- earSVR python class. We found this class to perform significantly faster than the sklearn.svm.SVR class – because it is built on top of the LIBLINEAR [3] library (not the slower LIBSVM). 4
  • 5. 4.5 Prediction Models Used From the strategies above, we selected nine prediction models to evaluate: 1. Random projection (number of dimensions deter- mined by cross-validation) followed by OLS, 2. Random projection (number of dimensions deter- mined by cross-validation) followed by Copas’ pre- shrunk regression, 3. Random projection (number of dimensions deter- mined by cross-validation) followed by pre-shrunk regression where the shrinkage factor was deter- mined by cross-validation, 4. Principal Component Analysis followed by OLS, 5. Principal Component Analysis followed by Copas’ pre-shrunk regression, 6. Principal Component Analysis followed by pre- shrunk regression where the shrinkage factor was determined by cross-validation, 7. Random Forest Regression (using the original fea- ture matrix) with maximum depth and number of trees determined by cross-validation, 8. XGBoost Regression (using the original feature matrix) with maximum depth and minimum child weight determined by cross-validation, 9. Support Vector Regression (using the original feature matrix) with C determined by cross- validation. In addition, we implemented a random predictor, which predicted the depression score for an individ- ual by randomly selecting another individual’s score. This provides a baseline prediction root mean squared prediction error to which the other models can be com- pared. 5 Results 5.1 IRT Scores The range of IRT scores was from about -1.90 to 4.21, with a mean of about 0 (by design) and a standard de- viation of about 1.06. The distribution of IRT scores is presented in Figure 1. Note that since the IRT scores have a SD of about 1, when evaluating root mean squared prediction errors, the errors are essentially in standard deviation units. 5.2 Cross Validation We used 5-fold cross-validation to determine the best value for the hyperparameters of the following models: regression-based, tree-based and support vector regres- sion. Histogram of depression scores Depression Scores (IRT) Frequency −2 −1 0 1 2 3 4 02004006008001000 Figure 1: Distribution of IRT scores for depression Regression When using Random Projection we used CV to determine the best choice of D, the number of dimensions of projection. Since the optimal choice may vary based on the type of prediction model, this parameter was independently cross-validated for for OLS, OLS with Copas shrinkage, and OLS with cross- validated shrinkage. The candidate values for D, along with associated root mean square prediction errors are listed in Table ??. For OLS, the Dbest was 100, whereas when Copas or cross-validated shrinkage was used, Dbest was 150. These values of Dbest were used in subsequent analyses. Note that over-fitting is a problem to be concerned with: in OLS with D = 2000, the root mean square prediction error was almost twice as large as with the optimal D = 100. Tree-based methods We used cross-validation with random forest regression and XGBoost. For random forest regression we tried the parame- ters and corresponding values shown in Table 1. For XGBoost regression we tried the parameters and cor- responding values in Table ??. Parameter Description Values max depth Max tree depth 10,50,100,200 n estimators Number of trees 100, 1000, 1500, 2000, 4000, 8000 Table 1: Parameters and values used for cross- validation with Random Forest Support Vector Regression We have cross- validated our model using the parameters and respec- tive values presented in Table 3. 5
  • 6. Parameter Description Values max depth Max tree depth 1,3,5,7,9 min child weight The minimum number of variables in each leaf node 1,3,5,7,9 Table 2: Parameters and values used for cross- validation with XGBoost Parameter Description Values epsilon Error margin 0.01, 0.1, 0.5, 1, 2, 4, 8, 16 C Penalty given to data points far from optimal hyperplane 1 × 10−8 , 1 × 10−7 , 1 × 10−6 , 1 × 10−5 ,1 × 10−4 , 1 × 10−3 , 1 × 10−2 , 1 × 10−1 , 1,10,100 Table 3: Parameters and values used for cross- validation with SVR 5.3 Prediction Quality The prediction results from each prediction method, using the best settings as determined by cross- validation, are presented in Table ??. Considering root mean squared prediction error (RMSPE), the tree-based and support-vectors models worked better than the PCA-based models, which in turn worked better than the random-projection-based models. Among the former, the best predictions were from Random Forest regression with RMSPE of 0.970, fol- lowed closely by XGBoost regression with RMSPE of 0.975. Support Vector regression fared slightly worse, with RMSPE of 0.991. Among the PCA models, OLS with Copas shrinkage did the best, with RMSPE of 0.996, followed by OLS with cross-validated shrinkage factor, with RMSPE of 1.002, and OLS with no shrinkage, with RMSPE of 1.010. Among Random Projection models, OLS with Co- pas shrinkage did the best with RMSPE of 1.033, fol- lowed by OLS with cross-validated shrinkage factor, with RMSPE of 1.042, and OLS with no shrinkage, with RMSPE of 1.049. However, all of these models performed better than the baseline RMSPE of 1.508 provided by random pre- diction. 6 Conclusion In this report we presented a methodology for the pre- diction of depression in a sparse dataset from an ado- lescent health study. We identified challenges in doing prediction using this dataset and addressed them with a combination of feature reduction, statistical and machine learning techniques. All the prediction methods we used performed bet- ter than a random predictor. The machine learning techniques we used achieved slightly better predictions than the methods based on OLS (X RMSPE vs Y). This might result from the interaction of different vari- ables. Interestingly, the predictions after Random Projec- tion were in the ballpark of the RMSPE of other predic- tions without dimensionality reduction. This indicates that Random Projection can work in practice. The AddHealth dataset was not constructed with prediction in mind, but rather taking a snapshot of the state of adolescent health in the US. We believe a dataset built for prediction would have taken into account from the get go some of the problems we ad- dresses: lack of observations, the format of the dataset and depression scores. Despite these challenges we were able to build a predictor of depression scores with one standard deviation of RMSPE and significantly better than a random predictor. 6
  • 7. References [1] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. CoRR, abs/1603.02754, 2016. [2] J. B. Copas. Regression, prediction and shrinkage. Journal of the Royal Statistical Society. Series B (Methodological), 1983. [3] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008. [4] T. Kiefer, A. Robitzsch, and M. Wu. TAM: Test Analysis Modules, 2016. R package version 1.16-0. [5] G. N. Masters. A rasch model for partial credit scoring. Psychometrika, 47(2):149–174, 1 June 1982. [6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, and et al. Scikit- learn: Machine learning in python. Journal of Ma- chine Learning Research, 2011. [7] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2015. [8] L. S. Radloff. The CES-D scale: A Self-Report depression scale for research in the general popula- tion. Applied psychological measurement, 1(3):385– 401, 1 June 1977. [9] J. William and L. Joram. Extensions of lipschitz mappings into a hilbert space. Conference in mod- ern analysis and probability, 1984. 7