Hepatic injury classification

Hepatic Injury Classification
STAT W4240 Section 3 Data Mining
Individual Project Two
Michael Jiang
zj2160
!

1 Linear Classification Models (Part I)
1.1 Classification Imbalance
The hepatic injury status response variable is a 3-class classifier. It has 3 possible values,
which are “None” “Mild Severity” and “Severe”. The distribution of this response variable is
shown in Figure 1. As we can see it, the distribution is highly imbalanced, which would
impose serious problem in model training. There are several ways to handle this problem,
one of the most popular methods is to use sampling techniques to reconstruct a balanced
training dataset. Ling and Li (1998)1 provide an approach to up-sampling in which cases
from the minority classes are sampled with replacement until each class has approximately
the same number. Also, the reason I prefer up-Sampling over down-Sampling in this context
is that the number of “Severe” observation is so limited, and the size of training dataset using
down-Sampling would be limited so that the model would not be well trained2. Therefore,
the way to create the training dataset is as follows:
1) First, set the random seed to be zj2160, and randomly assign every sample in the whole
dataset into training dataset and test dataset. The probability to be assigned into training
dataset is 80%.
2) Next, use function upSample in library caret to reconstruct the training dataset so that
the new training dataset would be a balanced one.
Do I need to use up-sampling method to reconstruct the test dataset? The reason is no since
if the training dataset is sampled to be balanced, the test dataset should be sampled to be
more consistent with the state of nature and should reflect the imbalance so that honest
estimates of future performance could be computed.
1.2 Classification Statistic
There are 3 usual classification statistics, which are AUC(area under curve), Kappa and
Accuracy. For 2-class classification problem, we usually use AUC as the classification
statistic. However, in this context, the response variable contains 3 classes. Two solutions
would be presented as follows:
1) Use Kappa or Accuracy as the classification statistic, since AUC is only appropriate for
2-class classification problem. However, some models are natively not suitable for
multi-class classification problem such as logistic regression model (although there is
multi-logistic regression model to compensate).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
1! Ling!C,!Li!C!(1998).!“Data!Mining!for!Direct!Marketing:!Problems!and!solutions.”!In!“Proceedings!of!the!Fourth!
International!Conference!on!Knowledge!Discovery!and!Data!Mining,”!pp.!73–79.!
2! I!have!tested!both!downNSampling!method!and!upNSampling!method,!and!the!comparison!could!be!referred!at!
later!part.!In!fact,!the!prediction!performance!of!most!models!would!become!better!by!substituting!
downNSampling!with!upNSampling.!The!reason!may!be!that!with!limited!“Severe”!sample,!downNSampling!
method!would!produce!a!small!training!dataset!so!that!the!model!would!not!be!well!trained.!

2) Still use Kappa or Accuracy as the final classification statistic, but build k sub-models
for all k classes. More specifically, create k binary variables, and the value would be 1
if this sample belongs to the corresponding class, 0 otherwise. In this context, the
binary variables would be as follows:
!"#$! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"!#!!"#"$%&'!
0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$
!"#$! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"#$!!"#"$%&'!
0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$%!!
!"#"$"! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"#"$"!!"#"$%&'!
0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$%!!
Then I would train 3 separate models using these 3 response variables. When selecting
tuning parameter, I would use AUC to select the optimal tuning parameter. When
predicting, I would combine the 3 predictions of possibility using softmax
transformation (Bridle 19903) which is defined as
∗ =
!!
!!!
!!! !!
!!
∗ is the transformed
where !!!is the possibility prediction for the !!! class and !!
∗.
value between 0 and 1. The final prediction would be class with the largest !!
I decided to use the latter one since it can accommodate all the models. The final
classification statistic when measuring the prediction performance would be Kappa, and I
would also use Accuracy as a reference.
1.3 Comparison between Models Based Separately on Bio and Chem
There are in total 4 linear classification models discussed in Chapter 12, which are Logistic
Regression Model, Linear Discriminant Analysis, Partial Least Square Discriminant
Analysis and Penalized Models. The result can be referred in table 1 and table 2. As we can
see, when we only use biological predictors, Penalized Models yields the best performance
with Kappa of 0.13 in up-sampling and 0.193 in down-sampling. When we only use
chemical fingerprint predictors, Partial Least Square Discriminant Analysis (PLSDA) yields
the best performance with Kappa of 0.277 in up-sampling and 0.246 in down-sampling.
Based on the results, it’s quite obvious that chemical fingerprint predictors contain the most
information about hepatic toxicity. This point could be further demonstrated when we
consider non-linear models.
1.4 Top Predictors
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
3! Bridle!J!(1990).!“Probabilistic!Interpretation!of!Feedforward!Classification!Network!Outputs,!with!
Relationships!to!Statistical!Pattern!Recognition.”!In!“Neurocomputing:!Algorithms,!Architectures!and!
Applications,”!pp.!227–236.!Springer–Verlag.!

For the optimal model for biological predictors which is Penalized Model(up-sampling), the
top 5 important predictors are as follows:
1) When predicting whether it’s “None”, the top 5 important variables are Z130, Z118,
Z98, Z48, Z64. See Figure 2 for details.
2) When predicting whether it’s “Mild”, the top 5 important variables are Z20, Z38, Z99,
Z53, Z79. See Figure 3 for details.
3) When predicting whether it’s “Severe”, the top 5 important variables are Z100, Z83,
For the optimal model for chemical fingerprint predictors which is PLSDA(up-sampling),
the top 5 important predictors are as follows:
1) When predicting whether it’s “None”, the top 5 important variables are X134, X188,
X154, X83, X72. See Figure 5 for details.
2) When predicting whether it’s “Mild”, the top 5 important variables are X140, X147,
3) When predicting whether it’s “Severe”, the top 5 important variables are X72, X113,
1.5 Comparison between Models Based on Both Bio and Chem
The optimal model for biological and chemical predictors is PLSDA, and it yields a Kappa
of 0.372 in up-sampling and a Kappa of 0.186 in down-sampling. With both sets of
predictors, the PLSDA model has a significantly better performance than those with only one
set of predictors.
The top 5 predictors for PLSDA model(up-sampling) are as follows:
1) When predicting whether it’s “None”, the top 5 important variables are X134, X154,
2) When predicting whether it’s “Mild”, the top 5 important variables are Z116, Z93, X38,
X98, X155. See Figure 9 for details.
X72, Z102, Z93. See Figure 10 for details.
When comparing those top 5 important variables with previous results, we can see quite
easily that for “None” and “Severe”, the top 5 predictors seem to come separately from the
top 5 predictors in biological predictors and chemical predictors, for example, in “Severe”,
Z100 and Z102 are all among the top 5 predictors in previous result. Another interesting
thing is that the percentage of Z-predictors in top 5 lists is higher than that of X-predictors,
which again confirms what we got previously that the chemical fingerprints predictors

contains most information about hepatic toxicity.
1.6 Suggestion
I would recommend using both biological and chemical predictors information, and using
upsampling to train PLSDA model. This would yield a quite accurate prediction. Since it’s
easy to see in the table 3 that almost all the performances of down-sampling method are
worse than that of up-sampling, we should use upsampling method to train the model. Also,
among all the linear classification models, PLSDA outperformances others with a Kappa of
0.372 which would qualify as a good prediction.
2 Nonlinear Classification Models (Part II)
2.1 Comparison between Models Based Separately on Bio and Chem
There are in total 6 nonlinear classification models discussed in Chapter 13, which are
Regulated Discriminant Analysis(I combined Quadratic Discriminant Analysis in it by
setting lambda to 1), Neural Network, Average Neural Network, Flexible Discriminant
Analysis, Support Vector Machine, K-Nearest Neighbor, Naïve Bayes. The result can be
referred in table 4 and table 5. As we can see, when we only use biological predictors,
Averaged Neural Network (AvNNet) yields the best performance with Kappa of 0.368 in
up-sampling and 0.119 in down-sampling. When we only use chemical fingerprint predictors,
Support Vector Machine (SVM) yields the best performance with Kappa of 0.328 in
up-sampling and 0.235 in down-sampling.
When compared with linear classification models, when we only use biological predictors,
the nonlinear structure of these models greatly help to improve the classification
performance, as we can see the best of linear model could only yield a Kappa of 0.13, but
almost all the nonlinear models yield a higher Kappa with the highest to be 0.368.
However, when we only use chemical predictors, the nonlinear structure does help but not as
much as the case in biological predictors. The highest Kappa with a nonlinear model is 0.372
but the highest Kappa with a linear model is 0.277.
2.2 Top Predictors
For the optimal model for biological predictors which is AvNNet(up-sampling), the top 5
important predictors are as follows:
1) When predicting whether it’s “None”, the top 5 important variables are Z130, Z118,
2) When predicting whether it’s “Mild”, the top 5 important variables are Z20, Z38, Z99,
Z53, Z79. See Figure 12 for details.

For the optimal model for chemical fingerprint predictors which is SVM(up-sampling), the
top 5 important predictors are as follows:
1) When predicting whether it’s “None”, the top 5 important variables are X132, X1, X95,
X133, X120. See Figure 14 for details.
2.3 Comparison between Models Based on Both Bio and Chem
The optimal model for biological and chemical predictors is Naïve Bayes, and it yields a
Kappa of 0.306 in up-sampling and a Kappa of 0.403 in down-sampling. With both sets of
predictors, the Naïve Bayes model has a slightly better performance than those with only one
set of predictors.
The top 5 predictors for Naïve Bayes model(up-sampling) are as follows:
1) When predicting whether it’s “None”, the top 5 important variables are X132, X1, X95,
X133, Z130. See Figure 17 for details.
When compared with previous results, the top 5 important variables are almost identical to
those of using only chemical fingerprint predictors. The only difference is in predicting
“None”, the 5th important variable is Z130 rather than X120. Also, this again strongly
confirms the previous conclusion that chemical fingerprints predictors contain most of the
information about hepatic toxicity, since almost all the important variable are
X-predictors(chemical fingerprint predictors).
2.4 Suggestion
I would recommend using both biological and chemical predictors information, and using
up-sampling to train Naïve Bayes Model. The nonlinear structure indeed helps to improve
performance over linear models. With a Kappa of 0.306 in up-sampling and a Kappa of
0.403 in down-sampling, well-trained Naïve Bayes Model outperforms the optimal linear
model. Therefore, I would recommend using Naïve Bayes to predict the hepatic toxicity.

3 Tree-based Classification Models (Part III)
3.1 CART & Conditional Inference Trees
Both CART trees and conditional inference trees models are built using chemistry predictors,
and Kappa statistic is used as the metric. When comparing the performance of predicting the
whole dataset, CART(tuning parameter mtry is 100) has Accuracy of 0.568 and Kappa of
0.21, while conditional inference tree(tuning parameter mtry is 10) has Accuracy of 0.534
and Kappa of 0.0996. It’s obvious that random forest with CART has better performance.
3.2 Computation Time Comparison
The output of the computation time is as follows:
> ## Obtain the computation time for each model
> rfCART$times$everything
user system elapsed
492.665 2.582 171.341
> rfcForest$times$everything
user system elapsed
581.095 52.354 169.595
As we can see, CART trees not only have a better performance, but also have a less
computational time than conditional inference trees. Therefore, I would prefer CART over
conditional inference tree.
3.3 Top Predictors
Figure 20 and Figure 21 shows the top 10 important variables for both models.
More specifically, for CART, the top 10 important variables are: X1, X132, X71, X28, X31,
X29, X147, X30, X11, X6.
For conditional inference tree, the top 10 important variables are: X132, X134, X1, X71,
X35, X95, X139, X38, X98, X160.
The top 10 most important variables are mostly different between CART and Conditional
Inference in that in Conditional Inference, statistical hypothesis tests are used to do
exhaustive search across predictors and their possible split points, and for every candidate
split, a statistical test is used to evaluate the difference between means of two groups created
by the split. However, the CART model, when choosing the possible split points, has a
different objective function, which is to maximize the reduction of square errors. This
difference in objective function may be the reason that two models have noticeable
difference.

Table 1
Linear'
Models'
Biological'Predictors'
Up4Sampling'Method' Down4Sampling'Method'
Training'Dataset'
Test'
Dataset'
Training'Dataset'
Test'
Dataset'
ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa'
LRM' 0.619' 0.546' 0.753' 0.0977' 0.578' 0.552' 0.628' 0.0147'
LDA' 0.556' 0.555' 0.846' 0.0749' 0.554' 0.567' 0.567' 0.0989'
PLSDA' 0.601' 0.569' 0.925' 0.125' 0.642' 0.623' 0.806' 0.102'
Penalized'
0.622' 0.577' 0.891' 0.13' 0.612' 0.61' 0.811' 0.193'
Models'
Table 2
Linear'
Models'
Chemical'Predictors'
Training'Dataset'
Test'
Dataset'
Training'Dataset'
Test'
Dataset'
LRM' 0.645' 0.593' 0.863' 0.167 0.625' 0.613' 0.674' 0.115
LDA' 0.729' 0.633' 0.91' 0.205 0.596' 0.615' 0.706' 0.176
PLSDA' 0.741' 0.659' 0.97' 0.277 0.643' 0.651' 0.76' 0.246
Penalized'
0.704' 0.672' 0.922' 0.166 0.642' 0.621' 0.717' 0.212
Models'

Table 3
Linear'
Models'
Biological'and'Chemical'Predictors'
Training'Dataset'
Test'
Dataset'
Training'Dataset'
Test'
Dataset'
LRM' 0.624' 0.579' 0.646' 0.159 0.601' 0.55' 0.628' 0.0287
LDA' 0.647' 0.619' 0.776' 0.0739 0.587' 0.584' 0.717' 0.0695
PLSDA' 0.783' 0.648' 0.983' 0.372 0.634' 0.627' 0.785' 0.186
Penalized'
0.698' 0.634' 0.96' 0.353 0.615' 0.621' 0.833' 0.186
Models'

Table 4
Non-Linear
Models
Biological Predictors
Up-Sampling Method Down-Sampling Method
Training'Dataset'
Test'
Dataset'
Training'Dataset'
Test'
Dataset'
RDA 0.81' 0.589' 0.979 0.0811 0.633' 0.645' 0.792 0.0926
NNet 0.75' 0.62' 0.971 0.2 0.666' 0.62' 0.733 -0.0622
AvNNet 0.793' 0.597' 0.987 0.368 0.645' 0.621' 0.825 0.119
FDA 0.593 0.579 0.902 0.284 0.573 0.512 0.642 0.135
SVM 0.688 0.598 0.945 0.253 0.596 0.618 0.9 -0.069
kNN 0.764 0.625 0.958 0.14 0.628 0.643 0.725 0.0353
Naïve Bayes 0.669' 0.575' 0.921 0.0245 0.597' 0.583' 0.711 0.162

Table 5
Non-Linear
Models
Chemical Fingerprint Predictors
Training'Dataset'
Test'
Dataset'
Training'Dataset'
Test'
Dataset'
RDA 0.785' 0.601' 0.974 0.249 0.648' 0.605' 0.7 0.225
NNet 0.854' 0.708' 0.991 0.314 0.665' 0.672' 0.758 0.21
AvNNet 0.877' 0.714' 0.998 0.225 0.68' 0.64' 0.767 0.295
FDA 0.748 0.695 0.89 0.215 0.67 0.666 0.65 0.0911
SVM 0.821 0.659 0.98 0.328 0.648 0.628 0.7 0.235
kNN 0.786 0.629 0.96 0.372 0.655 0.592 0.718 0.0155
Naïve Bayes 0.77' 0.561' 0.829 0.247 0.627' 0.617' 0.625 0.222

Table 6
Non-Linear
Models
Biological and Chemical Fingerprint Predictors
Training'Dataset'
Test'
Dataset'
Training'Dataset'
Test'
Dataset'
RDA 0.801' 0.597' 0.999 0.274 0.63' 0.596' 0.769 0.252
NNet 0.803' 0.612' 0.957 0.176 0.622' 0.617' 0.747 0.242
AvNNet 0.856' 0.679' 0.995 0.248 0.62' 0.625' 0.747 0.208
FDA 0.746 0.637 0.931 0.213 0.62 0.641 0.75 0.0495
SVM 0.836 0.601 0.99 0.137 0.621 0.629 0.775 -0.0088
kNN 0.789 0.628 0.933 0.289 0.644 0.617 0.8 0.0976
Naïve Bayes 0.778' 0.547' 0.896 0.306 0.601' 0.61' 0.667 0.403

Hepatic injury classification

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Hepatic injury classification

Similar to Hepatic injury classification (20)

Recently uploaded

Recently uploaded (20)

Hepatic injury classification