SlideShare a Scribd company logo
Hepatic Injury Classification 
STAT W4240 Section 3 Data Mining 
Individual Project Two 
Michael Jiang 
zj2160 
!
1 Linear Classification Models (Part I) 
1.1 Classification Imbalance 
The hepatic injury status response variable is a 3-class classifier. It has 3 possible values, 
which are “None” “Mild Severity” and “Severe”. The distribution of this response variable is 
shown in Figure 1. As we can see it, the distribution is highly imbalanced, which would 
impose serious problem in model training. There are several ways to handle this problem, 
one of the most popular methods is to use sampling techniques to reconstruct a balanced 
training dataset. Ling and Li (1998)1 provide an approach to up-sampling in which cases 
from the minority classes are sampled with replacement until each class has approximately 
the same number. Also, the reason I prefer up-Sampling over down-Sampling in this context 
is that the number of “Severe” observation is so limited, and the size of training dataset using 
down-Sampling would be limited so that the model would not be well trained2. Therefore, 
the way to create the training dataset is as follows: 
1) First, set the random seed to be zj2160, and randomly assign every sample in the whole 
dataset into training dataset and test dataset. The probability to be assigned into training 
dataset is 80%. 
2) Next, use function upSample in library caret to reconstruct the training dataset so that 
the new training dataset would be a balanced one. 
Do I need to use up-sampling method to reconstruct the test dataset? The reason is no since 
if the training dataset is sampled to be balanced, the test dataset should be sampled to be 
more consistent with the state of nature and should reflect the imbalance so that honest 
estimates of future performance could be computed. 
1.2 Classification Statistic 
There are 3 usual classification statistics, which are AUC(area under curve), Kappa and 
Accuracy. For 2-class classification problem, we usually use AUC as the classification 
statistic. However, in this context, the response variable contains 3 classes. Two solutions 
would be presented as follows: 
1) Use Kappa or Accuracy as the classification statistic, since AUC is only appropriate for 
2-class classification problem. However, some models are natively not suitable for 
multi-class classification problem such as logistic regression model (although there is 
multi-logistic regression model to compensate). 
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 
1! Ling!C,!Li!C!(1998).!“Data!Mining!for!Direct!Marketing:!Problems!and!solutions.”!In!“Proceedings!of!the!Fourth! 
International!Conference!on!Knowledge!Discovery!and!Data!Mining,”!pp.!73–79.! 
2! I!have!tested!both!downNSampling!method!and!upNSampling!method,!and!the!comparison!could!be!referred!at! 
later!part.!In!fact,!the!prediction!performance!of!most!models!would!become!better!by!substituting! 
downNSampling!with!upNSampling.!The!reason!may!be!that!with!limited!“Severe”!sample,!downNSampling! 
method!would!produce!a!small!training!dataset!so!that!the!model!would!not!be!well!trained.!
2) Still use Kappa or Accuracy as the final classification statistic, but build k sub-models 
for all k classes. More specifically, create k binary variables, and the value would be 1 
if this sample belongs to the corresponding class, 0 otherwise. In this context, the 
binary variables would be as follows: 
!"#$! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"!#!!"#"$%&'! 
0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$ 
!"#$! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"#$!!"#"$%&'! 
0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$%!! 
!"#"$"! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"#"$"!!"#"$%&'! 
0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$%!! 
Then I would train 3 separate models using these 3 response variables. When selecting 
tuning parameter, I would use AUC to select the optimal tuning parameter. When 
predicting, I would combine the 3 predictions of possibility using softmax 
transformation (Bridle 19903) which is defined as 
∗ = 
!! 
!!! 
!!! !! 
!! 
∗ is the transformed 
where !!!is the possibility prediction for the !!! class and !! 
∗. 
value between 0 and 1. The final prediction would be class with the largest !! 
I decided to use the latter one since it can accommodate all the models. The final 
classification statistic when measuring the prediction performance would be Kappa, and I 
would also use Accuracy as a reference. 
1.3 Comparison between Models Based Separately on Bio and Chem 
There are in total 4 linear classification models discussed in Chapter 12, which are Logistic 
Regression Model, Linear Discriminant Analysis, Partial Least Square Discriminant 
Analysis and Penalized Models. The result can be referred in table 1 and table 2. As we can 
see, when we only use biological predictors, Penalized Models yields the best performance 
with Kappa of 0.13 in up-sampling and 0.193 in down-sampling. When we only use 
chemical fingerprint predictors, Partial Least Square Discriminant Analysis (PLSDA) yields 
the best performance with Kappa of 0.277 in up-sampling and 0.246 in down-sampling. 
Based on the results, it’s quite obvious that chemical fingerprint predictors contain the most 
information about hepatic toxicity. This point could be further demonstrated when we 
consider non-linear models. 
1.4 Top Predictors 
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 
3! Bridle!J!(1990).!“Probabilistic!Interpretation!of!Feedforward!Classification!Network!Outputs,!with! 
Relationships!to!Statistical!Pattern!Recognition.”!In!“Neurocomputing:!Algorithms,!Architectures!and! 
Applications,”!pp.!227–236.!Springer–Verlag.!
For the optimal model for biological predictors which is Penalized Model(up-sampling), the 
top 5 important predictors are as follows: 
1) When predicting whether it’s “None”, the top 5 important variables are Z130, Z118, 
Z98, Z48, Z64. See Figure 2 for details. 
2) When predicting whether it’s “Mild”, the top 5 important variables are Z20, Z38, Z99, 
Z53, Z79. See Figure 3 for details. 
3) When predicting whether it’s “Severe”, the top 5 important variables are Z100, Z83, 
Z102, Z15, Z59. See Figure 4 for details. 
For the optimal model for chemical fingerprint predictors which is PLSDA(up-sampling), 
the top 5 important predictors are as follows: 
1) When predicting whether it’s “None”, the top 5 important variables are X134, X188, 
X154, X83, X72. See Figure 5 for details. 
2) When predicting whether it’s “Mild”, the top 5 important variables are X140, X147, 
X31, X134, X67. See Figure 6 for details. 
3) When predicting whether it’s “Severe”, the top 5 important variables are X72, X113, 
X44, X136, X81. See Figure 7 for details. 
1.5 Comparison between Models Based on Both Bio and Chem 
The optimal model for biological and chemical predictors is PLSDA, and it yields a Kappa 
of 0.372 in up-sampling and a Kappa of 0.186 in down-sampling. With both sets of 
predictors, the PLSDA model has a significantly better performance than those with only one 
set of predictors. 
The top 5 predictors for PLSDA model(up-sampling) are as follows: 
1) When predicting whether it’s “None”, the top 5 important variables are X134, X154, 
Z116, Z149, Z38. See Figure 8 for details. 
2) When predicting whether it’s “Mild”, the top 5 important variables are Z116, Z93, X38, 
X98, X155. See Figure 9 for details. 
3) When predicting whether it’s “Severe”, the top 5 important variables are Z69, Z100, 
X72, Z102, Z93. See Figure 10 for details. 
When comparing those top 5 important variables with previous results, we can see quite 
easily that for “None” and “Severe”, the top 5 predictors seem to come separately from the 
top 5 predictors in biological predictors and chemical predictors, for example, in “Severe”, 
Z100 and Z102 are all among the top 5 predictors in previous result. Another interesting 
thing is that the percentage of Z-predictors in top 5 lists is higher than that of X-predictors, 
which again confirms what we got previously that the chemical fingerprints predictors
contains most information about hepatic toxicity. 
1.6 Suggestion 
I would recommend using both biological and chemical predictors information, and using 
upsampling to train PLSDA model. This would yield a quite accurate prediction. Since it’s 
easy to see in the table 3 that almost all the performances of down-sampling method are 
worse than that of up-sampling, we should use upsampling method to train the model. Also, 
among all the linear classification models, PLSDA outperformances others with a Kappa of 
0.372 which would qualify as a good prediction. 
2 Nonlinear Classification Models (Part II) 
2.1 Comparison between Models Based Separately on Bio and Chem 
There are in total 6 nonlinear classification models discussed in Chapter 13, which are 
Regulated Discriminant Analysis(I combined Quadratic Discriminant Analysis in it by 
setting lambda to 1), Neural Network, Average Neural Network, Flexible Discriminant 
Analysis, Support Vector Machine, K-Nearest Neighbor, Naïve Bayes. The result can be 
referred in table 4 and table 5. As we can see, when we only use biological predictors, 
Averaged Neural Network (AvNNet) yields the best performance with Kappa of 0.368 in 
up-sampling and 0.119 in down-sampling. When we only use chemical fingerprint predictors, 
Support Vector Machine (SVM) yields the best performance with Kappa of 0.328 in 
up-sampling and 0.235 in down-sampling. 
When compared with linear classification models, when we only use biological predictors, 
the nonlinear structure of these models greatly help to improve the classification 
performance, as we can see the best of linear model could only yield a Kappa of 0.13, but 
almost all the nonlinear models yield a higher Kappa with the highest to be 0.368. 
However, when we only use chemical predictors, the nonlinear structure does help but not as 
much as the case in biological predictors. The highest Kappa with a nonlinear model is 0.372 
but the highest Kappa with a linear model is 0.277. 
2.2 Top Predictors 
For the optimal model for biological predictors which is AvNNet(up-sampling), the top 5 
important predictors are as follows: 
1) When predicting whether it’s “None”, the top 5 important variables are Z130, Z118, 
Z98, Z48, Z64. See Figure 11 for details. 
2) When predicting whether it’s “Mild”, the top 5 important variables are Z20, Z38, Z99, 
Z53, Z79. See Figure 12 for details. 
3) When predicting whether it’s “Severe”, the top 5 important variables are Z100, Z83,
Z102, Z15, Z59. See Figure 13 for details. 
For the optimal model for chemical fingerprint predictors which is SVM(up-sampling), the 
top 5 important predictors are as follows: 
1) When predicting whether it’s “None”, the top 5 important variables are X132, X1, X95, 
X133, X120. See Figure 14 for details. 
2) When predicting whether it’s “Mild”, the top 5 important variables are X135, X1, 
X132, X28, X125. See Figure 15 for details. 
3) When predicting whether it’s “Severe”, the top 5 important variables are X144, X145, 
X133, X139, X81. See Figure 16 for details. 
2.3 Comparison between Models Based on Both Bio and Chem 
The optimal model for biological and chemical predictors is Naïve Bayes, and it yields a 
Kappa of 0.306 in up-sampling and a Kappa of 0.403 in down-sampling. With both sets of 
predictors, the Naïve Bayes model has a slightly better performance than those with only one 
set of predictors. 
The top 5 predictors for Naïve Bayes model(up-sampling) are as follows: 
1) When predicting whether it’s “None”, the top 5 important variables are X132, X1, X95, 
X133, Z130. See Figure 17 for details. 
2) When predicting whether it’s “Mild”, the top 5 important variables are X135, X1, 
X132, X28, X125. See Figure 18 for details. 
3) When predicting whether it’s “Severe”, the top 5 important variables are X144, X145, 
X133, X139, X81. See Figure 19 for details. 
When compared with previous results, the top 5 important variables are almost identical to 
those of using only chemical fingerprint predictors. The only difference is in predicting 
“None”, the 5th important variable is Z130 rather than X120. Also, this again strongly 
confirms the previous conclusion that chemical fingerprints predictors contain most of the 
information about hepatic toxicity, since almost all the important variable are 
X-predictors(chemical fingerprint predictors). 
2.4 Suggestion 
I would recommend using both biological and chemical predictors information, and using 
up-sampling to train Naïve Bayes Model. The nonlinear structure indeed helps to improve 
performance over linear models. With a Kappa of 0.306 in up-sampling and a Kappa of 
0.403 in down-sampling, well-trained Naïve Bayes Model outperforms the optimal linear 
model. Therefore, I would recommend using Naïve Bayes to predict the hepatic toxicity.
3 Tree-based Classification Models (Part III) 
3.1 CART & Conditional Inference Trees 
Both CART trees and conditional inference trees models are built using chemistry predictors, 
and Kappa statistic is used as the metric. When comparing the performance of predicting the 
whole dataset, CART(tuning parameter mtry is 100) has Accuracy of 0.568 and Kappa of 
0.21, while conditional inference tree(tuning parameter mtry is 10) has Accuracy of 0.534 
and Kappa of 0.0996. It’s obvious that random forest with CART has better performance. 
3.2 Computation Time Comparison 
The output of the computation time is as follows: 
> ## Obtain the computation time for each model 
> rfCART$times$everything 
user system elapsed 
492.665 2.582 171.341 
> rfcForest$times$everything 
user system elapsed 
581.095 52.354 169.595 
As we can see, CART trees not only have a better performance, but also have a less 
computational time than conditional inference trees. Therefore, I would prefer CART over 
conditional inference tree. 
3.3 Top Predictors 
Figure 20 and Figure 21 shows the top 10 important variables for both models. 
More specifically, for CART, the top 10 important variables are: X1, X132, X71, X28, X31, 
X29, X147, X30, X11, X6. 
For conditional inference tree, the top 10 important variables are: X132, X134, X1, X71, 
X35, X95, X139, X38, X98, X160. 
The top 10 most important variables are mostly different between CART and Conditional 
Inference in that in Conditional Inference, statistical hypothesis tests are used to do 
exhaustive search across predictors and their possible split points, and for every candidate 
split, a statistical test is used to evaluate the difference between means of two groups created 
by the split. However, the CART model, when choosing the possible split points, has a 
different objective function, which is to maximize the reduction of square errors. This 
difference in objective function may be the reason that two models have noticeable 
difference.
Table 1 
Linear' 
Models' 
Biological'Predictors' 
Up4Sampling'Method' Down4Sampling'Method' 
Training'Dataset' 
Test' 
Dataset' 
Training'Dataset' 
Test' 
Dataset' 
ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' 
LRM' 0.619' 0.546' 0.753' 0.0977' 0.578' 0.552' 0.628' 0.0147' 
LDA' 0.556' 0.555' 0.846' 0.0749' 0.554' 0.567' 0.567' 0.0989' 
PLSDA' 0.601' 0.569' 0.925' 0.125' 0.642' 0.623' 0.806' 0.102' 
Penalized' 
0.622' 0.577' 0.891' 0.13' 0.612' 0.61' 0.811' 0.193' 
Models' 
Table 2 
Linear' 
Models' 
Chemical'Predictors' 
Up4Sampling'Method' Down4Sampling'Method' 
Training'Dataset' 
Test' 
Dataset' 
Training'Dataset' 
Test' 
Dataset' 
ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' 
LRM' 0.645' 0.593' 0.863' 0.167 0.625' 0.613' 0.674' 0.115 
LDA' 0.729' 0.633' 0.91' 0.205 0.596' 0.615' 0.706' 0.176 
PLSDA' 0.741' 0.659' 0.97' 0.277 0.643' 0.651' 0.76' 0.246 
Penalized' 
0.704' 0.672' 0.922' 0.166 0.642' 0.621' 0.717' 0.212 
Models'
Table 3 
Linear' 
Models' 
Biological'and'Chemical'Predictors' 
Up4Sampling'Method' Down4Sampling'Method' 
Training'Dataset' 
Test' 
Dataset' 
Training'Dataset' 
Test' 
Dataset' 
ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' 
LRM' 0.624' 0.579' 0.646' 0.159 0.601' 0.55' 0.628' 0.0287 
LDA' 0.647' 0.619' 0.776' 0.0739 0.587' 0.584' 0.717' 0.0695 
PLSDA' 0.783' 0.648' 0.983' 0.372 0.634' 0.627' 0.785' 0.186 
Penalized' 
0.698' 0.634' 0.96' 0.353 0.615' 0.621' 0.833' 0.186 
Models'
Table 4 
Non-Linear 
Models 
Biological Predictors 
Up-Sampling Method Down-Sampling Method 
Training'Dataset' 
Test' 
Dataset' 
Training'Dataset' 
Test' 
Dataset' 
ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' 
RDA 0.81' 0.589' 0.979 0.0811 0.633' 0.645' 0.792 0.0926 
NNet 0.75' 0.62' 0.971 0.2 0.666' 0.62' 0.733 -0.0622 
AvNNet 0.793' 0.597' 0.987 0.368 0.645' 0.621' 0.825 0.119 
FDA 0.593 0.579 0.902 0.284 0.573 0.512 0.642 0.135 
SVM 0.688 0.598 0.945 0.253 0.596 0.618 0.9 -0.069 
kNN 0.764 0.625 0.958 0.14 0.628 0.643 0.725 0.0353 
Naïve Bayes 0.669' 0.575' 0.921 0.0245 0.597' 0.583' 0.711 0.162
Table 5 
Non-Linear 
Models 
Chemical Fingerprint Predictors 
Up-Sampling Method Down-Sampling Method 
Training'Dataset' 
Test' 
Dataset' 
Training'Dataset' 
Test' 
Dataset' 
ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' 
RDA 0.785' 0.601' 0.974 0.249 0.648' 0.605' 0.7 0.225 
NNet 0.854' 0.708' 0.991 0.314 0.665' 0.672' 0.758 0.21 
AvNNet 0.877' 0.714' 0.998 0.225 0.68' 0.64' 0.767 0.295 
FDA 0.748 0.695 0.89 0.215 0.67 0.666 0.65 0.0911 
SVM 0.821 0.659 0.98 0.328 0.648 0.628 0.7 0.235 
kNN 0.786 0.629 0.96 0.372 0.655 0.592 0.718 0.0155 
Naïve Bayes 0.77' 0.561' 0.829 0.247 0.627' 0.617' 0.625 0.222
Table 6 
Non-Linear 
Models 
Biological and Chemical Fingerprint Predictors 
Up-Sampling Method Down-Sampling Method 
Training'Dataset' 
Test' 
Dataset' 
Training'Dataset' 
Test' 
Dataset' 
ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' 
RDA 0.801' 0.597' 0.999 0.274 0.63' 0.596' 0.769 0.252 
NNet 0.803' 0.612' 0.957 0.176 0.622' 0.617' 0.747 0.242 
AvNNet 0.856' 0.679' 0.995 0.248 0.62' 0.625' 0.747 0.208 
FDA 0.746 0.637 0.931 0.213 0.62 0.641 0.75 0.0495 
SVM 0.836 0.601 0.99 0.137 0.621 0.629 0.775 -0.0088 
kNN 0.789 0.628 0.933 0.289 0.644 0.617 0.8 0.0976 
Naïve Bayes 0.778' 0.547' 0.896 0.306 0.601' 0.61' 0.667 0.403
Figure 1 
! 
Figure 2
Figure 3 
! 
! 
! 
Figure 4
Figure 5 
! 
! 
! 
Figure 6
Figure 7 
! 
! 
! 
Figure 8
Figure 9 
! 
! 
! 
Figure 10
Figure 11 
! 
! 
! 
Figure 12
Figure 13 
! 
! 
! 
Figure 14
Figure 15 
! 
! 
! 
Figure 16
Figure 17 
! 
! 
! 
Figure 18
Figure 19 
! 
Figure 20
Figure 21

More Related Content

What's hot

Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
Jaideep Adusumelli
 
Dm
DmDm
[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization
Andrea Rubio
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
ijcsit
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...DineshRaj Goud
 
Variable and feature selection
Variable and feature selectionVariable and feature selection
Variable and feature selection
Aaron Karper
 
Stats ca report_18180485
Stats ca report_18180485Stats ca report_18180485
Stats ca report_18180485
sarthakkhare3
 
Effect of 3D parameters on Antifungal Activities of Some Heterocyclic Compounds
Effect of 3D parameters on Antifungal Activities of Some Heterocyclic CompoundsEffect of 3D parameters on Antifungal Activities of Some Heterocyclic Compounds
Effect of 3D parameters on Antifungal Activities of Some Heterocyclic Compounds
IOSR Journals
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
A Moment Inequality for Overall Decreasing Life Class of Life Distributions w...
A Moment Inequality for Overall Decreasing Life Class of Life Distributions w...A Moment Inequality for Overall Decreasing Life Class of Life Distributions w...
A Moment Inequality for Overall Decreasing Life Class of Life Distributions w...
inventionjournals
 
Arabidopsis thaliana Inspired Genetic Restoration Strategies
Arabidopsis thaliana Inspired Genetic Restoration StrategiesArabidopsis thaliana Inspired Genetic Restoration Strategies
Arabidopsis thaliana Inspired Genetic Restoration Strategies
CSCJournals
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_ReportRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report​Iván Rodríguez
 

What's hot (14)

Wine.Final.Project.MJv3
Wine.Final.Project.MJv3Wine.Final.Project.MJv3
Wine.Final.Project.MJv3
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
 
Dm
DmDm
Dm
 
[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
 
ppt
pptppt
ppt
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...
 
Variable and feature selection
Variable and feature selectionVariable and feature selection
Variable and feature selection
 
Stats ca report_18180485
Stats ca report_18180485Stats ca report_18180485
Stats ca report_18180485
 
Effect of 3D parameters on Antifungal Activities of Some Heterocyclic Compounds
Effect of 3D parameters on Antifungal Activities of Some Heterocyclic CompoundsEffect of 3D parameters on Antifungal Activities of Some Heterocyclic Compounds
Effect of 3D parameters on Antifungal Activities of Some Heterocyclic Compounds
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
A Moment Inequality for Overall Decreasing Life Class of Life Distributions w...
A Moment Inequality for Overall Decreasing Life Class of Life Distributions w...A Moment Inequality for Overall Decreasing Life Class of Life Distributions w...
A Moment Inequality for Overall Decreasing Life Class of Life Distributions w...
 
Arabidopsis thaliana Inspired Genetic Restoration Strategies
Arabidopsis thaliana Inspired Genetic Restoration StrategiesArabidopsis thaliana Inspired Genetic Restoration Strategies
Arabidopsis thaliana Inspired Genetic Restoration Strategies
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_ReportRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
 

Similar to Hepatic injury classification

Predicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine LearningPredicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine Learning
IdanGalShohet
 
Module 05 – Hypothesis Tests Using Two SamplesClass Objectives
Module 05 – Hypothesis Tests Using Two SamplesClass ObjectivesModule 05 – Hypothesis Tests Using Two SamplesClass Objectives
Module 05 – Hypothesis Tests Using Two SamplesClass Objectives
IlonaThornburg83
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification Analysis
YashIyengar
 
JSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzerJSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzerDennis Sweitzer
 
Regression Analysis of NBA Points Final
Regression Analysis of NBA Points  FinalRegression Analysis of NBA Points  Final
Regression Analysis of NBA Points FinalJohn Michael Croft
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression
Dr Athar Khan
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
Shivaram Prakash
 
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForestWisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForestSheing Jing Ng
 
Network analysis of cancer metabolism: A novel route to precision medicine
Network analysis of cancer metabolism: A novel route to precision medicineNetwork analysis of cancer metabolism: A novel route to precision medicine
Network analysis of cancer metabolism: A novel route to precision medicine
Varshit Dusad
 
Trabajo de ingles (5)
Trabajo de ingles (5)Trabajo de ingles (5)
Trabajo de ingles (5)sasmaripo
 
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
ijaia
 
Inference of the JAK-STAT Gene Network via Graphical Models
Inference of the JAK-STAT Gene Network via Graphical ModelsInference of the JAK-STAT Gene Network via Graphical Models
Inference of the JAK-STAT Gene Network via Graphical Models
SSA KPI
 
Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...
Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...
Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...
CSCJournals
 
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood TestUsing Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
StevenQu1
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
Vahid Taslimitehrani
 
Fuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease DiagnosisFuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
IRJET Journal
 
Design of experiments(
Design of experiments(Design of experiments(
Design of experiments(
Nugurusaichandan
 

Similar to Hepatic injury classification (20)

Predicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine LearningPredicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine Learning
 
final paper1
final paper1final paper1
final paper1
 
Module 05 – Hypothesis Tests Using Two SamplesClass Objectives
Module 05 – Hypothesis Tests Using Two SamplesClass ObjectivesModule 05 – Hypothesis Tests Using Two SamplesClass Objectives
Module 05 – Hypothesis Tests Using Two SamplesClass Objectives
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification Analysis
 
JSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzerJSM2013,Proceedings,paper307699_79238,DSweitzer
JSM2013,Proceedings,paper307699_79238,DSweitzer
 
Regression Analysis of NBA Points Final
Regression Analysis of NBA Points  FinalRegression Analysis of NBA Points  Final
Regression Analysis of NBA Points Final
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
 
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForestWisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
 
Network analysis of cancer metabolism: A novel route to precision medicine
Network analysis of cancer metabolism: A novel route to precision medicineNetwork analysis of cancer metabolism: A novel route to precision medicine
Network analysis of cancer metabolism: A novel route to precision medicine
 
Trabajo de ingles (5)
Trabajo de ingles (5)Trabajo de ingles (5)
Trabajo de ingles (5)
 
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
 
Inference of the JAK-STAT Gene Network via Graphical Models
Inference of the JAK-STAT Gene Network via Graphical ModelsInference of the JAK-STAT Gene Network via Graphical Models
Inference of the JAK-STAT Gene Network via Graphical Models
 
Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...
Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...
Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...
 
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood TestUsing Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
 
Fuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease DiagnosisFuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
 
report
reportreport
report
 
Design of experiments(
Design of experiments(Design of experiments(
Design of experiments(
 
FINAL (1)
FINAL (1)FINAL (1)
FINAL (1)
 

Recently uploaded

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 

Recently uploaded (20)

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 

Hepatic injury classification

  • 1. Hepatic Injury Classification STAT W4240 Section 3 Data Mining Individual Project Two Michael Jiang zj2160 !
  • 2. 1 Linear Classification Models (Part I) 1.1 Classification Imbalance The hepatic injury status response variable is a 3-class classifier. It has 3 possible values, which are “None” “Mild Severity” and “Severe”. The distribution of this response variable is shown in Figure 1. As we can see it, the distribution is highly imbalanced, which would impose serious problem in model training. There are several ways to handle this problem, one of the most popular methods is to use sampling techniques to reconstruct a balanced training dataset. Ling and Li (1998)1 provide an approach to up-sampling in which cases from the minority classes are sampled with replacement until each class has approximately the same number. Also, the reason I prefer up-Sampling over down-Sampling in this context is that the number of “Severe” observation is so limited, and the size of training dataset using down-Sampling would be limited so that the model would not be well trained2. Therefore, the way to create the training dataset is as follows: 1) First, set the random seed to be zj2160, and randomly assign every sample in the whole dataset into training dataset and test dataset. The probability to be assigned into training dataset is 80%. 2) Next, use function upSample in library caret to reconstruct the training dataset so that the new training dataset would be a balanced one. Do I need to use up-sampling method to reconstruct the test dataset? The reason is no since if the training dataset is sampled to be balanced, the test dataset should be sampled to be more consistent with the state of nature and should reflect the imbalance so that honest estimates of future performance could be computed. 1.2 Classification Statistic There are 3 usual classification statistics, which are AUC(area under curve), Kappa and Accuracy. For 2-class classification problem, we usually use AUC as the classification statistic. However, in this context, the response variable contains 3 classes. Two solutions would be presented as follows: 1) Use Kappa or Accuracy as the classification statistic, since AUC is only appropriate for 2-class classification problem. However, some models are natively not suitable for multi-class classification problem such as logistic regression model (although there is multi-logistic regression model to compensate). !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 1! Ling!C,!Li!C!(1998).!“Data!Mining!for!Direct!Marketing:!Problems!and!solutions.”!In!“Proceedings!of!the!Fourth! International!Conference!on!Knowledge!Discovery!and!Data!Mining,”!pp.!73–79.! 2! I!have!tested!both!downNSampling!method!and!upNSampling!method,!and!the!comparison!could!be!referred!at! later!part.!In!fact,!the!prediction!performance!of!most!models!would!become!better!by!substituting! downNSampling!with!upNSampling.!The!reason!may!be!that!with!limited!“Severe”!sample,!downNSampling! method!would!produce!a!small!training!dataset!so!that!the!model!would!not!be!well!trained.!
  • 3. 2) Still use Kappa or Accuracy as the final classification statistic, but build k sub-models for all k classes. More specifically, create k binary variables, and the value would be 1 if this sample belongs to the corresponding class, 0 otherwise. In this context, the binary variables would be as follows: !"#$! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"!#!!"#"$%&'! 0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$ !"#$! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"#$!!"#"$%&'! 0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$%!! !"#"$"! = 1,!!!!!!!!!!!!!!!!"!!"#$%&!!!!"!!"#"$"!!"#"$%&'! 0,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"ℎ!"#$%!! Then I would train 3 separate models using these 3 response variables. When selecting tuning parameter, I would use AUC to select the optimal tuning parameter. When predicting, I would combine the 3 predictions of possibility using softmax transformation (Bridle 19903) which is defined as ∗ = !! !!! !!! !! !! ∗ is the transformed where !!!is the possibility prediction for the !!! class and !! ∗. value between 0 and 1. The final prediction would be class with the largest !! I decided to use the latter one since it can accommodate all the models. The final classification statistic when measuring the prediction performance would be Kappa, and I would also use Accuracy as a reference. 1.3 Comparison between Models Based Separately on Bio and Chem There are in total 4 linear classification models discussed in Chapter 12, which are Logistic Regression Model, Linear Discriminant Analysis, Partial Least Square Discriminant Analysis and Penalized Models. The result can be referred in table 1 and table 2. As we can see, when we only use biological predictors, Penalized Models yields the best performance with Kappa of 0.13 in up-sampling and 0.193 in down-sampling. When we only use chemical fingerprint predictors, Partial Least Square Discriminant Analysis (PLSDA) yields the best performance with Kappa of 0.277 in up-sampling and 0.246 in down-sampling. Based on the results, it’s quite obvious that chemical fingerprint predictors contain the most information about hepatic toxicity. This point could be further demonstrated when we consider non-linear models. 1.4 Top Predictors !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 3! Bridle!J!(1990).!“Probabilistic!Interpretation!of!Feedforward!Classification!Network!Outputs,!with! Relationships!to!Statistical!Pattern!Recognition.”!In!“Neurocomputing:!Algorithms,!Architectures!and! Applications,”!pp.!227–236.!Springer–Verlag.!
  • 4. For the optimal model for biological predictors which is Penalized Model(up-sampling), the top 5 important predictors are as follows: 1) When predicting whether it’s “None”, the top 5 important variables are Z130, Z118, Z98, Z48, Z64. See Figure 2 for details. 2) When predicting whether it’s “Mild”, the top 5 important variables are Z20, Z38, Z99, Z53, Z79. See Figure 3 for details. 3) When predicting whether it’s “Severe”, the top 5 important variables are Z100, Z83, Z102, Z15, Z59. See Figure 4 for details. For the optimal model for chemical fingerprint predictors which is PLSDA(up-sampling), the top 5 important predictors are as follows: 1) When predicting whether it’s “None”, the top 5 important variables are X134, X188, X154, X83, X72. See Figure 5 for details. 2) When predicting whether it’s “Mild”, the top 5 important variables are X140, X147, X31, X134, X67. See Figure 6 for details. 3) When predicting whether it’s “Severe”, the top 5 important variables are X72, X113, X44, X136, X81. See Figure 7 for details. 1.5 Comparison between Models Based on Both Bio and Chem The optimal model for biological and chemical predictors is PLSDA, and it yields a Kappa of 0.372 in up-sampling and a Kappa of 0.186 in down-sampling. With both sets of predictors, the PLSDA model has a significantly better performance than those with only one set of predictors. The top 5 predictors for PLSDA model(up-sampling) are as follows: 1) When predicting whether it’s “None”, the top 5 important variables are X134, X154, Z116, Z149, Z38. See Figure 8 for details. 2) When predicting whether it’s “Mild”, the top 5 important variables are Z116, Z93, X38, X98, X155. See Figure 9 for details. 3) When predicting whether it’s “Severe”, the top 5 important variables are Z69, Z100, X72, Z102, Z93. See Figure 10 for details. When comparing those top 5 important variables with previous results, we can see quite easily that for “None” and “Severe”, the top 5 predictors seem to come separately from the top 5 predictors in biological predictors and chemical predictors, for example, in “Severe”, Z100 and Z102 are all among the top 5 predictors in previous result. Another interesting thing is that the percentage of Z-predictors in top 5 lists is higher than that of X-predictors, which again confirms what we got previously that the chemical fingerprints predictors
  • 5. contains most information about hepatic toxicity. 1.6 Suggestion I would recommend using both biological and chemical predictors information, and using upsampling to train PLSDA model. This would yield a quite accurate prediction. Since it’s easy to see in the table 3 that almost all the performances of down-sampling method are worse than that of up-sampling, we should use upsampling method to train the model. Also, among all the linear classification models, PLSDA outperformances others with a Kappa of 0.372 which would qualify as a good prediction. 2 Nonlinear Classification Models (Part II) 2.1 Comparison between Models Based Separately on Bio and Chem There are in total 6 nonlinear classification models discussed in Chapter 13, which are Regulated Discriminant Analysis(I combined Quadratic Discriminant Analysis in it by setting lambda to 1), Neural Network, Average Neural Network, Flexible Discriminant Analysis, Support Vector Machine, K-Nearest Neighbor, Naïve Bayes. The result can be referred in table 4 and table 5. As we can see, when we only use biological predictors, Averaged Neural Network (AvNNet) yields the best performance with Kappa of 0.368 in up-sampling and 0.119 in down-sampling. When we only use chemical fingerprint predictors, Support Vector Machine (SVM) yields the best performance with Kappa of 0.328 in up-sampling and 0.235 in down-sampling. When compared with linear classification models, when we only use biological predictors, the nonlinear structure of these models greatly help to improve the classification performance, as we can see the best of linear model could only yield a Kappa of 0.13, but almost all the nonlinear models yield a higher Kappa with the highest to be 0.368. However, when we only use chemical predictors, the nonlinear structure does help but not as much as the case in biological predictors. The highest Kappa with a nonlinear model is 0.372 but the highest Kappa with a linear model is 0.277. 2.2 Top Predictors For the optimal model for biological predictors which is AvNNet(up-sampling), the top 5 important predictors are as follows: 1) When predicting whether it’s “None”, the top 5 important variables are Z130, Z118, Z98, Z48, Z64. See Figure 11 for details. 2) When predicting whether it’s “Mild”, the top 5 important variables are Z20, Z38, Z99, Z53, Z79. See Figure 12 for details. 3) When predicting whether it’s “Severe”, the top 5 important variables are Z100, Z83,
  • 6. Z102, Z15, Z59. See Figure 13 for details. For the optimal model for chemical fingerprint predictors which is SVM(up-sampling), the top 5 important predictors are as follows: 1) When predicting whether it’s “None”, the top 5 important variables are X132, X1, X95, X133, X120. See Figure 14 for details. 2) When predicting whether it’s “Mild”, the top 5 important variables are X135, X1, X132, X28, X125. See Figure 15 for details. 3) When predicting whether it’s “Severe”, the top 5 important variables are X144, X145, X133, X139, X81. See Figure 16 for details. 2.3 Comparison between Models Based on Both Bio and Chem The optimal model for biological and chemical predictors is Naïve Bayes, and it yields a Kappa of 0.306 in up-sampling and a Kappa of 0.403 in down-sampling. With both sets of predictors, the Naïve Bayes model has a slightly better performance than those with only one set of predictors. The top 5 predictors for Naïve Bayes model(up-sampling) are as follows: 1) When predicting whether it’s “None”, the top 5 important variables are X132, X1, X95, X133, Z130. See Figure 17 for details. 2) When predicting whether it’s “Mild”, the top 5 important variables are X135, X1, X132, X28, X125. See Figure 18 for details. 3) When predicting whether it’s “Severe”, the top 5 important variables are X144, X145, X133, X139, X81. See Figure 19 for details. When compared with previous results, the top 5 important variables are almost identical to those of using only chemical fingerprint predictors. The only difference is in predicting “None”, the 5th important variable is Z130 rather than X120. Also, this again strongly confirms the previous conclusion that chemical fingerprints predictors contain most of the information about hepatic toxicity, since almost all the important variable are X-predictors(chemical fingerprint predictors). 2.4 Suggestion I would recommend using both biological and chemical predictors information, and using up-sampling to train Naïve Bayes Model. The nonlinear structure indeed helps to improve performance over linear models. With a Kappa of 0.306 in up-sampling and a Kappa of 0.403 in down-sampling, well-trained Naïve Bayes Model outperforms the optimal linear model. Therefore, I would recommend using Naïve Bayes to predict the hepatic toxicity.
  • 7. 3 Tree-based Classification Models (Part III) 3.1 CART & Conditional Inference Trees Both CART trees and conditional inference trees models are built using chemistry predictors, and Kappa statistic is used as the metric. When comparing the performance of predicting the whole dataset, CART(tuning parameter mtry is 100) has Accuracy of 0.568 and Kappa of 0.21, while conditional inference tree(tuning parameter mtry is 10) has Accuracy of 0.534 and Kappa of 0.0996. It’s obvious that random forest with CART has better performance. 3.2 Computation Time Comparison The output of the computation time is as follows: > ## Obtain the computation time for each model > rfCART$times$everything user system elapsed 492.665 2.582 171.341 > rfcForest$times$everything user system elapsed 581.095 52.354 169.595 As we can see, CART trees not only have a better performance, but also have a less computational time than conditional inference trees. Therefore, I would prefer CART over conditional inference tree. 3.3 Top Predictors Figure 20 and Figure 21 shows the top 10 important variables for both models. More specifically, for CART, the top 10 important variables are: X1, X132, X71, X28, X31, X29, X147, X30, X11, X6. For conditional inference tree, the top 10 important variables are: X132, X134, X1, X71, X35, X95, X139, X38, X98, X160. The top 10 most important variables are mostly different between CART and Conditional Inference in that in Conditional Inference, statistical hypothesis tests are used to do exhaustive search across predictors and their possible split points, and for every candidate split, a statistical test is used to evaluate the difference between means of two groups created by the split. However, the CART model, when choosing the possible split points, has a different objective function, which is to maximize the reduction of square errors. This difference in objective function may be the reason that two models have noticeable difference.
  • 8. Table 1 Linear' Models' Biological'Predictors' Up4Sampling'Method' Down4Sampling'Method' Training'Dataset' Test' Dataset' Training'Dataset' Test' Dataset' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' LRM' 0.619' 0.546' 0.753' 0.0977' 0.578' 0.552' 0.628' 0.0147' LDA' 0.556' 0.555' 0.846' 0.0749' 0.554' 0.567' 0.567' 0.0989' PLSDA' 0.601' 0.569' 0.925' 0.125' 0.642' 0.623' 0.806' 0.102' Penalized' 0.622' 0.577' 0.891' 0.13' 0.612' 0.61' 0.811' 0.193' Models' Table 2 Linear' Models' Chemical'Predictors' Up4Sampling'Method' Down4Sampling'Method' Training'Dataset' Test' Dataset' Training'Dataset' Test' Dataset' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' LRM' 0.645' 0.593' 0.863' 0.167 0.625' 0.613' 0.674' 0.115 LDA' 0.729' 0.633' 0.91' 0.205 0.596' 0.615' 0.706' 0.176 PLSDA' 0.741' 0.659' 0.97' 0.277 0.643' 0.651' 0.76' 0.246 Penalized' 0.704' 0.672' 0.922' 0.166 0.642' 0.621' 0.717' 0.212 Models'
  • 9. Table 3 Linear' Models' Biological'and'Chemical'Predictors' Up4Sampling'Method' Down4Sampling'Method' Training'Dataset' Test' Dataset' Training'Dataset' Test' Dataset' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' LRM' 0.624' 0.579' 0.646' 0.159 0.601' 0.55' 0.628' 0.0287 LDA' 0.647' 0.619' 0.776' 0.0739 0.587' 0.584' 0.717' 0.0695 PLSDA' 0.783' 0.648' 0.983' 0.372 0.634' 0.627' 0.785' 0.186 Penalized' 0.698' 0.634' 0.96' 0.353 0.615' 0.621' 0.833' 0.186 Models'
  • 10. Table 4 Non-Linear Models Biological Predictors Up-Sampling Method Down-Sampling Method Training'Dataset' Test' Dataset' Training'Dataset' Test' Dataset' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' RDA 0.81' 0.589' 0.979 0.0811 0.633' 0.645' 0.792 0.0926 NNet 0.75' 0.62' 0.971 0.2 0.666' 0.62' 0.733 -0.0622 AvNNet 0.793' 0.597' 0.987 0.368 0.645' 0.621' 0.825 0.119 FDA 0.593 0.579 0.902 0.284 0.573 0.512 0.642 0.135 SVM 0.688 0.598 0.945 0.253 0.596 0.618 0.9 -0.069 kNN 0.764 0.625 0.958 0.14 0.628 0.643 0.725 0.0353 Naïve Bayes 0.669' 0.575' 0.921 0.0245 0.597' 0.583' 0.711 0.162
  • 11. Table 5 Non-Linear Models Chemical Fingerprint Predictors Up-Sampling Method Down-Sampling Method Training'Dataset' Test' Dataset' Training'Dataset' Test' Dataset' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' RDA 0.785' 0.601' 0.974 0.249 0.648' 0.605' 0.7 0.225 NNet 0.854' 0.708' 0.991 0.314 0.665' 0.672' 0.758 0.21 AvNNet 0.877' 0.714' 0.998 0.225 0.68' 0.64' 0.767 0.295 FDA 0.748 0.695 0.89 0.215 0.67 0.666 0.65 0.0911 SVM 0.821 0.659 0.98 0.328 0.648 0.628 0.7 0.235 kNN 0.786 0.629 0.96 0.372 0.655 0.592 0.718 0.0155 Naïve Bayes 0.77' 0.561' 0.829 0.247 0.627' 0.617' 0.625 0.222
  • 12. Table 6 Non-Linear Models Biological and Chemical Fingerprint Predictors Up-Sampling Method Down-Sampling Method Training'Dataset' Test' Dataset' Training'Dataset' Test' Dataset' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' ROC(None)' ROC(Mild)' ROC(Severe)' Kappa' RDA 0.801' 0.597' 0.999 0.274 0.63' 0.596' 0.769 0.252 NNet 0.803' 0.612' 0.957 0.176 0.622' 0.617' 0.747 0.242 AvNNet 0.856' 0.679' 0.995 0.248 0.62' 0.625' 0.747 0.208 FDA 0.746 0.637 0.931 0.213 0.62 0.641 0.75 0.0495 SVM 0.836 0.601 0.99 0.137 0.621 0.629 0.775 -0.0088 kNN 0.789 0.628 0.933 0.289 0.644 0.617 0.8 0.0976 Naïve Bayes 0.778' 0.547' 0.896 0.306 0.601' 0.61' 0.667 0.403
  • 13. Figure 1 ! Figure 2
  • 14. Figure 3 ! ! ! Figure 4
  • 15. Figure 5 ! ! ! Figure 6
  • 16. Figure 7 ! ! ! Figure 8
  • 17. Figure 9 ! ! ! Figure 10
  • 18. Figure 11 ! ! ! Figure 12
  • 19. Figure 13 ! ! ! Figure 14
  • 20. Figure 15 ! ! ! Figure 16
  • 21. Figure 17 ! ! ! Figure 18
  • 22. Figure 19 ! Figure 20