SlideShare a Scribd company logo
1 of 86
Download to read offline
University of Kent
School of Computing
Predicting volume of distribution for drug compounds
using decision trees
NITHYAKALYANI CHINNAIAH
MSc ADVANCED COMPUTER SCIENCE
2015/2016
Under the supervision of Professor Alex Freitas
Word count: About 10,300
ii
Acknowledgements
Firstly, I am grateful to the Almighty for giving me strength, good health and perseverance
to complete this project successfully. My heartfelt thanks to my supervisor, Professor Alex
Freitas for his invaluable guidance and support throughout the project. His timely comments
and suggestions have aided my understanding significantly. I am indebted to my family and
friends. My parents and my sister gave me lots of confidence and encouragement while my
friends stood by me during difficult times. My husband and my daughter put up with me a
lot as I could not spend time with them for the past many months and they have been my
real motivation to complete this project on time.
iii
Abstract
The project is about improving the prediction accuracy of numerical value of volume of
distribution (VD), which is a proportionality constant for drug compounds. While there have
been many approaches using regression with decision trees to predict VD, very few research
studies have been carried out using classification scheme. This project proposes a method
where the two approaches are combined with the addition of a classification confidence
measure. The two stage classification-regression approach is studied with six different data
divisions. In the proposed approach, a test drug’s class is predicted using molecular
structures as attributes and if the prediction confidence is higher than a predetermined
threshold, the log VD value is obtained using regression decision tree built for the particular
class. If the classification prediction is not high, then the approach resorts to standard
regression to obtain log VD value. This two stage approach seems to be promising as the
cross validated performances are improved as compared to standard regression approach for
two of the studied data divisions (one third and one fifth divisions) that gave geometric
mean fold errors (GMFE) of 2.1502 and 2.2013 as compared to the standard regression
GMFE of 2.2373 and 2.2356, respectively.
iv
Table of contents
ABSTRACT
ACKNOWLEDGEMENT
CHAPTER ONE: Introduction
1.1 Contributions
1.2 Structure of dissertation
CHAPTER TWO: Background
2.1 Volume of distribution
2.2 An example of a drug and molecular descriptors
2.3 The classification task of data mining
2.4 Random Forest
2.5 Regression
2.6 Discretisation
CHAPTER THREE: Literature Review on Data Mining for Predicting Volume
of distribution
3.1 Summary of literature review
CHAPTER FOUR: Proposed Methodology
4.1 Discretisation of classes
4.2 Two-stage classification regression approach
4.2.1 Dividing data into training, validation and testing folds
4.2.2 Two stage approach - threshold detection
4.2.3 Error computation
4.3 Fold creation to compare the proposed approach with standard
regression
4.3.1 Standard regression
4.3.2 Proposed two-stage approach
CHAPTER FIVE: Computed Results and Analysis
5.1 Two stage approach
5.1.1 Deciding the threshold for two stage approach
CHAPTER SIX: Conclusion and Future Work Suggestions
6.1 Suggestions for future work
6.2 Reflection
REFERENCES
APPENDIX – Matlab program
ii
iii
1
2
2
3
3
3
4
5
6
6
7
12
13
13
15
15
15
17
18
18
19
20
20
21
26
27
28
29
31
1
Chapter 1: Introduction
The research project is about improving the prediction of volume of distribution (VD) for
drug compounds using classification approaches. VD is one of the most important
pharmacokinetic parameter of drugs, which relates the amount of a drug in the body to the
measured concentration in plasma. VD prediction can be determined using either in-vivo
(animal models, which is time consuming and also raises ethical concerns); in-vitro (which
is also time consuming and requiring costly biological assays) or with a third approach,
known as in-silico. Advantages of theoretical in-silico approaches are high throughput, i.e.
less time consuming computationally and cheaper cost wise (Freitas et al, 2015). So, one can
use this model first and then do in-vivo or in-vitro for more accurate results. Also, only this
approach allows un-synthesised drug to be evaluated and also, reliable in-silico predictions
of the pharmacokinetic parameter would benefit the early stages of drug discovery.
The in-silico predictions are usually performed using regression approaches but very
few researches have been completed using classification approach while several data mining
algorithms have been used to predict VD, the prediction error is still high due to the range of
VD values that is quite large and hence, the hypothesis to be tested here is that the predictive
performance can be improved by discretising the numerical VD values1
into a few categorical
classes and then applying regression. Obviously, the performance will vary depending on the
type of discretisation used.
Hence, the goal of this work would be to predict numerical VD using the predictions
of discretised classes. This dissertation proposes a novel two-stage classification-regression
approach with confidence measure that improves the prediction of VD. The data set that will
be used is extracted from Obach’s database (Obach et al, 2008) and consists of 604 different
drugs (instances).
The motivation behind this work is to predict the VD values for new drugs using data
mining algorithms as mentioned. Extracted knowledge from the research could be used by
pharmaceutical scientists in determining the appropriate dosage regimen. This would benefit
early stages of drug discovery as experimental measurement is not feasible for screening
purposes.
1
Actually, log VD rather than VD due to the large range.
2
1.1 Contributions
The contributions of work here are:
• Six different discretisation of log VD values into categorical classes have been
explored;
• Two-stage classification-regression approach developed with random forest (RF)
classification models and RF regression models;
• Classification confidence score, which is measured using the ratio of predictive
accuracy of each class in RF classification in the first stage, and used to decide the
specific RF regression to be used in the second stage;
• Two-stage classification-regression approach using confidence measure that
improves the VD prediction as compared to standard regression.
1.2 Structure of dissertation
The next chapter gives background details of the work such as VD and drug compounds
followed by generic description of classification and regression tasks in data mining. RF,
which is the approach used in this dissertation is also described. A literature review of
relevant work on classification and regression for predicting VD is described in Chapter 3.
Chapter 4 discusses the proposed methodology, which starts with class discretisation and
then the two-stage approach is described. Results and analysis are given in Chapter 5. The
final chapter gives concluding remarks and reflections on the dissertation with some
suggestions for possible future work.
3
Chapter 2: Background
2.1 Volume of distribution
Jambhekar and Breen, 2015 defines VD as proportionality constant and not a physiological
volume. VD relates the plasma concentration with the mass of drug in the body at a time and
is one of the most important pharmacokinetic parameter of drugs. The magnitude of VD is
irrelevant to the actual plasma volume, extracellular or total body volume and could vary
from 7 to 200 litres in a 70 kg subject. Even though VD has no physical meaning, it is
important to decide the correct VD as otherwise it will be difficult to give the appropriate
dosage of drugs to patients. For example, two different drugs could have different VD
because of their physiochemical properties (i.e. chemical structure) and this causes different
concentrations at steady state even if the drugs have the same initial concentrations and
elimination half-lifes. The higher the VD, the higher will be the ability of the drug to be
absorbed by the tissues. Acidic drugs are highly water soluble and do not penetrate into
tissues to a significant degree and hence, have lower VD values. On the other hand, basic
drugs are easily bound to the extracellular tissues and hence have a larger VD. As a result of
this, large range of variation exists for VD which causes the prediction to be a non-trivial
task.
In general, VD prediction can be determined using three approaches, namely in-vivo,
in-vitro and in-silico (Freitas et al, 2015). In-vivo models give a lot of information about the
pharmacokinetics properties of the drugs. But it involves ethical issues as animals are
involved. Furthermore, this approach is time consuming and costly. In-vitro is better than in-
vivo in terms of cost and time but nevertheless, still costly and time consuming as it involves
biological assays. In-silico approaches are based on theoretical models and though these
approaches lack the abundant of information that is possible from the other experimental
approaches, they are less time consuming and less costly. So, it is possible for one to use this
model first and then do in-vivo or in-vitro for more accurate results. In addition, only this
approach allows direct usage of human data which can be used to evaluate the
pharmacokinetics of un-synthesised drug.
2.2 An example of a drug and molecular descriptors
Hydroxychloroquine, which is used to treat malaria is an example of a drug. Its molecular
formula is C18H26ClN3O and has a VD value of 700 L/kg. Figure 1 shows the molecular
structure.
4
Figure 1: Molecular structure for Hydroxychloroquine (image from
https://pubchem.ncbi.nlm.nih.gov/compound/3652#section=2D-Structure)
A drug can be described by a number of molecular descriptors, such as surface
tension, molar volume, the numbers of hydrogen (H) donors and H acceptors, etc. For
instance, for the hydroxychloroquine drug, the number of H donors is 2 and the number of H
acceptors is 4.
2.3 The Classification Task of Data Mining
There are two types of machine learning, in general. The first is supervised learning where
each training example has a pre-defined class label. An example of supervised learning in
data mining is classification task. The second is unsupervised learning where examples are
not assigned pre-defined classes (on purpose or due to being unavailable). An example of
unsupervised learning in data mining is clustering task that is generally useful as a first step
in exploring the data space.
As classification task is used in this project, it is briefly explained in this section.
Classification depends on the predictive characteristics of the features of the data and
usually, has a predefined class or group. The goal of any classification technique is to build a
model using available training data that automatically predicts the class for future unseen
data based on a set of specific characteristics known as features. In other words,
classification would induce data from training set in order to predict the class of test
examples. Examples of classification approaches are Bayesian learning, instance based
learning and decision trees (DT).
Bayesian classification uses Naïve Bayes approach which is based on probability
theory. It assumes each attribute to be independent from each other and hence classification
accuracy is particularly sensitive to redundant attributes. On the other hand, it can handle
missing attributes as attributes are conditionally independent given the class.
5
Decision tree classification builds a DT in a top-down fashion. It selects an attribute
at a time and partitions the examples according to the values of the selected attribute and
repeats this process recursively. Information gain, which is based on information theory is a
commonly used attribute selection criteria. A decision tree can be easily converted into a set
of IF-THEN rules and due to the comprehensibility of the classifier, it is popularly used.
Instance-based learning such as k-nearest neighbour is a straightforward approach
that does not require a model to be built. A new instance is classified as the closest to the
instance in the training data. Some applications may have too complex dataset to allow the
discovery of a simple abstract model and in such cases, this approach is useful.
It is usual to have an ensemble of learning approaches used to increase the class
prediction accuracy where the aggregated output of the classifiers (example majority voting
strategy) would improve predictive accuracy as long as the classifiers are diverse and make
uncorrelated prediction errors. However, the use of such an ensemble requires more
computation.
The performance of classification task is usually evaluated using the predictive
accuracy, which is usually defined as the ratio of correctly classified examples in the test
data over the total number of examples in the test data. Other performance measures are
confusion matrices, sensitivity, precision, recall etc.
It is common to divide the available dataset into training, validation and testing
divisions in order to train and test the developed classifier. Validation data is used to obtain
optimised value of certain parameters used in classifiers. Obviously, how the data is divided
will affect the performance and to minimise this variance in performance resulting from the
data division, cross-validation is typically performed. In a k-fold cross validation, the
available data is divided into k non-overlapping subsets and k-1 folds are used as training
data while the remaining fold used as test data and classification performed. This procedure
is then repeated for k times and the average of the k performances are computed and used as
the actual performance measure. The commonly used k value is 10. In the extreme case, k is
set to the number of examples available and this is known as leave one out approach. The
usefulness of doing k fold cross validation is that it is statistically more reliable than single
training/test set partition but it is disadvantageous as it is computationally expensive.
2.4 Random Forest
Random forest (RF) is defined in Touw et al, 2012 as a versatile classification method that is
appropriate in analysing large data sets. It is a classification technique that builds a model
based on variables to separate the examples into different classes. This algorithm is popular
because of its high prediction accuracy and its ability to provide information about
importance of variables for classification. RF is actually an ensemble of individual decision
6
trees where each tree in the forest is built using a random subset of variables, hence the
name RF.
RF uses, by default (this parameter can be changed), the square root of the total
number of variables as the number of candidate features to be selected at each tree node. The
performance of RF depends on the ability of each individual tree in the forest and the
correlation between them (Breiman, 2001). It works by building an ensemble of trees and
each tree’s prediction is taken as a vote and approaches such as majority voting is to indicate
the most popular class that can significantly improve the overall classification accuracy. To
minimise correlation between the built trees, each tree is built from randomly selected
training examples, as in the bagging method (Breiman, 1996), where each tree is grown
from a sample of training examples randomly selected without replacement from the training
set. Another method is random split selection where, split selection is done randomly at each
node among the k best splits (Dietterich, 1998). Yet, another approach is by selecting the
training set from a random set of weights on the examples in the training set.
RF has a number of advantages. RF uses randomly selected attributes to produce
good results in classification though less so in regression. It is capable of extracting
knowledge from omics data such as when there are interactions between variables (Touw et
al, 2012).
2.5 Regression
Regression is an approach in data mining that predicts the value of a continuous (real-
valued) variable using several independent attributes. Outputs such as age, weight, distance,
temperature, income, sales etc could be predicted using regression techniques. RF decision
trees can be used for regression which would predict numeric quantities rather than
predicting categories. Hence, the same kind of trees can be used but each leaf would contain
a numeric value that is the average of the output values across all the training set examples
that relates to the leaf (Witten et al, 2011).
2.6 Discretisation
The basic idea of discretisation is to convert a continuous (i.e. real-valued) attribute into a
categorical (i.e. discrete or nominal) attribute. There are two approaches that can be
followed in general to do this: class-driven discretisation that chooses interval boundaries
taking into account the class distributions and class-blind discretisation that ignores the
classes of the examples.
7
Chapter 3: Literature Review on Data Mining for Predicting
Volume of Distribution
Many different algorithms like DT based regression, multiple linear regression, neural
networks, have been trained and tested with data sets that contain molecular descriptors of
various compounds along with VD which is determined after intravenous administration of
drugs to healthy people to obtain the prediction accuracy (Jones et al, 2011).
In the study by Freitas et al (2015), Quantitative Structure-Pharmacokinetic
Relationship (QSPKR) model using DTs with feature selection method to predict VD of
chemical compounds (i.e. drugs) was used. This model consists of two phases. In the first
phase, Kt:p (concentration ratio of tissue and plasma) for unknown tissues were predicted
based on molecular descriptors from known Kt:p values of 110 compounds which were
obtained via in vivo or in vitro approaches experimented in animals (i.e. rats). In the second
phase, the above model was used to predict Kt:p using 604 compounds. The predicted Kt:p
values were used along with molecular descriptors to predict VD of compounds.
In the first phase, four types of DTs were used to predict log Kt:p (instead of Kt:p, as
the distribution of VD is skewed): Conventional regression tree, Model tree, If-then
regression rules and Bagging using model tree. The DTs were implemented using M5P
algorithm in WEKA. Mean absolute error (MAE) was calculated using 10 fold cross
validation which showed different models gave the best prediction of log Kt:p for different
types of tissues. Bagging M5P gave lowest error for 7 out of 13 tissues but the authors used
the best method for each tissue.
In the second phase, correlation feature selection (CFS) with genetic search was
used to obtain useful descriptors to predict log VD. This method assigns a higher score to
feature subsets where there is a higher correlation between the subset’s features and the
target variable (indicating the features have good predictive power) and a lower correlation
among the subset’s features (indicating less redundancy among those features). To test the
prediction of VD, the training set consisted of 402 compounds and the test set of 202
compounds. Results obtained using geometric mean fold error (GMFE)2
was used as
performance measure as it is less affected by extreme outliers (as mentioned VD has skewed
distribution). Three different feature set was used:
1. log Kt:p predicted and molecular features (which gave GMFE of 2.61);
2. Only molecular features (which gave GMFE of 2.33);
2
GMFE=antilog10(MAE)
8
3. 56 descriptors selected by CFS (where only two Kt:p for adipose tissue and thymus and
rest 54 from molecular descriptors), which gave a GMFE of 2.29.
Though approach 3 improved GMFE slightly, the improvement was not statistically
significant using t-Test (p=0.577).
The analysis of predicted Kt:p values showed that error comes from large/small VD
values (eg: hydroxychloroquine drug giving VD value of 100 L/kg) and not the Kt:p
prediction. The results were difficult to compare with other studies due to variations in the
data set but GMFE error of 1.56 to 2.78 for inter-species scaling prediction of VD show that
the GMFE error of 2.29 obtained here was acceptable though not the best.
In the study by Amo et al. (2013), anatomical volumes have been used as rough
guidance to classify VD into three classes, namely volume of extracellular fluid class that has
a VD range of 0 – 0.3 L/kg, distribution to the tissues class with VD range of 0.3 – 1 L/kg and
binding to the cellular components class with VD higher than 1 L/kg. The initial dataset
collated by Obach and co-workers (Obach et al 2008) contains 670 compounds with VD and
fu values determined after intravenous administration to healthy people. In this study,
however, the dataset of 642 drugs with VD values ranging from 0.035 – 60 L/kg were used.
Two approaches were used to build predictive models: linear and non-linear models. The
former used Partial Least Square (PLS) with Principal Component Analysis (PCA) and all
descriptors were transformed with unit variance scaling and mean centring before PCA and
PLS analysis. Descriptors with unequal distribution were logarithmically transformed to
obtain better distribution. The second approach which was a non-linear one consisted of non
– linear recursive partitioning classification model that was used to build DTs. To generate
the training set and test set, all compounds were clustered by similarity based on root mean
square deviation. This was done to have compounds from each cluster in both data sets.
Division of training and test set compounds were based on two approaches. One based on VD
and another based on VD and fu (which depends on the binding affinity and capacity of
plasma proteins). The former gives three classes and the later gives six classes (by further
dividing each class as above using VD and with fu either being more than 0.7 or less than
0.7).
Table 1 shows the division of the data set based on VD values. In addition to VD
values, the table also shows a row consisting of log (VD) values as this logarithm
transformation of VD will be utilised in this project to minimise the skewed distribution
caused by some of the high VD values in the dataset. The six class division based on VD and
fu values is not shown here as fu will not be used in this project.
The performance was measured using sensitivity. As an example, Eq. (1) shows the
sensitivity computation for class 1:
9
321
1
FFT
T
ySensitivit
++
=
(1)
where T1 is the number of true positives for class 1, i.e., number of examples in class 1
correctly predicted as class 1, F2 is the number of examples in class 1 wrongly (falsely)
predicted as class 2 and F3 is the number of examples in class 1 wrongly (falsely) predicted
as class 3. The results were high for Class 1 that gave test set sensitivity of 0.71 and for
Class 3 (with test set sensitivity of 0.81) but not so good for Class 2 that only gave test set
sensitivity of 0.32.
Table 1: Division of data set based on VD values by Amo et al, 2013
Class 1 Class 2 Class 3 Total
VD 0 – 0.3 L/kg 0.3 – 1 L/kg ≥ 1 L/kg
log(VD) -1.4559 to -0.5228 -0.5086 to -0.0043 0.0414 to 2.8451
Training 105 96 181 382
Test 62 71 127 260
Total 167 167 308 642
Lombardo et al (2006) utilised a two stage approach to predict the value of VD. In
the first stage, mixture discriminant analysis (MDA) was used to classify into either high or
low class based on VD values: either low VD with VD < 10 L. kg-1
or high VD with VD ≥ 10
L.kg-1
. In the next step, RF regression models trained for each class were used to predict the
VD value. The authors showed that using 31 computed descriptors (examples such as
lipophilicity, ionisation, molecular volume and molecular fragments), the MDA-RF gave a
GMFE of 1.78±11.4. This was much improved as compared to using RF only that gave
GMFE 2.03±15.0. The proposed method also was better than multiple linear regression
approaches using different inputs (31 combined descriptors: GMFE 2.01±11.4; 19
descriptors from simulated annealing: GMFE 2.05±10.1; 16 descriptors based on
physiocochemical intuition: GMFE 2.48±16.2). Furthermore, to show that the MDA-RF will
perform well irrespective of the dataset division, a ten-fold cross validation approach was
used with 384 drugs with each fold either having 38 or 39 compounds, which gave GMFE
ranging from 1.83 to 2.24. Leave one out (LOO) approach was also utilised to study if any
class would be poorly predicted if it had not been included in the model using eight
structural classes with 72 analogues: Steroids (14 compounds), Beta blockers (16
compounds), Fluoroquinolone antibiotics (10 compounds), NSAIDs (7 compounds),
Cephalosporines (17 compounds), Benzodiazepines (15 compounds), Tricyclic
antidepressants (7 compounds) and Morphine-like (10 compounds). This was done as a
10
simulation of real-life prediction when new structural class that has no experimental data
will be tested. The LOO experiment gave GMFE of 1.46 to 2.94 with an average of 1.91.
Berellini et al (2009) also analysed prediction of VD using linear and nonlinear
models using 669 compounds. Four models were used to predict VD: RF, PLS (principal
component (PC) 5), PLS-PC6 and consensus of the two PLS models. GMFE was used to
measure performance. Different number of descriptors was used in each approach. RF model
used 280 molecular descriptors (from MOE and VolSurf+ software) while the PLS models
had 95 and 11 descriptors respectively for PC5 and PC6. Using average of the logarithms of
VD, a consensus model was built using the best models using PLS and RF. Results were
given for nine structural classes with 172 analogues (number of analogues shown in
brackets): Beta adrenergics (27), Benzodiazepines (18), Cephalosporines (27),
Fluoroquinolones (12), Morphinans (12), NSAIDs (16), Nucleosides/nucleotides (31),
Steroids (21) and Tricyclic antidepressants (8) using leave class out (LCO) approach. On
average, consensus model gave GMFE of 1.8 which was also given by PLS-PC5 model. The
other two models both gave GMFE of 2.0. When tested on an external test set of 29
compounds, GMFE from PLS-PC6 model was best with 1.8 while the other models gave
GMFE values from 1.9 to 2.2.
Zhivkova and Doytchinova (2012) looked at VD but only for acidic drugs totalling
132. Genetic algorithm (GA) was used as variable selection procedure from the 178
molecular descriptors available and stepwise linear regression was used to design two
models, once to estimate log VD and another to estimate log (VD/MW). The latter is VD per
unit weight that would eliminate the influence of molecular weight (MW). Descriptors with
non-zero values for less than three molecules were eliminated. The models were assessed by
explained variance (r2
) and standard error estimate (SEE). Two cross-validation approaches
were used: leave-one-out cross-validation (LOO-CV) and leave-many-out cross-validation
(LMO-CV using 80%:20% division for training:testing). The best performing descriptors for
log (VD) model selected by GA were xch9, SdsssP_acnt, SssS_acnt, SdssS_acnt, SHsSH,
Hmax, Gmin, and knotpv that gave r2
=0.661 and SEE of 0.194. For the second model, the
descriptors were xch7, xch9, SdsssP_acnt, SdssS_acnt, SsSS, Hmax and nelem. This second
model had higher r2
and SEE values of 0.687 and 0.221. External test set (of 30 runs) was
also used to validate the model’s performance using mean fold error (MFE). This external
test was compiled from Berrelini et al (2009) and consisted of 10 acidic drugs. First model
gave MFE of 2.04 while the second one gave MFE of 2.25.
In a recent paper, Zhivkova et al (2015) used 216 basic drugs belonging to different
chemical and therapeutic classes extracted from Obach’s database (Obach et al, 2008). The
dataset was divided into six sets each comprising of 36 drugs each: five for modelling and
one external validation. The five modelling subsets of data were subjected to five-fold cross
11
validation. Logarithm of VD values were used to obtain closer approximation to normal
distribution (i.e. to avoid the skewness in the VD values). Chemical structure of 179
molecular descriptors were computed by ACD/LogD and MDL QSAR software. Models
were validated using randomisation test, leave-one-out cross-validation and leave-group-out
validation using cross-validated coefficient q2
, prediction coefficient r2
for test set, mean
fold error prediction (MFEP) and accuracy. The accuracy (in percentage) was assessed using
predicted and actual VD within two or three fold error. A consensus model using 27 most
frequently emerging descriptors for 180 basic drugs (some drugs were removed as outliers)
gave r2
= 0.529 to 0.593 (mean of 0.555), MFEP was 2.24 to 2.38 (mean of 2.31) and
accuracy of 53% for two-fold error (on average). To propose criteria for prediction of VD of
basic drugs, three groups were suggested: small (VD <0.7 L/kg), moderate (0.7 L/kg ≤ VD ≤ 2
L/kg) and large (VD > 2 L/kg). Seven criteria were proposed logP, fB, Dipole, ncirc, Gmin,
SdssC, aaaC_acnt, Cl_acnt, F_acnt: Drugs that meet neither criterion have small VD (<0.7
L/kg), while those meeting three or more criteria have VD > 2L/kg.
Louis and Agrawal (2012) investigated analysis of VD values of anti-infective agents
from J group of the ATC classification. Three approaches namely, multiple linear regression
(MLR), artificial neural network (ANN) and support vector machine (SVM) were used with
theoretical molecular descriptors from 126 drugs (with VD ranging from 0.05 to 33 Lkg-1
).
Initially, compounds were sorted according to the log VD values and every 5th
compound was
taken as external test compound. The descriptors were selected using CFS which estimated a
subset of 20 subscriptors from a pool of 1664 descriptors computed using E-DRAGON
software (including 1D, 2D and 3D descriptors). Two performance evaluations method: root
mean square error (RMSE) and MAE were used, which showed that SVM gave the best
performance with RMSE=0.273 and MAE=0.207 on the test dataset. SVM parameter
optimisation was done using 10 fold cross validation.
Demir-Kavuk et al (2011) proposed a framework called DemQSAR combining
feature generation, feature selection, model building and overtraining control to predict VD
and also human clearance (CL). Although Obach’s database (Obach et al, 2008) was used,
only 584 compounds were selected with compounds containing phosphorous, boron, metal
atoms, macrocycles, and fragment-like were removed. This dataset was divided into 338 for
training and remaining approximately 40% (i.e. 246) for testing. Furthermore, 10 fold cross
validation of the whole dataset was also done along with external testing of 29 compounds
used by Berelleni et al (2009). Several software packages were used which gave 404
molecular descriptors for each compound. Combining this with four types of fingerprints
gave a total of 3642 features. Lasso regularization and recursive feature elimination (RFE)
were used as feature selectors. Cross validation performance for VD prediction gave GMFE
12
of 2.01 with all features and with minimal 27 features, GMFE of 2.24 was obtained. With
the same 27 features, external GMFE of 1.81 was obtained.
3.1 Summary of literature review
There are many research studies that have looked into predicting VD using various
algorithms based on regression approaches but only Lombardo et al (2006) has attempted to
implement two stage classification and regression similar to the work proposed here.
Nevertheless, the approach only looked at simple two class division based on VD values on a
smaller subset of 384 drugs in Obach’s database (Obach et al, 2008). Furthermore, the
approach by Lombardo et al did not attempt to measure confidence of classification before
obtaining regressed VD values.
13
Chapter 4 Proposed Methodology
4.1 Discretisation of classes
The proposed methodology consists of two stages where the first stage is classification and
the second stage is regression. In the first stage, VD values are discretised into several
classes and initially, the number of classes to use has to be decided. The simplest approach
would be to divide the data set in certain proportions and there are two ways to do this. One
way is to equally divide the entire data set into certain number of classes such as three
classes – LOW, MEDIUM and HIGH each consisting of 33% of the data. If this is going to
be five classes, then the proportion will be 20% with class set of VERY LOW, LOW,
MEDIUM, HIGH and VERY HIGH. Another way is to divide the data set unequally, for
example dividing some classes with more instances such as in proportions like 20%, 60%,
20% (LOW, MEDIUM, HIGH, respectively) for three classes and for the case of five
classes, it could be such as VERY LOW (10%), LOW (20%), MEDIUM (40%), HIGH
(20%) and VERY HIGH (10%). There are other approaches to divide the data set like using
k-means algorithm, density based algorithm and so on. For the data set utilised in this
research, preliminary analysis showed that some of these methods didn’t work very well.
For example, using k-means for three classes resulted as LOW (90%), MEDIUM (9%),
HIGH (1%) which is not useful as LOW class has very high number of instances and would
skew the classification.
In addition to the proportionality divisions, two more divisions are considered from
literature. One from Lombardo et al (2006) where the entire data set was divided into two
classes with respect to VD values. Instances with VD values <10 L/kg were classified as the
LOW class (consisting of 92%) and the instances with VD values >=10 L/kg were classified
as HIGH class (consisting of 8%). Another is by Amo et al (2013) in which data was divided
into three classes, namely LOW class with log VD values from 0 to 0.3 as LOW (27%), from
0.3 to 1 as MEDIUM (27%) and log VD values higher than 1 as HIGH (46%). Table 1 shows
the class divisions based on VD values for two and three classes while Table 2 shows the
class divisions for five classes. As can be seen from the tables, the division of data is not
exactly in the specified ratio. This is because at the division boundary, there could be several
instances with the same log VD values and as such, these require to be placed in the same
class causing slightly different ratios than the specified ones. For example, 33% division has
201 instances for LOW class, 198 instances for MEDIUM class and 206 instances for HIGH
class. Only 602 drug compounds from Obach’s database are used here as the two extremely
14
high values of VD for hydroxychloroquine and chloroquine with values of 700 L/kg and 140
L/kg will skew the training badly giving poor performances.
Table 1: Class divisions based on VD values for two and three classes
Class No. of drugs
LOW
VD, log VD
MED
VD, log VD
HIGH
VD, log VD
LOW MED HIGH
Equal Frequency (33%) 0.035 –0.41,
-1.4559 to -
0.3872
0.42 – 1.6,
-0.3767 to
0.2041
1.7 – 700,
0.2304 to
2.8451
201 198 203
Equal Frequency (20% 60%
20%)
0.035 – 0.23,
-1.4559 to -
0.6382
0.24 – 3.3,
-0.6197 to
0.5185
3.5 – 700,
0.5440 to
2.8451
123 362 117
27%, 27%, 46% (Amo et al,
2013)
0 – 0.3,
-1.4559 to -
0.5228
0.3 – 1.0,
-0.5086 to -
0.0043
> 1,
0.0413 to
2.8450
159 164 279
92%, 8% (Lombardo et al,
2006)
<10,
-1.4559 to 0.9867
>=10,
1.0000 to
2.8450
556 46
90%, 9%, 1% (k – means) 0.035 – 7.1,
-1.4559 to 0.8512
7.4 – 28,
0.8692 to 1.4472
33 – 60,
1.5185 to
1.7781
543 51 8
Table 2: Class divisions based on VD values for five classes
Equal Frequency
(20% 20% 20% 20% 20%)
VD, logVD
Number of
drugs
Equal Frequency
(10% 20% 40% 20% 10%)
VD, logVD
Number of
drugs
VERY LOW 0.035 – 0.23,
-1.4559 to -0.6382
123 0.035 – 0.16,
-1.4559 to -0.7959
61
LOW 0.24 – 0.57,
-0.6198 to -0.2441
117 0.17 – 0.34,
-0.7695 to -0.4685
118
MED 0.58 – 1.3,
-0.2365 to 0.1139
126 0.35 – 1.8,
-0.4559 to 0.2552
240
HIGH 1.4 – 3.3,
0.1461 to 0.5185
119 1.9 – 6.6,
0.2787 to 0.8195
122
VERY HIGH 3.5 – 700,
0.5441 to 2.8451
117 6.8 – 700,
0.8325 to 2.8451
61
15
4.2 Two-stage Classification Regression approach
4.2.1 Dividing data into training, validation and testing folds
Once the data is discretised into classes, it is further divided into 10 folds with nearly equal
number of instances of each class. Next, with one fold kept as test fold, 9 folds are used with
8 folds for training and the remaining fold for validation. The training data will be used to
build the RF classification and regression trees. The validation data is used to obtain the best
performing threshold to be used with confidence measure that will be explained later. The
data in the test fold is not used here.
4.2.2 Two stage approach - threshold detection
At the start of the two-stage approach, classification DTs are built (with 100 trees and 17
randomly selected variables3
where Matlab’s TreeBagger4
function is used to build the RF)
using the training data as below:
Bstage1=TreeBagger(trees,dtrain(:,1:293),ctrain,'nvartosample',novar);
where trees is the number of trees in RF (trees=100), novar is the number of variables
(novar=17), ctrain contains the class labels as per the divisions decided earlier. Numeric
values are assigned to each class such as 1 for LOW, 2 for MEDIUM and 3 for HIGH.
Columns 1 to 293 in dtrain contain the 293 attributes of the training data in the fold.
Bstage1 contains the built 100 RF trees.
Next, regression decision trees are built for each class using training data from each
class (again with 100 RF trees and 17 variables). So, in the case of 33%33%33% division,
regression RF is built three times, one for each class as below (example shown for class 1):
regclass1=TreeBagger(trees,train_class1(:,1:293),train_class1(:,294),'Metho
d',...
'regression','nvartosample',novar);
where train_class1 contains training data from class 1 only (with 293 attributes),
train_class1(:,294) contains the log VD values from training data and regclass1
contains the built RF for class 1. Similarly, RF regression tree is built for the rest of the
classes.
Using validation data, the classes of the validation instances are predicted using:
3
17 chosen based on rule of thumb: sqrt (number of attributes).
4
TreeBagger can be run for classification (default) or for regression.
16
[cstage1_val, cScore]=predict(Bstage1,dval(:,1:293));
where Bstage1 contains the built classification RF, dval is the validation data (with
293 attributes), cstage1_val contains the predicted classes and cScore contains the
confidence of prediction for each instance of the validation data. For example, for each
instance in the validation data, the 100 trees could have 20 trees giving prediction as class 1,
50 trees giving prediction as class 2 and 30 trees giving prediction as class 3. This would
give a prediction confidence measure of [0.2 0.5 0.3].
Classification accuracy using the predicted classes of the validation data is found
using cstage1_val with actual labels of the validation data. The overall accuracy is
computed using the number of correctly predicted instances divided by the total number of
instances – this is also done for each class where the number of correctly predicted instances
for the class is divided by the total number of instances for the class. The usefulness of doing
this will become evident later in the discussion.
To decide on the confidence measure i.e., how accurately is the class of that
particular instance is predicted, the following equation is used:
)_(*)_max( confidencetionclassificameanthresholdconfidencetionclassifica > (2)
where threshold varies from 1.1 to 4.0 in steps of 0.1,
mean(classification_confidence) is the mean confidence of the predicted class
outcomes for that instance across all trees in RF and max (classification_confidence)
is the maximum confidence value of these predicted class outcomes. Initially, the threshold
is set at 1.1.
If equation (2) is satisfied for the validation instance as below:
for i=1:length(cScore)
if(max(cScore(:,i))>thres_factor*mean(cScore(:,i)));
predict_ok(i)=1; %which instances are high confidence prediction
end
end
then two stage classification regression approach is used by setting the predict_ok variable
to 1 or else standard regression approach is used (predict_ok =0).
So, for each of the validation instance, if the confidence is exceeded, the respective
regression tree for the predicted class is loaded. For example, for an instance in the
17
validation data with high confidence, if the predicted class is MEDIUM, then the regression
tree built for this MEDIUM class is loaded and the predicted log VD is obtained.
If the confidence is not high (i.e. according to equation (2)), then standard regression
approach is followed by building RF using training data as follows. Standard regression of
log VD values is performed with ensemble of 100 random forest trees built with 17 variables
randomly selected:
Bstdreg=TreeBagger(trees,dtrain(:,1:293),dtrain(:,294),...
'Method','regression','nvartosample',novar);
where trees is the number of trees in RF (trees=100), novar is the number of variables
(novar=17), dtrain columns 1 to 293 contains the 293 attributes of the training data in the
fold and the 294th
column in dtrain contains the actual log VD values. Bstdreg contains
the built 100 RF trees.
Hence, the validation instance is subjected to either the two-stage approach or the
standard regression by the following code (example shown for three classes division):
%choose regression model based on predicted class
for i=1:length(cval)
if(str2num(cstage1_val{i})==1 && predict_ok(i)==1) %check predicted
class and high prediction
Vdregnew(i)=predict(regclass1,dval(i,1:293));
elseif(str2num(cstage1_val{i})==2 && predict_ok(i)==1)
Vdregnew(i)=predict(regclass2,dval(i,1:293));
elseif(str2num(cstage1_val{i})==3 && predict_ok(i)==1)
Vdregnew(i)=predict(regclass3,dval(i,1:293));
else
Vdregnew(i)=predict(Bstdreg,dval(i,1:293)); %use std reg tree if no
high prediction
end
end
The above code loops for all the validation instances where Vdregnew contains the predicted
value of log VD.
4.2.3 Error computation
To obtain the error values of the prediction, RMSE, MAE and GMFE are calculated as
follows:
18
∑=
−=
N
i
DiDi VV
N
RMSE
1
2'
)log(log
1
(3)
∑=
−=
N
i
DiDi VV
N
MAE
1
'
|loglog|
1
(4)
MAE
GMFE 10= (5)
where log VDi is the actual value given in Obach’s database for the particular validation drug
instance i while log V`Di is the estimated value obtained by either two-stage approach or
standard regression (i.e. Vdregnew above) and N is the number of drug instances in the
validation data.
The above processes 4.2.1 to 4.2.3 is repeated for all the different 90 folds (using
different training and validation data) and average of the errors from 90 folds is computed.
This gives the error for threshold=1.1.
The threshold is now changed to 1.2 (and so on until 4.0 in steps of 0.1) and all the
steps in 4.2.1 to 4.2.3 is repeated for 90 folds and average of the errors from 90 folds is
computed for each threshold. It should be noted that in some cases, the higher threshold
may not have any effect. For example, with three classes, confidence measures with
threshold value about 3.0 will automatically not be confident as the best confidence can
only be 1.0.
The best threshold to use is decided by not only the lowest GMFE error values but
also that there should be at least one confidence prediction in each class. This is to avoid
possible poor performance when using the testing data later.
4.3 Fold creation to compare the proposed approach with standard regression
Once the best threshold has been decided, full data is now divided into 10 folds where 9
folds are used in training and one fold during testing. The folds have nearly equal number of
instances of each class. The data division into folds will be useful to compare the
performance of the proposed approach with standard regression.
4.3.1 Standard regression
Standard regression of log VD values is performed with ensemble of 100 RF trees built with
17 variables randomly selected using training data. This is for comparison with the proposed
two stage approach. The following code is used:
19
Bstdreg=TreeBagger(trees,dtrain(:,1:293),dtrain(:,294),...
'Method','regression','nvartosample',novar);
where dtrain now contains the new training data (9 folds) with 293 attributes and the 294th
column in dtrain contains the actual log VD values. Bstdreg contains the built 100 RF
trees. Testing data (from the remaining fold) are used to obtain the regressed log VD and the
error values computed.
4.3.2 Proposed two-stage approach - testing
Similarly to section 4.2.2, for the two-stage approach, classification RF is built except that
now, the new 9 folds of training data is used and similarly, the regression RF is built for
each class. For each instance in the test data, class prediction and confidence score is
obtained. If the confidence exceeds equation (2) using the threshold decided earlier, then log
VD is predicted using the specific RF regression class. Otherwise, log VD is predicted using
the standard RF regression.
With these predicted log VD values for each test instance, the three errors (RMSE,
MAE, GMFE) are computed. The whole process is then repeated for a different 9 folds of
training and remaining fold for testing and average errors computed for comparison.
All the above steps starting from section 4.1 to 4.3 are carried out for different
discretisation of data and the corresponding results are stored in a text file.
20
Chapter 5 Computed Results and Analysis
In this chapter, the obtained results are given along with analysis of the results.
5.1 Two stage approach
Table 1 shows the classification results for the training and the validation data, i.e. in the
first stage of the two stage approach (without any confidence measure). The class divisions
are VERY LOW, LOW, MEDIUM, HIGH, VERY HIGH represented by classes 1, 2, 3, 4
and 5, respectively and similarly for 3 classes with LOW, MEDIUM, HIGH as 1,2 and 3
respectively. As can be seen for all the data divisions, the overall classification accuracy and
the classification accuracies of the individual classes for the training data are very good
except for the 92%8% division where class 1 has the most number of instances and thus
training is skewed resulting in high accuracy for class 1 and poor for the other class. Some
particular classes (where there are more training instances), the results are 100%, for
example class 3 in 10% 20% 40% 20% 10% data division.
For the validation data, the accuracies are not as good as in the training data and the
higher classification accuracies are for the class instances with more training instances. For
example, for the 20%60%20% division, the validation data accuracies for the classes are
0.3674, 0.8856 and 0.2651.
The poor classification performances for certain classes causes poor prediction
accuracy overall in the test data as well; this obviously will lead to erroneous regression
model loaded in stage two and hence poorer VD prediction. To solve this problem, a suitable
threshold with a confidence measure was explored as explained in the previous chapter.
Table 1: Mean classification accuracy using training and validation data
Training classification accuracy Validation classification accuracy
Overall
accuracy
Class
1
Class
2
Class
3
Class
4
Class
5
Overall
accuracy
Class
1
Class
2
Class
3
Class
4
Class
5
33% 33%
33% 0.9952 0.9984 0.9940 0.9931 0.6676 0.7994 0.5312 0.6721
20% 60%
20% 0.9773 0.9457 1.0000 0.9403 0.6584 0.3674 0.8856 0.2651
20% 20%
20% 20%
20% 0.9974 1.0000 0.9987 0.9993 0.9916 0.9970 0.4642 0.6535 0.2694 0.3932 0.3963 0.6006
10% 20%
40% 20%
10% 0.9821 0.9549 0.9870 1.0000 0.9750 0.9433 0.5215 0.0757 0.4833 0.8269 0.3345 0.2233
92% 8% 0.9429 1.0000 0.2534 0.9251 1.0000 0.0161
27% 27%
46% 0.9778 0.9839 0.9345 0.9998 0.6777 0.6978 0.3301 0.8726
21
5.1.1 Deciding the threshold for two stage approach
Table 2 shows the classification and two stage regression results for validation data from
equally divided three classes (33%33%33%) for varying threshold levels. Results are shown
only up to threshold 2.7 as the results did not change after this threshold. This is because the
threshold will be too high and no data instance will satisfy equation 2 and all the instances
will be processed by standard regression. GMFE of 2.1423 is the lowest error given by
threshold value of 1.5 and this is the threshold used with testing data later.
Table 2: Performance of validation data with varying threshold for 33% 33% 33%
division
Threshold
factor
Validation classification accuracy Two stage regression error
(with confidence measure)
Overall
accuracy
Class1 Class2 Class3 RMSE MAE GMFE
1.1 0.6814 0.8048 0.5385 0.6770 0.5024 0.3537 2.2662
1.2 0.6976 0.8148 0.5450 0.7048 0.4908 0.3466 2.2284
1.3 0.7144 0.8383 0.5498 0.7254 0.4860 0.3429 2.2090
1.4 0.7429 0.8727 0.5404 0.7646 0.4735 0.3360 2.1747
1.5 0.7777 0.9095 0.5364 0.8093 0.4633 0.3295 2.1423
1.6 0.7973 0.9320 0.5046 0.8460 0.4614 0.3302 2.1459
1.7 0.8198 0.9494 0.4867 0.8743 0.4599 0.3320 2.1541
1.8 0.8425 0.9729 0.4672 0.9058 0.4595 0.3347 2.1671
1.9 0.8677 0.9770 0.4557 0.9427 0.4608 0.3384 2.1860
2.0 0.8902 0.9849 0 0.9504 0.4579 0.3378 2.1824
2.1 0.9156 0.9970 0 0.9666 0.4569 0.3387 2.1874
2.2 0.9330 1 0 0.9607 0.4595 0.3414 2.2019
2.3 0.9504 1 0 0 0.4590 0.3429 2.2097
2.4 0.9552 1 0 0 0.4592 0.3433 2.2118
2.5 0.9718 1 0 0 0.4614 0.3453 2.2224
2.6 0.9806 1 0 0 0.4634 0.3477 2.2354
2.7 0 0 0 0 0.4627 0.3468 2.2303
Table 3 shows the classification and two stage regression results for validation data
from unequally divided three classes with more instances for the second class (i.e. 20% 60%
20%) by varying threshold levels. Results are shown only up to threshold 2.7 as the results
did not change after this threshold. This is because the threshold will be too high and no data
instance will satisfy equation 2 and all the instances will be processed by standard
regression. 1.9 is the threshold that gives lowest error with non-zero classification accuracy
for all the individual classes with GMFE of 2.2834.
22
Table 3: Performance of validation data with varying threshold for 20% 60% 20%
division
Threshold
factor
Validation classification accuracy Two stage regression error
(with confidence measure)
Overall
Accuracy
Class1 Class2 Class3 RMSE MAE GMFE
1.1 0.6593 0.3674 0.8859 0.2651 0.5062 0.3793 2.4027
1.2 0.6576 0.3643 0.8823 0.2612 0.4921 0.3729 2.3714
1.3 0.6620 0.3672 0.8863 0.2635 0.5088 0.3830 2.4249
1.4 0.6764 0.3737 0.8929 0.2806 0.5009 0.3767 2.3878
1.5 0.6873 0.3894 0.9048 0.2472 0.5001 0.3754 2.3809
1.6 0.7128 0.4114 0.9202 0.2318 0.4910 0.3692 2.3482
1.7 0.7364 0.4288 0.9406 0.2275 0.4839 0.3627 2.3118
1.8 0.7622 0.4546 0.9501 0.2136 0.4793 0.3593 2.2944
1.9 0.7879 0.4749 0.9583 0.2003 0.4765 0.3573 2.2834
2.0 0.8057 0.4671 0.9623 0 0.4724 0.3550 2.2704
2.1 0.8329 0 0.9665 0 0.4715 0.3537 2.2632
2.2 0.8477 0 0.9799 0 0.4695 0.3526 2.2577
2.3 0.8804 0 0.9848 0 0.4681 0.3509 2.2485
2.4 0.8879 0 0.9851 0 0.4676 0.3515 2.2516
2.5 0.8892 0 0.9825 0 0.4683 0.3511 2.2489
2.6 0.9151 0 0.9794 0 0.4679 0.3505 2.2458
2.7 0 0 0 0 0.4669 0.3499 2.2428
Table 4: Performance of validation data with varying threshold for
20%20%20%20%20% division
Threshold
factor
Validation classification accuracy Two stage regression error
(with confidence measure)
Overall
accuracy
Class1 Class2 Class3 Class4 Class5 RMSE MAE GMFE
1.1 0.4652 0.6530 0.2688 0.3936 0.3968 0.6006 0.5384 0.3825 2.4281
1.2 0.4636 0.6461 0.2611 0.3869 0.4194 0.5891 0.5418 0.3855 2.4490
1.3 0.4709 0.6592 0.2708 0.3869 0.4080 0.6059 0.5303 0.3778 2.4036
1.4 0.4751 0.6627 0.2471 0.3982 0.4148 0.6145 0.5288 0.3779 2.4067
1.5 0.4835 0.6835 0.2591 0.4090 0.4126 0.6087 0.5190 0.3742 2.3861
1.6 0.5026 0.7040 0.2550 0.4131 0.4361 0.6398 0.5032 0.3645 2.3317
1.7 0.5279 0.7184 0.2486 0.4542 0.4612 0.6704 0.4905 0.3569 2.2899
1.8 0.5366 0.7296 0.2507 0.4410 0.4437 0.7087 0.4851 0.3536 2.2720
1.9 0.5580 0.7622 0.2630 0.4403 0.4716 0.7256 0.4787 0.3493 2.2487
2.0 0.5623 0.7697 0.2230 0.4674 0.4662 0.7471 0.4744 0.3487 2.2447
2.1 0.5754 0.7944 0.2113 0.4813 0.4704 0.7786 0.4706 0.3469 2.2356
2.2 0.5767 0.8044 0.1678 0 0.4372 0.8023 0.4706 0.3475 2.2388
2.3 0.5832 0.8163 0 0 0.4522 0.8374 0.4690 0.3475 2.2385
2.4 0.6113 0.8402 0 0 0 0 0.4666 0.3464 2.2313
2.5 0.6223 0.8700 0 0 0 0 0.4655 0.3464 2.2315
2.6 0.6388 0.8851 0 0 0 0 0.4642 0.3454 2.2261
2.7 0.6751 0.8978 0 0 0 0 0.4625 0.3444 2.2205
2.8 0.6895 0 0 0 0 0 0.4638 0.3456 2.2269
2.9 0.6966 0 0 0 0 0 0.4636 0.3457 2.2271
3.0 0.7186 0 0 0 0 0 0.4646 0.3466 2.2313
3.1 0.7470 0 0 0 0 0 0.4632 0.3457 2.2253
3.2 0.7587 0 0 0 0 0 0.4638 0.3461 2.2286
3.3 0.7472 0 0 0 0 0 0.4635 0.3469 2.2315
3.4 0 0 0 0 0 0 0.4641 0.3473 2.2331
23
Table 4 shows the classification and two stage regression results for validation data
from equally divided five classes for varying threshold levels. Results are shown only up to
threshold 3.4 as the results did not change after this threshold. Though GMFE of 2.2205 is
the lowest error given by threshold 2.7 but there was not at least one instance in each class
with sufficient high confident prediction. So threshold value of 2.1 which gave GMFE of
2.2356 is used with testing data later.
Table 5: Performance of validation data with varying threshold for
10%20%40%20%10% division
Threshold
factor
Validation classification accuracy Two stage regression error
(with confidence measure)
Overall
accuracy
Class1 Class2 Class3 Class4 Class5 RMSE MAE GMFE
1.1 0.5153 0.0757 0.4833 0.8269 0.3345 0.1116 0.4942 0.3662 2.3322
1.2 0.5201 0.0791 0.4997 0.8323 0.3517 0.0963 0.4921 0.3639 2.3189
1.3 0.5178 0.0812 0.4981 0.8239 0.3554 0.1054 0.4940 0.3642 2.3208
1.4 0.5139 0.0697 0.4905 0.8241 0.3487 0.0946 0.4948 0.3671 2.3373
1.5 0.5216 0.0716 0.5081 0.8333 0.3541 0.1000 0.4877 0.3611 2.3047
1.6 0.5230 0.0691 0.5117 0.8291 0.3479 0.0947 0.4902 0.3642 2.3196
1.7 0.5307 0.0659 0.5145 0.8318 0.3513 0.0883 0.4863 0.3636 2.3172
1.8 0.5497 0.0631 0.5375 0.8434 0.3588 0.0926 0.4844 0.3611 2.3037
1.9 0.5578 0.0709 0.5442 0.8551 0.3748 0.0867 0.4826 0.3613 2.3041
2.0 0.5663 0.0519 0.5707 0.8512 0.3655 0.0679 0.4821 0.3608 2.3007
2.1 0.5875 0.0504 0.5859 0.8679 0.3825 0.0560 0.4772 0.3575 2.2828
2.2 0.6013 0.0393 0.5978 0.8800 0.3723 0.0575 0.4752 0.3548 2.2682
2.3 0.6189 0 0.6124 0.8910 0.3781 0.0466 0.4729 0.3538 2.2630
2.4 0.6404 0 0.6345 0.9112 0.3912 0 0.4718 0.3520 2.2526
2.5 0.6538 0 0.6572 0.9209 0 0 0.4699 0.3519 2.2526
2.6 0.6647 0 0.6835 0.9274 0 0 0.4676 0.3497 2.2407
2.7 0.6779 0 0.6846 0.9357 0 0 0.4689 0.3509 2.2468
2.8 0.6978 0 0.7128 0.9513 0 0 0.4676 0.3501 2.2424
2.9 0.7110 0 0 0.9589 0 0 0.4665 0.3478 2.2309
3.0 0.7234 0 0 0.9608 0 0 0.4662 0.3483 2.2331
3.1 0.7589 0 0 0.9762 0 0 0.4667 0.3485 2.2339
3.2 0.7545 0 0 0.9713 0 0 0.4672 0.3489 2.2363
3.3 0.7986 0 0 0 0 0 0.4681 0.3497 2.2408
3.4 0.8328 0 0 0.9881 0 0 0.4662 0.3486 2.2349
3.5 0 0 0 0 0 0 0.4670 0.3488 2.2359
Similar to the previous tables, Table 5 shows the results for validation data from
unequally divided five classes with more instances for the third class (i.e. 10% 20% 40%
20% 10%) by varying threshold levels. Results are shown only up to threshold 3.5 as the
results did not change after this threshold. Though GMFE of 2.2309 is the lowest error given
by threshold 2.9 but there was not at least one instance in each class with sufficient high
confident prediction. So threshold value of 2.2 which gives GMFE of 2.2682 is used with
testing data later.
24
Table 6 shows the results for validation data from unequally divided three classes
(27% 27% 46%) for varying threshold levels. Results are shown only up to threshold 2.7 as
the results did not change after this threshold and threshold 1.9 gives the lowest
GMFE=2.2077 and this is used with the testing data.
Table 6: Performance of validation data with varying threshold for 27%27%46%
division
Threshold
factor
Validation classification accuracy Two stage regression error
(with confidence measure)
Overall
Accuracy
Class1 Class2 Class3 RMSE MAE GMFE
1.1 0.6909 0.7024 0.3326 0.8785 0.5155 0.3786 2.4117
1.2 0.7066 0.7222 0.3401 0.8901 0.5039 0.3721 2.3717
1.3 0.7219 0.7456 0.3187 0.9048 0.4899 0.3654 2.3326
1.4 0.7482 0.7751 0.3228 0.9251 0.4823 0.3593 2.2991
1.5 0.7706 0.8023 0.2868 0.9503 0.4741 0.3538 2.2699
1.6 0.7858 0.8258 0.2692 0.9595 0.4681 0.3485 2.2418
1.7 0.8037 0.8509 0.2431 0.9744 0.4660 0.3465 2.2308
1.8 0.8248 0.8727 0.2332 0.9786 0.4628 0.3437 2.2189
1.9 0.8432 0.9074 0.2155 0.9831 0.4613 0.3417 2.2077
2.0 0.8585 0.9020 0 0.9880 0.4613 0.3429 2.2151
2.1 0.8729 0.9196 0 0.9909 0.4615 0.3439 2.2207
2.2 0.8902 0.9563 0 0.9926 0.4594 0.3429 2.2159
2.3 0.8965 0.9793 0 0.9973 0.4592 0.3441 2.2226
2.4 0.9066 0.9904 0 0.9984 0.4578 0.3429 2.2168
2.5 0.9090 0.9963 0 0 0.4601 0.3454 2.2293
2.6 0.9347 0 0 0 0.4612 0.3465 2.2346
2.7 0 0 0 0 0.4635 0.3492 2.2488
Table 7: Performance of validation data with varying threshold for 92% 8% division
Threshold
factor
Validation classification accuracy Two stage regression error
(with confidence measure)
Overall
accuracy
Class1 Class2 RMSE MAE GMFE
1.1 0.9044 1 0 0.4911 0.3618 2.3053
1.2 0.9056 1 0 0.4889 0.3610 2.3008
1.3 0.9071 1 0 0.4884 0.3612 2.3016
1.4 0.9100 1 0 0.4858 0.3597 2.2935
1.5 0.9118 1 0 0.4849 0.3591 2.2908
1.6 0.9213 1 0 0.4820 0.3572 2.2808
1.7 0.9353 1 0 0.4770 0.3550 2.2695
1.8 0.9451 1 0 0.4733 0.3542 2.2654
1.9 0.9681 1 0 0.4698 0.3526 2.2570
2.0 0 0 0 0.4677 0.3520 2.2540
Table 7 shows the results from unequally divided two classes with maximum
number of instances for class 1 with 92% 8% division by varying threshold levels. Results
are shown only up to threshold 2.0 as the results did not change after this threshold. Since
25
class 2 errors are 0 for all the cases, this was a special case and threshold value of 1.9 was
used with GMFE error of 2.2570.
Finally, Table 8 shows the standard regression errors for various discretisation using
ten folds of data (9 folds of training and 1 fold of testing), two stage regression errors where
the instances are first classified and then regressed accordingly (without any confidence
measure) and also the two stage regression with confidence measure (i.e. proposed method)
errors. For standard regression, the division of data is irrelevant and hence the performances
just basically depend on the training and testing folds and GMFE errors ranged from 2.2319
to 2.2493, i.e. didn’t change much as expected (slight changes due to RF being used).
Table 8: Standard and two-stage classification regression approach (with and
without) confidence measure using testing data
Data
division
Used
threshold
Standard regression error Two stage regression
error
(without confidence
measure)
Two stage regression
error
(with confidence
measure)
RMSE MAE GMFE RMSE MAE GMFE RMSE MAE GMFE
33% 33%
33%
1.5 0.4662 0.3485 2.2372 0.5195 0.3659 2.3416 0.4603 0.3306 2.1502
20% 60%
20%
1.9 0.4658 0.3498 2.2425 0.4999 0.3805 2.4139 0.4661 0.3502 2.2440
20% 20%
20% 20%
20%
2.1 0.4635 0.3479 2.2356 0.5266 0.3746 2.3798 0.4597 0.3412 2.2013
10% 20%
40% 20%
10%
2.2 0.4626 0.3476 2.2319 0.4883 0.3598 2.2987 0.4682 0.3481 2.2365
92% 8% 1.9 0.4673 0.3492 2.2493 0.4848 0.3559 2.2914 0.4646 0.3468 2.2373
27% 27%
46%
1.9 0.4645 0.3484 2.2375 0.5260 0.3828 2.4205 0.4672 0.3465 2.2279
The two stage without confidence measure gave GMFE errors from 2.2914 (for
division 92% 8%) to 2.4205 (27% 27% 46%) while using confidence measure improved the
performances for all the divisions when compared to without confidence measure. The best
was given by three equal division (33% 33% 33%) with GMFE error of 2.1502 while the
worse was 20% 60% 20% that gave GMFE error of 2.2440.
When comparing the three methods across the different data divisions, it can be seen
that equal data division (both the three classes and five classes) for the two stage with
confidence measure gave improved performance as compared to standard regression while
92% 8% and 27% 27% 46% divisions gave slightly improved performance and the rest two
(20% 60%20% and 10% 20% 40% 20% 10%) gave slightly poorer results.
26
Chapter 6 Conclusion and Future Work Suggestions
Accurate prediction of volume of distribution using in silico methods is very important
towards estimation of drug delivery and also in the synthesis of new drugs. While in vivo
and in vitro methods exist, these are costly and time consuming. In silico methods also
warrant quick VD predictive results for possible refinement with in vivo and in vitro
approaches.
Most of the studies using in silico approaches use various approaches to predict the
whole range of VD values (usually using Obach et al, 2008 database). However, due to the
large range of VD value (with skewed distribution), the obtained GMFE errors are still
relatively high. In this regard, this MSc dissertation project attempted to refine the prediction
by dividing the instances available in Obach’s database into six different equal and unequal
size divisions, namely 33%33%33%, 20%60%20%, 20%20%20%20%20%,
10%20%40%20%10%,92%8% and 27%27%46%. The aim of the divisions was to reduce
the range of VD values when building a model (in this case, RF decision tree). This would
hopefully increase the accuracy of VD prediction. The step entails a two stage approach,
firstly to build a classification model to predict the class for the test instance and then in the
second stage, to use built regression models for each class to predict the VD value. Decision
tree was the approach suggested by supervisor as it is a popular approach in the literature.
Random Forest (RF) approach was chosen due to its improved ability than a single decision
tree5
. The number of attributes available was 293 and the RF approach utilised 17 attributes
chosen randomly from this set of attributes as candidate attributes in each node of a decision
tree.
RF with 100 trees was used, although it is possible to achieve better performance
with higher number of trees, this would cost more computationally. Within the class
divisions, folds of training, validation and testing were created to implement this modified
two stage approach with confidence measure and also to compare the performances with
standard regression using three error measures, RMSE, MAE and GMFE. Initially, the
analysis was done in Weka, but as it was becoming more difficult to handle the data
divisions effectively, Matlab was chosen as the software to perform the analysis.
When the two stage method (classification-regression without confidence measure)
was implemented, it was realised that some classes had poor predictive ability in the first
stage. This was mainly due to unequal class sizes where the class that had more training
5
Supervisor suggested M5P algorithm but I was not able to successfully implement this in Matlab.
27
instances would perform well but the overall accuracy would be poor due to the other
classes performing very poorly. This resulted in higher error values than standard regression
for all the six class divisions. Upon discussing this finding with the supervisor, a modified
two stage approach was suggested. The approach would be different by utilising a
confidence measure of the RF classification in the first stage. While many approaches could
have been explored, due to time limitations, the readily available classification confidence
score as part of the TreeBagger function in Matlab was utilised. The classification
confidence score gave an ability of a measure to decide the effectiveness of the
classification. Using this confidence measure, two stage approach will only be utilised for
the instances where the confidence exceeds a certain threshold and otherwise, the instance
will be subjected to standard regression only.
The best threshold for each division was decided using training and validation folds.
When tested with testing data, the modified approach proved to be useful as it improved the
error measures mainly for the equal class divisions but performing comparably similar to
standard regression in the other class divisions. The best performing division was
33%33%33% that gave Geometric Mean Fold Error (GMFE) of 2.1502 for the two-stage
approach with confidence measure as compared to 2.3416 for the two-stage approach
without confidence measure and 2.2373 for the standard regression approach. For the case of
20%20%20%20%20% data division, the GMFE errors are 2.2013, 2.3798 and 2.2356 for the
proposed two-stage approach with confidence measure, two-stage without confidence
measure and standard regression, respectively.
Exact comparison with the results of others such as Lombardo et al (2006) is not
possible as Lombardo et al. used only 384 drugs that gave GMFE error of 1.83 to 2.24, but
the work here used 602 drugs. Berellini et al, 2009 used 669 compounds and obtained
GMFE of 1.8, but it is not apparent if cross validation of the dataset was implemented.
Freitas et al, 2015 used 402 compounds for training and 202 for testing with 56 attributes
selected with correlation feature selection with genetic search that gave GMFE value of
2.29.
In conclusion, the results obtained here are promising when comparing the current
state of the art and with further work, the prediction of VD values can be made more
accurate.
6.1 Suggestions for future work
Research is of no use if not utilised in some form. In this regard, it is hoped that the outcome
of this research will be extended in future (and perhaps for real-life scenarios) and some
possibilities are:
28
- Explore other class divisions, for example using density based clustering. This could
result in grouping VD values more closely following a certain pattern that could
result in improved prediction accuracy in first stage for the two stage approach.
- Explore other classifiers to use in the first stage, for example deep learning, which is
becoming very popular nowadays and has been shown to give improved
classification results as long as there is sufficient data for training.
- Use other decision tree approaches, both in the first and second stages such as M5P.
Increasing the number of trees could also result in improved performance in both
stages.
- Use feature selection (such as sequential forward floating selection) to select the
appropriate attributes to use, this is likely to give reduced errors if the selected
features have high inter-class variance and low intra-class variance.
- Use of other more advanced confidence measures in the first stage of classification
for example using Bayesian or another probability based approach.
6.2 Reflection
The project was challenging as I did not have any knowledge on pharmacokinetics. I thought
volume of distribution for a drug would be the actual volume of drug delivered! However,
my supervisor assured me that in depth knowledge of pharmacokinetics would not be
required and also provided me with a list of good set of journal papers and a book to read to
equip myself.
Although I had some knowledge acquired on Matlab (through self-study) for one of
my assignments, nevertheless writing codes for a full research problem was daunting. But
breaking up the task into smaller sections as advised by my supervisor proved to be very
useful and small lines of codes were built upon previous codes at a time.
Having a family to manage, time management was probably the most difficult
aspect of my project. However, the pursuit of MSc studies taught me the ability to divide my
time accordingly to strike a balance between the project and family.
My knowledge on machine learning, especially decision trees greatly improved after
implementing the project. I learnt that acquiring knowledge to pass exams is one thing but
practical implementation is yet another matter altogether.
29
References
E. M. del Amo, L. Ghemtio, H. Xhaard, M. Yliperttula, A. Urtti and H. Kidron, “Applying
linear and non-linear methods for parallel prediction of volume of distribution and fraction
of unbound drug,” PLoS One, vol. 8, no. 10, e74758, 2013.
G. Berellini, C. Springer, N. J. Waters and F. Lombardo, “In silico prediction of volume of
distribution in human using linear and nonlinear models on a 669 compound data set,” J.
Med. Chem., vol. 52, pp. 4488–4495, 2009.
L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123-140, 1996.
L. Breiman “Random forests,” Machine learning, vol. 45, no. 1, pp.5-32, 2001.
O. Demir-Kavuk, J. Bentzien, I. Muegge, and E-W. Knapp, “DemQSAR: predicting human
volume of distribution and clearance of drugs,” J. Compu. Aided. Mol. Des., vol. 25, pp.
1121-1133, 2011.
T. G. Dietterich, “Approximate statistical tests for comparing supervised classification
learning algorithms,” Neural Computation, vol. 10, no. 7, pp. 1895-1923, 1998.
A. A. Freitas, K. Limbu, and T. Ghafourian, “Predicting volume of distribution with
decision tree-based regression methods using predicted tissue: plasma partition coefficients,”
J. Cheminformatics, vol. 7, no. 6, 17 pages, 2015.
S. S. Jambhekar, and P. J. Breen, Basic Pharmacokinetics, Second Edition, Pharmaceutical
Press, London, 2012.
R. D. Jones, H. M. Jones, M. Rowland, C. R. Gibson, J. W. T. Yates, J. Y. Chien et. al,
“PhRMA CPCDC initiae on predictive models of human pharmokinetics part 2:
Comparative assessment of prediction methods of volume of distribution,” Journal of
Pharma Science, vol. 100, pp. 4074-4089, 2011.
30
F. Lombardo, R. S. Obach, F. M. DiCapua, G. A. Bakken, J. Lu, D. M. Potter, et al. “A
hybrid mixture discriminant analysis – random forest computational model for the prediction
of volume of distribution of drugs in human,” J. Med. Chem., vol. 49, pp. 2262–2267, 2006.
B. Louis and V. K. Agrawal, “Quantitative structure-pharmacokinetic relationship (QSPkR)
analysis of the volume of distribution values of anti-infective agents from J group of the
ATC classification in humans,” Acta. Pharm., vol. 62, pp. 305–323, 2012.
R. S. Obach, F. Lombardo, and N.J. Waters, “Trend analysis of a database of intravenous
pharmacokinetic parameters in humans for 670 drug compounds,” Drug. Metab. Dispos.,
vol. 36, pp.1385–1405, 2008.
W. G. Touw, J. R. Bayjanov, L. Overmars, L. Backus, J. Boekhorst, M. Wels, and S. A. van
Hijum. “Data mining in the life sciences with random forest: A walk in the park or lost in
the jungle?” Briefings in Bioinformatics, vol. 14, no. 3, pp. 315-326, 2013.
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques,
Third Edition, Elsevier/Morgan Kaufman, San Francisco, 2011.
Z. Zhivkova and I. Doytchinova, “Prediction of steady-state volume of distribution of acidic
drugs by quantitative structure-pharmacokinetics relationships,” J. Pharm Sci., vol. 101, pp.
1253–1266, 2012.
Z. Zhivkova, T. Mandova, and I. Doytchinova, “Quantitative structure–pharmacokinetics
relationships analysis of basic drugs: volume of distribution,” Journal of Pharmacy &
Pharmaceutical Sciences, vol. 18, no.3, pp.515-527, 2015.
31
Appendix – Matlab6
program
%program to create ten folds of data for 33%33%33% data division
%Nithyakalyani Chinnaiah, nc330@kent.ac.uk, August 2016
%due to the high number of lines of codes, only codes for one data
%division is being included in Appendix. The full set of codes can
be %found in corpus.
clc; clear all; close all
%just for info, class divisions
%y33(1:201)=1; %y33(202:399)=2; %y33(400:602)=3;
x=xlsread('Pharma_data1_only_for_matlab.xlsx'); %read data, 293
%attributes, 1 log Vd
temp=randperm(201); %for first class
cls1_fd1=temp(1:20);
cls1_fd2=temp(21:40);
cls1_fd3=temp(41:60);
cls1_fd4=temp(61:80);
cls1_fd5=temp(81:100);
cls1_fd6=temp(101:120);
cls1_fd7=temp(121:140);
cls1_fd8=temp(141:160);
cls1_fd9=temp(161:180);
cls1_fd10=temp(181:201);
ytemp2=202:399; %for second class
temp2=ytemp2(randperm(198));
cls2_fd1=temp2(1:20);
cls2_fd2=temp2(21:40);
cls2_fd3=temp2(41:60);
cls2_fd4=temp2(61:80);
cls2_fd5=temp2(81:100);
cls2_fd6=temp2(101:120);
cls2_fd7=temp2(121:140);
cls2_fd8=temp2(141:160);
cls2_fd9=temp2(161:180);
cls2_fd10=temp2(181:198);
ytemp3=400:602; %for third class
temp3=ytemp3(randperm(203));
cls3_fd1=temp3(1:20);
cls3_fd2=temp3(21:40);
cls3_fd3=temp3(41:60);
cls3_fd4=temp3(61:80);
cls3_fd5=temp3(81:100);
cls3_fd6=temp3(101:120);
cls3_fd7=temp3(121:140);
cls3_fd8=temp3(141:160);
cls3_fd9=temp3(161:180);
cls3_fd10=temp3(181:203);
6
Mathworks Inc 2016.
32
%splitting the indices for 10 folds of data division with 3 classes
fold10=[cls1_fd10,cls2_fd10,cls3_fd10];
fold9=[cls1_fd9,cls2_fd9,cls3_fd9];
fold8=[cls1_fd8,cls2_fd8,cls3_fd8];
fold7=[cls1_fd7,cls2_fd7,cls3_fd7];
fold6=[cls1_fd6,cls2_fd6,cls3_fd6];
fold5=[cls1_fd5,cls2_fd5,cls3_fd5];
fold4=[cls1_fd4,cls2_fd4,cls3_fd4];
fold3=[cls1_fd3,cls2_fd3,cls3_fd3];
fold2=[cls1_fd2,cls2_fd2,cls3_fd2];
fold1=[cls1_fd1,cls2_fd1,cls3_fd1];
%class labels
cfold1=[ones(size(cls1_fd1)),1+ones(size(cls2_fd1)),2+ones(size(cls3
_fd1))];
cfold2=[ones(size(cls1_fd2)),1+ones(size(cls2_fd2)),2+ones(size(cls3
_fd2))];
cfold3=[ones(size(cls1_fd3)),1+ones(size(cls2_fd3)),2+ones(size(cls3
_fd3))];
cfold4=[ones(size(cls1_fd4)),1+ones(size(cls2_fd4)),2+ones(size(cls3
_fd4))];
cfold5=[ones(size(cls1_fd5)),1+ones(size(cls2_fd5)),2+ones(size(cls3
_fd5))];
cfold6=[ones(size(cls1_fd6)),1+ones(size(cls2_fd6)),2+ones(size(cls3
_fd6))];
cfold7=[ones(size(cls1_fd7)),1+ones(size(cls2_fd7)),2+ones(size(cls3
_fd7))];
cfold8=[ones(size(cls1_fd8)),1+ones(size(cls2_fd8)),2+ones(size(cls3
_fd8))];
cfold9=[ones(size(cls1_fd9)),1+ones(size(cls2_fd9)),2+ones(size(cls3
_fd9))];
cfold10=[ones(size(cls1_fd10)),1+ones(size(cls2_fd10)),2+ones(size(c
ls3_fd10))];
save datafold33.mat
%%%%%% creating folds for different training, validation and testing
%%%%%%
% fold1 data %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold1,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold8,:)]; %training data
dval=x(fold9,:); %validation data
%fold1 labels
ctrain=[cfold1,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold8];
%training data label
cval=cfold9; %validation data label
%fold 1 class 1
cls1_train=[cls1_fd1,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd8]; %training data
cls1_val= cls1_fd9; %validation fold
%fold 1 class 2
cls2_train=[cls2_fd1,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd8];
cls2_val= cls2_fd9;
%fold 1 class 3
33
cls3_train=[cls3_fd1,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd8];
cls3_val= cls3_fd9;
save ('datafold_1.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% fold2 data %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold8,:)]; %training data
dval=x(fold1,:); %validation data
%fold1 labels
ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold8];
%training data label
cval=cfold1; %validation data label
%fold 1 class 1
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd8]; %training data
cls1_val= cls1_fd1; %validation fold
%fold 1 class 2
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd8];
cls2_val= cls2_fd1;
%fold 1 class 3
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd8];
cls3_val= cls3_fd1;
save ('datafold_2.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% fold3 data %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold1,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold8,:)]; %training data
dval=x(fold2,:); %validation data
%fold1 labels
ctrain=[cfold9,cfold1,cfold3,cfold4,cfold5,cfold6,cfold7,cfold8];
%training data label
cval=cfold2; %validation data label
%fold 1 class 1
cls1_train=[cls1_fd9,cls1_fd1,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd8]; %training data
cls1_val= cls1_fd2; %validation fold
%fold 1 class 2
cls2_train=[cls2_fd9,cls2_fd1,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd8];
cls2_val= cls2_fd2;
%fold 1 class 3
cls3_train=[cls3_fd9,cls3_fd1,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd8];
cls3_val= cls3_fd2;
34
save ('datafold_3.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% fold4 data %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold2,:);x(fold1,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold8,:)]; %training data
dval=x(fold3,:); %validation data
%fold1 labels
ctrain=[cfold9,cfold2,cfold1,cfold4,cfold5,cfold6,cfold7,cfold8];
%training data label
cval=cfold3; %validation data label
%fold 1 class 1
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd1,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd8]; %training data
cls1_val= cls1_fd3; %validation fold
%fold 1 class 2
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd1,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd8];
cls2_val= cls2_fd3;
%fold 1 class 3
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd1,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd8];
cls3_val= cls3_fd3;
save ('datafold_4.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% fold5 data %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold1,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold8,:)]; %training data
dval=x(fold4,:); %validation data
%fold1 labels
ctrain=[cfold9,cfold2,cfold3,cfold1,cfold5,cfold6,cfold7,cfold8];
%training data label
cval=cfold4; %validation data label
%fold 1 class 1
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd1,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd8]; %training data
cls1_val= cls1_fd4; %validation fold
%fold 1 class 2
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd1,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd8];
cls2_val= cls2_fd4;
%fold 1 class 3
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd1,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd8];
cls3_val= cls3_fd4;
save ('datafold_5.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
35
% fold6 data %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold1,:);x(fol
d6,:);x(fold7,:);x(fold8,:)]; %training data
dval=x(fold5,:); %validation data
%fold1 labels
ctrain=[cfold9,cfold2,cfold3,cfold4,cfold1,cfold6,cfold7,cfold8];
%training data label
cval=cfold5; %validation data label
%fold 1 class 1
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd1,cls1_fd6,cl
s1_fd7,cls1_fd8]; %training data
cls1_val= cls1_fd5; %validation fold
%fold 1 class 2
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd1,cls2_fd6,cl
s2_fd7,cls2_fd8];
cls2_val= cls2_fd5;
%fold 1 class 3
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd1,cls3_fd6,cl
s3_fd7,cls3_fd8];
cls3_val= cls3_fd5;
save ('datafold_6.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% fold7 data %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d1,:);x(fold7,:);x(fold8,:)]; %training data
dval=x(fold6,:); %validation data
%fold1 labels
ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold1,cfold7,cfold8];
%training data label
cval=cfold6; %validation data label
%fold 1 class 1
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd1,cl
s1_fd7,cls1_fd8]; %training data
cls1_val= cls1_fd6; %validation fold
%fold 1 class 2
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd1,cl
s2_fd7,cls2_fd8];
cls2_val= cls2_fd6;
%fold 1 class 3
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd1,cl
s3_fd7,cls3_fd8];
cls3_val= cls3_fd6;
save ('datafold_7.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% fold8 data %%%%%%%%%%%%%%%%%%%%
36
dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold1,:);x(fold8,:)]; %training data
dval=x(fold7,:); %validation data
%fold1 labels
ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold6,cfold1,cfold8];
%training data label
cval=cfold7; %validation data label
%fold 1 class 1
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd1,cls1_fd8]; %training data
cls1_val= cls1_fd7; %validation fold
%fold 1 class 2
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd1,cls2_fd8];
cls2_val= cls2_fd7;
%fold 1 class 3
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd1,cls3_fd8];
cls3_val= cls3_fd7;
save ('datafold_8.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% fold9 data %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold1,:)]; %training data
dval=x(fold8,:); %validation data
%fold1 labels
ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold1];
%training data label
cval=cfold8; %validation data label
%fold 1 class 1
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd1]; %training data
cls1_val= cls1_fd8; %validation fold
%fold 1 class 2
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd1];
cls2_val= cls2_fd8;
%fold 1 class 3
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd1];
cls3_val= cls3_fd8;
save ('datafold_9.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold1,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold10,:)];
dval=x(fold9,:);
37
ctrain=[cfold1,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold10];
cval=cfold9;
cls1_train=[cls1_fd1,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd10];
cls1_val= cls1_fd9;
cls2_train=[cls2_fd1,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd9;
cls3_train=[cls3_fd1,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd9;
save ('datafold_10.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold10,:)];
dval=x(fold1,:);
ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold10];
cval=cfold1;
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd10];
cls1_val= cls1_fd1;
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd1;
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd1;
save ('datafold_11.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold1,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold10,:)];
dval=x(fold2,:);
ctrain=[cfold9,cfold1,cfold3,cfold4,cfold5,cfold6,cfold7,cfold10];
cval=cfold2;
cls1_train=[cls1_fd9,cls1_fd1,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd10];
cls1_val= cls1_fd2;
cls2_train=[cls2_fd9,cls2_fd1,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd2;
cls3_train=[cls3_fd9,cls3_fd1,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd2;
save ('datafold_12.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%
38
dtrain=[x(fold9,:);x(fold2,:);x(fold1,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold10,:)];
dval=x(fold3,:);
ctrain=[cfold9,cfold2,cfold1,cfold4,cfold5,cfold6,cfold7,cfold10];
cval=cfold3;
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd1,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd10];
cls1_val= cls1_fd3
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd1,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd3;
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd1,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd3;
save ('datafold_13.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold1,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold10,:)];
dval=x(fold4,:);
ctrain=[cfold9,cfold2,cfold3,cfold1,cfold5,cfold6,cfold7,cfold10];
cval=cfold4;
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd1,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd10];
cls1_val= cls1_fd4;
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd1,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd4;
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd1,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd4;
save ('datafold_14.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold1,:);x(fol
d6,:);x(fold7,:);x(fold10,:)];
dval=x(fold5,:);
ctrain=[cfold9,cfold2,cfold3,cfold4,cfold1,cfold6,cfold7,cfold10];
cval=cfold5;
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd1,cls1_fd6,cl
s1_fd7,cls1_fd10];
cls1_val= cls1_fd5;
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd1,cls2_fd6,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd5;
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd1,cls3_fd6,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd5;
save ('datafold_15.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
39
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d1,:);x(fold7,:);x(fold10,:)];
dval=x(fold6,:);
ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold1,cfold7,cfold10];
cval=cfold6;
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd1,cl
s1_fd7,cls1_fd10];
cls1_val= cls1_fd6;
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd1,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd6;
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd1,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd6;
save ('datafold_16.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold1,:);x(fold10,:)];
dval=x(fold7,:);
ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold6,cfold1,cfold10];
cval=cfold7;
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd1,cls1_fd10];
cls1_val= cls1_fd7;
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd1,cls2_fd10];
cls2_val= cls2_fd7;
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd1,cls3_fd10];
cls3_val= cls3_fd7;
save ('datafold_17.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold1,:)];
dval=x(fold10,:);
ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold1];
cval=cfold10;
cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd1];
cls1_val= cls1_fd10;
cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd1];
cls2_val= cls2_fd10;
cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd1];
cls3_val= cls3_fd10;
40
save ('datafold_18.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold1,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold10,:)];
dval=x(fold8,:);
ctrain=[cfold1,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold10];
cval=cfold8;
cls1_train=[cls1_fd1,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd10];
cls1_val= cls1_fd8;
cls2_train=[cls2_fd1,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd8;
cls3_train=[cls3_fd1,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd8;
save ('datafold_19.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold10,:)];
dval=x(fold1,:);
ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold10];
cval=cfold1;
cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd10];
cls1_val= cls1_fd1;
cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd1;
cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd1;
save ('datafold_20.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold1,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold10,:)];
dval=x(fold2,:);
ctrain=[cfold8,cfold1,cfold3,cfold4,cfold5,cfold6,cfold7,cfold10];
cval=cfold2;
cls1_train=[cls1_fd8,cls1_fd1,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd10];
cls1_val= cls1_fd2;
cls2_train=[cls2_fd8,cls2_fd1,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd2;
41
cls3_train=[cls3_fd8,cls3_fd1,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd2;
save ('datafold_21.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold2,:);x(fold1,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold10,:)];
dval=x(fold3,:);
ctrain=[cfold8,cfold2,cfold1,cfold4,cfold5,cfold6,cfold7,cfold10];
cval=cfold3;
cls1_train=[cls1_fd8,cls1_fd2,cls1_fd1,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd10];
cls1_val= cls1_fd3
cls2_train=[cls2_fd8,cls2_fd2,cls2_fd1,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd3;
cls3_train=[cls3_fd8,cls3_fd2,cls3_fd1,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd3;
save ('datafold_22.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold1,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold10,:)];
dval=x(fold4,:);
ctrain=[cfold8,cfold2,cfold3,cfold1,cfold5,cfold6,cfold7,cfold10];
cval=cfold4;
cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd1,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd10];
cls1_val= cls1_fd4;
cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd1,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd4;
cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd1,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd4;
save ('datafold_23.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold1,:);x(fol
d6,:);x(fold7,:);x(fold10,:)];
dval=x(fold5,:);
ctrain=[cfold8,cfold2,cfold3,cfold4,cfold1,cfold6,cfold7,cfold10];
cval=cfold5;
cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd1,cls1_fd6,cl
s1_fd7,cls1_fd10];
42
cls1_val= cls1_fd5;
cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd1,cls2_fd6,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd5;
cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd1,cls3_fd6,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd5;
save ('datafold_24.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d1,:);x(fold7,:);x(fold10,:)];
dval=x(fold6,:);
ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold1,cfold7,cfold10];
cval=cfold6;
cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd1,cl
s1_fd7,cls1_fd10];
cls1_val= cls1_fd6;
cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd1,cl
s2_fd7,cls2_fd10];
cls2_val= cls2_fd6;
cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd1,cl
s3_fd7,cls3_fd10];
cls3_val= cls3_fd6;
save ('datafold_25.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold1,:);x(fold10,:)];
dval=x(fold7,:);
ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold6,cfold1,cfold10];
cval=cfold7;
cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd1,cls1_fd10];
cls1_val= cls1_fd7;
cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd1,cls2_fd10];
cls2_val= cls2_fd7;
cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd1,cls3_fd10];
cls3_val= cls3_fd7;
save ('datafold_26.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold7,:);x(fold1,:)];
dval=x(fold10,:);
ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold1];
43
cval=cfold10;
cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd7,cls1_fd1];
cls1_val= cls1_fd10;
cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd7,cls2_fd1];
cls2_val= cls2_fd10;
cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd7,cls3_fd1];
cls3_val= cls3_fd10;
save ('datafold_27.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold1,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold9,:);x(fold10,:)];
dval=x(fold8,:);
ctrain=[cfold1,cfold2,cfold3,cfold4,cfold5,cfold6,cfold9,cfold10];
cval=cfold8;
cls1_train=[cls1_fd1,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd9,cls1_fd10];
cls1_val= cls1_fd8;
cls2_train=[cls2_fd1,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd9,cls2_fd10];
cls2_val= cls2_fd8;
cls3_train=[cls3_fd1,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd9,cls3_fd10];
cls3_val= cls3_fd8;
save ('datafold_28.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold9,:);x(fold10,:)];
dval=x(fold1,:);
ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold6,cfold9,cfold10];
cval=cfold1;
cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd9,cls1_fd10];
cls1_val= cls1_fd1;
cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd9,cls2_fd10];
cls2_val= cls2_fd1;
cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd9,cls3_fd10];
cls3_val= cls3_fd1;
save ('datafold_29.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
44
% %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold1,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold9,:);x(fold10,:)];
dval=x(fold2,:);
ctrain=[cfold8,cfold1,cfold3,cfold4,cfold5,cfold6,cfold9,cfold10];
cval=cfold2;
cls1_train=[cls1_fd8,cls1_fd1,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd9,cls1_fd10];
cls1_val= cls1_fd2;
cls2_train=[cls2_fd8,cls2_fd1,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd9,cls2_fd10];
cls2_val= cls2_fd2;
cls3_train=[cls3_fd8,cls3_fd1,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd9,cls3_fd10];
cls3_val= cls3_fd2;
save ('datafold_30.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold2,:);x(fold1,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold9,:);x(fold10,:)];
dval=x(fold3,:);
ctrain=[cfold8,cfold2,cfold1,cfold4,cfold5,cfold6,cfold9,cfold10];
cval=cfold3;
cls1_train=[cls1_fd8,cls1_fd2,cls1_fd1,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd9,cls1_fd10];
cls1_val= cls1_fd3
cls2_train=[cls2_fd8,cls2_fd2,cls2_fd1,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd9,cls2_fd10];
cls2_val= cls2_fd3;
cls3_train=[cls3_fd8,cls3_fd2,cls3_fd1,cls3_fd4,cls3_fd5,cls3_fd6,cl
s3_fd9,cls3_fd10];
cls3_val= cls3_fd3;
save ('datafold_31.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold1,:);x(fold5,:);x(fol
d6,:);x(fold9,:);x(fold10,:)];
dval=x(fold4,:);
ctrain=[cfold8,cfold2,cfold3,cfold1,cfold5,cfold6,cfold9,cfold10];
cval=cfold4;
cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd1,cls1_fd5,cls1_fd6,cl
s1_fd9,cls1_fd10];
cls1_val= cls1_fd4;
cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd1,cls2_fd5,cls2_fd6,cl
s2_fd9,cls2_fd10];
cls2_val= cls2_fd4;
cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd1,cls3_fd5,cls3_fd6,cl
s3_fd9,cls3_fd10];
cls3_val= cls3_fd4;
45
save ('datafold_32.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold1,:);x(fol
d6,:);x(fold9,:);x(fold10,:)];
dval=x(fold5,:);
ctrain=[cfold8,cfold2,cfold3,cfold4,cfold1,cfold6,cfold9,cfold10];
cval=cfold5;
cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd1,cls1_fd6,cl
s1_fd9,cls1_fd10];
cls1_val= cls1_fd5;
cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd1,cls2_fd6,cl
s2_fd9,cls2_fd10];
cls2_val= cls2_fd5;
cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd1,cls3_fd6,cl
s3_fd9,cls3_fd10];
cls3_val= cls3_fd5;
save ('datafold_33.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d1,:);x(fold9,:);x(fold10,:)];
dval=x(fold6,:);
ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold1,cfold9,cfold10];
cval=cfold6;
cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd1,cl
s1_fd9,cls1_fd10];
cls1_val= cls1_fd6;
cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd1,cl
s2_fd9,cls2_fd10];
cls2_val= cls2_fd6;
cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd1,cl
s3_fd9,cls3_fd10];
cls3_val= cls3_fd6;
save ('datafold_34.mat', 'dtrain', 'dval', 'ctrain', 'cval',
'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train',
'cls3_val');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%
dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol
d6,:);x(fold1,:);x(fold10,:)];
dval=x(fold9,:);
ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold6,cfold1,cfold10];
cval=cfold9;
cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl
s1_fd1,cls1_fd10];
cls1_val= cls1_fd9;
cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl
s2_fd1,cls2_fd10];
cls2_val= cls2_fd9;
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach
Predicting volume of distribution for drug compounds using decision trees two-stage approach

More Related Content

What's hot

Case Based Medical Diagnosis of Occupational Chronic Lung Diseases From Their...
Case Based Medical Diagnosis of Occupational Chronic Lung Diseases From Their...Case Based Medical Diagnosis of Occupational Chronic Lung Diseases From Their...
Case Based Medical Diagnosis of Occupational Chronic Lung Diseases From Their...CSCJournals
 
Paper id 42201618
Paper id 42201618Paper id 42201618
Paper id 42201618IJRAT
 
Amia tbi-14-final
Amia tbi-14-finalAmia tbi-14-final
Amia tbi-14-finalRuss Altman
 
A web/mobile decision support system to improve medical diagnosis using a com...
A web/mobile decision support system to improve medical diagnosis using a com...A web/mobile decision support system to improve medical diagnosis using a com...
A web/mobile decision support system to improve medical diagnosis using a com...TELKOMNIKA JOURNAL
 
Lancaster design and analysis of pilot studies
Lancaster design and analysis of pilot studiesLancaster design and analysis of pilot studies
Lancaster design and analysis of pilot studiesnoorafifah
 
Kabboord comobidity 2016 THE FINAL version
Kabboord comobidity 2016 THE FINAL versionKabboord comobidity 2016 THE FINAL version
Kabboord comobidity 2016 THE FINAL versionAnouk Kabboord
 
Estimating the Survival Function of HIV AIDS Patients using Weibull Model
Estimating the Survival Function of HIV AIDS Patients using Weibull ModelEstimating the Survival Function of HIV AIDS Patients using Weibull Model
Estimating the Survival Function of HIV AIDS Patients using Weibull Modelijtsrd
 
Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...IAEME Publication
 
Use of the Crowdsourcing Methodology to Generate a Problem-Laboratory Test Kn...
Use of the Crowdsourcing Methodology to Generate a Problem-Laboratory Test Kn...Use of the Crowdsourcing Methodology to Generate a Problem-Laboratory Test Kn...
Use of the Crowdsourcing Methodology to Generate a Problem-Laboratory Test Kn...Allison McCoy
 
Challenges in Phase III Cancer Clinical Trials
Challenges in Phase III Cancer Clinical Trials Challenges in Phase III Cancer Clinical Trials
Challenges in Phase III Cancer Clinical Trials Bhaswat Chakraborty
 
Drug discovery and evaluation safety and pharmacokinetic assays
Drug discovery and evaluation safety and pharmacokinetic assaysDrug discovery and evaluation safety and pharmacokinetic assays
Drug discovery and evaluation safety and pharmacokinetic assaysSpringer
 
EWMA 2013 - Ep497 - Evidence Based Assessment of Moist Wound Healing Dressing...
EWMA 2013 - Ep497 - Evidence Based Assessment of Moist Wound Healing Dressing...EWMA 2013 - Ep497 - Evidence Based Assessment of Moist Wound Healing Dressing...
EWMA 2013 - Ep497 - Evidence Based Assessment of Moist Wound Healing Dressing...EWMAConference
 

What's hot (20)

Case Based Medical Diagnosis of Occupational Chronic Lung Diseases From Their...
Case Based Medical Diagnosis of Occupational Chronic Lung Diseases From Their...Case Based Medical Diagnosis of Occupational Chronic Lung Diseases From Their...
Case Based Medical Diagnosis of Occupational Chronic Lung Diseases From Their...
 
Paper id 42201618
Paper id 42201618Paper id 42201618
Paper id 42201618
 
GRDS International Conferences
GRDS International ConferencesGRDS International Conferences
GRDS International Conferences
 
Resume
ResumeResume
Resume
 
The Impact of Resident Duty Hour Reform on Hospital Readmission Rates 10.27.09
The Impact of Resident Duty Hour Reform on Hospital Readmission Rates 10.27.09The Impact of Resident Duty Hour Reform on Hospital Readmission Rates 10.27.09
The Impact of Resident Duty Hour Reform on Hospital Readmission Rates 10.27.09
 
Amia tbi-14-final
Amia tbi-14-finalAmia tbi-14-final
Amia tbi-14-final
 
A web/mobile decision support system to improve medical diagnosis using a com...
A web/mobile decision support system to improve medical diagnosis using a com...A web/mobile decision support system to improve medical diagnosis using a com...
A web/mobile decision support system to improve medical diagnosis using a com...
 
Lancaster design and analysis of pilot studies
Lancaster design and analysis of pilot studiesLancaster design and analysis of pilot studies
Lancaster design and analysis of pilot studies
 
PDOE
PDOEPDOE
PDOE
 
Kabboord comobidity 2016 THE FINAL version
Kabboord comobidity 2016 THE FINAL versionKabboord comobidity 2016 THE FINAL version
Kabboord comobidity 2016 THE FINAL version
 
Estimating the Survival Function of HIV AIDS Patients using Weibull Model
Estimating the Survival Function of HIV AIDS Patients using Weibull ModelEstimating the Survival Function of HIV AIDS Patients using Weibull Model
Estimating the Survival Function of HIV AIDS Patients using Weibull Model
 
dJHBv6
dJHBv6dJHBv6
dJHBv6
 
Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...
 
CDM
CDMCDM
CDM
 
Use of the Crowdsourcing Methodology to Generate a Problem-Laboratory Test Kn...
Use of the Crowdsourcing Methodology to Generate a Problem-Laboratory Test Kn...Use of the Crowdsourcing Methodology to Generate a Problem-Laboratory Test Kn...
Use of the Crowdsourcing Methodology to Generate a Problem-Laboratory Test Kn...
 
Estadistica1
Estadistica1Estadistica1
Estadistica1
 
Challenges in Phase III Cancer Clinical Trials
Challenges in Phase III Cancer Clinical Trials Challenges in Phase III Cancer Clinical Trials
Challenges in Phase III Cancer Clinical Trials
 
AAPM REPORT 84
AAPM REPORT 84 AAPM REPORT 84
AAPM REPORT 84
 
Drug discovery and evaluation safety and pharmacokinetic assays
Drug discovery and evaluation safety and pharmacokinetic assaysDrug discovery and evaluation safety and pharmacokinetic assays
Drug discovery and evaluation safety and pharmacokinetic assays
 
EWMA 2013 - Ep497 - Evidence Based Assessment of Moist Wound Healing Dressing...
EWMA 2013 - Ep497 - Evidence Based Assessment of Moist Wound Healing Dressing...EWMA 2013 - Ep497 - Evidence Based Assessment of Moist Wound Healing Dressing...
EWMA 2013 - Ep497 - Evidence Based Assessment of Moist Wound Healing Dressing...
 

Viewers also liked

The mechanism of breathing
The mechanism of breathingThe mechanism of breathing
The mechanism of breathingDenis Nyalunga
 
Breathing mechanics
Breathing mechanicsBreathing mechanics
Breathing mechanicsgaganbrar18
 
Drug distribution
Drug  distributionDrug  distribution
Drug distributionsuniu
 
Mechanics of respiration
Mechanics of respirationMechanics of respiration
Mechanics of respirationNisha Yadav
 
Physiology Of Respiration
Physiology Of RespirationPhysiology Of Respiration
Physiology Of Respirationraj kumar
 
Basic Mechanical Ventilation
Basic Mechanical VentilationBasic Mechanical Ventilation
Basic Mechanical VentilationAndrew Ferguson
 
Ppt respiratory system- physiology
Ppt respiratory system- physiologyPpt respiratory system- physiology
Ppt respiratory system- physiologyMeccar Moniem Elino
 
Mechanical ventilation ppt
Mechanical ventilation pptMechanical ventilation ppt
Mechanical ventilation pptBibini Bab
 
Ellefson Transportation Group Celebrates 50 Years in the CSRA
Ellefson Transportation Group Celebrates 50 Years in the CSRAEllefson Transportation Group Celebrates 50 Years in the CSRA
Ellefson Transportation Group Celebrates 50 Years in the CSRAKristin Stokes
 
Mrs. Sack's Foundations 20 Ch. 8 review assign(2)key
Mrs. Sack's Foundations 20 Ch. 8 review assign(2)keyMrs. Sack's Foundations 20 Ch. 8 review assign(2)key
Mrs. Sack's Foundations 20 Ch. 8 review assign(2)keyMichelle Sack
 
Unidad educativa san fráncico de asís
Unidad educativa san fráncico de asísUnidad educativa san fráncico de asís
Unidad educativa san fráncico de asísIsabel Echeverria
 
ТРУБА ГОСТ 10704-91 ГОСТ 3262-75 ГОСТ 8732-78
ТРУБА ГОСТ 10704-91 ГОСТ 3262-75 ГОСТ 8732-78ТРУБА ГОСТ 10704-91 ГОСТ 3262-75 ГОСТ 8732-78
ТРУБА ГОСТ 10704-91 ГОСТ 3262-75 ГОСТ 8732-78akolosova46
 
Baja ParticipacióN Electoral Juvenil En Chile,
Baja ParticipacióN Electoral Juvenil En Chile,Baja ParticipacióN Electoral Juvenil En Chile,
Baja ParticipacióN Electoral Juvenil En Chile,Eduardo Díaz
 
Jin Unidad 2
Jin Unidad 2Jin Unidad 2
Jin Unidad 2jin_group
 

Viewers also liked (20)

Respiration2 upload
Respiration2 uploadRespiration2 upload
Respiration2 upload
 
The mechanism of breathing
The mechanism of breathingThe mechanism of breathing
The mechanism of breathing
 
Drug distribution
Drug distributionDrug distribution
Drug distribution
 
Breathing mechanics
Breathing mechanicsBreathing mechanics
Breathing mechanics
 
Drug distribution
Drug  distributionDrug  distribution
Drug distribution
 
Mechanics of respiration
Mechanics of respirationMechanics of respiration
Mechanics of respiration
 
Physiology Of Respiration
Physiology Of RespirationPhysiology Of Respiration
Physiology Of Respiration
 
Basic Mechanical Ventilation
Basic Mechanical VentilationBasic Mechanical Ventilation
Basic Mechanical Ventilation
 
Ppt respiratory system- physiology
Ppt respiratory system- physiologyPpt respiratory system- physiology
Ppt respiratory system- physiology
 
Mechanical ventilation ppt
Mechanical ventilation pptMechanical ventilation ppt
Mechanical ventilation ppt
 
Ellefson Transportation Group Celebrates 50 Years in the CSRA
Ellefson Transportation Group Celebrates 50 Years in the CSRAEllefson Transportation Group Celebrates 50 Years in the CSRA
Ellefson Transportation Group Celebrates 50 Years in the CSRA
 
Sway July 2002
Sway July 2002Sway July 2002
Sway July 2002
 
Mrs. Sack's Foundations 20 Ch. 8 review assign(2)key
Mrs. Sack's Foundations 20 Ch. 8 review assign(2)keyMrs. Sack's Foundations 20 Ch. 8 review assign(2)key
Mrs. Sack's Foundations 20 Ch. 8 review assign(2)key
 
Unidad educativa san fráncico de asís
Unidad educativa san fráncico de asísUnidad educativa san fráncico de asís
Unidad educativa san fráncico de asís
 
ТРУБА ГОСТ 10704-91 ГОСТ 3262-75 ГОСТ 8732-78
ТРУБА ГОСТ 10704-91 ГОСТ 3262-75 ГОСТ 8732-78ТРУБА ГОСТ 10704-91 ГОСТ 3262-75 ГОСТ 8732-78
ТРУБА ГОСТ 10704-91 ГОСТ 3262-75 ГОСТ 8732-78
 
Cuento anam
Cuento anamCuento anam
Cuento anam
 
Baja ParticipacióN Electoral Juvenil En Chile,
Baja ParticipacióN Electoral Juvenil En Chile,Baja ParticipacióN Electoral Juvenil En Chile,
Baja ParticipacióN Electoral Juvenil En Chile,
 
9. nueve
9. nueve9. nueve
9. nueve
 
creacion de un blog
creacion de un blogcreacion de un blog
creacion de un blog
 
Jin Unidad 2
Jin Unidad 2Jin Unidad 2
Jin Unidad 2
 

Similar to Predicting volume of distribution for drug compounds using decision trees two-stage approach

Applications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessApplications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessProf. Dr. Basavaraj Nanjwade
 
Review on Bioanalytical Method Development in Human Plasma
Review on Bioanalytical Method Development in Human PlasmaReview on Bioanalytical Method Development in Human Plasma
Review on Bioanalytical Method Development in Human Plasmaijtsrd
 
LIVER DISEASE PREDICTION BY USING DIFFERENT DECISION TREE TECHNIQUES
LIVER DISEASE PREDICTION BY USING DIFFERENT DECISION TREE TECHNIQUESLIVER DISEASE PREDICTION BY USING DIFFERENT DECISION TREE TECHNIQUES
LIVER DISEASE PREDICTION BY USING DIFFERENT DECISION TREE TECHNIQUESIJDKP
 
Applicationsofbioinformaticsindrugdiscoveryandprocess
ApplicationsofbioinformaticsindrugdiscoveryandprocessApplicationsofbioinformaticsindrugdiscoveryandprocess
Applicationsofbioinformaticsindrugdiscoveryandprocessjaidev53ster
 
Innovative Strategies For Successful Trial Design - Webinar Slides
Innovative Strategies For Successful Trial Design - Webinar SlidesInnovative Strategies For Successful Trial Design - Webinar Slides
Innovative Strategies For Successful Trial Design - Webinar SlidesnQuery
 
College Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docxCollege Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docxclarebernice
 
VO2max Trainability and High Intensity Interval Training in Humans: A Meta-An...
VO2max Trainability and High Intensity Interval Training in Humans: A Meta-An...VO2max Trainability and High Intensity Interval Training in Humans: A Meta-An...
VO2max Trainability and High Intensity Interval Training in Humans: A Meta-An...Fernando Farias
 
Delayed Randomized Design
Delayed Randomized DesignDelayed Randomized Design
Delayed Randomized DesignTrevor McMullan
 
A Guide To Performing Systematic Literature Reviews In Bioinformatics
A Guide To Performing Systematic Literature Reviews In BioinformaticsA Guide To Performing Systematic Literature Reviews In Bioinformatics
A Guide To Performing Systematic Literature Reviews In BioinformaticsLori Moore
 
LC-MS methods for regulated bioequivalence
LC-MS methods for regulated bioequivalenceLC-MS methods for regulated bioequivalence
LC-MS methods for regulated bioequivalenceBhaswat Chakraborty
 
Role of Biostatistics in Clinical Trials
Role of Biostatistics in Clinical TrialsRole of Biostatistics in Clinical Trials
Role of Biostatistics in Clinical TrialsClinosolIndia
 
Principles of drug trial in cardiology
Principles of drug trial in cardiology Principles of drug trial in cardiology
Principles of drug trial in cardiology Ramachandra Barik
 
A Brief Cognitive Assessment For Use With Schizophrenia Patients In Community...
A Brief Cognitive Assessment For Use With Schizophrenia Patients In Community...A Brief Cognitive Assessment For Use With Schizophrenia Patients In Community...
A Brief Cognitive Assessment For Use With Schizophrenia Patients In Community...Sarah Marie
 
Redefining the Gold Standard
Redefining the Gold StandardRedefining the Gold Standard
Redefining the Gold StandardCovance
 
Enhanced Detection System for Trust Aware P2P Communication Networks
Enhanced Detection System for Trust Aware P2P Communication NetworksEnhanced Detection System for Trust Aware P2P Communication Networks
Enhanced Detection System for Trust Aware P2P Communication NetworksEditor IJCATR
 
Comparative Study of Diabetic Patient Data’s Using Classification Algorithm i...
Comparative Study of Diabetic Patient Data’s Using Classification Algorithm i...Comparative Study of Diabetic Patient Data’s Using Classification Algorithm i...
Comparative Study of Diabetic Patient Data’s Using Classification Algorithm i...Editor IJCATR
 
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...Editor IJCATR
 
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...IRJET Journal
 

Similar to Predicting volume of distribution for drug compounds using decision trees two-stage approach (20)

Cadd
CaddCadd
Cadd
 
Applications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessApplications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And Process
 
Review on Bioanalytical Method Development in Human Plasma
Review on Bioanalytical Method Development in Human PlasmaReview on Bioanalytical Method Development in Human Plasma
Review on Bioanalytical Method Development in Human Plasma
 
LIVER DISEASE PREDICTION BY USING DIFFERENT DECISION TREE TECHNIQUES
LIVER DISEASE PREDICTION BY USING DIFFERENT DECISION TREE TECHNIQUESLIVER DISEASE PREDICTION BY USING DIFFERENT DECISION TREE TECHNIQUES
LIVER DISEASE PREDICTION BY USING DIFFERENT DECISION TREE TECHNIQUES
 
Applicationsofbioinformaticsindrugdiscoveryandprocess
ApplicationsofbioinformaticsindrugdiscoveryandprocessApplicationsofbioinformaticsindrugdiscoveryandprocess
Applicationsofbioinformaticsindrugdiscoveryandprocess
 
Innovative Strategies For Successful Trial Design - Webinar Slides
Innovative Strategies For Successful Trial Design - Webinar SlidesInnovative Strategies For Successful Trial Design - Webinar Slides
Innovative Strategies For Successful Trial Design - Webinar Slides
 
College Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docxCollege Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docx
 
VO2max Trainability and High Intensity Interval Training in Humans: A Meta-An...
VO2max Trainability and High Intensity Interval Training in Humans: A Meta-An...VO2max Trainability and High Intensity Interval Training in Humans: A Meta-An...
VO2max Trainability and High Intensity Interval Training in Humans: A Meta-An...
 
Bioinformatics and Drug Discovery
Bioinformatics and Drug DiscoveryBioinformatics and Drug Discovery
Bioinformatics and Drug Discovery
 
Delayed Randomized Design
Delayed Randomized DesignDelayed Randomized Design
Delayed Randomized Design
 
A Guide To Performing Systematic Literature Reviews In Bioinformatics
A Guide To Performing Systematic Literature Reviews In BioinformaticsA Guide To Performing Systematic Literature Reviews In Bioinformatics
A Guide To Performing Systematic Literature Reviews In Bioinformatics
 
LC-MS methods for regulated bioequivalence
LC-MS methods for regulated bioequivalenceLC-MS methods for regulated bioequivalence
LC-MS methods for regulated bioequivalence
 
Role of Biostatistics in Clinical Trials
Role of Biostatistics in Clinical TrialsRole of Biostatistics in Clinical Trials
Role of Biostatistics in Clinical Trials
 
Principles of drug trial in cardiology
Principles of drug trial in cardiology Principles of drug trial in cardiology
Principles of drug trial in cardiology
 
A Brief Cognitive Assessment For Use With Schizophrenia Patients In Community...
A Brief Cognitive Assessment For Use With Schizophrenia Patients In Community...A Brief Cognitive Assessment For Use With Schizophrenia Patients In Community...
A Brief Cognitive Assessment For Use With Schizophrenia Patients In Community...
 
Redefining the Gold Standard
Redefining the Gold StandardRedefining the Gold Standard
Redefining the Gold Standard
 
Enhanced Detection System for Trust Aware P2P Communication Networks
Enhanced Detection System for Trust Aware P2P Communication NetworksEnhanced Detection System for Trust Aware P2P Communication Networks
Enhanced Detection System for Trust Aware P2P Communication Networks
 
Comparative Study of Diabetic Patient Data’s Using Classification Algorithm i...
Comparative Study of Diabetic Patient Data’s Using Classification Algorithm i...Comparative Study of Diabetic Patient Data’s Using Classification Algorithm i...
Comparative Study of Diabetic Patient Data’s Using Classification Algorithm i...
 
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...
C omparative S tudy of D iabetic P atient D ata’s U sing C lassification A lg...
 
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
 

Recently uploaded

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Predicting volume of distribution for drug compounds using decision trees two-stage approach

  • 1. University of Kent School of Computing Predicting volume of distribution for drug compounds using decision trees NITHYAKALYANI CHINNAIAH MSc ADVANCED COMPUTER SCIENCE 2015/2016 Under the supervision of Professor Alex Freitas Word count: About 10,300
  • 2. ii Acknowledgements Firstly, I am grateful to the Almighty for giving me strength, good health and perseverance to complete this project successfully. My heartfelt thanks to my supervisor, Professor Alex Freitas for his invaluable guidance and support throughout the project. His timely comments and suggestions have aided my understanding significantly. I am indebted to my family and friends. My parents and my sister gave me lots of confidence and encouragement while my friends stood by me during difficult times. My husband and my daughter put up with me a lot as I could not spend time with them for the past many months and they have been my real motivation to complete this project on time.
  • 3. iii Abstract The project is about improving the prediction accuracy of numerical value of volume of distribution (VD), which is a proportionality constant for drug compounds. While there have been many approaches using regression with decision trees to predict VD, very few research studies have been carried out using classification scheme. This project proposes a method where the two approaches are combined with the addition of a classification confidence measure. The two stage classification-regression approach is studied with six different data divisions. In the proposed approach, a test drug’s class is predicted using molecular structures as attributes and if the prediction confidence is higher than a predetermined threshold, the log VD value is obtained using regression decision tree built for the particular class. If the classification prediction is not high, then the approach resorts to standard regression to obtain log VD value. This two stage approach seems to be promising as the cross validated performances are improved as compared to standard regression approach for two of the studied data divisions (one third and one fifth divisions) that gave geometric mean fold errors (GMFE) of 2.1502 and 2.2013 as compared to the standard regression GMFE of 2.2373 and 2.2356, respectively.
  • 4. iv Table of contents ABSTRACT ACKNOWLEDGEMENT CHAPTER ONE: Introduction 1.1 Contributions 1.2 Structure of dissertation CHAPTER TWO: Background 2.1 Volume of distribution 2.2 An example of a drug and molecular descriptors 2.3 The classification task of data mining 2.4 Random Forest 2.5 Regression 2.6 Discretisation CHAPTER THREE: Literature Review on Data Mining for Predicting Volume of distribution 3.1 Summary of literature review CHAPTER FOUR: Proposed Methodology 4.1 Discretisation of classes 4.2 Two-stage classification regression approach 4.2.1 Dividing data into training, validation and testing folds 4.2.2 Two stage approach - threshold detection 4.2.3 Error computation 4.3 Fold creation to compare the proposed approach with standard regression 4.3.1 Standard regression 4.3.2 Proposed two-stage approach CHAPTER FIVE: Computed Results and Analysis 5.1 Two stage approach 5.1.1 Deciding the threshold for two stage approach CHAPTER SIX: Conclusion and Future Work Suggestions 6.1 Suggestions for future work 6.2 Reflection REFERENCES APPENDIX – Matlab program ii iii 1 2 2 3 3 3 4 5 6 6 7 12 13 13 15 15 15 17 18 18 19 20 20 21 26 27 28 29 31
  • 5. 1 Chapter 1: Introduction The research project is about improving the prediction of volume of distribution (VD) for drug compounds using classification approaches. VD is one of the most important pharmacokinetic parameter of drugs, which relates the amount of a drug in the body to the measured concentration in plasma. VD prediction can be determined using either in-vivo (animal models, which is time consuming and also raises ethical concerns); in-vitro (which is also time consuming and requiring costly biological assays) or with a third approach, known as in-silico. Advantages of theoretical in-silico approaches are high throughput, i.e. less time consuming computationally and cheaper cost wise (Freitas et al, 2015). So, one can use this model first and then do in-vivo or in-vitro for more accurate results. Also, only this approach allows un-synthesised drug to be evaluated and also, reliable in-silico predictions of the pharmacokinetic parameter would benefit the early stages of drug discovery. The in-silico predictions are usually performed using regression approaches but very few researches have been completed using classification approach while several data mining algorithms have been used to predict VD, the prediction error is still high due to the range of VD values that is quite large and hence, the hypothesis to be tested here is that the predictive performance can be improved by discretising the numerical VD values1 into a few categorical classes and then applying regression. Obviously, the performance will vary depending on the type of discretisation used. Hence, the goal of this work would be to predict numerical VD using the predictions of discretised classes. This dissertation proposes a novel two-stage classification-regression approach with confidence measure that improves the prediction of VD. The data set that will be used is extracted from Obach’s database (Obach et al, 2008) and consists of 604 different drugs (instances). The motivation behind this work is to predict the VD values for new drugs using data mining algorithms as mentioned. Extracted knowledge from the research could be used by pharmaceutical scientists in determining the appropriate dosage regimen. This would benefit early stages of drug discovery as experimental measurement is not feasible for screening purposes. 1 Actually, log VD rather than VD due to the large range.
  • 6. 2 1.1 Contributions The contributions of work here are: • Six different discretisation of log VD values into categorical classes have been explored; • Two-stage classification-regression approach developed with random forest (RF) classification models and RF regression models; • Classification confidence score, which is measured using the ratio of predictive accuracy of each class in RF classification in the first stage, and used to decide the specific RF regression to be used in the second stage; • Two-stage classification-regression approach using confidence measure that improves the VD prediction as compared to standard regression. 1.2 Structure of dissertation The next chapter gives background details of the work such as VD and drug compounds followed by generic description of classification and regression tasks in data mining. RF, which is the approach used in this dissertation is also described. A literature review of relevant work on classification and regression for predicting VD is described in Chapter 3. Chapter 4 discusses the proposed methodology, which starts with class discretisation and then the two-stage approach is described. Results and analysis are given in Chapter 5. The final chapter gives concluding remarks and reflections on the dissertation with some suggestions for possible future work.
  • 7. 3 Chapter 2: Background 2.1 Volume of distribution Jambhekar and Breen, 2015 defines VD as proportionality constant and not a physiological volume. VD relates the plasma concentration with the mass of drug in the body at a time and is one of the most important pharmacokinetic parameter of drugs. The magnitude of VD is irrelevant to the actual plasma volume, extracellular or total body volume and could vary from 7 to 200 litres in a 70 kg subject. Even though VD has no physical meaning, it is important to decide the correct VD as otherwise it will be difficult to give the appropriate dosage of drugs to patients. For example, two different drugs could have different VD because of their physiochemical properties (i.e. chemical structure) and this causes different concentrations at steady state even if the drugs have the same initial concentrations and elimination half-lifes. The higher the VD, the higher will be the ability of the drug to be absorbed by the tissues. Acidic drugs are highly water soluble and do not penetrate into tissues to a significant degree and hence, have lower VD values. On the other hand, basic drugs are easily bound to the extracellular tissues and hence have a larger VD. As a result of this, large range of variation exists for VD which causes the prediction to be a non-trivial task. In general, VD prediction can be determined using three approaches, namely in-vivo, in-vitro and in-silico (Freitas et al, 2015). In-vivo models give a lot of information about the pharmacokinetics properties of the drugs. But it involves ethical issues as animals are involved. Furthermore, this approach is time consuming and costly. In-vitro is better than in- vivo in terms of cost and time but nevertheless, still costly and time consuming as it involves biological assays. In-silico approaches are based on theoretical models and though these approaches lack the abundant of information that is possible from the other experimental approaches, they are less time consuming and less costly. So, it is possible for one to use this model first and then do in-vivo or in-vitro for more accurate results. In addition, only this approach allows direct usage of human data which can be used to evaluate the pharmacokinetics of un-synthesised drug. 2.2 An example of a drug and molecular descriptors Hydroxychloroquine, which is used to treat malaria is an example of a drug. Its molecular formula is C18H26ClN3O and has a VD value of 700 L/kg. Figure 1 shows the molecular structure.
  • 8. 4 Figure 1: Molecular structure for Hydroxychloroquine (image from https://pubchem.ncbi.nlm.nih.gov/compound/3652#section=2D-Structure) A drug can be described by a number of molecular descriptors, such as surface tension, molar volume, the numbers of hydrogen (H) donors and H acceptors, etc. For instance, for the hydroxychloroquine drug, the number of H donors is 2 and the number of H acceptors is 4. 2.3 The Classification Task of Data Mining There are two types of machine learning, in general. The first is supervised learning where each training example has a pre-defined class label. An example of supervised learning in data mining is classification task. The second is unsupervised learning where examples are not assigned pre-defined classes (on purpose or due to being unavailable). An example of unsupervised learning in data mining is clustering task that is generally useful as a first step in exploring the data space. As classification task is used in this project, it is briefly explained in this section. Classification depends on the predictive characteristics of the features of the data and usually, has a predefined class or group. The goal of any classification technique is to build a model using available training data that automatically predicts the class for future unseen data based on a set of specific characteristics known as features. In other words, classification would induce data from training set in order to predict the class of test examples. Examples of classification approaches are Bayesian learning, instance based learning and decision trees (DT). Bayesian classification uses Naïve Bayes approach which is based on probability theory. It assumes each attribute to be independent from each other and hence classification accuracy is particularly sensitive to redundant attributes. On the other hand, it can handle missing attributes as attributes are conditionally independent given the class.
  • 9. 5 Decision tree classification builds a DT in a top-down fashion. It selects an attribute at a time and partitions the examples according to the values of the selected attribute and repeats this process recursively. Information gain, which is based on information theory is a commonly used attribute selection criteria. A decision tree can be easily converted into a set of IF-THEN rules and due to the comprehensibility of the classifier, it is popularly used. Instance-based learning such as k-nearest neighbour is a straightforward approach that does not require a model to be built. A new instance is classified as the closest to the instance in the training data. Some applications may have too complex dataset to allow the discovery of a simple abstract model and in such cases, this approach is useful. It is usual to have an ensemble of learning approaches used to increase the class prediction accuracy where the aggregated output of the classifiers (example majority voting strategy) would improve predictive accuracy as long as the classifiers are diverse and make uncorrelated prediction errors. However, the use of such an ensemble requires more computation. The performance of classification task is usually evaluated using the predictive accuracy, which is usually defined as the ratio of correctly classified examples in the test data over the total number of examples in the test data. Other performance measures are confusion matrices, sensitivity, precision, recall etc. It is common to divide the available dataset into training, validation and testing divisions in order to train and test the developed classifier. Validation data is used to obtain optimised value of certain parameters used in classifiers. Obviously, how the data is divided will affect the performance and to minimise this variance in performance resulting from the data division, cross-validation is typically performed. In a k-fold cross validation, the available data is divided into k non-overlapping subsets and k-1 folds are used as training data while the remaining fold used as test data and classification performed. This procedure is then repeated for k times and the average of the k performances are computed and used as the actual performance measure. The commonly used k value is 10. In the extreme case, k is set to the number of examples available and this is known as leave one out approach. The usefulness of doing k fold cross validation is that it is statistically more reliable than single training/test set partition but it is disadvantageous as it is computationally expensive. 2.4 Random Forest Random forest (RF) is defined in Touw et al, 2012 as a versatile classification method that is appropriate in analysing large data sets. It is a classification technique that builds a model based on variables to separate the examples into different classes. This algorithm is popular because of its high prediction accuracy and its ability to provide information about importance of variables for classification. RF is actually an ensemble of individual decision
  • 10. 6 trees where each tree in the forest is built using a random subset of variables, hence the name RF. RF uses, by default (this parameter can be changed), the square root of the total number of variables as the number of candidate features to be selected at each tree node. The performance of RF depends on the ability of each individual tree in the forest and the correlation between them (Breiman, 2001). It works by building an ensemble of trees and each tree’s prediction is taken as a vote and approaches such as majority voting is to indicate the most popular class that can significantly improve the overall classification accuracy. To minimise correlation between the built trees, each tree is built from randomly selected training examples, as in the bagging method (Breiman, 1996), where each tree is grown from a sample of training examples randomly selected without replacement from the training set. Another method is random split selection where, split selection is done randomly at each node among the k best splits (Dietterich, 1998). Yet, another approach is by selecting the training set from a random set of weights on the examples in the training set. RF has a number of advantages. RF uses randomly selected attributes to produce good results in classification though less so in regression. It is capable of extracting knowledge from omics data such as when there are interactions between variables (Touw et al, 2012). 2.5 Regression Regression is an approach in data mining that predicts the value of a continuous (real- valued) variable using several independent attributes. Outputs such as age, weight, distance, temperature, income, sales etc could be predicted using regression techniques. RF decision trees can be used for regression which would predict numeric quantities rather than predicting categories. Hence, the same kind of trees can be used but each leaf would contain a numeric value that is the average of the output values across all the training set examples that relates to the leaf (Witten et al, 2011). 2.6 Discretisation The basic idea of discretisation is to convert a continuous (i.e. real-valued) attribute into a categorical (i.e. discrete or nominal) attribute. There are two approaches that can be followed in general to do this: class-driven discretisation that chooses interval boundaries taking into account the class distributions and class-blind discretisation that ignores the classes of the examples.
  • 11. 7 Chapter 3: Literature Review on Data Mining for Predicting Volume of Distribution Many different algorithms like DT based regression, multiple linear regression, neural networks, have been trained and tested with data sets that contain molecular descriptors of various compounds along with VD which is determined after intravenous administration of drugs to healthy people to obtain the prediction accuracy (Jones et al, 2011). In the study by Freitas et al (2015), Quantitative Structure-Pharmacokinetic Relationship (QSPKR) model using DTs with feature selection method to predict VD of chemical compounds (i.e. drugs) was used. This model consists of two phases. In the first phase, Kt:p (concentration ratio of tissue and plasma) for unknown tissues were predicted based on molecular descriptors from known Kt:p values of 110 compounds which were obtained via in vivo or in vitro approaches experimented in animals (i.e. rats). In the second phase, the above model was used to predict Kt:p using 604 compounds. The predicted Kt:p values were used along with molecular descriptors to predict VD of compounds. In the first phase, four types of DTs were used to predict log Kt:p (instead of Kt:p, as the distribution of VD is skewed): Conventional regression tree, Model tree, If-then regression rules and Bagging using model tree. The DTs were implemented using M5P algorithm in WEKA. Mean absolute error (MAE) was calculated using 10 fold cross validation which showed different models gave the best prediction of log Kt:p for different types of tissues. Bagging M5P gave lowest error for 7 out of 13 tissues but the authors used the best method for each tissue. In the second phase, correlation feature selection (CFS) with genetic search was used to obtain useful descriptors to predict log VD. This method assigns a higher score to feature subsets where there is a higher correlation between the subset’s features and the target variable (indicating the features have good predictive power) and a lower correlation among the subset’s features (indicating less redundancy among those features). To test the prediction of VD, the training set consisted of 402 compounds and the test set of 202 compounds. Results obtained using geometric mean fold error (GMFE)2 was used as performance measure as it is less affected by extreme outliers (as mentioned VD has skewed distribution). Three different feature set was used: 1. log Kt:p predicted and molecular features (which gave GMFE of 2.61); 2. Only molecular features (which gave GMFE of 2.33); 2 GMFE=antilog10(MAE)
  • 12. 8 3. 56 descriptors selected by CFS (where only two Kt:p for adipose tissue and thymus and rest 54 from molecular descriptors), which gave a GMFE of 2.29. Though approach 3 improved GMFE slightly, the improvement was not statistically significant using t-Test (p=0.577). The analysis of predicted Kt:p values showed that error comes from large/small VD values (eg: hydroxychloroquine drug giving VD value of 100 L/kg) and not the Kt:p prediction. The results were difficult to compare with other studies due to variations in the data set but GMFE error of 1.56 to 2.78 for inter-species scaling prediction of VD show that the GMFE error of 2.29 obtained here was acceptable though not the best. In the study by Amo et al. (2013), anatomical volumes have been used as rough guidance to classify VD into three classes, namely volume of extracellular fluid class that has a VD range of 0 – 0.3 L/kg, distribution to the tissues class with VD range of 0.3 – 1 L/kg and binding to the cellular components class with VD higher than 1 L/kg. The initial dataset collated by Obach and co-workers (Obach et al 2008) contains 670 compounds with VD and fu values determined after intravenous administration to healthy people. In this study, however, the dataset of 642 drugs with VD values ranging from 0.035 – 60 L/kg were used. Two approaches were used to build predictive models: linear and non-linear models. The former used Partial Least Square (PLS) with Principal Component Analysis (PCA) and all descriptors were transformed with unit variance scaling and mean centring before PCA and PLS analysis. Descriptors with unequal distribution were logarithmically transformed to obtain better distribution. The second approach which was a non-linear one consisted of non – linear recursive partitioning classification model that was used to build DTs. To generate the training set and test set, all compounds were clustered by similarity based on root mean square deviation. This was done to have compounds from each cluster in both data sets. Division of training and test set compounds were based on two approaches. One based on VD and another based on VD and fu (which depends on the binding affinity and capacity of plasma proteins). The former gives three classes and the later gives six classes (by further dividing each class as above using VD and with fu either being more than 0.7 or less than 0.7). Table 1 shows the division of the data set based on VD values. In addition to VD values, the table also shows a row consisting of log (VD) values as this logarithm transformation of VD will be utilised in this project to minimise the skewed distribution caused by some of the high VD values in the dataset. The six class division based on VD and fu values is not shown here as fu will not be used in this project. The performance was measured using sensitivity. As an example, Eq. (1) shows the sensitivity computation for class 1:
  • 13. 9 321 1 FFT T ySensitivit ++ = (1) where T1 is the number of true positives for class 1, i.e., number of examples in class 1 correctly predicted as class 1, F2 is the number of examples in class 1 wrongly (falsely) predicted as class 2 and F3 is the number of examples in class 1 wrongly (falsely) predicted as class 3. The results were high for Class 1 that gave test set sensitivity of 0.71 and for Class 3 (with test set sensitivity of 0.81) but not so good for Class 2 that only gave test set sensitivity of 0.32. Table 1: Division of data set based on VD values by Amo et al, 2013 Class 1 Class 2 Class 3 Total VD 0 – 0.3 L/kg 0.3 – 1 L/kg ≥ 1 L/kg log(VD) -1.4559 to -0.5228 -0.5086 to -0.0043 0.0414 to 2.8451 Training 105 96 181 382 Test 62 71 127 260 Total 167 167 308 642 Lombardo et al (2006) utilised a two stage approach to predict the value of VD. In the first stage, mixture discriminant analysis (MDA) was used to classify into either high or low class based on VD values: either low VD with VD < 10 L. kg-1 or high VD with VD ≥ 10 L.kg-1 . In the next step, RF regression models trained for each class were used to predict the VD value. The authors showed that using 31 computed descriptors (examples such as lipophilicity, ionisation, molecular volume and molecular fragments), the MDA-RF gave a GMFE of 1.78±11.4. This was much improved as compared to using RF only that gave GMFE 2.03±15.0. The proposed method also was better than multiple linear regression approaches using different inputs (31 combined descriptors: GMFE 2.01±11.4; 19 descriptors from simulated annealing: GMFE 2.05±10.1; 16 descriptors based on physiocochemical intuition: GMFE 2.48±16.2). Furthermore, to show that the MDA-RF will perform well irrespective of the dataset division, a ten-fold cross validation approach was used with 384 drugs with each fold either having 38 or 39 compounds, which gave GMFE ranging from 1.83 to 2.24. Leave one out (LOO) approach was also utilised to study if any class would be poorly predicted if it had not been included in the model using eight structural classes with 72 analogues: Steroids (14 compounds), Beta blockers (16 compounds), Fluoroquinolone antibiotics (10 compounds), NSAIDs (7 compounds), Cephalosporines (17 compounds), Benzodiazepines (15 compounds), Tricyclic antidepressants (7 compounds) and Morphine-like (10 compounds). This was done as a
  • 14. 10 simulation of real-life prediction when new structural class that has no experimental data will be tested. The LOO experiment gave GMFE of 1.46 to 2.94 with an average of 1.91. Berellini et al (2009) also analysed prediction of VD using linear and nonlinear models using 669 compounds. Four models were used to predict VD: RF, PLS (principal component (PC) 5), PLS-PC6 and consensus of the two PLS models. GMFE was used to measure performance. Different number of descriptors was used in each approach. RF model used 280 molecular descriptors (from MOE and VolSurf+ software) while the PLS models had 95 and 11 descriptors respectively for PC5 and PC6. Using average of the logarithms of VD, a consensus model was built using the best models using PLS and RF. Results were given for nine structural classes with 172 analogues (number of analogues shown in brackets): Beta adrenergics (27), Benzodiazepines (18), Cephalosporines (27), Fluoroquinolones (12), Morphinans (12), NSAIDs (16), Nucleosides/nucleotides (31), Steroids (21) and Tricyclic antidepressants (8) using leave class out (LCO) approach. On average, consensus model gave GMFE of 1.8 which was also given by PLS-PC5 model. The other two models both gave GMFE of 2.0. When tested on an external test set of 29 compounds, GMFE from PLS-PC6 model was best with 1.8 while the other models gave GMFE values from 1.9 to 2.2. Zhivkova and Doytchinova (2012) looked at VD but only for acidic drugs totalling 132. Genetic algorithm (GA) was used as variable selection procedure from the 178 molecular descriptors available and stepwise linear regression was used to design two models, once to estimate log VD and another to estimate log (VD/MW). The latter is VD per unit weight that would eliminate the influence of molecular weight (MW). Descriptors with non-zero values for less than three molecules were eliminated. The models were assessed by explained variance (r2 ) and standard error estimate (SEE). Two cross-validation approaches were used: leave-one-out cross-validation (LOO-CV) and leave-many-out cross-validation (LMO-CV using 80%:20% division for training:testing). The best performing descriptors for log (VD) model selected by GA were xch9, SdsssP_acnt, SssS_acnt, SdssS_acnt, SHsSH, Hmax, Gmin, and knotpv that gave r2 =0.661 and SEE of 0.194. For the second model, the descriptors were xch7, xch9, SdsssP_acnt, SdssS_acnt, SsSS, Hmax and nelem. This second model had higher r2 and SEE values of 0.687 and 0.221. External test set (of 30 runs) was also used to validate the model’s performance using mean fold error (MFE). This external test was compiled from Berrelini et al (2009) and consisted of 10 acidic drugs. First model gave MFE of 2.04 while the second one gave MFE of 2.25. In a recent paper, Zhivkova et al (2015) used 216 basic drugs belonging to different chemical and therapeutic classes extracted from Obach’s database (Obach et al, 2008). The dataset was divided into six sets each comprising of 36 drugs each: five for modelling and one external validation. The five modelling subsets of data were subjected to five-fold cross
  • 15. 11 validation. Logarithm of VD values were used to obtain closer approximation to normal distribution (i.e. to avoid the skewness in the VD values). Chemical structure of 179 molecular descriptors were computed by ACD/LogD and MDL QSAR software. Models were validated using randomisation test, leave-one-out cross-validation and leave-group-out validation using cross-validated coefficient q2 , prediction coefficient r2 for test set, mean fold error prediction (MFEP) and accuracy. The accuracy (in percentage) was assessed using predicted and actual VD within two or three fold error. A consensus model using 27 most frequently emerging descriptors for 180 basic drugs (some drugs were removed as outliers) gave r2 = 0.529 to 0.593 (mean of 0.555), MFEP was 2.24 to 2.38 (mean of 2.31) and accuracy of 53% for two-fold error (on average). To propose criteria for prediction of VD of basic drugs, three groups were suggested: small (VD <0.7 L/kg), moderate (0.7 L/kg ≤ VD ≤ 2 L/kg) and large (VD > 2 L/kg). Seven criteria were proposed logP, fB, Dipole, ncirc, Gmin, SdssC, aaaC_acnt, Cl_acnt, F_acnt: Drugs that meet neither criterion have small VD (<0.7 L/kg), while those meeting three or more criteria have VD > 2L/kg. Louis and Agrawal (2012) investigated analysis of VD values of anti-infective agents from J group of the ATC classification. Three approaches namely, multiple linear regression (MLR), artificial neural network (ANN) and support vector machine (SVM) were used with theoretical molecular descriptors from 126 drugs (with VD ranging from 0.05 to 33 Lkg-1 ). Initially, compounds were sorted according to the log VD values and every 5th compound was taken as external test compound. The descriptors were selected using CFS which estimated a subset of 20 subscriptors from a pool of 1664 descriptors computed using E-DRAGON software (including 1D, 2D and 3D descriptors). Two performance evaluations method: root mean square error (RMSE) and MAE were used, which showed that SVM gave the best performance with RMSE=0.273 and MAE=0.207 on the test dataset. SVM parameter optimisation was done using 10 fold cross validation. Demir-Kavuk et al (2011) proposed a framework called DemQSAR combining feature generation, feature selection, model building and overtraining control to predict VD and also human clearance (CL). Although Obach’s database (Obach et al, 2008) was used, only 584 compounds were selected with compounds containing phosphorous, boron, metal atoms, macrocycles, and fragment-like were removed. This dataset was divided into 338 for training and remaining approximately 40% (i.e. 246) for testing. Furthermore, 10 fold cross validation of the whole dataset was also done along with external testing of 29 compounds used by Berelleni et al (2009). Several software packages were used which gave 404 molecular descriptors for each compound. Combining this with four types of fingerprints gave a total of 3642 features. Lasso regularization and recursive feature elimination (RFE) were used as feature selectors. Cross validation performance for VD prediction gave GMFE
  • 16. 12 of 2.01 with all features and with minimal 27 features, GMFE of 2.24 was obtained. With the same 27 features, external GMFE of 1.81 was obtained. 3.1 Summary of literature review There are many research studies that have looked into predicting VD using various algorithms based on regression approaches but only Lombardo et al (2006) has attempted to implement two stage classification and regression similar to the work proposed here. Nevertheless, the approach only looked at simple two class division based on VD values on a smaller subset of 384 drugs in Obach’s database (Obach et al, 2008). Furthermore, the approach by Lombardo et al did not attempt to measure confidence of classification before obtaining regressed VD values.
  • 17. 13 Chapter 4 Proposed Methodology 4.1 Discretisation of classes The proposed methodology consists of two stages where the first stage is classification and the second stage is regression. In the first stage, VD values are discretised into several classes and initially, the number of classes to use has to be decided. The simplest approach would be to divide the data set in certain proportions and there are two ways to do this. One way is to equally divide the entire data set into certain number of classes such as three classes – LOW, MEDIUM and HIGH each consisting of 33% of the data. If this is going to be five classes, then the proportion will be 20% with class set of VERY LOW, LOW, MEDIUM, HIGH and VERY HIGH. Another way is to divide the data set unequally, for example dividing some classes with more instances such as in proportions like 20%, 60%, 20% (LOW, MEDIUM, HIGH, respectively) for three classes and for the case of five classes, it could be such as VERY LOW (10%), LOW (20%), MEDIUM (40%), HIGH (20%) and VERY HIGH (10%). There are other approaches to divide the data set like using k-means algorithm, density based algorithm and so on. For the data set utilised in this research, preliminary analysis showed that some of these methods didn’t work very well. For example, using k-means for three classes resulted as LOW (90%), MEDIUM (9%), HIGH (1%) which is not useful as LOW class has very high number of instances and would skew the classification. In addition to the proportionality divisions, two more divisions are considered from literature. One from Lombardo et al (2006) where the entire data set was divided into two classes with respect to VD values. Instances with VD values <10 L/kg were classified as the LOW class (consisting of 92%) and the instances with VD values >=10 L/kg were classified as HIGH class (consisting of 8%). Another is by Amo et al (2013) in which data was divided into three classes, namely LOW class with log VD values from 0 to 0.3 as LOW (27%), from 0.3 to 1 as MEDIUM (27%) and log VD values higher than 1 as HIGH (46%). Table 1 shows the class divisions based on VD values for two and three classes while Table 2 shows the class divisions for five classes. As can be seen from the tables, the division of data is not exactly in the specified ratio. This is because at the division boundary, there could be several instances with the same log VD values and as such, these require to be placed in the same class causing slightly different ratios than the specified ones. For example, 33% division has 201 instances for LOW class, 198 instances for MEDIUM class and 206 instances for HIGH class. Only 602 drug compounds from Obach’s database are used here as the two extremely
  • 18. 14 high values of VD for hydroxychloroquine and chloroquine with values of 700 L/kg and 140 L/kg will skew the training badly giving poor performances. Table 1: Class divisions based on VD values for two and three classes Class No. of drugs LOW VD, log VD MED VD, log VD HIGH VD, log VD LOW MED HIGH Equal Frequency (33%) 0.035 –0.41, -1.4559 to - 0.3872 0.42 – 1.6, -0.3767 to 0.2041 1.7 – 700, 0.2304 to 2.8451 201 198 203 Equal Frequency (20% 60% 20%) 0.035 – 0.23, -1.4559 to - 0.6382 0.24 – 3.3, -0.6197 to 0.5185 3.5 – 700, 0.5440 to 2.8451 123 362 117 27%, 27%, 46% (Amo et al, 2013) 0 – 0.3, -1.4559 to - 0.5228 0.3 – 1.0, -0.5086 to - 0.0043 > 1, 0.0413 to 2.8450 159 164 279 92%, 8% (Lombardo et al, 2006) <10, -1.4559 to 0.9867 >=10, 1.0000 to 2.8450 556 46 90%, 9%, 1% (k – means) 0.035 – 7.1, -1.4559 to 0.8512 7.4 – 28, 0.8692 to 1.4472 33 – 60, 1.5185 to 1.7781 543 51 8 Table 2: Class divisions based on VD values for five classes Equal Frequency (20% 20% 20% 20% 20%) VD, logVD Number of drugs Equal Frequency (10% 20% 40% 20% 10%) VD, logVD Number of drugs VERY LOW 0.035 – 0.23, -1.4559 to -0.6382 123 0.035 – 0.16, -1.4559 to -0.7959 61 LOW 0.24 – 0.57, -0.6198 to -0.2441 117 0.17 – 0.34, -0.7695 to -0.4685 118 MED 0.58 – 1.3, -0.2365 to 0.1139 126 0.35 – 1.8, -0.4559 to 0.2552 240 HIGH 1.4 – 3.3, 0.1461 to 0.5185 119 1.9 – 6.6, 0.2787 to 0.8195 122 VERY HIGH 3.5 – 700, 0.5441 to 2.8451 117 6.8 – 700, 0.8325 to 2.8451 61
  • 19. 15 4.2 Two-stage Classification Regression approach 4.2.1 Dividing data into training, validation and testing folds Once the data is discretised into classes, it is further divided into 10 folds with nearly equal number of instances of each class. Next, with one fold kept as test fold, 9 folds are used with 8 folds for training and the remaining fold for validation. The training data will be used to build the RF classification and regression trees. The validation data is used to obtain the best performing threshold to be used with confidence measure that will be explained later. The data in the test fold is not used here. 4.2.2 Two stage approach - threshold detection At the start of the two-stage approach, classification DTs are built (with 100 trees and 17 randomly selected variables3 where Matlab’s TreeBagger4 function is used to build the RF) using the training data as below: Bstage1=TreeBagger(trees,dtrain(:,1:293),ctrain,'nvartosample',novar); where trees is the number of trees in RF (trees=100), novar is the number of variables (novar=17), ctrain contains the class labels as per the divisions decided earlier. Numeric values are assigned to each class such as 1 for LOW, 2 for MEDIUM and 3 for HIGH. Columns 1 to 293 in dtrain contain the 293 attributes of the training data in the fold. Bstage1 contains the built 100 RF trees. Next, regression decision trees are built for each class using training data from each class (again with 100 RF trees and 17 variables). So, in the case of 33%33%33% division, regression RF is built three times, one for each class as below (example shown for class 1): regclass1=TreeBagger(trees,train_class1(:,1:293),train_class1(:,294),'Metho d',... 'regression','nvartosample',novar); where train_class1 contains training data from class 1 only (with 293 attributes), train_class1(:,294) contains the log VD values from training data and regclass1 contains the built RF for class 1. Similarly, RF regression tree is built for the rest of the classes. Using validation data, the classes of the validation instances are predicted using: 3 17 chosen based on rule of thumb: sqrt (number of attributes). 4 TreeBagger can be run for classification (default) or for regression.
  • 20. 16 [cstage1_val, cScore]=predict(Bstage1,dval(:,1:293)); where Bstage1 contains the built classification RF, dval is the validation data (with 293 attributes), cstage1_val contains the predicted classes and cScore contains the confidence of prediction for each instance of the validation data. For example, for each instance in the validation data, the 100 trees could have 20 trees giving prediction as class 1, 50 trees giving prediction as class 2 and 30 trees giving prediction as class 3. This would give a prediction confidence measure of [0.2 0.5 0.3]. Classification accuracy using the predicted classes of the validation data is found using cstage1_val with actual labels of the validation data. The overall accuracy is computed using the number of correctly predicted instances divided by the total number of instances – this is also done for each class where the number of correctly predicted instances for the class is divided by the total number of instances for the class. The usefulness of doing this will become evident later in the discussion. To decide on the confidence measure i.e., how accurately is the class of that particular instance is predicted, the following equation is used: )_(*)_max( confidencetionclassificameanthresholdconfidencetionclassifica > (2) where threshold varies from 1.1 to 4.0 in steps of 0.1, mean(classification_confidence) is the mean confidence of the predicted class outcomes for that instance across all trees in RF and max (classification_confidence) is the maximum confidence value of these predicted class outcomes. Initially, the threshold is set at 1.1. If equation (2) is satisfied for the validation instance as below: for i=1:length(cScore) if(max(cScore(:,i))>thres_factor*mean(cScore(:,i))); predict_ok(i)=1; %which instances are high confidence prediction end end then two stage classification regression approach is used by setting the predict_ok variable to 1 or else standard regression approach is used (predict_ok =0). So, for each of the validation instance, if the confidence is exceeded, the respective regression tree for the predicted class is loaded. For example, for an instance in the
  • 21. 17 validation data with high confidence, if the predicted class is MEDIUM, then the regression tree built for this MEDIUM class is loaded and the predicted log VD is obtained. If the confidence is not high (i.e. according to equation (2)), then standard regression approach is followed by building RF using training data as follows. Standard regression of log VD values is performed with ensemble of 100 random forest trees built with 17 variables randomly selected: Bstdreg=TreeBagger(trees,dtrain(:,1:293),dtrain(:,294),... 'Method','regression','nvartosample',novar); where trees is the number of trees in RF (trees=100), novar is the number of variables (novar=17), dtrain columns 1 to 293 contains the 293 attributes of the training data in the fold and the 294th column in dtrain contains the actual log VD values. Bstdreg contains the built 100 RF trees. Hence, the validation instance is subjected to either the two-stage approach or the standard regression by the following code (example shown for three classes division): %choose regression model based on predicted class for i=1:length(cval) if(str2num(cstage1_val{i})==1 && predict_ok(i)==1) %check predicted class and high prediction Vdregnew(i)=predict(regclass1,dval(i,1:293)); elseif(str2num(cstage1_val{i})==2 && predict_ok(i)==1) Vdregnew(i)=predict(regclass2,dval(i,1:293)); elseif(str2num(cstage1_val{i})==3 && predict_ok(i)==1) Vdregnew(i)=predict(regclass3,dval(i,1:293)); else Vdregnew(i)=predict(Bstdreg,dval(i,1:293)); %use std reg tree if no high prediction end end The above code loops for all the validation instances where Vdregnew contains the predicted value of log VD. 4.2.3 Error computation To obtain the error values of the prediction, RMSE, MAE and GMFE are calculated as follows:
  • 22. 18 ∑= −= N i DiDi VV N RMSE 1 2' )log(log 1 (3) ∑= −= N i DiDi VV N MAE 1 ' |loglog| 1 (4) MAE GMFE 10= (5) where log VDi is the actual value given in Obach’s database for the particular validation drug instance i while log V`Di is the estimated value obtained by either two-stage approach or standard regression (i.e. Vdregnew above) and N is the number of drug instances in the validation data. The above processes 4.2.1 to 4.2.3 is repeated for all the different 90 folds (using different training and validation data) and average of the errors from 90 folds is computed. This gives the error for threshold=1.1. The threshold is now changed to 1.2 (and so on until 4.0 in steps of 0.1) and all the steps in 4.2.1 to 4.2.3 is repeated for 90 folds and average of the errors from 90 folds is computed for each threshold. It should be noted that in some cases, the higher threshold may not have any effect. For example, with three classes, confidence measures with threshold value about 3.0 will automatically not be confident as the best confidence can only be 1.0. The best threshold to use is decided by not only the lowest GMFE error values but also that there should be at least one confidence prediction in each class. This is to avoid possible poor performance when using the testing data later. 4.3 Fold creation to compare the proposed approach with standard regression Once the best threshold has been decided, full data is now divided into 10 folds where 9 folds are used in training and one fold during testing. The folds have nearly equal number of instances of each class. The data division into folds will be useful to compare the performance of the proposed approach with standard regression. 4.3.1 Standard regression Standard regression of log VD values is performed with ensemble of 100 RF trees built with 17 variables randomly selected using training data. This is for comparison with the proposed two stage approach. The following code is used:
  • 23. 19 Bstdreg=TreeBagger(trees,dtrain(:,1:293),dtrain(:,294),... 'Method','regression','nvartosample',novar); where dtrain now contains the new training data (9 folds) with 293 attributes and the 294th column in dtrain contains the actual log VD values. Bstdreg contains the built 100 RF trees. Testing data (from the remaining fold) are used to obtain the regressed log VD and the error values computed. 4.3.2 Proposed two-stage approach - testing Similarly to section 4.2.2, for the two-stage approach, classification RF is built except that now, the new 9 folds of training data is used and similarly, the regression RF is built for each class. For each instance in the test data, class prediction and confidence score is obtained. If the confidence exceeds equation (2) using the threshold decided earlier, then log VD is predicted using the specific RF regression class. Otherwise, log VD is predicted using the standard RF regression. With these predicted log VD values for each test instance, the three errors (RMSE, MAE, GMFE) are computed. The whole process is then repeated for a different 9 folds of training and remaining fold for testing and average errors computed for comparison. All the above steps starting from section 4.1 to 4.3 are carried out for different discretisation of data and the corresponding results are stored in a text file.
  • 24. 20 Chapter 5 Computed Results and Analysis In this chapter, the obtained results are given along with analysis of the results. 5.1 Two stage approach Table 1 shows the classification results for the training and the validation data, i.e. in the first stage of the two stage approach (without any confidence measure). The class divisions are VERY LOW, LOW, MEDIUM, HIGH, VERY HIGH represented by classes 1, 2, 3, 4 and 5, respectively and similarly for 3 classes with LOW, MEDIUM, HIGH as 1,2 and 3 respectively. As can be seen for all the data divisions, the overall classification accuracy and the classification accuracies of the individual classes for the training data are very good except for the 92%8% division where class 1 has the most number of instances and thus training is skewed resulting in high accuracy for class 1 and poor for the other class. Some particular classes (where there are more training instances), the results are 100%, for example class 3 in 10% 20% 40% 20% 10% data division. For the validation data, the accuracies are not as good as in the training data and the higher classification accuracies are for the class instances with more training instances. For example, for the 20%60%20% division, the validation data accuracies for the classes are 0.3674, 0.8856 and 0.2651. The poor classification performances for certain classes causes poor prediction accuracy overall in the test data as well; this obviously will lead to erroneous regression model loaded in stage two and hence poorer VD prediction. To solve this problem, a suitable threshold with a confidence measure was explored as explained in the previous chapter. Table 1: Mean classification accuracy using training and validation data Training classification accuracy Validation classification accuracy Overall accuracy Class 1 Class 2 Class 3 Class 4 Class 5 Overall accuracy Class 1 Class 2 Class 3 Class 4 Class 5 33% 33% 33% 0.9952 0.9984 0.9940 0.9931 0.6676 0.7994 0.5312 0.6721 20% 60% 20% 0.9773 0.9457 1.0000 0.9403 0.6584 0.3674 0.8856 0.2651 20% 20% 20% 20% 20% 0.9974 1.0000 0.9987 0.9993 0.9916 0.9970 0.4642 0.6535 0.2694 0.3932 0.3963 0.6006 10% 20% 40% 20% 10% 0.9821 0.9549 0.9870 1.0000 0.9750 0.9433 0.5215 0.0757 0.4833 0.8269 0.3345 0.2233 92% 8% 0.9429 1.0000 0.2534 0.9251 1.0000 0.0161 27% 27% 46% 0.9778 0.9839 0.9345 0.9998 0.6777 0.6978 0.3301 0.8726
  • 25. 21 5.1.1 Deciding the threshold for two stage approach Table 2 shows the classification and two stage regression results for validation data from equally divided three classes (33%33%33%) for varying threshold levels. Results are shown only up to threshold 2.7 as the results did not change after this threshold. This is because the threshold will be too high and no data instance will satisfy equation 2 and all the instances will be processed by standard regression. GMFE of 2.1423 is the lowest error given by threshold value of 1.5 and this is the threshold used with testing data later. Table 2: Performance of validation data with varying threshold for 33% 33% 33% division Threshold factor Validation classification accuracy Two stage regression error (with confidence measure) Overall accuracy Class1 Class2 Class3 RMSE MAE GMFE 1.1 0.6814 0.8048 0.5385 0.6770 0.5024 0.3537 2.2662 1.2 0.6976 0.8148 0.5450 0.7048 0.4908 0.3466 2.2284 1.3 0.7144 0.8383 0.5498 0.7254 0.4860 0.3429 2.2090 1.4 0.7429 0.8727 0.5404 0.7646 0.4735 0.3360 2.1747 1.5 0.7777 0.9095 0.5364 0.8093 0.4633 0.3295 2.1423 1.6 0.7973 0.9320 0.5046 0.8460 0.4614 0.3302 2.1459 1.7 0.8198 0.9494 0.4867 0.8743 0.4599 0.3320 2.1541 1.8 0.8425 0.9729 0.4672 0.9058 0.4595 0.3347 2.1671 1.9 0.8677 0.9770 0.4557 0.9427 0.4608 0.3384 2.1860 2.0 0.8902 0.9849 0 0.9504 0.4579 0.3378 2.1824 2.1 0.9156 0.9970 0 0.9666 0.4569 0.3387 2.1874 2.2 0.9330 1 0 0.9607 0.4595 0.3414 2.2019 2.3 0.9504 1 0 0 0.4590 0.3429 2.2097 2.4 0.9552 1 0 0 0.4592 0.3433 2.2118 2.5 0.9718 1 0 0 0.4614 0.3453 2.2224 2.6 0.9806 1 0 0 0.4634 0.3477 2.2354 2.7 0 0 0 0 0.4627 0.3468 2.2303 Table 3 shows the classification and two stage regression results for validation data from unequally divided three classes with more instances for the second class (i.e. 20% 60% 20%) by varying threshold levels. Results are shown only up to threshold 2.7 as the results did not change after this threshold. This is because the threshold will be too high and no data instance will satisfy equation 2 and all the instances will be processed by standard regression. 1.9 is the threshold that gives lowest error with non-zero classification accuracy for all the individual classes with GMFE of 2.2834.
  • 26. 22 Table 3: Performance of validation data with varying threshold for 20% 60% 20% division Threshold factor Validation classification accuracy Two stage regression error (with confidence measure) Overall Accuracy Class1 Class2 Class3 RMSE MAE GMFE 1.1 0.6593 0.3674 0.8859 0.2651 0.5062 0.3793 2.4027 1.2 0.6576 0.3643 0.8823 0.2612 0.4921 0.3729 2.3714 1.3 0.6620 0.3672 0.8863 0.2635 0.5088 0.3830 2.4249 1.4 0.6764 0.3737 0.8929 0.2806 0.5009 0.3767 2.3878 1.5 0.6873 0.3894 0.9048 0.2472 0.5001 0.3754 2.3809 1.6 0.7128 0.4114 0.9202 0.2318 0.4910 0.3692 2.3482 1.7 0.7364 0.4288 0.9406 0.2275 0.4839 0.3627 2.3118 1.8 0.7622 0.4546 0.9501 0.2136 0.4793 0.3593 2.2944 1.9 0.7879 0.4749 0.9583 0.2003 0.4765 0.3573 2.2834 2.0 0.8057 0.4671 0.9623 0 0.4724 0.3550 2.2704 2.1 0.8329 0 0.9665 0 0.4715 0.3537 2.2632 2.2 0.8477 0 0.9799 0 0.4695 0.3526 2.2577 2.3 0.8804 0 0.9848 0 0.4681 0.3509 2.2485 2.4 0.8879 0 0.9851 0 0.4676 0.3515 2.2516 2.5 0.8892 0 0.9825 0 0.4683 0.3511 2.2489 2.6 0.9151 0 0.9794 0 0.4679 0.3505 2.2458 2.7 0 0 0 0 0.4669 0.3499 2.2428 Table 4: Performance of validation data with varying threshold for 20%20%20%20%20% division Threshold factor Validation classification accuracy Two stage regression error (with confidence measure) Overall accuracy Class1 Class2 Class3 Class4 Class5 RMSE MAE GMFE 1.1 0.4652 0.6530 0.2688 0.3936 0.3968 0.6006 0.5384 0.3825 2.4281 1.2 0.4636 0.6461 0.2611 0.3869 0.4194 0.5891 0.5418 0.3855 2.4490 1.3 0.4709 0.6592 0.2708 0.3869 0.4080 0.6059 0.5303 0.3778 2.4036 1.4 0.4751 0.6627 0.2471 0.3982 0.4148 0.6145 0.5288 0.3779 2.4067 1.5 0.4835 0.6835 0.2591 0.4090 0.4126 0.6087 0.5190 0.3742 2.3861 1.6 0.5026 0.7040 0.2550 0.4131 0.4361 0.6398 0.5032 0.3645 2.3317 1.7 0.5279 0.7184 0.2486 0.4542 0.4612 0.6704 0.4905 0.3569 2.2899 1.8 0.5366 0.7296 0.2507 0.4410 0.4437 0.7087 0.4851 0.3536 2.2720 1.9 0.5580 0.7622 0.2630 0.4403 0.4716 0.7256 0.4787 0.3493 2.2487 2.0 0.5623 0.7697 0.2230 0.4674 0.4662 0.7471 0.4744 0.3487 2.2447 2.1 0.5754 0.7944 0.2113 0.4813 0.4704 0.7786 0.4706 0.3469 2.2356 2.2 0.5767 0.8044 0.1678 0 0.4372 0.8023 0.4706 0.3475 2.2388 2.3 0.5832 0.8163 0 0 0.4522 0.8374 0.4690 0.3475 2.2385 2.4 0.6113 0.8402 0 0 0 0 0.4666 0.3464 2.2313 2.5 0.6223 0.8700 0 0 0 0 0.4655 0.3464 2.2315 2.6 0.6388 0.8851 0 0 0 0 0.4642 0.3454 2.2261 2.7 0.6751 0.8978 0 0 0 0 0.4625 0.3444 2.2205 2.8 0.6895 0 0 0 0 0 0.4638 0.3456 2.2269 2.9 0.6966 0 0 0 0 0 0.4636 0.3457 2.2271 3.0 0.7186 0 0 0 0 0 0.4646 0.3466 2.2313 3.1 0.7470 0 0 0 0 0 0.4632 0.3457 2.2253 3.2 0.7587 0 0 0 0 0 0.4638 0.3461 2.2286 3.3 0.7472 0 0 0 0 0 0.4635 0.3469 2.2315 3.4 0 0 0 0 0 0 0.4641 0.3473 2.2331
  • 27. 23 Table 4 shows the classification and two stage regression results for validation data from equally divided five classes for varying threshold levels. Results are shown only up to threshold 3.4 as the results did not change after this threshold. Though GMFE of 2.2205 is the lowest error given by threshold 2.7 but there was not at least one instance in each class with sufficient high confident prediction. So threshold value of 2.1 which gave GMFE of 2.2356 is used with testing data later. Table 5: Performance of validation data with varying threshold for 10%20%40%20%10% division Threshold factor Validation classification accuracy Two stage regression error (with confidence measure) Overall accuracy Class1 Class2 Class3 Class4 Class5 RMSE MAE GMFE 1.1 0.5153 0.0757 0.4833 0.8269 0.3345 0.1116 0.4942 0.3662 2.3322 1.2 0.5201 0.0791 0.4997 0.8323 0.3517 0.0963 0.4921 0.3639 2.3189 1.3 0.5178 0.0812 0.4981 0.8239 0.3554 0.1054 0.4940 0.3642 2.3208 1.4 0.5139 0.0697 0.4905 0.8241 0.3487 0.0946 0.4948 0.3671 2.3373 1.5 0.5216 0.0716 0.5081 0.8333 0.3541 0.1000 0.4877 0.3611 2.3047 1.6 0.5230 0.0691 0.5117 0.8291 0.3479 0.0947 0.4902 0.3642 2.3196 1.7 0.5307 0.0659 0.5145 0.8318 0.3513 0.0883 0.4863 0.3636 2.3172 1.8 0.5497 0.0631 0.5375 0.8434 0.3588 0.0926 0.4844 0.3611 2.3037 1.9 0.5578 0.0709 0.5442 0.8551 0.3748 0.0867 0.4826 0.3613 2.3041 2.0 0.5663 0.0519 0.5707 0.8512 0.3655 0.0679 0.4821 0.3608 2.3007 2.1 0.5875 0.0504 0.5859 0.8679 0.3825 0.0560 0.4772 0.3575 2.2828 2.2 0.6013 0.0393 0.5978 0.8800 0.3723 0.0575 0.4752 0.3548 2.2682 2.3 0.6189 0 0.6124 0.8910 0.3781 0.0466 0.4729 0.3538 2.2630 2.4 0.6404 0 0.6345 0.9112 0.3912 0 0.4718 0.3520 2.2526 2.5 0.6538 0 0.6572 0.9209 0 0 0.4699 0.3519 2.2526 2.6 0.6647 0 0.6835 0.9274 0 0 0.4676 0.3497 2.2407 2.7 0.6779 0 0.6846 0.9357 0 0 0.4689 0.3509 2.2468 2.8 0.6978 0 0.7128 0.9513 0 0 0.4676 0.3501 2.2424 2.9 0.7110 0 0 0.9589 0 0 0.4665 0.3478 2.2309 3.0 0.7234 0 0 0.9608 0 0 0.4662 0.3483 2.2331 3.1 0.7589 0 0 0.9762 0 0 0.4667 0.3485 2.2339 3.2 0.7545 0 0 0.9713 0 0 0.4672 0.3489 2.2363 3.3 0.7986 0 0 0 0 0 0.4681 0.3497 2.2408 3.4 0.8328 0 0 0.9881 0 0 0.4662 0.3486 2.2349 3.5 0 0 0 0 0 0 0.4670 0.3488 2.2359 Similar to the previous tables, Table 5 shows the results for validation data from unequally divided five classes with more instances for the third class (i.e. 10% 20% 40% 20% 10%) by varying threshold levels. Results are shown only up to threshold 3.5 as the results did not change after this threshold. Though GMFE of 2.2309 is the lowest error given by threshold 2.9 but there was not at least one instance in each class with sufficient high confident prediction. So threshold value of 2.2 which gives GMFE of 2.2682 is used with testing data later.
  • 28. 24 Table 6 shows the results for validation data from unequally divided three classes (27% 27% 46%) for varying threshold levels. Results are shown only up to threshold 2.7 as the results did not change after this threshold and threshold 1.9 gives the lowest GMFE=2.2077 and this is used with the testing data. Table 6: Performance of validation data with varying threshold for 27%27%46% division Threshold factor Validation classification accuracy Two stage regression error (with confidence measure) Overall Accuracy Class1 Class2 Class3 RMSE MAE GMFE 1.1 0.6909 0.7024 0.3326 0.8785 0.5155 0.3786 2.4117 1.2 0.7066 0.7222 0.3401 0.8901 0.5039 0.3721 2.3717 1.3 0.7219 0.7456 0.3187 0.9048 0.4899 0.3654 2.3326 1.4 0.7482 0.7751 0.3228 0.9251 0.4823 0.3593 2.2991 1.5 0.7706 0.8023 0.2868 0.9503 0.4741 0.3538 2.2699 1.6 0.7858 0.8258 0.2692 0.9595 0.4681 0.3485 2.2418 1.7 0.8037 0.8509 0.2431 0.9744 0.4660 0.3465 2.2308 1.8 0.8248 0.8727 0.2332 0.9786 0.4628 0.3437 2.2189 1.9 0.8432 0.9074 0.2155 0.9831 0.4613 0.3417 2.2077 2.0 0.8585 0.9020 0 0.9880 0.4613 0.3429 2.2151 2.1 0.8729 0.9196 0 0.9909 0.4615 0.3439 2.2207 2.2 0.8902 0.9563 0 0.9926 0.4594 0.3429 2.2159 2.3 0.8965 0.9793 0 0.9973 0.4592 0.3441 2.2226 2.4 0.9066 0.9904 0 0.9984 0.4578 0.3429 2.2168 2.5 0.9090 0.9963 0 0 0.4601 0.3454 2.2293 2.6 0.9347 0 0 0 0.4612 0.3465 2.2346 2.7 0 0 0 0 0.4635 0.3492 2.2488 Table 7: Performance of validation data with varying threshold for 92% 8% division Threshold factor Validation classification accuracy Two stage regression error (with confidence measure) Overall accuracy Class1 Class2 RMSE MAE GMFE 1.1 0.9044 1 0 0.4911 0.3618 2.3053 1.2 0.9056 1 0 0.4889 0.3610 2.3008 1.3 0.9071 1 0 0.4884 0.3612 2.3016 1.4 0.9100 1 0 0.4858 0.3597 2.2935 1.5 0.9118 1 0 0.4849 0.3591 2.2908 1.6 0.9213 1 0 0.4820 0.3572 2.2808 1.7 0.9353 1 0 0.4770 0.3550 2.2695 1.8 0.9451 1 0 0.4733 0.3542 2.2654 1.9 0.9681 1 0 0.4698 0.3526 2.2570 2.0 0 0 0 0.4677 0.3520 2.2540 Table 7 shows the results from unequally divided two classes with maximum number of instances for class 1 with 92% 8% division by varying threshold levels. Results are shown only up to threshold 2.0 as the results did not change after this threshold. Since
  • 29. 25 class 2 errors are 0 for all the cases, this was a special case and threshold value of 1.9 was used with GMFE error of 2.2570. Finally, Table 8 shows the standard regression errors for various discretisation using ten folds of data (9 folds of training and 1 fold of testing), two stage regression errors where the instances are first classified and then regressed accordingly (without any confidence measure) and also the two stage regression with confidence measure (i.e. proposed method) errors. For standard regression, the division of data is irrelevant and hence the performances just basically depend on the training and testing folds and GMFE errors ranged from 2.2319 to 2.2493, i.e. didn’t change much as expected (slight changes due to RF being used). Table 8: Standard and two-stage classification regression approach (with and without) confidence measure using testing data Data division Used threshold Standard regression error Two stage regression error (without confidence measure) Two stage regression error (with confidence measure) RMSE MAE GMFE RMSE MAE GMFE RMSE MAE GMFE 33% 33% 33% 1.5 0.4662 0.3485 2.2372 0.5195 0.3659 2.3416 0.4603 0.3306 2.1502 20% 60% 20% 1.9 0.4658 0.3498 2.2425 0.4999 0.3805 2.4139 0.4661 0.3502 2.2440 20% 20% 20% 20% 20% 2.1 0.4635 0.3479 2.2356 0.5266 0.3746 2.3798 0.4597 0.3412 2.2013 10% 20% 40% 20% 10% 2.2 0.4626 0.3476 2.2319 0.4883 0.3598 2.2987 0.4682 0.3481 2.2365 92% 8% 1.9 0.4673 0.3492 2.2493 0.4848 0.3559 2.2914 0.4646 0.3468 2.2373 27% 27% 46% 1.9 0.4645 0.3484 2.2375 0.5260 0.3828 2.4205 0.4672 0.3465 2.2279 The two stage without confidence measure gave GMFE errors from 2.2914 (for division 92% 8%) to 2.4205 (27% 27% 46%) while using confidence measure improved the performances for all the divisions when compared to without confidence measure. The best was given by three equal division (33% 33% 33%) with GMFE error of 2.1502 while the worse was 20% 60% 20% that gave GMFE error of 2.2440. When comparing the three methods across the different data divisions, it can be seen that equal data division (both the three classes and five classes) for the two stage with confidence measure gave improved performance as compared to standard regression while 92% 8% and 27% 27% 46% divisions gave slightly improved performance and the rest two (20% 60%20% and 10% 20% 40% 20% 10%) gave slightly poorer results.
  • 30. 26 Chapter 6 Conclusion and Future Work Suggestions Accurate prediction of volume of distribution using in silico methods is very important towards estimation of drug delivery and also in the synthesis of new drugs. While in vivo and in vitro methods exist, these are costly and time consuming. In silico methods also warrant quick VD predictive results for possible refinement with in vivo and in vitro approaches. Most of the studies using in silico approaches use various approaches to predict the whole range of VD values (usually using Obach et al, 2008 database). However, due to the large range of VD value (with skewed distribution), the obtained GMFE errors are still relatively high. In this regard, this MSc dissertation project attempted to refine the prediction by dividing the instances available in Obach’s database into six different equal and unequal size divisions, namely 33%33%33%, 20%60%20%, 20%20%20%20%20%, 10%20%40%20%10%,92%8% and 27%27%46%. The aim of the divisions was to reduce the range of VD values when building a model (in this case, RF decision tree). This would hopefully increase the accuracy of VD prediction. The step entails a two stage approach, firstly to build a classification model to predict the class for the test instance and then in the second stage, to use built regression models for each class to predict the VD value. Decision tree was the approach suggested by supervisor as it is a popular approach in the literature. Random Forest (RF) approach was chosen due to its improved ability than a single decision tree5 . The number of attributes available was 293 and the RF approach utilised 17 attributes chosen randomly from this set of attributes as candidate attributes in each node of a decision tree. RF with 100 trees was used, although it is possible to achieve better performance with higher number of trees, this would cost more computationally. Within the class divisions, folds of training, validation and testing were created to implement this modified two stage approach with confidence measure and also to compare the performances with standard regression using three error measures, RMSE, MAE and GMFE. Initially, the analysis was done in Weka, but as it was becoming more difficult to handle the data divisions effectively, Matlab was chosen as the software to perform the analysis. When the two stage method (classification-regression without confidence measure) was implemented, it was realised that some classes had poor predictive ability in the first stage. This was mainly due to unequal class sizes where the class that had more training 5 Supervisor suggested M5P algorithm but I was not able to successfully implement this in Matlab.
  • 31. 27 instances would perform well but the overall accuracy would be poor due to the other classes performing very poorly. This resulted in higher error values than standard regression for all the six class divisions. Upon discussing this finding with the supervisor, a modified two stage approach was suggested. The approach would be different by utilising a confidence measure of the RF classification in the first stage. While many approaches could have been explored, due to time limitations, the readily available classification confidence score as part of the TreeBagger function in Matlab was utilised. The classification confidence score gave an ability of a measure to decide the effectiveness of the classification. Using this confidence measure, two stage approach will only be utilised for the instances where the confidence exceeds a certain threshold and otherwise, the instance will be subjected to standard regression only. The best threshold for each division was decided using training and validation folds. When tested with testing data, the modified approach proved to be useful as it improved the error measures mainly for the equal class divisions but performing comparably similar to standard regression in the other class divisions. The best performing division was 33%33%33% that gave Geometric Mean Fold Error (GMFE) of 2.1502 for the two-stage approach with confidence measure as compared to 2.3416 for the two-stage approach without confidence measure and 2.2373 for the standard regression approach. For the case of 20%20%20%20%20% data division, the GMFE errors are 2.2013, 2.3798 and 2.2356 for the proposed two-stage approach with confidence measure, two-stage without confidence measure and standard regression, respectively. Exact comparison with the results of others such as Lombardo et al (2006) is not possible as Lombardo et al. used only 384 drugs that gave GMFE error of 1.83 to 2.24, but the work here used 602 drugs. Berellini et al, 2009 used 669 compounds and obtained GMFE of 1.8, but it is not apparent if cross validation of the dataset was implemented. Freitas et al, 2015 used 402 compounds for training and 202 for testing with 56 attributes selected with correlation feature selection with genetic search that gave GMFE value of 2.29. In conclusion, the results obtained here are promising when comparing the current state of the art and with further work, the prediction of VD values can be made more accurate. 6.1 Suggestions for future work Research is of no use if not utilised in some form. In this regard, it is hoped that the outcome of this research will be extended in future (and perhaps for real-life scenarios) and some possibilities are:
  • 32. 28 - Explore other class divisions, for example using density based clustering. This could result in grouping VD values more closely following a certain pattern that could result in improved prediction accuracy in first stage for the two stage approach. - Explore other classifiers to use in the first stage, for example deep learning, which is becoming very popular nowadays and has been shown to give improved classification results as long as there is sufficient data for training. - Use other decision tree approaches, both in the first and second stages such as M5P. Increasing the number of trees could also result in improved performance in both stages. - Use feature selection (such as sequential forward floating selection) to select the appropriate attributes to use, this is likely to give reduced errors if the selected features have high inter-class variance and low intra-class variance. - Use of other more advanced confidence measures in the first stage of classification for example using Bayesian or another probability based approach. 6.2 Reflection The project was challenging as I did not have any knowledge on pharmacokinetics. I thought volume of distribution for a drug would be the actual volume of drug delivered! However, my supervisor assured me that in depth knowledge of pharmacokinetics would not be required and also provided me with a list of good set of journal papers and a book to read to equip myself. Although I had some knowledge acquired on Matlab (through self-study) for one of my assignments, nevertheless writing codes for a full research problem was daunting. But breaking up the task into smaller sections as advised by my supervisor proved to be very useful and small lines of codes were built upon previous codes at a time. Having a family to manage, time management was probably the most difficult aspect of my project. However, the pursuit of MSc studies taught me the ability to divide my time accordingly to strike a balance between the project and family. My knowledge on machine learning, especially decision trees greatly improved after implementing the project. I learnt that acquiring knowledge to pass exams is one thing but practical implementation is yet another matter altogether.
  • 33. 29 References E. M. del Amo, L. Ghemtio, H. Xhaard, M. Yliperttula, A. Urtti and H. Kidron, “Applying linear and non-linear methods for parallel prediction of volume of distribution and fraction of unbound drug,” PLoS One, vol. 8, no. 10, e74758, 2013. G. Berellini, C. Springer, N. J. Waters and F. Lombardo, “In silico prediction of volume of distribution in human using linear and nonlinear models on a 669 compound data set,” J. Med. Chem., vol. 52, pp. 4488–4495, 2009. L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123-140, 1996. L. Breiman “Random forests,” Machine learning, vol. 45, no. 1, pp.5-32, 2001. O. Demir-Kavuk, J. Bentzien, I. Muegge, and E-W. Knapp, “DemQSAR: predicting human volume of distribution and clearance of drugs,” J. Compu. Aided. Mol. Des., vol. 25, pp. 1121-1133, 2011. T. G. Dietterich, “Approximate statistical tests for comparing supervised classification learning algorithms,” Neural Computation, vol. 10, no. 7, pp. 1895-1923, 1998. A. A. Freitas, K. Limbu, and T. Ghafourian, “Predicting volume of distribution with decision tree-based regression methods using predicted tissue: plasma partition coefficients,” J. Cheminformatics, vol. 7, no. 6, 17 pages, 2015. S. S. Jambhekar, and P. J. Breen, Basic Pharmacokinetics, Second Edition, Pharmaceutical Press, London, 2012. R. D. Jones, H. M. Jones, M. Rowland, C. R. Gibson, J. W. T. Yates, J. Y. Chien et. al, “PhRMA CPCDC initiae on predictive models of human pharmokinetics part 2: Comparative assessment of prediction methods of volume of distribution,” Journal of Pharma Science, vol. 100, pp. 4074-4089, 2011.
  • 34. 30 F. Lombardo, R. S. Obach, F. M. DiCapua, G. A. Bakken, J. Lu, D. M. Potter, et al. “A hybrid mixture discriminant analysis – random forest computational model for the prediction of volume of distribution of drugs in human,” J. Med. Chem., vol. 49, pp. 2262–2267, 2006. B. Louis and V. K. Agrawal, “Quantitative structure-pharmacokinetic relationship (QSPkR) analysis of the volume of distribution values of anti-infective agents from J group of the ATC classification in humans,” Acta. Pharm., vol. 62, pp. 305–323, 2012. R. S. Obach, F. Lombardo, and N.J. Waters, “Trend analysis of a database of intravenous pharmacokinetic parameters in humans for 670 drug compounds,” Drug. Metab. Dispos., vol. 36, pp.1385–1405, 2008. W. G. Touw, J. R. Bayjanov, L. Overmars, L. Backus, J. Boekhorst, M. Wels, and S. A. van Hijum. “Data mining in the life sciences with random forest: A walk in the park or lost in the jungle?” Briefings in Bioinformatics, vol. 14, no. 3, pp. 315-326, 2013. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Third Edition, Elsevier/Morgan Kaufman, San Francisco, 2011. Z. Zhivkova and I. Doytchinova, “Prediction of steady-state volume of distribution of acidic drugs by quantitative structure-pharmacokinetics relationships,” J. Pharm Sci., vol. 101, pp. 1253–1266, 2012. Z. Zhivkova, T. Mandova, and I. Doytchinova, “Quantitative structure–pharmacokinetics relationships analysis of basic drugs: volume of distribution,” Journal of Pharmacy & Pharmaceutical Sciences, vol. 18, no.3, pp.515-527, 2015.
  • 35. 31 Appendix – Matlab6 program %program to create ten folds of data for 33%33%33% data division %Nithyakalyani Chinnaiah, nc330@kent.ac.uk, August 2016 %due to the high number of lines of codes, only codes for one data %division is being included in Appendix. The full set of codes can be %found in corpus. clc; clear all; close all %just for info, class divisions %y33(1:201)=1; %y33(202:399)=2; %y33(400:602)=3; x=xlsread('Pharma_data1_only_for_matlab.xlsx'); %read data, 293 %attributes, 1 log Vd temp=randperm(201); %for first class cls1_fd1=temp(1:20); cls1_fd2=temp(21:40); cls1_fd3=temp(41:60); cls1_fd4=temp(61:80); cls1_fd5=temp(81:100); cls1_fd6=temp(101:120); cls1_fd7=temp(121:140); cls1_fd8=temp(141:160); cls1_fd9=temp(161:180); cls1_fd10=temp(181:201); ytemp2=202:399; %for second class temp2=ytemp2(randperm(198)); cls2_fd1=temp2(1:20); cls2_fd2=temp2(21:40); cls2_fd3=temp2(41:60); cls2_fd4=temp2(61:80); cls2_fd5=temp2(81:100); cls2_fd6=temp2(101:120); cls2_fd7=temp2(121:140); cls2_fd8=temp2(141:160); cls2_fd9=temp2(161:180); cls2_fd10=temp2(181:198); ytemp3=400:602; %for third class temp3=ytemp3(randperm(203)); cls3_fd1=temp3(1:20); cls3_fd2=temp3(21:40); cls3_fd3=temp3(41:60); cls3_fd4=temp3(61:80); cls3_fd5=temp3(81:100); cls3_fd6=temp3(101:120); cls3_fd7=temp3(121:140); cls3_fd8=temp3(141:160); cls3_fd9=temp3(161:180); cls3_fd10=temp3(181:203); 6 Mathworks Inc 2016.
  • 36. 32 %splitting the indices for 10 folds of data division with 3 classes fold10=[cls1_fd10,cls2_fd10,cls3_fd10]; fold9=[cls1_fd9,cls2_fd9,cls3_fd9]; fold8=[cls1_fd8,cls2_fd8,cls3_fd8]; fold7=[cls1_fd7,cls2_fd7,cls3_fd7]; fold6=[cls1_fd6,cls2_fd6,cls3_fd6]; fold5=[cls1_fd5,cls2_fd5,cls3_fd5]; fold4=[cls1_fd4,cls2_fd4,cls3_fd4]; fold3=[cls1_fd3,cls2_fd3,cls3_fd3]; fold2=[cls1_fd2,cls2_fd2,cls3_fd2]; fold1=[cls1_fd1,cls2_fd1,cls3_fd1]; %class labels cfold1=[ones(size(cls1_fd1)),1+ones(size(cls2_fd1)),2+ones(size(cls3 _fd1))]; cfold2=[ones(size(cls1_fd2)),1+ones(size(cls2_fd2)),2+ones(size(cls3 _fd2))]; cfold3=[ones(size(cls1_fd3)),1+ones(size(cls2_fd3)),2+ones(size(cls3 _fd3))]; cfold4=[ones(size(cls1_fd4)),1+ones(size(cls2_fd4)),2+ones(size(cls3 _fd4))]; cfold5=[ones(size(cls1_fd5)),1+ones(size(cls2_fd5)),2+ones(size(cls3 _fd5))]; cfold6=[ones(size(cls1_fd6)),1+ones(size(cls2_fd6)),2+ones(size(cls3 _fd6))]; cfold7=[ones(size(cls1_fd7)),1+ones(size(cls2_fd7)),2+ones(size(cls3 _fd7))]; cfold8=[ones(size(cls1_fd8)),1+ones(size(cls2_fd8)),2+ones(size(cls3 _fd8))]; cfold9=[ones(size(cls1_fd9)),1+ones(size(cls2_fd9)),2+ones(size(cls3 _fd9))]; cfold10=[ones(size(cls1_fd10)),1+ones(size(cls2_fd10)),2+ones(size(c ls3_fd10))]; save datafold33.mat %%%%%% creating folds for different training, validation and testing %%%%%% % fold1 data %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold1,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold8,:)]; %training data dval=x(fold9,:); %validation data %fold1 labels ctrain=[cfold1,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold8]; %training data label cval=cfold9; %validation data label %fold 1 class 1 cls1_train=[cls1_fd1,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd8]; %training data cls1_val= cls1_fd9; %validation fold %fold 1 class 2 cls2_train=[cls2_fd1,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd8]; cls2_val= cls2_fd9; %fold 1 class 3
  • 37. 33 cls3_train=[cls3_fd1,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd8]; cls3_val= cls3_fd9; save ('datafold_1.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % fold2 data %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold8,:)]; %training data dval=x(fold1,:); %validation data %fold1 labels ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold8]; %training data label cval=cfold1; %validation data label %fold 1 class 1 cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd8]; %training data cls1_val= cls1_fd1; %validation fold %fold 1 class 2 cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd8]; cls2_val= cls2_fd1; %fold 1 class 3 cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd8]; cls3_val= cls3_fd1; save ('datafold_2.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % fold3 data %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold1,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold8,:)]; %training data dval=x(fold2,:); %validation data %fold1 labels ctrain=[cfold9,cfold1,cfold3,cfold4,cfold5,cfold6,cfold7,cfold8]; %training data label cval=cfold2; %validation data label %fold 1 class 1 cls1_train=[cls1_fd9,cls1_fd1,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd8]; %training data cls1_val= cls1_fd2; %validation fold %fold 1 class 2 cls2_train=[cls2_fd9,cls2_fd1,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd8]; cls2_val= cls2_fd2; %fold 1 class 3 cls3_train=[cls3_fd9,cls3_fd1,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd8]; cls3_val= cls3_fd2;
  • 38. 34 save ('datafold_3.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % fold4 data %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold2,:);x(fold1,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold8,:)]; %training data dval=x(fold3,:); %validation data %fold1 labels ctrain=[cfold9,cfold2,cfold1,cfold4,cfold5,cfold6,cfold7,cfold8]; %training data label cval=cfold3; %validation data label %fold 1 class 1 cls1_train=[cls1_fd9,cls1_fd2,cls1_fd1,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd8]; %training data cls1_val= cls1_fd3; %validation fold %fold 1 class 2 cls2_train=[cls2_fd9,cls2_fd2,cls2_fd1,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd8]; cls2_val= cls2_fd3; %fold 1 class 3 cls3_train=[cls3_fd9,cls3_fd2,cls3_fd1,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd8]; cls3_val= cls3_fd3; save ('datafold_4.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % fold5 data %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold1,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold8,:)]; %training data dval=x(fold4,:); %validation data %fold1 labels ctrain=[cfold9,cfold2,cfold3,cfold1,cfold5,cfold6,cfold7,cfold8]; %training data label cval=cfold4; %validation data label %fold 1 class 1 cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd1,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd8]; %training data cls1_val= cls1_fd4; %validation fold %fold 1 class 2 cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd1,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd8]; cls2_val= cls2_fd4; %fold 1 class 3 cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd1,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd8]; cls3_val= cls3_fd4; save ('datafold_5.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  • 39. 35 % fold6 data %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold1,:);x(fol d6,:);x(fold7,:);x(fold8,:)]; %training data dval=x(fold5,:); %validation data %fold1 labels ctrain=[cfold9,cfold2,cfold3,cfold4,cfold1,cfold6,cfold7,cfold8]; %training data label cval=cfold5; %validation data label %fold 1 class 1 cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd1,cls1_fd6,cl s1_fd7,cls1_fd8]; %training data cls1_val= cls1_fd5; %validation fold %fold 1 class 2 cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd1,cls2_fd6,cl s2_fd7,cls2_fd8]; cls2_val= cls2_fd5; %fold 1 class 3 cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd1,cls3_fd6,cl s3_fd7,cls3_fd8]; cls3_val= cls3_fd5; save ('datafold_6.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % fold7 data %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d1,:);x(fold7,:);x(fold8,:)]; %training data dval=x(fold6,:); %validation data %fold1 labels ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold1,cfold7,cfold8]; %training data label cval=cfold6; %validation data label %fold 1 class 1 cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd1,cl s1_fd7,cls1_fd8]; %training data cls1_val= cls1_fd6; %validation fold %fold 1 class 2 cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd1,cl s2_fd7,cls2_fd8]; cls2_val= cls2_fd6; %fold 1 class 3 cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd1,cl s3_fd7,cls3_fd8]; cls3_val= cls3_fd6; save ('datafold_7.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % fold8 data %%%%%%%%%%%%%%%%%%%%
  • 40. 36 dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold1,:);x(fold8,:)]; %training data dval=x(fold7,:); %validation data %fold1 labels ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold6,cfold1,cfold8]; %training data label cval=cfold7; %validation data label %fold 1 class 1 cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd1,cls1_fd8]; %training data cls1_val= cls1_fd7; %validation fold %fold 1 class 2 cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd1,cls2_fd8]; cls2_val= cls2_fd7; %fold 1 class 3 cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd1,cls3_fd8]; cls3_val= cls3_fd7; save ('datafold_8.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % fold9 data %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold1,:)]; %training data dval=x(fold8,:); %validation data %fold1 labels ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold1]; %training data label cval=cfold8; %validation data label %fold 1 class 1 cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd1]; %training data cls1_val= cls1_fd8; %validation fold %fold 1 class 2 cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd1]; cls2_val= cls2_fd8; %fold 1 class 3 cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd1]; cls3_val= cls3_fd8; save ('datafold_9.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%% dtrain=[x(fold1,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold10,:)]; dval=x(fold9,:);
  • 41. 37 ctrain=[cfold1,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold10]; cval=cfold9; cls1_train=[cls1_fd1,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd10]; cls1_val= cls1_fd9; cls2_train=[cls2_fd1,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd9; cls3_train=[cls3_fd1,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd9; save ('datafold_10.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold10,:)]; dval=x(fold1,:); ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold10]; cval=cfold1; cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd10]; cls1_val= cls1_fd1; cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd1; cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd1; save ('datafold_11.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold1,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold10,:)]; dval=x(fold2,:); ctrain=[cfold9,cfold1,cfold3,cfold4,cfold5,cfold6,cfold7,cfold10]; cval=cfold2; cls1_train=[cls1_fd9,cls1_fd1,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd10]; cls1_val= cls1_fd2; cls2_train=[cls2_fd9,cls2_fd1,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd2; cls3_train=[cls3_fd9,cls3_fd1,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd2; save ('datafold_12.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%
  • 42. 38 dtrain=[x(fold9,:);x(fold2,:);x(fold1,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold10,:)]; dval=x(fold3,:); ctrain=[cfold9,cfold2,cfold1,cfold4,cfold5,cfold6,cfold7,cfold10]; cval=cfold3; cls1_train=[cls1_fd9,cls1_fd2,cls1_fd1,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd10]; cls1_val= cls1_fd3 cls2_train=[cls2_fd9,cls2_fd2,cls2_fd1,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd3; cls3_train=[cls3_fd9,cls3_fd2,cls3_fd1,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd3; save ('datafold_13.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold1,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold10,:)]; dval=x(fold4,:); ctrain=[cfold9,cfold2,cfold3,cfold1,cfold5,cfold6,cfold7,cfold10]; cval=cfold4; cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd1,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd10]; cls1_val= cls1_fd4; cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd1,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd4; cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd1,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd4; save ('datafold_14.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold1,:);x(fol d6,:);x(fold7,:);x(fold10,:)]; dval=x(fold5,:); ctrain=[cfold9,cfold2,cfold3,cfold4,cfold1,cfold6,cfold7,cfold10]; cval=cfold5; cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd1,cls1_fd6,cl s1_fd7,cls1_fd10]; cls1_val= cls1_fd5; cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd1,cls2_fd6,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd5; cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd1,cls3_fd6,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd5; save ('datafold_15.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val');
  • 43. 39 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d1,:);x(fold7,:);x(fold10,:)]; dval=x(fold6,:); ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold1,cfold7,cfold10]; cval=cfold6; cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd1,cl s1_fd7,cls1_fd10]; cls1_val= cls1_fd6; cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd1,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd6; cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd1,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd6; save ('datafold_16.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold1,:);x(fold10,:)]; dval=x(fold7,:); ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold6,cfold1,cfold10]; cval=cfold7; cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd1,cls1_fd10]; cls1_val= cls1_fd7; cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd1,cls2_fd10]; cls2_val= cls2_fd7; cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd1,cls3_fd10]; cls3_val= cls3_fd7; save ('datafold_17.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold9,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold1,:)]; dval=x(fold10,:); ctrain=[cfold9,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold1]; cval=cfold10; cls1_train=[cls1_fd9,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd1]; cls1_val= cls1_fd10; cls2_train=[cls2_fd9,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd1]; cls2_val= cls2_fd10; cls3_train=[cls3_fd9,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd1]; cls3_val= cls3_fd10;
  • 44. 40 save ('datafold_18.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%% dtrain=[x(fold1,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold10,:)]; dval=x(fold8,:); ctrain=[cfold1,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold10]; cval=cfold8; cls1_train=[cls1_fd1,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd10]; cls1_val= cls1_fd8; cls2_train=[cls2_fd1,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd8; cls3_train=[cls3_fd1,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd8; save ('datafold_19.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold10,:)]; dval=x(fold1,:); ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold10]; cval=cfold1; cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd10]; cls1_val= cls1_fd1; cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd1; cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd1; save ('datafold_20.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold1,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold10,:)]; dval=x(fold2,:); ctrain=[cfold8,cfold1,cfold3,cfold4,cfold5,cfold6,cfold7,cfold10]; cval=cfold2; cls1_train=[cls1_fd8,cls1_fd1,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd10]; cls1_val= cls1_fd2; cls2_train=[cls2_fd8,cls2_fd1,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd2;
  • 45. 41 cls3_train=[cls3_fd8,cls3_fd1,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd2; save ('datafold_21.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold2,:);x(fold1,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold10,:)]; dval=x(fold3,:); ctrain=[cfold8,cfold2,cfold1,cfold4,cfold5,cfold6,cfold7,cfold10]; cval=cfold3; cls1_train=[cls1_fd8,cls1_fd2,cls1_fd1,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd10]; cls1_val= cls1_fd3 cls2_train=[cls2_fd8,cls2_fd2,cls2_fd1,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd3; cls3_train=[cls3_fd8,cls3_fd2,cls3_fd1,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd3; save ('datafold_22.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold1,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold10,:)]; dval=x(fold4,:); ctrain=[cfold8,cfold2,cfold3,cfold1,cfold5,cfold6,cfold7,cfold10]; cval=cfold4; cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd1,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd10]; cls1_val= cls1_fd4; cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd1,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd4; cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd1,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd4; save ('datafold_23.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold1,:);x(fol d6,:);x(fold7,:);x(fold10,:)]; dval=x(fold5,:); ctrain=[cfold8,cfold2,cfold3,cfold4,cfold1,cfold6,cfold7,cfold10]; cval=cfold5; cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd1,cls1_fd6,cl s1_fd7,cls1_fd10];
  • 46. 42 cls1_val= cls1_fd5; cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd1,cls2_fd6,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd5; cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd1,cls3_fd6,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd5; save ('datafold_24.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d1,:);x(fold7,:);x(fold10,:)]; dval=x(fold6,:); ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold1,cfold7,cfold10]; cval=cfold6; cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd1,cl s1_fd7,cls1_fd10]; cls1_val= cls1_fd6; cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd1,cl s2_fd7,cls2_fd10]; cls2_val= cls2_fd6; cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd1,cl s3_fd7,cls3_fd10]; cls3_val= cls3_fd6; save ('datafold_25.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold1,:);x(fold10,:)]; dval=x(fold7,:); ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold6,cfold1,cfold10]; cval=cfold7; cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd1,cls1_fd10]; cls1_val= cls1_fd7; cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd1,cls2_fd10]; cls2_val= cls2_fd7; cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd1,cls3_fd10]; cls3_val= cls3_fd7; save ('datafold_26.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold7,:);x(fold1,:)]; dval=x(fold10,:); ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold6,cfold7,cfold1];
  • 47. 43 cval=cfold10; cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd7,cls1_fd1]; cls1_val= cls1_fd10; cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd7,cls2_fd1]; cls2_val= cls2_fd10; cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd7,cls3_fd1]; cls3_val= cls3_fd10; save ('datafold_27.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%% dtrain=[x(fold1,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold9,:);x(fold10,:)]; dval=x(fold8,:); ctrain=[cfold1,cfold2,cfold3,cfold4,cfold5,cfold6,cfold9,cfold10]; cval=cfold8; cls1_train=[cls1_fd1,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd9,cls1_fd10]; cls1_val= cls1_fd8; cls2_train=[cls2_fd1,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd9,cls2_fd10]; cls2_val= cls2_fd8; cls3_train=[cls3_fd1,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd9,cls3_fd10]; cls3_val= cls3_fd8; save ('datafold_28.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold9,:);x(fold10,:)]; dval=x(fold1,:); ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold6,cfold9,cfold10]; cval=cfold1; cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd9,cls1_fd10]; cls1_val= cls1_fd1; cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd9,cls2_fd10]; cls2_val= cls2_fd1; cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd9,cls3_fd10]; cls3_val= cls3_fd1; save ('datafold_29.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  • 48. 44 % %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold1,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold9,:);x(fold10,:)]; dval=x(fold2,:); ctrain=[cfold8,cfold1,cfold3,cfold4,cfold5,cfold6,cfold9,cfold10]; cval=cfold2; cls1_train=[cls1_fd8,cls1_fd1,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd9,cls1_fd10]; cls1_val= cls1_fd2; cls2_train=[cls2_fd8,cls2_fd1,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd9,cls2_fd10]; cls2_val= cls2_fd2; cls3_train=[cls3_fd8,cls3_fd1,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd9,cls3_fd10]; cls3_val= cls3_fd2; save ('datafold_30.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold2,:);x(fold1,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold9,:);x(fold10,:)]; dval=x(fold3,:); ctrain=[cfold8,cfold2,cfold1,cfold4,cfold5,cfold6,cfold9,cfold10]; cval=cfold3; cls1_train=[cls1_fd8,cls1_fd2,cls1_fd1,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd9,cls1_fd10]; cls1_val= cls1_fd3 cls2_train=[cls2_fd8,cls2_fd2,cls2_fd1,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd9,cls2_fd10]; cls2_val= cls2_fd3; cls3_train=[cls3_fd8,cls3_fd2,cls3_fd1,cls3_fd4,cls3_fd5,cls3_fd6,cl s3_fd9,cls3_fd10]; cls3_val= cls3_fd3; save ('datafold_31.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold1,:);x(fold5,:);x(fol d6,:);x(fold9,:);x(fold10,:)]; dval=x(fold4,:); ctrain=[cfold8,cfold2,cfold3,cfold1,cfold5,cfold6,cfold9,cfold10]; cval=cfold4; cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd1,cls1_fd5,cls1_fd6,cl s1_fd9,cls1_fd10]; cls1_val= cls1_fd4; cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd1,cls2_fd5,cls2_fd6,cl s2_fd9,cls2_fd10]; cls2_val= cls2_fd4; cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd1,cls3_fd5,cls3_fd6,cl s3_fd9,cls3_fd10]; cls3_val= cls3_fd4;
  • 49. 45 save ('datafold_32.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold1,:);x(fol d6,:);x(fold9,:);x(fold10,:)]; dval=x(fold5,:); ctrain=[cfold8,cfold2,cfold3,cfold4,cfold1,cfold6,cfold9,cfold10]; cval=cfold5; cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd1,cls1_fd6,cl s1_fd9,cls1_fd10]; cls1_val= cls1_fd5; cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd1,cls2_fd6,cl s2_fd9,cls2_fd10]; cls2_val= cls2_fd5; cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd1,cls3_fd6,cl s3_fd9,cls3_fd10]; cls3_val= cls3_fd5; save ('datafold_33.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d1,:);x(fold9,:);x(fold10,:)]; dval=x(fold6,:); ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold1,cfold9,cfold10]; cval=cfold6; cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd1,cl s1_fd9,cls1_fd10]; cls1_val= cls1_fd6; cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd1,cl s2_fd9,cls2_fd10]; cls2_val= cls2_fd6; cls3_train=[cls3_fd8,cls3_fd2,cls3_fd3,cls3_fd4,cls3_fd5,cls3_fd1,cl s3_fd9,cls3_fd10]; cls3_val= cls3_fd6; save ('datafold_34.mat', 'dtrain', 'dval', 'ctrain', 'cval', 'cls1_train', 'cls1_val', 'cls2_train', 'cls2_val', 'cls3_train', 'cls3_val'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%% dtrain=[x(fold8,:);x(fold2,:);x(fold3,:);x(fold4,:);x(fold5,:);x(fol d6,:);x(fold1,:);x(fold10,:)]; dval=x(fold9,:); ctrain=[cfold8,cfold2,cfold3,cfold4,cfold5,cfold6,cfold1,cfold10]; cval=cfold9; cls1_train=[cls1_fd8,cls1_fd2,cls1_fd3,cls1_fd4,cls1_fd5,cls1_fd6,cl s1_fd1,cls1_fd10]; cls1_val= cls1_fd9; cls2_train=[cls2_fd8,cls2_fd2,cls2_fd3,cls2_fd4,cls2_fd5,cls2_fd6,cl s2_fd1,cls2_fd10]; cls2_val= cls2_fd9;