QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemoinformatics Summer School. Strasbourg, France 25 – 29 June 2012. And ESOF EuroScience Open Forum, Dublin, Ireland 11-15 July 2012
The goal of this study was to predict ready biodegradation of
chemicals by QSAR modeling. The dataset used for this purpose was
produced by the Japanese Ministry of International Trade and Industry
(MITI) with experimental results according to the OECD test guideline
301C. Molecular descriptors from Dragon 6 were calculated. Variable
selection coupled with classification methods were applied to find the
most predictive models with low cross-validation error rate. The best
models were after that validated using the preselected test set to check
its prediction reliability and for further analysis.
Similar to QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemoinformatics Summer School. Strasbourg, France 25 – 29 June 2012. And ESOF EuroScience Open Forum, Dublin, Ireland 11-15 July 2012
Mark Mackey, Cresset, 'Meet Molecular Architect, A new product for understand...Cresset
Similar to QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemoinformatics Summer School. Strasbourg, France 25 – 29 June 2012. And ESOF EuroScience Open Forum, Dublin, Ireland 11-15 July 2012 (20)
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemoinformatics Summer School. Strasbourg, France 25 – 29 June 2012. And ESOF EuroScience Open Forum, Dublin, Ireland 11-15 July 2012
1. QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS
Kamel Mansouri, Tine Ringsted, Viviana Consonni,
Davide Ballabio, Roberto Todeschini
Milano Chemometrics and QSAR Research Group, Department of Environmental Sciences,
University of Milano-Bicocca, P.za della Scienza 1 – 20126 Milano, Italy
Persistent organic pollutants are highly bioaccumulative with toxic
effects on humans, wildlife and the environment. Their persistency have
been studied experimentally and theoretically for the evaluation of new
chemicals to avoid Persistent Bioaccumulative and Toxic (PBT)
compounds. In order to fill data gaps, QSARs are increasingly being
used by scientific community as an alternative to animal testing and
implemented in legislation (REACH).
The goal of this study was to predict ready biodegradation of
chemicals by QSAR modeling. The dataset used for this purpose was
produced by the Japanese Ministry of International Trade and Industry
(MITI) with experimental results according to the OECD test guideline
301C. Molecular descriptors from Dragon 6 were calculated. Variable
selection coupled with classification methods were applied to find the
most predictive models with low cross-validation error rate. The best
models were after that validated using the preselected test set to check
its prediction reliability and for further analysis.
1314 compounds with ready biodegradation (MITI-I test) were collected.[1]
A molecule was removed if:
it had a disconnected structure
the experimental value did not agree with the classification BOD threshold
of 60%. (Fig1)
replicate values had more than 20% difference
the classification would change if nitrification was taken into account
After removal 1055 molecules remained (356 ready biodegradable/ 699 not ready
biodegradable). (Fig2)
Descriptors :
Different blocks of molecular descriptors were
initially calculated using Dragon6 [3]; 2D Atom pairs,
Topological indices, Ring descriptors, Constitutional
indices, Functional groups, 2D Matrix based, Atom centered
fragments, Atom type E-state.
Highly correlated, constant and near constant descriptors
were removed automatically using the same software.
Variable selection:
In Matlab, using genetic algorithms (GA) [4] applied on each classification
method, (SVM, KNN, PLSDA), two filters were performed to select the best
descriptors:
+ first on each block apart, then on resulting sets all merged.
+ the frequency of selection after 100 GA runs was used to sort the
descriptors by importance to keep only the 100 most appropriates ones for
the last modeling step.
Validation of models:
5-fold cross-validation.
A test set which was chosen by randomly splitting the initial data set into
20% test and 80% training set while keeping the balance between ready
biodegradable/not ready biodegradable. The training set contained 837
molecules and the test set 218 molecules.
0
10
20
30
40
50
60
70
80
Numberofmolecules
28 days
<28 days
QSAR
SVM
KNN
PLS-DA
Model ID Descriptors
5f-CV Test
ER cv Spec. Sens. ER test Spec. Sens.
SVM_1 20 0.151 0.775 0.924 0.135 0.806 0.925
SVM_2 23 0.153 0.785 0.910 0.131 0.806 0.932
SVM_3 24 0.156 0.775 0.913 0.131 0.819 0.918
Model ID Descriptors LVs
Fit 5f-CV Test
ER fit Spec. Sens. ER cv Spec. Sens. ER test Spec. Sens.
PLSDA_1 26 9 0.140 0.887 0.834 0.141 0.891 0.826 0.145 0.861 0.849
PLSDA_2 28 9 0.144 0.891 0.821 0.142 0.887 0.828 0.145 0.847 0.863
PLSDA_3 23 5 0.144 0.880 0.832 0.141 0.884 0.834 0.148 0.833 0.870
Model ID Descriptors Distance K
5f-CV Test
ER cv Spec. Sens. ER test Spec. Sens.
KNN_1 17 Euclidean 6 0.136 0.859 0.870 0.121 0.847 0.911
KNN_2 17 CityBloc 6 0.139 0.852 0.870 0.138 0.847 0.877
KNN_1 15 CityBloc 8 0.141 0.849 0.870 0.142 0.806 0.911
Abstract:
Acknowledgements:
The research leading to these results has received funding from the [European
Community's] Seventh Framework Programme ([FP7/2007-2013]) under Grant Agreement
n° [238701] of the project Marie Curie ITN Environmental Chemoinformatics (ECO-ITN).
http://www.eco-itn.eu
References:
[1]. Chemical Risk Information Platform (CHRIP), National Institute of Technology and
Evaluation, Japan, http://www.safe.nite.go.jp/english/kizon/KIZON_start_hazkizon.htm
[2]. Chih-Chung Chang and Chih-Jen Lin, LIBSVM 3.1
http://www.csie.ntu.edu.tw/~cjlin/libsvm
[3]. Dragon6. Talete srl, Milano, Italy, http://www.talete.mi.it
[4]. Leardi, R., Lupianez, A., 1998. Genetic algorithms applied to feature selection in PLS
regression: how and when to use them. Chemometr. Intell. Lab. 41, 195–207.
in vesselsubstancetestmg
blankbyuptakeOmg-substanceby testuptakeOmg 22
BOD
Table2: Selected best
models using GA-SVM
Table1: Selected best models using GA-KNN
Table3: Selected best models using GA-PLSDA
Fig2: Multidimensional scaling plot
Fig1:
Distribution of
BOD values in
Ready Biodeg.
compounds.
The number of K nearest neighbors was optimized during the GA calculations to
meet the lowest cross-validation error rate (ER cv). The most selected descriptors
are: Kier benzene-likeliness (BLI), nb. atoms of type 'sssN', sum of 'dssC' E-
states, nb. of subst. benzene C(sp2) and nb. of ring tertiary C(sp3).
The number of PLSDA latent variables (LVs) was optimized during the GA calculations to meet
the lowest cross-validation error rate (ER cv). The most selected descriptors are: R-CX-R, nb. of
atoms type 'sssN’, spectral mean absolute deviation from Laplace matrix , presence of C-Cl at
Topo. Dist. 1, eccentricity, nb. of N atoms, nb. of (thio-) carbamates (aliphatic) average
Randic index from Burden matrix weighted by mass and Cl attached to C1(sp3).
The SVM results were obtained using the LIBSVM3.1 C library
compiled in Matlab [2]. The kernel used in the radial-basis-function
and its default parameters defined in the library. The most selected
descriptors are: average MW, nb. of terminal primary C(sp3), mean
first ionisation pot., nb. of N atoms, sum of ' aasC' E-states, nb. of
heteroatoms, nb. of esters (aromatic), intrinsic state
pseudoconnectivity index and freq. of C-P at Topo. Dist 2.