SlideShare a Scribd company logo
U.S. Environmental Protection Agency
Office of Research and Development
Techniques used :
Kohonen self-organizing maps (SOM):
A machine learning technique that employs artificial neural
networks (ANN) to collapse the data samples in a number of
clusters (winning neurons) in a two dimensional space.
Partial least squares discriminant analysis (PLSDA):
PLSDA is a partial least squares regression (PLS2-based) with
the discrimination power of a classification method. It finds
fundamental relations between the matrix of descriptors and
the class vector by calculating latent variables (LVs), which
are orthogonal linear combinations of the original variables.
K-nearest neighbors (kNN):
A molecule is classified according to the classes of the k
closest molecules, according to the majority of its k nearest
neighbors. The Euclidean metric was used to measure
distances between molecules in the descriptors space.
Support vector machines (SVM):
SVM define a decision boundary that optimally separates two
classes by maximizing the distance between them. The
decision boundary can be described as an hyper-plane that is
expressed in terms of a linear combination of functions
parameterized by support vectors, which consist in a subset
of training molecules.
Variable selection using genetic algorithms (GA):
GA is a nature inspired technique applied to find the
optimal subset of molecular descriptors. It starts from an
initial random population of chromosomes (present
molecular descriptors). An evolutionary process is
simulated to optimize a defined fitness function. New
chromosomes are obtained by coupling those of the initial
population with genetic operations (crossover and
mutation). This operation was repeated for n runs.
Subsequently, a forward selection was performed based on
the most frequently included descriptors during the n runs
(Fig. 6) .
In this work, the feature selection procedure was
performed in two steps. First, GA was launched for 50 runs,
100 evolutions for each, on the total number of 470
descriptors. Then, another 50 runs were performed on the
most selected subset during the first step which improves
the results (Fig. 5).
Goodness of fit measure and validation methods:
5-fold cross-validation was coupled to GA in order to
validate the models and avoid overfitting (by-chance
correlation). It consists of, repeatedly, fitting a model using
80% of the data and predicting the 20% left out.
The used fitness function was the non-error rate (NER)
called also balanced accuracy:
and
where and
In-silico structure activity relationship study of toxicity endpoints by QSAR modeling
Kamel Mansouri1,2, Nisha Sipes1,2 and Richard Judson2
1Oak Ridge Institute for Science and Education, Oak Ridge, TN, USA
2National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, RTP, NC, USA
Introduction
Several thousand chemicals were tested in hundreds of toxicity-related in-vitro high-throughput screening (HTS)
bioassays through the EPA’s ToxCast and Tox21 projects. However, this chemical set only covers a portion of the chemical
space of interest for environmental risk assessment, leading to a need to fill data gaps with other methods. A cost effective
and reliable approach to fullfill this task is to build quantitative structure-activity relationships (QSARs).
In this work, a subset of 1877 chemicals from ToxCast were used to build QSAR models. These models will be applied
to predict values for multiple ToxCast assays in a larger environmental database of ~30K chemical structures.
Based on a clustering study by Sipes et al. (2013), the initial molecular targets of this effort consisted of a set of 18
NovaScreen G-protein coupled receptor (GPCR) assays. These assays are part of the aminergic category that showed the
highest number of actives within the ToxCast portfolio. Classification methods including SOM, SVM, PLSDA and kNN, were
tested. These methods were coupled to variable selection techniques such as genetic algorithms that were applied in order
to select the best representative molecular descriptors based on statistical fitness functions. The obtained models were
validated and their prediction ability measured. The models that showed good results will be applied within the limits of
their established chemical space defined by the applicability domain.
Results
Kamel Mansouri l mansouri.kamel@epa.gov l 919-541-0545
Data preparation
Conclusions and future work
• The QSAR models showed good results especially using the PLSDA method.
• Except for SVM, the statistics of the models between fitting and 5-fold CV are balanced indicating low overfitting.
• Modeling all aminergic category assays together showed acceptable accuracy. This is a good approximation for
such similar assays. The accuracy increased with SVM compared to the single model for hH1.
• Future work:
• Apply the same procedure for all the 18 assays and build regression models to predict AC50 values.
• Link the molecular descriptors selected in each model to the predicted biological activity.
• Develop consensus models using the predictions of all the tested methods to increase accuracy.
• Employ muti-criteria decision making (MCDM) techniques to improve and simplify the models.
Fig 5: The frequency of selection for each descriptor after 50 GA
runs (A) and the NER in 5-fold CV for each GA run (B) for hH1.
Disclaimer: The views expressed in this poster are those of the authors and do not necessarily reflect the views or policies of the U.S. Environmental Protection Agency.
Molecular descriptors calculation:
A total of 1022 molecular descriptors were calculated from the 2D chemical structures.
The tools used for descriptor calculation were: Indigo, RDKit, CDK and MOE.
In order to reduce collinearity, a correlation filter with a threshold of 0.96 was applied and descriptors with
constant, near constant or at least one missing value, were removed. The remaining reduced set consisted of 470
descriptors.
Chemical structure curation:
The initial dataset considered for this study consisted of 1877 chemicals. In order to prepare a consistent QSAR ready
dataset, the chemical structures were curated using a KNIME workflow developed for this purpose.
The main steps of this workflow were:
• Check the validity of the molecular file format and retrieve any missing structures
• Remove the inorganic and metallo-organic structures
• Remove salts and counter ions and fulfill valence
• Convert stereo-isomers and tautomers into a unique form to reduce redundancy
• Remove any duplicates
• Calculate descriptors
The resulting SDF file contained 1783 curated 2D chemical
structures. Only 1005 compounds tested for all 18 endpoints were
considered in the training set. After selecting the best QSAR model
to be applied, the 778 entries with missing values will be filled
with predictions. A new vector with molecules that are active for at
least on of 18 assays was created to model the whole group.
𝑆𝑛 ≡ 𝑇𝑃𝑅 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝑆𝑝 ≡ 𝑇𝑁𝑅 =
𝑇𝑁
𝑇𝑁 + 𝐹𝑃
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝑁𝑃 + 𝐹𝑁
Fig 6: The final forward selection based on the frequency
of selection for each descriptor after 50 GA runs for hH1.
Method PLSDA KNN SVM
Endpoint
(Positives)
Dsc LVs Fitting 5-fold CV Dsc k Fitting 5-fold CV Dsc C Fitting 5-fold CV
NER Ac NER Ac NER Ac NER Ac NER Ac NER Ac
hH1 (37) 26 5 91.9 89.4 91.5 88.6 27 1 81.1 96.2 81.2 96.3 30 5 87.8 99.1 68.5 96.9
hM1-5 (76) 15 3 84.1 82.9 85.7 83.6 19 4 76.7 93.8 78.8 94.3 12 10 94.7 99.2 76.6 94.7
All (115) 20 5 84.0 85.8 82.3 84.1 22 1 76.8 90.3 78.1 90.1 16 10 94.3 98.7 75.9 92.1
The modeling procedure
Dsc: Number of descriptors; LVs: Latent variables; Ac: Accuracy; K: Number of nearest neighbors; C: Cost, SVM’s degree
of fitting; CV: Cross-validation.
Fig 4: ROC curves of the SOM map in fitting
for active and non-active compounds.
Fitting
NER Ac
81.79 95.4
5-fold CV
NER AC
66.8 89.5
Abstract ID: 2273u
EPA @ SOT
Assay Symbol Name
Hm1-5 CHRM1-5 cholinergic receptor,
muscarinic 1-5
gMPeripheral_
NonSelective
M1 Muscarinic receptor
peripheral
hAdrb2 ADRB2 adrenergic, beta-2-,
receptor, surface
bDR_
NonSelective
DRD1 Dopamine receptor D1
h5HT2A HTR2A 5-hydroxytryptamine
(serotonin) receptor 2A
rAdra1A,B Adra1a,b adrenergic, alpha-1A-B,
receptor
rAdra1_
NonSelective
Adra1a adrenergic, alpha-1A-,
receptor
hH1 HRH1 Histamine receptor H1
gH2 Hrh2 Guineapig histamine
receptor H2
rAdra2_
NonSelective
Adra2a adrenergic, alpha-2A-,
receptor
hAdra2A ADRA2A adrenergic, alpha-2A-,
receptor
rmAdra2B Adra2b adrenergic, alpha-2B-,
receptor
Fig 1: KNIME workflow for chemical structure curation
Fig 2: Heatmap of the clustered 18 assays
𝑁𝐸𝑅% = (𝑆𝑛 + 𝑆𝑝)/2
A
B
Fig 3:Supervised-learning SOM map of all active
compounds for the 18 assays with the full set of
470 descriptors.

More Related Content

Similar to In-silico structure activity relationship study of toxicity endpoints by QSAR modeling. Presented at SOT 2014, Phoenix, AZ, March 23 - 27, 2014.

Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
Sunghwan Kim
 
STATISTICAL METHOD OF QSAR
STATISTICAL METHOD OF QSARSTATISTICAL METHOD OF QSAR
STATISTICAL METHOD OF QSAR
RaniBhagat1
 
Deep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsDeep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpoints
Valery Tkachenko
 
Case Study: Overview of Metabolomic Data Normalization Strategies
Case Study: Overview of Metabolomic Data Normalization StrategiesCase Study: Overview of Metabolomic Data Normalization Strategies
Case Study: Overview of Metabolomic Data Normalization Strategies
Dmitry Grapov
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
Kamel Mansouri
 
3D QSAR
3D QSAR3D QSAR
3D QSAR
suraj wanjari
 
Ijetr042111
Ijetr042111Ijetr042111
An Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal ClustersAn Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal Clusters
IJCSEA Journal
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionDario Panada
 
H017365863
H017365863H017365863
H017365863
IOSR Journals
 
Dimensionality Reduction Evolution and Validation
Dimensionality Reduction Evolution and ValidationDimensionality Reduction Evolution and Validation
Dimensionality Reduction Evolution and Validation
iosrjce
 
Evaluation of Processing Technologies for Umbilical Cord Blood
Evaluation of Processing Technologies for Umbilical Cord Blood Evaluation of Processing Technologies for Umbilical Cord Blood
Evaluation of Processing Technologies for Umbilical Cord Blood
Sararajputsa
 
Evaluation of Processing Technologies for Umbilical Cord Blood
Evaluation of Processing Technologies for Umbilical Cord BloodEvaluation of Processing Technologies for Umbilical Cord Blood
Evaluation of Processing Technologies for Umbilical Cord Blood
Sararajputsa
 
Evaluation of ProcessingTechnologies for UCB
Evaluation of ProcessingTechnologies for UCBEvaluation of ProcessingTechnologies for UCB
Evaluation of ProcessingTechnologies for UCB
Ankita-rastogi
 
Virtual screening of chemicals for endocrine disrupting activity through CER...
Virtual screening of chemicals for endocrine disrupting activity through  CER...Virtual screening of chemicals for endocrine disrupting activity through  CER...
Virtual screening of chemicals for endocrine disrupting activity through CER...
Kamel Mansouri
 
ADMET.pptx
ADMET.pptxADMET.pptx
ADMET.pptx
Santu Chall
 
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
Kamel Mansouri
 
General pipeline of transcriptomics analysis
General pipeline of transcriptomics analysisGeneral pipeline of transcriptomics analysis
General pipeline of transcriptomics analysis
Santy Marques-Ladeira
 

Similar to In-silico structure activity relationship study of toxicity endpoints by QSAR modeling. Presented at SOT 2014, Phoenix, AZ, March 23 - 27, 2014. (20)

Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
STATISTICAL METHOD OF QSAR
STATISTICAL METHOD OF QSARSTATISTICAL METHOD OF QSAR
STATISTICAL METHOD OF QSAR
 
Deep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsDeep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpoints
 
Case Study: Overview of Metabolomic Data Normalization Strategies
Case Study: Overview of Metabolomic Data Normalization StrategiesCase Study: Overview of Metabolomic Data Normalization Strategies
Case Study: Overview of Metabolomic Data Normalization Strategies
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
 
3D QSAR
3D QSAR3D QSAR
3D QSAR
 
Ijetr042111
Ijetr042111Ijetr042111
Ijetr042111
 
An Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal ClustersAn Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal Clusters
 
P1121133727
P1121133727P1121133727
P1121133727
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
H017365863
H017365863H017365863
H017365863
 
Dimensionality Reduction Evolution and Validation
Dimensionality Reduction Evolution and ValidationDimensionality Reduction Evolution and Validation
Dimensionality Reduction Evolution and Validation
 
Evaluation of Processing Technologies for Umbilical Cord Blood
Evaluation of Processing Technologies for Umbilical Cord Blood Evaluation of Processing Technologies for Umbilical Cord Blood
Evaluation of Processing Technologies for Umbilical Cord Blood
 
Evaluation of Processing Technologies for Umbilical Cord Blood
Evaluation of Processing Technologies for Umbilical Cord BloodEvaluation of Processing Technologies for Umbilical Cord Blood
Evaluation of Processing Technologies for Umbilical Cord Blood
 
Evaluation of ProcessingTechnologies for UCB
Evaluation of ProcessingTechnologies for UCBEvaluation of ProcessingTechnologies for UCB
Evaluation of ProcessingTechnologies for UCB
 
Virtual screening of chemicals for endocrine disrupting activity through CER...
Virtual screening of chemicals for endocrine disrupting activity through  CER...Virtual screening of chemicals for endocrine disrupting activity through  CER...
Virtual screening of chemicals for endocrine disrupting activity through CER...
 
ADMET.pptx
ADMET.pptxADMET.pptx
ADMET.pptx
 
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
 
General pipeline of transcriptomics analysis
General pipeline of transcriptomics analysisGeneral pipeline of transcriptomics analysis
General pipeline of transcriptomics analysis
 

More from Kamel Mansouri

OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSOPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
Kamel Mansouri
 
Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...
Kamel Mansouri
 
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
Kamel Mansouri
 
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Kamel Mansouri
 
Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...
Kamel Mansouri
 
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityCoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
Kamel Mansouri
 
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Kamel Mansouri
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
Kamel Mansouri
 
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
Kamel Mansouri
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
Kamel Mansouri
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
Kamel Mansouri
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
Kamel Mansouri
 

More from Kamel Mansouri (12)

OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSOPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
 
Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...
 
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
 
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
 
Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...
 
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityCoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
 
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
 
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
 

Recently uploaded

如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
zeex60
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 

Recently uploaded (20)

如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 

In-silico structure activity relationship study of toxicity endpoints by QSAR modeling. Presented at SOT 2014, Phoenix, AZ, March 23 - 27, 2014.

  • 1. U.S. Environmental Protection Agency Office of Research and Development Techniques used : Kohonen self-organizing maps (SOM): A machine learning technique that employs artificial neural networks (ANN) to collapse the data samples in a number of clusters (winning neurons) in a two dimensional space. Partial least squares discriminant analysis (PLSDA): PLSDA is a partial least squares regression (PLS2-based) with the discrimination power of a classification method. It finds fundamental relations between the matrix of descriptors and the class vector by calculating latent variables (LVs), which are orthogonal linear combinations of the original variables. K-nearest neighbors (kNN): A molecule is classified according to the classes of the k closest molecules, according to the majority of its k nearest neighbors. The Euclidean metric was used to measure distances between molecules in the descriptors space. Support vector machines (SVM): SVM define a decision boundary that optimally separates two classes by maximizing the distance between them. The decision boundary can be described as an hyper-plane that is expressed in terms of a linear combination of functions parameterized by support vectors, which consist in a subset of training molecules. Variable selection using genetic algorithms (GA): GA is a nature inspired technique applied to find the optimal subset of molecular descriptors. It starts from an initial random population of chromosomes (present molecular descriptors). An evolutionary process is simulated to optimize a defined fitness function. New chromosomes are obtained by coupling those of the initial population with genetic operations (crossover and mutation). This operation was repeated for n runs. Subsequently, a forward selection was performed based on the most frequently included descriptors during the n runs (Fig. 6) . In this work, the feature selection procedure was performed in two steps. First, GA was launched for 50 runs, 100 evolutions for each, on the total number of 470 descriptors. Then, another 50 runs were performed on the most selected subset during the first step which improves the results (Fig. 5). Goodness of fit measure and validation methods: 5-fold cross-validation was coupled to GA in order to validate the models and avoid overfitting (by-chance correlation). It consists of, repeatedly, fitting a model using 80% of the data and predicting the 20% left out. The used fitness function was the non-error rate (NER) called also balanced accuracy: and where and In-silico structure activity relationship study of toxicity endpoints by QSAR modeling Kamel Mansouri1,2, Nisha Sipes1,2 and Richard Judson2 1Oak Ridge Institute for Science and Education, Oak Ridge, TN, USA 2National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, RTP, NC, USA Introduction Several thousand chemicals were tested in hundreds of toxicity-related in-vitro high-throughput screening (HTS) bioassays through the EPA’s ToxCast and Tox21 projects. However, this chemical set only covers a portion of the chemical space of interest for environmental risk assessment, leading to a need to fill data gaps with other methods. A cost effective and reliable approach to fullfill this task is to build quantitative structure-activity relationships (QSARs). In this work, a subset of 1877 chemicals from ToxCast were used to build QSAR models. These models will be applied to predict values for multiple ToxCast assays in a larger environmental database of ~30K chemical structures. Based on a clustering study by Sipes et al. (2013), the initial molecular targets of this effort consisted of a set of 18 NovaScreen G-protein coupled receptor (GPCR) assays. These assays are part of the aminergic category that showed the highest number of actives within the ToxCast portfolio. Classification methods including SOM, SVM, PLSDA and kNN, were tested. These methods were coupled to variable selection techniques such as genetic algorithms that were applied in order to select the best representative molecular descriptors based on statistical fitness functions. The obtained models were validated and their prediction ability measured. The models that showed good results will be applied within the limits of their established chemical space defined by the applicability domain. Results Kamel Mansouri l mansouri.kamel@epa.gov l 919-541-0545 Data preparation Conclusions and future work • The QSAR models showed good results especially using the PLSDA method. • Except for SVM, the statistics of the models between fitting and 5-fold CV are balanced indicating low overfitting. • Modeling all aminergic category assays together showed acceptable accuracy. This is a good approximation for such similar assays. The accuracy increased with SVM compared to the single model for hH1. • Future work: • Apply the same procedure for all the 18 assays and build regression models to predict AC50 values. • Link the molecular descriptors selected in each model to the predicted biological activity. • Develop consensus models using the predictions of all the tested methods to increase accuracy. • Employ muti-criteria decision making (MCDM) techniques to improve and simplify the models. Fig 5: The frequency of selection for each descriptor after 50 GA runs (A) and the NER in 5-fold CV for each GA run (B) for hH1. Disclaimer: The views expressed in this poster are those of the authors and do not necessarily reflect the views or policies of the U.S. Environmental Protection Agency. Molecular descriptors calculation: A total of 1022 molecular descriptors were calculated from the 2D chemical structures. The tools used for descriptor calculation were: Indigo, RDKit, CDK and MOE. In order to reduce collinearity, a correlation filter with a threshold of 0.96 was applied and descriptors with constant, near constant or at least one missing value, were removed. The remaining reduced set consisted of 470 descriptors. Chemical structure curation: The initial dataset considered for this study consisted of 1877 chemicals. In order to prepare a consistent QSAR ready dataset, the chemical structures were curated using a KNIME workflow developed for this purpose. The main steps of this workflow were: • Check the validity of the molecular file format and retrieve any missing structures • Remove the inorganic and metallo-organic structures • Remove salts and counter ions and fulfill valence • Convert stereo-isomers and tautomers into a unique form to reduce redundancy • Remove any duplicates • Calculate descriptors The resulting SDF file contained 1783 curated 2D chemical structures. Only 1005 compounds tested for all 18 endpoints were considered in the training set. After selecting the best QSAR model to be applied, the 778 entries with missing values will be filled with predictions. A new vector with molecules that are active for at least on of 18 assays was created to model the whole group. 𝑆𝑛 ≡ 𝑇𝑃𝑅 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 𝑆𝑝 ≡ 𝑇𝑁𝑅 = 𝑇𝑁 𝑇𝑁 + 𝐹𝑃 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝑁𝑃 + 𝐹𝑁 Fig 6: The final forward selection based on the frequency of selection for each descriptor after 50 GA runs for hH1. Method PLSDA KNN SVM Endpoint (Positives) Dsc LVs Fitting 5-fold CV Dsc k Fitting 5-fold CV Dsc C Fitting 5-fold CV NER Ac NER Ac NER Ac NER Ac NER Ac NER Ac hH1 (37) 26 5 91.9 89.4 91.5 88.6 27 1 81.1 96.2 81.2 96.3 30 5 87.8 99.1 68.5 96.9 hM1-5 (76) 15 3 84.1 82.9 85.7 83.6 19 4 76.7 93.8 78.8 94.3 12 10 94.7 99.2 76.6 94.7 All (115) 20 5 84.0 85.8 82.3 84.1 22 1 76.8 90.3 78.1 90.1 16 10 94.3 98.7 75.9 92.1 The modeling procedure Dsc: Number of descriptors; LVs: Latent variables; Ac: Accuracy; K: Number of nearest neighbors; C: Cost, SVM’s degree of fitting; CV: Cross-validation. Fig 4: ROC curves of the SOM map in fitting for active and non-active compounds. Fitting NER Ac 81.79 95.4 5-fold CV NER AC 66.8 89.5 Abstract ID: 2273u EPA @ SOT Assay Symbol Name Hm1-5 CHRM1-5 cholinergic receptor, muscarinic 1-5 gMPeripheral_ NonSelective M1 Muscarinic receptor peripheral hAdrb2 ADRB2 adrenergic, beta-2-, receptor, surface bDR_ NonSelective DRD1 Dopamine receptor D1 h5HT2A HTR2A 5-hydroxytryptamine (serotonin) receptor 2A rAdra1A,B Adra1a,b adrenergic, alpha-1A-B, receptor rAdra1_ NonSelective Adra1a adrenergic, alpha-1A-, receptor hH1 HRH1 Histamine receptor H1 gH2 Hrh2 Guineapig histamine receptor H2 rAdra2_ NonSelective Adra2a adrenergic, alpha-2A-, receptor hAdra2A ADRA2A adrenergic, alpha-2A-, receptor rmAdra2B Adra2b adrenergic, alpha-2B-, receptor Fig 1: KNIME workflow for chemical structure curation Fig 2: Heatmap of the clustered 18 assays 𝑁𝐸𝑅% = (𝑆𝑛 + 𝑆𝑝)/2 A B Fig 3:Supervised-learning SOM map of all active compounds for the 18 assays with the full set of 470 descriptors.