SlideShare a Scribd company logo
1 of 47
Download to read offline
An examination of the impact
of data quality on QSAR
Modeling in regards to the
environmental sciences
Antony Williams & Kamel Mansouri
National Center for Computational Toxicology
April 12, 2016
The views expressed in this presentation are those of the author and do not necessarily reflect
the views or policies of the U.S. EPA
Office of Research and
Development
Recent Cheminformatics
development at NCCT
• We have been building a new cheminformatics architecture
• The first “app” provides PUBLIC access to curated chemistry
• Focus on integrating agency and external resources
• Aggregating and curating data, visualization elements and
“services” to underpin other internal efforts including:
• Read-across
• Predictive modeling
• Non-targeted screening
1
Office of Research and
Development
DSSTox (our quality content)
DSSTox_v2
PublicCurated
5. Low
2. Low
6. Untrusted
1. High
3. High
4. Med
7. Incomplete
QC Level Totals (12Jun2015)
validated
Public_Untrusted
Public_Low
Public_Medium
Public_High
DSSTox_Low
DSSTox_High 4535
16K
33K
101K
584K
~ 310K pending
~ 150K pending
Office of Research and
Development
Chemistry Software Architecture
Office of Research and
Development
The iCSS Chemistry Dashboard
at https://comptox.epa.gov
4
Office of Research and
Development
The iCSS Chemistry Dashboard
5
Office of Research and
Development
The iCSS Chemistry Dashboard
6
Office of Research and
Development
The iCSS Chemistry Dashboard
7
Office of Research and
Development
The iCSS Chemistry Dashboard
8
Office of Research and
Development
Simple Link Integration
Office of Research and
Development
Data sharing and access
Office of Research and
Development
Expt’l and Predicted PhysChem Data
11
Office of Research and
Development
Developing “NCCT Models”
• Interest in physicochemical properties to include
in exposure modeling, augmenting with Toxcast
data etc.
• Our approach to modeling:
– Obtain high quality training sets
– Apply appropriate modeling approaches
(modern machine learning)
– Validate performance of models and use to
predict properties across our full datasets
• Initiated work with available “EPI Suite” data
Office of Research and
Development
EPI Suite Properties
• Water solubility
• Melting Point
• Boiling Point
• LogP (KOWWIN: Octanol-water partition coefficient)
• Atmospheric Hydroxylation Rate
• LogBCF (Bioconcentration Factor)
• Biodegradation Half-life
• Ready biodegradability
• Henry's Law Constant
• Fish Biotransformation Half-life
• LogKOA (Octanol/Air Partition Coefficient)
• LogKOC (Soil Adsorption Coefficient)
• Vapor Pressure
Office of Research and
Development
Looking at EPISuite KOWWIN:
2447 Training Compounds
• KOWWIN model was trained
on 2447 chemicals
• Available experimental
dataset is 15,809 chemicals
• Previous experience of the
data necessitated review
Office of Research and
Development
Predicted vs. Experimental
considering >15k chemicals
Office of Research and
Development
Data Files
• The data files have FOUR representations of a chemical
plus the property value.
• CAS Number: 80-05-7
• Name: Bisphenol A
• SMILES: CC(C)(C1=CC=C(O)C=C1)C1=CC=C(O)C=C1
• Molfile:
Office of Research and
Development
The Approach
• To build models we need the set of chemicals and their
property series
• Our curation process
– Decide on the “chemical” by checking levels of consistency
– We did NOT validate each measured property value
– Perform initial analysis manually to understand how to
clean the data
– Automate the process (and test iteratively)
– Feed through all datasets
Office of Research and
Development
General Observations
• MANY mismatches across datasets – comparing structure
molfile, SMILES, Names and CAS Numbers
• Duplicate pairs for chemicals – same structures but different
values for properties. Some wildly different.
• Some SMILES cannot convert – what software was used??
• Invalid CAS Numbers (can be validated with a CheckSum)
• Some chemical names are neither systematic nor trivial.
Totally ambiguous and truncated
• “Stereochemistry” representations are not appropriate for
computers and commonly not even included
Office of Research and
Development
Truncated and obscure names
Office of Research and
Development
“Duplicate” Structures
Office of Research and
Development
Large Differences in Values?
6380-24-1 93-15-2
Office of Research and
Development
Salts
Office of Research and
Development
Some statistics: 15,809 chemicals
(KOWWIN dataset)
• CAS Checksum: 12163 valid, 3646 invalid (>23%)
• Invalid names: 555
• Invalid SMILES 133
• Valence errors: 322 Molfile, 3782 SMILES (>24%)
• Duplicates check:
–31 DUPLICATES MOLFILE
–626 DUPLICATES SMILES
–531 DUPLICATES NAMES
• SMILES vs. Molfiles (structure check)
–1279 differ in stereochemistry (~8%)
–362 “Covalent Halogens”
–191 differ as tautomers
–436 are different compounds (~3%)
Office of Research and
Development
Quality FLAGS
• 4 Star ENHANCED Stereochemistry – chemical name,
CAS Number, Molfile and SMILES all consistent PLUS
enhanced with correct stereochemistry
Office of Research and
Development
Quality FLAGS
• 4 Star – chemical name, CAS Number, Molfile and
SMILES all consistent. Stereo ignored.
• 3 Star ENHANCED Stereochemistry – consistency
between 3 of 4: CAS Number (DSSTox lookup), Molfile,
SMILES, Tautomer - PLUS enhanced with correct
stereochemistry
• 3 Star– consistency between 3 of 4: CAS Number, Molfile,
SMILES, Tautomer - stereo ignored
Office of Research and
Development
Quality FLAGS
Treated as “Junk Data” in terms of using as training data
• 2 Star PLUS – consistency between 2 of 4: chemical name,
CAS Number (DSSTox lookup), Molfile, SMILES, Tautomer.
Molfile is changed to be consistent if necessary
• 2 Star – only two fields internally consistent
• 1 star – what’s left
• AUTOMATION procedures used across all data – “KNIME”
pipelining software for processing SDF files
Office of Research and
Development
Cleaning the data
27
• Models built on “QSAR ready” forms of the curated structures –
standardize tautomers, remove stereochemistry, no salts
• Compare 3 STAR and BETTER models with entire dataset
• Exclusion of 1 and 2 star data removed SOME “outliers” only
• Definite cluster not being modeled appropriately - “Applicability
domain” – initial model based on ~2500 chemicals
Office of Research and
Development
How did it get this way?
• Data was assembled over many years (I assume), by
various people and likely heterogeneous processes
• Cheminformatics tools of today are leaps and bounds
ahead of tools from the 80s period.
• Data errors are COMMONPLACE! It took 3 years and a
team of 6 people to clean up Wikipedia of the 8% errors!
• Just because its commonplace doesn’t mean we shouldn’t
improve it!
Office of Research and
Development
Errors in Databases are
Common Place
Merck Index
DailyMedDrugBank
Wikipedia
Common Chemistry
ChEBI
Wolfram Alpha
KEGG
Office of Research and
Development
KNIME workflow to evaluate the dataset
Office of Research and
Development
Summary:
Property Initial file flagged Updated 3-4 STAR Curated QSAR ready
AOP 818 818 745
BCF 685 618 608
BioHC 175 151 150
Biowin 1265 1196 1171
BP 5890 5591 5436
HL 1829 1758 1711
KM 631 548 541
KOA 308 277 270
LogP 15809 14544 14041
MP 10051 9120 8656
PC 788 750 735
VP 3037 2840 2716
WF 5764 5076 4836
WS 2348 2046 2010
Office of Research and
Development
Development of a QSAR model
• Curation of the data
» Flagged and curated files available for sharing
• Preparation of training and test sets
» Inserted as a field in SDFiles and csv data files
• Calculation of an initial set of descriptors
» PaDEL 2D descriptors and fingerprints generated and shared
• Selection of a mathematical method
» Several approaches tested: KNN, PLS, SVM…
• Variable selection technique
» Genetic algorithm
• Validation of the model’s predictive ability
» 5-fold cross validation and external test set
• Define the Applicability Domain
» Local (nearest neighbors) and global (leverage) approaches
Office of Research and
Development
QSARs for regulatory purposes
Office of Research and
Development
The 5 OECD principles:
Principle Description
1) A defined endpoint Any physicochemical, biological or environmental effect
that can be measured and therefore modelled.
2) An unambiguous
algorithm
Ensure transparency in the description of the model
algorithm.
3) A defined domain of
applicability
Define limitations in terms of the types of chemical
structures, physicochemical properties and mechanisms of
action for which the models can generate reliable
predictions.
4) Appropriate measures
of goodness-of-fit,
robustness and
predictivity
a) The internal fitting performance of a model
b) the predictivity of a model, determined by using an
appropriate external test set.
5) Mechanistic
interpretation, if possible
Mechanistic associations between the descriptors used in
a model and the endpoint being predicted.
The conditions for the validity of QSARs
Office of Research and
Development
NCCT models
Prop Vars 5-fold CV (75%) Training (75%) Test (25%)
Q2 RMSE N R2 RMSE N R2 RMSE
BCF 10 0.84 0.55 465 0.85 0.53 161 0.73 0.65
BP 13 0.93 22.46 4077 0.93 22.06 1358 0.93 22.08
LogP 9 0.85 0.69 10531 0.86 0.67 3510 0.86 0.78
MP 15 0.72 51.8 6486 0.74 50.27 2167 0.73 52.72
VP 12 0.91 1.08 2034 0.91 1.08 679 0.92 1
WS 11 0.87 0.81 3158 0.87 0.82 1066 0.86 0.86
HL 9 0.84 1.96 441 0.84 1.91 150 0.85 1.82
Office of Research and
Development
NCCT models
Prop Vars 5-fold CV (75%) Training (75%) Test (25%)
Q2 RMSE N R2 RMSE N R2 RMSE
AOH 13 0.85 1.14 516 0.85 1.12 176 0.83 1.23
BioHL 6 0.89 0.25 112 0.88 0.26 38 0.75 0.38
KM 12 0.83 0.49 405 0.82 0.5 136 0.73 0.62
KOC 12 0.81 0.55 545 0.81 0.54 184 0.71 0.61
KOA 2 0.95 0.69 202 0.95 0.65 68 0.96 0.68
BA Sn-Sp BA Sn-Sp BA Sn-Sp
R-Bio 10 0.8 0.82-0.78 1198 0.8 0.82-0.79 411 0.79 0.81-0.77
Office of Research and
Development
LogP Model: Weighted kNN Model,
9 descriptors
37
Weighted 5-nearest neighbors
9 Descriptors
Training set: 10531 chemicals
Test set: 3510 chemicals
5 fold Cross-validation:
Q2=0.85 RMSE=0.69
Fitting:
R2=0.86 RMSE=0.67
Test:
R2=0.86 RMSE=0.78
Office of Research and
Development
Applicability Domain
38
EPISuite Predicted vs. Experimental EPISuite vs kNN Comparison of Cluster
Applicability Domain of original EPI Suite LogP Model
(accuracy of the global model is an issue)
Example 280 compound cluster outside applicability
domain of EPI Suite compared to new model predictions
Office of Research and
Development
Standalone application:
Input:
–MATLAB .mat file, an ASCII file with only a matrix of variables
–SDF file or SMILES strings of QSAR-ready structures. In this case
the program will calculate PaDEL 2D descriptors and make the
predictions.
• The program will extract the molecules names from the input csv or SDF
(or assign arbitrary names if not) As IDs for the predictions.
Output
• Depending on the extension, the can be text file or csv with
–A list of molecules IDs and predictions
–Applicability domain
–Accuracy of the prediction
–Similarity index to the 5 nearest neighbors
–The 5 nearest neighbors from the training set: Exp. value, Prediction,
InChi key
Office of Research and
Development
NCCT Model Predictions
• We want to provide WEB-access to the cleaned data, the
predicted data for all chemicals and ultimately real time models.
CLEAN data once and use many times!
• QSAR prediction models (kNN) produced for all properties
• 700k chemical structures pushed through NCCT_Models
• Each property now available on the dashboard
–Original EPISuite predicted value where available
–3 and 4 star quality experimental data displayed
–Predicted data displayed
40
Office of Research and
Development
Examples – Glyphosate (logP)
Office of Research and
Development
Examples - Bisphenol S (MP)
Office of Research and
Development
Reporting Outcome
• Presently authoring: “Examining the impact of Quality vs
Quantity on Predictive Models for PhysChem”
• Supplementary data will include appropriate files with flags
– full dataset plus QSAR ready form
• Full performance statistics available for all models
• Models will be deployed as prediction engines in the future
– one chemical at a time and batch processing (to be done
after RapidTox Project)
Office of Research and
Development
Small Data Sets limit
Applicability Domain
• Some EPI Suite data sets are quite small…a few
thousand points. There is a lot of Open Data now
• EXAMPLE: 200,000 melting points and 13,000 pyrolysis
data points
• Modeling this much data is a challenge! 44
Office of Research and
Development
Adding MORE Data
• Collect “Open Data” for various endpoints from:
 PubChem
 eChemPortal
 Open Data Sources
• Data already harvested and awaiting processing
• New models to be produced from expanded data sets
45
Office of Research and
Development
Conclusions
• We will rebuild existing prototypes
on this new architecture based on
priorities
• Componentized architecture approach will be
iterated as/if we hit limitations based on projects
• Much work to do with key activities on the API to
access data and algorithms
• Our Focus is “Build Once, Use Many Times” – for
our projects but made available to others to use also
• This will allow rapid prototyping as we progress

More Related Content

What's hot

OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSOPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSKamel Mansouri
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Kamel Mansouri
 
How to control variability in environmental data
How to control variability in environmental dataHow to control variability in environmental data
How to control variability in environmental dataChemistry Matters Inc.
 
International Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsInternational Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsKamel Mansouri
 
Virtual screening of chemicals for endocrine disrupting activity through CER...
Virtual screening of chemicals for endocrine disrupting activity through  CER...Virtual screening of chemicals for endocrine disrupting activity through  CER...
Virtual screening of chemicals for endocrine disrupting activity through CER...Kamel Mansouri
 
Senior Chemist 4.0
Senior Chemist 4.0Senior Chemist 4.0
Senior Chemist 4.0Evan Coss
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...Kamel Mansouri
 
Cheryl Soltis-Muth Resume 9-26-16 LI
Cheryl Soltis-Muth Resume 9-26-16 LICheryl Soltis-Muth Resume 9-26-16 LI
Cheryl Soltis-Muth Resume 9-26-16 LICheryl Soltis-Muth
 
Update on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL ProjectUpdate on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL ProjectIES / IAQM
 
Compound Management Focus 2010
Compound Management Focus 2010Compound Management Focus 2010
Compound Management Focus 2010GWendelSGC
 

What's hot (20)

OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSOPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
 
How to control variability in environmental data
How to control variability in environmental dataHow to control variability in environmental data
How to control variability in environmental data
 
International Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsInternational Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology Problems
 
Virtual screening of chemicals for endocrine disrupting activity through CER...
Virtual screening of chemicals for endocrine disrupting activity through  CER...Virtual screening of chemicals for endocrine disrupting activity through  CER...
Virtual screening of chemicals for endocrine disrupting activity through CER...
 
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
Senior Chemist 4.0
Senior Chemist 4.0Senior Chemist 4.0
Senior Chemist 4.0
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
 
Resume Or
Resume OrResume Or
Resume Or
 
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
 
Cheryl Soltis-Muth Resume 9-26-16 LI
Cheryl Soltis-Muth Resume 9-26-16 LICheryl Soltis-Muth Resume 9-26-16 LI
Cheryl Soltis-Muth Resume 9-26-16 LI
 
How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
 
Update on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL ProjectUpdate on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL Project
 
Compound Management Focus 2010
Compound Management Focus 2010Compound Management Focus 2010
Compound Management Focus 2010
 
Jennifer L. Brantley Resume
Jennifer L. Brantley ResumeJennifer L. Brantley Resume
Jennifer L. Brantley Resume
 
Organic chemist
Organic chemistOrganic chemist
Organic chemist
 

Similar to An examination of data quality on QSAR Modeling in regards to the environmental sciences. Presented at UNC-CH Talk, Chapel Hill, NC, April 12, 2016.

Susan Tao Resume 030816 linkdIn
Susan Tao Resume 030816 linkdInSusan Tao Resume 030816 linkdIn
Susan Tao Resume 030816 linkdInSusan Tao
 
What+Is+Six+Sigma
What+Is+Six+SigmaWhat+Is+Six+Sigma
What+Is+Six+SigmaTyg Lucas
 
High-throughput and Automated Process Development for Accelerated Biotherapeu...
High-throughput and Automated Process Development for Accelerated Biotherapeu...High-throughput and Automated Process Development for Accelerated Biotherapeu...
High-throughput and Automated Process Development for Accelerated Biotherapeu...KBI Biopharma
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Kamel Mansouri
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsKen Karapetyan
 
andrusyszyn_paul_m_resume_v01
andrusyszyn_paul_m_resume_v01andrusyszyn_paul_m_resume_v01
andrusyszyn_paul_m_resume_v01Paul Andrusyszyn
 
External quality assurance in clinical trace element labs
External quality assurance in clinical trace element labs External quality assurance in clinical trace element labs
External quality assurance in clinical trace element labs Chris Harrington
 
JeffWright_Resume_2015
JeffWright_Resume_2015JeffWright_Resume_2015
JeffWright_Resume_2015Jeffrey Wright
 

Similar to An examination of data quality on QSAR Modeling in regards to the environmental sciences. Presented at UNC-CH Talk, Chapel Hill, NC, April 12, 2016. (20)

Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
The influence of data curation on QSAR Modeling – examining issues of qualit...
 The influence of data curation on QSAR Modeling – examining issues of qualit... The influence of data curation on QSAR Modeling – examining issues of qualit...
The influence of data curation on QSAR Modeling – examining issues of qualit...
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
 
Susan Tao Resume 030816 linkdIn
Susan Tao Resume 030816 linkdInSusan Tao Resume 030816 linkdIn
Susan Tao Resume 030816 linkdIn
 
What+Is+Six+Sigma
What+Is+Six+SigmaWhat+Is+Six+Sigma
What+Is+Six+Sigma
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
 
High-throughput and Automated Process Development for Accelerated Biotherapeu...
High-throughput and Automated Process Development for Accelerated Biotherapeu...High-throughput and Automated Process Development for Accelerated Biotherapeu...
High-throughput and Automated Process Development for Accelerated Biotherapeu...
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
 
SummaryCV-NJW
SummaryCV-NJWSummaryCV-NJW
SummaryCV-NJW
 
Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...
 
Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
andrusyszyn_paul_m_resume_v01
andrusyszyn_paul_m_resume_v01andrusyszyn_paul_m_resume_v01
andrusyszyn_paul_m_resume_v01
 
External quality assurance in clinical trace element labs
External quality assurance in clinical trace element labs External quality assurance in clinical trace element labs
External quality assurance in clinical trace element labs
 
JeffWright_Resume_2015
JeffWright_Resume_2015JeffWright_Resume_2015
JeffWright_Resume_2015
 

More from Kamel Mansouri

Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)Kamel Mansouri
 
Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...Kamel Mansouri
 
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityCoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityKamel Mansouri
 
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Kamel Mansouri
 
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...Kamel Mansouri
 
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...Kamel Mansouri
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...Kamel Mansouri
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...Kamel Mansouri
 

More from Kamel Mansouri (8)

Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
 
Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...
 
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityCoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
 
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
 
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
 
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
 

Recently uploaded

Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 

Recently uploaded (20)

Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 

An examination of data quality on QSAR Modeling in regards to the environmental sciences. Presented at UNC-CH Talk, Chapel Hill, NC, April 12, 2016.

  • 1. An examination of the impact of data quality on QSAR Modeling in regards to the environmental sciences Antony Williams & Kamel Mansouri National Center for Computational Toxicology April 12, 2016 The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
  • 2. Office of Research and Development Recent Cheminformatics development at NCCT • We have been building a new cheminformatics architecture • The first “app” provides PUBLIC access to curated chemistry • Focus on integrating agency and external resources • Aggregating and curating data, visualization elements and “services” to underpin other internal efforts including: • Read-across • Predictive modeling • Non-targeted screening 1
  • 3. Office of Research and Development DSSTox (our quality content) DSSTox_v2 PublicCurated 5. Low 2. Low 6. Untrusted 1. High 3. High 4. Med 7. Incomplete QC Level Totals (12Jun2015) validated Public_Untrusted Public_Low Public_Medium Public_High DSSTox_Low DSSTox_High 4535 16K 33K 101K 584K ~ 310K pending ~ 150K pending
  • 4. Office of Research and Development Chemistry Software Architecture
  • 5. Office of Research and Development The iCSS Chemistry Dashboard at https://comptox.epa.gov 4
  • 6. Office of Research and Development The iCSS Chemistry Dashboard 5
  • 7. Office of Research and Development The iCSS Chemistry Dashboard 6
  • 8. Office of Research and Development The iCSS Chemistry Dashboard 7
  • 9. Office of Research and Development The iCSS Chemistry Dashboard 8
  • 10. Office of Research and Development Simple Link Integration
  • 11. Office of Research and Development Data sharing and access
  • 12. Office of Research and Development Expt’l and Predicted PhysChem Data 11
  • 13. Office of Research and Development Developing “NCCT Models” • Interest in physicochemical properties to include in exposure modeling, augmenting with Toxcast data etc. • Our approach to modeling: – Obtain high quality training sets – Apply appropriate modeling approaches (modern machine learning) – Validate performance of models and use to predict properties across our full datasets • Initiated work with available “EPI Suite” data
  • 14. Office of Research and Development EPI Suite Properties • Water solubility • Melting Point • Boiling Point • LogP (KOWWIN: Octanol-water partition coefficient) • Atmospheric Hydroxylation Rate • LogBCF (Bioconcentration Factor) • Biodegradation Half-life • Ready biodegradability • Henry's Law Constant • Fish Biotransformation Half-life • LogKOA (Octanol/Air Partition Coefficient) • LogKOC (Soil Adsorption Coefficient) • Vapor Pressure
  • 15. Office of Research and Development Looking at EPISuite KOWWIN: 2447 Training Compounds • KOWWIN model was trained on 2447 chemicals • Available experimental dataset is 15,809 chemicals • Previous experience of the data necessitated review
  • 16. Office of Research and Development Predicted vs. Experimental considering >15k chemicals
  • 17. Office of Research and Development Data Files • The data files have FOUR representations of a chemical plus the property value. • CAS Number: 80-05-7 • Name: Bisphenol A • SMILES: CC(C)(C1=CC=C(O)C=C1)C1=CC=C(O)C=C1 • Molfile:
  • 18. Office of Research and Development The Approach • To build models we need the set of chemicals and their property series • Our curation process – Decide on the “chemical” by checking levels of consistency – We did NOT validate each measured property value – Perform initial analysis manually to understand how to clean the data – Automate the process (and test iteratively) – Feed through all datasets
  • 19. Office of Research and Development General Observations • MANY mismatches across datasets – comparing structure molfile, SMILES, Names and CAS Numbers • Duplicate pairs for chemicals – same structures but different values for properties. Some wildly different. • Some SMILES cannot convert – what software was used?? • Invalid CAS Numbers (can be validated with a CheckSum) • Some chemical names are neither systematic nor trivial. Totally ambiguous and truncated • “Stereochemistry” representations are not appropriate for computers and commonly not even included
  • 20. Office of Research and Development Truncated and obscure names
  • 21. Office of Research and Development “Duplicate” Structures
  • 22. Office of Research and Development Large Differences in Values? 6380-24-1 93-15-2
  • 23. Office of Research and Development Salts
  • 24. Office of Research and Development Some statistics: 15,809 chemicals (KOWWIN dataset) • CAS Checksum: 12163 valid, 3646 invalid (>23%) • Invalid names: 555 • Invalid SMILES 133 • Valence errors: 322 Molfile, 3782 SMILES (>24%) • Duplicates check: –31 DUPLICATES MOLFILE –626 DUPLICATES SMILES –531 DUPLICATES NAMES • SMILES vs. Molfiles (structure check) –1279 differ in stereochemistry (~8%) –362 “Covalent Halogens” –191 differ as tautomers –436 are different compounds (~3%)
  • 25. Office of Research and Development Quality FLAGS • 4 Star ENHANCED Stereochemistry – chemical name, CAS Number, Molfile and SMILES all consistent PLUS enhanced with correct stereochemistry
  • 26. Office of Research and Development Quality FLAGS • 4 Star – chemical name, CAS Number, Molfile and SMILES all consistent. Stereo ignored. • 3 Star ENHANCED Stereochemistry – consistency between 3 of 4: CAS Number (DSSTox lookup), Molfile, SMILES, Tautomer - PLUS enhanced with correct stereochemistry • 3 Star– consistency between 3 of 4: CAS Number, Molfile, SMILES, Tautomer - stereo ignored
  • 27. Office of Research and Development Quality FLAGS Treated as “Junk Data” in terms of using as training data • 2 Star PLUS – consistency between 2 of 4: chemical name, CAS Number (DSSTox lookup), Molfile, SMILES, Tautomer. Molfile is changed to be consistent if necessary • 2 Star – only two fields internally consistent • 1 star – what’s left • AUTOMATION procedures used across all data – “KNIME” pipelining software for processing SDF files
  • 28. Office of Research and Development Cleaning the data 27 • Models built on “QSAR ready” forms of the curated structures – standardize tautomers, remove stereochemistry, no salts • Compare 3 STAR and BETTER models with entire dataset • Exclusion of 1 and 2 star data removed SOME “outliers” only • Definite cluster not being modeled appropriately - “Applicability domain” – initial model based on ~2500 chemicals
  • 29. Office of Research and Development How did it get this way? • Data was assembled over many years (I assume), by various people and likely heterogeneous processes • Cheminformatics tools of today are leaps and bounds ahead of tools from the 80s period. • Data errors are COMMONPLACE! It took 3 years and a team of 6 people to clean up Wikipedia of the 8% errors! • Just because its commonplace doesn’t mean we shouldn’t improve it!
  • 30. Office of Research and Development Errors in Databases are Common Place Merck Index DailyMedDrugBank Wikipedia Common Chemistry ChEBI Wolfram Alpha KEGG
  • 31. Office of Research and Development KNIME workflow to evaluate the dataset
  • 32. Office of Research and Development Summary: Property Initial file flagged Updated 3-4 STAR Curated QSAR ready AOP 818 818 745 BCF 685 618 608 BioHC 175 151 150 Biowin 1265 1196 1171 BP 5890 5591 5436 HL 1829 1758 1711 KM 631 548 541 KOA 308 277 270 LogP 15809 14544 14041 MP 10051 9120 8656 PC 788 750 735 VP 3037 2840 2716 WF 5764 5076 4836 WS 2348 2046 2010
  • 33. Office of Research and Development Development of a QSAR model • Curation of the data » Flagged and curated files available for sharing • Preparation of training and test sets » Inserted as a field in SDFiles and csv data files • Calculation of an initial set of descriptors » PaDEL 2D descriptors and fingerprints generated and shared • Selection of a mathematical method » Several approaches tested: KNN, PLS, SVM… • Variable selection technique » Genetic algorithm • Validation of the model’s predictive ability » 5-fold cross validation and external test set • Define the Applicability Domain » Local (nearest neighbors) and global (leverage) approaches
  • 34. Office of Research and Development QSARs for regulatory purposes
  • 35. Office of Research and Development The 5 OECD principles: Principle Description 1) A defined endpoint Any physicochemical, biological or environmental effect that can be measured and therefore modelled. 2) An unambiguous algorithm Ensure transparency in the description of the model algorithm. 3) A defined domain of applicability Define limitations in terms of the types of chemical structures, physicochemical properties and mechanisms of action for which the models can generate reliable predictions. 4) Appropriate measures of goodness-of-fit, robustness and predictivity a) The internal fitting performance of a model b) the predictivity of a model, determined by using an appropriate external test set. 5) Mechanistic interpretation, if possible Mechanistic associations between the descriptors used in a model and the endpoint being predicted. The conditions for the validity of QSARs
  • 36. Office of Research and Development NCCT models Prop Vars 5-fold CV (75%) Training (75%) Test (25%) Q2 RMSE N R2 RMSE N R2 RMSE BCF 10 0.84 0.55 465 0.85 0.53 161 0.73 0.65 BP 13 0.93 22.46 4077 0.93 22.06 1358 0.93 22.08 LogP 9 0.85 0.69 10531 0.86 0.67 3510 0.86 0.78 MP 15 0.72 51.8 6486 0.74 50.27 2167 0.73 52.72 VP 12 0.91 1.08 2034 0.91 1.08 679 0.92 1 WS 11 0.87 0.81 3158 0.87 0.82 1066 0.86 0.86 HL 9 0.84 1.96 441 0.84 1.91 150 0.85 1.82
  • 37. Office of Research and Development NCCT models Prop Vars 5-fold CV (75%) Training (75%) Test (25%) Q2 RMSE N R2 RMSE N R2 RMSE AOH 13 0.85 1.14 516 0.85 1.12 176 0.83 1.23 BioHL 6 0.89 0.25 112 0.88 0.26 38 0.75 0.38 KM 12 0.83 0.49 405 0.82 0.5 136 0.73 0.62 KOC 12 0.81 0.55 545 0.81 0.54 184 0.71 0.61 KOA 2 0.95 0.69 202 0.95 0.65 68 0.96 0.68 BA Sn-Sp BA Sn-Sp BA Sn-Sp R-Bio 10 0.8 0.82-0.78 1198 0.8 0.82-0.79 411 0.79 0.81-0.77
  • 38. Office of Research and Development LogP Model: Weighted kNN Model, 9 descriptors 37 Weighted 5-nearest neighbors 9 Descriptors Training set: 10531 chemicals Test set: 3510 chemicals 5 fold Cross-validation: Q2=0.85 RMSE=0.69 Fitting: R2=0.86 RMSE=0.67 Test: R2=0.86 RMSE=0.78
  • 39. Office of Research and Development Applicability Domain 38 EPISuite Predicted vs. Experimental EPISuite vs kNN Comparison of Cluster Applicability Domain of original EPI Suite LogP Model (accuracy of the global model is an issue) Example 280 compound cluster outside applicability domain of EPI Suite compared to new model predictions
  • 40. Office of Research and Development Standalone application: Input: –MATLAB .mat file, an ASCII file with only a matrix of variables –SDF file or SMILES strings of QSAR-ready structures. In this case the program will calculate PaDEL 2D descriptors and make the predictions. • The program will extract the molecules names from the input csv or SDF (or assign arbitrary names if not) As IDs for the predictions. Output • Depending on the extension, the can be text file or csv with –A list of molecules IDs and predictions –Applicability domain –Accuracy of the prediction –Similarity index to the 5 nearest neighbors –The 5 nearest neighbors from the training set: Exp. value, Prediction, InChi key
  • 41. Office of Research and Development NCCT Model Predictions • We want to provide WEB-access to the cleaned data, the predicted data for all chemicals and ultimately real time models. CLEAN data once and use many times! • QSAR prediction models (kNN) produced for all properties • 700k chemical structures pushed through NCCT_Models • Each property now available on the dashboard –Original EPISuite predicted value where available –3 and 4 star quality experimental data displayed –Predicted data displayed 40
  • 42. Office of Research and Development Examples – Glyphosate (logP)
  • 43. Office of Research and Development Examples - Bisphenol S (MP)
  • 44. Office of Research and Development Reporting Outcome • Presently authoring: “Examining the impact of Quality vs Quantity on Predictive Models for PhysChem” • Supplementary data will include appropriate files with flags – full dataset plus QSAR ready form • Full performance statistics available for all models • Models will be deployed as prediction engines in the future – one chemical at a time and batch processing (to be done after RapidTox Project)
  • 45. Office of Research and Development Small Data Sets limit Applicability Domain • Some EPI Suite data sets are quite small…a few thousand points. There is a lot of Open Data now • EXAMPLE: 200,000 melting points and 13,000 pyrolysis data points • Modeling this much data is a challenge! 44
  • 46. Office of Research and Development Adding MORE Data • Collect “Open Data” for various endpoints from:  PubChem  eChemPortal  Open Data Sources • Data already harvested and awaiting processing • New models to be produced from expanded data sets 45
  • 47. Office of Research and Development Conclusions • We will rebuild existing prototypes on this new architecture based on priorities • Componentized architecture approach will be iterated as/if we hit limitations based on projects • Much work to do with key activities on the API to access data and algorithms • Our Focus is “Build Once, Use Many Times” – for our projects but made available to others to use also • This will allow rapid prototyping as we progress