SlideShare a Scribd company logo
An examination of the impact
of data quality on QSAR
Modeling in regards to the
environmental sciences
Antony Williams & Kamel Mansouri
National Center for Computational Toxicology
April 12, 2016
The views expressed in this presentation are those of the author and do not necessarily reflect
the views or policies of the U.S. EPA
Office of Research and
Development
Recent Cheminformatics
development at NCCT
• We have been building a new cheminformatics architecture
• The first “app” provides PUBLIC access to curated chemistry
• Focus on integrating agency and external resources
• Aggregating and curating data, visualization elements and
“services” to underpin other internal efforts including:
• Read-across
• Predictive modeling
• Non-targeted screening
1
Office of Research and
Development
DSSTox (our quality content)
DSSTox_v2
PublicCurated
5. Low
2. Low
6. Untrusted
1. High
3. High
4. Med
7. Incomplete
QC Level Totals (12Jun2015)
validated
Public_Untrusted
Public_Low
Public_Medium
Public_High
DSSTox_Low
DSSTox_High 4535
16K
33K
101K
584K
~ 310K pending
~ 150K pending
Office of Research and
Development
Chemistry Software Architecture
Office of Research and
Development
The iCSS Chemistry Dashboard
at https://comptox.epa.gov
4
Office of Research and
Development
The iCSS Chemistry Dashboard
5
Office of Research and
Development
The iCSS Chemistry Dashboard
6
Office of Research and
Development
The iCSS Chemistry Dashboard
7
Office of Research and
Development
The iCSS Chemistry Dashboard
8
Office of Research and
Development
Simple Link Integration
Office of Research and
Development
Data sharing and access
Office of Research and
Development
Expt’l and Predicted PhysChem Data
11
Office of Research and
Development
Developing “NCCT Models”
• Interest in physicochemical properties to include
in exposure modeling, augmenting with Toxcast
data etc.
• Our approach to modeling:
– Obtain high quality training sets
– Apply appropriate modeling approaches
(modern machine learning)
– Validate performance of models and use to
predict properties across our full datasets
• Initiated work with available “EPI Suite” data
Office of Research and
Development
EPI Suite Properties
• Water solubility
• Melting Point
• Boiling Point
• LogP (KOWWIN: Octanol-water partition coefficient)
• Atmospheric Hydroxylation Rate
• LogBCF (Bioconcentration Factor)
• Biodegradation Half-life
• Ready biodegradability
• Henry's Law Constant
• Fish Biotransformation Half-life
• LogKOA (Octanol/Air Partition Coefficient)
• LogKOC (Soil Adsorption Coefficient)
• Vapor Pressure
Office of Research and
Development
Looking at EPISuite KOWWIN:
2447 Training Compounds
• KOWWIN model was trained
on 2447 chemicals
• Available experimental
dataset is 15,809 chemicals
• Previous experience of the
data necessitated review
Office of Research and
Development
Predicted vs. Experimental
considering >15k chemicals
Office of Research and
Development
Data Files
• The data files have FOUR representations of a chemical
plus the property value.
• CAS Number: 80-05-7
• Name: Bisphenol A
• SMILES: CC(C)(C1=CC=C(O)C=C1)C1=CC=C(O)C=C1
• Molfile:
Office of Research and
Development
The Approach
• To build models we need the set of chemicals and their
property series
• Our curation process
– Decide on the “chemical” by checking levels of consistency
– We did NOT validate each measured property value
– Perform initial analysis manually to understand how to
clean the data
– Automate the process (and test iteratively)
– Feed through all datasets
Office of Research and
Development
General Observations
• MANY mismatches across datasets – comparing structure
molfile, SMILES, Names and CAS Numbers
• Duplicate pairs for chemicals – same structures but different
values for properties. Some wildly different.
• Some SMILES cannot convert – what software was used??
• Invalid CAS Numbers (can be validated with a CheckSum)
• Some chemical names are neither systematic nor trivial.
Totally ambiguous and truncated
• “Stereochemistry” representations are not appropriate for
computers and commonly not even included
Office of Research and
Development
Truncated and obscure names
Office of Research and
Development
“Duplicate” Structures
Office of Research and
Development
Large Differences in Values?
6380-24-1 93-15-2
Office of Research and
Development
Salts
Office of Research and
Development
Some statistics: 15,809 chemicals
(KOWWIN dataset)
• CAS Checksum: 12163 valid, 3646 invalid (>23%)
• Invalid names: 555
• Invalid SMILES 133
• Valence errors: 322 Molfile, 3782 SMILES (>24%)
• Duplicates check:
–31 DUPLICATES MOLFILE
–626 DUPLICATES SMILES
–531 DUPLICATES NAMES
• SMILES vs. Molfiles (structure check)
–1279 differ in stereochemistry (~8%)
–362 “Covalent Halogens”
–191 differ as tautomers
–436 are different compounds (~3%)
Office of Research and
Development
Quality FLAGS
• 4 Star ENHANCED Stereochemistry – chemical name,
CAS Number, Molfile and SMILES all consistent PLUS
enhanced with correct stereochemistry
Office of Research and
Development
Quality FLAGS
• 4 Star – chemical name, CAS Number, Molfile and
SMILES all consistent. Stereo ignored.
• 3 Star ENHANCED Stereochemistry – consistency
between 3 of 4: CAS Number (DSSTox lookup), Molfile,
SMILES, Tautomer - PLUS enhanced with correct
stereochemistry
• 3 Star– consistency between 3 of 4: CAS Number, Molfile,
SMILES, Tautomer - stereo ignored
Office of Research and
Development
Quality FLAGS
Treated as “Junk Data” in terms of using as training data
• 2 Star PLUS – consistency between 2 of 4: chemical name,
CAS Number (DSSTox lookup), Molfile, SMILES, Tautomer.
Molfile is changed to be consistent if necessary
• 2 Star – only two fields internally consistent
• 1 star – what’s left
• AUTOMATION procedures used across all data – “KNIME”
pipelining software for processing SDF files
Office of Research and
Development
Cleaning the data
27
• Models built on “QSAR ready” forms of the curated structures –
standardize tautomers, remove stereochemistry, no salts
• Compare 3 STAR and BETTER models with entire dataset
• Exclusion of 1 and 2 star data removed SOME “outliers” only
• Definite cluster not being modeled appropriately - “Applicability
domain” – initial model based on ~2500 chemicals
Office of Research and
Development
How did it get this way?
• Data was assembled over many years (I assume), by
various people and likely heterogeneous processes
• Cheminformatics tools of today are leaps and bounds
ahead of tools from the 80s period.
• Data errors are COMMONPLACE! It took 3 years and a
team of 6 people to clean up Wikipedia of the 8% errors!
• Just because its commonplace doesn’t mean we shouldn’t
improve it!
Office of Research and
Development
Errors in Databases are
Common Place
Merck Index
DailyMedDrugBank
Wikipedia
Common Chemistry
ChEBI
Wolfram Alpha
KEGG
Office of Research and
Development
KNIME workflow to evaluate the dataset
Office of Research and
Development
Summary:
Property Initial file flagged Updated 3-4 STAR Curated QSAR ready
AOP 818 818 745
BCF 685 618 608
BioHC 175 151 150
Biowin 1265 1196 1171
BP 5890 5591 5436
HL 1829 1758 1711
KM 631 548 541
KOA 308 277 270
LogP 15809 14544 14041
MP 10051 9120 8656
PC 788 750 735
VP 3037 2840 2716
WF 5764 5076 4836
WS 2348 2046 2010
Office of Research and
Development
Development of a QSAR model
• Curation of the data
» Flagged and curated files available for sharing
• Preparation of training and test sets
» Inserted as a field in SDFiles and csv data files
• Calculation of an initial set of descriptors
» PaDEL 2D descriptors and fingerprints generated and shared
• Selection of a mathematical method
» Several approaches tested: KNN, PLS, SVM…
• Variable selection technique
» Genetic algorithm
• Validation of the model’s predictive ability
» 5-fold cross validation and external test set
• Define the Applicability Domain
» Local (nearest neighbors) and global (leverage) approaches
Office of Research and
Development
QSARs for regulatory purposes
Office of Research and
Development
The 5 OECD principles:
Principle Description
1) A defined endpoint Any physicochemical, biological or environmental effect
that can be measured and therefore modelled.
2) An unambiguous
algorithm
Ensure transparency in the description of the model
algorithm.
3) A defined domain of
applicability
Define limitations in terms of the types of chemical
structures, physicochemical properties and mechanisms of
action for which the models can generate reliable
predictions.
4) Appropriate measures
of goodness-of-fit,
robustness and
predictivity
a) The internal fitting performance of a model
b) the predictivity of a model, determined by using an
appropriate external test set.
5) Mechanistic
interpretation, if possible
Mechanistic associations between the descriptors used in
a model and the endpoint being predicted.
The conditions for the validity of QSARs
Office of Research and
Development
NCCT models
Prop Vars 5-fold CV (75%) Training (75%) Test (25%)
Q2 RMSE N R2 RMSE N R2 RMSE
BCF 10 0.84 0.55 465 0.85 0.53 161 0.73 0.65
BP 13 0.93 22.46 4077 0.93 22.06 1358 0.93 22.08
LogP 9 0.85 0.69 10531 0.86 0.67 3510 0.86 0.78
MP 15 0.72 51.8 6486 0.74 50.27 2167 0.73 52.72
VP 12 0.91 1.08 2034 0.91 1.08 679 0.92 1
WS 11 0.87 0.81 3158 0.87 0.82 1066 0.86 0.86
HL 9 0.84 1.96 441 0.84 1.91 150 0.85 1.82
Office of Research and
Development
NCCT models
Prop Vars 5-fold CV (75%) Training (75%) Test (25%)
Q2 RMSE N R2 RMSE N R2 RMSE
AOH 13 0.85 1.14 516 0.85 1.12 176 0.83 1.23
BioHL 6 0.89 0.25 112 0.88 0.26 38 0.75 0.38
KM 12 0.83 0.49 405 0.82 0.5 136 0.73 0.62
KOC 12 0.81 0.55 545 0.81 0.54 184 0.71 0.61
KOA 2 0.95 0.69 202 0.95 0.65 68 0.96 0.68
BA Sn-Sp BA Sn-Sp BA Sn-Sp
R-Bio 10 0.8 0.82-0.78 1198 0.8 0.82-0.79 411 0.79 0.81-0.77
Office of Research and
Development
LogP Model: Weighted kNN Model,
9 descriptors
37
Weighted 5-nearest neighbors
9 Descriptors
Training set: 10531 chemicals
Test set: 3510 chemicals
5 fold Cross-validation:
Q2=0.85 RMSE=0.69
Fitting:
R2=0.86 RMSE=0.67
Test:
R2=0.86 RMSE=0.78
Office of Research and
Development
Applicability Domain
38
EPISuite Predicted vs. Experimental EPISuite vs kNN Comparison of Cluster
Applicability Domain of original EPI Suite LogP Model
(accuracy of the global model is an issue)
Example 280 compound cluster outside applicability
domain of EPI Suite compared to new model predictions
Office of Research and
Development
Standalone application:
Input:
–MATLAB .mat file, an ASCII file with only a matrix of variables
–SDF file or SMILES strings of QSAR-ready structures. In this case
the program will calculate PaDEL 2D descriptors and make the
predictions.
• The program will extract the molecules names from the input csv or SDF
(or assign arbitrary names if not) As IDs for the predictions.
Output
• Depending on the extension, the can be text file or csv with
–A list of molecules IDs and predictions
–Applicability domain
–Accuracy of the prediction
–Similarity index to the 5 nearest neighbors
–The 5 nearest neighbors from the training set: Exp. value, Prediction,
InChi key
Office of Research and
Development
NCCT Model Predictions
• We want to provide WEB-access to the cleaned data, the
predicted data for all chemicals and ultimately real time models.
CLEAN data once and use many times!
• QSAR prediction models (kNN) produced for all properties
• 700k chemical structures pushed through NCCT_Models
• Each property now available on the dashboard
–Original EPISuite predicted value where available
–3 and 4 star quality experimental data displayed
–Predicted data displayed
40
Office of Research and
Development
Examples – Glyphosate (logP)
Office of Research and
Development
Examples - Bisphenol S (MP)
Office of Research and
Development
Reporting Outcome
• Presently authoring: “Examining the impact of Quality vs
Quantity on Predictive Models for PhysChem”
• Supplementary data will include appropriate files with flags
– full dataset plus QSAR ready form
• Full performance statistics available for all models
• Models will be deployed as prediction engines in the future
– one chemical at a time and batch processing (to be done
after RapidTox Project)
Office of Research and
Development
Small Data Sets limit
Applicability Domain
• Some EPI Suite data sets are quite small…a few
thousand points. There is a lot of Open Data now
• EXAMPLE: 200,000 melting points and 13,000 pyrolysis
data points
• Modeling this much data is a challenge! 44
Office of Research and
Development
Adding MORE Data
• Collect “Open Data” for various endpoints from:
 PubChem
 eChemPortal
 Open Data Sources
• Data already harvested and awaiting processing
• New models to be produced from expanded data sets
45
Office of Research and
Development
Conclusions
• We will rebuild existing prototypes
on this new architecture based on
priorities
• Componentized architecture approach will be
iterated as/if we hit limitations based on projects
• Much work to do with key activities on the API to
access data and algorithms
• Our Focus is “Build Once, Use Many Times” – for
our projects but made available to others to use also
• This will allow rapid prototyping as we progress

More Related Content

What's hot

Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
Kamel Mansouri
 
Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings Open PHACTS Chemistry Platform Update and Learnings
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...
Kamel Mansouri
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Development of a Tool for Systematic Integration of Traditional and New Appro...
Development of a Tool for Systematic Integration of Traditional and New Appro...Development of a Tool for Systematic Integration of Traditional and New Appro...
Development of a Tool for Systematic Integration of Traditional and New Appro...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Overview of open resources to support automated structure verification and e...
Overview of open resources to support automated structure verification  and e...Overview of open resources to support automated structure verification  and e...
Overview of open resources to support automated structure verification and e...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSOPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
Kamel Mansouri
 
New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Accessing information for chemicals in hydraulic fracturing fluids using the ...
Accessing information for chemicals in hydraulic fracturing fluids using the ...Accessing information for chemicals in hydraulic fracturing fluids using the ...
Accessing information for chemicals in hydraulic fracturing fluids using the ...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
SETAC Rome Non-Target Screening For Chemical Discovery
SETAC Rome Non-Target Screening For Chemical DiscoverySETAC Rome Non-Target Screening For Chemical Discovery
SETAC Rome Non-Target Screening For Chemical Discovery
Emma Schymanski
 
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in RCrunching Molecules and Numbers in R
Crunching Molecules and Numbers in RRajarshi Guha
 
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
Kamel Mansouri
 
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
Kamel Mansouri
 

What's hot (20)

Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
 
Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings
 
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
 
The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
 
Development of a Tool for Systematic Integration of Traditional and New Appro...
Development of a Tool for Systematic Integration of Traditional and New Appro...Development of a Tool for Systematic Integration of Traditional and New Appro...
Development of a Tool for Systematic Integration of Traditional and New Appro...
 
Overview of open resources to support automated structure verification and e...
Overview of open resources to support automated structure verification  and e...Overview of open resources to support automated structure verification  and e...
Overview of open resources to support automated structure verification and e...
 
Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...
 
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSOPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
 
New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
 
Accessing information for chemicals in hydraulic fracturing fluids using the ...
Accessing information for chemicals in hydraulic fracturing fluids using the ...Accessing information for chemicals in hydraulic fracturing fluids using the ...
Accessing information for chemicals in hydraulic fracturing fluids using the ...
 
SETAC Rome Non-Target Screening For Chemical Discovery
SETAC Rome Non-Target Screening For Chemical DiscoverySETAC Rome Non-Target Screening For Chemical Discovery
SETAC Rome Non-Target Screening For Chemical Discovery
 
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in RCrunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
 
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
 
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
 
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
 

Viewers also liked

How One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online ChemistryHow One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online Chemistry
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Yue Liao
 
NSF Data Management Requirements 101
NSF Data Management Requirements 101NSF Data Management Requirements 101
NSF Data Management Requirements 101
Georgia Libraries Conference (formerly Ga COMO).
 
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Georgia Libraries Conference (formerly Ga COMO).
 
From Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectFrom Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki Project
Joel Gehman
 
SMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementSMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic Management
Joel Gehman
 
Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
Georgia Libraries Conference (formerly Ga COMO).
 
Social Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online ProfileSocial Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online Profile
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...
NASIG
 
Building an Online Profile Using Social Networking and Amplification Tools fo...
Building an Online Profile Using Social Networking and Amplification Tools fo...Building an Online Profile Using Social Networking and Amplification Tools fo...
Building an Online Profile Using Social Networking and Amplification Tools fo...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...
lisbk
 
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...
Joel Gehman
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
c.titus.brown
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
c.titus.brown
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
c.titus.brown
 

Viewers also liked (16)

How One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online ChemistryHow One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online Chemistry
 
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
 
NSF Data Management Requirements 101
NSF Data Management Requirements 101NSF Data Management Requirements 101
NSF Data Management Requirements 101
 
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
 
From Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectFrom Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki Project
 
SMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementSMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic Management
 
Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...
 
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
 
Social Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online ProfileSocial Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online Profile
 
Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...
 
Building an Online Profile Using Social Networking and Amplification Tools fo...
Building an Online Profile Using Social Networking and Amplification Tools fo...Building an Online Profile Using Social Networking and Amplification Tools fo...
Building an Online Profile Using Social Networking and Amplification Tools fo...
 
Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...
 
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 

Similar to An examination of data quality on QSAR Modeling in regards to the environmental sciences

Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
Kamel Mansouri
 
Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Susan Tao Resume 030816 linkdIn
Susan Tao Resume 030816 linkdInSusan Tao Resume 030816 linkdIn
Susan Tao Resume 030816 linkdInSusan Tao
 
What+Is+Six+Sigma
What+Is+Six+SigmaWhat+Is+Six+Sigma
What+Is+Six+SigmaTyg Lucas
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
International Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsInternational Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology Problems
Kamel Mansouri
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
High-throughput and Automated Process Development for Accelerated Biotherapeu...
High-throughput and Automated Process Development for Accelerated Biotherapeu...High-throughput and Automated Process Development for Accelerated Biotherapeu...
High-throughput and Automated Process Development for Accelerated Biotherapeu...
KBI Biopharma
 
Senior Chemist 4.0
Senior Chemist 4.0Senior Chemist 4.0
Senior Chemist 4.0Evan Coss
 
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Kamel Mansouri
 
Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Similar to An examination of data quality on QSAR Modeling in regards to the environmental sciences (20)

Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
 
Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
 
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
 
Susan Tao Resume 030816 linkdIn
Susan Tao Resume 030816 linkdInSusan Tao Resume 030816 linkdIn
Susan Tao Resume 030816 linkdIn
 
What+Is+Six+Sigma
What+Is+Six+SigmaWhat+Is+Six+Sigma
What+Is+Six+Sigma
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
 
International Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsInternational Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology Problems
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
High-throughput and Automated Process Development for Accelerated Biotherapeu...
High-throughput and Automated Process Development for Accelerated Biotherapeu...High-throughput and Automated Process Development for Accelerated Biotherapeu...
High-throughput and Automated Process Development for Accelerated Biotherapeu...
 
Senior Chemist 4.0
Senior Chemist 4.0Senior Chemist 4.0
Senior Chemist 4.0
 
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
 
SummaryCV-NJW
SummaryCV-NJWSummaryCV-NJW
SummaryCV-NJW
 
Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...
 
Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...
 

Recently uploaded

Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
alishadewangan1
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 

Recently uploaded (20)

Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 

An examination of data quality on QSAR Modeling in regards to the environmental sciences

  • 1. An examination of the impact of data quality on QSAR Modeling in regards to the environmental sciences Antony Williams & Kamel Mansouri National Center for Computational Toxicology April 12, 2016 The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
  • 2. Office of Research and Development Recent Cheminformatics development at NCCT • We have been building a new cheminformatics architecture • The first “app” provides PUBLIC access to curated chemistry • Focus on integrating agency and external resources • Aggregating and curating data, visualization elements and “services” to underpin other internal efforts including: • Read-across • Predictive modeling • Non-targeted screening 1
  • 3. Office of Research and Development DSSTox (our quality content) DSSTox_v2 PublicCurated 5. Low 2. Low 6. Untrusted 1. High 3. High 4. Med 7. Incomplete QC Level Totals (12Jun2015) validated Public_Untrusted Public_Low Public_Medium Public_High DSSTox_Low DSSTox_High 4535 16K 33K 101K 584K ~ 310K pending ~ 150K pending
  • 4. Office of Research and Development Chemistry Software Architecture
  • 5. Office of Research and Development The iCSS Chemistry Dashboard at https://comptox.epa.gov 4
  • 6. Office of Research and Development The iCSS Chemistry Dashboard 5
  • 7. Office of Research and Development The iCSS Chemistry Dashboard 6
  • 8. Office of Research and Development The iCSS Chemistry Dashboard 7
  • 9. Office of Research and Development The iCSS Chemistry Dashboard 8
  • 10. Office of Research and Development Simple Link Integration
  • 11. Office of Research and Development Data sharing and access
  • 12. Office of Research and Development Expt’l and Predicted PhysChem Data 11
  • 13. Office of Research and Development Developing “NCCT Models” • Interest in physicochemical properties to include in exposure modeling, augmenting with Toxcast data etc. • Our approach to modeling: – Obtain high quality training sets – Apply appropriate modeling approaches (modern machine learning) – Validate performance of models and use to predict properties across our full datasets • Initiated work with available “EPI Suite” data
  • 14. Office of Research and Development EPI Suite Properties • Water solubility • Melting Point • Boiling Point • LogP (KOWWIN: Octanol-water partition coefficient) • Atmospheric Hydroxylation Rate • LogBCF (Bioconcentration Factor) • Biodegradation Half-life • Ready biodegradability • Henry's Law Constant • Fish Biotransformation Half-life • LogKOA (Octanol/Air Partition Coefficient) • LogKOC (Soil Adsorption Coefficient) • Vapor Pressure
  • 15. Office of Research and Development Looking at EPISuite KOWWIN: 2447 Training Compounds • KOWWIN model was trained on 2447 chemicals • Available experimental dataset is 15,809 chemicals • Previous experience of the data necessitated review
  • 16. Office of Research and Development Predicted vs. Experimental considering >15k chemicals
  • 17. Office of Research and Development Data Files • The data files have FOUR representations of a chemical plus the property value. • CAS Number: 80-05-7 • Name: Bisphenol A • SMILES: CC(C)(C1=CC=C(O)C=C1)C1=CC=C(O)C=C1 • Molfile:
  • 18. Office of Research and Development The Approach • To build models we need the set of chemicals and their property series • Our curation process – Decide on the “chemical” by checking levels of consistency – We did NOT validate each measured property value – Perform initial analysis manually to understand how to clean the data – Automate the process (and test iteratively) – Feed through all datasets
  • 19. Office of Research and Development General Observations • MANY mismatches across datasets – comparing structure molfile, SMILES, Names and CAS Numbers • Duplicate pairs for chemicals – same structures but different values for properties. Some wildly different. • Some SMILES cannot convert – what software was used?? • Invalid CAS Numbers (can be validated with a CheckSum) • Some chemical names are neither systematic nor trivial. Totally ambiguous and truncated • “Stereochemistry” representations are not appropriate for computers and commonly not even included
  • 20. Office of Research and Development Truncated and obscure names
  • 21. Office of Research and Development “Duplicate” Structures
  • 22. Office of Research and Development Large Differences in Values? 6380-24-1 93-15-2
  • 23. Office of Research and Development Salts
  • 24. Office of Research and Development Some statistics: 15,809 chemicals (KOWWIN dataset) • CAS Checksum: 12163 valid, 3646 invalid (>23%) • Invalid names: 555 • Invalid SMILES 133 • Valence errors: 322 Molfile, 3782 SMILES (>24%) • Duplicates check: –31 DUPLICATES MOLFILE –626 DUPLICATES SMILES –531 DUPLICATES NAMES • SMILES vs. Molfiles (structure check) –1279 differ in stereochemistry (~8%) –362 “Covalent Halogens” –191 differ as tautomers –436 are different compounds (~3%)
  • 25. Office of Research and Development Quality FLAGS • 4 Star ENHANCED Stereochemistry – chemical name, CAS Number, Molfile and SMILES all consistent PLUS enhanced with correct stereochemistry
  • 26. Office of Research and Development Quality FLAGS • 4 Star – chemical name, CAS Number, Molfile and SMILES all consistent. Stereo ignored. • 3 Star ENHANCED Stereochemistry – consistency between 3 of 4: CAS Number (DSSTox lookup), Molfile, SMILES, Tautomer - PLUS enhanced with correct stereochemistry • 3 Star– consistency between 3 of 4: CAS Number, Molfile, SMILES, Tautomer - stereo ignored
  • 27. Office of Research and Development Quality FLAGS Treated as “Junk Data” in terms of using as training data • 2 Star PLUS – consistency between 2 of 4: chemical name, CAS Number (DSSTox lookup), Molfile, SMILES, Tautomer. Molfile is changed to be consistent if necessary • 2 Star – only two fields internally consistent • 1 star – what’s left • AUTOMATION procedures used across all data – “KNIME” pipelining software for processing SDF files
  • 28. Office of Research and Development Cleaning the data 27 • Models built on “QSAR ready” forms of the curated structures – standardize tautomers, remove stereochemistry, no salts • Compare 3 STAR and BETTER models with entire dataset • Exclusion of 1 and 2 star data removed SOME “outliers” only • Definite cluster not being modeled appropriately - “Applicability domain” – initial model based on ~2500 chemicals
  • 29. Office of Research and Development How did it get this way? • Data was assembled over many years (I assume), by various people and likely heterogeneous processes • Cheminformatics tools of today are leaps and bounds ahead of tools from the 80s period. • Data errors are COMMONPLACE! It took 3 years and a team of 6 people to clean up Wikipedia of the 8% errors! • Just because its commonplace doesn’t mean we shouldn’t improve it!
  • 30. Office of Research and Development Errors in Databases are Common Place Merck Index DailyMedDrugBank Wikipedia Common Chemistry ChEBI Wolfram Alpha KEGG
  • 31. Office of Research and Development KNIME workflow to evaluate the dataset
  • 32. Office of Research and Development Summary: Property Initial file flagged Updated 3-4 STAR Curated QSAR ready AOP 818 818 745 BCF 685 618 608 BioHC 175 151 150 Biowin 1265 1196 1171 BP 5890 5591 5436 HL 1829 1758 1711 KM 631 548 541 KOA 308 277 270 LogP 15809 14544 14041 MP 10051 9120 8656 PC 788 750 735 VP 3037 2840 2716 WF 5764 5076 4836 WS 2348 2046 2010
  • 33. Office of Research and Development Development of a QSAR model • Curation of the data » Flagged and curated files available for sharing • Preparation of training and test sets » Inserted as a field in SDFiles and csv data files • Calculation of an initial set of descriptors » PaDEL 2D descriptors and fingerprints generated and shared • Selection of a mathematical method » Several approaches tested: KNN, PLS, SVM… • Variable selection technique » Genetic algorithm • Validation of the model’s predictive ability » 5-fold cross validation and external test set • Define the Applicability Domain » Local (nearest neighbors) and global (leverage) approaches
  • 34. Office of Research and Development QSARs for regulatory purposes
  • 35. Office of Research and Development The 5 OECD principles: Principle Description 1) A defined endpoint Any physicochemical, biological or environmental effect that can be measured and therefore modelled. 2) An unambiguous algorithm Ensure transparency in the description of the model algorithm. 3) A defined domain of applicability Define limitations in terms of the types of chemical structures, physicochemical properties and mechanisms of action for which the models can generate reliable predictions. 4) Appropriate measures of goodness-of-fit, robustness and predictivity a) The internal fitting performance of a model b) the predictivity of a model, determined by using an appropriate external test set. 5) Mechanistic interpretation, if possible Mechanistic associations between the descriptors used in a model and the endpoint being predicted. The conditions for the validity of QSARs
  • 36. Office of Research and Development NCCT models Prop Vars 5-fold CV (75%) Training (75%) Test (25%) Q2 RMSE N R2 RMSE N R2 RMSE BCF 10 0.84 0.55 465 0.85 0.53 161 0.73 0.65 BP 13 0.93 22.46 4077 0.93 22.06 1358 0.93 22.08 LogP 9 0.85 0.69 10531 0.86 0.67 3510 0.86 0.78 MP 15 0.72 51.8 6486 0.74 50.27 2167 0.73 52.72 VP 12 0.91 1.08 2034 0.91 1.08 679 0.92 1 WS 11 0.87 0.81 3158 0.87 0.82 1066 0.86 0.86 HL 9 0.84 1.96 441 0.84 1.91 150 0.85 1.82
  • 37. Office of Research and Development NCCT models Prop Vars 5-fold CV (75%) Training (75%) Test (25%) Q2 RMSE N R2 RMSE N R2 RMSE AOH 13 0.85 1.14 516 0.85 1.12 176 0.83 1.23 BioHL 6 0.89 0.25 112 0.88 0.26 38 0.75 0.38 KM 12 0.83 0.49 405 0.82 0.5 136 0.73 0.62 KOC 12 0.81 0.55 545 0.81 0.54 184 0.71 0.61 KOA 2 0.95 0.69 202 0.95 0.65 68 0.96 0.68 BA Sn-Sp BA Sn-Sp BA Sn-Sp R-Bio 10 0.8 0.82-0.78 1198 0.8 0.82-0.79 411 0.79 0.81-0.77
  • 38. Office of Research and Development LogP Model: Weighted kNN Model, 9 descriptors 37 Weighted 5-nearest neighbors 9 Descriptors Training set: 10531 chemicals Test set: 3510 chemicals 5 fold Cross-validation: Q2=0.85 RMSE=0.69 Fitting: R2=0.86 RMSE=0.67 Test: R2=0.86 RMSE=0.78
  • 39. Office of Research and Development Applicability Domain 38 EPISuite Predicted vs. Experimental EPISuite vs kNN Comparison of Cluster Applicability Domain of original EPI Suite LogP Model (accuracy of the global model is an issue) Example 280 compound cluster outside applicability domain of EPI Suite compared to new model predictions
  • 40. Office of Research and Development Standalone application: Input: –MATLAB .mat file, an ASCII file with only a matrix of variables –SDF file or SMILES strings of QSAR-ready structures. In this case the program will calculate PaDEL 2D descriptors and make the predictions. • The program will extract the molecules names from the input csv or SDF (or assign arbitrary names if not) As IDs for the predictions. Output • Depending on the extension, the can be text file or csv with –A list of molecules IDs and predictions –Applicability domain –Accuracy of the prediction –Similarity index to the 5 nearest neighbors –The 5 nearest neighbors from the training set: Exp. value, Prediction, InChi key
  • 41. Office of Research and Development NCCT Model Predictions • We want to provide WEB-access to the cleaned data, the predicted data for all chemicals and ultimately real time models. CLEAN data once and use many times! • QSAR prediction models (kNN) produced for all properties • 700k chemical structures pushed through NCCT_Models • Each property now available on the dashboard –Original EPISuite predicted value where available –3 and 4 star quality experimental data displayed –Predicted data displayed 40
  • 42. Office of Research and Development Examples – Glyphosate (logP)
  • 43. Office of Research and Development Examples - Bisphenol S (MP)
  • 44. Office of Research and Development Reporting Outcome • Presently authoring: “Examining the impact of Quality vs Quantity on Predictive Models for PhysChem” • Supplementary data will include appropriate files with flags – full dataset plus QSAR ready form • Full performance statistics available for all models • Models will be deployed as prediction engines in the future – one chemical at a time and batch processing (to be done after RapidTox Project)
  • 45. Office of Research and Development Small Data Sets limit Applicability Domain • Some EPI Suite data sets are quite small…a few thousand points. There is a lot of Open Data now • EXAMPLE: 200,000 melting points and 13,000 pyrolysis data points • Modeling this much data is a challenge! 44
  • 46. Office of Research and Development Adding MORE Data • Collect “Open Data” for various endpoints from:  PubChem  eChemPortal  Open Data Sources • Data already harvested and awaiting processing • New models to be produced from expanded data sets 45
  • 47. Office of Research and Development Conclusions • We will rebuild existing prototypes on this new architecture based on priorities • Componentized architecture approach will be iterated as/if we hit limitations based on projects • Much work to do with key activities on the API to access data and algorithms • Our Focus is “Build Once, Use Many Times” – for our projects but made available to others to use also • This will allow rapid prototyping as we progress