SlideShare a Scribd company logo
1 of 39
0
Annual NCCT Retreat
Antony Williams*, Kamel Mansouri, Ann Richard and
Chris Grulke
The needs for chemistry standards,
database tools and data curation at
the chemical-biology interface
January, 2016
SLAS 2016, San Diego, CA
This work was reviewed by EPA and approved for presentation but does not necessarily reflect official Agency policy.
• Ensuring Chemical Structure, Biological Data
and Computational Model Quality - Sean Ekins
• Bioassay Variability and Reliability in the
Published and Patent Literature - John
Overington
• Machine Learning to Optimize Experiments
Why Have One Model When You Could Have
Thousands? - Alex Clark
Speakers who “experience” data quality
1
Why does Quality Matter?
What is the Correct Structure of Vitamin K?
Merck Index
DailyMedDrugBank
Wikipedia
Common Chemistry
ChEBI
Wolfram Alpha
KEGG
Experiences over the years
4
Experiences over the years
5
6
Sustainable
Chemistry
Human Health
Hazard
Ecosystems
Hazard
Environmental
Persistence
Sustainable chemistry relates to the
design and use of chemicals that
minimize impacts to human health,
ecosystems and the environment.
How can predictive tools be used to screen
for potential impacts of chemicals early in
the product development process?
How can chemical and bioassay data
inputs be combined to screen and
prioritize testing of thousands of existing
chemicals lacking data?
How do chemical transformations impact
the hazard potential and environmental
persistence?
ToxRefDB
ToxCast
DSSTox
ACToR
Tox21
Chemical Structure Annotation
Chemical Management
Chemical QC
NCCT Publicly Available Data
http://www.epa.gov/ncct
Chemical
Inventories &
Data
Sources
(ACTOR)
Chemical
Structures
CASRN &
Chemical
Names
Chemistry as a Data Foundation
8
EPA SRS
(TSCA, EcoTox)
EDSP21
ToxCast/Tox21
CPCAT
ToxRef
1:1 mapping
CAS:Name:structure
Quality scores
>700K substances
CSS
Dashboards
MOA & Knowledge-
informed features &
chemotypes
ToxCast/Tox21
HTS activities
CAS look-up
New list error-checking
Chemical search by
CAS-names
SMILES
Structures
Structure file
downloads
PK/ADME model
inputs & outputs
Structure “cleaning”
SAR-ready structure files
Structure-based predictions
Similarity searching
Fingerprints, feature sets
QSAR models;
phys-chem properties
Phys-chem calc &
measured properties
Structure
Generic
Substance
Test
Sample
Chemical Name
CASRN
Supplier, Lot/Batch,
physical description
Features
Properties
Chemotypes, fingerprints,
charges, phys-chem properties,
etc.
SMILES
InChI
Increasingdataaggregation
Chemical Elements to Quality Data Integration –
Chemical “Representations”
Diverse quality in databases
• Our challenges in assembling a database
– Sourcing high quality data – sorting wheat from chaff
– How to mesh data together – based on structure? On
name? On CAS number? Other identifier?
– Checking self-consistency of data?
– Structure validation versus property value validation
– very different challenges
10
iCSS Chemistry Dashboard
• Deliver a web-based platform hosting
prediction models – both in-house and
“community models”
• Initial set of models to be based on
EPISuite’s PHYSPROP datasets – logP,
BP, MP, Wsol etc.
• A question of “data
quality” - a project to
review data initiated.
KOWWIN: 2447 Training Compounds
Overall Predicted vs. Experimental
(>15k chemicals)
Original EPISuite Data
• Sourced the SDF files online as downloads
• Basic Manual review searching for errors:
– hypervalency, charge imbalance, undefined stereo
– deduplication
– mismatches between identifiers
• CAS Numbers not matching structure
• Names not matching structure
• Collisions between identifiers
Incorrect valences
Differences in Values (MP)
Salts
Collisions in Records
• Same structure depictions (Molfiles) -
different CAS, different names, different
SMILES
Identifiers
• Many chemical names are
truncated
• Many chemicals don’t
have CAS Numbers
KNIME workflow to evaluate dataset
15,809 chemicals in the KOWWIN dataset
• CAS Checksum: 12163 valid, 3646 invalid
• Invalid names: 555
• Invalid SMILES 133
• Valence errors: 322 Molfile, 3782 SMILES
• Duplicates check:
– 31 DUPLICATES MOLFILE
– 626 DUPLICATES SMILES
– 531 DUPLICATES NAMES
• SMILES vs. Molfiles (structure check)
– 1279 differ in stereochemistry
– 362 “Covalent Halogens”
– 191 differ as tautomers
– 436 are different compounds
Statistics and data quality flags inserted
• 4 Stars ENHANCED: 4 levels of consistency with
stereo information
• 4 Stars: 4 levels of consistency, stereo ignored.
• 3 Stars Plus: 3 out of 4 levels. The 4th is a
tautomer.
• 3 Stars ENHANCED: 4 levels of consistency with
stereo information
• 3 Stars: 3 levels of consistency, stereo ignored.
• 2 Stars PLUS : 2 out of 4 levels. The 3rd is a
tautomer.
• 1 Star - What's left.
Quality flags inserted into data
Standardization of structures
• Explicit hydrogen removal
• Removal of chirality info, isotopes and pseudo-atoms
• Aromatization + add explicit hydrogen atoms
• Standardize Nitro groups
• Other tautomerization/mesomerization checking
• Neutralize (when possible)
Mesomerization/tautomerization
• Azide mesomers
• Exo-enol tautomers
• Enamine-Imine tautomers
• Ynol-ketene tautomers
• ….
Neutralize Structures
Building the Models
• Interested in “impact of quality vs. quantity of data on
prediction models”
• Is it worth the effort to clean and enhance data
underpinning the models?
• QSAR ready forms for modeling – standardize
tautomers, remove stereochemistry, no salts
• Compare 3 STAR and BETTER models with entire
dataset
What about the “cluster”?
27
Weighted kNN Model, 10 descriptors
28
Weighted 5-nearest neighbors
Training: 11251 chemicals
Test set: 2798 chemicals
5 fold cross validation:
R2: 0.87
RMSE: 0.67
Global vs Local Models???
280 compound cluster
29
EPISuite Predicted vs. Experimental EPISuite vs kNN Comparison of Cluster
The Applicability Domain of the Original
EPISuite The global model was an issue…
Bringing together PhysChem Data
• Additional Experimental Data being assembled
• PubChem
• eChemPortal
• Open Data Sources
30
Open Data Set - example
• 200,000 melting points and 13,000 pyrolysis data points
• Modeling this much data is a challenge!
• Accessibility to data?
– Figshare.com
– DataDryad.com
– Institutional Repositories
– Publishers? 31
iCSS Chemistry Dashboard
Releasing in March 2016
32
iCSS Chemistry Dashboard
Releasing in March 2016
• Links external resources: EPA, NIH, property predictors
34
iCSS Chemistry Dashboard
Releasing in March 2016
Toxicity Estimation Software Tool
http://www.epa.gov/chemical-research/toxicity-estimation-software-tool-test
35
Toxicity Estimation Software Tool
36
Acknowledgements
Contributors to the iCSS Chemistry Dashboard
37
Dave Lyons
Kamel Mansouri
Ann Richard
Aria Smith
Jennifer Smith
Indira Thillainadarajah
Jeff Edwards
Jeremy Fitzpatrick
Jordan Foster
Chris Grulke
Jason Harris
• Data quality impacts modeling
• Sourcing high quality data is difficult
• Public domain and open data is valuable but
requires validation
• Modern cheminformatics approaches can help
validate and prepare data for modeling
• There are increasing efforts to source and
release high quality, open data to the world
Conclusion

More Related Content

What's hot

OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...Kamel Mansouri
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsSean Ekins
 

What's hot (20)

Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
 
Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings
 
Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...
 
Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...
 
ChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry dataChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry data
 
Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
 
Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...
 
A chemistry data repository to serve them all
A chemistry data repository to serve them allA chemistry data repository to serve them all
A chemistry data repository to serve them all
 
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
 
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
Development of a Tool for Systematic Integration of Traditional and New Appro...
Development of a Tool for Systematic Integration of Traditional and New Appro...Development of a Tool for Systematic Integration of Traditional and New Appro...
Development of a Tool for Systematic Integration of Traditional and New Appro...
 
New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...
 
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
 
Accessing information for chemicals in hydraulic fracturing fluids using the ...
Accessing information for chemicals in hydraulic fracturing fluids using the ...Accessing information for chemicals in hydraulic fracturing fluids using the ...
Accessing information for chemicals in hydraulic fracturing fluids using the ...
 

Viewers also liked

Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Yue Liao
 
From Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectFrom Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectJoel Gehman
 
SMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementSMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementJoel Gehman
 
Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...NASIG
 
Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...lisbk
 
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...Joel Gehman
 

Viewers also liked (15)

How One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online ChemistryHow One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online Chemistry
 
NSF Data Management Requirements 101
NSF Data Management Requirements 101NSF Data Management Requirements 101
NSF Data Management Requirements 101
 
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
 
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
 
From Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectFrom Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki Project
 
SMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementSMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic Management
 
Social Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online ProfileSocial Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online Profile
 
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
 
Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...
 
Building an Online Profile Using Social Networking and Amplification Tools fo...
Building an Online Profile Using Social Networking and Amplification Tools fo...Building an Online Profile Using Social Networking and Amplification Tools fo...
Building an Online Profile Using Social Networking and Amplification Tools fo...
 
Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...
 
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 

Similar to The needs for chemistry standards, database tools and data curation at the chemical-biology interface

The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...Kamel Mansouri
 
Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Kamel Mansouri
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...Kamel Mansouri
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsKen Karapetyan
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...Kamel Mansouri
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...Kamel Mansouri
 

Similar to The needs for chemistry standards, database tools and data curation at the chemical-biology interface (20)

The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...
 
Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
Structure standardization approaches for mass spectrometry data integration
Structure standardization approaches for  mass spectrometry data integrationStructure standardization approaches for  mass spectrometry data integration
Structure standardization approaches for mass spectrometry data integration
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiativeseScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
 
Cheminformatics approaches to support chemical identification delivered via t...
Cheminformatics approaches to support chemical identification delivered via t...Cheminformatics approaches to support chemical identification delivered via t...
Cheminformatics approaches to support chemical identification delivered via t...
 
ChemValidator – an online service for validating and standardizing chemical s...
ChemValidator – an online service for validating and standardizing chemical s...ChemValidator – an online service for validating and standardizing chemical s...
ChemValidator – an online service for validating and standardizing chemical s...
 
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Connecting Chemistry Across the Internet Using ChemSpider
Connecting Chemistry Across the Internet Using ChemSpiderConnecting Chemistry Across the Internet Using ChemSpider
Connecting Chemistry Across the Internet Using ChemSpider
 
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
 

Recently uploaded

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicinesherlingomez2
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONrouseeyyy
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 

Recently uploaded (20)

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 

The needs for chemistry standards, database tools and data curation at the chemical-biology interface

  • 1. 0 Annual NCCT Retreat Antony Williams*, Kamel Mansouri, Ann Richard and Chris Grulke The needs for chemistry standards, database tools and data curation at the chemical-biology interface January, 2016 SLAS 2016, San Diego, CA This work was reviewed by EPA and approved for presentation but does not necessarily reflect official Agency policy.
  • 2. • Ensuring Chemical Structure, Biological Data and Computational Model Quality - Sean Ekins • Bioassay Variability and Reliability in the Published and Patent Literature - John Overington • Machine Learning to Optimize Experiments Why Have One Model When You Could Have Thousands? - Alex Clark Speakers who “experience” data quality 1
  • 4. What is the Correct Structure of Vitamin K? Merck Index DailyMedDrugBank Wikipedia Common Chemistry ChEBI Wolfram Alpha KEGG
  • 7. 6 Sustainable Chemistry Human Health Hazard Ecosystems Hazard Environmental Persistence Sustainable chemistry relates to the design and use of chemicals that minimize impacts to human health, ecosystems and the environment. How can predictive tools be used to screen for potential impacts of chemicals early in the product development process? How can chemical and bioassay data inputs be combined to screen and prioritize testing of thousands of existing chemicals lacking data? How do chemical transformations impact the hazard potential and environmental persistence?
  • 8. ToxRefDB ToxCast DSSTox ACToR Tox21 Chemical Structure Annotation Chemical Management Chemical QC NCCT Publicly Available Data http://www.epa.gov/ncct
  • 9. Chemical Inventories & Data Sources (ACTOR) Chemical Structures CASRN & Chemical Names Chemistry as a Data Foundation 8 EPA SRS (TSCA, EcoTox) EDSP21 ToxCast/Tox21 CPCAT ToxRef 1:1 mapping CAS:Name:structure Quality scores >700K substances CSS Dashboards MOA & Knowledge- informed features & chemotypes ToxCast/Tox21 HTS activities CAS look-up New list error-checking Chemical search by CAS-names SMILES Structures Structure file downloads PK/ADME model inputs & outputs Structure “cleaning” SAR-ready structure files Structure-based predictions Similarity searching Fingerprints, feature sets QSAR models; phys-chem properties Phys-chem calc & measured properties
  • 10. Structure Generic Substance Test Sample Chemical Name CASRN Supplier, Lot/Batch, physical description Features Properties Chemotypes, fingerprints, charges, phys-chem properties, etc. SMILES InChI Increasingdataaggregation Chemical Elements to Quality Data Integration – Chemical “Representations”
  • 11. Diverse quality in databases • Our challenges in assembling a database – Sourcing high quality data – sorting wheat from chaff – How to mesh data together – based on structure? On name? On CAS number? Other identifier? – Checking self-consistency of data? – Structure validation versus property value validation – very different challenges 10
  • 12. iCSS Chemistry Dashboard • Deliver a web-based platform hosting prediction models – both in-house and “community models” • Initial set of models to be based on EPISuite’s PHYSPROP datasets – logP, BP, MP, Wsol etc. • A question of “data quality” - a project to review data initiated.
  • 14. Overall Predicted vs. Experimental (>15k chemicals)
  • 15. Original EPISuite Data • Sourced the SDF files online as downloads • Basic Manual review searching for errors: – hypervalency, charge imbalance, undefined stereo – deduplication – mismatches between identifiers • CAS Numbers not matching structure • Names not matching structure • Collisions between identifiers
  • 18. Salts
  • 19. Collisions in Records • Same structure depictions (Molfiles) - different CAS, different names, different SMILES
  • 20. Identifiers • Many chemical names are truncated • Many chemicals don’t have CAS Numbers
  • 21. KNIME workflow to evaluate dataset
  • 22. 15,809 chemicals in the KOWWIN dataset • CAS Checksum: 12163 valid, 3646 invalid • Invalid names: 555 • Invalid SMILES 133 • Valence errors: 322 Molfile, 3782 SMILES • Duplicates check: – 31 DUPLICATES MOLFILE – 626 DUPLICATES SMILES – 531 DUPLICATES NAMES • SMILES vs. Molfiles (structure check) – 1279 differ in stereochemistry – 362 “Covalent Halogens” – 191 differ as tautomers – 436 are different compounds Statistics and data quality flags inserted
  • 23. • 4 Stars ENHANCED: 4 levels of consistency with stereo information • 4 Stars: 4 levels of consistency, stereo ignored. • 3 Stars Plus: 3 out of 4 levels. The 4th is a tautomer. • 3 Stars ENHANCED: 4 levels of consistency with stereo information • 3 Stars: 3 levels of consistency, stereo ignored. • 2 Stars PLUS : 2 out of 4 levels. The 3rd is a tautomer. • 1 Star - What's left. Quality flags inserted into data
  • 24. Standardization of structures • Explicit hydrogen removal • Removal of chirality info, isotopes and pseudo-atoms • Aromatization + add explicit hydrogen atoms • Standardize Nitro groups • Other tautomerization/mesomerization checking • Neutralize (when possible)
  • 25. Mesomerization/tautomerization • Azide mesomers • Exo-enol tautomers • Enamine-Imine tautomers • Ynol-ketene tautomers • ….
  • 27. Building the Models • Interested in “impact of quality vs. quantity of data on prediction models” • Is it worth the effort to clean and enhance data underpinning the models? • QSAR ready forms for modeling – standardize tautomers, remove stereochemistry, no salts • Compare 3 STAR and BETTER models with entire dataset
  • 28. What about the “cluster”? 27
  • 29. Weighted kNN Model, 10 descriptors 28 Weighted 5-nearest neighbors Training: 11251 chemicals Test set: 2798 chemicals 5 fold cross validation: R2: 0.87 RMSE: 0.67
  • 30. Global vs Local Models??? 280 compound cluster 29 EPISuite Predicted vs. Experimental EPISuite vs kNN Comparison of Cluster The Applicability Domain of the Original EPISuite The global model was an issue…
  • 31. Bringing together PhysChem Data • Additional Experimental Data being assembled • PubChem • eChemPortal • Open Data Sources 30
  • 32. Open Data Set - example • 200,000 melting points and 13,000 pyrolysis data points • Modeling this much data is a challenge! • Accessibility to data? – Figshare.com – DataDryad.com – Institutional Repositories – Publishers? 31
  • 35. • Links external resources: EPA, NIH, property predictors 34 iCSS Chemistry Dashboard Releasing in March 2016
  • 36. Toxicity Estimation Software Tool http://www.epa.gov/chemical-research/toxicity-estimation-software-tool-test 35
  • 38. Acknowledgements Contributors to the iCSS Chemistry Dashboard 37 Dave Lyons Kamel Mansouri Ann Richard Aria Smith Jennifer Smith Indira Thillainadarajah Jeff Edwards Jeremy Fitzpatrick Jordan Foster Chris Grulke Jason Harris
  • 39. • Data quality impacts modeling • Sourcing high quality data is difficult • Public domain and open data is valuable but requires validation • Modern cheminformatics approaches can help validate and prepare data for modeling • There are increasing efforts to source and release high quality, open data to the world Conclusion