SlideShare a Scribd company logo
1 of 52
Chemistry data: Distortion and
dissemination in the Internet Era
Antony Williams1, Chris Grulke1, Katie Paul-Friedman1,
and Emma Schymanski2
1) National Center for Computational Toxicology, U.S. Environmental Protection Agency, RTP, NC
2) Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, Luxembourg
August 2018
ACS Fall Meeting, Boston
http://www.orcid.org/0000-0002-2668-4821
The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
CompTox Dashboard
https://comptox.epa.gov/dashboard
1
Underneath the Dashboard
2
DSSTox_v2
PublicCurated
5. Low
2. Low
6. Untrusted
1. High
3. High
4. Med
7. Incomplete
validated
Public_Untrusted
Public_Low
Public_Medium
Public_High
DSSTox_Low
DSSTox_High 7269
16K
33K
101K
590K
~ 310K pending
~ 150K pending
~ 150K substances in
top 4 QC quality bins
Distribution of curated data
Record Information Quality Flags
4
ChemReg Curation
The ALANWOOD Pesticide Set
5
The CompTox Dashboard for
Structure Identification by MS
6
Collaborative Data Curation
• Mapping between our data (and websites)
has resulted in collaborative data curation
• Collaboration with Emma Schymanski re.
the NORMAN Suspects Exchange
https://www.norman-network.com/?q=node/236
• It has highlighted data quality issues – and
these are not unexpected!
7
NORMAN Suspect Exchange
8
http://www.norman-network.com/?q=node/236
Example: NORMAN Priority List
9
Mapping on Two Identifiers
Lookup Based on Name
10
Mapping on Two Identifiers
Lookup Based on CASRN
11
Mapping Quality Control
12
Example: NORMAN Priority List
13
Example: NORMAN Priority List
14
>23 NORMAN Lists Available
15
Aggregating Data
• We aggregate data from various sources
• Downloading is “easy”
• Databasing is “easy”
• Data curation is a laborious and painful task
16
Online Today
17
We Curated These Public Data to
Build Prediction Models
Public data should be curated prior to modeling
Different Compounds
Duplicate Structures
19
Covalent Halogens
20
Property Initial file Curated Data Curated QSAR ready
AOP 818 818 745
BCF 685 618 608
BioHC 175 151 150
Biowin 1265 1196 1171
BP 5890 5591 5436
HL 1829 1758 1711
KM 631 548 541
KOA 308 277 270
LogP 15809 14544 14041
MP 10051 9120 8656
PC 788 750 735
VP 3037 2840 2716
WF 5764 5076 4836
WS 2348 2046 2010
Curation to QSAR Ready Files
LogP dataset:
15,809 structures
• CAS Checksum: 12163 valid, 3646 invalid (>23%)
• Invalid names: 555
• Invalid SMILES 133
• Valence errors: 322 Molfile, 3782 SMILES (>24%)
• Duplicates check:
– 31 DUPLICATE MOLFILES
– 626 DUPLICATE SMILES
– 531 DUPLICATE NAMES
• SMILES vs. Molfiles (structure check)
– 1279 differ in stereochemistry (~8%)
– 362 “Covalent Halogens”
– 191 differ as tautomers
– 436 are different compounds (~3%)
Workflow Details and Data
23
OPERA Models: https://github.com/kmansouri/OPERA
Sourcing Chemical Hazard Data
24
Hazard Data from “ToxVal_DB”
• ToxVal Database contains following data:
–30,050 chemicals
–772,721 toxicity values
–29 sources of data
–21,507 sub-sources
–4585 journals cited
–69,833 literature citations
25
How can we curate our data?
• Crowdsourcing is well proven nowadays
• Comments can be added at a record level
• Submitted comments are reviewed by
administrators and responded to
26
Public Crowdsourced Comments
https://comptox.epa.gov/dashboard/comments/public_index
27
Reviewer comments are public
28
Comments to date
• The majority of comments to date:
– Structure and names/CASRN do not match
– Add additional synonyms
– Request to add specific property data
– Structure layout/depiction needs improving
29
Crowdsourcing Comments
Single Cell Commenting added
• Highlight an alphanumeric text string
30
Crowdsourcing Comments
31
In Vitro Bioassay Screening
ToxCast and Tox21
32
Bioactivity Data
• 100s of thousands of
bioactivity curves to
review
• Impossible to review
every one manually
• Now accepting public
Crowdsourced
Comments
• Public crowdsourcing
will not suffice!!!
33
Internal Review of 25,000 curves
34
Screenshot of entry page for Beta R Shiny Application for NCCT users
Brown & Paul-Friedman
Internal Review of 25,000 curves
A “good fit” bioactivity curve
35
Internal Review of 25,000 curves
36
…and gain-loss fit with a cell
viability assay (makes little sense)
Single-Point in middle of concentration range
drives ACTIVE Hit Call
Internal Review of 25,000 curves
Abnormally High-Noise
37
Internal Curve Review Results
• Internal curve review has resulted in:
– Instances of correction of fitting procedures in the ToxCast Pipeline
– Identification of issues with source data
– Identification of additional flags or filters that could be used,
depending on the application of ToxCast data
– a beta implementation of quality assurance for HTS data
– Brown & Paul-Friedman, Uncertainty in ToxCast Curve-Fitting: Quantitative and
Qualitative Descriptors Inform a Model to Predict Reproducible Fits (in
preparation)
38
Chemical structures as text
39
How do I extract structures?
• Copy-Paste doesn’t work
• This is not the way a publisher should deliver
chemistry. But this is on the AUTHOR!
40
Extract Structure Drawings???
41
Try hand-drawing Algal Toxins!
42
Think of files in multiple formats!
• SMILES are hyper-dependent on good
layout algorithms. It’s not easy!
43
[H][C@]12OC3(C[C@
@H](C)[C@H]1O)C[C
@@]1([H])CCCC4(CC
C5(O4)O[C@@]([H])(
CC[C@@]5(C)O)CC(=
C)CCCC4=NC[C@H](
C)[C@@H](C)CC44C
CC(=C[C@]4([H])[C@]
2([H])O3)C(O)=O)O1
Names and CASRNs
are NOT structures
• In our domain most chemicals are text –
chemical names and CAS Numbers
44
And generally problematic…
45
Active vs Deleted CASRN
46
Tricky mapping by CASRN
This one has 316 Deleted CASRN
47
PLEASE use our curated data
48
Thirty Years of Experience
• What has changed?
– Cheminformatics has progressed
– The internet proliferates data access
– Standards have progressed (InChI) and are improving
– Anybody can access/download millions of structures!
• What hasn’t changed?
– Minor progress presenting structures in publications
– Delivery via PDFs still dominates
– Mandates on scientists are very unlikely
49
Conclusion
• The CompTox Dashboard provides access to data for
~760,000 chemicals
• Sourced and aggregated data from multiple resources
• Data quality is very random, source-to-source.
• We work hard to curate data - PLEASE USE IT!
• Our data is not perfect - we have a long way to go!
50
Contact
Antony Williams
US EPA Office of Research and Development
National Center for Computational Toxicology (NCCT)
Williams.Antony@epa.gov
ORCID: https://orcid.org/0000-0002-2668-4821
51

More Related Content

What's hot

What's hot (20)

Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
 
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
 
Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...
 
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
 
How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
 
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
 
Accessing information for chemicals in hydraulic fracturing fluids using the ...
Accessing information for chemicals in hydraulic fracturing fluids using the ...Accessing information for chemicals in hydraulic fracturing fluids using the ...
Accessing information for chemicals in hydraulic fracturing fluids using the ...
 
New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...
 
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
 
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
Data Review and Clean-Up Using Crowdsourced Input via the  US EPA CompTox Das...Data Review and Clean-Up Using Crowdsourced Input via the  US EPA CompTox Das...
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
 
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
 

Similar to Chemistry data: Distortion and dissemination in the Internet Era

The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...Kamel Mansouri
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...Kamel Mansouri
 
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...Andrew McEachran
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Kamel Mansouri
 

Similar to Chemistry data: Distortion and dissemination in the Internet Era (20)

The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
 
Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...
 
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
 
Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...
 
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
 
Progress in delivering transparency in research data
Progress in delivering transparency in research dataProgress in delivering transparency in research data
Progress in delivering transparency in research data
 
Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards
 
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...
 
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
 
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
 
Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...
 
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
 
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
 
Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...
 
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 

Recently uploaded

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)Jshifa
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 

Recently uploaded (20)

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 

Chemistry data: Distortion and dissemination in the Internet Era

  • 1. Chemistry data: Distortion and dissemination in the Internet Era Antony Williams1, Chris Grulke1, Katie Paul-Friedman1, and Emma Schymanski2 1) National Center for Computational Toxicology, U.S. Environmental Protection Agency, RTP, NC 2) Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, Luxembourg August 2018 ACS Fall Meeting, Boston http://www.orcid.org/0000-0002-2668-4821 The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
  • 4. DSSTox_v2 PublicCurated 5. Low 2. Low 6. Untrusted 1. High 3. High 4. Med 7. Incomplete validated Public_Untrusted Public_Low Public_Medium Public_High DSSTox_Low DSSTox_High 7269 16K 33K 101K 590K ~ 310K pending ~ 150K pending ~ 150K substances in top 4 QC quality bins Distribution of curated data
  • 7. The CompTox Dashboard for Structure Identification by MS 6
  • 8. Collaborative Data Curation • Mapping between our data (and websites) has resulted in collaborative data curation • Collaboration with Emma Schymanski re. the NORMAN Suspects Exchange https://www.norman-network.com/?q=node/236 • It has highlighted data quality issues – and these are not unexpected! 7
  • 11. Mapping on Two Identifiers Lookup Based on Name 10
  • 12. Mapping on Two Identifiers Lookup Based on CASRN 11
  • 16. >23 NORMAN Lists Available 15
  • 17. Aggregating Data • We aggregate data from various sources • Downloading is “easy” • Databasing is “easy” • Data curation is a laborious and painful task 16
  • 19. We Curated These Public Data to Build Prediction Models Public data should be curated prior to modeling Different Compounds
  • 22. Property Initial file Curated Data Curated QSAR ready AOP 818 818 745 BCF 685 618 608 BioHC 175 151 150 Biowin 1265 1196 1171 BP 5890 5591 5436 HL 1829 1758 1711 KM 631 548 541 KOA 308 277 270 LogP 15809 14544 14041 MP 10051 9120 8656 PC 788 750 735 VP 3037 2840 2716 WF 5764 5076 4836 WS 2348 2046 2010 Curation to QSAR Ready Files
  • 23. LogP dataset: 15,809 structures • CAS Checksum: 12163 valid, 3646 invalid (>23%) • Invalid names: 555 • Invalid SMILES 133 • Valence errors: 322 Molfile, 3782 SMILES (>24%) • Duplicates check: – 31 DUPLICATE MOLFILES – 626 DUPLICATE SMILES – 531 DUPLICATE NAMES • SMILES vs. Molfiles (structure check) – 1279 differ in stereochemistry (~8%) – 362 “Covalent Halogens” – 191 differ as tautomers – 436 are different compounds (~3%)
  • 24. Workflow Details and Data 23 OPERA Models: https://github.com/kmansouri/OPERA
  • 26. Hazard Data from “ToxVal_DB” • ToxVal Database contains following data: –30,050 chemicals –772,721 toxicity values –29 sources of data –21,507 sub-sources –4585 journals cited –69,833 literature citations 25
  • 27. How can we curate our data? • Crowdsourcing is well proven nowadays • Comments can be added at a record level • Submitted comments are reviewed by administrators and responded to 26
  • 30. Comments to date • The majority of comments to date: – Structure and names/CASRN do not match – Add additional synonyms – Request to add specific property data – Structure layout/depiction needs improving 29
  • 31. Crowdsourcing Comments Single Cell Commenting added • Highlight an alphanumeric text string 30
  • 33. In Vitro Bioassay Screening ToxCast and Tox21 32
  • 34. Bioactivity Data • 100s of thousands of bioactivity curves to review • Impossible to review every one manually • Now accepting public Crowdsourced Comments • Public crowdsourcing will not suffice!!! 33
  • 35. Internal Review of 25,000 curves 34 Screenshot of entry page for Beta R Shiny Application for NCCT users Brown & Paul-Friedman
  • 36. Internal Review of 25,000 curves A “good fit” bioactivity curve 35
  • 37. Internal Review of 25,000 curves 36 …and gain-loss fit with a cell viability assay (makes little sense) Single-Point in middle of concentration range drives ACTIVE Hit Call
  • 38. Internal Review of 25,000 curves Abnormally High-Noise 37
  • 39. Internal Curve Review Results • Internal curve review has resulted in: – Instances of correction of fitting procedures in the ToxCast Pipeline – Identification of issues with source data – Identification of additional flags or filters that could be used, depending on the application of ToxCast data – a beta implementation of quality assurance for HTS data – Brown & Paul-Friedman, Uncertainty in ToxCast Curve-Fitting: Quantitative and Qualitative Descriptors Inform a Model to Predict Reproducible Fits (in preparation) 38
  • 41. How do I extract structures? • Copy-Paste doesn’t work • This is not the way a publisher should deliver chemistry. But this is on the AUTHOR! 40
  • 44. Think of files in multiple formats! • SMILES are hyper-dependent on good layout algorithms. It’s not easy! 43 [H][C@]12OC3(C[C@ @H](C)[C@H]1O)C[C @@]1([H])CCCC4(CC C5(O4)O[C@@]([H])( CC[C@@]5(C)O)CC(= C)CCCC4=NC[C@H]( C)[C@@H](C)CC44C CC(=C[C@]4([H])[C@] 2([H])O3)C(O)=O)O1
  • 45. Names and CASRNs are NOT structures • In our domain most chemicals are text – chemical names and CAS Numbers 44
  • 47. Active vs Deleted CASRN 46
  • 48. Tricky mapping by CASRN This one has 316 Deleted CASRN 47
  • 49. PLEASE use our curated data 48
  • 50. Thirty Years of Experience • What has changed? – Cheminformatics has progressed – The internet proliferates data access – Standards have progressed (InChI) and are improving – Anybody can access/download millions of structures! • What hasn’t changed? – Minor progress presenting structures in publications – Delivery via PDFs still dominates – Mandates on scientists are very unlikely 49
  • 51. Conclusion • The CompTox Dashboard provides access to data for ~760,000 chemicals • Sourced and aggregated data from multiple resources • Data quality is very random, source-to-source. • We work hard to curate data - PLEASE USE IT! • Our data is not perfect - we have a long way to go! 50
  • 52. Contact Antony Williams US EPA Office of Research and Development National Center for Computational Toxicology (NCCT) Williams.Antony@epa.gov ORCID: https://orcid.org/0000-0002-2668-4821 51

Editor's Notes

  1. Comments from Katie: In terms of verbal comments, you might note that we are simply reviewing the uncertainty related to the curve-fitting itself. Sometime people conflate uncertainty in response, e.g., assay interference due to autofluorescence or cytotoxicity, with curve-fitting uncertainty. These are two distinct sources of uncertainty. An autofluorescent compound should produce a beautiful curve in an assay that measures fluorescence as an indirect marker of some bioactivity. These curves will get an “active” score.  When filtering the data for use – say you only want to know the chemicals that definitively activate some receptor – then you might want to layer in not only the reproducibility of the curve fitting, but also other information such as the cytotoxicity burst, information on autofluorescence or nonspecific protein inhibition, etc. This is probably elementary to you, but it is a point I like to make because so many people think that we should auto-filter data for these different sources of assay interference. I am okay with pre-set options to select for filtering, but different applications will require different levels of stringency in terms of filtering.