SlideShare a Scribd company logo
Publishing chemical data in public
data repository
Jian Zhang*, Paul Thiessen, Asta Gindulyte, Evan Bolton
256th ACS National Meeting, Boston, August 2018
Outline ..
• PubChem overview
• What data PubChem has
• How to publish your data - case studies
• Automated pipeline
• How to access
• Summary
PubChem … overview and status
• A public chemical data repository – a public data sharing platform
• An open chemistry database
• A chemical information hub
• A data comparison center
• A chemical data index
Compounds: 96,478,070
Substances: 247,243,896
BioAssays: 1,252,901
Tested Compounds: 2,978,541
Tested Substances: 4,994,132
BioActivities: 236,790,496
Protein Targets: 10,854
Gene Targets: 22,108
Data submitors: 623
Countries: 40
Data … chemical centralized and beyond
• Chemical structure – 2D/3D, SMILES, InChI, SDF..
• Property -
• Drug and medication
• Agrochemicals
• Food additives
• Safety and hazards
• Toxicity
• Literature
• Patents
• Bioactivity
• Target
• Natural products
• Pathways
• … more
• Link back to original data
How to submit data to PubChem and publishing
• Chemical substance – SDF, SMILS, …
• Bioactivity data – CSV, XML, ASN.1 …
• Annotation – data format varies
Data submission .. Chemical substances
• Data format: SDF, CSV.. Through PubChem UpLoad
• Covert your structure SDF/CSV into PubChem
standard
• Provide mapping information for your data file.
https://pubchemdocs.ncbi.nlm.nih.gov/upload-chemicals
Data submission .. Bioactivity data
• Data format: CSV, XML, ASN1.. Through PubChem
UpLoad
• Use PubChem standard tags for your data (spreadsheet)
• Covert your data into PubChem stardard XML/ASN1
https://pubchemdocs.ncbi.nlm.nih.gov/upload-bioassays
Data submission .. Data specifications
Data submission .. Annotations
• Incoming data format varies … CSV, text, XML, json,
html, tables, images … special parser needed
Data submission .. Case study
Springer Nature submitted over 620k chemical substances
and more than 4 million literature articles/book chapters to
PubChem which yield over 28 million compound-literature
links.
Data submission .. Case study
Springer Nature data in PubChem..
Data submission .. Case study
Substance
data
Literature
annotations
Data submission .. Case study
INFOCHEM 2D 1 1.00000 0.00000 0
13 13 0 0 0 0 0 0 0 0999 V2000
0.7894 1.3684 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.5789 1.3684 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 1.3684 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.4210 0.7368 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.1579 0.7368 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-1.5789 1.3684 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.1579 2.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.4210 2.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0526 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-0.4210 -0.6316 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -1.3158 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-0.4210 -2.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7894 -1.3158 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 3 0 0 0 0
1 3 1 0 0 0 0
3 4 2 0 0 0 0
3 8 1 0 0 0 0
4 5 1 0 0 0 0
4 9 1 0 0 0 0
5 6 2 0 0 0 0
6 7 1 0 0 0 0
7 8 2 0 0 0 0
9 10 2 0 0 0 0
10 11 1 0 0 0 0
11 12 1 0 0 0 0
11 13 1 0 0 0 0
M END
> <PUBCHEM_EXT_DATASOURCE_REGID>
13043826-37657822
> <PUBCHEM_SUBSTANCE_COMMENT>
This substance has the INFOCHEM hashcode: 13043826-37657822
> <PUBCHEM_SUBSTANCE_SYNONYM>
(1e)-N'-(3-cyano-2-pyridinyl)-N,N-dimethylmethanimidamide
$$$$
{
"substances": {
"100086048-91340494": [
{
"doi": "10.1007/7355_2017_9",
"relevance": "3.0595"
},
{
"doi": "10.1007/s12010-017-2680-4",
"relevance": "0.241"
},
{
"doi": "10.1007/s12562-018-1226-1",
"relevance": "0.1393"
},
{
"doi": "10.1007/s13197-018-3155-5",
"relevance": "0.1366"
},
{
"doi": "10.1007/s40278-018-48591-z",
"relevance": "2.923"
},
{
"doi": "10.1186/s12866-018-1205-9",
"relevance": "0.5178"
},
{
"citations": [
{
"10.1007/978-3-662-55935-2_34": {
"documentType": "book chapter",
"journal": "Periphere arterielle Interventionen",
"language": "De",
"openArticle": " ",
"publicationYear": "2018",
"subjectCollection": "Medicine",
"title": "Perkutane Angioplastie infrapoplitealer
Arterien"
}
},
{
"10.1007/978-3-662-55935-2_33": {
"documentType": "book chapter",
"journal": "Periphere arterielle Interventionen",
"language": "De",
"openArticle": " ",
"publicationYear": "2018",
"subjectCollection": "Medicine",
"title": "Behandlungsstrategien bei kritischer
Ischämie und chronischen femoropoplitealen
Verschlüssen"
……..
<!DOCTYPE Annotations PUBLIC "-//NCBI//annotations/EN" "annotations.dtd">
<Annotations>
<Annotation>
<SourceName>Springer Nature</SourceName>
<SourceID>100086048-91340494</SourceID>
<Description>Literature references related to scientific contents from Springer Nature journals and
books. &lt;a href=&quot;https://link.springer.com/&quot;&gt;Read more ...&lt;/a&gt;</Description>
<LinkToPubChemBy>
<Source>
<SourceName>15745</SourceName>
<SourceID>100086048-91340494</SourceID>
</Source>
</LinkToPubChemBy>
<Data>
<TOCHeading>Springer Nature References</TOCHeading>
<Name>Springer Nature References</Name>
<Value>
<Table>
<ColumnName>Title</ColumnName>
<ColumnName>Journal/Book</ColumnName>
<ColumnName>PMID</ColumnName>
<ColumnName>Open Access</ColumnName>
<ColumnName>Year</ColumnName>
<ExternalTableName>springernature</ExternalTableName>
<ExternalTableNumRows>5</ExternalTableNumRows>
</Table>
</Value>
</Data>
Data submission .. Case study
DBs
APIs
Data submission ..
• MassBank of North America, MoNA, submitted over 77k
chemical substances and about 200k MS annotations to
PubChem.
Case study
Data submission ..
MoNA data in PubChem ...
Data submission ..
MoNA substance data parsing ..
[{
"compound": [{
"inchi": "InChI=1S/C10H9ClN4O2S/c11-9-5-13-6-10(14-9)15-18(16,17)8-3-1-
7(12)2-4-8/h1-6H,12H2,(H,14,15)",
"inchiKey": "QKLPUVXBJHRFQZ-UHFFFAOYSA-N",
"metaData": [
{
"category": "none",
"computed": false,
"hidden": false,
"name": "SMILES",
"value": "c1cc(ccc1N)S(=O)(=O)Nc2cncc(n2)Cl"
},
{
"category": "external id",
"computed": false,
"hidden": false,
"name": "cas",
"value": "102-65-8"
},
{
"category": "external id",
"computed": false,
"hidden": false,
Data submission ..
Data submission ..
MoNA annotation data parsing ..
[{
"compound": [{
"inchi": "InChI=1S/C10H9ClN4O2S/c11-9-5-13-6-10(14-9)15-
18(16,17)8-3-1-7(12)2-4-8/h1-6H,12H2,(H,14,15)",
"inchiKey": "QKLPUVXBJHRFQZ-UHFFFAOYSA-N",
"metaData": [
{
"category": "none",
"computed": false,
"hidden": false,
"name": "SMILES",
.... }],
"id": "AU100601",
……..
"splash": {
"block1": "splash10",
"block2": "0a4i",
"block3": "1900000000",
"block4": "d2bc1c887f6f99ed0f74",
"splash": "splash10-0a4i-1900000000-d2bc1c887f6f99ed0f74"
},
"submitter": {
"id": "ntho@chem.uoa.gr",
"emailAddress": "ntho@chem.uoa.gr",
"firstName": "Nikolaos",
"lastName": "Thomaidis",
"institution": "University of Athens"
},
"ruleBased": false,
"text": "MassBank"
}
}
},
....
DBs
APIs
Annotation raw data formats from
PubChem data submitters
• CSV/spreadsheet - Pistoia Alliance Chemical Safety Library reactivity alerts, EPA
pesticides, USGS Env …
• HTML - ILO-ICSC, NIOSH, NCI cancer drugs, CAMEO …
• Text – FDA Orangebook
• XML – BioRad SpectraBase, HMDB, OSHA ..
• Images – CCDC, MoNA, BioRad …
• JSON – Springer Nature, …
• Tables – HSDB ..
Automated pipeline
• PubChem set up an automated data submission, parsing,
standardization pipeline to update substance, bioactivity
data, and annotations periodically.
• Update can be monthly, weekly, or using a watcher ..
• The data submission can be set to pull or push.
• The raw data can be in form of SDF, CSV, text, json, xml,
html, images …
Automated pipeline
Cron jobs
Raw data
FTP / URL
…
DBs
APIs
Web pages FTP
Data accessing .. Open data free access
• Website
• APIs (programmatic access)
• FTP download
• Widgets
• RDF
Policies and disclaimers
https://www.ncbi.nlm.nih.gov/home/about/policies/
Who can send data to PubChem ....
• Chemical vendors
• Research institutes
• Government agencies
• Publishers
• Universities
• Pharma companies
• Individual scientists...
• ….
Summary
• PubChem provides an open public data repository to allow
submitters to upload chemical data.
• There are total 623 data submitters including publishers,
government agencies, research institutes, chemical vendors,
universities, individual scientists...
• PubChem provides an automated data uploading pipeline.
• The raw data can be set to pull or push when submit data to
PubChem.
Thanks you ... This research was supported by the Intramural Research
Program of the NIH, National Library of Medicine.
Josef Eiblmaier
Dorothee Geppert
Zoila Meza-Renken
Jakob Ruhdorfer
Evan Bolton
Asta Gindulyte
Ben Shoemaker
Paul Thiessen
Siqian He
Bo Yu
Jie Chen
Tiejun Cheng
Jane He
Sunghwan Kim
Leon Li
Leonid Zaslavsky
Sajjan Singh Mehta
Gert Wohlgemuth
Oliver Fiehn

More Related Content

What's hot

PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
Chris Southan
 
When pharmaceutical companies publish large datasets an abundance of riches o...
When pharmaceutical companies publish large datasets an abundance of riches o...When pharmaceutical companies publish large datasets an abundance of riches o...
When pharmaceutical companies publish large datasets an abundance of riches o...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourcePubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information Resource
Sunghwan Kim
 
Exploring Chemical and Biological Knowledge Spaces with PubChem
Exploring Chemical and Biological Knowledge Spaces with PubChemExploring Chemical and Biological Knowledge Spaces with PubChem
Exploring Chemical and Biological Knowledge Spaces with PubChem
Paul Thiessen
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?
Sunghwan Kim
 
PubChem and its application for cheminformatics education
PubChem and its application for cheminformatics educationPubChem and its application for cheminformatics education
PubChem and its application for cheminformatics education
Sunghwan Kim
 
Generating Biomedical Hypotheses Using Semantic Web Technologies
Generating Biomedical Hypotheses Using Semantic Web TechnologiesGenerating Biomedical Hypotheses Using Semantic Web Technologies
Generating Biomedical Hypotheses Using Semantic Web Technologies
Michel Dumontier
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information training
Sunghwan Kim
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
Sunghwan Kim
 
PubChem LCSS
PubChem LCSSPubChem LCSS
PubChem LCSS
Jian Zhang
 
PubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistryPubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data Chemistry
Sunghwan Kim
 
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
Frederik van den Broek
 
PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data Chemistry
Sunghwan Kim
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
Sunghwan Kim
 
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug DiscoveryPubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug Discovery
Sunghwan Kim
 
BioSHaRE Catalogue of tools and services for data sharing
BioSHaRE Catalogue of tools and services for data sharingBioSHaRE Catalogue of tools and services for data sharing
BioSHaRE Catalogue of tools and services for data sharing
Lisette Giepmans
 
Chemistry Resources Science Teachers
Chemistry Resources Science TeachersChemistry Resources Science Teachers
Chemistry Resources Science Teachers
Mary Markland
 

What's hot (20)

PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
When pharmaceutical companies publish large datasets an abundance of riches o...
When pharmaceutical companies publish large datasets an abundance of riches o...When pharmaceutical companies publish large datasets an abundance of riches o...
When pharmaceutical companies publish large datasets an abundance of riches o...
 
PubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourcePubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information Resource
 
Exploring Chemical and Biological Knowledge Spaces with PubChem
Exploring Chemical and Biological Knowledge Spaces with PubChemExploring Chemical and Biological Knowledge Spaces with PubChem
Exploring Chemical and Biological Knowledge Spaces with PubChem
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?
 
PubChem and its application for cheminformatics education
PubChem and its application for cheminformatics educationPubChem and its application for cheminformatics education
PubChem and its application for cheminformatics education
 
Generating Biomedical Hypotheses Using Semantic Web Technologies
Generating Biomedical Hypotheses Using Semantic Web TechnologiesGenerating Biomedical Hypotheses Using Semantic Web Technologies
Generating Biomedical Hypotheses Using Semantic Web Technologies
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information training
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 
PubChem LCSS
PubChem LCSSPubChem LCSS
PubChem LCSS
 
PubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistryPubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data Chemistry
 
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
 
PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data Chemistry
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
 
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
 
PubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug DiscoveryPubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug Discovery
 
BioSHaRE Catalogue of tools and services for data sharing
BioSHaRE Catalogue of tools and services for data sharingBioSHaRE Catalogue of tools and services for data sharing
BioSHaRE Catalogue of tools and services for data sharing
 
Chemistry Resources Science Teachers
Chemistry Resources Science TeachersChemistry Resources Science Teachers
Chemistry Resources Science Teachers
 

Similar to Publishing chemical data in public data repository

How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Cheminformatics Support for MS Supporting Exposomics
Cheminformatics Support for MS Supporting ExposomicsCheminformatics Support for MS Supporting Exposomics
Delivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applicationsDelivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applications
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Integrating Mass Spectrometry Non-Targeted Analysis and Computational Toxico...
Integrating Mass Spectrometry  Non-Targeted Analysis and Computational Toxico...Integrating Mass Spectrometry  Non-Targeted Analysis and Computational Toxico...
Integrating Mass Spectrometry Non-Targeted Analysis and Computational Toxico...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data DashboardsAccessing Environmental Chemistry Data via Data Dashboards
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
Michel Dumontier
 
Success in decision making data relevance curation
Success in decision making data relevance curationSuccess in decision making data relevance curation
A chemistry data repository to serve them all
A chemistry data repository to serve them allA chemistry data repository to serve them all
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Bigfinite
 
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Similar to Publishing chemical data in public data repository (20)

How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
 
Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards
 
Cheminformatics Support for MS Supporting Exposomics
Cheminformatics Support for MS Supporting ExposomicsCheminformatics Support for MS Supporting Exposomics
Cheminformatics Support for MS Supporting Exposomics
 
Delivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applicationsDelivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applications
 
Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
 
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
 
Integrating Mass Spectrometry Non-Targeted Analysis and Computational Toxico...
Integrating Mass Spectrometry  Non-Targeted Analysis and Computational Toxico...Integrating Mass Spectrometry  Non-Targeted Analysis and Computational Toxico...
Integrating Mass Spectrometry Non-Targeted Analysis and Computational Toxico...
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data DashboardsAccessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards
 
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
 
Success in decision making data relevance curation
Success in decision making data relevance curationSuccess in decision making data relevance curation
Success in decision making data relevance curation
 
A chemistry data repository to serve them all
A chemistry data repository to serve them allA chemistry data repository to serve them all
A chemistry data repository to serve them all
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
 
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
 

Recently uploaded

Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
ArshadAyub49
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理
一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理
一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理
mbawufebxi
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
GeorgiiSteshenko
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 

Recently uploaded (20)

Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理
一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理
一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 

Publishing chemical data in public data repository

  • 1. Publishing chemical data in public data repository Jian Zhang*, Paul Thiessen, Asta Gindulyte, Evan Bolton 256th ACS National Meeting, Boston, August 2018
  • 2. Outline .. • PubChem overview • What data PubChem has • How to publish your data - case studies • Automated pipeline • How to access • Summary
  • 3. PubChem … overview and status • A public chemical data repository – a public data sharing platform • An open chemistry database • A chemical information hub • A data comparison center • A chemical data index Compounds: 96,478,070 Substances: 247,243,896 BioAssays: 1,252,901 Tested Compounds: 2,978,541 Tested Substances: 4,994,132 BioActivities: 236,790,496 Protein Targets: 10,854 Gene Targets: 22,108 Data submitors: 623 Countries: 40
  • 4.
  • 5. Data … chemical centralized and beyond • Chemical structure – 2D/3D, SMILES, InChI, SDF.. • Property - • Drug and medication • Agrochemicals • Food additives • Safety and hazards • Toxicity • Literature • Patents • Bioactivity • Target • Natural products • Pathways • … more • Link back to original data
  • 6. How to submit data to PubChem and publishing • Chemical substance – SDF, SMILS, … • Bioactivity data – CSV, XML, ASN.1 … • Annotation – data format varies
  • 7. Data submission .. Chemical substances • Data format: SDF, CSV.. Through PubChem UpLoad • Covert your structure SDF/CSV into PubChem standard • Provide mapping information for your data file. https://pubchemdocs.ncbi.nlm.nih.gov/upload-chemicals
  • 8. Data submission .. Bioactivity data • Data format: CSV, XML, ASN1.. Through PubChem UpLoad • Use PubChem standard tags for your data (spreadsheet) • Covert your data into PubChem stardard XML/ASN1 https://pubchemdocs.ncbi.nlm.nih.gov/upload-bioassays
  • 9. Data submission .. Data specifications
  • 10. Data submission .. Annotations • Incoming data format varies … CSV, text, XML, json, html, tables, images … special parser needed
  • 11. Data submission .. Case study Springer Nature submitted over 620k chemical substances and more than 4 million literature articles/book chapters to PubChem which yield over 28 million compound-literature links.
  • 12. Data submission .. Case study Springer Nature data in PubChem..
  • 13.
  • 14. Data submission .. Case study Substance data
  • 15. Literature annotations Data submission .. Case study INFOCHEM 2D 1 1.00000 0.00000 0 13 13 0 0 0 0 0 0 0 0999 V2000 0.7894 1.3684 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.5789 1.3684 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 1.3684 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.4210 0.7368 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1579 0.7368 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -1.5789 1.3684 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1579 2.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.4210 2.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0526 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -0.4210 -0.6316 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 -1.3158 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -0.4210 -2.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7894 -1.3158 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 0 0 0 0 1 3 1 0 0 0 0 3 4 2 0 0 0 0 3 8 1 0 0 0 0 4 5 1 0 0 0 0 4 9 1 0 0 0 0 5 6 2 0 0 0 0 6 7 1 0 0 0 0 7 8 2 0 0 0 0 9 10 2 0 0 0 0 10 11 1 0 0 0 0 11 12 1 0 0 0 0 11 13 1 0 0 0 0 M END > <PUBCHEM_EXT_DATASOURCE_REGID> 13043826-37657822 > <PUBCHEM_SUBSTANCE_COMMENT> This substance has the INFOCHEM hashcode: 13043826-37657822 > <PUBCHEM_SUBSTANCE_SYNONYM> (1e)-N'-(3-cyano-2-pyridinyl)-N,N-dimethylmethanimidamide $$$$ { "substances": { "100086048-91340494": [ { "doi": "10.1007/7355_2017_9", "relevance": "3.0595" }, { "doi": "10.1007/s12010-017-2680-4", "relevance": "0.241" }, { "doi": "10.1007/s12562-018-1226-1", "relevance": "0.1393" }, { "doi": "10.1007/s13197-018-3155-5", "relevance": "0.1366" }, { "doi": "10.1007/s40278-018-48591-z", "relevance": "2.923" }, { "doi": "10.1186/s12866-018-1205-9", "relevance": "0.5178" }, { "citations": [ { "10.1007/978-3-662-55935-2_34": { "documentType": "book chapter", "journal": "Periphere arterielle Interventionen", "language": "De", "openArticle": " ", "publicationYear": "2018", "subjectCollection": "Medicine", "title": "Perkutane Angioplastie infrapoplitealer Arterien" } }, { "10.1007/978-3-662-55935-2_33": { "documentType": "book chapter", "journal": "Periphere arterielle Interventionen", "language": "De", "openArticle": " ", "publicationYear": "2018", "subjectCollection": "Medicine", "title": "Behandlungsstrategien bei kritischer Ischämie und chronischen femoropoplitealen Verschlüssen" …….. <!DOCTYPE Annotations PUBLIC "-//NCBI//annotations/EN" "annotations.dtd"> <Annotations> <Annotation> <SourceName>Springer Nature</SourceName> <SourceID>100086048-91340494</SourceID> <Description>Literature references related to scientific contents from Springer Nature journals and books. &lt;a href=&quot;https://link.springer.com/&quot;&gt;Read more ...&lt;/a&gt;</Description> <LinkToPubChemBy> <Source> <SourceName>15745</SourceName> <SourceID>100086048-91340494</SourceID> </Source> </LinkToPubChemBy> <Data> <TOCHeading>Springer Nature References</TOCHeading> <Name>Springer Nature References</Name> <Value> <Table> <ColumnName>Title</ColumnName> <ColumnName>Journal/Book</ColumnName> <ColumnName>PMID</ColumnName> <ColumnName>Open Access</ColumnName> <ColumnName>Year</ColumnName> <ExternalTableName>springernature</ExternalTableName> <ExternalTableNumRows>5</ExternalTableNumRows> </Table> </Value> </Data>
  • 16. Data submission .. Case study DBs APIs
  • 17. Data submission .. • MassBank of North America, MoNA, submitted over 77k chemical substances and about 200k MS annotations to PubChem. Case study
  • 18. Data submission .. MoNA data in PubChem ...
  • 19. Data submission .. MoNA substance data parsing .. [{ "compound": [{ "inchi": "InChI=1S/C10H9ClN4O2S/c11-9-5-13-6-10(14-9)15-18(16,17)8-3-1- 7(12)2-4-8/h1-6H,12H2,(H,14,15)", "inchiKey": "QKLPUVXBJHRFQZ-UHFFFAOYSA-N", "metaData": [ { "category": "none", "computed": false, "hidden": false, "name": "SMILES", "value": "c1cc(ccc1N)S(=O)(=O)Nc2cncc(n2)Cl" }, { "category": "external id", "computed": false, "hidden": false, "name": "cas", "value": "102-65-8" }, { "category": "external id", "computed": false, "hidden": false,
  • 21. Data submission .. MoNA annotation data parsing .. [{ "compound": [{ "inchi": "InChI=1S/C10H9ClN4O2S/c11-9-5-13-6-10(14-9)15- 18(16,17)8-3-1-7(12)2-4-8/h1-6H,12H2,(H,14,15)", "inchiKey": "QKLPUVXBJHRFQZ-UHFFFAOYSA-N", "metaData": [ { "category": "none", "computed": false, "hidden": false, "name": "SMILES", .... }], "id": "AU100601", …….. "splash": { "block1": "splash10", "block2": "0a4i", "block3": "1900000000", "block4": "d2bc1c887f6f99ed0f74", "splash": "splash10-0a4i-1900000000-d2bc1c887f6f99ed0f74" }, "submitter": { "id": "ntho@chem.uoa.gr", "emailAddress": "ntho@chem.uoa.gr", "firstName": "Nikolaos", "lastName": "Thomaidis", "institution": "University of Athens" }, "ruleBased": false, "text": "MassBank" } } }, .... DBs APIs
  • 22. Annotation raw data formats from PubChem data submitters • CSV/spreadsheet - Pistoia Alliance Chemical Safety Library reactivity alerts, EPA pesticides, USGS Env … • HTML - ILO-ICSC, NIOSH, NCI cancer drugs, CAMEO … • Text – FDA Orangebook • XML – BioRad SpectraBase, HMDB, OSHA .. • Images – CCDC, MoNA, BioRad … • JSON – Springer Nature, … • Tables – HSDB ..
  • 23. Automated pipeline • PubChem set up an automated data submission, parsing, standardization pipeline to update substance, bioactivity data, and annotations periodically. • Update can be monthly, weekly, or using a watcher .. • The data submission can be set to pull or push. • The raw data can be in form of SDF, CSV, text, json, xml, html, images …
  • 24. Automated pipeline Cron jobs Raw data FTP / URL … DBs APIs Web pages FTP
  • 25. Data accessing .. Open data free access • Website • APIs (programmatic access) • FTP download • Widgets • RDF
  • 27. Who can send data to PubChem .... • Chemical vendors • Research institutes • Government agencies • Publishers • Universities • Pharma companies • Individual scientists... • ….
  • 28. Summary • PubChem provides an open public data repository to allow submitters to upload chemical data. • There are total 623 data submitters including publishers, government agencies, research institutes, chemical vendors, universities, individual scientists... • PubChem provides an automated data uploading pipeline. • The raw data can be set to pull or push when submit data to PubChem.
  • 29. Thanks you ... This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. Josef Eiblmaier Dorothee Geppert Zoila Meza-Renken Jakob Ruhdorfer Evan Bolton Asta Gindulyte Ben Shoemaker Paul Thiessen Siqian He Bo Yu Jie Chen Tiejun Cheng Jane He Sunghwan Kim Leon Li Leonid Zaslavsky Sajjan Singh Mehta Gert Wohlgemuth Oliver Fiehn