SlideShare a Scribd company logo
1 of 29
Publishing chemical data in public
data repository
Jian Zhang*, Paul Thiessen, Asta Gindulyte, Evan Bolton
256th ACS National Meeting, Boston, August 2018
Outline ..
• PubChem overview
• What data PubChem has
• How to publish your data - case studies
• Automated pipeline
• How to access
• Summary
PubChem … overview and status
• A public chemical data repository – a public data sharing platform
• An open chemistry database
• A chemical information hub
• A data comparison center
• A chemical data index
Compounds: 96,478,070
Substances: 247,243,896
BioAssays: 1,252,901
Tested Compounds: 2,978,541
Tested Substances: 4,994,132
BioActivities: 236,790,496
Protein Targets: 10,854
Gene Targets: 22,108
Data submitors: 623
Countries: 40
Data … chemical centralized and beyond
• Chemical structure – 2D/3D, SMILES, InChI, SDF..
• Property -
• Drug and medication
• Agrochemicals
• Food additives
• Safety and hazards
• Toxicity
• Literature
• Patents
• Bioactivity
• Target
• Natural products
• Pathways
• … more
• Link back to original data
How to submit data to PubChem and publishing
• Chemical substance – SDF, SMILS, …
• Bioactivity data – CSV, XML, ASN.1 …
• Annotation – data format varies
Data submission .. Chemical substances
• Data format: SDF, CSV.. Through PubChem UpLoad
• Covert your structure SDF/CSV into PubChem
standard
• Provide mapping information for your data file.
https://pubchemdocs.ncbi.nlm.nih.gov/upload-chemicals
Data submission .. Bioactivity data
• Data format: CSV, XML, ASN1.. Through PubChem
UpLoad
• Use PubChem standard tags for your data (spreadsheet)
• Covert your data into PubChem stardard XML/ASN1
https://pubchemdocs.ncbi.nlm.nih.gov/upload-bioassays
Data submission .. Data specifications
Data submission .. Annotations
• Incoming data format varies … CSV, text, XML, json,
html, tables, images … special parser needed
Data submission .. Case study
Springer Nature submitted over 620k chemical substances
and more than 4 million literature articles/book chapters to
PubChem which yield over 28 million compound-literature
links.
Data submission .. Case study
Springer Nature data in PubChem..
Data submission .. Case study
Substance
data
Literature
annotations
Data submission .. Case study
INFOCHEM 2D 1 1.00000 0.00000 0
13 13 0 0 0 0 0 0 0 0999 V2000
0.7894 1.3684 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.5789 1.3684 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 1.3684 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.4210 0.7368 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.1579 0.7368 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-1.5789 1.3684 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.1579 2.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.4210 2.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0526 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-0.4210 -0.6316 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 -1.3158 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-0.4210 -2.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7894 -1.3158 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 3 0 0 0 0
1 3 1 0 0 0 0
3 4 2 0 0 0 0
3 8 1 0 0 0 0
4 5 1 0 0 0 0
4 9 1 0 0 0 0
5 6 2 0 0 0 0
6 7 1 0 0 0 0
7 8 2 0 0 0 0
9 10 2 0 0 0 0
10 11 1 0 0 0 0
11 12 1 0 0 0 0
11 13 1 0 0 0 0
M END
> <PUBCHEM_EXT_DATASOURCE_REGID>
13043826-37657822
> <PUBCHEM_SUBSTANCE_COMMENT>
This substance has the INFOCHEM hashcode: 13043826-37657822
> <PUBCHEM_SUBSTANCE_SYNONYM>
(1e)-N'-(3-cyano-2-pyridinyl)-N,N-dimethylmethanimidamide
$$$$
{
"substances": {
"100086048-91340494": [
{
"doi": "10.1007/7355_2017_9",
"relevance": "3.0595"
},
{
"doi": "10.1007/s12010-017-2680-4",
"relevance": "0.241"
},
{
"doi": "10.1007/s12562-018-1226-1",
"relevance": "0.1393"
},
{
"doi": "10.1007/s13197-018-3155-5",
"relevance": "0.1366"
},
{
"doi": "10.1007/s40278-018-48591-z",
"relevance": "2.923"
},
{
"doi": "10.1186/s12866-018-1205-9",
"relevance": "0.5178"
},
{
"citations": [
{
"10.1007/978-3-662-55935-2_34": {
"documentType": "book chapter",
"journal": "Periphere arterielle Interventionen",
"language": "De",
"openArticle": " ",
"publicationYear": "2018",
"subjectCollection": "Medicine",
"title": "Perkutane Angioplastie infrapoplitealer
Arterien"
}
},
{
"10.1007/978-3-662-55935-2_33": {
"documentType": "book chapter",
"journal": "Periphere arterielle Interventionen",
"language": "De",
"openArticle": " ",
"publicationYear": "2018",
"subjectCollection": "Medicine",
"title": "Behandlungsstrategien bei kritischer
Ischämie und chronischen femoropoplitealen
Verschlüssen"
……..
<!DOCTYPE Annotations PUBLIC "-//NCBI//annotations/EN" "annotations.dtd">
<Annotations>
<Annotation>
<SourceName>Springer Nature</SourceName>
<SourceID>100086048-91340494</SourceID>
<Description>Literature references related to scientific contents from Springer Nature journals and
books. &lt;a href=&quot;https://link.springer.com/&quot;&gt;Read more ...&lt;/a&gt;</Description>
<LinkToPubChemBy>
<Source>
<SourceName>15745</SourceName>
<SourceID>100086048-91340494</SourceID>
</Source>
</LinkToPubChemBy>
<Data>
<TOCHeading>Springer Nature References</TOCHeading>
<Name>Springer Nature References</Name>
<Value>
<Table>
<ColumnName>Title</ColumnName>
<ColumnName>Journal/Book</ColumnName>
<ColumnName>PMID</ColumnName>
<ColumnName>Open Access</ColumnName>
<ColumnName>Year</ColumnName>
<ExternalTableName>springernature</ExternalTableName>
<ExternalTableNumRows>5</ExternalTableNumRows>
</Table>
</Value>
</Data>
Data submission .. Case study
DBs
APIs
Data submission ..
• MassBank of North America, MoNA, submitted over 77k
chemical substances and about 200k MS annotations to
PubChem.
Case study
Data submission ..
MoNA data in PubChem ...
Data submission ..
MoNA substance data parsing ..
[{
"compound": [{
"inchi": "InChI=1S/C10H9ClN4O2S/c11-9-5-13-6-10(14-9)15-18(16,17)8-3-1-
7(12)2-4-8/h1-6H,12H2,(H,14,15)",
"inchiKey": "QKLPUVXBJHRFQZ-UHFFFAOYSA-N",
"metaData": [
{
"category": "none",
"computed": false,
"hidden": false,
"name": "SMILES",
"value": "c1cc(ccc1N)S(=O)(=O)Nc2cncc(n2)Cl"
},
{
"category": "external id",
"computed": false,
"hidden": false,
"name": "cas",
"value": "102-65-8"
},
{
"category": "external id",
"computed": false,
"hidden": false,
Data submission ..
Data submission ..
MoNA annotation data parsing ..
[{
"compound": [{
"inchi": "InChI=1S/C10H9ClN4O2S/c11-9-5-13-6-10(14-9)15-
18(16,17)8-3-1-7(12)2-4-8/h1-6H,12H2,(H,14,15)",
"inchiKey": "QKLPUVXBJHRFQZ-UHFFFAOYSA-N",
"metaData": [
{
"category": "none",
"computed": false,
"hidden": false,
"name": "SMILES",
.... }],
"id": "AU100601",
……..
"splash": {
"block1": "splash10",
"block2": "0a4i",
"block3": "1900000000",
"block4": "d2bc1c887f6f99ed0f74",
"splash": "splash10-0a4i-1900000000-d2bc1c887f6f99ed0f74"
},
"submitter": {
"id": "ntho@chem.uoa.gr",
"emailAddress": "ntho@chem.uoa.gr",
"firstName": "Nikolaos",
"lastName": "Thomaidis",
"institution": "University of Athens"
},
"ruleBased": false,
"text": "MassBank"
}
}
},
....
DBs
APIs
Annotation raw data formats from
PubChem data submitters
• CSV/spreadsheet - Pistoia Alliance Chemical Safety Library reactivity alerts, EPA
pesticides, USGS Env …
• HTML - ILO-ICSC, NIOSH, NCI cancer drugs, CAMEO …
• Text – FDA Orangebook
• XML – BioRad SpectraBase, HMDB, OSHA ..
• Images – CCDC, MoNA, BioRad …
• JSON – Springer Nature, …
• Tables – HSDB ..
Automated pipeline
• PubChem set up an automated data submission, parsing,
standardization pipeline to update substance, bioactivity
data, and annotations periodically.
• Update can be monthly, weekly, or using a watcher ..
• The data submission can be set to pull or push.
• The raw data can be in form of SDF, CSV, text, json, xml,
html, images …
Automated pipeline
Cron jobs
Raw data
FTP / URL
…
DBs
APIs
Web pages FTP
Data accessing .. Open data free access
• Website
• APIs (programmatic access)
• FTP download
• Widgets
• RDF
Policies and disclaimers
https://www.ncbi.nlm.nih.gov/home/about/policies/
Who can send data to PubChem ....
• Chemical vendors
• Research institutes
• Government agencies
• Publishers
• Universities
• Pharma companies
• Individual scientists...
• ….
Summary
• PubChem provides an open public data repository to allow
submitters to upload chemical data.
• There are total 623 data submitters including publishers,
government agencies, research institutes, chemical vendors,
universities, individual scientists...
• PubChem provides an automated data uploading pipeline.
• The raw data can be set to pull or push when submit data to
PubChem.
Thanks you ... This research was supported by the Intramural Research
Program of the NIH, National Library of Medicine.
Josef Eiblmaier
Dorothee Geppert
Zoila Meza-Renken
Jakob Ruhdorfer
Evan Bolton
Asta Gindulyte
Ben Shoemaker
Paul Thiessen
Siqian He
Bo Yu
Jie Chen
Tiejun Cheng
Jane He
Sunghwan Kim
Leon Li
Leonid Zaslavsky
Sajjan Singh Mehta
Gert Wohlgemuth
Oliver Fiehn

More Related Content

What's hot

PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyChris Southan
 
PubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourcePubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourceSunghwan Kim
 
Exploring Chemical and Biological Knowledge Spaces with PubChem
Exploring Chemical and Biological Knowledge Spaces with PubChemExploring Chemical and Biological Knowledge Spaces with PubChem
Exploring Chemical and Biological Knowledge Spaces with PubChemPaul Thiessen
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?Sunghwan Kim
 
PubChem and its application for cheminformatics education
PubChem and its application for cheminformatics educationPubChem and its application for cheminformatics education
PubChem and its application for cheminformatics educationSunghwan Kim
 
Generating Biomedical Hypotheses Using Semantic Web Technologies
Generating Biomedical Hypotheses Using Semantic Web TechnologiesGenerating Biomedical Hypotheses Using Semantic Web Technologies
Generating Biomedical Hypotheses Using Semantic Web TechnologiesMichel Dumontier
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingSunghwan Kim
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
 
PubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistryPubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistrySunghwan Kim
 
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...Frederik van den Broek
 
PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data ChemistrySunghwan Kim
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
 
PubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug DiscoveryPubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug DiscoverySunghwan Kim
 
BioSHaRE Catalogue of tools and services for data sharing
BioSHaRE Catalogue of tools and services for data sharingBioSHaRE Catalogue of tools and services for data sharing
BioSHaRE Catalogue of tools and services for data sharingLisette Giepmans
 
Chemistry Resources Science Teachers
Chemistry Resources Science TeachersChemistry Resources Science Teachers
Chemistry Resources Science TeachersMary Markland
 

What's hot (20)

PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
When pharmaceutical companies publish large datasets an abundance of riches o...
When pharmaceutical companies publish large datasets an abundance of riches o...When pharmaceutical companies publish large datasets an abundance of riches o...
When pharmaceutical companies publish large datasets an abundance of riches o...
 
PubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourcePubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information Resource
 
Exploring Chemical and Biological Knowledge Spaces with PubChem
Exploring Chemical and Biological Knowledge Spaces with PubChemExploring Chemical and Biological Knowledge Spaces with PubChem
Exploring Chemical and Biological Knowledge Spaces with PubChem
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?
 
PubChem and its application for cheminformatics education
PubChem and its application for cheminformatics educationPubChem and its application for cheminformatics education
PubChem and its application for cheminformatics education
 
Generating Biomedical Hypotheses Using Semantic Web Technologies
Generating Biomedical Hypotheses Using Semantic Web TechnologiesGenerating Biomedical Hypotheses Using Semantic Web Technologies
Generating Biomedical Hypotheses Using Semantic Web Technologies
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information training
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 
PubChem LCSS
PubChem LCSSPubChem LCSS
PubChem LCSS
 
PubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistryPubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data Chemistry
 
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
 
PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data Chemistry
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
 
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
 
PubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug DiscoveryPubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug Discovery
 
BioSHaRE Catalogue of tools and services for data sharing
BioSHaRE Catalogue of tools and services for data sharingBioSHaRE Catalogue of tools and services for data sharing
BioSHaRE Catalogue of tools and services for data sharing
 
Chemistry Resources Science Teachers
Chemistry Resources Science TeachersChemistry Resources Science Teachers
Chemistry Resources Science Teachers
 

Similar to Publishing chemical data in public data repository

2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge DiscoveryMichel Dumontier
 
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Bigfinite
 

Similar to Publishing chemical data in public data repository (20)

How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
 
Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards
 
Cheminformatics Support for MS Supporting Exposomics
Cheminformatics Support for MS Supporting ExposomicsCheminformatics Support for MS Supporting Exposomics
Cheminformatics Support for MS Supporting Exposomics
 
Delivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applicationsDelivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applications
 
Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
 
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
 
Integrating Mass Spectrometry Non-Targeted Analysis and Computational Toxico...
Integrating Mass Spectrometry  Non-Targeted Analysis and Computational Toxico...Integrating Mass Spectrometry  Non-Targeted Analysis and Computational Toxico...
Integrating Mass Spectrometry Non-Targeted Analysis and Computational Toxico...
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data DashboardsAccessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards
 
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
 
Success in decision making data relevance curation
Success in decision making data relevance curationSuccess in decision making data relevance curation
Success in decision making data relevance curation
 
A chemistry data repository to serve them all
A chemistry data repository to serve them allA chemistry data repository to serve them all
A chemistry data repository to serve them all
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
 
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
 

Recently uploaded

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 

Recently uploaded (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 

Publishing chemical data in public data repository

  • 1. Publishing chemical data in public data repository Jian Zhang*, Paul Thiessen, Asta Gindulyte, Evan Bolton 256th ACS National Meeting, Boston, August 2018
  • 2. Outline .. • PubChem overview • What data PubChem has • How to publish your data - case studies • Automated pipeline • How to access • Summary
  • 3. PubChem … overview and status • A public chemical data repository – a public data sharing platform • An open chemistry database • A chemical information hub • A data comparison center • A chemical data index Compounds: 96,478,070 Substances: 247,243,896 BioAssays: 1,252,901 Tested Compounds: 2,978,541 Tested Substances: 4,994,132 BioActivities: 236,790,496 Protein Targets: 10,854 Gene Targets: 22,108 Data submitors: 623 Countries: 40
  • 4.
  • 5. Data … chemical centralized and beyond • Chemical structure – 2D/3D, SMILES, InChI, SDF.. • Property - • Drug and medication • Agrochemicals • Food additives • Safety and hazards • Toxicity • Literature • Patents • Bioactivity • Target • Natural products • Pathways • … more • Link back to original data
  • 6. How to submit data to PubChem and publishing • Chemical substance – SDF, SMILS, … • Bioactivity data – CSV, XML, ASN.1 … • Annotation – data format varies
  • 7. Data submission .. Chemical substances • Data format: SDF, CSV.. Through PubChem UpLoad • Covert your structure SDF/CSV into PubChem standard • Provide mapping information for your data file. https://pubchemdocs.ncbi.nlm.nih.gov/upload-chemicals
  • 8. Data submission .. Bioactivity data • Data format: CSV, XML, ASN1.. Through PubChem UpLoad • Use PubChem standard tags for your data (spreadsheet) • Covert your data into PubChem stardard XML/ASN1 https://pubchemdocs.ncbi.nlm.nih.gov/upload-bioassays
  • 9. Data submission .. Data specifications
  • 10. Data submission .. Annotations • Incoming data format varies … CSV, text, XML, json, html, tables, images … special parser needed
  • 11. Data submission .. Case study Springer Nature submitted over 620k chemical substances and more than 4 million literature articles/book chapters to PubChem which yield over 28 million compound-literature links.
  • 12. Data submission .. Case study Springer Nature data in PubChem..
  • 13.
  • 14. Data submission .. Case study Substance data
  • 15. Literature annotations Data submission .. Case study INFOCHEM 2D 1 1.00000 0.00000 0 13 13 0 0 0 0 0 0 0 0999 V2000 0.7894 1.3684 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.5789 1.3684 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 1.3684 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.4210 0.7368 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1579 0.7368 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -1.5789 1.3684 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1579 2.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.4210 2.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0526 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -0.4210 -0.6316 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 -1.3158 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -0.4210 -2.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7894 -1.3158 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 0 0 0 0 1 3 1 0 0 0 0 3 4 2 0 0 0 0 3 8 1 0 0 0 0 4 5 1 0 0 0 0 4 9 1 0 0 0 0 5 6 2 0 0 0 0 6 7 1 0 0 0 0 7 8 2 0 0 0 0 9 10 2 0 0 0 0 10 11 1 0 0 0 0 11 12 1 0 0 0 0 11 13 1 0 0 0 0 M END > <PUBCHEM_EXT_DATASOURCE_REGID> 13043826-37657822 > <PUBCHEM_SUBSTANCE_COMMENT> This substance has the INFOCHEM hashcode: 13043826-37657822 > <PUBCHEM_SUBSTANCE_SYNONYM> (1e)-N'-(3-cyano-2-pyridinyl)-N,N-dimethylmethanimidamide $$$$ { "substances": { "100086048-91340494": [ { "doi": "10.1007/7355_2017_9", "relevance": "3.0595" }, { "doi": "10.1007/s12010-017-2680-4", "relevance": "0.241" }, { "doi": "10.1007/s12562-018-1226-1", "relevance": "0.1393" }, { "doi": "10.1007/s13197-018-3155-5", "relevance": "0.1366" }, { "doi": "10.1007/s40278-018-48591-z", "relevance": "2.923" }, { "doi": "10.1186/s12866-018-1205-9", "relevance": "0.5178" }, { "citations": [ { "10.1007/978-3-662-55935-2_34": { "documentType": "book chapter", "journal": "Periphere arterielle Interventionen", "language": "De", "openArticle": " ", "publicationYear": "2018", "subjectCollection": "Medicine", "title": "Perkutane Angioplastie infrapoplitealer Arterien" } }, { "10.1007/978-3-662-55935-2_33": { "documentType": "book chapter", "journal": "Periphere arterielle Interventionen", "language": "De", "openArticle": " ", "publicationYear": "2018", "subjectCollection": "Medicine", "title": "Behandlungsstrategien bei kritischer Ischämie und chronischen femoropoplitealen Verschlüssen" …….. <!DOCTYPE Annotations PUBLIC "-//NCBI//annotations/EN" "annotations.dtd"> <Annotations> <Annotation> <SourceName>Springer Nature</SourceName> <SourceID>100086048-91340494</SourceID> <Description>Literature references related to scientific contents from Springer Nature journals and books. &lt;a href=&quot;https://link.springer.com/&quot;&gt;Read more ...&lt;/a&gt;</Description> <LinkToPubChemBy> <Source> <SourceName>15745</SourceName> <SourceID>100086048-91340494</SourceID> </Source> </LinkToPubChemBy> <Data> <TOCHeading>Springer Nature References</TOCHeading> <Name>Springer Nature References</Name> <Value> <Table> <ColumnName>Title</ColumnName> <ColumnName>Journal/Book</ColumnName> <ColumnName>PMID</ColumnName> <ColumnName>Open Access</ColumnName> <ColumnName>Year</ColumnName> <ExternalTableName>springernature</ExternalTableName> <ExternalTableNumRows>5</ExternalTableNumRows> </Table> </Value> </Data>
  • 16. Data submission .. Case study DBs APIs
  • 17. Data submission .. • MassBank of North America, MoNA, submitted over 77k chemical substances and about 200k MS annotations to PubChem. Case study
  • 18. Data submission .. MoNA data in PubChem ...
  • 19. Data submission .. MoNA substance data parsing .. [{ "compound": [{ "inchi": "InChI=1S/C10H9ClN4O2S/c11-9-5-13-6-10(14-9)15-18(16,17)8-3-1- 7(12)2-4-8/h1-6H,12H2,(H,14,15)", "inchiKey": "QKLPUVXBJHRFQZ-UHFFFAOYSA-N", "metaData": [ { "category": "none", "computed": false, "hidden": false, "name": "SMILES", "value": "c1cc(ccc1N)S(=O)(=O)Nc2cncc(n2)Cl" }, { "category": "external id", "computed": false, "hidden": false, "name": "cas", "value": "102-65-8" }, { "category": "external id", "computed": false, "hidden": false,
  • 21. Data submission .. MoNA annotation data parsing .. [{ "compound": [{ "inchi": "InChI=1S/C10H9ClN4O2S/c11-9-5-13-6-10(14-9)15- 18(16,17)8-3-1-7(12)2-4-8/h1-6H,12H2,(H,14,15)", "inchiKey": "QKLPUVXBJHRFQZ-UHFFFAOYSA-N", "metaData": [ { "category": "none", "computed": false, "hidden": false, "name": "SMILES", .... }], "id": "AU100601", …….. "splash": { "block1": "splash10", "block2": "0a4i", "block3": "1900000000", "block4": "d2bc1c887f6f99ed0f74", "splash": "splash10-0a4i-1900000000-d2bc1c887f6f99ed0f74" }, "submitter": { "id": "ntho@chem.uoa.gr", "emailAddress": "ntho@chem.uoa.gr", "firstName": "Nikolaos", "lastName": "Thomaidis", "institution": "University of Athens" }, "ruleBased": false, "text": "MassBank" } } }, .... DBs APIs
  • 22. Annotation raw data formats from PubChem data submitters • CSV/spreadsheet - Pistoia Alliance Chemical Safety Library reactivity alerts, EPA pesticides, USGS Env … • HTML - ILO-ICSC, NIOSH, NCI cancer drugs, CAMEO … • Text – FDA Orangebook • XML – BioRad SpectraBase, HMDB, OSHA .. • Images – CCDC, MoNA, BioRad … • JSON – Springer Nature, … • Tables – HSDB ..
  • 23. Automated pipeline • PubChem set up an automated data submission, parsing, standardization pipeline to update substance, bioactivity data, and annotations periodically. • Update can be monthly, weekly, or using a watcher .. • The data submission can be set to pull or push. • The raw data can be in form of SDF, CSV, text, json, xml, html, images …
  • 24. Automated pipeline Cron jobs Raw data FTP / URL … DBs APIs Web pages FTP
  • 25. Data accessing .. Open data free access • Website • APIs (programmatic access) • FTP download • Widgets • RDF
  • 27. Who can send data to PubChem .... • Chemical vendors • Research institutes • Government agencies • Publishers • Universities • Pharma companies • Individual scientists... • ….
  • 28. Summary • PubChem provides an open public data repository to allow submitters to upload chemical data. • There are total 623 data submitters including publishers, government agencies, research institutes, chemical vendors, universities, individual scientists... • PubChem provides an automated data uploading pipeline. • The raw data can be set to pull or push when submit data to PubChem.
  • 29. Thanks you ... This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. Josef Eiblmaier Dorothee Geppert Zoila Meza-Renken Jakob Ruhdorfer Evan Bolton Asta Gindulyte Ben Shoemaker Paul Thiessen Siqian He Bo Yu Jie Chen Tiejun Cheng Jane He Sunghwan Kim Leon Li Leonid Zaslavsky Sajjan Singh Mehta Gert Wohlgemuth Oliver Fiehn