Publishing chemical data in public data repository
1. Publishing chemical data in public
data repository
Jian Zhang*, Paul Thiessen, Asta Gindulyte, Evan Bolton
256th ACS National Meeting, Boston, August 2018
2. Outline ..
• PubChem overview
• What data PubChem has
• How to publish your data - case studies
• Automated pipeline
• How to access
• Summary
3. PubChem … overview and status
• A public chemical data repository – a public data sharing platform
• An open chemistry database
• A chemical information hub
• A data comparison center
• A chemical data index
Compounds: 96,478,070
Substances: 247,243,896
BioAssays: 1,252,901
Tested Compounds: 2,978,541
Tested Substances: 4,994,132
BioActivities: 236,790,496
Protein Targets: 10,854
Gene Targets: 22,108
Data submitors: 623
Countries: 40
4.
5. Data … chemical centralized and beyond
• Chemical structure – 2D/3D, SMILES, InChI, SDF..
• Property -
• Drug and medication
• Agrochemicals
• Food additives
• Safety and hazards
• Toxicity
• Literature
• Patents
• Bioactivity
• Target
• Natural products
• Pathways
• … more
• Link back to original data
6. How to submit data to PubChem and publishing
• Chemical substance – SDF, SMILS, …
• Bioactivity data – CSV, XML, ASN.1 …
• Annotation – data format varies
7. Data submission .. Chemical substances
• Data format: SDF, CSV.. Through PubChem UpLoad
• Covert your structure SDF/CSV into PubChem
standard
• Provide mapping information for your data file.
https://pubchemdocs.ncbi.nlm.nih.gov/upload-chemicals
8. Data submission .. Bioactivity data
• Data format: CSV, XML, ASN1.. Through PubChem
UpLoad
• Use PubChem standard tags for your data (spreadsheet)
• Covert your data into PubChem stardard XML/ASN1
https://pubchemdocs.ncbi.nlm.nih.gov/upload-bioassays
10. Data submission .. Annotations
• Incoming data format varies … CSV, text, XML, json,
html, tables, images … special parser needed
11. Data submission .. Case study
Springer Nature submitted over 620k chemical substances
and more than 4 million literature articles/book chapters to
PubChem which yield over 28 million compound-literature
links.
22. Annotation raw data formats from
PubChem data submitters
• CSV/spreadsheet - Pistoia Alliance Chemical Safety Library reactivity alerts, EPA
pesticides, USGS Env …
• HTML - ILO-ICSC, NIOSH, NCI cancer drugs, CAMEO …
• Text – FDA Orangebook
• XML – BioRad SpectraBase, HMDB, OSHA ..
• Images – CCDC, MoNA, BioRad …
• JSON – Springer Nature, …
• Tables – HSDB ..
23. Automated pipeline
• PubChem set up an automated data submission, parsing,
standardization pipeline to update substance, bioactivity
data, and annotations periodically.
• Update can be monthly, weekly, or using a watcher ..
• The data submission can be set to pull or push.
• The raw data can be in form of SDF, CSV, text, json, xml,
html, images …
27. Who can send data to PubChem ....
• Chemical vendors
• Research institutes
• Government agencies
• Publishers
• Universities
• Pharma companies
• Individual scientists...
• ….
28. Summary
• PubChem provides an open public data repository to allow
submitters to upload chemical data.
• There are total 623 data submitters including publishers,
government agencies, research institutes, chemical vendors,
universities, individual scientists...
• PubChem provides an automated data uploading pipeline.
• The raw data can be set to pull or push when submit data to
PubChem.
29. Thanks you ... This research was supported by the Intramural Research
Program of the NIH, National Library of Medicine.
Josef Eiblmaier
Dorothee Geppert
Zoila Meza-Renken
Jakob Ruhdorfer
Evan Bolton
Asta Gindulyte
Ben Shoemaker
Paul Thiessen
Siqian He
Bo Yu
Jie Chen
Tiejun Cheng
Jane He
Sunghwan Kim
Leon Li
Leonid Zaslavsky
Sajjan Singh Mehta
Gert Wohlgemuth
Oliver Fiehn