Presented at the 258th American Chemical Society (ACS) National Meeting in San Diego, CA (August 26, 2019).
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical data repository that provides information on various chemical entities, including small molecules, siRNA, miRNA, peptides, lipids, carbohydrates, chemically modified biologics, etc. One of the most commonly requested tasks in PubChem is to search for a compound by chemical name (also commonly called “chemical synonym”). PubChem performs this task by looking up chemical synonym-structure associations provided by individual depositors to PubChem. These name-structure associations are used to create links between chemicals and Medical Subject Headings (MeSH) terms, which in turn are used to generate associations between chemicals and PubMed articles. The accuracy of these depositor-provided synonym-structure associations is dependent upon two important quality control methods used in PubChem: (1) chemical structure standardization and (2) synonym filtering based on crowd voting. In this presentation, we will discuss the two quality control methods and their effects on the chemical synonym-structure associations.
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Chemical Structure Standardization and Synonym Filtering in PubChem
1. Chemical Structure Standardization and
Synonym Filtering in PubChem
Sunghwan Kim, Ph.D., M.Sc.
ACS National Meeting in San Diego, CA
(August 26, 2019)
3. 3
PubChem
Public chemical information resource
Collects data from more than 690+ sources
Disseminates data back to the public free of charge
Contains the largest amount of publicly available chemical
information
Faces unique challenges to
deal with many big data issues
on a daily basis.
• Chemical structure
standardization
• Name-structure association
clean up
4. Depositor-provided
Bioactivity test results
Unique chemical
structure extraction
through Standardization
Depositor-provided
substance descriptions
Unique chemical structures
Activity of tested
“substances”
Activity of “compounds” derived
from associated “substances”
690+ Data Contributors
Substance
deposition
Assay
deposition
Data Organization in PubChem
Substance ID (SID) Assay ID (AID)
Compound ID (CID)
4
5. Unique chemical
structure extraction
through Standardization
Depositor-provided
substance descriptions
Unique chemical structures
690+ Data Contributors
Substance
deposition
Data Organization in PubChem
Substance ID (SID)
Depositor-provided
Bioactivity test results
Activity of tested
“substances”
Activity of “compounds” derived
from associated “substances”
Assay
deposition
Assay ID (AID)
Compound ID (CID)
5
6. Unique chemical
structure extraction
through Standardization
Depositor-provided
substance descriptions
Unique chemical structures
690+ Data Contributors
Substance
deposition
Data Organization in PubChem
Substance ID (SID)
Compound ID (CID)
6
Individual data depositors
provide PubChem with:
• Chemical structures
• Chemical names (synonyms)
They need to be
organized/cleaned up through:
• Structure standardization
• Synonym filtering
15. 15
• ~90% of the substances
are subject to
standardization.
• Mostly organic
compounds.
• Standardization success rate:
99.64%
• Modification rate:
44.43%
J. Cheminform. (2018) 10:36
Standardization
Statistics
16. Most stable
in vacuum
Most stable
in water
It is not necessarily what one may expect
Standardized Structures
Standardized
by PubChem
17. In most cases, tautomeric forms of a molecule are
standardized into a single form.
There are a few exceptions.
CID 18630CID 31261
Standardized Structures
tautomerization
18. Standardization and Structure Identity Search
You can search PubChem using a structure as a query.
The input structure may be provided:
• using a line notation (e.g., SMILES, InChI)
• through using the PubChem Sketcher.
The input structure for identity search will be standardized
first before the search is performed.
Therefore, hits from identity search may have different
structures from the original input structure.
25. 25
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Unfiltered Depositor-provided synonyms (page 1/3)
26. 26
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Various forms of
“Not Available”
Unfiltered Depositor-provided synonyms (page 1/3)
27. 27
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Various forms of
“Not Available”
Unfiltered Depositor-provided synonyms (page 1/3)
28. 28
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Various forms of
“Not Available”
Great reduction in the structure count
after structure standardization
SIDs are standardized to Na (sodium)
Unfiltered Depositor-provided synonyms (page 1/3)
29. 29
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Error messages from
name generation software
Unfiltered Depositor-provided synonyms (page 1/3)
30. 30
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Names of
chemical classes
Unfiltered Depositor-provided synonyms (page 1/3)
31. 31
Synonym # SIDs # CIDs
(1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2-
YL)METHANIDE HYDROBROMIDE 405 345
ETHANONE,1- - 328 328
CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304
COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION! 302 302
TRIACYLGLYCEROL 286 285
TRIGLYCERIDE 286 285
QUINOLONE DER. 280 279
UNABLE TO GENERATE VALUE 274 264
UNL 656 255
UNKNOWN LIGAND 615 235
HEPT DERIV. 213 211
MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN
CURRENT VERSION! 208 208
ACHIRAL CENTER(S) 187 187
Unfiltered Depositor-provided synonyms (page 2/3)
32. 32
Synonym # SIDs # CIDs
(1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2-
YL)METHANIDE HYDROBROMIDE 405 345
ETHANONE,1- - 328 328
CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304
COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION! 302 302
TRIACYLGLYCEROL 286 285
TRIGLYCERIDE 286 285
QUINOLONE DER. 280 279
UNABLE TO GENERATE VALUE 274 264
UNL 656 255
UNKNOWN LIGAND 615 235
HEPT DERIV. 213 211
MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN
CURRENT VERSION! 208 208
ACHIRAL CENTER(S) 187 187
“Derivative” of
a chemical
Unfiltered Depositor-provided synonyms (page 2/3)
34. 34
Synonym # SIDs # CIDs
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125
Molecular formula
Unfiltered Depositor-provided synonyms (page 3/3)
35. 35
Synonym # SIDs # CIDs
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125
Abbreviation for
chemical names
Unfiltered Depositor-provided synonyms (page 3/3)
36. 36
Synonym # SIDs # CIDs
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125
Abbreviation for
chemical names
Unfiltered Depositor-provided synonyms (page 3/3)
Description
37. 37
Synonym # SIDs # CIDs
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125
Abbreviation for
chemical names
Unfiltered Depositor-provided synonyms (page 3/3)
Description
“Not available”
38. 38
Unfiltered Depositor-provided synonyms
Depositor-provided synonyms include:
• Real chemical names
• Abbreviations for chemical names
• “Derivatives” of some chemicals
• Names of chemical classes
• Molecular formula
• N/A, NULL, Not Available, NA, N.A., etc
• Error messages or comments
Not feasible to manually clean up.
PubChem uses crowd-voting-based synonym filtering.
40. 40
PubChem Synonym filtering
Crowd-voting approach
Check for a consensus on the name-structure association
between depositors.
Consensus threshold : >60% of the total votes
When a consensus is reached,
the synonym is added to the “filtered” synonym list of the
corresponding compound (standardized structure).
41. 41
CID 1
Synonym A SID 1Depositor 1
Synonyms that occurs only “once”
No disagreement in the name-structure association
Consider that the Synonym A means CID 1,
(although it may not be correct)
42. 42
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
Synonyms occurring multiple times
Which one is
the best choice?
43. 43
Synonym filtering using crowd voting
Two potential approaches
• Multiple-votes-per-depositor
• Single-vote-per-depositor
44. 44
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
3 (30%)
5 (50%)
2 (20%)
Consensus Threshold = 60%
Multiple-Votes-per-Depositor Strategy
45. 45
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
46. 46
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
47. 47
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
48. 48
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
49. 49
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
1 (33%)
2 (67%)
0 (0%)
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
Consensus has reached!
Synonym A = CID 2
51. 51
Abbr. CACTVS hash code used Description
CID CID hash code Connectivity + isotopes + stereochemistry
STE CID stereo hash code Connectivity + stereochemistry
CON CID connectivity hash code Connectivity
PCID Parent CID hash code CID of the parent compound
PSTE Parent CID stereo hash code STE of the parent compound
PCON Parent CID connectivity hash code CON of the parent compound
In practice, synonym filtering uses CACTVS hash codes (instead
of CID) to determine whether a consensus is reached or not.
Additional consideration:
Different contexts of chemical sameness
55. 55
1. Synonym filtering focuses on consistency, not correctness.
• It resolves the discrepancies in name-structure associations
within & between depositors.
• It does not mean that filtered synonyms are correct.
Limitations of Synonym Filtering
Fentin acetate (CID 16682804)
Its filtered synonyms include:
• m-Nitrobenzaldehyde 3-thio-4-o-tolylsemicarbazone
• Benzaldehyde, m-nitro-, 3-thio-4-o-tolylsemicarbazone
58. 58
Limitations of Synonym Filtering
1. Synonym filtering focuses on consistency, not correctness.
• Data sources integrate synonym data from another sources that are
regarded to be authoritative (e.g., government resources).
• Erroneous data in one source propagate into another sources.
• This practice helps incorrect name-chemical associations getting more
votes than it should during the synonym filtering process.
59. 59
2. More than 90% of depositor-provided synonyms occur only once.
• Automatically assigned to the structures represented by their
corresponding CIDs.
Limitations of Synonym Filtering
63. 63
PubChem contains a large amount of chemical information provided by
690+ data sources.
Through the chemical structure standardization process, PubChem
standardizes depositor-provided chemical structures and extracts unique
structures.
PubChem uses a crowd-voting-based synonym filtering to clean up
name-structure associations provided by depositors.
Summary
64. 64
Acknowledgements
Evan Bolton
Jie Chen
Tiejun Cheng
Asta Gindulyte
Jia He
Siqian He
Qingliang Li
Benjamin Shoemaker
Thiessen Paul
Bo Yu
Leonid Zaslavsky
Jian Zhang
The PubChem Team
PubChem depositors, users, and collaborators
Funded by the National Library of Medicine