Chemical Structure Standardization and Synonym Filtering in PubChem

Chemical Structure Standardization and
Synonym Filtering in PubChem
Sunghwan Kim, Ph.D., M.Sc.
ACS National Meeting in San Diego, CA
(August 26, 2019)

2
PubChem
(https://pubchem.ncbi.nlm.nih.gov)

3
PubChem
 Public chemical information resource
 Collects data from more than 690+ sources
 Disseminates data back to the public free of charge
 Contains the largest amount of publicly available chemical
information
 Faces unique challenges to
deal with many big data issues
on a daily basis.
• Chemical structure
standardization
• Name-structure association
clean up

Depositor-provided
Bioactivity test results
Unique chemical
structure extraction
through Standardization
Depositor-provided
substance descriptions
Unique chemical structures
Activity of tested
“substances”
Activity of “compounds” derived
from associated “substances”
690+ Data Contributors
Substance
deposition
Assay
deposition
Data Organization in PubChem
Substance ID (SID) Assay ID (AID)
Compound ID (CID)
4

Unique chemical
Depositor-provided
Substance
deposition
Substance ID (SID)
Depositor-provided
Bioactivity test results
Activity of tested
“substances”
Activity of “compounds” derived
from associated “substances”
Assay
deposition
Assay ID (AID)
Compound ID (CID)
5

Unique chemical
Depositor-provided
Substance
deposition
Substance ID (SID)
Compound ID (CID)
6
 Individual data depositors
provide PubChem with:
• Chemical structures
• Chemical names (synonyms)
 They need to be
organized/cleaned up through:
• Structure standardization
• Synonym filtering

7
Common Issues with
Chemical Structure Representations in
PubChem

Drawing conventions
Drawing conventions are often ignored in
structures deposited by original data sources.

Kekulé 1 Kekulé 2aromatic
Aromatic Compounds
Many Kekulé structures for aromatic compounds
Which one should be used as a standard?

Tautomerism
Ionization
Mesomerism
Ionization
Different Forms of the Same Molecule
Different tautomers, resonance forms, protonation states!
Choose the most stable one?

Most stable
in vacuum
Most stable
in water
The stability depends upon the context.
Different Forms of the Same Molecule

12
PubChem
Chemical Structure Standardization

Detect components
•Isolate covalent units
•Neutralize (by  H+ or e-)
•Reprocess
•Detect unique components
PubChem
Standardization
Normalize representation
• Tautomer invariance
• Aromaticity detection
• Stereochemistry
• Explicit hydrogen
Validate chemical contents
• Atoms defined/real
• Implicit hydrogen
• Functional group
• Atom valence
Calculate
•Coordinates
•Properties
•Descriptors

14
J. Cheminform. (2018) 10:36

15
• ~90% of the substances
are subject to
standardization.
• Mostly organic
compounds.
• Standardization success rate:
99.64%
• Modification rate:
44.43%
J. Cheminform. (2018) 10:36
Standardization
Statistics

Most stable
in vacuum
Most stable
in water
It is not necessarily what one may expect
Standardized Structures
Standardized
by PubChem

 In most cases, tautomeric forms of a molecule are
standardized into a single form.
 There are a few exceptions.
CID 18630CID 31261
Standardized Structures
tautomerization

Standardization and Structure Identity Search
 You can search PubChem using a structure as a query.
 The input structure may be provided:
• using a line notation (e.g., SMILES, InChI)
• through using the PubChem Sketcher.
 The input structure for identity search will be standardized
first before the search is performed.
 Therefore, hits from identity search may have different
structures from the original input structure.

19
Uracil
(CID 1174)
Identity
search
2,4-Dihydroxypyrimidine
(SID 377954591)
2-hydroxy-4(1h)-pyrimidinone
(SID 341255477)
Standardization and Structure Identity Search

20
Depositor-supplied synonyms &
MeSH Entry Terms

21
Two kinds of chemical names in PubChem

22
MeSH Entry Terms
 A set of “terms” related to ibuprofen.
 Used to index PubMed articles to help find articles
about ibuprofen.

23
Depositor-Supplied Synonyms
 Synonyms provided for “substance” records by depositors.
 “Filtered” synonyms are provided on the “Compound” Summary

24
Raw (unfiltered)
depositor-provided synonym
associated with the largest number of CIDs
Examples

25
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Unfiltered Depositor-provided synonyms (page 1/3)

26
N/A 6,869 6,368
NULL 4,610 4,599
VERSION 2,554 2,554
DIGLYCERIDE 841 841
VERSION! 797 794
VITASMLAB 461 461
CLA 770 394
NA 7,081 371
Various forms of
“Not Available”

27
N/A 6,869 6,368
NULL 4,610 4,599
VERSION 2,554 2,554
DIGLYCERIDE 841 841
VERSION! 797 794
VITASMLAB 461 461
CLA 770 394
NA 7,081 371
Various forms of
“Not Available”

28
N/A 6,869 6,368
NULL 4,610 4,599
VERSION 2,554 2,554
DIGLYCERIDE 841 841
VERSION! 797 794
VITASMLAB 461 461
CLA 770 394
NA 7,081 371
Various forms of
“Not Available”
Great reduction in the structure count
after structure standardization
 SIDs are standardized to Na (sodium)

29
N/A 6,869 6,368
NULL 4,610 4,599
VERSION 2,554 2,554
DIGLYCERIDE 841 841
VERSION! 797 794
VITASMLAB 461 461
CLA 770 394
NA 7,081 371
Error messages from
name generation software

30
N/A 6,869 6,368
NULL 4,610 4,599
VERSION 2,554 2,554
DIGLYCERIDE 841 841
VERSION! 797 794
VITASMLAB 461 461
CLA 770 394
NA 7,081 371
Names of
chemical classes

31
(1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2-
YL)METHANIDE HYDROBROMIDE 405 345
ETHANONE,1- - 328 328
CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304
COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION! 302 302
TRIACYLGLYCEROL 286 285
TRIGLYCERIDE 286 285
QUINOLONE DER. 280 279
UNABLE TO GENERATE VALUE 274 264
UNL 656 255
UNKNOWN LIGAND 615 235
HEPT DERIV. 213 211
MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN
CURRENT VERSION! 208 208
ACHIRAL CENTER(S) 187 187

32
(1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2-
YL)METHANIDE HYDROBROMIDE 405 345
ETHANONE,1- - 328 328
CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304
COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION! 302 302
TRIACYLGLYCEROL 286 285
TRIGLYCERIDE 286 285
QUINOLONE DER. 280 279
UNABLE TO GENERATE VALUE 274 264
UNL 656 255
UNKNOWN LIGAND 615 235
HEPT DERIV. 213 211
MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN
CURRENT VERSION! 208 208
ACHIRAL CENTER(S) 187 187
“Derivative” of
a chemical

33
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125

34
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
N. A. 128 125
Molecular formula

35
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
N. A. 128 125
Abbreviation for
chemical names

36
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
N. A. 128 125
Abbreviation for
chemical names
Description

37
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
N. A. 128 125
Abbreviation for
chemical names
Description
“Not available”

38
Unfiltered Depositor-provided synonyms
 Depositor-provided synonyms include:
• Real chemical names
• Abbreviations for chemical names
• “Derivatives” of some chemicals
• Names of chemical classes
• Molecular formula
• N/A, NULL, Not Available, NA, N.A., etc
• Error messages or comments
 Not feasible to manually clean up.
 PubChem uses crowd-voting-based synonym filtering.

40
PubChem Synonym filtering
 Crowd-voting approach
 Check for a consensus on the name-structure association
between depositors.
 Consensus threshold : >60% of the total votes
 When a consensus is reached,
the synonym is added to the “filtered” synonym list of the
corresponding compound (standardized structure).

41
CID 1
Synonym A SID 1Depositor 1
Synonyms that occurs only “once”
 No disagreement in the name-structure association
 Consider that the Synonym A means CID 1,
(although it may not be correct)

42
CID 1
CID 2
CID 3
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
Synonyms occurring multiple times
Which one is
the best choice?

43
Synonym filtering using crowd voting
 Two potential approaches
• Multiple-votes-per-depositor
• Single-vote-per-depositor

44
CID 1
CID 2
CID 3
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
3 (30%)
5 (50%)
2 (20%)
Consensus Threshold = 60%
Multiple-Votes-per-Depositor Strategy

45
CID 1
CID 2
CID 3
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
Single-Vote-per-Depositor Strategy

46
CID 1
CID 2
CID 3
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes

47
CID 1
CID 2
CID 3
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes

48
CID 1
CID 2
CID 3
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes

49
CID 1
CID 2
CID 3
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
1 (33%)
2 (67%)
0 (0%)
Consensus has reached!
Synonym A = CID 2

50
Additional consideration:
Different contexts of chemical sameness
CID 6305
(L-Tryptophan)
CID 1148
(Tryptophan)
CID 9060
(D-Tryptophan)
CID 12209747 CID 58478580

51
Abbr. CACTVS hash code used Description
CID CID hash code Connectivity + isotopes + stereochemistry
STE CID stereo hash code Connectivity + stereochemistry
CON CID connectivity hash code Connectivity
PCID Parent CID hash code CID of the parent compound
PSTE Parent CID stereo hash code STE of the parent compound
PCON Parent CID connectivity hash code CON of the parent compound
In practice, synonym filtering uses CACTVS hash codes (instead
of CID) to determine whether a consensus is reached or not.
Additional consideration:
Different contexts of chemical sameness

52
Filtered Depositor-provided synonyms with
the largest number of CIDs
Before Clustering After clustering
Synonym # SIDs # CIDs # SIDs # CIDs
124-07-2 (PARENT) 27 25 27 25
VITAMIN B12 38 23 37 22
159351-69-6 50 23 48 21
64-18-6 (PARENT) 25 23 22 20
1397-89-3 57 24 51 18
RIFAPENTINE 59 18 59 18
7681-93-8 44 19 43 18
NYSTATIN 61 28 34 17
50-14-6 61 17 61 17
104376-79-6 33 17 33 17
AMPHOTERICIN B 67 21 63 17
68-19-9 37 21 33 17
ACONITINE 47 19 45 17
QUININE SULFATE 38 17 38 17

53
124-07-2 (PARENT) 27 25 27 25
VITAMIN B12 38 23 37 22
159351-69-6 50 23 48 21
64-18-6 (PARENT) 25 23 22 20
1397-89-3 57 24 51 18
7681-93-8 44 19 43 18
50-14-6 61 17 61 17
104376-79-6 33 17 33 17
68-19-9 37 21 33 17
CAS numbers

124-07-2 (PARENT) 27 25 27 25
VITAMIN B12 38 23 37 22
159351-69-6 50 23 48 21
64-18-6 (PARENT) 25 23 22 20
1397-89-3 57 24 51 18
7681-93-8 44 19 43 18
50-14-6 61 17 61 17
104376-79-6 33 17 33 17
68-19-9 37 21 33 17
54
CAS numbers for
parent compounds

55
1. Synonym filtering focuses on consistency, not correctness.
• It resolves the discrepancies in name-structure associations
within & between depositors.
• It does not mean that filtered synonyms are correct.
Limitations of Synonym Filtering
Fentin acetate (CID 16682804)
Its filtered synonyms include:
• m-Nitrobenzaldehyde 3-thio-4-o-tolylsemicarbazone
• Benzaldehyde, m-nitro-, 3-thio-4-o-tolylsemicarbazone

56

57
 Synonym filtering focuses on consistency, not correctness.

58
• Data sources integrate synonym data from another sources that are
regarded to be authoritative (e.g., government resources).
• Erroneous data in one source propagate into another sources.
• This practice helps incorrect name-chemical associations getting more
votes than it should during the synonym filtering process.

59
2. More than 90% of depositor-provided synonyms occur only once.
• Automatically assigned to the structures represented by their
corresponding CIDs.

60
Uracil
(CID 1174)
2,4-Dihydroxypyrimidine
(SID 377954591)
2-hydroxy-4(1h)-pyrimidinone
(SID 341255477)
3. Different tautomers are merged into one standardized tautomeric
structure.
 Their names are also merged with those of the standardized
tautomer.

61

63
 PubChem contains a large amount of chemical information provided by
690+ data sources.
 Through the chemical structure standardization process, PubChem
standardizes depositor-provided chemical structures and extracts unique
structures.
 PubChem uses a crowd-voting-based synonym filtering to clean up
name-structure associations provided by depositors.
Summary

64
Acknowledgements
Evan Bolton
Jie Chen
Tiejun Cheng
Asta Gindulyte
Jia He
Siqian He
Qingliang Li
Benjamin Shoemaker
Thiessen Paul
Bo Yu
Leonid Zaslavsky
Jian Zhang
 The PubChem Team
 PubChem depositors, users, and collaborators
 Funded by the National Library of Medicine

Chemical Structure Standardization and Synonym Filtering in PubChem

Recommended

Recommended

More Related Content

Similar to Chemical Structure Standardization and Synonym Filtering in PubChem

Similar to Chemical Structure Standardization and Synonym Filtering in PubChem (20)

More from Sunghwan Kim

More from Sunghwan Kim (20)

Recently uploaded

Recently uploaded (20)

Chemical Structure Standardization and Synonym Filtering in PubChem