SlideShare a Scribd company logo
Chemical Structure Standardization and
Synonym Filtering in PubChem
Sunghwan Kim, Ph.D., M.Sc.
ACS National Meeting in San Diego, CA
(August 26, 2019)
2
PubChem
(https://pubchem.ncbi.nlm.nih.gov)
3
PubChem
 Public chemical information resource
 Collects data from more than 690+ sources
 Disseminates data back to the public free of charge
 Contains the largest amount of publicly available chemical
information
 Faces unique challenges to
deal with many big data issues
on a daily basis.
• Chemical structure
standardization
• Name-structure association
clean up
Depositor-provided
Bioactivity test results
Unique chemical
structure extraction
through Standardization
Depositor-provided
substance descriptions
Unique chemical structures
Activity of tested
“substances”
Activity of “compounds” derived
from associated “substances”
690+ Data Contributors
Substance
deposition
Assay
deposition
Data Organization in PubChem
Substance ID (SID) Assay ID (AID)
Compound ID (CID)
4
Unique chemical
structure extraction
through Standardization
Depositor-provided
substance descriptions
Unique chemical structures
690+ Data Contributors
Substance
deposition
Data Organization in PubChem
Substance ID (SID)
Depositor-provided
Bioactivity test results
Activity of tested
“substances”
Activity of “compounds” derived
from associated “substances”
Assay
deposition
Assay ID (AID)
Compound ID (CID)
5
Unique chemical
structure extraction
through Standardization
Depositor-provided
substance descriptions
Unique chemical structures
690+ Data Contributors
Substance
deposition
Data Organization in PubChem
Substance ID (SID)
Compound ID (CID)
6
 Individual data depositors
provide PubChem with:
• Chemical structures
• Chemical names (synonyms)
 They need to be
organized/cleaned up through:
• Structure standardization
• Synonym filtering
7
Common Issues with
Chemical Structure Representations in
PubChem
Drawing conventions
Drawing conventions are often ignored in
structures deposited by original data sources.
Kekulé 1 Kekulé 2aromatic
Aromatic Compounds
Many Kekulé structures for aromatic compounds
Which one should be used as a standard?
Tautomerism
Ionization
Mesomerism
Ionization
Different Forms of the Same Molecule
Different tautomers, resonance forms, protonation states!
Choose the most stable one?
Most stable
in vacuum
Most stable
in water
The stability depends upon the context.
Different Forms of the Same Molecule
12
PubChem
Chemical Structure Standardization
Detect components
•Isolate covalent units
•Neutralize (by  H+ or e-)
•Reprocess
•Detect unique components
PubChem
Standardization
Normalize representation
• Tautomer invariance
• Aromaticity detection
• Stereochemistry
• Explicit hydrogen
Validate chemical contents
• Atoms defined/real
• Implicit hydrogen
• Functional group
• Atom valence
Calculate
•Coordinates
•Properties
•Descriptors
14
J. Cheminform. (2018) 10:36
15
• ~90% of the substances
are subject to
standardization.
• Mostly organic
compounds.
• Standardization success rate:
99.64%
• Modification rate:
44.43%
J. Cheminform. (2018) 10:36
Standardization
Statistics
Most stable
in vacuum
Most stable
in water
It is not necessarily what one may expect
Standardized Structures
Standardized
by PubChem
 In most cases, tautomeric forms of a molecule are
standardized into a single form.
 There are a few exceptions.
CID 18630CID 31261
Standardized Structures
tautomerization
Standardization and Structure Identity Search
 You can search PubChem using a structure as a query.
 The input structure may be provided:
• using a line notation (e.g., SMILES, InChI)
• through using the PubChem Sketcher.
 The input structure for identity search will be standardized
first before the search is performed.
 Therefore, hits from identity search may have different
structures from the original input structure.
19
Uracil
(CID 1174)
Identity
search
2,4-Dihydroxypyrimidine
(SID 377954591)
2-hydroxy-4(1h)-pyrimidinone
(SID 341255477)
Standardization and Structure Identity Search
20
Depositor-supplied synonyms &
MeSH Entry Terms
21
Two kinds of chemical names in PubChem
22
MeSH Entry Terms
 A set of “terms” related to ibuprofen.
 Used to index PubMed articles to help find articles
about ibuprofen.
23
Depositor-Supplied Synonyms
 Synonyms provided for “substance” records by depositors.
 “Filtered” synonyms are provided on the “Compound” Summary
24
Raw (unfiltered)
depositor-provided synonym
associated with the largest number of CIDs
Examples
25
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Unfiltered Depositor-provided synonyms (page 1/3)
26
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Various forms of
“Not Available”
Unfiltered Depositor-provided synonyms (page 1/3)
27
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Various forms of
“Not Available”
Unfiltered Depositor-provided synonyms (page 1/3)
28
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Various forms of
“Not Available”
Great reduction in the structure count
after structure standardization
 SIDs are standardized to Na (sodium)
Unfiltered Depositor-provided synonyms (page 1/3)
29
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Error messages from
name generation software
Unfiltered Depositor-provided synonyms (page 1/3)
30
Synonym # SIDs # CIDs
N/A 6,869 6,368
SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT
SUPPORTED IN CURRENT VERSION 4,903 4,902
NULL 4,610 4,599
ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION 2,554 2,554
NOT AVAILABLE 1,867 1,816
LECITHIN 1,157 1,142
DIACYLGLYCEROL 847 842
DIGLYCERIDE 841 841
MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT
VERSION! 797 794
VITASMLAB 461 461
MIXTURE NAME 419 413
CLA 770 394
CHLOROPHYLL A 749 393
NA 7,081 371
Names of
chemical classes
Unfiltered Depositor-provided synonyms (page 1/3)
31
Synonym # SIDs # CIDs
(1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2-
YL)METHANIDE HYDROBROMIDE 405 345
ETHANONE,1- - 328 328
CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304
COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION! 302 302
TRIACYLGLYCEROL 286 285
TRIGLYCERIDE 286 285
QUINOLONE DER. 280 279
UNABLE TO GENERATE VALUE 274 264
UNL 656 255
UNKNOWN LIGAND 615 235
HEPT DERIV. 213 211
MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN
CURRENT VERSION! 208 208
ACHIRAL CENTER(S) 187 187
Unfiltered Depositor-provided synonyms (page 2/3)
32
Synonym # SIDs # CIDs
(1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2-
YL)METHANIDE HYDROBROMIDE 405 345
ETHANONE,1- - 328 328
CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304
COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT
VERSION! 302 302
TRIACYLGLYCEROL 286 285
TRIGLYCERIDE 286 285
QUINOLONE DER. 280 279
UNABLE TO GENERATE VALUE 274 264
UNL 656 255
UNKNOWN LIGAND 615 235
HEPT DERIV. 213 211
MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN
CURRENT VERSION! 208 208
ACHIRAL CENTER(S) 187 187
“Derivative” of
a chemical
Unfiltered Depositor-provided synonyms (page 2/3)
33
Synonym # SIDs # CIDs
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125
Unfiltered Depositor-provided synonyms (page 3/3)
34
Synonym # SIDs # CIDs
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125
Molecular formula
Unfiltered Depositor-provided synonyms (page 3/3)
35
Synonym # SIDs # CIDs
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125
Abbreviation for
chemical names
Unfiltered Depositor-provided synonyms (page 3/3)
36
Synonym # SIDs # CIDs
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125
Abbreviation for
chemical names
Unfiltered Depositor-provided synonyms (page 3/3)
Description
37
Synonym # SIDs # CIDs
C9H11NO2 179 174
HEM 4,645 165
BCR 290 160
C10H13NO2 161 154
BETA-CAROTENE 298 147
C8H10N2O2 149 144
C10H10N2O2 149 143
-ACETICACID 141 141
C9H8N2O2 143 141
PROTOPORPHYRIN IX CONTAINING FE 3,690 140
C8H9NO2 144 139
NAG 9,599 130
METHANOL 247 128
C8H9NO3 129 127
C10H9NO2 133 126
PYRIDINONE DERIV. 130 126
N. A. 128 125
Abbreviation for
chemical names
Unfiltered Depositor-provided synonyms (page 3/3)
Description
“Not available”
38
Unfiltered Depositor-provided synonyms
 Depositor-provided synonyms include:
• Real chemical names
• Abbreviations for chemical names
• “Derivatives” of some chemicals
• Names of chemical classes
• Molecular formula
• N/A, NULL, Not Available, NA, N.A., etc
• Error messages or comments
 Not feasible to manually clean up.
 PubChem uses crowd-voting-based synonym filtering.
39
PubChem Synonym Filtering
40
PubChem Synonym filtering
 Crowd-voting approach
 Check for a consensus on the name-structure association
between depositors.
 Consensus threshold : >60% of the total votes
 When a consensus is reached,
the synonym is added to the “filtered” synonym list of the
corresponding compound (standardized structure).
41
CID 1
Synonym A SID 1Depositor 1
Synonyms that occurs only “once”
 No disagreement in the name-structure association
 Consider that the Synonym A means CID 1,
(although it may not be correct)
42
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
Synonyms occurring multiple times
Which one is
the best choice?
43
Synonym filtering using crowd voting
 Two potential approaches
• Multiple-votes-per-depositor
• Single-vote-per-depositor
44
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
3 (30%)
5 (50%)
2 (20%)
Consensus Threshold = 60%
Multiple-Votes-per-Depositor Strategy
45
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
46
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
47
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
48
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
49
CID 1
CID 2
CID 3
Synonym A SID 1Depositor 1
Synonym A
Synonym A
Synonym A
Synonym A
SID 2
SID 4
SID 5
SID 3
Depositor 2
SID 7
Synonym A
Synonym A
SID 8
SID 6
Synonym A
Depositor 3
SID 10
SID 9Synonym A
Synonym A
Depositor 4
# votes
1 (33%)
2 (67%)
0 (0%)
Consensus Threshold = 60%
Single-Vote-per-Depositor Strategy
Consensus has reached!
Synonym A = CID 2
50
Additional consideration:
Different contexts of chemical sameness
CID 6305
(L-Tryptophan)
CID 1148
(Tryptophan)
CID 9060
(D-Tryptophan)
CID 12209747 CID 58478580
51
Abbr. CACTVS hash code used Description
CID CID hash code Connectivity + isotopes + stereochemistry
STE CID stereo hash code Connectivity + stereochemistry
CON CID connectivity hash code Connectivity
PCID Parent CID hash code CID of the parent compound
PSTE Parent CID stereo hash code STE of the parent compound
PCON Parent CID connectivity hash code CON of the parent compound
In practice, synonym filtering uses CACTVS hash codes (instead
of CID) to determine whether a consensus is reached or not.
Additional consideration:
Different contexts of chemical sameness
52
Filtered Depositor-provided synonyms with
the largest number of CIDs
Before Clustering After clustering
Synonym # SIDs # CIDs # SIDs # CIDs
124-07-2 (PARENT) 27 25 27 25
VITAMIN B12 38 23 37 22
159351-69-6 50 23 48 21
64-18-6 (PARENT) 25 23 22 20
1397-89-3 57 24 51 18
RIFAPENTINE 59 18 59 18
7681-93-8 44 19 43 18
NYSTATIN 61 28 34 17
50-14-6 61 17 61 17
104376-79-6 33 17 33 17
AMPHOTERICIN B 67 21 63 17
68-19-9 37 21 33 17
ACONITINE 47 19 45 17
QUININE SULFATE 38 17 38 17
53
Filtered Depositor-provided synonyms with
the largest number of CIDs
Before Clustering After clustering
Synonym # SIDs # CIDs # SIDs # CIDs
124-07-2 (PARENT) 27 25 27 25
VITAMIN B12 38 23 37 22
159351-69-6 50 23 48 21
64-18-6 (PARENT) 25 23 22 20
1397-89-3 57 24 51 18
RIFAPENTINE 59 18 59 18
7681-93-8 44 19 43 18
NYSTATIN 61 28 34 17
50-14-6 61 17 61 17
104376-79-6 33 17 33 17
AMPHOTERICIN B 67 21 63 17
68-19-9 37 21 33 17
ACONITINE 47 19 45 17
QUININE SULFATE 38 17 38 17
CAS numbers
Before Clustering After clustering
Synonym # SIDs # CIDs # SIDs # CIDs
124-07-2 (PARENT) 27 25 27 25
VITAMIN B12 38 23 37 22
159351-69-6 50 23 48 21
64-18-6 (PARENT) 25 23 22 20
1397-89-3 57 24 51 18
RIFAPENTINE 59 18 59 18
7681-93-8 44 19 43 18
NYSTATIN 61 28 34 17
50-14-6 61 17 61 17
104376-79-6 33 17 33 17
AMPHOTERICIN B 67 21 63 17
68-19-9 37 21 33 17
ACONITINE 47 19 45 17
QUININE SULFATE 38 17 38 17
54
Filtered Depositor-provided synonyms with
the largest number of CIDs
CAS numbers for
parent compounds
55
1. Synonym filtering focuses on consistency, not correctness.
• It resolves the discrepancies in name-structure associations
within & between depositors.
• It does not mean that filtered synonyms are correct.
Limitations of Synonym Filtering
Fentin acetate (CID 16682804)
Its filtered synonyms include:
• m-Nitrobenzaldehyde 3-thio-4-o-tolylsemicarbazone
• Benzaldehyde, m-nitro-, 3-thio-4-o-tolylsemicarbazone
56
Limitations of Synonym Filtering
1. Synonym filtering focuses on consistency, not correctness.
57
Limitations of Synonym Filtering
 Synonym filtering focuses on consistency, not correctness.
58
Limitations of Synonym Filtering
1. Synonym filtering focuses on consistency, not correctness.
• Data sources integrate synonym data from another sources that are
regarded to be authoritative (e.g., government resources).
• Erroneous data in one source propagate into another sources.
• This practice helps incorrect name-chemical associations getting more
votes than it should during the synonym filtering process.
59
2. More than 90% of depositor-provided synonyms occur only once.
• Automatically assigned to the structures represented by their
corresponding CIDs.
Limitations of Synonym Filtering
60
Uracil
(CID 1174)
2,4-Dihydroxypyrimidine
(SID 377954591)
2-hydroxy-4(1h)-pyrimidinone
(SID 341255477)
3. Different tautomers are merged into one standardized tautomeric
structure.
 Their names are also merged with those of the standardized
tautomer.
Limitations of Synonym Filtering
61
Limitations of Synonym Filtering
62
Summary
63
 PubChem contains a large amount of chemical information provided by
690+ data sources.
 Through the chemical structure standardization process, PubChem
standardizes depositor-provided chemical structures and extracts unique
structures.
 PubChem uses a crowd-voting-based synonym filtering to clean up
name-structure associations provided by depositors.
Summary
64
Acknowledgements
Evan Bolton
Jie Chen
Tiejun Cheng
Asta Gindulyte
Jia He
Siqian He
Qingliang Li
Benjamin Shoemaker
Thiessen Paul
Bo Yu
Leonid Zaslavsky
Jian Zhang
 The PubChem Team
 PubChem depositors, users, and collaborators
 Funded by the National Library of Medicine

More Related Content

Similar to Chemical Structure Standardization and Synonym Filtering in PubChem

Carb back loading 1.0 download
Carb back loading 1.0 downloadCarb back loading 1.0 download
Carb back loading 1.0 download
passkalilo
 
Carb back loading 1.0 download
Carb back loading 1.0 downloadCarb back loading 1.0 download
Carb back loading 1.0 download
passkalilo
 
Checking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying ChemistryChecking, Curating And Qualifying Chemistry
foglar book.pdf
foglar book.pdffoglar book.pdf
foglar book.pdf
BalqeesMustafa
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACSExtracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS
SimBioSys_Inc
 
Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)
BIOVIA
 
Patent Cheminformatics: Identification of key compounds in patents
Patent Cheminformatics: Identification of key compounds in patentsPatent Cheminformatics: Identification of key compounds in patents
Patent Cheminformatics: Identification of key compounds in patentsSorel Muresan
 
Enfin, DAS and BioMart
Enfin, DAS and BioMartEnfin, DAS and BioMart
Enfin, DAS and BioMart
Rafael C. Jimenez
 
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
NextMove Software
 
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Which Drug Did You Mean ?
Which Drug Did You Mean ?Which Drug Did You Mean ?
Which Drug Did You Mean ?
Chris Southan
 
Experiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the CommunityExperiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the Community
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Chemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChemChemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChem
Sunghwan Kim
 
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
Kamel Mansouri
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Structural databases
Structural databases Structural databases
Structural databases
Priyadharshana
 
Science equipment-guide
Science equipment-guideScience equipment-guide
Science equipment-guide
DheerajSinha11
 

Similar to Chemical Structure Standardization and Synonym Filtering in PubChem (20)

Carb back loading 1.0 download
Carb back loading 1.0 downloadCarb back loading 1.0 download
Carb back loading 1.0 download
 
Carb back loading 1.0 download
Carb back loading 1.0 downloadCarb back loading 1.0 download
Carb back loading 1.0 download
 
Checking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying ChemistryChecking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying Chemistry
 
foglar book.pdf
foglar book.pdffoglar book.pdf
foglar book.pdf
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACSExtracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS
 
Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)
 
Patent Cheminformatics: Identification of key compounds in patents
Patent Cheminformatics: Identification of key compounds in patentsPatent Cheminformatics: Identification of key compounds in patents
Patent Cheminformatics: Identification of key compounds in patents
 
Enfin, DAS and BioMart
Enfin, DAS and BioMartEnfin, DAS and BioMart
Enfin, DAS and BioMart
 
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
 
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
 
Which Drug Did You Mean ?
Which Drug Did You Mean ?Which Drug Did You Mean ?
Which Drug Did You Mean ?
 
Experiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the CommunityExperiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the Community
 
Chemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChemChemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChem
 
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
 
Structural databases
Structural databases Structural databases
Structural databases
 
Science equipment-guide
Science equipment-guideScience equipment-guide
Science equipment-guide
 

More from Sunghwan Kim

PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data Chemistry
Sunghwan Kim
 
PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy training
Sunghwan Kim
 
PubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistryPubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data Chemistry
Sunghwan Kim
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
Sunghwan Kim
 
PubChem and its application for cheminformatics education
PubChem and its application for cheminformatics educationPubChem and its application for cheminformatics education
PubChem and its application for cheminformatics education
Sunghwan Kim
 
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Sunghwan Kim
 
Cheminformatics Education with PubChem
Cheminformatics Education with PubChemCheminformatics Education with PubChem
Cheminformatics Education with PubChem
Sunghwan Kim
 
PubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourcePubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information Resource
Sunghwan Kim
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistry
Sunghwan Kim
 
PubChem as a resource for chemical information education
PubChem as a resource for chemical information educationPubChem as a resource for chemical information education
PubChem as a resource for chemical information education
Sunghwan Kim
 
Toxicological information in PubChem
Toxicological information in PubChemToxicological information in PubChem
Toxicological information in PubChem
Sunghwan Kim
 
Exploiting PubChem for Drug Discovery
Exploiting PubChem for Drug DiscoveryExploiting PubChem for Drug Discovery
Exploiting PubChem for Drug Discovery
Sunghwan Kim
 
PubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug DiscoveryPubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug Discovery
Sunghwan Kim
 
A Brief Overview of Cheminformatics
A Brief Overview of CheminformaticsA Brief Overview of Cheminformatics
A Brief Overview of Cheminformatics
Sunghwan Kim
 
Searching for chemical information using PubChem
Searching for chemical information using PubChemSearching for chemical information using PubChem
Searching for chemical information using PubChem
Sunghwan Kim
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information training
Sunghwan Kim
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...
Sunghwan Kim
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
Sunghwan Kim
 
Searching for patent information in PubChem
Searching for patent information in PubChem Searching for patent information in PubChem
Searching for patent information in PubChem
Sunghwan Kim
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
Sunghwan Kim
 

More from Sunghwan Kim (20)

PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data Chemistry
 
PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy training
 
PubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistryPubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data Chemistry
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
 
PubChem and its application for cheminformatics education
PubChem and its application for cheminformatics educationPubChem and its application for cheminformatics education
PubChem and its application for cheminformatics education
 
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
 
Cheminformatics Education with PubChem
Cheminformatics Education with PubChemCheminformatics Education with PubChem
Cheminformatics Education with PubChem
 
PubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourcePubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information Resource
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistry
 
PubChem as a resource for chemical information education
PubChem as a resource for chemical information educationPubChem as a resource for chemical information education
PubChem as a resource for chemical information education
 
Toxicological information in PubChem
Toxicological information in PubChemToxicological information in PubChem
Toxicological information in PubChem
 
Exploiting PubChem for Drug Discovery
Exploiting PubChem for Drug DiscoveryExploiting PubChem for Drug Discovery
Exploiting PubChem for Drug Discovery
 
PubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug DiscoveryPubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug Discovery
 
A Brief Overview of Cheminformatics
A Brief Overview of CheminformaticsA Brief Overview of Cheminformatics
A Brief Overview of Cheminformatics
 
Searching for chemical information using PubChem
Searching for chemical information using PubChemSearching for chemical information using PubChem
Searching for chemical information using PubChem
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information training
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
Searching for patent information in PubChem
Searching for patent information in PubChem Searching for patent information in PubChem
Searching for patent information in PubChem
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 

Recently uploaded

Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
alishadewangan1
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilityISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
SciAstra
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 

Recently uploaded (20)

Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilityISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 

Chemical Structure Standardization and Synonym Filtering in PubChem

  • 1. Chemical Structure Standardization and Synonym Filtering in PubChem Sunghwan Kim, Ph.D., M.Sc. ACS National Meeting in San Diego, CA (August 26, 2019)
  • 3. 3 PubChem  Public chemical information resource  Collects data from more than 690+ sources  Disseminates data back to the public free of charge  Contains the largest amount of publicly available chemical information  Faces unique challenges to deal with many big data issues on a daily basis. • Chemical structure standardization • Name-structure association clean up
  • 4. Depositor-provided Bioactivity test results Unique chemical structure extraction through Standardization Depositor-provided substance descriptions Unique chemical structures Activity of tested “substances” Activity of “compounds” derived from associated “substances” 690+ Data Contributors Substance deposition Assay deposition Data Organization in PubChem Substance ID (SID) Assay ID (AID) Compound ID (CID) 4
  • 5. Unique chemical structure extraction through Standardization Depositor-provided substance descriptions Unique chemical structures 690+ Data Contributors Substance deposition Data Organization in PubChem Substance ID (SID) Depositor-provided Bioactivity test results Activity of tested “substances” Activity of “compounds” derived from associated “substances” Assay deposition Assay ID (AID) Compound ID (CID) 5
  • 6. Unique chemical structure extraction through Standardization Depositor-provided substance descriptions Unique chemical structures 690+ Data Contributors Substance deposition Data Organization in PubChem Substance ID (SID) Compound ID (CID) 6  Individual data depositors provide PubChem with: • Chemical structures • Chemical names (synonyms)  They need to be organized/cleaned up through: • Structure standardization • Synonym filtering
  • 7. 7 Common Issues with Chemical Structure Representations in PubChem
  • 8. Drawing conventions Drawing conventions are often ignored in structures deposited by original data sources.
  • 9. Kekulé 1 Kekulé 2aromatic Aromatic Compounds Many Kekulé structures for aromatic compounds Which one should be used as a standard?
  • 10. Tautomerism Ionization Mesomerism Ionization Different Forms of the Same Molecule Different tautomers, resonance forms, protonation states! Choose the most stable one?
  • 11. Most stable in vacuum Most stable in water The stability depends upon the context. Different Forms of the Same Molecule
  • 13. Detect components •Isolate covalent units •Neutralize (by  H+ or e-) •Reprocess •Detect unique components PubChem Standardization Normalize representation • Tautomer invariance • Aromaticity detection • Stereochemistry • Explicit hydrogen Validate chemical contents • Atoms defined/real • Implicit hydrogen • Functional group • Atom valence Calculate •Coordinates •Properties •Descriptors
  • 15. 15 • ~90% of the substances are subject to standardization. • Mostly organic compounds. • Standardization success rate: 99.64% • Modification rate: 44.43% J. Cheminform. (2018) 10:36 Standardization Statistics
  • 16. Most stable in vacuum Most stable in water It is not necessarily what one may expect Standardized Structures Standardized by PubChem
  • 17.  In most cases, tautomeric forms of a molecule are standardized into a single form.  There are a few exceptions. CID 18630CID 31261 Standardized Structures tautomerization
  • 18. Standardization and Structure Identity Search  You can search PubChem using a structure as a query.  The input structure may be provided: • using a line notation (e.g., SMILES, InChI) • through using the PubChem Sketcher.  The input structure for identity search will be standardized first before the search is performed.  Therefore, hits from identity search may have different structures from the original input structure.
  • 21. 21 Two kinds of chemical names in PubChem
  • 22. 22 MeSH Entry Terms  A set of “terms” related to ibuprofen.  Used to index PubMed articles to help find articles about ibuprofen.
  • 23. 23 Depositor-Supplied Synonyms  Synonyms provided for “substance” records by depositors.  “Filtered” synonyms are provided on the “Compound” Summary
  • 24. 24 Raw (unfiltered) depositor-provided synonym associated with the largest number of CIDs Examples
  • 25. 25 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Unfiltered Depositor-provided synonyms (page 1/3)
  • 26. 26 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Various forms of “Not Available” Unfiltered Depositor-provided synonyms (page 1/3)
  • 27. 27 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Various forms of “Not Available” Unfiltered Depositor-provided synonyms (page 1/3)
  • 28. 28 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Various forms of “Not Available” Great reduction in the structure count after structure standardization  SIDs are standardized to Na (sodium) Unfiltered Depositor-provided synonyms (page 1/3)
  • 29. 29 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Error messages from name generation software Unfiltered Depositor-provided synonyms (page 1/3)
  • 30. 30 Synonym # SIDs # CIDs N/A 6,869 6,368 SPIRO COMPOUNDS WITH POLYCYCLIC COMPONENTS ARE NOT SUPPORTED IN CURRENT VERSION 4,903 4,902 NULL 4,610 4,599 ASSEMBLIES OF CYCLIC SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION 2,554 2,554 NOT AVAILABLE 1,867 1,816 LECITHIN 1,157 1,142 DIACYLGLYCEROL 847 842 DIGLYCERIDE 841 841 MULTIPLICATIVE NOMENCLATURE IS NOT SUPPORTED IN CURRENT VERSION! 797 794 VITASMLAB 461 461 MIXTURE NAME 419 413 CLA 770 394 CHLOROPHYLL A 749 393 NA 7,081 371 Names of chemical classes Unfiltered Depositor-provided synonyms (page 1/3)
  • 31. 31 Synonym # SIDs # CIDs (1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2- YL)METHANIDE HYDROBROMIDE 405 345 ETHANONE,1- - 328 328 CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304 COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION! 302 302 TRIACYLGLYCEROL 286 285 TRIGLYCERIDE 286 285 QUINOLONE DER. 280 279 UNABLE TO GENERATE VALUE 274 264 UNL 656 255 UNKNOWN LIGAND 615 235 HEPT DERIV. 213 211 MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION! 208 208 ACHIRAL CENTER(S) 187 187 Unfiltered Depositor-provided synonyms (page 2/3)
  • 32. 32 Synonym # SIDs # CIDs (1-(5-CARBOXYPENTYL)-3,3-DIMETHYL-3H-INDOL-1-IUM-2- YL)METHANIDE HYDROBROMIDE 405 345 ETHANONE,1- - 328 328 CANNOT MAKE CHOICE: LIGANDS ARE COMPARED UP TO 10 SPHERES 304 304 COMPLEX BRIDGED FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION! 302 302 TRIACYLGLYCEROL 286 285 TRIGLYCERIDE 286 285 QUINOLONE DER. 280 279 UNABLE TO GENERATE VALUE 274 264 UNL 656 255 UNKNOWN LIGAND 615 235 HEPT DERIV. 213 211 MULTIPARENT NAMES FOR FUSED SYSTEMS ARE NOT SUPPORTED IN CURRENT VERSION! 208 208 ACHIRAL CENTER(S) 187 187 “Derivative” of a chemical Unfiltered Depositor-provided synonyms (page 2/3)
  • 33. 33 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Unfiltered Depositor-provided synonyms (page 3/3)
  • 34. 34 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Molecular formula Unfiltered Depositor-provided synonyms (page 3/3)
  • 35. 35 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Abbreviation for chemical names Unfiltered Depositor-provided synonyms (page 3/3)
  • 36. 36 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Abbreviation for chemical names Unfiltered Depositor-provided synonyms (page 3/3) Description
  • 37. 37 Synonym # SIDs # CIDs C9H11NO2 179 174 HEM 4,645 165 BCR 290 160 C10H13NO2 161 154 BETA-CAROTENE 298 147 C8H10N2O2 149 144 C10H10N2O2 149 143 -ACETICACID 141 141 C9H8N2O2 143 141 PROTOPORPHYRIN IX CONTAINING FE 3,690 140 C8H9NO2 144 139 NAG 9,599 130 METHANOL 247 128 C8H9NO3 129 127 C10H9NO2 133 126 PYRIDINONE DERIV. 130 126 N. A. 128 125 Abbreviation for chemical names Unfiltered Depositor-provided synonyms (page 3/3) Description “Not available”
  • 38. 38 Unfiltered Depositor-provided synonyms  Depositor-provided synonyms include: • Real chemical names • Abbreviations for chemical names • “Derivatives” of some chemicals • Names of chemical classes • Molecular formula • N/A, NULL, Not Available, NA, N.A., etc • Error messages or comments  Not feasible to manually clean up.  PubChem uses crowd-voting-based synonym filtering.
  • 40. 40 PubChem Synonym filtering  Crowd-voting approach  Check for a consensus on the name-structure association between depositors.  Consensus threshold : >60% of the total votes  When a consensus is reached, the synonym is added to the “filtered” synonym list of the corresponding compound (standardized structure).
  • 41. 41 CID 1 Synonym A SID 1Depositor 1 Synonyms that occurs only “once”  No disagreement in the name-structure association  Consider that the Synonym A means CID 1, (although it may not be correct)
  • 42. 42 CID 1 CID 2 CID 3 Synonym A SID 1Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9Synonym A Synonym A Depositor 4 Synonyms occurring multiple times Which one is the best choice?
  • 43. 43 Synonym filtering using crowd voting  Two potential approaches • Multiple-votes-per-depositor • Single-vote-per-depositor
  • 44. 44 CID 1 CID 2 CID 3 Synonym A SID 1Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9Synonym A Synonym A Depositor 4 # votes 3 (30%) 5 (50%) 2 (20%) Consensus Threshold = 60% Multiple-Votes-per-Depositor Strategy
  • 45. 45 CID 1 CID 2 CID 3 Synonym A SID 1Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9Synonym A Synonym A Depositor 4 # votes Consensus Threshold = 60% Single-Vote-per-Depositor Strategy
  • 46. 46 CID 1 CID 2 CID 3 Synonym A SID 1Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9Synonym A Synonym A Depositor 4 # votes Consensus Threshold = 60% Single-Vote-per-Depositor Strategy
  • 47. 47 CID 1 CID 2 CID 3 Synonym A SID 1Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9Synonym A Synonym A Depositor 4 # votes Consensus Threshold = 60% Single-Vote-per-Depositor Strategy
  • 48. 48 CID 1 CID 2 CID 3 Synonym A SID 1Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9Synonym A Synonym A Depositor 4 # votes Consensus Threshold = 60% Single-Vote-per-Depositor Strategy
  • 49. 49 CID 1 CID 2 CID 3 Synonym A SID 1Depositor 1 Synonym A Synonym A Synonym A Synonym A SID 2 SID 4 SID 5 SID 3 Depositor 2 SID 7 Synonym A Synonym A SID 8 SID 6 Synonym A Depositor 3 SID 10 SID 9Synonym A Synonym A Depositor 4 # votes 1 (33%) 2 (67%) 0 (0%) Consensus Threshold = 60% Single-Vote-per-Depositor Strategy Consensus has reached! Synonym A = CID 2
  • 50. 50 Additional consideration: Different contexts of chemical sameness CID 6305 (L-Tryptophan) CID 1148 (Tryptophan) CID 9060 (D-Tryptophan) CID 12209747 CID 58478580
  • 51. 51 Abbr. CACTVS hash code used Description CID CID hash code Connectivity + isotopes + stereochemistry STE CID stereo hash code Connectivity + stereochemistry CON CID connectivity hash code Connectivity PCID Parent CID hash code CID of the parent compound PSTE Parent CID stereo hash code STE of the parent compound PCON Parent CID connectivity hash code CON of the parent compound In practice, synonym filtering uses CACTVS hash codes (instead of CID) to determine whether a consensus is reached or not. Additional consideration: Different contexts of chemical sameness
  • 52. 52 Filtered Depositor-provided synonyms with the largest number of CIDs Before Clustering After clustering Synonym # SIDs # CIDs # SIDs # CIDs 124-07-2 (PARENT) 27 25 27 25 VITAMIN B12 38 23 37 22 159351-69-6 50 23 48 21 64-18-6 (PARENT) 25 23 22 20 1397-89-3 57 24 51 18 RIFAPENTINE 59 18 59 18 7681-93-8 44 19 43 18 NYSTATIN 61 28 34 17 50-14-6 61 17 61 17 104376-79-6 33 17 33 17 AMPHOTERICIN B 67 21 63 17 68-19-9 37 21 33 17 ACONITINE 47 19 45 17 QUININE SULFATE 38 17 38 17
  • 53. 53 Filtered Depositor-provided synonyms with the largest number of CIDs Before Clustering After clustering Synonym # SIDs # CIDs # SIDs # CIDs 124-07-2 (PARENT) 27 25 27 25 VITAMIN B12 38 23 37 22 159351-69-6 50 23 48 21 64-18-6 (PARENT) 25 23 22 20 1397-89-3 57 24 51 18 RIFAPENTINE 59 18 59 18 7681-93-8 44 19 43 18 NYSTATIN 61 28 34 17 50-14-6 61 17 61 17 104376-79-6 33 17 33 17 AMPHOTERICIN B 67 21 63 17 68-19-9 37 21 33 17 ACONITINE 47 19 45 17 QUININE SULFATE 38 17 38 17 CAS numbers
  • 54. Before Clustering After clustering Synonym # SIDs # CIDs # SIDs # CIDs 124-07-2 (PARENT) 27 25 27 25 VITAMIN B12 38 23 37 22 159351-69-6 50 23 48 21 64-18-6 (PARENT) 25 23 22 20 1397-89-3 57 24 51 18 RIFAPENTINE 59 18 59 18 7681-93-8 44 19 43 18 NYSTATIN 61 28 34 17 50-14-6 61 17 61 17 104376-79-6 33 17 33 17 AMPHOTERICIN B 67 21 63 17 68-19-9 37 21 33 17 ACONITINE 47 19 45 17 QUININE SULFATE 38 17 38 17 54 Filtered Depositor-provided synonyms with the largest number of CIDs CAS numbers for parent compounds
  • 55. 55 1. Synonym filtering focuses on consistency, not correctness. • It resolves the discrepancies in name-structure associations within & between depositors. • It does not mean that filtered synonyms are correct. Limitations of Synonym Filtering Fentin acetate (CID 16682804) Its filtered synonyms include: • m-Nitrobenzaldehyde 3-thio-4-o-tolylsemicarbazone • Benzaldehyde, m-nitro-, 3-thio-4-o-tolylsemicarbazone
  • 56. 56 Limitations of Synonym Filtering 1. Synonym filtering focuses on consistency, not correctness.
  • 57. 57 Limitations of Synonym Filtering  Synonym filtering focuses on consistency, not correctness.
  • 58. 58 Limitations of Synonym Filtering 1. Synonym filtering focuses on consistency, not correctness. • Data sources integrate synonym data from another sources that are regarded to be authoritative (e.g., government resources). • Erroneous data in one source propagate into another sources. • This practice helps incorrect name-chemical associations getting more votes than it should during the synonym filtering process.
  • 59. 59 2. More than 90% of depositor-provided synonyms occur only once. • Automatically assigned to the structures represented by their corresponding CIDs. Limitations of Synonym Filtering
  • 60. 60 Uracil (CID 1174) 2,4-Dihydroxypyrimidine (SID 377954591) 2-hydroxy-4(1h)-pyrimidinone (SID 341255477) 3. Different tautomers are merged into one standardized tautomeric structure.  Their names are also merged with those of the standardized tautomer. Limitations of Synonym Filtering
  • 63. 63  PubChem contains a large amount of chemical information provided by 690+ data sources.  Through the chemical structure standardization process, PubChem standardizes depositor-provided chemical structures and extracts unique structures.  PubChem uses a crowd-voting-based synonym filtering to clean up name-structure associations provided by depositors. Summary
  • 64. 64 Acknowledgements Evan Bolton Jie Chen Tiejun Cheng Asta Gindulyte Jia He Siqian He Qingliang Li Benjamin Shoemaker Thiessen Paul Bo Yu Leonid Zaslavsky Jian Zhang  The PubChem Team  PubChem depositors, users, and collaborators  Funded by the National Library of Medicine