Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Chemical structure representation in PubChem
1. Chemical structure
representation in pubchem
Roger Sayle
NextMove Software, Cambridge, UK
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
2. Selected Pubchem publications
• Sunghwan Kim, Paul A. Thiessen, Evan E. Bolton, Jie Chen, Gang Fu, Asta
Gindulyte, Lianyi Han, Jane He, Siqian He, Benjamin A. Shoemaker, Jiyao
Wang, Bo Yu, Jian Zhang and Stephen H. Bryant, “PubChem Substance and
Compound Databases”, Nucleic Acids Research, 2015.
• Volker D. Hahnke, Evan E. Bolton and Stephen H. Bryant, “PubChem atom
enironments”, Journal of Cheminformatics, 7:41, 2015.
• Evan E. Bolton, Yanli Wang, Paul A. Thiessen, Stephen H. Bryant,
“PubChem: Integrated Platform of Molecule Molecules and Biological
Activities”, Annual Reports in Computational Chemistry, Volume 4.,
Chapter 12, pp. 217-241, 2008.
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
3. Substance and compound
• A unique and invaluable feature of PubChem’s
architecture is the distinction between the deposited
structures (substances) and the normalized
structures (compounds), and the retention of both.
• Pubchem Substance contains ~209.6M structures.
• Pubchem Compound contains ~91.7M structures.
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
4. Molecular identity
• When are two chemical structures the same?
– Alternate chemical representations.
– Aromaticity and conjugation.
– Protonation states and tautomerism.
– Errors and typographical mistakes.
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
6. example 1: ethanol
• PubChem CID 702 has been deposited 1569 times
with six different explicit atom counts.
– 1311 have 9 atoms and 8 bonds.
– 249 have 3 atoms and 2 bonds.
– 4 have 0 atoms and 0 bonds.
– 2 have 4 atoms and 3 bonds.
– 2 have 5 atoms and 4 bonds.
– 1 has 7 atoms and 6 bonds.
• All have same SMILES (“CCO”) and InChI.
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
7. Explicit vs. implicit hydrogens
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
8. example 2: nitrobenzene
• Pubchem CID 7416 has been deposited as 164
distinct substance depositions (2 without structures).
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
9. Mdl molfile-ageDdon
• Biovia 2017 changed the interpretation of CT files.
• This affects 342,689 SIDs and 213,097 CIDs.
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
10. Hydrogens: easy come/easy go?
• PubChem is inconsistent on protonation/hydrogens.
• Common organic element radicals are hydrogenated:
– [C] → C, [Cl] → Cl, [P] → P, [S] → S, [H] → [HH]
– [Li], [Be], [B], [Si], [As], [Se], [At], etc. remain unchanged.
• Some groups get deprotonated
– c1ccccc1[N+](=O)O → c1ccccc1[N+](=O)[O-]
• But generally protonation state is preserved
– CC(=O)O, CC(=O)[O-], [NH4+], [NH3+]CC(=O)[O-]
– C[N+](C)(C)O
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
11. Example 3: o-xylene
• A major challenge in chemical databases is
aromaticity; that two compounds that differ in
Kekule forms are the same molecule.
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
CID 7237
12. Pubchem canonical kekule smiles
• A significant novel innovation in cheminformatics
was Evan Bolton’s development of a “canonical”
Kekulé SMILES form of a molecule.
• Different chemistry toolkits (and chemists!) differ in
opinion on which ring systems are aromatic and
which are not, hence PubChem’s wish to remain
“neutral” by only providing non-aromatic SMILES.
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
13. Bolton’s algorithm
• Steps of Bolton’s Canonical Kekulé Form Algorithm:
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
14. Tricky case: 10b,10c-dihydropyrene
• An important aspect is to aromatize all conjugated
cycles, not just those associated with SSSR.
• Unfortunately, this computationally demanding
requirement is a source of pain at the NCBI.
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
15. Conjugated ring systems
• Does it make sense to distinguish 4n+2 Hückel
aromaticity from conjugated ring systems?
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
21. Periodic table (circa 1997-2003)
• PubChem currently handles 109 of the 118 elements
in the periodic table [to be ratified in 2016].
• Hence “Mt” is the heaviest element at the moment.
• “Ds”, “Rg”, “Cn”, “Fl”, “Lv” already ratified.
• “Nh”, “Mc”, “Ts” and “Og” expected soon.
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
22. Pubchem Isotopes
• PubChem registration confirms that any specified
isotope has been observed experimentally.
• Hence [7CH4] is rejected, but [8CH4] is allowed.
• Interestingly, the [8CH4] of CID 11635947 has a half-
life of only two zeptoseconds (2×10-19 seconds).
• Another quirk is that PubChem doesn’t normalize
mononuclidic isotopes. Hence [19F]C (CID58338844)
is the sames as FC (CID11638).
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
23. Disavowed by the government
• There are a number of species PubChem rejects:
– Chlorine dioxide O=[Cl]=O
– Carbide anions: [C-]#[C-] and [C-4]
• But there is hope…
– Disulfur dioxide: O=[S][S]=O → O=S=S=O
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
24. Related compounds/substances
• CID → SID
– Same Connectivity, Same Stereochemistry, Same Isotopes
– Same Parent Connectivity, Same Exact Parent
– Mixtures, Components and Neutralized Forms
– Unique Components
– Similar Compounds (90% Tanimoto), Similar Conformers
• CID → SID
– All, Same Structure, Mixture
• SID → SID
– Same Connectivity, Same Exact
• SID → CID
– PubChem SID
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
25. Pubchem bond encoding
• PubChem allows depositors to specify advanced
representations of molecular structures such as
inorganics and organometallics via SD tags.
• PUBCHEM_NONSTANDARDBOND
– 4 = Quadruple bond, 5 = Dative bond, 6 = Complex bond,
7 = Ionic bond.
• PUBCHEM_BONDANNOTATIONS
– 2 = Hydrogen bond, 9 = Resonance bond, 10 = Bold bond,
11 = Fischer bond, 12 = Close contact.
• Relatively few depositors make use of these.
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
26. Final thoughts: abstract
For all of the grief that I give Evan, often over corner cases of chemical semantics that
only one or two people care about, it is fair to say that PubChem represents the
current state-of-the-art in chemical structure representation. Nobody does it better.
Under the surface, unseen to most users, are a large number of technical and scientific
innovations that have enabled PubChem to scale over the past decade and a half to
now contain approaching 100 million compounds. From simple design decisions such
as the substance vs. compound distinction [that allows PubChem to avoid the early
mistakes of CAS] to breakthroughs such as canonical Kekule SMILEs [to avoid the early
mistakes of Daylight Chemical Information Systems], the architecture of Pubchem
contains a treasure trove of cheminformatics innovations, covering normalization,
tautomers, mixtures, 2D fingerprints and similarity, substructure search, biopolymers,
text mining and much more. During this presentation I hope to share some of the cool
insights that the remarkable staff at the NCBI often forget to mention or are too
modest to point out.
Congratulations Evan and Steve.
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
27. acknowledgements
• Evan Bolton, Steve Bryant, Paul Thiessen, Volker
Hähnke, David Lipman and the PubChem team at the
NCBI.
• John May, at NextMove Software, for the analysis of
PubChem atom types affected by Biovia changes.
• The rest of the team at NextMove Software.
• George Vacek and the team at OpenEye Scientific
Software.
252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016