MINE Databases for 
Metabolite Repair: 
A Workshop 
James Jeffryes 
10/16/14 
1
REPRESENTING 
CHEMICALS 
DIGITALLY 
2
• What constraints influence how we represent 
compounds digitally? 
• A few common chemical data structures 
• Canonicalization & Hashing 
• Fingerprinting and Similarity measures 
3 
Overview
A central struggle in Computer Science 
Should hydrogen atoms be specified? 
How to represent resonance? 
How to provide material properties? 
Computational 
Efficiency 
Memory 
Efficiency 
4
Another Tradeoff 
Human 
Readability 
Computational 
Utility 
OC[C@H]1OC(O)[C@H](O) 
[C@@H](O)[C@@H]1O 
WQZGKKKJIJFFOK-GASJEMHNSA- 
N 
5
Computers ❤️ Graphs 
• Graphs have nodes 
and edges 
• So do molecules! 
• These nodes may have 
spatial positions 
• Hydrogen atoms can 
really get in the way! 
O 
H 
C 
C 
H 
H 
H 
H 
H 
6
Encoding graphs 
• Three ways with increasing 
subtlety (more CPU, less 
memory): 
– Matrices 
– Lists 
– String 
O 
C 
C 
7
Bond Electron Matrix 
C C O 
C 0 1 0 
C 1 0 1 
O 0 1 0 
O 
C 
C 
A symmetric matrix with the values 
corresponding the bond order between 
two compounds 
Not as space efficient but very easy to 
manipulate computationally 8
Chemical Markup Language (CML) 
is a list notation 
<cml><MDocument><MChemicalStruct> 
<molecule molID="m1"> 
<atomArray> 
<atom id="a1" elementType="C" x2="-4.208333333333333” y2="1.4583333333333333"/> 
<atom id="a2" elementType="C" x2="-2.801473328563728" y2="2.0847077636700657"/> 
<atom id="a3" elementType="O" x2="-2.325587157226309" y2="3.549334798764602"/> 
</atomArray> 
<bondArray> 
<bond atomRefs2="a1 a2" order="1"/> 
<bond atomRefs2="a2 a3" order="1"/> 
</bondArray> 
</molecule> 
</MChemicalStruct></MDocument></cml> 
O 
C 
C 
9
Mol files are also lists 
3 2 0 0 0 0 0 0 0 0999 V2000 
22.1200 -15.8397 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 
20.9088 -16.5419 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 
23.3312 -16.5419 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 
1 2 1 0 0 0 
1 3 1 0 0 0 
O 
C 
C 
From 
To 
Order 
Coordinates Type 
SDF files are lists of molfiles and 
properties (Listception!) 
10
Simplified molecular-input 
line-entry system (SMILES) 
CCO …That’s it! 
How about something a bit more tricky? 
O=Cc1ccc(O)c(OC)cc1 
O 
C 
C 
11
Try writing SMILES for Ethambutol 
• CCC(CO)NCCNC(CC)CO 
• What about: 
– OCC(CC)NCCNC(CO)CC 
– CCC(NCCNC(CO)CC)CO 
– And many more 
• What if we want to know if 2 compounds are 
the same? 
12
• R group – matches any group of atoms 
• Query Atoms 
– A – Matches any atom but hydrogen 
– Q – Matches any atom but hydrogen or carbon 
– M – Matches any metal 
– X – Matches any halogen 
– Atom lists – Match any of a specified set of elements 
• Psudoatoms – an atom not on the periodic table. 
Computers just treat them as text 
13 
Atoms that aren’t literal atoms
Canonicalization 
[O-]C(=O)c1cc(O)cc(c1)O 
14 
Establish a canonical form of 
the graph (Can be tricky!): 
• Dominant tautomer 
(resonance) 
• Predominate chemical 
species (charge) 
Enumerate the graph in a predictable 
way: 
• Picking the starting atom 
• Selecting which branch to follow at 
branch points 
SMILES can be canonical, InChIs always are
Identifying molecules 
• Even a string representation can be a 
cumbersome way to refer to molecules 
• For example phospholipids: 
– InChI=1S/C81H148O17P2/c1-5-9-13-17-21-25-29-33-37-41-45-49-53-57-61-65- 
78(83)91-71-76(97-80(85)67-63-59-55-51-47-43-39-35-31-27-23-19-15-11-7-3)73- 
95-99(87,88)93-69-75(82)70-94-100(89,90)96-74-77(98-81(86)68-64-60-56-52-48- 
44-40-36-32-28-24-20-16-12-8-4)72-92-79(84)66-62-58-54-50-46-42-38-34-30-26- 
22-18-14-10-6-2/h23,27,33-40,75-77,82H,5-22,24-26,28-32,41-74H2,1- 
4H3,(H,87,88)(H,89,90)/b27-23-,37-33-,38-34-,39-35-,40-36-/t75?,76-,77-/m1/s1 
• What we need is automatic name for this 
compound 
15
Hashing to the rescue 
• We want a function that is: 
– Deterministic (always gives the same output for the same 
input) 
– Fixed Length (usually) 
– Uniform (makes good use of the space we allow it) 
• There is no way to have 1:1 mapping, collisions can 
happen (but very unlikely) 
• Example InChIKeys 
– HGIKPGJCIWRORL-TVFZIFOYSA-N 
Connectivity Stereo etc. 
Protonation 
16
Fragment based Chemical Fingerprints 
17 
~400 Chemical Moieties which are ether present or absent 
Used extensively in Pharmaceutical Science
Atom Pair Chemical Fingerprints 
• Encode all atoms as a type 
– -OH = 14 
– -CH2- = 3 
– -CH3 = 1 
• Enumerate all distances between pairs 
– 14 – (2) – 3 
– 3 – (2) – 1 
– 14 – (3) – 1 
• Hash the result 
O 
C 
C 
18
Your Turn! 
• Find the unique atom types 
and count unique atom pairs 
– 5 unique atom types 
• -CH3, -CH2-, -CH<, -OH, -NH- 
– ~23 unique atom pairs 
19
Quantitative Chemical Similarity 
20 
Tanimoto Coefficient 
(no similarity) 0 < τ < 1 (exactly similar vector) 
We can quantitatively 
describe chemical 
similarity by 
computation. 
[ 0 1 0 0 1 ] 
HO 
O 
O O 
O P O P OH 
OH 
OH 
OH 
OH 
[ 0 1 0 1 1 ] 
τ = 0.2
QUESTIONS? 
21

Representing Chemicals Digitally: An overview of Cheminformatics

  • 1.
    MINE Databases for Metabolite Repair: A Workshop James Jeffryes 10/16/14 1
  • 2.
  • 3.
    • What constraintsinfluence how we represent compounds digitally? • A few common chemical data structures • Canonicalization & Hashing • Fingerprinting and Similarity measures 3 Overview
  • 4.
    A central strugglein Computer Science Should hydrogen atoms be specified? How to represent resonance? How to provide material properties? Computational Efficiency Memory Efficiency 4
  • 5.
    Another Tradeoff Human Readability Computational Utility OC[C@H]1OC(O)[C@H](O) [C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-GASJEMHNSA- N 5
  • 6.
    Computers ❤️ Graphs • Graphs have nodes and edges • So do molecules! • These nodes may have spatial positions • Hydrogen atoms can really get in the way! O H C C H H H H H 6
  • 7.
    Encoding graphs •Three ways with increasing subtlety (more CPU, less memory): – Matrices – Lists – String O C C 7
  • 8.
    Bond Electron Matrix C C O C 0 1 0 C 1 0 1 O 0 1 0 O C C A symmetric matrix with the values corresponding the bond order between two compounds Not as space efficient but very easy to manipulate computationally 8
  • 9.
    Chemical Markup Language(CML) is a list notation <cml><MDocument><MChemicalStruct> <molecule molID="m1"> <atomArray> <atom id="a1" elementType="C" x2="-4.208333333333333” y2="1.4583333333333333"/> <atom id="a2" elementType="C" x2="-2.801473328563728" y2="2.0847077636700657"/> <atom id="a3" elementType="O" x2="-2.325587157226309" y2="3.549334798764602"/> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1"/> <bond atomRefs2="a2 a3" order="1"/> </bondArray> </molecule> </MChemicalStruct></MDocument></cml> O C C 9
  • 10.
    Mol files arealso lists 3 2 0 0 0 0 0 0 0 0999 V2000 22.1200 -15.8397 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 20.9088 -16.5419 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 23.3312 -16.5419 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 1 3 1 0 0 0 O C C From To Order Coordinates Type SDF files are lists of molfiles and properties (Listception!) 10
  • 11.
    Simplified molecular-input line-entrysystem (SMILES) CCO …That’s it! How about something a bit more tricky? O=Cc1ccc(O)c(OC)cc1 O C C 11
  • 12.
    Try writing SMILESfor Ethambutol • CCC(CO)NCCNC(CC)CO • What about: – OCC(CC)NCCNC(CO)CC – CCC(NCCNC(CO)CC)CO – And many more • What if we want to know if 2 compounds are the same? 12
  • 13.
    • R group– matches any group of atoms • Query Atoms – A – Matches any atom but hydrogen – Q – Matches any atom but hydrogen or carbon – M – Matches any metal – X – Matches any halogen – Atom lists – Match any of a specified set of elements • Psudoatoms – an atom not on the periodic table. Computers just treat them as text 13 Atoms that aren’t literal atoms
  • 14.
    Canonicalization [O-]C(=O)c1cc(O)cc(c1)O 14 Establish a canonical form of the graph (Can be tricky!): • Dominant tautomer (resonance) • Predominate chemical species (charge) Enumerate the graph in a predictable way: • Picking the starting atom • Selecting which branch to follow at branch points SMILES can be canonical, InChIs always are
  • 15.
    Identifying molecules •Even a string representation can be a cumbersome way to refer to molecules • For example phospholipids: – InChI=1S/C81H148O17P2/c1-5-9-13-17-21-25-29-33-37-41-45-49-53-57-61-65- 78(83)91-71-76(97-80(85)67-63-59-55-51-47-43-39-35-31-27-23-19-15-11-7-3)73- 95-99(87,88)93-69-75(82)70-94-100(89,90)96-74-77(98-81(86)68-64-60-56-52-48- 44-40-36-32-28-24-20-16-12-8-4)72-92-79(84)66-62-58-54-50-46-42-38-34-30-26- 22-18-14-10-6-2/h23,27,33-40,75-77,82H,5-22,24-26,28-32,41-74H2,1- 4H3,(H,87,88)(H,89,90)/b27-23-,37-33-,38-34-,39-35-,40-36-/t75?,76-,77-/m1/s1 • What we need is automatic name for this compound 15
  • 16.
    Hashing to therescue • We want a function that is: – Deterministic (always gives the same output for the same input) – Fixed Length (usually) – Uniform (makes good use of the space we allow it) • There is no way to have 1:1 mapping, collisions can happen (but very unlikely) • Example InChIKeys – HGIKPGJCIWRORL-TVFZIFOYSA-N Connectivity Stereo etc. Protonation 16
  • 17.
    Fragment based ChemicalFingerprints 17 ~400 Chemical Moieties which are ether present or absent Used extensively in Pharmaceutical Science
  • 18.
    Atom Pair ChemicalFingerprints • Encode all atoms as a type – -OH = 14 – -CH2- = 3 – -CH3 = 1 • Enumerate all distances between pairs – 14 – (2) – 3 – 3 – (2) – 1 – 14 – (3) – 1 • Hash the result O C C 18
  • 19.
    Your Turn! •Find the unique atom types and count unique atom pairs – 5 unique atom types • -CH3, -CH2-, -CH<, -OH, -NH- – ~23 unique atom pairs 19
  • 20.
    Quantitative Chemical Similarity 20 Tanimoto Coefficient (no similarity) 0 < τ < 1 (exactly similar vector) We can quantitatively describe chemical similarity by computation. [ 0 1 0 0 1 ] HO O O O O P O P OH OH OH OH OH [ 0 1 0 1 1 ] τ = 0.2
  • 21.