RDkit Gems
Roger Sayle, PhD
NextMove Software LTD
cambridge, uk
introduction
• This presentation is intended as a short talktorial
describing real world code examples, recent
features an...
looping over molecule atoms
• Poor
for (RWMol::AtomIterator atomIt=mol->beginAtoms();
atomIt != mol->endAtoms(); atomIt++)...
looping over aTOM's bonds
• Documented idiom:
... molPtr is a const ROMol * ...
... atomPtr is a const Atom * ...
ROMol::O...
looping over aTOM's bonds
• Proposed replacement:
... molPtr is a const ROMol * ...
... atomPtr is a const Atom * ...
ROMo...
Accepting string arguments
• Good practice to accept both C++
constant literals and STL std::strings.
• Poor
void foo(cons...
isoelectronic valences
• Supporting PF6
-
in RDKit currently requires
adding 7 to the neutral valences of P.
Instead, P-1
...
representing reactions
• Two possible representations
– A new object containing vectors of reactant
and product (and agent...
common ancestry/base class
• A significant advantage of avoiding new
object types is the common requirement in
cheminforma...
Proposed rdkit interface
• Atom::setProp(“rxnRole”,(int)role)
– RXN_ROLE_NONE 0
– RXN_ROLE_REACTANT 1
– RXN_ROLE_AGENT 2
–...
and where this gets used...
• Code/GraphMol/FileParser/MolFileWriter.cpp:
• Function GetMolFileAtomLine contains
• Current...
inverted protein hierarchy
• Optimize data structures for common ops.
• Poor
for (modelIt mi=mol->getModels(); mi; ++mi)
f...
rdkit pdb file writer features
• By default
– Atoms are written as HETAM records
– Bonds are written as CONECT records
– B...
example pdb format output
COMPND benzene
HETATM 1 C1 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 2 C2 UNL 1 0.000 0.000 0.0...
unique atom names
• Counts unique per element type
• C1..C99,CA0..CA9,CB0..CZ9,CAA...CZZ
• This allows for up to 1036 indi...
beware openbabel's lig residue
• PDB residue LIG is reserved for 3-pyridin-
4-yl-2,4-dihydro-indeno[1,2-c]pyrazole.
• Use ...
PDB file readers
• Not all connectivity is explicit
• Two options for determining bonds
– Template methods
– Proximity met...
template-based bond orders
• Need only specify the double bonds
• Most (11/20) amino acids only need C=O.
• Arg: C=O,CZ=NH...
switching on short strings
// These are macros to allow their use in C++ constants
#define BCNAM(A,B,C) (((A)<<16) | ((B)<...
proximity bonding
• The ideal distance between two bonded
atoms is the sum of their covalent radii.
• The ideal distance b...
compare the square
• A rate limiting step in many (poorly written)
proximity calculations is the calculation of
square roo...
fail early, fail fast (like pharma)
• Poor
return dx*dx + dy*dy + dz*dz <= radius2
• Better
dist2 = dx*dx
if (dist > radiu...
Proximity algorithms
• Brute Force [FreON] N2
• Z (Ordinate) Sort [OpenBabel] NlogN
• Binary Space Partition [CDK] NlogN.....
proximity comparison
proximity comparison
proximity comparison
Brute force
z (ordinate) sort
binary space partition (bsp)
fixed dimension grid
fixed spacing grid
finite fixed spacing grid
infinite fixed spacing grid
grid hashing
future work/wish list
• Atom reordering in RDKit.
• C++ compressed (gzip) file support.
• Non-forward PDB supplier.
• In-p...
Upcoming SlideShare
Loading in...5
×

RDKit Gems

468

Published on

Presentation by Roger at 2nd RDKit UGM at EMBL-EBI Oct 2013

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
468
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

RDKit Gems

  1. 1. RDkit Gems Roger Sayle, PhD NextMove Software LTD cambridge, uk
  2. 2. introduction • This presentation is intended as a short talktorial describing real world code examples, recent features and commonly used RDKit idioms. • Although many refect personal/NextMove Software experiences developing with the C++ APIs, most (though not all) observations apply equally to the Java and Python APIs.
  3. 3. looping over molecule atoms • Poor for (RWMol::AtomIterator atomIt=mol->beginAtoms(); atomIt != mol->endAtoms(); atomIt++) … • Better for (RWMol::AtomIterator atomIt=mol->beginAtoms(); atomIt != mol->endAtoms(); ++atomIt) … • The C++ post increment operator returns the original value prior to the increment, requiring a copy to made.
  4. 4. looping over aTOM's bonds • Documented idiom: ... molPtr is a const ROMol * ... ... atomPtr is a const Atom * ... ROMol::OEDGE_ITER beg,end; boost::tie(beg,end) = molPtr->getAtomBonds(atomPtr); while(beg!=end){ const Bond * bond=(*molPtr)[*beg].get(); ... do something with the Bond ... ++beg; }
  5. 5. looping over aTOM's bonds • Proposed replacement: ... molPtr is a const ROMol * ... ... atomPtr is a const Atom * ... ROMol::OBOND_ITER_PAIR bondIt; bondIt = molPtr->getAtomBonds(atomPtr); while(bondIt.first != bondIt.second){ const Bond * bond=(*molPtr)[*bondIt.first].get(); ... do something with the Bond ... ++bondIt; }
  6. 6. Accepting string arguments • Good practice to accept both C++ constant literals and STL std::strings. • Poor void foo(const char *ptr) { foo(std::string(ptr)); } void foo(const std::string str) { … } • Better void foo(const std::string &str) { foo(str.c_str()); } void foo(const char *ptr) { … }
  7. 7. isoelectronic valences • Supporting PF6 - in RDKit currently requires adding 7 to the neutral valences of P. Instead, P-1 should be isoelectronic with S. • Poor valence(atomicNo,charge) ~= valence(atomicNo,0)-charge • Better valence(atomicNo,charge) ~= valence(atomicNo-charge,0)
  8. 8. representing reactions • Two possible representations – A new object containing vectors of reactant and product (and agent?) molecules. – Annotation of existing molecule with reaction roles annotated on each atom. • Both are representations are equivalent and one may be converted to the other and vice versa.
  9. 9. common ancestry/base class • A significant advantage of avoiding new object types is the common requirement in cheminformatics of reading reactions, molecules, queries, peptides, transforms, and pharmacophores for the same file format. • OpenBabel and GGA's Indigo require you to know the contents of a file/record (mol vs. rxn) before reading it!?
  10. 10. Proposed rdkit interface • Atom::setProp(“rxnRole”,(int)role) – RXN_ROLE_NONE 0 – RXN_ROLE_REACTANT 1 – RXN_ROLE_AGENT 2 – RXN_ROLE_PRODUCT 3 • Atom::setProp(“rxnGroup”,(int)group) • Atom::setProp(“molAtomMapNumber”,idx) • RWMol::setProp(“isRxn”,isrxn?1:0)
  11. 11. and where this gets used... • Code/GraphMol/FileParser/MolFileWriter.cpp: • Function GetMolFileAtomLine contains • Currently unused local variables – rxnComponentType – rxnComponentNumber • This would allow support for CPSS-style reactions [i.e. reactions in SD files]
  12. 12. inverted protein hierarchy • Optimize data structures for common ops. • Poor for (modelIt mi=mol->getModels(); mi; ++mi) for (chainIt ci=(*mi)->getChains(); ci; ++ci) for (resIt ri=(*ci)->getResidues(); ri; ++ri) for (atomIt ai=(*ri)->getAtoms(); ai; ++ai) … /* Do something! */ • Better for (atomIt ai=mol->getAtoms(); ai; ++ai) … /* Do something! */
  13. 13. rdkit pdb file writer features • By default – Atoms are written as HETAM records – Bonds are written as CONECT records – Bond orders encoded by popular extension – All atoms are uniquely named – Molecule title written as COMPND record – Formal charges are preserved. – Default residue is “UNL” (not “LIG”).
  14. 14. example pdb format output COMPND benzene HETATM 1 C1 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 2 C2 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 3 C3 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 4 C4 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 5 C5 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 6 C6 UNL 1 0.000 0.000 0.000 1.00 0.00 C CONECT 1 2 2 6 CONECT 2 3 CONECT 3 4 4 CONECT 4 5 CONECT 5 6 6 END
  15. 15. unique atom names • Counts unique per element type • C1..C99,CA0..CA9,CB0..CZ9,CAA...CZZ • This allows for up to 1036 indices.
  16. 16. beware openbabel's lig residue • PDB residue LIG is reserved for 3-pyridin- 4-yl-2,4-dihydro-indeno[1,2-c]pyrazole. • Use of residue UNL (for UNknown Ligand) is much preferred.
  17. 17. PDB file readers • Not all connectivity is explicit • Two options for determining bonds – Template methods – Proximity methods • Two options for determining bond orders – Template methods – 3D geometry methods • See Sayle, “PDB Cruft to Content”, MUG 2001
  18. 18. template-based bond orders • Need only specify the double bonds • Most (11/20) amino acids only need C=O. • Arg: C=O,CZ=NH2 • Asn/Asp: C=O,CG=OD1 • Gln/Glu: C=O,CD=OE1 • His: C=O,CG=CD2,ND1=CE1 • Phe/Tyr: C=O,CG=CD1,CD2=CE2,CE1=CZ • Trp: C=O,CG=CD1,CD2=CE2,CE3=CZ3,CZ2=CH2
  19. 19. switching on short strings // These are macros to allow their use in C++ constants #define BCNAM(A,B,C) (((A)<<16) | ((B)<<8) | (C)) #define BCATM(A,B,C,D) (((A)<<24) | ((B)<<16) | ((C)<<8) | (D)) switch (rescode) { case BCNAM('A','L','A'): case BCNAM('C','Y','S'): case BCNAM('G','L','Y'): case BCNAM('I','L','E'): case BCNAM('L','E','U'): ... if (atm1==BCATM(' ','C',' ',' ') && atm2==BCATM(' ','O',' ',' ')) return true; break;
  20. 20. proximity bonding • The ideal distance between two bonded atoms is the sum of their covalent radii. • The ideal distance between two non- bonded atoms is the sum of their van der Waal's radii. • Assume bonding between all atoms closer than the sum of Rcov plus a delta (0.45Å). • Impose minimum distance threshold (0.4Å)
  21. 21. compare the square • A rate limiting step in many (poorly written) proximity calculations is the calculation of square roots. • Poor if(sqrt(dx*dx+dy*dy+dz*dz) < radius) ... • Better if(dx*dx + dy*dy + dz*dz < radius*radius) … c.f. http://gcc.gnu.org/ml/gcc-patches/2003-03/msg01683.html
  22. 22. fail early, fail fast (like pharma) • Poor return dx*dx + dy*dy + dz*dz <= radius2 • Better dist2 = dx*dx if (dist > radius2) return false dist2 += dy*dy if (dist > radius2) return false dist2 += dz*dz return dist2 <= radius2
  23. 23. Proximity algorithms • Brute Force [FreON] N2 • Z (Ordinate) Sort [OpenBabel] NlogN • Binary Space Partition [CDK] NlogN..N2 • Fixed Dimension Grid [RasMol] ~N • Fixed Spacing Grid [Coot] ~N • Grid Hashing [RDKit/OEChem] ~N
  24. 24. proximity comparison
  25. 25. proximity comparison
  26. 26. proximity comparison
  27. 27. Brute force
  28. 28. z (ordinate) sort
  29. 29. binary space partition (bsp)
  30. 30. fixed dimension grid
  31. 31. fixed spacing grid
  32. 32. finite fixed spacing grid
  33. 33. infinite fixed spacing grid
  34. 34. grid hashing
  35. 35. future work/wish list • Atom reordering in RDKit. • C++ compressed (gzip) file support. • Non-forward PDB supplier. • In-place removeHs/addHs.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×