SlideShare a Scribd company logo
RDkit Gems
Roger Sayle, PhD
NextMove Software LTD
cambridge, uk
introduction
• This presentation is intended as a short talktorial
describing real world code examples, recent
features and commonly used RDKit idioms.
• Although many refect personal/NextMove
Software experiences developing with the C++
APIs, most (though not all) observations apply
equally to the Java and Python APIs.
looping over molecule atoms
• Poor
for (RWMol::AtomIterator atomIt=mol->beginAtoms();
atomIt != mol->endAtoms(); atomIt++) …
• Better
for (RWMol::AtomIterator atomIt=mol->beginAtoms();
atomIt != mol->endAtoms(); ++atomIt) …
• The C++ post increment operator returns
the original value prior to the increment,
requiring a copy to made.
looping over aTOM's bonds
• Documented idiom:
... molPtr is a const ROMol * ...
... atomPtr is a const Atom * ...
ROMol::OEDGE_ITER beg,end;
boost::tie(beg,end) = molPtr->getAtomBonds(atomPtr);
while(beg!=end){
const Bond * bond=(*molPtr)[*beg].get();
... do something with the Bond ...
++beg;
}
looping over aTOM's bonds
• Proposed replacement:
... molPtr is a const ROMol * ...
... atomPtr is a const Atom * ...
ROMol::OBOND_ITER_PAIR bondIt;
bondIt = molPtr->getAtomBonds(atomPtr);
while(bondIt.first != bondIt.second){
const Bond * bond=(*molPtr)[*bondIt.first].get();
... do something with the Bond ...
++bondIt;
}
Accepting string arguments
• Good practice to accept both C++
constant literals and STL std::strings.
• Poor
void foo(const char *ptr) { foo(std::string(ptr)); }
void foo(const std::string str) { … }
• Better
void foo(const std::string &str) { foo(str.c_str()); }
void foo(const char *ptr) { … }
isoelectronic valences
• Supporting PF6
-
in RDKit currently requires
adding 7 to the neutral valences of P.
Instead, P-1
should be isoelectronic with S.
• Poor
valence(atomicNo,charge) ~= valence(atomicNo,0)-charge
• Better
valence(atomicNo,charge) ~= valence(atomicNo-charge,0)
representing reactions
• Two possible representations
– A new object containing vectors of reactant
and product (and agent?) molecules.
– Annotation of existing molecule with reaction
roles annotated on each atom.
• Both are representations are equivalent
and one may be converted to the other
and vice versa.
common ancestry/base class
• A significant advantage of avoiding new
object types is the common requirement in
cheminformatics of reading reactions,
molecules, queries, peptides, transforms,
and pharmacophores for the same file
format.
• OpenBabel and GGA's Indigo require you
to know the contents of a file/record (mol
vs. rxn) before reading it!?
Proposed rdkit interface
• Atom::setProp(“rxnRole”,(int)role)
– RXN_ROLE_NONE 0
– RXN_ROLE_REACTANT 1
– RXN_ROLE_AGENT 2
– RXN_ROLE_PRODUCT 3
• Atom::setProp(“rxnGroup”,(int)group)
• Atom::setProp(“molAtomMapNumber”,idx)
• RWMol::setProp(“isRxn”,isrxn?1:0)
and where this gets used...
• Code/GraphMol/FileParser/MolFileWriter.cpp:
• Function GetMolFileAtomLine contains
• Currently unused local variables
– rxnComponentType
– rxnComponentNumber
• This would allow support for CPSS-style
reactions [i.e. reactions in SD files]
inverted protein hierarchy
• Optimize data structures for common ops.
• Poor
for (modelIt mi=mol->getModels(); mi; ++mi)
for (chainIt ci=(*mi)->getChains(); ci; ++ci)
for (resIt ri=(*ci)->getResidues(); ri; ++ri)
for (atomIt ai=(*ri)->getAtoms(); ai; ++ai)
… /* Do something! */
• Better
for (atomIt ai=mol->getAtoms(); ai; ++ai)
… /* Do something! */
rdkit pdb file writer features
• By default
– Atoms are written as HETAM records
– Bonds are written as CONECT records
– Bond orders encoded by popular extension
– All atoms are uniquely named
– Molecule title written as COMPND record
– Formal charges are preserved.
– Default residue is “UNL” (not “LIG”).
example pdb format output
COMPND benzene
HETATM 1 C1 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 2 C2 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 3 C3 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 4 C4 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 5 C5 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 6 C6 UNL 1 0.000 0.000 0.000 1.00 0.00 C
CONECT 1 2 2 6
CONECT 2 3
CONECT 3 4 4
CONECT 4 5
CONECT 5 6 6
END
unique atom names
• Counts unique per element type
• C1..C99,CA0..CA9,CB0..CZ9,CAA...CZZ
• This allows for up to 1036 indices.
beware openbabel's lig residue
• PDB residue LIG is reserved for 3-pyridin-
4-yl-2,4-dihydro-indeno[1,2-c]pyrazole.
• Use of residue UNL (for UNknown Ligand)
is much preferred.
PDB file readers
• Not all connectivity is explicit
• Two options for determining bonds
– Template methods
– Proximity methods
• Two options for determining bond orders
– Template methods
– 3D geometry methods
• See Sayle, “PDB Cruft to Content”, MUG 2001
template-based bond orders
• Need only specify the double bonds
• Most (11/20) amino acids only need C=O.
• Arg: C=O,CZ=NH2
• Asn/Asp: C=O,CG=OD1
• Gln/Glu: C=O,CD=OE1
• His: C=O,CG=CD2,ND1=CE1
• Phe/Tyr: C=O,CG=CD1,CD2=CE2,CE1=CZ
• Trp: C=O,CG=CD1,CD2=CE2,CE3=CZ3,CZ2=CH2
switching on short strings
// These are macros to allow their use in C++ constants
#define BCNAM(A,B,C) (((A)<<16) | ((B)<<8) | (C))
#define BCATM(A,B,C,D) (((A)<<24) | ((B)<<16) | ((C)<<8) | (D))
switch (rescode) {
case BCNAM('A','L','A'):
case BCNAM('C','Y','S'):
case BCNAM('G','L','Y'):
case BCNAM('I','L','E'):
case BCNAM('L','E','U'):
...
if (atm1==BCATM(' ','C',' ',' ') && atm2==BCATM(' ','O',' ',' '))
return true;
break;
proximity bonding
• The ideal distance between two bonded
atoms is the sum of their covalent radii.
• The ideal distance between two non-
bonded atoms is the sum of their van der
Waal's radii.
• Assume bonding between all atoms closer
than the sum of Rcov plus a delta (0.45Å).
• Impose minimum distance threshold (0.4Å)
compare the square
• A rate limiting step in many (poorly written)
proximity calculations is the calculation of
square roots.
• Poor
if(sqrt(dx*dx+dy*dy+dz*dz) < radius) ...
• Better
if(dx*dx + dy*dy + dz*dz < radius*radius) …
c.f. http://gcc.gnu.org/ml/gcc-patches/2003-03/msg01683.html
fail early, fail fast (like pharma)
• Poor
return dx*dx + dy*dy + dz*dz <= radius2
• Better
dist2 = dx*dx
if (dist > radius2)
return false
dist2 += dy*dy
if (dist > radius2)
return false
dist2 += dz*dz
return dist2 <= radius2
Proximity algorithms
• Brute Force [FreON] N2
• Z (Ordinate) Sort [OpenBabel] NlogN
• Binary Space Partition [CDK] NlogN..N2
• Fixed Dimension Grid [RasMol] ~N
• Fixed Spacing Grid [Coot] ~N
• Grid Hashing [RDKit/OEChem] ~N
proximity comparison
proximity comparison
proximity comparison
Brute force
z (ordinate) sort
binary space partition (bsp)
fixed dimension grid
fixed spacing grid
finite fixed spacing grid
infinite fixed spacing grid
grid hashing
future work/wish list
• Atom reordering in RDKit.
• C++ compressed (gzip) file support.
• Non-forward PDB supplier.
• In-place removeHs/addHs.

More Related Content

Viewers also liked

Real estate industry in chittagong
Real estate industry in chittagongReal estate industry in chittagong
Real estate industry in chittagong
Alexander Decker
 
Retailer 01-2013-preview
Retailer 01-2013-previewRetailer 01-2013-preview
Retailer 01-2013-preview
Cushman and Wakefield, Moscow
 
Mishimasyk141025
Mishimasyk141025Mishimasyk141025
Mishimasyk141025
Kazufumi Ohkawa
 
R -> Python
R -> PythonR -> Python
R -> Python
Kazufumi Ohkawa
 
Rdkitの紹介
Rdkitの紹介Rdkitの紹介
Rdkitの紹介
Takayuki Serizawa
 
10分でわかるRandom forest
10分でわかるRandom forest10分でわかるRandom forest
10分でわかるRandom forest
Yasunori Ozaki
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論
Taiji Suzuki
 
Частотный преобразователь
Частотный преобразовательЧастотный преобразователь
Частотный преобразовательkulibin
 
AQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organismsAQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organisms
sherinshaju
 
Tech
TechTech
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelonaJournalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelonawww.patkane.global
 
Cómo hacer presentaciones exitosas
Cómo hacer presentaciones exitosasCómo hacer presentaciones exitosas
Cómo hacer presentaciones exitosas
Ismael Plascencia Nuñez
 
Mobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows PhoneMobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows Phone
Edgewater
 
Content Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored UpdatesContent Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored Updates
Rebecca Feldman
 
BIOSTER Technology Research Institute
BIOSTER Technology Research InstituteBIOSTER Technology Research Institute
BIOSTER Technology Research Institute
Data Science Institute - Imperial College London
 
22 of the best marketing quotes
22 of the best marketing quotes22 of the best marketing quotes
22 of the best marketing quotes
sherinshaju
 
PDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the policePDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the police
The Star Newspaper
 

Viewers also liked (18)

Real estate industry in chittagong
Real estate industry in chittagongReal estate industry in chittagong
Real estate industry in chittagong
 
Retailer 01-2013-preview
Retailer 01-2013-previewRetailer 01-2013-preview
Retailer 01-2013-preview
 
Mishimasyk141025
Mishimasyk141025Mishimasyk141025
Mishimasyk141025
 
R -> Python
R -> PythonR -> Python
R -> Python
 
Rdkitの紹介
Rdkitの紹介Rdkitの紹介
Rdkitの紹介
 
10分でわかるRandom forest
10分でわかるRandom forest10分でわかるRandom forest
10分でわかるRandom forest
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論
 
Частотный преобразователь
Частотный преобразовательЧастотный преобразователь
Частотный преобразователь
 
AQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organismsAQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organisms
 
Tech
TechTech
Tech
 
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelonaJournalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
 
Cómo hacer presentaciones exitosas
Cómo hacer presentaciones exitosasCómo hacer presentaciones exitosas
Cómo hacer presentaciones exitosas
 
Mobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows PhoneMobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows Phone
 
Content Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored UpdatesContent Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored Updates
 
Diana maria morales hernandez actividad1 mapa_c
Diana maria morales hernandez actividad1 mapa_cDiana maria morales hernandez actividad1 mapa_c
Diana maria morales hernandez actividad1 mapa_c
 
BIOSTER Technology Research Institute
BIOSTER Technology Research InstituteBIOSTER Technology Research Institute
BIOSTER Technology Research Institute
 
22 of the best marketing quotes
22 of the best marketing quotes22 of the best marketing quotes
22 of the best marketing quotes
 
PDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the policePDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the police
 

Similar to RDKit Gems

En528032
En528032En528032
En528032
stalinys
 
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...moiz89
 
Nptel cad2-06 capcitances
Nptel cad2-06 capcitancesNptel cad2-06 capcitances
Nptel cad2-06 capcitances
chenna_kesava
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structureBITS
 
Pot.ppt.pdf
Pot.ppt.pdfPot.ppt.pdf
Pot.ppt.pdf
ashwanikushwaha15
 
Quantum-Espresso_10_8_14
Quantum-Espresso_10_8_14Quantum-Espresso_10_8_14
Quantum-Espresso_10_8_14cjfoss
 
Design and implementation of cyclo converter for high frequency applications
Design and implementation of cyclo converter for high frequency applicationsDesign and implementation of cyclo converter for high frequency applications
Design and implementation of cyclo converter for high frequency applications
cuashok07
 
Making a peaking filter by Julio Marqués
Making a peaking filter by Julio MarquésMaking a peaking filter by Julio Marqués
Making a peaking filter by Julio Marqués
Julio José Marqués Emán
 
Proposals for Memristor Crossbar Design and Applications
Proposals for Memristor Crossbar Design and ApplicationsProposals for Memristor Crossbar Design and Applications
Proposals for Memristor Crossbar Design and Applications
Blaise Mouttet
 
90981041 control-system-lab-manual
90981041 control-system-lab-manual90981041 control-system-lab-manual
90981041 control-system-lab-manualGopinath.B.L Naidu
 
MetroScientific Week 1.pptx
MetroScientific Week 1.pptxMetroScientific Week 1.pptx
MetroScientific Week 1.pptx
Bipin Saha
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
Dragos Sbîrlea
 
Basic execution
Basic executionBasic execution
Basic execution
Ashutosh Sharma
 
Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1
Kshitij Singh
 
Buck converter design
Buck converter designBuck converter design
Buck converter design
Võ Hồng Quý
 
Benchmark Calculations of Atomic Data for Modelling Applications
 Benchmark Calculations of Atomic Data for Modelling Applications Benchmark Calculations of Atomic Data for Modelling Applications
Benchmark Calculations of Atomic Data for Modelling Applications
AstroAtom
 
Wien2k getting started
Wien2k getting startedWien2k getting started
Wien2k getting started
ABDERRAHMANE REGGAD
 
PPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).pptPPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).ppt
tariqqureshi33
 
Structure prediction
Structure predictionStructure prediction
Structure prediction
Cesario Ajpi Condori
 

Similar to RDKit Gems (20)

En528032
En528032En528032
En528032
 
Lp seminar
Lp seminarLp seminar
Lp seminar
 
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
 
Nptel cad2-06 capcitances
Nptel cad2-06 capcitancesNptel cad2-06 capcitances
Nptel cad2-06 capcitances
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structure
 
Pot.ppt.pdf
Pot.ppt.pdfPot.ppt.pdf
Pot.ppt.pdf
 
Quantum-Espresso_10_8_14
Quantum-Espresso_10_8_14Quantum-Espresso_10_8_14
Quantum-Espresso_10_8_14
 
Design and implementation of cyclo converter for high frequency applications
Design and implementation of cyclo converter for high frequency applicationsDesign and implementation of cyclo converter for high frequency applications
Design and implementation of cyclo converter for high frequency applications
 
Making a peaking filter by Julio Marqués
Making a peaking filter by Julio MarquésMaking a peaking filter by Julio Marqués
Making a peaking filter by Julio Marqués
 
Proposals for Memristor Crossbar Design and Applications
Proposals for Memristor Crossbar Design and ApplicationsProposals for Memristor Crossbar Design and Applications
Proposals for Memristor Crossbar Design and Applications
 
90981041 control-system-lab-manual
90981041 control-system-lab-manual90981041 control-system-lab-manual
90981041 control-system-lab-manual
 
MetroScientific Week 1.pptx
MetroScientific Week 1.pptxMetroScientific Week 1.pptx
MetroScientific Week 1.pptx
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
 
Basic execution
Basic executionBasic execution
Basic execution
 
Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1
 
Buck converter design
Buck converter designBuck converter design
Buck converter design
 
Benchmark Calculations of Atomic Data for Modelling Applications
 Benchmark Calculations of Atomic Data for Modelling Applications Benchmark Calculations of Atomic Data for Modelling Applications
Benchmark Calculations of Atomic Data for Modelling Applications
 
Wien2k getting started
Wien2k getting startedWien2k getting started
Wien2k getting started
 
PPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).pptPPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).ppt
 
Structure prediction
Structure predictionStructure prediction
Structure prediction
 

More from NextMove Software

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
NextMove Software
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
NextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
NextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
NextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
NextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
NextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
NextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
NextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
NextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
NextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
NextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
NextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 

Recently uploaded

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 

RDKit Gems

  • 1. RDkit Gems Roger Sayle, PhD NextMove Software LTD cambridge, uk
  • 2. introduction • This presentation is intended as a short talktorial describing real world code examples, recent features and commonly used RDKit idioms. • Although many refect personal/NextMove Software experiences developing with the C++ APIs, most (though not all) observations apply equally to the Java and Python APIs.
  • 3. looping over molecule atoms • Poor for (RWMol::AtomIterator atomIt=mol->beginAtoms(); atomIt != mol->endAtoms(); atomIt++) … • Better for (RWMol::AtomIterator atomIt=mol->beginAtoms(); atomIt != mol->endAtoms(); ++atomIt) … • The C++ post increment operator returns the original value prior to the increment, requiring a copy to made.
  • 4. looping over aTOM's bonds • Documented idiom: ... molPtr is a const ROMol * ... ... atomPtr is a const Atom * ... ROMol::OEDGE_ITER beg,end; boost::tie(beg,end) = molPtr->getAtomBonds(atomPtr); while(beg!=end){ const Bond * bond=(*molPtr)[*beg].get(); ... do something with the Bond ... ++beg; }
  • 5. looping over aTOM's bonds • Proposed replacement: ... molPtr is a const ROMol * ... ... atomPtr is a const Atom * ... ROMol::OBOND_ITER_PAIR bondIt; bondIt = molPtr->getAtomBonds(atomPtr); while(bondIt.first != bondIt.second){ const Bond * bond=(*molPtr)[*bondIt.first].get(); ... do something with the Bond ... ++bondIt; }
  • 6. Accepting string arguments • Good practice to accept both C++ constant literals and STL std::strings. • Poor void foo(const char *ptr) { foo(std::string(ptr)); } void foo(const std::string str) { … } • Better void foo(const std::string &str) { foo(str.c_str()); } void foo(const char *ptr) { … }
  • 7. isoelectronic valences • Supporting PF6 - in RDKit currently requires adding 7 to the neutral valences of P. Instead, P-1 should be isoelectronic with S. • Poor valence(atomicNo,charge) ~= valence(atomicNo,0)-charge • Better valence(atomicNo,charge) ~= valence(atomicNo-charge,0)
  • 8. representing reactions • Two possible representations – A new object containing vectors of reactant and product (and agent?) molecules. – Annotation of existing molecule with reaction roles annotated on each atom. • Both are representations are equivalent and one may be converted to the other and vice versa.
  • 9. common ancestry/base class • A significant advantage of avoiding new object types is the common requirement in cheminformatics of reading reactions, molecules, queries, peptides, transforms, and pharmacophores for the same file format. • OpenBabel and GGA's Indigo require you to know the contents of a file/record (mol vs. rxn) before reading it!?
  • 10. Proposed rdkit interface • Atom::setProp(“rxnRole”,(int)role) – RXN_ROLE_NONE 0 – RXN_ROLE_REACTANT 1 – RXN_ROLE_AGENT 2 – RXN_ROLE_PRODUCT 3 • Atom::setProp(“rxnGroup”,(int)group) • Atom::setProp(“molAtomMapNumber”,idx) • RWMol::setProp(“isRxn”,isrxn?1:0)
  • 11. and where this gets used... • Code/GraphMol/FileParser/MolFileWriter.cpp: • Function GetMolFileAtomLine contains • Currently unused local variables – rxnComponentType – rxnComponentNumber • This would allow support for CPSS-style reactions [i.e. reactions in SD files]
  • 12. inverted protein hierarchy • Optimize data structures for common ops. • Poor for (modelIt mi=mol->getModels(); mi; ++mi) for (chainIt ci=(*mi)->getChains(); ci; ++ci) for (resIt ri=(*ci)->getResidues(); ri; ++ri) for (atomIt ai=(*ri)->getAtoms(); ai; ++ai) … /* Do something! */ • Better for (atomIt ai=mol->getAtoms(); ai; ++ai) … /* Do something! */
  • 13. rdkit pdb file writer features • By default – Atoms are written as HETAM records – Bonds are written as CONECT records – Bond orders encoded by popular extension – All atoms are uniquely named – Molecule title written as COMPND record – Formal charges are preserved. – Default residue is “UNL” (not “LIG”).
  • 14. example pdb format output COMPND benzene HETATM 1 C1 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 2 C2 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 3 C3 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 4 C4 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 5 C5 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 6 C6 UNL 1 0.000 0.000 0.000 1.00 0.00 C CONECT 1 2 2 6 CONECT 2 3 CONECT 3 4 4 CONECT 4 5 CONECT 5 6 6 END
  • 15. unique atom names • Counts unique per element type • C1..C99,CA0..CA9,CB0..CZ9,CAA...CZZ • This allows for up to 1036 indices.
  • 16. beware openbabel's lig residue • PDB residue LIG is reserved for 3-pyridin- 4-yl-2,4-dihydro-indeno[1,2-c]pyrazole. • Use of residue UNL (for UNknown Ligand) is much preferred.
  • 17. PDB file readers • Not all connectivity is explicit • Two options for determining bonds – Template methods – Proximity methods • Two options for determining bond orders – Template methods – 3D geometry methods • See Sayle, “PDB Cruft to Content”, MUG 2001
  • 18. template-based bond orders • Need only specify the double bonds • Most (11/20) amino acids only need C=O. • Arg: C=O,CZ=NH2 • Asn/Asp: C=O,CG=OD1 • Gln/Glu: C=O,CD=OE1 • His: C=O,CG=CD2,ND1=CE1 • Phe/Tyr: C=O,CG=CD1,CD2=CE2,CE1=CZ • Trp: C=O,CG=CD1,CD2=CE2,CE3=CZ3,CZ2=CH2
  • 19. switching on short strings // These are macros to allow their use in C++ constants #define BCNAM(A,B,C) (((A)<<16) | ((B)<<8) | (C)) #define BCATM(A,B,C,D) (((A)<<24) | ((B)<<16) | ((C)<<8) | (D)) switch (rescode) { case BCNAM('A','L','A'): case BCNAM('C','Y','S'): case BCNAM('G','L','Y'): case BCNAM('I','L','E'): case BCNAM('L','E','U'): ... if (atm1==BCATM(' ','C',' ',' ') && atm2==BCATM(' ','O',' ',' ')) return true; break;
  • 20. proximity bonding • The ideal distance between two bonded atoms is the sum of their covalent radii. • The ideal distance between two non- bonded atoms is the sum of their van der Waal's radii. • Assume bonding between all atoms closer than the sum of Rcov plus a delta (0.45Å). • Impose minimum distance threshold (0.4Å)
  • 21. compare the square • A rate limiting step in many (poorly written) proximity calculations is the calculation of square roots. • Poor if(sqrt(dx*dx+dy*dy+dz*dz) < radius) ... • Better if(dx*dx + dy*dy + dz*dz < radius*radius) … c.f. http://gcc.gnu.org/ml/gcc-patches/2003-03/msg01683.html
  • 22. fail early, fail fast (like pharma) • Poor return dx*dx + dy*dy + dz*dz <= radius2 • Better dist2 = dx*dx if (dist > radius2) return false dist2 += dy*dy if (dist > radius2) return false dist2 += dz*dz return dist2 <= radius2
  • 23. Proximity algorithms • Brute Force [FreON] N2 • Z (Ordinate) Sort [OpenBabel] NlogN • Binary Space Partition [CDK] NlogN..N2 • Fixed Dimension Grid [RasMol] ~N • Fixed Spacing Grid [Coot] ~N • Grid Hashing [RDKit/OEChem] ~N
  • 35. future work/wish list • Atom reordering in RDKit. • C++ compressed (gzip) file support. • Non-forward PDB supplier. • In-place removeHs/addHs.