Successfully reported this slideshow.
Your SlideShare is downloading. ×

Mike Lynch Award Lecture, ICCS 2022

Ad

RDKit: where did we come from and where are
we going?
Greg Landrum (@dr_greg_landrum)
12th International Conference on Che...

Ad

The Trustees of the CSA Trust are pleased to announce that
Greg Landrum has been awarded the 2022 Mike Lynch
Award, in rec...

Ad

3
The RDKit

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 48 Ad
1 of 48 Ad

More Related Content

Mike Lynch Award Lecture, ICCS 2022

  1. 1. RDKit: where did we come from and where are we going? Greg Landrum (@dr_greg_landrum) 12th International Conference on Chemical Structures 12 June, 2022
  2. 2. The Trustees of the CSA Trust are pleased to announce that Greg Landrum has been awarded the 2022 Mike Lynch Award, in recognition of his work on the development of RDKit and his fostering of the community around it, a transformative software resource for cheminformatics and machine learning. https://csa-trust.org/2022/05/13/mike-lynch-award-2022-greg-landrum/ The purpose of the Award is to recognise and encourage outstanding accomplishments in education, research and development activities that are related to the systems and methods used to store, process and retrieve information about chemical structures, reactions and properties. The Mike Lynch Award will be presented at a prestigious, relevant conference to be identified prior to each presentation and the awardee will be asked to give a presentation at the conference. https://csa-trust.org/awards-and-grants/awards/
  3. 3. 3 The RDKit
  4. 4. 4 Acknowledgements ● Everyone who has contributed code, questions, answers, bug reports, etc ● The people who manage RDKit packaging ● The organizers and sponsors of the RDKit UGMs ● People who have funded RDKit development (directly or indirectly) ● The others in our community who've been pushing the idea and adoption of open source
  5. 5. 5 An open source toolkit for cheminformatics ● Business-friendly BSD license ● Core data structures and algorithms in C++ ● Python 3.x wrapper generated using Boost.Python ● Java and C# wrappers generated with SWIG ● JavaScript wrappers ● CFFI wrapper for usage from other languages ● 2D and 3D molecular operations ● Descriptor generation for machine learning ● Molecular database cartridge for PostgreSQL ● Cheminformatics nodes for KNIME (distributed from the KNIME community site: http://www.knime.org/rdkit)
  6. 6. 6 Ecodesystem Exact same implementation regardless of where you are using it from
  7. 7. 7 Releases, reproducibility, and citability ● 2 feature releases per year ● ~monthly patch releases with bug fixes ● Every release is assigned a DOI and archived on Zenodo https://zenodo.org/record/6483170
  8. 8. 8 Packaging - conda-forge: conda install -c conda-forge rdkit - pypi: pip install rdkit-pypi - npm: npm i @rdkit/rdkit - apt: apt install python3-rdkit postgresql-14-rdkit
  9. 9. 9 Sustainability: the bus problem https://commons.wikimedia.org/wiki/File:Postauto_susten.jpg
  10. 10. 10 Sustainability: the bus problem RDKit maintainers: - Greg - Brian Kelley (Relay Therapeutics) - Ricardo Rodriguez (Schrödinger) - Paolo Tosco (Novartis) Regular code contributors: - David Cosgrove - Peter Gedeck - Gareth Jones - Eisuke Kawashima - Dan Nealschneider - Sereina Riniker - Roger Sayle - Riccardo Vianello
  11. 11. The RDKit community How it started…
  12. 12. The RDKit community How it’s going…
  13. 13. Where we came from, where we’re going
  14. 14. 14 The early days ● 2000-2006: initial development work at Rational Discovery ● 2006: code open sourced and released on sourceforge.net
  15. 15. 15 Aside: some motivations for open-sourcing scientific code ● Recognition ● Helping the scientific community ● Feedback and help from others ● You get to keep using the code when you move on to your next position
  16. 16. 16 Some history ● 2000-2006: initial development work at Rational Discovery ● 2006: code open sourced and released on sourceforge.net ● 2007: First NIBR contribution (chemical reaction handling); Noel discovers the RDKit ● 2008: first POC of Java wrapper; Mac support added; SLN and Mol2 parsers; ● 2009: Morgan fingerprints; switch to cmake; switch to VF2 for SSS ● 2010: PostgreSQL cartridge; First iteration of the KNIME nodes; $RDBASE/Contrib appears; SaltRemover and FunctionalGroups code ● 2011: New Java wrappers; more functionality moved to C++; InChI support; AvalonTools integration ● 2012: First UGM; Speed improvements; MCS implementation; IPython integration; “RDKit Cookbook” appears ● 2013: Move to github; Pandas integration; MMFF and Open3DAlign support; PDB support; rdkit blog started
  17. 17. 17 Some history, cntd ● 2014: python3 support; conda integration; experimental lucene integration; MCS implementation in C++ ● 2015: new drawing code; improved canonicalization algorithm; ETKDG; reduced memory usage ● 2016: Regular patch releases; easier builds; performance improvements; KNIME nodes move to Github ● 2017: Modern C++; R-group decomposition, first GSoC participation, conda-forge packages ● 2018: CoordGen integration; molecular standardization ● 2019: Azure DevOps, substructure speedup, new molecule hashing code, Neo4J integration, new JS wrappers ● 2020: new CIP implementation, scaffold network, abbreviations, tautomer-insensitive substructure search ● 2021: rdkit-cffi, more drawing improvements, R-group decomposition improvements ● 2022: C++17, generics for searching, non-tetrahedral symmetry…
  18. 18. An aside…
  19. 19. 19 Looking forward
  20. 20. 20 Longer term RDKit objectives ● Improved support for other classes of molecules ■ Polymers ■ Organometallics ● Ensuring that the PostgreSQL cartridge is a plausible candidate for use in a corporate “data warehouse”1 ● Ensuring all the pieces are in place to make it easy to write a compound registration system 1 or whatever such things are called these days
  21. 21. 21 Future directions: the cartridge Ensuring that the PostgreSQL cartridge is a plausible candidate for use in a corporate “data warehouse” - Integration of tautomer insensitive search - Integration of the MolStandardize code - Improvements to the chemical reaction handling - Integration of the generics for searching Further ideas - Adding some 3D search capabilities
  22. 22. 22 Future directions: registration systems First: what is a chemical registration system?
  23. 23. 23 Aside: Goals of a compound registration system We want to be able to answer these questions: - Have we seen this compound before? - Give me a key for this compound - Give me the structure for this key
  24. 24. 24 Aside: Goals of a compound registration system We want to be able to answer these questions: - Have we seen this compound before? - Give me a key for this compound - Give me the structure for this key So what do we need to be able to do? - Standardize molecules - Generate hashes/keys for standardized molecules - Store structures
  25. 25. 25 Using keys for registration Idea: use a hash to combine: - The molecular structure (via a fixed H InChI) - A stereo code - A stereo comment https://github.com/rdkit/UGM_2015/blob/8f562e70add17bab35f43823af0f03673f8a 1f2d/Presentations/KeyToRegistration.GregLandrum.pdf
  26. 26. 26 Future directions: registration systems Ensuring all the pieces are in place to make it easy to write a compound registration system - Improvements to MolStandardize code - Improvements to the molecular hashing code - Support for more other classes of molecules
  27. 27. 27 Let’s talk about molecular identity This isn’t just a topic for standard compound registration systems.
  28. 28. 28 Molecular identity and computational questions ● Which molecules were used to generate this result? ● Have I already done a calculation using this molecule? ● Was this molecule part of my training set? All of these require us to be able to answer the question “are these two molecules the same?” Here be dragons…
  29. 29. 29 Some things making molecular identity nontrivial
  30. 30. 30 Some things making molecular identity nontrivial ● Counterions, solvents ● Resonance forms ● Charges ● Tautomers ● Stereochemistry Sometimes we care about these differences, sometimes we don’t. It depends on the context around when asking the question “are these two molecules the same?” This is not a comprehensive list
  31. 31. 31 Identity hashes for molecules Idea: convert the molecule into some form which allows us to test whether or not it’s identical to other molecules via a simple string (or numerical) comparison. What “identical” means will be determined by the identity hash used. Familiar examples: - Canonical SMILES - InChI
  32. 32. 32 Contextual identity Instead of having a single key/hash for a molecule, store a collection of layers with different levels of detail/types of information. When searching, choose the layers which are relevant for the current use case ● Store molecules using some relatively lossless format (e.g. v3000 SDF) ● Use molecular hashes capturing different levels of information to establish whether or not duplicates exist Note: it’s possible to do a limited version of this via careful manipulation of InChI strings
  33. 33. 33 Some more identity hashes https://www.nextmovesoftware.com/talks/OBoyle_MolHash_ACS_201908.pdf Available in the RDKit since the 2019.09 release
  34. 34. 34 Some of the basic identity hashes in rdMolHash ● Molecular formula ● Anonymous graph ● Element graph ● Murcko scaffold ● Tautomer ● Canonical smiles There are many others
  35. 35. 35 Hashes for registration The team at Schrödinger1 have contributed a new RDKit module for calculating layered hashes which are useful for compound identity testing and registration. This will be in the 2022.09 release. Layers it currently supports: - Formula - Canonical SMILES : with and without stereo - Tautomer hash: with and without stereo - Sgroup data (for some help with polymers and things like atropisomers) - “Escape layer” (free text allowing a structure to be different even if everything else says it’s the same) 1 Chris Von Bargen, Hussein Faara, Dan Nealschneider, Ricardo Rodriguez, Rachel Walker
  36. 36. 36 Registration hash example {<HashLayer.CANONICAL_SMILES: 1>: 'COc1ccc2[nH]c([S@@](=O)Cc3ncc(C)c(OC)c3C)nc2c1', <HashLayer.ESCAPE: 2>: '', <HashLayer.FORMULA: 3>: 'C17H19N3O3S', <HashLayer.NO_STEREO_SMILES: 4>: 'COc1ccc2[nH]c(S(=O)Cc3ncc(C)c(OC)c3C)nc2c1', <HashLayer.NO_STEREO_TAUTOMER_HASH: 5>: 'CO[C]1[CH][CH][C]2[N][C]([S]([O])C[C]3[N][CH][C](C)[C](OC)[C]3C)[N][C]2[CH]1_1_0', <HashLayer.SGROUP_DATA: 6>: '[]', <HashLayer.TAUTOMER_HASH: 7>: 'CO[C]1[CH][CH][C]2[N][C]([S@@]([O])C[C]3[N][CH][C](C)[C](OC)[C]3C)[N][C]2[CH]1_1_0'}
  37. 37. 37 Handling tautomers {<HashLayer.CANONICAL_SMILES: 1>: 'CCCS(=O)(=O)Nc1ccc(F)c(C(=O)c2c[nH]c3ncc(-c 4ccc(Cl)cc4)cc23)c1F', <HashLayer.ESCAPE: 2>: '', <HashLayer.FORMULA: 3>: 'C23H18ClF2N3O3S', … <HashLayer.TAUTOMER_HASH: 7>: 'CCCS([O])([O])[N][C]1[CH][CH][C](F)[C]([C]( [O])[C]2[CH][N][C]3[N][CH][C]([C]4[CH][CH][C ](Cl)[CH][CH]4)[CH][C]32)[C]1F_2_0'} {<HashLayer.CANONICAL_SMILES: 1>: 'CCCS(=O)(=O)Nc1ccc(F)c(C(=O)c2cnc3[nH]cc(-c 4ccc(Cl)cc4)cc2-3)c1F', <HashLayer.ESCAPE: 2>: '', <HashLayer.FORMULA: 3>: 'C23H18ClF2N3O3S', … <HashLayer.TAUTOMER_HASH: 7>: 'CCCS([O])([O])[N][C]1[CH][CH][C](F)[C]([C]( [O])[C]2[CH][N][C]3[N][CH][C]([C]4[CH][CH][C ](Cl)[CH][CH]4)[CH][C]32)[C]1F_2_0'}
  38. 38. 38 Handling atropisomers Structures from: https://doi.org/10.1016/j.xphs.2021.10.011
  39. 39. 39 Handling atropisomers Structures from: https://doi.org/10.1016/j.xphs.2021.10.011 The bold and hashed bonds are just drawing features and don’t survive translation to things like CXSMILES or mol files. But we can use S groups to indicate the stereochemistry
  40. 40. 40 Handling atropisomers Structures from: https://doi.org/10.1016/j.xphs.2021.10.011 {<HashLayer.CANONICAL_SMILES: 1>: 'COc1cc2ncc3c(c2cc1-c1cn(C)nc1C)n(-c1c(F)cncc1OC)c(=O )n3C', <HashLayer.ESCAPE: 2>: '', <HashLayer.FORMULA: 3>: 'C23H21FN6O3', … <HashLayer.SGROUP_DATA: 6>: '[{"fieldName": "atropisomer", "atom": [19, 20], "bonds": [], "value": "M"}]', …} {<HashLayer.CANONICAL_SMILES: 1>: 'COc1cc2ncc3c(c2cc1-c1cn(C)nc1C)n(-c1c(F)cncc1OC)c(=O )n3C', <HashLayer.ESCAPE: 2>: '', <HashLayer.FORMULA: 3>: 'C23H21FN6O3', … <HashLayer.SGROUP_DATA: 6>: '[{"fieldName": "atropisomer", "atom": [19, 20], "bonds": [], "value": "P"}]', …}
  41. 41. 41 Handling polymers {<HashLayer.CANONICAL_SMILES: 1>: '*c1cnc(*)s1', …, <HashLayer.SGROUP_DATA: 6>: '[{"type": "SRU", "atoms": [1, 2, 3, 4, 6], "bonds": [[0, 1], [4, 5]], "index": 1, "connect": "HT", "label": "n"}]', …} {<HashLayer.CANONICAL_SMILES: 1>: '*c1cnc(*)s1', …, <HashLayer.SGROUP_DATA: 6>: '[{"type": "SRU", "atoms": [1, 2, 3, 4, 6], "bonds": [[0, 1], [4, 5]], "index": 1, "connect": "HH", "label": "n"}]', …}
  42. 42. 42 Handling enhanced stereochemistry Ethambutol These two describe the same racemic mixture
  43. 43. 43 Handling enhanced stereochemistry {<HashLayer.CANONICAL_SMILES: 1>: 'CC[C@@H](CO)NCCN[C@@H](CC)CO', …, <HashLayer.NO_STEREO_SMILES: 4>: 'CCC(CO)NCCNC(CC)CO', …} {<HashLayer.CANONICAL_SMILES: 1>: 'CC[C@@H](CO)NCCN[C@@H](CC)CO |&1:2,9|', …, <HashLayer.NO_STEREO_SMILES: 4>: 'CCC(CO)NCCNC(CC)CO', …} We get the same hash if the molecule is drawn with wedged bonds.
  44. 44. 44 Using the escape layer Suppose I start with the racemic mixture, run it through a chiral column, and collect the two fractions I want to register the two fractions separately without determining the absolute stereochemistry
  45. 45. 45 Using the escape layer {<HashLayer.CANONICAL_SMILES: 1>: 'CC[C@@H](CO)NCCN[C@@H](CC)CO |o1:2,9|', <HashLayer.ESCAPE: 2>: ‘first fraction', …} {<HashLayer.CANONICAL_SMILES: 1>: 'CC[C@@H](CO)NCCN[C@@H](CC)CO |o1:2,9|', <HashLayer.ESCAPE: 2>: ‘second fraction', …}
  46. 46. 46 Aside: using the escape layer for comp chem {… <HashLayer.ESCAPE: 2>: ‘conformer 1', …} {… <HashLayer.ESCAPE: 2>: ‘conformer 2', …} Suppose I want to store multiple conformers/poses of the same molecule
  47. 47. 47 Wrapping up: molecular identity ● For many computational tasks we want to be able to figure out whether or not we have seen/used a particular molecule ● The definition of “same” for molecules depends on the context/question being asked ● Layered registration hashes make it easy (and cheap) to store sets of molecules and answer the context-dependent “are these the same?” question
  48. 48. 48 Thanks! Thanks!

×