1. Open-source from/in the enterprise: the RDKit
Gregory Landrum
NIBR Informatics
Novartis Institutes for BioMedical Research, Basel, Switzerland
2. Outline
§ What is the RDKit?
§ RDKit integration with other open-source projects
• Knime
• PostgreSQL
• IPython
• Pandas
• Lucene
§ RDKit in NIBR, some case studies
3. RDKit: What is it?
§ Open-source C++ toolkit for cheminformatics
§ Wrappers for Python (2.x), Java, C#
§ Functionality:
• 2D and 3D molecular operations
• Descriptor generation for machine learning
• PostgreSQL database cartridge for substructure and similarity searching
• Knime nodes
• IPython integration
• Lucene integration (experimental)
• Supports Mac/Windows/Linux
§ Releases every 6 months
§ business-friendly BSD license
§ Code: https://github.com/rdkit
§ http://www.rdkit.org
4. The community
§ Mailing lists hosted at sourceforge: https://sourceforge.net/p/rdkit/
mailman/
§ Active participants from academia, small and large pharma, software
companies, and service providers
§ 30+ attendees at each of the two user group meetings
5. Some features
§ Input/Output: SMILES/SMARTS, SDF, TDT, PDB,
SLN [1], Corina mol2 [1]
§ “Cheminformatics”:
• Substructure searching
• Canonical SMILES
• Chirality support (i.e. R/S or E/Z labeling)
• Chemical transformations (e.g. remove matching
substructures)
• Chemical reactions
§ 2D depiction, including constrained depiction
§ 2D->3D conversion/conformational analysis via
distance geometry
§ UFF and MMFF94 implementation for cleaning up
structures
§ Fingerprinting: Daylight-like, atom pairs, topological
torsions, Morgan algorithm, “MACCS keys”, etc.
§ Similarity/diversity picking
§ 2D pharmacophores [1]
§ Gasteiger-Marsili charges
§ Hierarchical subgraph/fragment analysis
§ Bemis and Murcko scaffold determination
§ RECAP and BRICS implementations
§ Multi-molecule maximum common substructure
§ Feature maps
§ Shape-based similarity
§ Fraggle similarity (from GSK)
§ Molecule-molecule alignment
§ Open3DAlign implementation
§ Integration with PyMOL for 3D visualization
§ Functional group filtering
§ Salt stripping
§ Molecular descriptor library:
Topological (κ3, Balaban J, etc.), Compositional (Number
of Rings, Number of Aromatic Heterocycles, etc.),
EState, SlogP/SMR (Wildman and Crippen approach),
“MOE like” VSA descriptors, Feature-map vectors
§ Machine Learning:
• Clustering (hierarchical)
• Information theory (Shannon entropy, information
gain, etc.)
§ Tight integration with the IPython notebook and
pandas
§ Integration with the InChI library
[1] These implementations are functional but are not necessarily
the best, fastest, or most complete.
6. The contrib dir
§ LEF (Anna Vulpetti, NIBR): Local Environment of Fluorine
§ PBF (Nicholas Firth, ICR): Plane of best fit descriptor
§ SA_Score (Peter Ertl, NIBR): synthetic-accessibility score
§ fraggle (Jameed Hussain, GSK): fragment-based similarity
§ mmpa (Jameed Hussain, GSK): molecular matched pairs
§ pzc (Paul Czodrowski, Merck KGaA): tools for building and validating
classifiers
§ ConformerParser (Sereina Riniker, NIBR): parser for Amber trajectory
files
7. C++ :
Core data structures and algorithms
Postgre
SQL
Java
SWIG
Python
Boost.Python
Knime
What is this all about?
script
inter-
active
Exact same algorithms/implementations accessible from
many different endpoints
C#
App
8. Knime integration
§ Open-source RDKit-based nodes for Knime providing cheminformatics
functionality
+
§ Trusted nodes distributed from
knime community site
§ Work in progress: more nodes being
added (new wizard makes it easy)
12. PostgreSQL integration
§ PostgreSQL (http://www.postgresql.org): a robust, flexible, and
extensible relational open-source database. Rich collection of
extensions available
§ RDKit “cartridge”:
• Fast substructure and similarity search
• Fingerprints (count-based and bit-vector):
Morgan (ECFP-like), FeatMorgan (FCFP-like), RDKit (Daylight like), atom pair,
topological torsion, MACCS
• Standard molecule properties and descriptors
§ Basis for myChEMBL (http://chembl.blogspot.co.uk/2013/10/chembl-
virtual-machine-aka-mychembl.html) Ochoa, R., Davies, M., Papadatos, G.,
Atkinson, F., & Overington, J. P. (2014). myChEMBL: a virtual machine implementation of
open data and cheminformatics tools. Bioinformatics, 30(2), 298–300.
+
17. IPython notebok integration
§ IPython: a very powerful interactive shell for python
http://www.ipython.org
§ IPython notebook: IPython in the browser, with graphics
• combines code and output in one place
• great tool for reproducible research
• Example notebook with graphics.
§ RDKit integration:
• Display molecules, substructure matches, reactions, graphics from PyMOL
+
21. Pandas integration
§ Pandas: library for working with data tables in Python. Integrates well
with matplotlib and ipython
http://pandas.pydata.org/
§ RDKit integration:
• Load smiles tables or SD files into Pandas data tables
• Adds molecule columns to existing tables with smiles/SD columns
• Enables substructure filters on tables
• Integration with IPython notebook to render molecules
+
23. Lucene integration
§ Still in the experimental stage
§ Adds substructure search functionality with fingerprint screenout to
Lucene
§ Includes demo app for testing
+
24. RDKit in NIBR
§ Extensive use by CADD, informaticians, and IT
§ Lots of convenience code/wrappers for accessing internal data sources
and tools
§ Combined with the Avalon toolkit (another NIBR-supported open-
source project), provides the underpinning for many of our global
chemistry-based applications
+
25. The Avalon toolkit
§ C/Java cheminformatics toolkit
§ Primary author: Bernd Rohde (NIBRIT Basel)
§ http://sourceforge.net/projects/avalontoolkit/
§ Functionality:
• Canonical SMILES
• Avalon fingerprint (highly optimized substructure fingerprint)
• Molecular standardization (STRUCHK)
• 2D Coordinate generation
• Tomcat webapp for 2D rendering
§ The RDKit has (optional) Python bindings for much of the functionality
+
26. RDKit in NIBR
Case study 1: CIx Framework
§ “Service bus” for cheminformatics/CADD services
§ Handles format conversions for input/output automatically
i.e. callers can provide SMILES input to a service/model wants CTABs with 3D
coordinates
§ Supports versioning of models/services
§ Tight integration with scientific tools (e.g. Tibco Spotfire, Knime, Instant
JChem, etc.)
§ Enables trivial addition of “chemical intelligence” to web apps
§ Makes it easy to globally deploy models: once a new model/service (or
new version of a model/service) is registered with the Framework, it is
instantly globally accessible
+
27. CIx Framework architecture
Translation service
- molecule format conversion
- name lookup
XML File exchange
between engine and the
Models
Database to store
Model information
Model registration and
Request service
Web Model
Registration
Portal Front
end
Cix Tools Framework:
Cix Tools Web
Service
-SOAP
-REST
Model
Script
Model
Model
Script
Model
Model
Script
Model
Model
Script
Model
CIX Tools Engine
Data
In one of the following
formats:
- TSV/CSV File
- SMILES/CPD_NO
- SD-File
- DART query
XML File exchange
between engine and the
Translation service
Get the Model info from the Database
Client
- web app
- KNIME
- Spotfire
- IJC
- Python
Java/Tomcat
Python/Django
Geographically diverse servers
Most models are Python/Django
+
28. RDKit in NIBR
Case study 2: Small-Molecule Registration
§ Internally developed web application for compound registration
§ C#-based web services writing to Oracle
§ RDKit + Avalon toolkit for structure standardization
§ RDKit + InChI used for structure-key calculation
§ Calls out to CIx Framework for standard computed properties
§ Independent (but validated) Python implementation of standardization
and structure-key calculation for standalone use
+
29. RDKit in NIBR
Case study 3: QSAR Toolkit
§ Descriptor calculator providing access to all available internal
descriptors
§ Tools for pulling assay data from our data warehouse
§ Standardized model-building
§ Standardized reporting for evaluation and peer review
§ Packaging for deployment via CIx Framework
§ Model Watchdog:
Pulls most recent data, generates predictions, creates report showing evolution
of model accuracy over time
+
30. RDKit in NIBR
Case study 4: Similarity Server
§ Central PostgreSQL database with easily available compounds
• in-house available
• available from reliable vendors
§ Kept up-to-date
§ Substructure search
§ Similarity search with various fingerprints:
• Avalon
• Morgan2, Morgan3, FeatMorgan2
• Atom Pairs, Topological Torsions
§ Web services interface
§ Available to chemists via one of their standard desktop tools
+
32. Acknowledgements
§ General:
• Remy Evard (NIBR/Informatics)
• Richard Lewis (NIBR/GDC)
• Tom Digby (NIBR/Legal)
• Peter Gedeck (NIBR/GDC)
• Nik Stiefl (NIBR/GDC)
§ RDKit Community
• Roger Sayle (NextMove): PDB Parser
• Andrew Dalke (Dalke Scientific): FMCS
• Paolo Tosco (University of Turin):
MMFF94, Open3DAlign
• Jameed Hussain (GSK): Fraggle,
mmpa
§ Pandas, scikit-learn:
• Sereina Riniker (NIBR/Informatics)
• Nikolas Fechner (NIBR/Informatics)
http://www.rdkit.org
§ Knime:
• Manuel Schwarze (NIBR/Informatics)
• Thorsten Meinl (knime.com)
• Bernd Wiswedel (knime.com)
§ SMR
• Thomas Mueller (NIBR/Informatics)
• Thomas Veith (NIBR/Informatics)
• Dave Cotter (NIBR/Informatics)
§ QSAR Toolkit:
• Peter Gedeck (NIBR/GDC)
• Nikolas Fechner (NIBR/Informatics)
§ CIx Framework
• Sandra Mueller (NIBR/Informatics)
• Joerg Muehlbacher (NIBR/CPC)
• Riccardo Vianello (NIBR/Informatics)
§ NIBR Open Source
• Ken Robbins (NIBR/Informatics)
• Dennis Jen (NIBR/Informatics)
• Mark Schreiber (NIBR/Informatics)
33. Advertising
33
3rd RDKit User Group Meeting
22-24 October 2014
Merck KGaA, Darmstadt, Germany
Talks, “talktorials”, lightning talks, social activities, and a hackathon on
the 24th.
Registration: http://goo.gl/z6QzwD
Full announcement: http://goo.gl/ZUm2wm
We’re looking for speakers. Please contact greg.landrum@gmail.com