Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Open Source Cheminformatics
1. Open Source
Cheminformatics
Rajarshi Guha
Open Source
Open Standards
Open Source Cheminformatics Open Data
Tools and Data
Rajarshi Guha
School of Informatics, Indiana University
Bio IT World
29th April, 2009
2. Open Source
Open Source Cheminformatics Cheminformatics
Rajarshi Guha
Open Source
Been around for some time, niche field
Open Standards
OSS snippets/code based on closed source API’s versus Open Data
fully open source tools
Why use OSS cheminformatics?
Articulated nicely by Delano
Reverse also articulated nicely by Stahl
Goal
Not argue for or against Open Source
Show what’s there, how it fits in with other technologies
Delano, W. L., Drug Discovery Today, 2005, 10, 213–217
Stahl, M. T., Drug Discovery Today, 2005, 10, 219–222
3. Open Source
Open Source Cheminformatics Cheminformatics
Rajarshi Guha
Open Source
Been around for some time, niche field
Open Standards
OSS snippets/code based on closed source API’s versus Open Data
fully open source tools
Why use OSS cheminformatics?
Articulated nicely by Delano
Reverse also articulated nicely by Stahl
Goal
Not argue for or against Open Source
Show what’s there, how it fits in with other technologies
Delano, W. L., Drug Discovery Today, 2005, 10, 213–217
Stahl, M. T., Drug Discovery Today, 2005, 10, 219–222
4. Open Source
Cheminformatics Software Cheminformatics
Rajarshi Guha
Open Source
Open Standards
Open Data
The ecosystem is composed of developer- and
user-oriented software
Most applications will depend on lower level functionality
Choice of toolkit influences
robustness
performance
ease of distribution
integration with other libraries
Won’t be talking about user-oriented software
5. Open Source
The Toolkit Ecosystem Cheminformatics
Rajarshi Guha
Open Source
Timeline of cheminformatics toolkits* Open Standards
*(runs on Unix and supports SMILES and SMARTS)
1995 and earlier 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
Open Data
Daylight
C and Fortran
Is a wrapper
DayPerl
Developer moved
DaySWIG between projects
Tcl, Python and more
PyDaylight higher-level Python API
frowns
Python; API based on PyDaylight
(OBabel) OELib OEChem +Ogham &Lexichem
Babel
C++ +Python C++
(not a library) +Python +Java
(third-party package)
Guidelines OpenBabel
+Python, Perl +Java, Ruby
OEChem and its sister libraries for molecular modeling are fast, flexible, powerful
Pybel
and complete (except for fingerprints). It is designed for high-end users who know
the nuances of cheminformatics. Expensive. My choice for C++, Java and Python. higher-level Python API
RDKit
CDK is the toolkit to use if you are on the JDK and OEChem is too pricey. It has a
strong structure and structural biology component, close ties with 2D and 3D C++/Python - internal library
Public release on Sourceforge
display programs, and integration with Bioclipse, Taverna, and Knime.
Accessible from the C version of Python
RDKit is relatively new and with a small user community. The software cinfony
Accessible from the Java version of Python (Jython)
engineering skills are the best of the free projects. Includes 2D layout, 2D→3D,
abstraction API
QSAR, forcefield, shape and machine learning components. Worth a look!
JOELib
OpenBabel is the most community driven. Its strength is file format conversion, for Java; API based on OELib
both small molecules and biomolecules. It is expanding towards more modeling
CDK
support, including several forcefield implementations. Often used as a test-bed for Part of JChemDraw
new algorithms. Code quality is variable, reflecting the diverse contributor base.
Java
Do not use the Daylight toolkit for new code. It is expensive, there's very little new
development, and you can get nearly all of its functionality elsewhere.
Andrew Dalke’s EuroQSAR 2008 poster
6. Open Source
What’s available? Cheminformatics
Rajarshi Guha
Open Source
Open Standards
Open Data
CDK (Java)
Openbabel (C++)
RDKit (C++)
Licensing varies
A large degree of overlap
7. Open Source
Toolkits - A Comparison Cheminformatics
Rajarshi Guha
Open Source
Feature CDK OpenBabel RDKit
Open Standards
License LGPL GPL new BSD
Open Data
Language Java C++ C++ / Python
SLOC 188,554 194,358 173,219
Fingerprints
Hashed
Substructure
File format support
Aromaticity models
Stereochemistry
Canonicalization
Descriptors
2D coordinate generation
3D coordinate generation
2D depictions
Conformer generation
Rigid alignment
SMARTS searching
Pharmacophore searching
8. Open Source
CDK Overview Cheminformatics
Rajarshi Guha
Category functionality Open Source
Input / Output Support for various formats including SDF, Open Standards
SMILES, CML, PDB, InChI, PubChem Open Data
XML formats, Canonical SMILES support,
Pharmacophore serialization
Visualization 2D coordinate generation and depiction
Properties Fingerprinting Gasteiger-Marsilli and
MMFF94 partial charges, Atom, bond and
molecular descriptors, NMR prediction via
HOSE codes, Aromaticity perception
Graph Isomorphism and Sub-graph isomorphism
detection, SMARTS support, Ring
perception, pharmacophore searching. A
variety of graph theoretical algorithms
(including traversal, shortest paths,
distance matrix)
9. Open Source
Data Visualization Cheminformatics
Rajarshi Guha
Open Source
Lots of OSS molecular visualization tools available Open Standards
Open Data
Needs to be combined with data analysis tools
R is great for analytics, has powerful graphics
Not cheminformatics aware, not user-friendly
Possibilities
Rattle
GGobi
Processing - developer oriented, good for ad-hoc,
multiple data type visualizations
Bioclipse
11. Open Source
Open Source Cheminformatics Workflows Cheminformatics
Rajarshi Guha
Requirements Open Source
Open Standards
Core cheminformatics
Open Data
Analytics
Database backends
Integration
Can it be done?
Yes, in various ways
For the non-expert user, pipeline tools provide a nice
platform for integrating all the above
For expert users, it’s useful to go lower level
Integration between R and the CDK provides a
cheminformatics enhanced modeling platform
12. Open Source
CDK and R Cheminformatics
Rajarshi Guha
Open Source
Open Standards
R is oriented towards statistical modeling and Open Data
computations
Cheminformatics agnostic
rcdk integrates the CDK into the R environment
Read and process molecular structure information
Descriptors
Fingerprints
General molecule manipulation
Provides access to CDK functionality in idiomatic R
http://cran.r-project.org/web/packages/rcdk/index.html
13. Open Source
Accessing Chemical Information from R Cheminformatics
Rajarshi Guha
Open Source
Open Standards
rcdk is good for processing and manipulating molecules Open Data
in R
Also useful to be able to access chemical information
directly from databases
rpubchem provides access to PubChem compound,
substance and bioassay collections
By compound, substance, assay ID’s
By keyword searches
Packages assay information into a data.frame and
includes associated metadata
Supplements the rcdk package
http://cran.r-project.org/web/packages/rpubchem/index.html
14. Open Source
Standards for Cheminformatics? Cheminformatics
Rajarshi Guha
Open Source
Open Standards
Open Data
Open standards/specifications help everybody
Most refer to file formats
CML, JCAMP-DX
InChI, AniML
Who sets them? How are they constructued?
Are there usage restrictions?
15. Open Source
Standards for Cheminformatics Cheminformatics
Rajarshi Guha
Open Source
Open definition Open Standards
Open Data
Public participation in defining the standard
Mailing lists, wiki’s for transparency
Possibility of forking the standard
FlexMol, OpenSmiles, JCAMP-DX
Open use
No royalties for usage
No patents, trademarks, copyrights etc
SMILES, SDF, InChI, SLN
16. Open Source
Standards for Cheminformatics Cheminformatics
Rajarshi Guha
Open Source
Open Standards
De facto standard Open Data
In wide use, few or no variants
Data exchange is easy and reliable
SDF, SMILES, PDB
Formal standard
Endorsed by some sort of recognized group, academic, or
government body
InChI, OpenSMILES, JCAMP-DX
17. Open Source
The Blue Obelisk Cheminformatics
Rajarshi Guha
Open Source
Open Standards
Open Data
Umbrella for a variety of OSS projects
Covers code, data, standards
Open to everybody
OpenSMILES is a recent project aiming to provide
explicit description of the SMILES grammar
http://blueobelisk.sourceforge.net/ http://www.opensmiles.org/
18. Open Source
The Pistoia Alliance Cheminformatics
Rajarshi Guha
Open Source
Open Standards
. . . established to streamline non
Open Data
competitive elements of the pharmaceutical
drug discovery workflow by the specification
of common business terms, relationships and
processes . . .
An opportunity for the Open Source cheminformatics
community to link with industrial users
ontology developments
web service interfaces
database schema
http://pistoiaalliance.sourceforge.net/
20. Open Source
The Distributed Future Cheminformatics
Rajarshi Guha
Web services, cloud computing, . . . Open Source
Open Standards
The OSS cheminformatics
Open Data
ecosystem integrates with these
scenarios very easily
Cost and licenses are one aspect
Redundancy is a big benefit
Data / functionality mashups can lead to innovative
solutions
Cheminformatics web services
CDK based services (hosted at various places)
Daylight web services
NCI, Chemspider
21. Open Source
There’s Data in Them Thar Internets Cheminformatics
Rajarshi Guha
Open Source
Open Standards
Open Data
Many significant public resources of chemical
information
PubChem
ChemSpider
NMRShiftDB
Use anything to access them
Does OSS have a role to play here?
Open Access is likely more important in this case
22. Open Source
Data Access Cheminformatics
Rajarshi Guha
Open Source
Open Standards
Open Data
Good to have access to data in open fashion
What about adding value to the data?
Could replicate databases
Easier if the data source is built on a OSS stack
Raw data dumps obviate this need
But open, well defined API’s are preferable
Avoiding hosting/update hassles
Easier to mash multiple data sources
Made easier when data sources support standards
23. Open Source
Benchmark Datasets Cheminformatics
Rajarshi Guha
Open Source
Open Standards
Benchmarking is vital Open Data
Some sub-fields have collections of benchmark datasets
Docking (DUD)
Virtual screening (MUV)
No general datasets or attempts for benchmarking core
cheminformatics operations
Initial attempt at cheminfbenchmark on GitHub
Restricted to Java libraries at this point (CDK, MX)
Uses datasets taken from PubChem
Fingerprinting, SD parsing, SMARTS parsing,
substructure searching
Rohrer, S. G. et al., J. Chem. Inf. Model., 2009, 49, 169–184
24. Open Source
Open Source Open Notebook Science Cheminformatics
Rajarshi Guha
Open Source
ONS is a paradigm whereby some or all experimental Open Standards
Open Data
results are published in an open form with little or no lag
time
Championed by Jean Claude Bradley, Cameron Neylon,
Raf Aerts and others
Closed source versus open source cheminformatics
doesn’t necessarily hinder ONS practise
But open source cheminformatics makes life easier
25. Open Source
ONS Solubility Challenge Cheminformatics
Rajarshi Guha
Open Source
Open Standards
Open Data
Led Jean-Claude Bradley (Drexel U.)
Solubility measurements in various non-aqueous solvents
Part of a larger project to identify anti-malarial
compounds
Very distributed
Multiple groups generating and modeling data
Data hosted on wiki’s and Google spreadsheets
Multiple views, enhanced via cheminformatics web
services
26. Open Source
ONS Solubility Challenge Cheminformatics
Rajarshi Guha
Open Source
Data Storage Data Storage Open Standards
Open Data
Data Views
Data Generation
Web
Services
Data Modeling
27. Open Source
What’s Holding OSS Cheminformatics Back? Cheminformatics
Rajarshi Guha
Open Source
Open Standards
Open Data
Niche field
Comprehensiveness, polish
Funding
28. Open Source
Conclusions Cheminformatics
Rajarshi Guha
Open Source
Open Standards
Open Data
The ecosystem is alive with activity
Distributed systems are important - OSS
cheminformatics fits in nicely
OSS projects should coordinate with users
industrial and academic
Quality and effectiveness will be the final arbiter