• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Open Source Cheminformatics
 

Open Source Cheminformatics

on

  • 7,531 views

Upcoming presentation at BioIT world

Upcoming presentation at BioIT world

Statistics

Views

Total Views
7,531
Views on SlideShare
7,198
Embed Views
333

Actions

Likes
12
Downloads
175
Comments
0

4 Embeds 333

http://depth-first.com 300
http://www.slideshare.net 31
http://74.125.155.132 1
http://webcache.googleusercontent.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Open Source Cheminformatics Open Source Cheminformatics Presentation Transcript

    • Open Source Cheminformatics Rajarshi Guha Open Source Open Standards Open Source Cheminformatics Open Data Tools and Data Rajarshi Guha School of Informatics, Indiana University Bio IT World 29th April, 2009
    • Open Source Open Source Cheminformatics Cheminformatics Rajarshi Guha Open Source Been around for some time, niche field Open Standards OSS snippets/code based on closed source API’s versus Open Data fully open source tools Why use OSS cheminformatics? Articulated nicely by Delano Reverse also articulated nicely by Stahl Goal Not argue for or against Open Source Show what’s there, how it fits in with other technologies Delano, W. L., Drug Discovery Today, 2005, 10, 213–217 Stahl, M. T., Drug Discovery Today, 2005, 10, 219–222
    • Open Source Open Source Cheminformatics Cheminformatics Rajarshi Guha Open Source Been around for some time, niche field Open Standards OSS snippets/code based on closed source API’s versus Open Data fully open source tools Why use OSS cheminformatics? Articulated nicely by Delano Reverse also articulated nicely by Stahl Goal Not argue for or against Open Source Show what’s there, how it fits in with other technologies Delano, W. L., Drug Discovery Today, 2005, 10, 213–217 Stahl, M. T., Drug Discovery Today, 2005, 10, 219–222
    • Open Source Cheminformatics Software Cheminformatics Rajarshi Guha Open Source Open Standards Open Data The ecosystem is composed of developer- and user-oriented software Most applications will depend on lower level functionality Choice of toolkit influences robustness performance ease of distribution integration with other libraries Won’t be talking about user-oriented software
    • Open Source The Toolkit Ecosystem Cheminformatics Rajarshi Guha Open Source Timeline of cheminformatics toolkits* Open Standards *(runs on Unix and supports SMILES and SMARTS) 1995 and earlier 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Open Data Daylight C and Fortran Is a wrapper DayPerl Developer moved DaySWIG between projects Tcl, Python and more PyDaylight higher-level Python API frowns Python; API based on PyDaylight (OBabel) OELib OEChem +Ogham &Lexichem Babel C++ +Python C++ (not a library) +Python +Java (third-party package) Guidelines OpenBabel +Python, Perl +Java, Ruby OEChem and its sister libraries for molecular modeling are fast, flexible, powerful Pybel and complete (except for fingerprints). It is designed for high-end users who know the nuances of cheminformatics. Expensive. My choice for C++, Java and Python. higher-level Python API RDKit CDK is the toolkit to use if you are on the JDK and OEChem is too pricey. It has a strong structure and structural biology component, close ties with 2D and 3D C++/Python - internal library Public release on Sourceforge display programs, and integration with Bioclipse, Taverna, and Knime. Accessible from the C version of Python RDKit is relatively new and with a small user community. The software cinfony Accessible from the Java version of Python (Jython) engineering skills are the best of the free projects. Includes 2D layout, 2D→3D, abstraction API QSAR, forcefield, shape and machine learning components. Worth a look! JOELib OpenBabel is the most community driven. Its strength is file format conversion, for Java; API based on OELib both small molecules and biomolecules. It is expanding towards more modeling CDK support, including several forcefield implementations. Often used as a test-bed for Part of JChemDraw new algorithms. Code quality is variable, reflecting the diverse contributor base. Java Do not use the Daylight toolkit for new code. It is expensive, there's very little new development, and you can get nearly all of its functionality elsewhere. Andrew Dalke’s EuroQSAR 2008 poster
    • Open Source What’s available? Cheminformatics Rajarshi Guha Open Source Open Standards Open Data CDK (Java) Openbabel (C++) RDKit (C++) Licensing varies A large degree of overlap
    • Open Source Toolkits - A Comparison Cheminformatics Rajarshi Guha Open Source Feature CDK OpenBabel RDKit Open Standards License LGPL GPL new BSD Open Data Language Java C++ C++ / Python SLOC 188,554 194,358 173,219 Fingerprints    Hashed    Substructure    File format support    Aromaticity models    Stereochemistry    Canonicalization    Descriptors    2D coordinate generation    3D coordinate generation    2D depictions    Conformer generation    Rigid alignment    SMARTS searching    Pharmacophore searching
    • Open Source CDK Overview Cheminformatics Rajarshi Guha Category functionality Open Source Input / Output Support for various formats including SDF, Open Standards SMILES, CML, PDB, InChI, PubChem Open Data XML formats, Canonical SMILES support, Pharmacophore serialization Visualization 2D coordinate generation and depiction Properties Fingerprinting Gasteiger-Marsilli and MMFF94 partial charges, Atom, bond and molecular descriptors, NMR prediction via HOSE codes, Aromaticity perception Graph Isomorphism and Sub-graph isomorphism detection, SMARTS support, Ring perception, pharmacophore searching. A variety of graph theoretical algorithms (including traversal, shortest paths, distance matrix)
    • Open Source Data Visualization Cheminformatics Rajarshi Guha Open Source Lots of OSS molecular visualization tools available Open Standards Open Data Needs to be combined with data analysis tools R is great for analytics, has powerful graphics Not cheminformatics aware, not user-friendly Possibilities Rattle GGobi Processing - developer oriented, good for ad-hoc, multiple data type visualizations Bioclipse
    • Open Source Data Visualization - Bioclipse Cheminformatics Rajarshi Guha Open Source Open Standards Open Data
    • Open Source Open Source Cheminformatics Workflows Cheminformatics Rajarshi Guha Requirements Open Source Open Standards Core cheminformatics Open Data Analytics Database backends Integration Can it be done? Yes, in various ways For the non-expert user, pipeline tools provide a nice platform for integrating all the above For expert users, it’s useful to go lower level Integration between R and the CDK provides a cheminformatics enhanced modeling platform
    • Open Source CDK and R Cheminformatics Rajarshi Guha Open Source Open Standards R is oriented towards statistical modeling and Open Data computations Cheminformatics agnostic rcdk integrates the CDK into the R environment Read and process molecular structure information Descriptors Fingerprints General molecule manipulation Provides access to CDK functionality in idiomatic R http://cran.r-project.org/web/packages/rcdk/index.html
    • Open Source Accessing Chemical Information from R Cheminformatics Rajarshi Guha Open Source Open Standards rcdk is good for processing and manipulating molecules Open Data in R Also useful to be able to access chemical information directly from databases rpubchem provides access to PubChem compound, substance and bioassay collections By compound, substance, assay ID’s By keyword searches Packages assay information into a data.frame and includes associated metadata Supplements the rcdk package http://cran.r-project.org/web/packages/rpubchem/index.html
    • Open Source Standards for Cheminformatics? Cheminformatics Rajarshi Guha Open Source Open Standards Open Data Open standards/specifications help everybody Most refer to file formats CML, JCAMP-DX InChI, AniML Who sets them? How are they constructued? Are there usage restrictions?
    • Open Source Standards for Cheminformatics Cheminformatics Rajarshi Guha Open Source Open definition Open Standards Open Data Public participation in defining the standard Mailing lists, wiki’s for transparency Possibility of forking the standard FlexMol, OpenSmiles, JCAMP-DX Open use No royalties for usage No patents, trademarks, copyrights etc SMILES, SDF, InChI, SLN
    • Open Source Standards for Cheminformatics Cheminformatics Rajarshi Guha Open Source Open Standards De facto standard Open Data In wide use, few or no variants Data exchange is easy and reliable SDF, SMILES, PDB Formal standard Endorsed by some sort of recognized group, academic, or government body InChI, OpenSMILES, JCAMP-DX
    • Open Source The Blue Obelisk Cheminformatics Rajarshi Guha Open Source Open Standards Open Data Umbrella for a variety of OSS projects Covers code, data, standards Open to everybody OpenSMILES is a recent project aiming to provide explicit description of the SMILES grammar http://blueobelisk.sourceforge.net/ http://www.opensmiles.org/
    • Open Source The Pistoia Alliance Cheminformatics Rajarshi Guha Open Source Open Standards . . . established to streamline non Open Data competitive elements of the pharmaceutical drug discovery workflow by the specification of common business terms, relationships and processes . . . An opportunity for the Open Source cheminformatics community to link with industrial users ontology developments web service interfaces database schema http://pistoiaalliance.sourceforge.net/
    • Open Source The Distributed Future Cheminformatics Rajarshi Guha Open Source Open Standards Open Data
    • Open Source The Distributed Future Cheminformatics Rajarshi Guha Web services, cloud computing, . . . Open Source Open Standards The OSS cheminformatics Open Data ecosystem integrates with these scenarios very easily Cost and licenses are one aspect Redundancy is a big benefit Data / functionality mashups can lead to innovative solutions Cheminformatics web services CDK based services (hosted at various places) Daylight web services NCI, Chemspider
    • Open Source There’s Data in Them Thar Internets Cheminformatics Rajarshi Guha Open Source Open Standards Open Data Many significant public resources of chemical information PubChem ChemSpider NMRShiftDB Use anything to access them Does OSS have a role to play here? Open Access is likely more important in this case
    • Open Source Data Access Cheminformatics Rajarshi Guha Open Source Open Standards Open Data Good to have access to data in open fashion What about adding value to the data? Could replicate databases Easier if the data source is built on a OSS stack Raw data dumps obviate this need But open, well defined API’s are preferable Avoiding hosting/update hassles Easier to mash multiple data sources Made easier when data sources support standards
    • Open Source Benchmark Datasets Cheminformatics Rajarshi Guha Open Source Open Standards Benchmarking is vital Open Data Some sub-fields have collections of benchmark datasets Docking (DUD) Virtual screening (MUV) No general datasets or attempts for benchmarking core cheminformatics operations Initial attempt at cheminfbenchmark on GitHub Restricted to Java libraries at this point (CDK, MX) Uses datasets taken from PubChem Fingerprinting, SD parsing, SMARTS parsing, substructure searching Rohrer, S. G. et al., J. Chem. Inf. Model., 2009, 49, 169–184
    • Open Source Open Source & Open Notebook Science Cheminformatics Rajarshi Guha Open Source ONS is a paradigm whereby some or all experimental Open Standards Open Data results are published in an open form with little or no lag time Championed by Jean Claude Bradley, Cameron Neylon, Raf Aerts and others Closed source versus open source cheminformatics doesn’t necessarily hinder ONS practise But open source cheminformatics makes life easier
    • Open Source ONS Solubility Challenge Cheminformatics Rajarshi Guha Open Source Open Standards Open Data Led Jean-Claude Bradley (Drexel U.) Solubility measurements in various non-aqueous solvents Part of a larger project to identify anti-malarial compounds Very distributed Multiple groups generating and modeling data Data hosted on wiki’s and Google spreadsheets Multiple views, enhanced via cheminformatics web services
    • Open Source ONS Solubility Challenge Cheminformatics Rajarshi Guha Open Source Data Storage Data Storage Open Standards Open Data Data Views Data Generation Web Services Data Modeling
    • Open Source What’s Holding OSS Cheminformatics Back? Cheminformatics Rajarshi Guha Open Source Open Standards Open Data Niche field Comprehensiveness, polish Funding
    • Open Source Conclusions Cheminformatics Rajarshi Guha Open Source Open Standards Open Data The ecosystem is alive with activity Distributed systems are important - OSS cheminformatics fits in nicely OSS projects should coordinate with users industrial and academic Quality and effectiveness will be the final arbiter