SlideShare a Scribd company logo
1 of 18
Download to read offline
R & CDK

                                                  1/18




     Chemical Data Mining
    Open Source & Reproducible


              Rajarshi Guha

NIH Center for Advancing Translational Science


            August 21, 2012
            Philadelphia PA
R & CDK
Background
                                                            2/18




   Been using it since 2003, developed a number of R
   packages, mostly public
   Make extensive use of R at NCGC for small molecule &
   RNAi screening and high content analysis
   In paralllel, need to manipulate and process chemical
   structure data


   How R is enhanced by other Open Source software
   How R enables and supports reproducible science
R & CDK
What is R?
                                                                  3/18




    R is an environment for modeling
        Contains many prepackaged statistical and mathematical
        functions
        No need to implement anything (if you don’t want to)
    R is a matrix programming language that is good for
    statistical computing
        Full fledged, interpreted language
        Well integrated with statistical functionality
        Easy to integrate with C, C++, Fortran
        Good for prototyping
R & CDK
Why cheminformatics in R?
                                     4/18




  Much of cheminformatics is data
  modeling and mining
  But the numeric data is derived
  from chemical structure
  Thus we want to work with
      molecules & and their parts
      files containing molecules
      databases of molecules
R & CDK
Why cheminformatics in R?
                                                               5/18




    In contrast to bioinformatics (cf. Bioconductor), not a
    whole lot of cheminformatics support for R
    For cheminformatics and chemistry, relevant packages
    include
        rcdk, rpubchem, chemblr,fingerprint
        bio3d, ChemmineR, caret
    A lot of cheminformatics employs various forms of
    statistics and machine learning - R is exactly the
    environment for that
    We just need to add some chemistry capabilities to it
R & CDK
What does the CDK provide?
                                                                6/18



    Fundamental chemical objects
        atoms
        bonds
        molecules
    More complex objects are also available
        Sequences
        Reactions
        Collections of molecules
    Input/Output for a wide variety of molecular file formats
    Fingerprints and fragment generation
    Rigid alignments, pharmacophore searching
    Substructure searching, SMARTS support
    Molecular descriptors
R & CDK
Using the CDK in R
                                                               7/18


    Based on the rJava package
    Two R packages to install (not counting the
    dependencies)
    Provides access to a variety of CDK classes and methods
    Idiomatic R


         rcdk

   CDK           Jmol                          rpubchem

         rJava             fingerprint            XML

                   R Programming Environment
R & CDK
Reading in data
                                                         8/18




    The CDK supports a variety of file formats
    rcdk loads all recognized formats, automatically
    Data can be local or remote


mols <- load.molecules( c("data/io/set1.sdf",
              "data/io/set2.smi",
              "http://rguha.net/rcdk/remote.sdf"))


    For large SDF’s use an iterating reader
    Can’t do much with these objects, except via rcdk
    functions
R & CDK
Working with molecules
                                                               9/18




    Currently you can access atoms, bonds, get certain atom
    properties, 2D/3D coordinates
    Since rcdk doesn’t cover the entire CDK API, you might
    need to drop down to the rJava level and make calls to
    the Java code by hand
R & CDK
Accessing fingerprints
                                                                   10/18




    CDK provides several fingerprints
        Path-based, MACCS, E-State, PubChem
    Access them via get.fingerprint(...)
    Works on one molecule at a time, use lapply to process a
    list of molecules
    This method works with the fingerprint package
        Separate package to represent and manipulate fingerprint
        data from various sources (CDK, BCI, MOE)
        Uses C to perform similarity calculations
R & CDK
Working with fingerprints
                                                                                        11/18



                   The fingerprint package implements 28 similarity and
                   dissimilarity metrics
                   Easy to run enrichment studies
                   We can compare datasets in O(n) time, using the “bit
                   spectrum”
             1.0
             0.8
 Frequency

             0.6
             0.4
             0.2
             0.0




                      0                        50                          100   150

                                                            Bit Position




             0
                 Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384
R & CDK
Visualization
                                                              12/18




    rcdk supports visualization of 2D structure images in
    two ways
    First, you can bring up a Swing window
    Second, you can obtain the depiction as a raster image


mols <- load.molecules("data/dhfr_3d.sd")

## view a single molecule in a Swing window
view.molecule.2d(mols[[1]])

## view a table of molecules
view.molecule.2d(mols[1:10])
R & CDK
The QSAR workflow
                    13/18
R & CDK
The QSAR workflow
                                                                  14/18




   Before model development you’ll need to clean the
   molecules, evaluate descriptors, generate subsets
   With the numeric data in hand, we can proceed to
   modeling
   Before building predictive models, we’d probably explore
   the dataset
       Normality of the dependent variable
       Correlations between descriptors and dependent variable
       Similarity of subsets
   Go wild and build all the models that R supports
R & CDK
Interacting with chemical databases
                                                              15/18




    A variety of databases containing structures, physical
    properties, biological activities
    Direct access within R lets us streamline our workflow
    Enabled by public APIs
        Pubchem PUG and REST
        ChEMBL REST API (chemblr)
R & CDK
Reproducible chemical data mining
                                                                               16/18




   The many toolkits and versions         .Rda




                                                        Reproducible Bundle
   make reproducibility tough
   DB and HTTP access ensures
   that an analysis can be always
   up to date if required
   If the analysis is not based on a
   fixed snapshot of data,
                                            .R
   reproducibility cannot be
                                       Sweave / Knitr
   guaranteed
   Might actually make all those
   published QSAR models
   reusable!
R & CDK
Acknowledgements
                              17/18




   rcdk
       Steffen Neumann
       Miguel Rojas
       Ranke Johannes
   CDK
       Egon Willighagen
       Christoph Steinbeck
       ...
R & CDK

                                        18/18




http://sourceforge.net/projects/cdk/

  http://github.com/rajarshi/cdk

              @rguha

More Related Content

What's hot

Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Ryan Rosario
 
Fundamentals of programming and problem solving
Fundamentals of programming and problem solvingFundamentals of programming and problem solving
Fundamentals of programming and problem solvingJustine Dela Serna
 
Localization (l10n) - The Process
Localization (l10n) - The ProcessLocalization (l10n) - The Process
Localization (l10n) - The ProcessSundeep Anand
 
My Open Access papers
My Open Access papersMy Open Access papers
My Open Access papersbaoilleach
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Ryan Rosario
 

What's hot (6)

Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
 
Fundamentals of programming and problem solving
Fundamentals of programming and problem solvingFundamentals of programming and problem solving
Fundamentals of programming and problem solving
 
Localization (l10n) - The Process
Localization (l10n) - The ProcessLocalization (l10n) - The Process
Localization (l10n) - The Process
 
My Open Access papers
My Open Access papersMy Open Access papers
My Open Access papers
 
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
Towards OpenLogos Hybrid Machine Translation - Anabela BarreiroTowards OpenLogos Hybrid Machine Translation - Anabela Barreiro
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
 

Viewers also liked

Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...Salford Systems
 
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryAnn-Marie Roche
 
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsDevelopment and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsSean Ekins
 
Chemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & UnderstandingChemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & UnderstandingRajarshi Guha
 
Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Sean Ekins
 
Agile large-scale machine-learning pipelines in drug discovery
Agile large-scale machine-learning pipelines in drug discoveryAgile large-scale machine-learning pipelines in drug discovery
Agile large-scale machine-learning pipelines in drug discoveryOla Spjuth
 
Composicion bidimensional (1)
Composicion bidimensional (1)Composicion bidimensional (1)
Composicion bidimensional (1)joselizz
 
Animation lesson 2
Animation lesson 2Animation lesson 2
Animation lesson 2Ty171
 
Wondrous Wise Words
Wondrous Wise Words Wondrous Wise Words
Wondrous Wise Words OH TEIK BIN
 
두피에좋은음식
두피에좋은음식두피에좋은음식
두피에좋은음식준배 채
 
นิป เอมรัฐ
นิป เอมรัฐนิป เอมรัฐ
นิป เอมรัฐguest6487de
 
Sharman 2015 PhD thesis
Sharman 2015 PhD thesisSharman 2015 PhD thesis
Sharman 2015 PhD thesisMurray Sharman
 
Sdc11 feb14 class12
Sdc11 feb14 class12Sdc11 feb14 class12
Sdc11 feb14 class12missjaqui
 

Viewers also liked (17)

Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
 
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistry
 
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsDevelopment and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
 
Chemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & UnderstandingChemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & Understanding
 
Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery
 
Agile large-scale machine-learning pipelines in drug discovery
Agile large-scale machine-learning pipelines in drug discoveryAgile large-scale machine-learning pipelines in drug discovery
Agile large-scale machine-learning pipelines in drug discovery
 
Dispensing error
Dispensing errorDispensing error
Dispensing error
 
EPA CAA Email 9.4.03
EPA CAA Email 9.4.03EPA CAA Email 9.4.03
EPA CAA Email 9.4.03
 
Composicion bidimensional (1)
Composicion bidimensional (1)Composicion bidimensional (1)
Composicion bidimensional (1)
 
Animation lesson 2
Animation lesson 2Animation lesson 2
Animation lesson 2
 
Wondrous Wise Words
Wondrous Wise Words Wondrous Wise Words
Wondrous Wise Words
 
Latihan 1 tata
Latihan 1 tataLatihan 1 tata
Latihan 1 tata
 
두피에좋은음식
두피에좋은음식두피에좋은음식
두피에좋은음식
 
นิป เอมรัฐ
นิป เอมรัฐนิป เอมรัฐ
นิป เอมรัฐ
 
Rhoades_logo_color
Rhoades_logo_colorRhoades_logo_color
Rhoades_logo_color
 
Sharman 2015 PhD thesis
Sharman 2015 PhD thesisSharman 2015 PhD thesis
Sharman 2015 PhD thesis
 
Sdc11 feb14 class12
Sdc11 feb14 class12Sdc11 feb14 class12
Sdc11 feb14 class12
 

Similar to Chemical Data Mining: Open Source & Reproducible

Integrating R with the CDK: Enhanced Chemical Data Mining
Integrating R with the CDK: Enhanced Chemical Data MiningIntegrating R with the CDK: Enhanced Chemical Data Mining
Integrating R with the CDK: Enhanced Chemical Data MiningRajarshi Guha
 
Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Nikolaos Konstantinou
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitGreg Landrum
 
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in RCrunching Molecules and Numbers in R
Crunching Molecules and Numbers in RRajarshi Guha
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data setsR A Akerkar
 
ICWE2017 BigDataEurope
ICWE2017 BigDataEuropeICWE2017 BigDataEurope
ICWE2017 BigDataEuropeBigData_Europe
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R ProgrammingIRJET Journal
 

Similar to Chemical Data Mining: Open Source & Reproducible (20)

Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Integrating R with the CDK: Enhanced Chemical Data Mining
Integrating R with the CDK: Enhanced Chemical Data MiningIntegrating R with the CDK: Enhanced Chemical Data Mining
Integrating R with the CDK: Enhanced Chemical Data Mining
 
Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in RCrunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data sets
 
ICWE2017 BigDataEurope
ICWE2017 BigDataEuropeICWE2017 BigDataEurope
ICWE2017 BigDataEurope
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R Programming
 

More from Rajarshi Guha

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomeRajarshi Guha
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in contextRajarshi Guha
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomeRajarshi Guha
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMCRajarshi Guha
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformRajarshi Guha
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?Rajarshi Guha
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?Rajarshi Guha
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsRajarshi Guha
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATSRajarshi Guha
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & RRajarshi Guha
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Rajarshi Guha
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the partsRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Rajarshi Guha
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesRajarshi Guha
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Rajarshi Guha
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research DatabaseRajarshi Guha
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsRajarshi Guha
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Rajarshi Guha
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in RRajarshi Guha
 

More from Rajarshi Guha (20)

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
 

Chemical Data Mining: Open Source & Reproducible

  • 1. R & CDK 1/18 Chemical Data Mining Open Source & Reproducible Rajarshi Guha NIH Center for Advancing Translational Science August 21, 2012 Philadelphia PA
  • 2. R & CDK Background 2/18 Been using it since 2003, developed a number of R packages, mostly public Make extensive use of R at NCGC for small molecule & RNAi screening and high content analysis In paralllel, need to manipulate and process chemical structure data How R is enhanced by other Open Source software How R enables and supports reproducible science
  • 3. R & CDK What is R? 3/18 R is an environment for modeling Contains many prepackaged statistical and mathematical functions No need to implement anything (if you don’t want to) R is a matrix programming language that is good for statistical computing Full fledged, interpreted language Well integrated with statistical functionality Easy to integrate with C, C++, Fortran Good for prototyping
  • 4. R & CDK Why cheminformatics in R? 4/18 Much of cheminformatics is data modeling and mining But the numeric data is derived from chemical structure Thus we want to work with molecules & and their parts files containing molecules databases of molecules
  • 5. R & CDK Why cheminformatics in R? 5/18 In contrast to bioinformatics (cf. Bioconductor), not a whole lot of cheminformatics support for R For cheminformatics and chemistry, relevant packages include rcdk, rpubchem, chemblr,fingerprint bio3d, ChemmineR, caret A lot of cheminformatics employs various forms of statistics and machine learning - R is exactly the environment for that We just need to add some chemistry capabilities to it
  • 6. R & CDK What does the CDK provide? 6/18 Fundamental chemical objects atoms bonds molecules More complex objects are also available Sequences Reactions Collections of molecules Input/Output for a wide variety of molecular file formats Fingerprints and fragment generation Rigid alignments, pharmacophore searching Substructure searching, SMARTS support Molecular descriptors
  • 7. R & CDK Using the CDK in R 7/18 Based on the rJava package Two R packages to install (not counting the dependencies) Provides access to a variety of CDK classes and methods Idiomatic R rcdk CDK Jmol rpubchem rJava fingerprint XML R Programming Environment
  • 8. R & CDK Reading in data 8/18 The CDK supports a variety of file formats rcdk loads all recognized formats, automatically Data can be local or remote mols <- load.molecules( c("data/io/set1.sdf", "data/io/set2.smi", "http://rguha.net/rcdk/remote.sdf")) For large SDF’s use an iterating reader Can’t do much with these objects, except via rcdk functions
  • 9. R & CDK Working with molecules 9/18 Currently you can access atoms, bonds, get certain atom properties, 2D/3D coordinates Since rcdk doesn’t cover the entire CDK API, you might need to drop down to the rJava level and make calls to the Java code by hand
  • 10. R & CDK Accessing fingerprints 10/18 CDK provides several fingerprints Path-based, MACCS, E-State, PubChem Access them via get.fingerprint(...) Works on one molecule at a time, use lapply to process a list of molecules This method works with the fingerprint package Separate package to represent and manipulate fingerprint data from various sources (CDK, BCI, MOE) Uses C to perform similarity calculations
  • 11. R & CDK Working with fingerprints 11/18 The fingerprint package implements 28 similarity and dissimilarity metrics Easy to run enrichment studies We can compare datasets in O(n) time, using the “bit spectrum” 1.0 0.8 Frequency 0.6 0.4 0.2 0.0 0 50 100 150 Bit Position 0 Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384
  • 12. R & CDK Visualization 12/18 rcdk supports visualization of 2D structure images in two ways First, you can bring up a Swing window Second, you can obtain the depiction as a raster image mols <- load.molecules("data/dhfr_3d.sd") ## view a single molecule in a Swing window view.molecule.2d(mols[[1]]) ## view a table of molecules view.molecule.2d(mols[1:10])
  • 13. R & CDK The QSAR workflow 13/18
  • 14. R & CDK The QSAR workflow 14/18 Before model development you’ll need to clean the molecules, evaluate descriptors, generate subsets With the numeric data in hand, we can proceed to modeling Before building predictive models, we’d probably explore the dataset Normality of the dependent variable Correlations between descriptors and dependent variable Similarity of subsets Go wild and build all the models that R supports
  • 15. R & CDK Interacting with chemical databases 15/18 A variety of databases containing structures, physical properties, biological activities Direct access within R lets us streamline our workflow Enabled by public APIs Pubchem PUG and REST ChEMBL REST API (chemblr)
  • 16. R & CDK Reproducible chemical data mining 16/18 The many toolkits and versions .Rda Reproducible Bundle make reproducibility tough DB and HTTP access ensures that an analysis can be always up to date if required If the analysis is not based on a fixed snapshot of data, .R reproducibility cannot be Sweave / Knitr guaranteed Might actually make all those published QSAR models reusable!
  • 17. R & CDK Acknowledgements 17/18 rcdk Steffen Neumann Miguel Rojas Ranke Johannes CDK Egon Willighagen Christoph Steinbeck ...
  • 18. R & CDK 18/18 http://sourceforge.net/projects/cdk/ http://github.com/rajarshi/cdk @rguha