SlideShare a Scribd company logo
1 of 32
Download to read offline
Introduction R & CDK R & PubChem Conclusions




                      Integrating R and the CDK
                       Enhanced Chemical Data Mining


                                    Rajarshi Guha

                                   School of Informatics
                                    Indiana University


                                    21st May, 2007
                                    Covington, KY
Introduction R & CDK R & PubChem Conclusions

Outline



  1   Introduction


  2   R & CDK


  3   R & PubChem


  4   Conclusions
Introduction R & CDK R & PubChem Conclusions

Outline



  1   Introduction


  2   R & CDK


  3   R & PubChem


  4   Conclusions
Introduction R & CDK R & PubChem Conclusions

Cheminformatics & Modeling



        Chemical information is available from multiple sources
               Databases
               Algorithms
        The information must be used in some way
               Summarized in the form of a report
               Modeled (numerically / statistically)
        A variety of software is available for both cheminformatics and
        modeling
               Can we get something that allows us the best of both worlds?
Introduction R & CDK R & PubChem Conclusions

Cheminformatics & Modeling



        Chemical information is available from multiple sources
               Databases
               Algorithms
        The information must be used in some way
               Summarized in the form of a report
               Modeled (numerically / statistically)
        A variety of software is available for both cheminformatics and
        modeling
               Can we get something that allows us the best of both worlds?
Introduction R & CDK R & PubChem Conclusions

Modeling Environments


        Many cheminformatics applications include statistical
        functionality
        Useful, but has a number of drawbacks
               Their focus is on cheminformatics, not necessarily statistics
               Usually limited in the statistics offerings
        Makes sense to use software designed for statistics
               SAS, SPSS
               Minitab, R
               ...
        However, these focus on statistics and not cheminformatics

     Combine a statistics package with a cheminformatics package
Introduction R & CDK R & PubChem Conclusions

Modeling Environments


        Many cheminformatics applications include statistical
        functionality
        Useful, but has a number of drawbacks
               Their focus is on cheminformatics, not necessarily statistics
               Usually limited in the statistics offerings
        Makes sense to use software designed for statistics
               SAS, SPSS
               Minitab, R
               ...
        However, these focus on statistics and not cheminformatics

     Combine a statistics package with a cheminformatics package
Introduction R & CDK R & PubChem Conclusions

Why Use R?



        Open source statistical programming environment
        Full fledged programming language
        A wide variety of statistical methods - traditional, well tested
        methods as well as state of the art methods
        Extensive and active community
        Easy to integrate into other environments
               Workflow tools (Pipeline Pilot, Knime)
               Java programs (CDK)
Introduction R & CDK R & PubChem Conclusions

Why Use the CDK?




          Open source Java library for cheminformatics
          Covers basic cheminformatics functionality
          Also provides descriptors, 3D coordinate generation, rigid
          alignment etc.
          Active development community




  Steinbeck, C. et al., Curr. Pharm. Des., 2007, 12, 2110 – 2120
Introduction R & CDK R & PubChem Conclusions

Outline



  1   Introduction


  2   R & CDK


  3   R & PubChem


  4   Conclusions
Introduction R & CDK R & PubChem Conclusions

Goals of the Integration




        Allow one to use cheminformatics tools within R
        Ensure that cheminformatics functionality follows R idioms
        Provide access to PubChem data - compound and bioassay
Introduction R & CDK R & PubChem Conclusions

Using the CDK in R
  Overview
      Based on the rJava package
        Access to basic cheminformatics
        View 3D structures via Jmol
        View and edit 2D structures via JChemPaint
        Single R package to install - no extra downloads
Introduction R & CDK R & PubChem Conclusions

rcdk Functionality

                                 Function                   Description

        Editing and viewing      draw.molecule              Draw 2D structures
                                 view.molecule              View a molecule in 3D
                                 view.molecule.2d           View a molecule in 2D
                                 view.molecule.table        View 3D molecules along with data

        IO                       load.molecules             Load arbitrary molecular file formats
                                 write.molecules            Write molecules to MDL SD formatted files

        Descriptors              eval.desc                  Evaluate a single descriptor
                                 get.desc.engine            Get the descriptor engine which can be used to
                                                            automatically generate all descriptors
                                 get.desc.classnames        Get a list of available descriptors

        Miscellaneous            get.fingerprint            Generate binary fingerprints
                                 get.smiles                 Obtain SMILES representations
                                 parse.smiles               Parse a SMILES representation into a CDK
                                                            molecule object
                                 get.property               Retrieve arbitrary properties on molecules
                                 set.property               Set arbitrary properties on molecules
                                 remove.property            Remove a property from a molecule
                                 get.totalcharge            Get the total charge of a molecule
                                 get.total.hydrogen.count   Get the count of (implicit and explicit) hydro-
                                                            gens
                                 remove.hydrogens           Remove hydrogens from a molecule




  Guha, R., J. Stat. Soft. , 2007, 18 (6)
Introduction R & CDK R & PubChem Conclusions

Clustering Molecules



  Generating fingerprints
    > library(rcdk)
    > mols <- load.molecules(’dhfr_3d.sd’)
    > fp.list <- lapply(mols, get.fingerprint)

        The fingerprints are hashed, 1024-bit
        Gives us a list of numeric vectors, indicating the bit
        positions that are on
        The next step is to evaluate a distance matrix
Introduction R & CDK R & PubChem Conclusions

Clustering Molecules



  Fingerprint manipulation
    > library(fingerprint)
    > fp.dist <- fp.sim.matrix(fp.list)
    > fp.dist <- as.dist(1-fp.dist)

        The fingerprint package for R allows one to easily handle
        fingerprint data
        Recognizes CDK, BCI and MOE fingerprint formats
        We can now perform the clustering
Introduction R & CDK R & PubChem Conclusions

Clustering Molecules

  Hierarchical clustering
    > clus.hier <- hclust(fp.dist, method=’ward’)


                Hierarchical Clustering of the Sutherland DHFR Dataset
                                                                                               Average Disimilarity per Cluster
           35




                                                                                         0.7
           30




                                                                                         0.6
           25




                                                                                         0.5
                                                                         Dissimilarity
           20




                                                                                         0.4
  Height

           15




                                                                                         0.3
           10




                                                                                         0.2
                                                                                         0.1
           5




                                                                                         0.0
           0




                                                                                                 1        2        3       4

                                                                                                        Cluster Number
Introduction R & CDK R & PubChem Conclusions

QSAR Modeling
Introduction R & CDK R & PubChem Conclusions

QSAR Modeling



  Representation
      Molecules must be represented in ome machine readable
      format
        SD files are commonly used
        The draw.molecule function can be used to get a 2D
        structure editor
        The resultant molecule can be accessed in the R workspace
        Currently can’t generate 3D structures
Introduction R & CDK R & PubChem Conclusions

QSAR Modeling
  Calculating descriptors
   > mols <- load.molecules(’bp.sdf’)
   > desc.names <- c(
   + ’org.openscience.cdk.qsar.descriptors.molecular.BCUTDescriptor’,
   + ’org.openscience.cdk.qsar.descriptors.molecular.CPSADescriptor’,
   + ’org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor’,
   + ’org.openscience.cdk.qsar.descriptors.molecular.KappaShapeIndicesDescriptor’)

   >   desc.list <- list()
   >   for (i in 1:length(mols)) {
   +     tmp <- c()
   +     for (j in 1:length(desc.names)) {
   +       values <- eval.desc(mols.qsar[[i]], desc.names[j])
   +       tmp <- c(tmp, values)
   +       if (i == 1) data.names <- c(data.names, paste(prefix[j], 1:length(values), sep=’.’))
   +     }
   +     desc.list[[i]] <- tmp
   +   }
   >   desc.data <- as.data.frame(do.call(’rbind’, desc.list))



         Currently have to specify descriptors and generate names
         This will be automated in the next release
         The result is a data matrix which can then be modeled using
         the various methods available in R
Introduction R & CDK R & PubChem Conclusions

QSAR Modeling
                                                                                                   q       Training Set
                                                                                                           Prediction Set




                                                                                             600
                                                                                                                                                                   q q

                                                                                                                                                             q
                                                                                                                                                             q




                                                                Observed Boiling Point (K)
                                                                                                                                                                         q
                                                                                                                                                            q qqq




                                                                                             500
                                                                                                                                                           q
                                                                                                                                                          q
                                                                                                                                                      q qq q q      q
                                                                                                                                                           q
                                                                                                                                                        qq q
                                                                                                                                                          q
                                                                                                                                                        q qq q    q
                                                                                                                                                    q
                                                                                                                                                  qqq q
                                                                                                                                                      q
                                                                                                                                                               q
                                                                                                                                                 q q    q    q
                                                                                                                                                  q   q q
                                                                                                                                           qqq q q q
                                                                                                                                      q q qq
                                                                                                                                         q q q q qq
                                                                                                                                           qq
                                                                                                                                         qq q




                                                                                             400
                                                                                                                                q q qq q q      q
                                                                                                                                          q q
                                                                                                                                          q
                                                                                                                                    q q qqq
                                                                                                                                          q
                                                                                                                                     qq q
                                                                                                                                     q
                                                                                                                                     q
                                                                                                                                    qqqq q q
                                                                                                                                                     q
                                                                                                                                  q q qq q q q
                                                                                                                                       qq
                                                                                                                               q q q q q qq qq
                                                                                                                                   q
                                                                                                                                q qq q
                                                                                                                               q q q q
                                                                                                                                   q             q
                                                                                                                                      q q q q
                                                                                                                         q q q q qq q qq
                                                                                                                                       q q

The results for an OLS model                                                                                       q
                                                                                                                      q
                                                                                                                    q qq
                                                                                                                        qq q



                                                                                                                     q qq q
                                                                                                                      q qq
                                                                                                                           q
                                                                                                                             q q
                                                                                                                             q
                                                                                                                            q qq qqq q
                                                                                                                                qq
                                                                                                                                qq q
                                                                                                                                   qq
                                                                                                                                  qq q
                                                                                                                                   q
                                                                                                                                             q
                                                                                                                                                       q




                                                                                             300
                                                                                                                               q
                                                                                                                               q      q
                                                                                                                           q
                                                                                                                  q       q
                                                                                                                  qq   q     q
> summary(model)                                                                                           q         qq

Coefficients:                                                                                      q
                                                                                                           q
                                                                                                             q




                                                                                             200
                                                                                                                     q
               Estimate Std. Error t value Pr(>|t|)                                                    q
                                                                                                                         q


(Intercept)      25.611     13.743   1.864   0.0639 .
                                                                                                                             300                400         500                 600
BCUT.6           26.155      1.503 17.406 < 2e-16 ***
                                                                                                                                    Predicted Boiling Point (K)
CPSA.12        2565.552    181.668 14.122 < 2e-16 ***
                                                                                                                                                                                      q
WeightedPath.1    7.645      0.608 12.573 < 2e-16 ***                                                             q                                                 q




                                                                                             2
                                                                                                                                                                    q

MDE.11           91.355     17.916   5.099 8.05e-07 ***                                                              q
                                                                                                                               q
                                                                                                                                            q
                                                                                                                                                              q
                                                                                                                                                                    q
                                                                                                       q    q q
---                                                                                                                q q        q
                                                                                                                                          q
                                                                                                                                                  qq
                                                                                                                                                        q              qq
                                                                                                                                                                         q                   q
                                                                                                                                                                                             q




                                                                                             1
                                                                                                                             q         q                      q                   q
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1                                               q q

                                                                                                                q    qqq q
                                                                                                                            q
                                                                                                                             q
                                                                                                                                  q
                                                                                                                                     qq
                                                                                                                                        qq q
                                                                                                                                                  q
                                                                                                                                                 q q q        q
                                                                                                                                                               q qqq q q
                                                                                                                                                               q
                                                                                                                                                                     q
                                                                                                                                                                     q
                                                                                                                                                                               q
                                                                                                                                                                                 q q
                                                                                                                                                                                   q
                                                                                                                                                                                             qq
                                                                                                                                                                                              q
                                                                                                                    q q q                                          q                q




                                                                Standardized Residual
                                                                                                                                                                                          qq
                                                                                                               qq q
                                                                                                                 q                q          q
                                                                                                                                                                          q
                                                                                                                                                                        q q
                                                                                                                         q       q                                q                   q
                                                                                                              q          q          q       q          q q q              q q
                                                                                                               q                              q                 q           q        q      q
                                                                                                    q       q                         q q                         q                    q q
                                                                                                             q                                qq                                         q




                                                                                             0
Residual standard error: 28.88 on 195 degrees of freedom                                               qq
                                                                                                                  q
                                                                                                                  q     q
                                                                                                                          qq
                                                                                                                               q
                                                                                                                                   q
                                                                                                                                     q
                                                                                                                                         q q
                                                                                                                                        q q
                                                                                                                                           q          q q
                                                                                                                                                          q
                                                                                                                                                           q
                                                                                                                                                           q                 q q q
                                                                                                                                                                                    q      q

                                                                                                                     q              q                qq                            q      q qq
Multiple R-Squared: 0.8311,     Adjusted R-squared: 0.8276                                            q
                                                                                                           q                          q     q
                                                                                                                                                q           q
                                                                                                                                                                        q q q q
                                                                                                                                                                             qq
                                                                                                       qq                   q                      q                            q
                                                                                                                      q                                  q
F-statistic: 239.9 on 4 and 195 DF, p-value: < 2.2e-16




                                                                                             −1
                                                                                                          q                     qq
                                                                                                                        q                      q q                                      q
                                                                                                     q                     q qq          q                          q
                                                                                                                   q                             q     q     q
                                                                                                      q         q                                        q
                                                                                                     q                                                                      q
                                                                                                                                                                       q
                                                                                                                                                     q                          q      q
                                                                                                                                                             q                           q
                                                                                                   q                                                q           q                    q
                                                                                                    q




                                                                                             −2
                                                                                                                                                      q
                                                                                                                                        q

                                                                                                                 q




                                                                                             −3
                                                                                                            q
                                                                                                                                                               q



                                                                                                   0                               50                 100            150                   200

                                                                                                                                            Training Set Index
Introduction R & CDK R & PubChem Conclusions

Molecular Visualization
Introduction R & CDK R & PubChem Conclusions

Outline



  1   Introduction


  2   R & CDK


  3   R & PubChem


  4   Conclusions
Introduction R & CDK R & PubChem Conclusions

Accesing PubChem

  What can we get?
     2D structures for ∼10M compounds
        Some precomputed properties
        ∼500 bioassay datasets ranging from 100 observations to
        80,000 observations
        Searchable by keyword

  What’s the data like?
     Structure data can be retrieved in the form of SD or XML files
        Structures are represented as 2D coordinates as well as
        SMILES
        Bioassay data consists of XML or CSV files
               Column descriptions etc. are located in a separate XML file
Introduction R & CDK R & PubChem Conclusions

Accessing PubChem

  What’s the problem?
     Combining bioassay data & metadata by hand is painful
          The XML descriptions are readable but verbose
          Getting data can be a multi-step process, involving search,
          download and processing

  How does rpubchem help?
          Automated download of data & metadata files for compounds
          and bioassays
          Extracts data into a data.frame
          Assigns appropriate column names and adds extra information
          (such as units) as attributes
          Allows keyword searches for bioassays

  Guha, R., J. Stat. Soft. , 2007, 18 (6)
Introduction R & CDK R & PubChem Conclusions

Example Usage


  Getting structural data
    >   library(rpubchem)
    >   library(rcdk)
    >   structs <- get.cid( sample(1:10000, 10) )
    >   mols <- lapply(structs[,2], parse.smiles)
    >   view.molecule.2d(mols)

        We retrieve the SMILES along with some of the precomputed
        values
        Heavy atom count, formal charge, H count, H-bond donors
        and acceptors
Introduction R & CDK R & PubChem Conclusions

Example Usage



  Getting bioassay data
    > aids <- find.assay.id(quot;lung+cancer+carcinomaquot;)
    > sapply(aids, function(x) get.assay.desc(x)$assay.desc)
    > adata <- get.assay(175)


        Search for assays by keywords
        Get short descriptions
        Get the actual assay data
Introduction R & CDK R & PubChem Conclusions

Example Usage

  What do we get?
   Field Name                                    Meaning
   PUBCHEM.SID                                   Substance ID
   PUBCHEM.CID                                   Compound ID
   PUBCHEM.ACTIVITY.OUTCOME                      Activity outcome, which can be one
                                                 of active, inactive, inconclusive, un-
                                                 specified
   PUBCHEM.ACTIVITY.SCORE                        PubChem activity score, higher is
                                                 more active
   PUBCHEM.ASSAYDATA.COMMENT                     Test result specific comment


        A set of fixed fields
        Arbitrary number of assay specific fields
        Have to process a description file
Introduction R & CDK R & PubChem Conclusions

Example Usage
  What do we get?
  > names(adata)
  [1] quot;PUBCHEM.SIDquot;                 quot;PUBCHEM.CIDquot;
  [3] quot;PUBCHEM.ACTIVITY.OUTCOMEquot;    quot;PUBCHEM.ACTIVITY.SCOREquot;
  [5] quot;PUBCHEM.ASSAYDATA.COMMENTquot;   quot;concquot;
  [7] quot;giprctquot;                      quot;nbrtestquot;
  [9] quot;rangequot;

  > attr(adata, quot;commentsquot;)
  [1] quot;These data are a subset of the data from the NCI yeast anticancer
  drug screen. Compounds are identified by the NCI NSC number. In the
  Seattle Project numbering system, the mlh1 rad18 strain is SPY number
  50858. Compounds with inhibition of growth(giprct) >= 70% were
  considered active. Activity score was based on increasing values of giprct.quot;

  > attr(adata, ’types’)
  $conc
  [1] quot;concentration tested, unit: micromolarquot;
  [2] quot;uMquot;

  $giprct
  [1] quot;Inhibition of growth compared to untreated controls (%)quot;
  [2] quot;NAquot;

  $nbrtest
  [1] quot;Number of tests averaged for this data pointquot;
  [2] quot;NAquot;

  $range
  [1] quot;Range of values for this data pointquot; quot;NAquot;
Introduction R & CDK R & PubChem Conclusions

Outline



  1   Introduction


  2   R & CDK


  3   R & PubChem


  4   Conclusions
Introduction R & CDK R & PubChem Conclusions

Future Work




        Improve the coverage of the CDK functionality
        Improve descriptor generation and naming
        Include support for more chemical file formats
        2D Structure-data tables
        Avoid in-memory processing of large PubChem assay datasets
Introduction R & CDK R & PubChem Conclusions

Conclusions




        Integrating the CDK with R enhances the utility of the
        statistical environment for cheminformatics applications
        The integration is still at the programmer level
               Applications can be built up from the available functionality
        rpubchem provides easy access to large amounts of biological
        and chemical data
Introduction R & CDK R & PubChem Conclusions

Acknowledgements




        NIH
        CDK developers
        Simon Urbanek

More Related Content

Similar to RCDK PubChem Integrating R and CDK for Chemical Data Mining

R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}Rajarshi Guha
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleRajarshi Guha
 
Datascience Training with Hadoop, Python Machine Learning & Scala, Spark
Datascience Training with Hadoop, Python Machine Learning & Scala, SparkDatascience Training with Hadoop, Python Machine Learning & Scala, Spark
Datascience Training with Hadoop, Python Machine Learning & Scala, SparkSequelGate
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperDerek Diamond
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tunebaoilleach
 
Structure mapping your way to better software
Structure mapping your way to better softwareStructure mapping your way to better software
Structure mapping your way to better softwarematthoneycutt
 
MongoDB for Java Developers with Spring Data
MongoDB for Java Developers with Spring DataMongoDB for Java Developers with Spring Data
MongoDB for Java Developers with Spring DataChris Richardson
 
Object Oriented Concepts and Principles
Object Oriented Concepts and PrinciplesObject Oriented Concepts and Principles
Object Oriented Concepts and Principlesdeonpmeyer
 
Web sphere application server performance tuning workshop
Web sphere application server performance tuning workshopWeb sphere application server performance tuning workshop
Web sphere application server performance tuning workshopRohit Kelapure
 
Cinfony - Combining disparate cheminformatics resources into a single toolkit
Cinfony - Combining disparate cheminformatics resources into a single toolkitCinfony - Combining disparate cheminformatics resources into a single toolkit
Cinfony - Combining disparate cheminformatics resources into a single toolkitbaoilleach
 
Mac ruby deployment
Mac ruby deploymentMac ruby deployment
Mac ruby deploymentThilo Utke
 
Java database programming with jdbc
Java database programming with jdbcJava database programming with jdbc
Java database programming with jdbcsriram raj
 
MV(C, mvvm) in iOS and ReactiveCocoa
MV(C, mvvm) in iOS and ReactiveCocoaMV(C, mvvm) in iOS and ReactiveCocoa
MV(C, mvvm) in iOS and ReactiveCocoaYi-Shou Chen
 
Framework Engineering_Final
Framework Engineering_FinalFramework Engineering_Final
Framework Engineering_FinalYoungSu Son
 
Reactive Relational Database Connectivity
Reactive Relational Database ConnectivityReactive Relational Database Connectivity
Reactive Relational Database ConnectivityVMware Tanzu
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software developmentMartin Pinzger
 

Similar to RCDK PubChem Integrating R and CDK for Chemical Data Mining (20)

R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
 
Corba 2
Corba 2Corba 2
Corba 2
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
 
Datascience Training with Hadoop, Python Machine Learning & Scala, Spark
Datascience Training with Hadoop, Python Machine Learning & Scala, SparkDatascience Training with Hadoop, Python Machine Learning & Scala, Spark
Datascience Training with Hadoop, Python Machine Learning & Scala, Spark
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White Paper
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tune
 
Structure mapping your way to better software
Structure mapping your way to better softwareStructure mapping your way to better software
Structure mapping your way to better software
 
Sdlc
SdlcSdlc
Sdlc
 
MongoDB for Java Developers with Spring Data
MongoDB for Java Developers with Spring DataMongoDB for Java Developers with Spring Data
MongoDB for Java Developers with Spring Data
 
Object Oriented Concepts and Principles
Object Oriented Concepts and PrinciplesObject Oriented Concepts and Principles
Object Oriented Concepts and Principles
 
Web sphere application server performance tuning workshop
Web sphere application server performance tuning workshopWeb sphere application server performance tuning workshop
Web sphere application server performance tuning workshop
 
OpenCV+Android.pptx
OpenCV+Android.pptxOpenCV+Android.pptx
OpenCV+Android.pptx
 
Cinfony - Combining disparate cheminformatics resources into a single toolkit
Cinfony - Combining disparate cheminformatics resources into a single toolkitCinfony - Combining disparate cheminformatics resources into a single toolkit
Cinfony - Combining disparate cheminformatics resources into a single toolkit
 
Mac ruby deployment
Mac ruby deploymentMac ruby deployment
Mac ruby deployment
 
Java database programming with jdbc
Java database programming with jdbcJava database programming with jdbc
Java database programming with jdbc
 
Openobject bi
Openobject biOpenobject bi
Openobject bi
 
MV(C, mvvm) in iOS and ReactiveCocoa
MV(C, mvvm) in iOS and ReactiveCocoaMV(C, mvvm) in iOS and ReactiveCocoa
MV(C, mvvm) in iOS and ReactiveCocoa
 
Framework Engineering_Final
Framework Engineering_FinalFramework Engineering_Final
Framework Engineering_Final
 
Reactive Relational Database Connectivity
Reactive Relational Database ConnectivityReactive Relational Database Connectivity
Reactive Relational Database Connectivity
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software development
 

More from Rajarshi Guha

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomeRajarshi Guha
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in contextRajarshi Guha
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomeRajarshi Guha
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMCRajarshi Guha
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformRajarshi Guha
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?Rajarshi Guha
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?Rajarshi Guha
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsRajarshi Guha
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATSRajarshi Guha
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & RRajarshi Guha
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Rajarshi Guha
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the partsRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Rajarshi Guha
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesRajarshi Guha
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Rajarshi Guha
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research DatabaseRajarshi Guha
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsRajarshi Guha
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Rajarshi Guha
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in RRajarshi Guha
 

More from Rajarshi Guha (20)

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

RCDK PubChem Integrating R and CDK for Chemical Data Mining

  • 1. Introduction R & CDK R & PubChem Conclusions Integrating R and the CDK Enhanced Chemical Data Mining Rajarshi Guha School of Informatics Indiana University 21st May, 2007 Covington, KY
  • 2. Introduction R & CDK R & PubChem Conclusions Outline 1 Introduction 2 R & CDK 3 R & PubChem 4 Conclusions
  • 3. Introduction R & CDK R & PubChem Conclusions Outline 1 Introduction 2 R & CDK 3 R & PubChem 4 Conclusions
  • 4. Introduction R & CDK R & PubChem Conclusions Cheminformatics & Modeling Chemical information is available from multiple sources Databases Algorithms The information must be used in some way Summarized in the form of a report Modeled (numerically / statistically) A variety of software is available for both cheminformatics and modeling Can we get something that allows us the best of both worlds?
  • 5. Introduction R & CDK R & PubChem Conclusions Cheminformatics & Modeling Chemical information is available from multiple sources Databases Algorithms The information must be used in some way Summarized in the form of a report Modeled (numerically / statistically) A variety of software is available for both cheminformatics and modeling Can we get something that allows us the best of both worlds?
  • 6. Introduction R & CDK R & PubChem Conclusions Modeling Environments Many cheminformatics applications include statistical functionality Useful, but has a number of drawbacks Their focus is on cheminformatics, not necessarily statistics Usually limited in the statistics offerings Makes sense to use software designed for statistics SAS, SPSS Minitab, R ... However, these focus on statistics and not cheminformatics Combine a statistics package with a cheminformatics package
  • 7. Introduction R & CDK R & PubChem Conclusions Modeling Environments Many cheminformatics applications include statistical functionality Useful, but has a number of drawbacks Their focus is on cheminformatics, not necessarily statistics Usually limited in the statistics offerings Makes sense to use software designed for statistics SAS, SPSS Minitab, R ... However, these focus on statistics and not cheminformatics Combine a statistics package with a cheminformatics package
  • 8. Introduction R & CDK R & PubChem Conclusions Why Use R? Open source statistical programming environment Full fledged programming language A wide variety of statistical methods - traditional, well tested methods as well as state of the art methods Extensive and active community Easy to integrate into other environments Workflow tools (Pipeline Pilot, Knime) Java programs (CDK)
  • 9. Introduction R & CDK R & PubChem Conclusions Why Use the CDK? Open source Java library for cheminformatics Covers basic cheminformatics functionality Also provides descriptors, 3D coordinate generation, rigid alignment etc. Active development community Steinbeck, C. et al., Curr. Pharm. Des., 2007, 12, 2110 – 2120
  • 10. Introduction R & CDK R & PubChem Conclusions Outline 1 Introduction 2 R & CDK 3 R & PubChem 4 Conclusions
  • 11. Introduction R & CDK R & PubChem Conclusions Goals of the Integration Allow one to use cheminformatics tools within R Ensure that cheminformatics functionality follows R idioms Provide access to PubChem data - compound and bioassay
  • 12. Introduction R & CDK R & PubChem Conclusions Using the CDK in R Overview Based on the rJava package Access to basic cheminformatics View 3D structures via Jmol View and edit 2D structures via JChemPaint Single R package to install - no extra downloads
  • 13. Introduction R & CDK R & PubChem Conclusions rcdk Functionality Function Description Editing and viewing draw.molecule Draw 2D structures view.molecule View a molecule in 3D view.molecule.2d View a molecule in 2D view.molecule.table View 3D molecules along with data IO load.molecules Load arbitrary molecular file formats write.molecules Write molecules to MDL SD formatted files Descriptors eval.desc Evaluate a single descriptor get.desc.engine Get the descriptor engine which can be used to automatically generate all descriptors get.desc.classnames Get a list of available descriptors Miscellaneous get.fingerprint Generate binary fingerprints get.smiles Obtain SMILES representations parse.smiles Parse a SMILES representation into a CDK molecule object get.property Retrieve arbitrary properties on molecules set.property Set arbitrary properties on molecules remove.property Remove a property from a molecule get.totalcharge Get the total charge of a molecule get.total.hydrogen.count Get the count of (implicit and explicit) hydro- gens remove.hydrogens Remove hydrogens from a molecule Guha, R., J. Stat. Soft. , 2007, 18 (6)
  • 14. Introduction R & CDK R & PubChem Conclusions Clustering Molecules Generating fingerprints > library(rcdk) > mols <- load.molecules(’dhfr_3d.sd’) > fp.list <- lapply(mols, get.fingerprint) The fingerprints are hashed, 1024-bit Gives us a list of numeric vectors, indicating the bit positions that are on The next step is to evaluate a distance matrix
  • 15. Introduction R & CDK R & PubChem Conclusions Clustering Molecules Fingerprint manipulation > library(fingerprint) > fp.dist <- fp.sim.matrix(fp.list) > fp.dist <- as.dist(1-fp.dist) The fingerprint package for R allows one to easily handle fingerprint data Recognizes CDK, BCI and MOE fingerprint formats We can now perform the clustering
  • 16. Introduction R & CDK R & PubChem Conclusions Clustering Molecules Hierarchical clustering > clus.hier <- hclust(fp.dist, method=’ward’) Hierarchical Clustering of the Sutherland DHFR Dataset Average Disimilarity per Cluster 35 0.7 30 0.6 25 0.5 Dissimilarity 20 0.4 Height 15 0.3 10 0.2 0.1 5 0.0 0 1 2 3 4 Cluster Number
  • 17. Introduction R & CDK R & PubChem Conclusions QSAR Modeling
  • 18. Introduction R & CDK R & PubChem Conclusions QSAR Modeling Representation Molecules must be represented in ome machine readable format SD files are commonly used The draw.molecule function can be used to get a 2D structure editor The resultant molecule can be accessed in the R workspace Currently can’t generate 3D structures
  • 19. Introduction R & CDK R & PubChem Conclusions QSAR Modeling Calculating descriptors > mols <- load.molecules(’bp.sdf’) > desc.names <- c( + ’org.openscience.cdk.qsar.descriptors.molecular.BCUTDescriptor’, + ’org.openscience.cdk.qsar.descriptors.molecular.CPSADescriptor’, + ’org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor’, + ’org.openscience.cdk.qsar.descriptors.molecular.KappaShapeIndicesDescriptor’) > desc.list <- list() > for (i in 1:length(mols)) { + tmp <- c() + for (j in 1:length(desc.names)) { + values <- eval.desc(mols.qsar[[i]], desc.names[j]) + tmp <- c(tmp, values) + if (i == 1) data.names <- c(data.names, paste(prefix[j], 1:length(values), sep=’.’)) + } + desc.list[[i]] <- tmp + } > desc.data <- as.data.frame(do.call(’rbind’, desc.list)) Currently have to specify descriptors and generate names This will be automated in the next release The result is a data matrix which can then be modeled using the various methods available in R
  • 20. Introduction R & CDK R & PubChem Conclusions QSAR Modeling q Training Set Prediction Set 600 q q q q Observed Boiling Point (K) q q qqq 500 q q q qq q q q q qq q q q qq q q q qqq q q q q q q q q q q qqq q q q q q qq q q q q qq qq qq q 400 q q qq q q q q q q q q qqq q qq q q q qqqq q q q q q qq q q q qq q q q q q qq qq q q qq q q q q q q q q q q q q q q q qq q qq q q The results for an OLS model q q q qq qq q q qq q q qq q q q q q qq qqq q qq qq q qq qq q q q q 300 q q q q q q qq q q > summary(model) q qq Coefficients: q q q 200 q Estimate Std. Error t value Pr(>|t|) q q (Intercept) 25.611 13.743 1.864 0.0639 . 300 400 500 600 BCUT.6 26.155 1.503 17.406 < 2e-16 *** Predicted Boiling Point (K) CPSA.12 2565.552 181.668 14.122 < 2e-16 *** q WeightedPath.1 7.645 0.608 12.573 < 2e-16 *** q q 2 q MDE.11 91.355 17.916 5.099 8.05e-07 *** q q q q q q q q --- q q q q qq q qq q q q 1 q q q q Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 q q q qqq q q q q qq qq q q q q q q q qqq q q q q q q q q q qq q q q q q q Standardized Residual qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q 0 Residual standard error: 28.88 on 195 degrees of freedom qq q q q qq q q q q q q q q q q q q q q q q q q q q qq q q qq Multiple R-Squared: 0.8311, Adjusted R-squared: 0.8276 q q q q q q q q q q qq qq q q q q q F-statistic: 239.9 on 4 and 195 DF, p-value: < 2.2e-16 −1 q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q −2 q q q −3 q q 0 50 100 150 200 Training Set Index
  • 21. Introduction R & CDK R & PubChem Conclusions Molecular Visualization
  • 22. Introduction R & CDK R & PubChem Conclusions Outline 1 Introduction 2 R & CDK 3 R & PubChem 4 Conclusions
  • 23. Introduction R & CDK R & PubChem Conclusions Accesing PubChem What can we get? 2D structures for ∼10M compounds Some precomputed properties ∼500 bioassay datasets ranging from 100 observations to 80,000 observations Searchable by keyword What’s the data like? Structure data can be retrieved in the form of SD or XML files Structures are represented as 2D coordinates as well as SMILES Bioassay data consists of XML or CSV files Column descriptions etc. are located in a separate XML file
  • 24. Introduction R & CDK R & PubChem Conclusions Accessing PubChem What’s the problem? Combining bioassay data & metadata by hand is painful The XML descriptions are readable but verbose Getting data can be a multi-step process, involving search, download and processing How does rpubchem help? Automated download of data & metadata files for compounds and bioassays Extracts data into a data.frame Assigns appropriate column names and adds extra information (such as units) as attributes Allows keyword searches for bioassays Guha, R., J. Stat. Soft. , 2007, 18 (6)
  • 25. Introduction R & CDK R & PubChem Conclusions Example Usage Getting structural data > library(rpubchem) > library(rcdk) > structs <- get.cid( sample(1:10000, 10) ) > mols <- lapply(structs[,2], parse.smiles) > view.molecule.2d(mols) We retrieve the SMILES along with some of the precomputed values Heavy atom count, formal charge, H count, H-bond donors and acceptors
  • 26. Introduction R & CDK R & PubChem Conclusions Example Usage Getting bioassay data > aids <- find.assay.id(quot;lung+cancer+carcinomaquot;) > sapply(aids, function(x) get.assay.desc(x)$assay.desc) > adata <- get.assay(175) Search for assays by keywords Get short descriptions Get the actual assay data
  • 27. Introduction R & CDK R & PubChem Conclusions Example Usage What do we get? Field Name Meaning PUBCHEM.SID Substance ID PUBCHEM.CID Compound ID PUBCHEM.ACTIVITY.OUTCOME Activity outcome, which can be one of active, inactive, inconclusive, un- specified PUBCHEM.ACTIVITY.SCORE PubChem activity score, higher is more active PUBCHEM.ASSAYDATA.COMMENT Test result specific comment A set of fixed fields Arbitrary number of assay specific fields Have to process a description file
  • 28. Introduction R & CDK R & PubChem Conclusions Example Usage What do we get? > names(adata) [1] quot;PUBCHEM.SIDquot; quot;PUBCHEM.CIDquot; [3] quot;PUBCHEM.ACTIVITY.OUTCOMEquot; quot;PUBCHEM.ACTIVITY.SCOREquot; [5] quot;PUBCHEM.ASSAYDATA.COMMENTquot; quot;concquot; [7] quot;giprctquot; quot;nbrtestquot; [9] quot;rangequot; > attr(adata, quot;commentsquot;) [1] quot;These data are a subset of the data from the NCI yeast anticancer drug screen. Compounds are identified by the NCI NSC number. In the Seattle Project numbering system, the mlh1 rad18 strain is SPY number 50858. Compounds with inhibition of growth(giprct) >= 70% were considered active. Activity score was based on increasing values of giprct.quot; > attr(adata, ’types’) $conc [1] quot;concentration tested, unit: micromolarquot; [2] quot;uMquot; $giprct [1] quot;Inhibition of growth compared to untreated controls (%)quot; [2] quot;NAquot; $nbrtest [1] quot;Number of tests averaged for this data pointquot; [2] quot;NAquot; $range [1] quot;Range of values for this data pointquot; quot;NAquot;
  • 29. Introduction R & CDK R & PubChem Conclusions Outline 1 Introduction 2 R & CDK 3 R & PubChem 4 Conclusions
  • 30. Introduction R & CDK R & PubChem Conclusions Future Work Improve the coverage of the CDK functionality Improve descriptor generation and naming Include support for more chemical file formats 2D Structure-data tables Avoid in-memory processing of large PubChem assay datasets
  • 31. Introduction R & CDK R & PubChem Conclusions Conclusions Integrating the CDK with R enhances the utility of the statistical environment for cheminformatics applications The integration is still at the programmer level Applications can be built up from the available functionality rpubchem provides easy access to large amounts of biological and chemical data
  • 32. Introduction R & CDK R & PubChem Conclusions Acknowledgements NIH CDK developers Simon Urbanek