Online chemical modeling
  environment: models

    Iurii Sushko, Sergey Novotarskiy
        Thursday, August 13, 2009
Existent alternatives
Classical approach: Weka, R, Mathematica

Advantages:

       1. Most flexible
       2. Suitable for research and deep analysis

Disadvantages:

       1. It’s complex: suitable for mathematician,
          informatician, statistician but not
          chemist and biologist
       2. Very tedious data preparation
Community driven source   Authority driven source
Collaboration in QSAR
Possibilities for collaboration in QSAR:

 1.Use others' data
      a.build models, based on others' data
      b.validate your models against others' data
 2. Use others' models
      a.validate your data against published models
      b.use output of published models
        as an input for new ones
      c.compare performance of published models
        with own ones

 All existent modeling tools lack means of collaboration
OCHEM advantages
Collaboration-targeted features:
    1. Tight connection between database and
       modeling tools
    2. Wiki, discussion, comments, tags



Simplified modeling workflow:
    1.   Sensible defaults for most parameters
    2.   Only necessary parameters requested
    3.   Data representation is targeted for chemist
    4.   Possibility of fine tune for experts
Modeling workflow

1. Data preparation


2. Building a model


3. Analysing the model
                         AD
4. Application of the
   model
Stage 1 – Data preparation
   Property                    Filtering
                                                        Condition
     logP = 0.5            Toxicology, Biology,         Temperature,
                           Partition coefficient.         pH, species,
Melting Point = 100
          C                                             tissue, method




                             Data Point                   Introducer
         Tags                                            Bill G., Sergey B.
  Toxicology, Biology,
  Partition coefficient.
                                                     Date of modification
                                                     Informationsystem




   Structure                                             Article
                            Manipulation
  Benzene, Urea, ...             Editing                   Garberg, P
                            Organization            “In vitro models for …”
                              Working sets<
Stage 1 – Data preparation                       Tags
                                            Toxicology, Biology,
                                            Partition coefficient.




                          Manipulation
                               Editing
                           Organization
                            Working sets<




    Filtering
Toxicology, Biology,
Partition coefficient.
Stage 1: Data preparation
Stage 1: Data preparation
Stage 1: Data preparation
Stage 1: Data preparation
Stage 2: Model building - input data
Stage 2: Model building - descriptors (I)
Stage 2: Model building - descriptors (II)
Stage 2: Model building – descriptors (manual)
Stage 3: Analysing the model (I)
Basic model statistics
Stage 3: Analysing the model (II)
Applicability domain assessment
Stage 4: Application of the model
Selection of the model of interest




                              Model, published by another user
        Newly created model
Stage 4: Application of the model
Provide target compounds
Stage 4: Application of the model
 Prediction results




Target compound       Prediction   Accuracy assessment
Stage 4: Application of the model
Assessment of accuracy of predictions




Target compound
Need for distribution of calculations
Fact: QSAR modeling is calculation-intensive

Examples of calculations:
• Training of neural network ensembles
• Computing 3D conformations
• Computing complex molecular descriptors

Solution:
• Distributed calculation network
• User can postpone, cancel or fetch task results later
Automatic updates and testing




  Calculation servers are automatically updated upon
  availability of new release
  Automatic testing of servers upon updates
  Tasks that did not pass tests are disabled, keeping
  the server functional
Backend - distributed calculation
Central metaserver, distributed calculation servers
Automatic server updates, on-the-fly server testing
Basic facts

  About 50000 experimental measurements on
  285 physicochemical properties published in
  about 2000 articles
  Implemented modeling methods:
  ANN, KNN, MLR, Kernel ridge regression
  Integrated descriptors: Dragon, E-State,
  Fragments
Backend - basic facts

 Platform: Java EE
 Database: MySQL
 Server: Tomcat
 ORM: Hibernate
 MVC: Spring framework
 Client side: AJAX, HTML+Javascript

Online Chemical Modeling Environment: Models

  • 1.
    Online chemical modeling environment: models Iurii Sushko, Sergey Novotarskiy Thursday, August 13, 2009
  • 2.
    Existent alternatives Classical approach:Weka, R, Mathematica Advantages: 1. Most flexible 2. Suitable for research and deep analysis Disadvantages: 1. It’s complex: suitable for mathematician, informatician, statistician but not chemist and biologist 2. Very tedious data preparation
  • 4.
    Community driven source Authority driven source
  • 5.
    Collaboration in QSAR Possibilitiesfor collaboration in QSAR: 1.Use others' data a.build models, based on others' data b.validate your models against others' data 2. Use others' models a.validate your data against published models b.use output of published models as an input for new ones c.compare performance of published models with own ones All existent modeling tools lack means of collaboration
  • 6.
    OCHEM advantages Collaboration-targeted features: 1. Tight connection between database and modeling tools 2. Wiki, discussion, comments, tags Simplified modeling workflow: 1. Sensible defaults for most parameters 2. Only necessary parameters requested 3. Data representation is targeted for chemist 4. Possibility of fine tune for experts
  • 7.
    Modeling workflow 1. Datapreparation 2. Building a model 3. Analysing the model AD 4. Application of the model
  • 8.
    Stage 1 –Data preparation Property Filtering Condition logP = 0.5 Toxicology, Biology, Temperature, Partition coefficient. pH, species, Melting Point = 100 C tissue, method Data Point Introducer Tags Bill G., Sergey B. Toxicology, Biology, Partition coefficient. Date of modification Informationsystem Structure Article Manipulation Benzene, Urea, ... Editing Garberg, P Organization “In vitro models for …” Working sets<
  • 9.
    Stage 1 –Data preparation Tags Toxicology, Biology, Partition coefficient. Manipulation Editing Organization Working sets< Filtering Toxicology, Biology, Partition coefficient.
  • 10.
    Stage 1: Datapreparation
  • 11.
    Stage 1: Datapreparation
  • 12.
    Stage 1: Datapreparation
  • 13.
    Stage 1: Datapreparation
  • 14.
    Stage 2: Modelbuilding - input data
  • 15.
    Stage 2: Modelbuilding - descriptors (I)
  • 16.
    Stage 2: Modelbuilding - descriptors (II)
  • 17.
    Stage 2: Modelbuilding – descriptors (manual)
  • 18.
    Stage 3: Analysingthe model (I) Basic model statistics
  • 19.
    Stage 3: Analysingthe model (II) Applicability domain assessment
  • 20.
    Stage 4: Applicationof the model Selection of the model of interest Model, published by another user Newly created model
  • 21.
    Stage 4: Applicationof the model Provide target compounds
  • 22.
    Stage 4: Applicationof the model Prediction results Target compound Prediction Accuracy assessment
  • 23.
    Stage 4: Applicationof the model Assessment of accuracy of predictions Target compound
  • 24.
    Need for distributionof calculations Fact: QSAR modeling is calculation-intensive Examples of calculations: • Training of neural network ensembles • Computing 3D conformations • Computing complex molecular descriptors Solution: • Distributed calculation network • User can postpone, cancel or fetch task results later
  • 25.
    Automatic updates andtesting Calculation servers are automatically updated upon availability of new release Automatic testing of servers upon updates Tasks that did not pass tests are disabled, keeping the server functional
  • 26.
    Backend - distributedcalculation Central metaserver, distributed calculation servers Automatic server updates, on-the-fly server testing
  • 27.
    Basic facts About 50000 experimental measurements on 285 physicochemical properties published in about 2000 articles Implemented modeling methods: ANN, KNN, MLR, Kernel ridge regression Integrated descriptors: Dragon, E-State, Fragments
  • 28.
    Backend - basicfacts Platform: Java EE Database: MySQL Server: Tomcat ORM: Hibernate MVC: Spring framework Client side: AJAX, HTML+Javascript