Automated Molecular Data Extraction
using Open Babel & ChemSpotlight:
       The Semantic Desktop

         Prof. Geoff Hutchison
         Department of Chemistry
         University of Pittsburgh
         geoffh@pitt.edu


         ACS CINF: Skolnik Symposium
         21 August 2012
         http://hutchison.chem.pitt.edu
“
I can plug my iPod into any
computer and it will recognize
my music and give me all sorts
of metadata: artist, title, type of
music...

Why can’t I read the chemical
metadata off my chemistry files?
                                      ”
— Prof. Henry S. Rzepa (Imperial College)
  Spring 2005 ACS Meeting, San Diego, CA
Pre-History: Chem://Dig


                                Index files, websites
                                Based on Chem MIME
                                Find files on extension
                                Perceive chemistry
                                Database Store
                                Search, Filter
                                Retrieval

    H. Rzepa et al. New J. Chem (2002) 26 p. 656
Open Babel
              Open Babel (Started 2001)
                 Free, open source chemical toolbox
                 Cross-platform: Win, Mac, Linux...
                 Both user-tools & C++ library
                 Interfaces in Python, Perl, Ruby,
                 Java, C#
                 Supports chemistry, bioinformatics,
                 solid-state…
                 100+ file formats and variants

          http://openbabel.org/
    O’Boyle et al. J. Cheminf. 2011, 3:33
Chemical Database?


    1. Some way to store data
         (Organize it)
    2. Index it
    3. Search / filter
    4. Visualize results
ChemSpotlight: Indexing Architecture



                                   ~300 lines
              +                +    of code

  Spotlight       Open Babel

    http://chemspotlight.openmolecules.net/
ChemSpotlight: “Un” Database


      Use the system-wide search database
      No (Visible) Database!
      Index files in-place
      Includes textual data
      (e.g., chemical names, formulas, etc.)
      Multiple retrieval and filtering interfaces
      (i.e., any third-party search tool works)

      http://chemspotlight.openmolecules.net/
So What’s Stored / Perceived
       Formula, mass, SMILES, InChI
       net_sourceforge_openbabel_Formula        =
       C21H36N7O8S

       Fingerprints, number of
        atoms, bonds, residues
       PDB, SDF keywords, properties
       Calculation keywords:
       kMDItemComment                           =
       "Gaussian 09 #n B3LYP/6-31G(d) Opt"

       Calculation results
       (HOMO, LUMO, Dipole Moment)
       net_sourceforge_chemspotlight_DipoleMoment   =
       3.5
ChemSpotlight “Un” Database
ChemSpotlight “Un” Database
How Do We Visualize?

   “QuickLook” previews
   New code ~800 lines
   Generate SDF, PDB, CIF
   (if needed)
   Pass off to ChemDoodle
   Web Components
   Pseudo-3D, interactive JS
   + HTML5
   … or SVG generation
   from Open Babel

             http://web.chemdoodle.com/
Organic Heterojunction Solar Cells



  light
  Transparent Electrode
        +   p-type material
                              Circuit
    -       n-type material
    Reflective Electrode
Organic Heterojunction Solar Cells

                                 ΔE ≥ Exciton Binding Energy                           e-


                                                                           Optical Excitation
  light                                                                            hν
                                        Cathode
  Transparent Electrode                                        Hole
                                                   Electron Conducting                Effective
        +   p-type material
                                                  Conductor Polymer                Heterojunction
                              Circuit
    -       n-type material                       (Nanoparticle)                     Bandgap

    Reflective Electrode                                                  Anode
                                                                                      h+
Pipeline Model for Finding New Molecules

             Monomers
                                       >106
                                     Possible
                                    Structures

                                        Electronic




                                                     ~9 minutes
                                        Properties

                                         Optical
                                        Properties

                                        Synthetic
                                         Score


J Phys Chem C 2011 vol. 115 pp. 16200       ...
Pipeline Model for Finding New Molecules

             Monomers
                                       >106
                                     Possible
                                    Structures

                Fast                    Electronic




                                                     ~9 minutes
             Screening                  Properties

                                         Optical
                                        Properties

                                        Synthetic
               Slower                    Score


J Phys Chem C 2011 vol. 115 pp. 16200       ...
New Genetic Algorithm Approach

      Rather than directly
      driving & wait for
      calc results
      Check Spotlight for
      new results
        “What are top
        HOMO energies?”
      Update GA, generate
      new candidates,
      submit new jobs
Scaling Up the Polymer Solar Search


        S
                                             0


   2nd Gen. Search:
   680 Monomers          LUMO Energy (eV)   −1

   2800+ Fragments
   Search Space:
                                            −2
   500+ million
   oligomers
   ~9 minutes per core                      −3
                                              −9.5   −9.0   −8.5 −8.0 −7.5     −7.0   −6.5
                                                            HOMO Energy (eV)
Take-Home Messages

   “Big Data” is a Big Headache
   ChemSpotlight & Un-Databases Work!
   Keep data as native files w/separate index
   Integrate into user-friendly tools
   Sell to users: “What’s in it for me?”
    Indexing, retrieval
    Improved workflows
Marcus Hanwell
                      Pitt / Kitware




Dr. Noel O’Boyle     Casey Campbell
U.C. Cork, Ireland     Pitt (2010)

2012 ACS Skolnik Symposium - ChemSpotlight

  • 1.
    Automated Molecular DataExtraction using Open Babel & ChemSpotlight: The Semantic Desktop Prof. Geoff Hutchison Department of Chemistry University of Pittsburgh geoffh@pitt.edu ACS CINF: Skolnik Symposium 21 August 2012 http://hutchison.chem.pitt.edu
  • 2.
    “ I can plugmy iPod into any computer and it will recognize my music and give me all sorts of metadata: artist, title, type of music... Why can’t I read the chemical metadata off my chemistry files? ” — Prof. Henry S. Rzepa (Imperial College) Spring 2005 ACS Meeting, San Diego, CA
  • 3.
    Pre-History: Chem://Dig Index files, websites Based on Chem MIME Find files on extension Perceive chemistry Database Store Search, Filter Retrieval H. Rzepa et al. New J. Chem (2002) 26 p. 656
  • 4.
    Open Babel Open Babel (Started 2001) Free, open source chemical toolbox Cross-platform: Win, Mac, Linux... Both user-tools & C++ library Interfaces in Python, Perl, Ruby, Java, C# Supports chemistry, bioinformatics, solid-state… 100+ file formats and variants http://openbabel.org/ O’Boyle et al. J. Cheminf. 2011, 3:33
  • 5.
    Chemical Database? 1. Some way to store data (Organize it) 2. Index it 3. Search / filter 4. Visualize results
  • 6.
    ChemSpotlight: Indexing Architecture ~300 lines + + of code Spotlight Open Babel http://chemspotlight.openmolecules.net/
  • 7.
    ChemSpotlight: “Un” Database Use the system-wide search database No (Visible) Database! Index files in-place Includes textual data (e.g., chemical names, formulas, etc.) Multiple retrieval and filtering interfaces (i.e., any third-party search tool works) http://chemspotlight.openmolecules.net/
  • 8.
    So What’s Stored/ Perceived Formula, mass, SMILES, InChI net_sourceforge_openbabel_Formula = C21H36N7O8S Fingerprints, number of atoms, bonds, residues PDB, SDF keywords, properties Calculation keywords: kMDItemComment = "Gaussian 09 #n B3LYP/6-31G(d) Opt" Calculation results (HOMO, LUMO, Dipole Moment) net_sourceforge_chemspotlight_DipoleMoment = 3.5
  • 9.
  • 10.
  • 11.
    How Do WeVisualize? “QuickLook” previews New code ~800 lines Generate SDF, PDB, CIF (if needed) Pass off to ChemDoodle Web Components Pseudo-3D, interactive JS + HTML5 … or SVG generation from Open Babel http://web.chemdoodle.com/
  • 12.
    Organic Heterojunction SolarCells light Transparent Electrode + p-type material Circuit - n-type material Reflective Electrode
  • 13.
    Organic Heterojunction SolarCells ΔE ≥ Exciton Binding Energy e- Optical Excitation light hν Cathode Transparent Electrode Hole Electron Conducting Effective + p-type material Conductor Polymer Heterojunction Circuit - n-type material (Nanoparticle) Bandgap Reflective Electrode Anode h+
  • 14.
    Pipeline Model forFinding New Molecules Monomers >106 Possible Structures Electronic ~9 minutes Properties Optical Properties Synthetic Score J Phys Chem C 2011 vol. 115 pp. 16200 ...
  • 15.
    Pipeline Model forFinding New Molecules Monomers >106 Possible Structures Fast Electronic ~9 minutes Screening Properties Optical Properties Synthetic Slower Score J Phys Chem C 2011 vol. 115 pp. 16200 ...
  • 16.
    New Genetic AlgorithmApproach Rather than directly driving & wait for calc results Check Spotlight for new results “What are top HOMO energies?” Update GA, generate new candidates, submit new jobs
  • 17.
    Scaling Up thePolymer Solar Search S 0 2nd Gen. Search: 680 Monomers LUMO Energy (eV) −1 2800+ Fragments Search Space: −2 500+ million oligomers ~9 minutes per core −3 −9.5 −9.0 −8.5 −8.0 −7.5 −7.0 −6.5 HOMO Energy (eV)
  • 18.
    Take-Home Messages “Big Data” is a Big Headache ChemSpotlight & Un-Databases Work! Keep data as native files w/separate index Integrate into user-friendly tools Sell to users: “What’s in it for me?” Indexing, retrieval Improved workflows
  • 19.
    Marcus Hanwell Pitt / Kitware Dr. Noel O’Boyle Casey Campbell U.C. Cork, Ireland Pitt (2010)