SlideShare a Scribd company logo
1 of 19
Download to read offline
Automated Molecular Data Extraction
using Open Babel & ChemSpotlight:
       The Semantic Desktop

         Prof. Geoff Hutchison
         Department of Chemistry
         University of Pittsburgh
         geoffh@pitt.edu


         ACS CINF: Skolnik Symposium
         21 August 2012
         http://hutchison.chem.pitt.edu
“
I can plug my iPod into any
computer and it will recognize
my music and give me all sorts
of metadata: artist, title, type of
music...

Why can’t I read the chemical
metadata off my chemistry files?
                                      ”
— Prof. Henry S. Rzepa (Imperial College)
  Spring 2005 ACS Meeting, San Diego, CA
Pre-History: Chem://Dig


                                Index files, websites
                                Based on Chem MIME
                                Find files on extension
                                Perceive chemistry
                                Database Store
                                Search, Filter
                                Retrieval

    H. Rzepa et al. New J. Chem (2002) 26 p. 656
Open Babel
              Open Babel (Started 2001)
                 Free, open source chemical toolbox
                 Cross-platform: Win, Mac, Linux...
                 Both user-tools & C++ library
                 Interfaces in Python, Perl, Ruby,
                 Java, C#
                 Supports chemistry, bioinformatics,
                 solid-state…
                 100+ file formats and variants

          http://openbabel.org/
    O’Boyle et al. J. Cheminf. 2011, 3:33
Chemical Database?


    1. Some way to store data
         (Organize it)
    2. Index it
    3. Search / filter
    4. Visualize results
ChemSpotlight: Indexing Architecture



                                   ~300 lines
              +                +    of code

  Spotlight       Open Babel

    http://chemspotlight.openmolecules.net/
ChemSpotlight: “Un” Database


      Use the system-wide search database
      No (Visible) Database!
      Index files in-place
      Includes textual data
      (e.g., chemical names, formulas, etc.)
      Multiple retrieval and filtering interfaces
      (i.e., any third-party search tool works)

      http://chemspotlight.openmolecules.net/
So What’s Stored / Perceived
       Formula, mass, SMILES, InChI
       net_sourceforge_openbabel_Formula        =
       C21H36N7O8S

       Fingerprints, number of
        atoms, bonds, residues
       PDB, SDF keywords, properties
       Calculation keywords:
       kMDItemComment                           =
       "Gaussian 09 #n B3LYP/6-31G(d) Opt"

       Calculation results
       (HOMO, LUMO, Dipole Moment)
       net_sourceforge_chemspotlight_DipoleMoment   =
       3.5
ChemSpotlight “Un” Database
ChemSpotlight “Un” Database
How Do We Visualize?

   “QuickLook” previews
   New code ~800 lines
   Generate SDF, PDB, CIF
   (if needed)
   Pass off to ChemDoodle
   Web Components
   Pseudo-3D, interactive JS
   + HTML5
   … or SVG generation
   from Open Babel

             http://web.chemdoodle.com/
Organic Heterojunction Solar Cells



  light
  Transparent Electrode
        +   p-type material
                              Circuit
    -       n-type material
    Reflective Electrode
Organic Heterojunction Solar Cells

                                 ΔE ≥ Exciton Binding Energy                           e-


                                                                           Optical Excitation
  light                                                                            hν
                                        Cathode
  Transparent Electrode                                        Hole
                                                   Electron Conducting                Effective
        +   p-type material
                                                  Conductor Polymer                Heterojunction
                              Circuit
    -       n-type material                       (Nanoparticle)                     Bandgap

    Reflective Electrode                                                  Anode
                                                                                      h+
Pipeline Model for Finding New Molecules

             Monomers
                                       >106
                                     Possible
                                    Structures

                                        Electronic




                                                     ~9 minutes
                                        Properties

                                         Optical
                                        Properties

                                        Synthetic
                                         Score


J Phys Chem C 2011 vol. 115 pp. 16200       ...
Pipeline Model for Finding New Molecules

             Monomers
                                       >106
                                     Possible
                                    Structures

                Fast                    Electronic




                                                     ~9 minutes
             Screening                  Properties

                                         Optical
                                        Properties

                                        Synthetic
               Slower                    Score


J Phys Chem C 2011 vol. 115 pp. 16200       ...
New Genetic Algorithm Approach

      Rather than directly
      driving & wait for
      calc results
      Check Spotlight for
      new results
        “What are top
        HOMO energies?”
      Update GA, generate
      new candidates,
      submit new jobs
Scaling Up the Polymer Solar Search


        S
                                             0


   2nd Gen. Search:
   680 Monomers          LUMO Energy (eV)   −1

   2800+ Fragments
   Search Space:
                                            −2
   500+ million
   oligomers
   ~9 minutes per core                      −3
                                              −9.5   −9.0   −8.5 −8.0 −7.5     −7.0   −6.5
                                                            HOMO Energy (eV)
Take-Home Messages

   “Big Data” is a Big Headache
   ChemSpotlight & Un-Databases Work!
   Keep data as native files w/separate index
   Integrate into user-friendly tools
   Sell to users: “What’s in it for me?”
    Indexing, retrieval
    Improved workflows
Marcus Hanwell
                      Pitt / Kitware




Dr. Noel O’Boyle     Casey Campbell
U.C. Cork, Ireland     Pitt (2010)

More Related Content

Viewers also liked

Trastornos alimenticios.
Trastornos alimenticios.Trastornos alimenticios.
Trastornos alimenticios.
_danielahm
 
Tutorial de como crear particiones de disco duro en windows 10
Tutorial de como crear particiones de disco duro en windows 10Tutorial de como crear particiones de disco duro en windows 10
Tutorial de como crear particiones de disco duro en windows 10
luisberazaarieta
 
Derecho de los pueblos a la auto determinación
Derecho de los pueblos a la  auto determinaciónDerecho de los pueblos a la  auto determinación
Derecho de los pueblos a la auto determinación
Frank Ragol
 
4ª lista de exercícios desenho técnico i
4ª lista de exercícios   desenho técnico i4ª lista de exercícios   desenho técnico i
4ª lista de exercícios desenho técnico i
Marilia Estevao
 

Viewers also liked (20)

Plan de mejora de hoy.com.ec
Plan de mejora de hoy.com.ecPlan de mejora de hoy.com.ec
Plan de mejora de hoy.com.ec
 
InTASC Standards
InTASC StandardsInTASC Standards
InTASC Standards
 
2013 Year End Commercial Real Estate Review
2013 Year End Commercial Real Estate Review2013 Year End Commercial Real Estate Review
2013 Year End Commercial Real Estate Review
 
Trastornos alimenticios.
Trastornos alimenticios.Trastornos alimenticios.
Trastornos alimenticios.
 
The 2015 Tech Roundup
The 2015 Tech RoundupThe 2015 Tech Roundup
The 2015 Tech Roundup
 
Resume New Mitesh
Resume New MiteshResume New Mitesh
Resume New Mitesh
 
07 (ok)mulher encurvada (libertação)
07  (ok)mulher encurvada (libertação)07  (ok)mulher encurvada (libertação)
07 (ok)mulher encurvada (libertação)
 
Crew, Foia, Documents 012829 - 012917
Crew, Foia, Documents 012829 - 012917Crew, Foia, Documents 012829 - 012917
Crew, Foia, Documents 012829 - 012917
 
Styling with CSS
Styling with CSSStyling with CSS
Styling with CSS
 
Tutorial de como crear particiones de disco duro en windows 10
Tutorial de como crear particiones de disco duro en windows 10Tutorial de como crear particiones de disco duro en windows 10
Tutorial de como crear particiones de disco duro en windows 10
 
Impact Outside Academia
Impact Outside AcademiaImpact Outside Academia
Impact Outside Academia
 
Tema 2: secuencias-didacticas
Tema 2: secuencias-didacticasTema 2: secuencias-didacticas
Tema 2: secuencias-didacticas
 
Disrupting the Startup Brogrammer Culture
Disrupting the Startup Brogrammer Culture Disrupting the Startup Brogrammer Culture
Disrupting the Startup Brogrammer Culture
 
Demystifying research impact metrics and library support
Demystifying research impact   metrics and library supportDemystifying research impact   metrics and library support
Demystifying research impact metrics and library support
 
How Secure is Cloud ?
How Secure is Cloud ?How Secure is Cloud ?
How Secure is Cloud ?
 
Derecho de los pueblos a la auto determinación
Derecho de los pueblos a la  auto determinaciónDerecho de los pueblos a la  auto determinación
Derecho de los pueblos a la auto determinación
 
4ª lista de exercícios desenho técnico i
4ª lista de exercícios   desenho técnico i4ª lista de exercícios   desenho técnico i
4ª lista de exercícios desenho técnico i
 
Neu-ir 2016: Opening note
Neu-ir 2016: Opening noteNeu-ir 2016: Opening note
Neu-ir 2016: Opening note
 
Randall Whittinghill: Puppies
Randall Whittinghill: PuppiesRandall Whittinghill: Puppies
Randall Whittinghill: Puppies
 
Explore Your Twitter Analytics Dashboard
Explore Your Twitter Analytics DashboardExplore Your Twitter Analytics Dashboard
Explore Your Twitter Analytics Dashboard
 

Similar to 2012 ACS Skolnik Symposium - ChemSpotlight

PhD_10_2011_Abhijeet_Paul
PhD_10_2011_Abhijeet_PaulPhD_10_2011_Abhijeet_Paul
PhD_10_2011_Abhijeet_Paul
Abhijeet Paul
 
Introduction to Nanotechnology: Part 3
Introduction to Nanotechnology: Part 3Introduction to Nanotechnology: Part 3
Introduction to Nanotechnology: Part 3
glennfish
 
High-throughput Quantum Chemistry and Virtual Screening for Lithium Ion Batte...
High-throughput Quantum Chemistry and Virtual Screening for Lithium Ion Batte...High-throughput Quantum Chemistry and Virtual Screening for Lithium Ion Batte...
High-throughput Quantum Chemistry and Virtual Screening for Lithium Ion Batte...
BIOVIA
 

Similar to 2012 ACS Skolnik Symposium - ChemSpotlight (20)

Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...
 
玩轉 LHC 公開數據 (Play around with the LHC open data)
玩轉 LHC 公開數據 (Play around with the LHC open data)玩轉 LHC 公開數據 (Play around with the LHC open data)
玩轉 LHC 公開數據 (Play around with the LHC open data)
 
大強子計算網格與OSS
大強子計算網格與OSS大強子計算網格與OSS
大強子計算網格與OSS
 
PhD_10_2011_Abhijeet_Paul
PhD_10_2011_Abhijeet_PaulPhD_10_2011_Abhijeet_Paul
PhD_10_2011_Abhijeet_Paul
 
EnCOrE: Chemistry, Education, Knowledge From the Real to the Virtual Needs, P...
EnCOrE: Chemistry, Education, Knowledge From the Real to the Virtual Needs, P...EnCOrE: Chemistry, Education, Knowledge From the Real to the Virtual Needs, P...
EnCOrE: Chemistry, Education, Knowledge From the Real to the Virtual Needs, P...
 
Lattice Energy LLC-Nickel-seed LENR Networks-April 20 2011
Lattice Energy LLC-Nickel-seed LENR Networks-April 20 2011Lattice Energy LLC-Nickel-seed LENR Networks-April 20 2011
Lattice Energy LLC-Nickel-seed LENR Networks-April 20 2011
 
NANO266 - Lecture 12 - High-throughput computational materials design
NANO266 - Lecture 12 - High-throughput computational materials designNANO266 - Lecture 12 - High-throughput computational materials design
NANO266 - Lecture 12 - High-throughput computational materials design
 
Вычислительный эксперимент в молекулярной биофизике белков и биомембран
Вычислительный эксперимент в молекулярной биофизике белков и биомембранВычислительный эксперимент в молекулярной биофизике белков и биомембран
Вычислительный эксперимент в молекулярной биофизике белков и биомембран
 
ICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials ProjectICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials Project
 
The Computational Microscope Images Biomolecular Machines and Nanodevices - K...
The Computational Microscope Images Biomolecular Machines and Nanodevices - K...The Computational Microscope Images Biomolecular Machines and Nanodevices - K...
The Computational Microscope Images Biomolecular Machines and Nanodevices - K...
 
Computational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeComputational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to Practice
 
Introduction to Nanotechnology: Part 3
Introduction to Nanotechnology: Part 3Introduction to Nanotechnology: Part 3
Introduction to Nanotechnology: Part 3
 
The Materials Project: An Electronic Structure Database for Community-Based M...
The Materials Project: An Electronic Structure Database for Community-Based M...The Materials Project: An Electronic Structure Database for Community-Based M...
The Materials Project: An Electronic Structure Database for Community-Based M...
 
Kobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectKobeworkshop pubchemqc project
Kobeworkshop pubchemqc project
 
Using MongoDB for Materials Discovery
Using MongoDB for Materials DiscoveryUsing MongoDB for Materials Discovery
Using MongoDB for Materials Discovery
 
Bionic eye
Bionic eyeBionic eye
Bionic eye
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
淺嚐 LHCb 數據分析的滋味 Play around the LHCb Data on Kaggle with SK-Learn and MatPlotLib
淺嚐 LHCb 數據分析的滋味 Play around the LHCb Data on Kaggle with SK-Learn and MatPlotLib淺嚐 LHCb 數據分析的滋味 Play around the LHCb Data on Kaggle with SK-Learn and MatPlotLib
淺嚐 LHCb 數據分析的滋味 Play around the LHCb Data on Kaggle with SK-Learn and MatPlotLib
 
Materials Modelling: From theory to solar cells (Lecture 1)
Materials Modelling: From theory to solar cells  (Lecture 1)Materials Modelling: From theory to solar cells  (Lecture 1)
Materials Modelling: From theory to solar cells (Lecture 1)
 
High-throughput Quantum Chemistry and Virtual Screening for Lithium Ion Batte...
High-throughput Quantum Chemistry and Virtual Screening for Lithium Ion Batte...High-throughput Quantum Chemistry and Virtual Screening for Lithium Ion Batte...
High-throughput Quantum Chemistry and Virtual Screening for Lithium Ion Batte...
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

2012 ACS Skolnik Symposium - ChemSpotlight

  • 1. Automated Molecular Data Extraction using Open Babel & ChemSpotlight: The Semantic Desktop Prof. Geoff Hutchison Department of Chemistry University of Pittsburgh geoffh@pitt.edu ACS CINF: Skolnik Symposium 21 August 2012 http://hutchison.chem.pitt.edu
  • 2. “ I can plug my iPod into any computer and it will recognize my music and give me all sorts of metadata: artist, title, type of music... Why can’t I read the chemical metadata off my chemistry files? ” — Prof. Henry S. Rzepa (Imperial College) Spring 2005 ACS Meeting, San Diego, CA
  • 3. Pre-History: Chem://Dig Index files, websites Based on Chem MIME Find files on extension Perceive chemistry Database Store Search, Filter Retrieval H. Rzepa et al. New J. Chem (2002) 26 p. 656
  • 4. Open Babel Open Babel (Started 2001) Free, open source chemical toolbox Cross-platform: Win, Mac, Linux... Both user-tools & C++ library Interfaces in Python, Perl, Ruby, Java, C# Supports chemistry, bioinformatics, solid-state… 100+ file formats and variants http://openbabel.org/ O’Boyle et al. J. Cheminf. 2011, 3:33
  • 5. Chemical Database? 1. Some way to store data (Organize it) 2. Index it 3. Search / filter 4. Visualize results
  • 6. ChemSpotlight: Indexing Architecture ~300 lines + + of code Spotlight Open Babel http://chemspotlight.openmolecules.net/
  • 7. ChemSpotlight: “Un” Database Use the system-wide search database No (Visible) Database! Index files in-place Includes textual data (e.g., chemical names, formulas, etc.) Multiple retrieval and filtering interfaces (i.e., any third-party search tool works) http://chemspotlight.openmolecules.net/
  • 8. So What’s Stored / Perceived Formula, mass, SMILES, InChI net_sourceforge_openbabel_Formula = C21H36N7O8S Fingerprints, number of atoms, bonds, residues PDB, SDF keywords, properties Calculation keywords: kMDItemComment = "Gaussian 09 #n B3LYP/6-31G(d) Opt" Calculation results (HOMO, LUMO, Dipole Moment) net_sourceforge_chemspotlight_DipoleMoment = 3.5
  • 11. How Do We Visualize? “QuickLook” previews New code ~800 lines Generate SDF, PDB, CIF (if needed) Pass off to ChemDoodle Web Components Pseudo-3D, interactive JS + HTML5 … or SVG generation from Open Babel http://web.chemdoodle.com/
  • 12. Organic Heterojunction Solar Cells light Transparent Electrode + p-type material Circuit - n-type material Reflective Electrode
  • 13. Organic Heterojunction Solar Cells ΔE ≥ Exciton Binding Energy e- Optical Excitation light hν Cathode Transparent Electrode Hole Electron Conducting Effective + p-type material Conductor Polymer Heterojunction Circuit - n-type material (Nanoparticle) Bandgap Reflective Electrode Anode h+
  • 14. Pipeline Model for Finding New Molecules Monomers >106 Possible Structures Electronic ~9 minutes Properties Optical Properties Synthetic Score J Phys Chem C 2011 vol. 115 pp. 16200 ...
  • 15. Pipeline Model for Finding New Molecules Monomers >106 Possible Structures Fast Electronic ~9 minutes Screening Properties Optical Properties Synthetic Slower Score J Phys Chem C 2011 vol. 115 pp. 16200 ...
  • 16. New Genetic Algorithm Approach Rather than directly driving & wait for calc results Check Spotlight for new results “What are top HOMO energies?” Update GA, generate new candidates, submit new jobs
  • 17. Scaling Up the Polymer Solar Search S 0 2nd Gen. Search: 680 Monomers LUMO Energy (eV) −1 2800+ Fragments Search Space: −2 500+ million oligomers ~9 minutes per core −3 −9.5 −9.0 −8.5 −8.0 −7.5 −7.0 −6.5 HOMO Energy (eV)
  • 18. Take-Home Messages “Big Data” is a Big Headache ChemSpotlight & Un-Databases Work! Keep data as native files w/separate index Integrate into user-friendly tools Sell to users: “What’s in it for me?” Indexing, retrieval Improved workflows
  • 19. Marcus Hanwell Pitt / Kitware Dr. Noel O’Boyle Casey Campbell U.C. Cork, Ireland Pitt (2010)