  1. Automated Molecular Data Extractionusing Open Babel & ChemSpotlight: The Semantic Desktop Prof. Geoff Hutchison Department of Chemistry University of Pittsburgh geoffh@pitt.edu ACS CINF: Skolnik Symposium 21 August 2012 http://hutchison.chem.pitt.edu
  2. “I can plug my iPod into anycomputer and it will recognizemy music and give me all sortsof metadata: artist, title, type ofmusic...Why can’t I read the chemicalmetadata off my chemistry files? ”— Prof. Henry S. Rzepa (Imperial College) Spring 2005 ACS Meeting, San Diego, CA
  3. Pre-History: Chem://Dig Index files, websites Based on Chem MIME Find files on extension Perceive chemistry Database Store Search, Filter Retrieval H. Rzepa et al. New J. Chem (2002) 26 p. 656
  4. Open Babel Open Babel (Started 2001) Free, open source chemical toolbox Cross-platform: Win, Mac, Linux... Both user-tools & C++ library Interfaces in Python, Perl, Ruby, Java, C# Supports chemistry, bioinformatics, solid-state… 100+ file formats and variants http://openbabel.org/ O’Boyle et al. J. Cheminf. 2011, 3:33
  5. Chemical Database? 1. Some way to store data (Organize it) 2. Index it 3. Search / filter 4. Visualize results
  6. ChemSpotlight: Indexing Architecture ~300 lines + + of code Spotlight Open Babel http://chemspotlight.openmolecules.net/
  7. ChemSpotlight: “Un” Database Use the system-wide search database No (Visible) Database! Index files in-place Includes textual data (e.g., chemical names, formulas, etc.) Multiple retrieval and filtering interfaces (i.e., any third-party search tool works) http://chemspotlight.openmolecules.net/
  8. So What’s Stored / Perceived Formula, mass, SMILES, InChI net_sourceforge_openbabel_Formula = C21H36N7O8S Fingerprints, number of atoms, bonds, residues PDB, SDF keywords, properties Calculation keywords: kMDItemComment = "Gaussian 09 #n B3LYP/6-31G(d) Opt" Calculation results (HOMO, LUMO, Dipole Moment) net_sourceforge_chemspotlight_DipoleMoment = 3.5
  9. ChemSpotlight “Un” Database
  10. ChemSpotlight “Un” Database
  11. How Do We Visualize? “QuickLook” previews New code ~800 lines Generate SDF, PDB, CIF (if needed) Pass off to ChemDoodle Web Components Pseudo-3D, interactive JS + HTML5 … or SVG generation from Open Babel http://web.chemdoodle.com/
  12. Organic Heterojunction Solar Cells light Transparent Electrode + p-type material Circuit - n-type material Reflective Electrode
  13. Organic Heterojunction Solar Cells ΔE ≥ Exciton Binding Energy e- Optical Excitation light hν Cathode Transparent Electrode Hole Electron Conducting Effective + p-type material Conductor Polymer Heterojunction Circuit - n-type material (Nanoparticle) Bandgap Reflective Electrode Anode h+
  14. Pipeline Model for Finding New Molecules Monomers >106 Possible Structures Electronic ~9 minutes Properties Optical Properties Synthetic ScoreJ Phys Chem C 2011 vol. 115 pp. 16200 ...
  15. Pipeline Model for Finding New Molecules Monomers >106 Possible Structures Fast Electronic ~9 minutes Screening Properties Optical Properties Synthetic Slower ScoreJ Phys Chem C 2011 vol. 115 pp. 16200 ...
  16. New Genetic Algorithm Approach Rather than directly driving & wait for calc results Check Spotlight for new results “What are top HOMO energies?” Update GA, generate new candidates, submit new jobs
  17. Scaling Up the Polymer Solar Search S 0 2nd Gen. Search: 680 Monomers LUMO Energy (eV) −1 2800+ Fragments Search Space: −2 500+ million oligomers ~9 minutes per core −3 −9.5 −9.0 −8.5 −8.0 −7.5 −7.0 −6.5 HOMO Energy (eV)
  18. Take-Home Messages “Big Data” is a Big Headache ChemSpotlight & Un-Databases Work! Keep data as native files w/separate index Integrate into user-friendly tools Sell to users: “What’s in it for me?” Indexing, retrieval Improved workflows
  19. Marcus Hanwell Pitt / KitwareDr. Noel O’Boyle Casey CampbellU.C. Cork, Ireland Pitt (2010)