Exploring Large Chemical Data Sets
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Exploring Large Chemical Data Sets



Exploring Large Chemical

Exploring Large Chemical
Data Sets:
Interactive Analysis and Visualization



Total Views
Views on SlideShare
Embed Views



1 Embed 80

http://lanyrd.com 80



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Exploring Large Chemical Data Sets Presentation Transcript

  • 1. Exploring Large Chemical Data Sets Interactive Analysis and Visualization Kyle Lutz and Marcus D. Hanwell August 21, 2012 Skolnik Symposium
  • 2. Overview● An open-source, cross-platform cheminformatics tool● A general-purpose tool for chemical data exploration and analysis● Interactive, editable and queryable database of chemical data on the desktop● Part of the Open Chemistry application suite (Avogadro and MoleQueue)● Leverages several open-source projects: Qt, VTK, Chemkit, Open Babel, MongoDB
  • 3. Architecture● Native, cross-platform C++ application built with Qt● Stores chemical data in a NoSQL MongoDB database● Uses VTK for 2D and 3D data set visualization
  • 4. Main Window
  • 5. Molecule Details
  • 6. QueriesSupports differentqueries:● Name● Formula● InChI● InChIKey● Structure and Substructure
  • 7. Similarity Searching
  • 8. Charts and Plots Scatter Plot Histogram of logP of Polar Surface Area (TPSA) against Volume (VABC)
  • 9. Multidimensional Analysis● Provide tools for viewing and analyzing large amounts of data with multiple dimensions ○ Scatter Plot Matrix ○ Parallel Coordinates ○ K-Means Clustering● Interactive charts supporting selection● Easy to add new chemical descriptors
  • 10. Scatter Plot Matrix Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
  • 11. Parallel Coordinates Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
  • 12. K-Means Clustering● ~30 numeric molecular descriptors● 1D, 2D, and 3D visualization● Selection and extraction of molecules from clusters
  • 13. Similarity Visualization● Similarity Clustering● Calculated from fingerprint similarity or structural similarity
  • 14. Similarity Visualization 60% 30% 45%
  • 15. ChemicalJSON Example: ethane.cjson● JSON (JavaScript Object Notation) is a "lightweight data-interchange format"● Store molecular structure, geometry, identifiers and descriptors all as a single JSON object● Benefits: ○ More compact than XML/CML ○ Native language of MongoDB and JSON-RPC ○ Easily converted to a binary representation (BSON) Specification avaialble at: http://wiki.openchemistry.org/Chemical_JSON
  • 16. ChemicalJSON in MongoDB● Nearly identical to what is stored in a file ○ A few extra fields stored ■ 2D diagram (as PNG) ■ Heavy atom count (for substructure searching) ■ Binary fingerprints (for similarity searching) ■ InChIKey for indexing and as a unique key ■ Mongos OID ("_id") field● Trivial to write out to a .cjson file: db.molecules.find({"name" : "ethanol"}, {"diagram" : 0, "heavyAtomCount" : 0, "fp2_fingerprint" : 0, "_id" : 0})
  • 17. Open Chemistry with ParaViewWeb● Uses ParaViews client-server architecture● Interactive 3D rendering● Runs in any modern web browser URL: http://paraviewweb.kitware.com/OpenChemistry/
  • 18. Open Chemistry with ParaViewWeb ChemData
  • 19. RPC / Avogadro Integration● Uses JSON-RPC to communicate with other applications (most notably Avogadro)● Visualize data directly from the database● Uses ChemicalJSON to represent molecular structures and transfer molecular information
  • 20. Future Directions● Direct integration with 3rd party databases (PubChem, PDB, ...)● Broader support for storing and analyzing computational job results ○ Linked with molecular structures ○ Direct from CML or converted/parsed● Plugins to facilitate extension ○ Descriptors ○ Visualization ○ Chemical file input/output● Scaling studies, working with multiple data servers and terabytes of data
  • 21. Comments/Questions? Home Page http://wiki.openchemistry.org/ChemData Source Code https://github.com/OpenChemistry/chemdata ParaViewWeb Demohttp://paraviewweb.kitware.com/OpenChemistry