Exploring Large Chemical        Data Sets Interactive Analysis and Visualization          Kyle Lutz and Marcus D. Hanwell ...
Overview● An open-source, cross-platform  cheminformatics tool● A general-purpose tool for chemical data  exploration and ...
Architecture● Native, cross-platform C++ application built with Qt● Stores chemical data in a NoSQL MongoDB database● Uses...
Main Window
Molecule Details
QueriesSupports differentqueries:● Name● Formula● InChI● InChIKey● Structure and   Substructure
Similarity Searching
Charts and Plots            Scatter Plot          Histogram of logP   of Polar Surface Area (TPSA)      against Volume (VA...
Multidimensional Analysis● Provide tools for viewing and analyzing large  amounts of data with multiple dimensions   ○ Sca...
Scatter Plot Matrix      Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
Parallel Coordinates     Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
K-Means Clustering● ~30 numeric molecular descriptors● 1D, 2D, and 3D visualization● Selection and extraction of molecules...
Similarity Visualization● Similarity Clustering● Calculated from fingerprint similarity or structural  similarity
Similarity Visualization                           60%      30%      45%
ChemicalJSON                                                           Example: ethane.cjson●   JSON (JavaScript Object No...
ChemicalJSON in MongoDB● Nearly identical to what is stored in a file   ○ A few extra fields stored     ■ 2D diagram (as P...
Open Chemistry with ParaViewWeb● Uses ParaViews client-server architecture● Interactive 3D rendering● Runs in any modern w...
Open Chemistry with ParaViewWeb    ChemData
RPC / Avogadro Integration● Uses JSON-RPC to communicate with other  applications (most notably Avogadro)● Visualize data ...
Future Directions● Direct integration with 3rd party databases  (PubChem, PDB, ...)● Broader support for storing and analy...
Comments/Questions?                  Home Page   http://wiki.openchemistry.org/ChemData                  Source Code https...
Upcoming SlideShare
Loading in...5
×

Exploring Large Chemical Data Sets

1,667

Published on

Exploring Large Chemical
Data Sets:
Interactive Analysis and Visualization

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,667
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Exploring Large Chemical Data Sets

  1. 1. Exploring Large Chemical Data Sets Interactive Analysis and Visualization Kyle Lutz and Marcus D. Hanwell August 21, 2012 Skolnik Symposium
  2. 2. Overview● An open-source, cross-platform cheminformatics tool● A general-purpose tool for chemical data exploration and analysis● Interactive, editable and queryable database of chemical data on the desktop● Part of the Open Chemistry application suite (Avogadro and MoleQueue)● Leverages several open-source projects: Qt, VTK, Chemkit, Open Babel, MongoDB
  3. 3. Architecture● Native, cross-platform C++ application built with Qt● Stores chemical data in a NoSQL MongoDB database● Uses VTK for 2D and 3D data set visualization
  4. 4. Main Window
  5. 5. Molecule Details
  6. 6. QueriesSupports differentqueries:● Name● Formula● InChI● InChIKey● Structure and Substructure
  7. 7. Similarity Searching
  8. 8. Charts and Plots Scatter Plot Histogram of logP of Polar Surface Area (TPSA) against Volume (VABC)
  9. 9. Multidimensional Analysis● Provide tools for viewing and analyzing large amounts of data with multiple dimensions ○ Scatter Plot Matrix ○ Parallel Coordinates ○ K-Means Clustering● Interactive charts supporting selection● Easy to add new chemical descriptors
  10. 10. Scatter Plot Matrix Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
  11. 11. Parallel Coordinates Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
  12. 12. K-Means Clustering● ~30 numeric molecular descriptors● 1D, 2D, and 3D visualization● Selection and extraction of molecules from clusters
  13. 13. Similarity Visualization● Similarity Clustering● Calculated from fingerprint similarity or structural similarity
  14. 14. Similarity Visualization 60% 30% 45%
  15. 15. ChemicalJSON Example: ethane.cjson● JSON (JavaScript Object Notation) is a "lightweight data-interchange format"● Store molecular structure, geometry, identifiers and descriptors all as a single JSON object● Benefits: ○ More compact than XML/CML ○ Native language of MongoDB and JSON-RPC ○ Easily converted to a binary representation (BSON) Specification avaialble at: http://wiki.openchemistry.org/Chemical_JSON
  16. 16. ChemicalJSON in MongoDB● Nearly identical to what is stored in a file ○ A few extra fields stored ■ 2D diagram (as PNG) ■ Heavy atom count (for substructure searching) ■ Binary fingerprints (for similarity searching) ■ InChIKey for indexing and as a unique key ■ Mongos OID ("_id") field● Trivial to write out to a .cjson file: db.molecules.find({"name" : "ethanol"}, {"diagram" : 0, "heavyAtomCount" : 0, "fp2_fingerprint" : 0, "_id" : 0})
  17. 17. Open Chemistry with ParaViewWeb● Uses ParaViews client-server architecture● Interactive 3D rendering● Runs in any modern web browser URL: http://paraviewweb.kitware.com/OpenChemistry/
  18. 18. Open Chemistry with ParaViewWeb ChemData
  19. 19. RPC / Avogadro Integration● Uses JSON-RPC to communicate with other applications (most notably Avogadro)● Visualize data directly from the database● Uses ChemicalJSON to represent molecular structures and transfer molecular information
  20. 20. Future Directions● Direct integration with 3rd party databases (PubChem, PDB, ...)● Broader support for storing and analyzing computational job results ○ Linked with molecular structures ○ Direct from CML or converted/parsed● Plugins to facilitate extension ○ Descriptors ○ Visualization ○ Chemical file input/output● Scaling studies, working with multiple data servers and terabytes of data
  21. 21. Comments/Questions? Home Page http://wiki.openchemistry.org/ChemData Source Code https://github.com/OpenChemistry/chemdata ParaViewWeb Demohttp://paraviewweb.kitware.com/OpenChemistry
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×