Exploring Large Chemical
        Data Sets
 Interactive Analysis and Visualization



          Kyle Lutz and Marcus D. Hanwell

                 August 21, 2012
                Skolnik Symposium
Overview
● An open-source, cross-platform
  cheminformatics tool
● A general-purpose tool for chemical data
  exploration and analysis
● Interactive, editable and queryable
  database of chemical data on the desktop
● Part of the Open Chemistry application
  suite (Avogadro and MoleQueue)
● Leverages several open-source projects:
  Qt, VTK, Chemkit, Open Babel, MongoDB
Architecture
● Native, cross-platform C++ application built with Qt
● Stores chemical data in a NoSQL MongoDB database
● Uses VTK for 2D and 3D data set visualization
Main Window
Molecule Details
Queries

Supports different
queries:
● Name
● Formula
● InChI
● InChIKey
● Structure and
   Substructure
Similarity Searching
Charts and Plots




            Scatter Plot          Histogram of logP
   of Polar Surface Area (TPSA)
      against Volume (VABC)
Multidimensional Analysis
● Provide tools for viewing and analyzing large
  amounts of data with multiple dimensions
   ○ Scatter Plot Matrix
   ○ Parallel Coordinates
   ○ K-Means Clustering
● Interactive charts supporting selection
● Easy to add new chemical descriptors
Scatter Plot Matrix




      Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
Parallel Coordinates




     Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
K-Means Clustering
● ~30 numeric molecular descriptors
● 1D, 2D, and 3D visualization
● Selection and extraction of molecules from clusters
Similarity Visualization
● Similarity Clustering
● Calculated from fingerprint similarity or structural
  similarity
Similarity Visualization




                           60%
      30%




      45%
ChemicalJSON
                                                           Example: ethane.cjson

●   JSON (JavaScript Object Notation) is
    a "lightweight data-interchange
    format"
●   Store molecular structure, geometry,
    identifiers and descriptors all as a
    single JSON object
●   Benefits:
    ○ More compact than XML/CML
    ○ Native language of MongoDB and
      JSON-RPC
    ○ Easily converted to a binary
      representation (BSON)




                  Specification avaialble at: http://wiki.openchemistry.org/Chemical_JSON
ChemicalJSON in MongoDB
● Nearly identical to what is stored in a file
   ○ A few extra fields stored
     ■ 2D diagram (as PNG)
     ■ Heavy atom count (for substructure searching)
     ■ Binary fingerprints (for similarity searching)
     ■ InChIKey for indexing and as a unique key
     ■ Mongo's OID ("_id") field
● Trivial to write out to a .cjson file:
     db.molecules.find({"name" : "ethanol"},
                       {"diagram" : 0,
                        "heavyAtomCount" : 0,
                        "fp2_fingerprint" : 0,
                        "_id" : 0})
Open Chemistry with ParaViewWeb
● Uses ParaView's client-server architecture
● Interactive 3D rendering
● Runs in any modern web browser




        URL: http://paraviewweb.kitware.com/OpenChemistry/
Open Chemistry with ParaViewWeb
    ChemData
RPC / Avogadro Integration
● Uses JSON-RPC to communicate with other
  applications (most notably Avogadro)
● Visualize data directly from the database
● Uses ChemicalJSON to represent molecular
  structures and transfer molecular information
Future Directions
● Direct integration with 3rd party databases
  (PubChem, PDB, ...)
● Broader support for storing and analyzing
  computational job results
   ○ Linked with molecular structures
   ○ Direct from CML or converted/parsed
● Plugins to facilitate extension
   ○ Descriptors
   ○ Visualization
   ○ Chemical file input/output
● Scaling studies, working with multiple data
  servers and terabytes of data
Comments/Questions?
                  Home Page
   http://wiki.openchemistry.org/ChemData

                  Source Code
 https://github.com/OpenChemistry/chemdata

              ParaViewWeb Demo
http://paraviewweb.kitware.com/OpenChemistry

Exploring Large Chemical Data Sets

  • 1.
    Exploring Large Chemical Data Sets Interactive Analysis and Visualization Kyle Lutz and Marcus D. Hanwell August 21, 2012 Skolnik Symposium
  • 2.
    Overview ● An open-source,cross-platform cheminformatics tool ● A general-purpose tool for chemical data exploration and analysis ● Interactive, editable and queryable database of chemical data on the desktop ● Part of the Open Chemistry application suite (Avogadro and MoleQueue) ● Leverages several open-source projects: Qt, VTK, Chemkit, Open Babel, MongoDB
  • 3.
    Architecture ● Native, cross-platformC++ application built with Qt ● Stores chemical data in a NoSQL MongoDB database ● Uses VTK for 2D and 3D data set visualization
  • 4.
  • 5.
  • 6.
    Queries Supports different queries: ● Name ●Formula ● InChI ● InChIKey ● Structure and Substructure
  • 7.
  • 8.
    Charts and Plots Scatter Plot Histogram of logP of Polar Surface Area (TPSA) against Volume (VABC)
  • 9.
    Multidimensional Analysis ● Providetools for viewing and analyzing large amounts of data with multiple dimensions ○ Scatter Plot Matrix ○ Parallel Coordinates ○ K-Means Clustering ● Interactive charts supporting selection ● Easy to add new chemical descriptors
  • 10.
    Scatter Plot Matrix Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
  • 11.
    Parallel Coordinates Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
  • 12.
    K-Means Clustering ● ~30numeric molecular descriptors ● 1D, 2D, and 3D visualization ● Selection and extraction of molecules from clusters
  • 13.
    Similarity Visualization ● SimilarityClustering ● Calculated from fingerprint similarity or structural similarity
  • 14.
  • 15.
    ChemicalJSON Example: ethane.cjson ● JSON (JavaScript Object Notation) is a "lightweight data-interchange format" ● Store molecular structure, geometry, identifiers and descriptors all as a single JSON object ● Benefits: ○ More compact than XML/CML ○ Native language of MongoDB and JSON-RPC ○ Easily converted to a binary representation (BSON) Specification avaialble at: http://wiki.openchemistry.org/Chemical_JSON
  • 16.
    ChemicalJSON in MongoDB ●Nearly identical to what is stored in a file ○ A few extra fields stored ■ 2D diagram (as PNG) ■ Heavy atom count (for substructure searching) ■ Binary fingerprints (for similarity searching) ■ InChIKey for indexing and as a unique key ■ Mongo's OID ("_id") field ● Trivial to write out to a .cjson file: db.molecules.find({"name" : "ethanol"}, {"diagram" : 0, "heavyAtomCount" : 0, "fp2_fingerprint" : 0, "_id" : 0})
  • 17.
    Open Chemistry withParaViewWeb ● Uses ParaView's client-server architecture ● Interactive 3D rendering ● Runs in any modern web browser URL: http://paraviewweb.kitware.com/OpenChemistry/
  • 18.
    Open Chemistry withParaViewWeb ChemData
  • 19.
    RPC / AvogadroIntegration ● Uses JSON-RPC to communicate with other applications (most notably Avogadro) ● Visualize data directly from the database ● Uses ChemicalJSON to represent molecular structures and transfer molecular information
  • 20.
    Future Directions ● Directintegration with 3rd party databases (PubChem, PDB, ...) ● Broader support for storing and analyzing computational job results ○ Linked with molecular structures ○ Direct from CML or converted/parsed ● Plugins to facilitate extension ○ Descriptors ○ Visualization ○ Chemical file input/output ● Scaling studies, working with multiple data servers and terabytes of data
  • 21.
    Comments/Questions? Home Page http://wiki.openchemistry.org/ChemData Source Code https://github.com/OpenChemistry/chemdata ParaViewWeb Demo http://paraviewweb.kitware.com/OpenChemistry