Connecting Chemistry Across the
     Internet Using ChemSpider

       Antony J Williams and Valery Tkachenko
                     SERMACS, November 15th 2012
Chemistry Data and the Weeds
Tell me about Roundup
So what is Round Up?
The World’s Encyclopedia
Roundup
Where do we Round Up data?
   Where can I find the molfile for Roundup?
   Papers/Patents about Roundup?
   What are the side effects of Roundup?
   Where can I order Roundup?
   What are the physicochemical properties?
   Metabolic pathways?
   Different synonyms of Roundup?
   Synthesis of Roundup?
   Side effects of Roundup?
   Etc….
Where do I Round Up Data?
In an increasing LinkedData map….
But I want to aggregate data? So…
ChemSpider
 Takes on the role of a structure centric hub:

   Connecting, validating, qualifying data
   Enhancing data with connections to services
   Provides access to data and services for others
    to use (Thermo, Agilent, Bruker, Waters,
    ACD/Labs, Accelrys, etc.)
   Uses available services to integrate, connect
    and enhance the offering
Roundup on ChemSpider
What will ChemSpider give us??
What will ChemSpider give us??
What will ChemSpider give us??
What will ChemSpider give us??
What will ChemSpider give us??
What will ChemSpider give us??
ChemSpider is Collapsing Data???
What will ChemSpider give us??
For Glyphosate itself
How did we build it?
 We deal in Molfiles or SDF files – with coordinates
 Deposit anything that has an InChI – we support
  what InChI can handle, good and bad
 Standardization based on “InChI standardization”
 InChIs aggregate (certain) tautomers

 How much of ChemSpider is “on ChemSpider”?
Connecting Chemistry across the web
 So much of what is seen on ChemSpider is
  retrieved in real time using services
Connecting Chemistry across the web
Online Predictions
A Comment on Quality
 For >28 million chemical compounds there are
  some errors:

     “Incorrect” structure representations
     Mismatched name-structure relationships
     Experimental properties (the values, the units)
     Real vs. virtual compounds – text-mining and
      conversion

   We have deprecated a LOT of data…
Downsides of InChI

 Good for small molecules – but no polymers,
  issues with inorganics, organometallics, imperfect
  stereochemistry. ChemSpider is “small molecules”

 InChI used as the “deduplicator” – FIRST version
  of a compound into the database becomes THE
  structure to deduplicate against…
Side Effects of InChI Usage
SMILES by comparison…
Side Effects of InChI Usage
Standardization Issues
Depiction based on molfile
Downsides of Overall Approach
 Meshing data together based on InChIs worked
  for simple molecules

 2D layout errors inherited or limited by algorithm

 Complex molecules that are meant to be the
  same thing were NOT deduplicated. Compounds
  differing by one stereocenter, named the same,
  meant to be the same, are not the same
So much data online is “erroneous”
The confusion of name-structures
Collapsing Data – Standardization
What needs to happen?
 If we could validate
    Catch errors in databases (and clean)
    Proactively catch errors in publications/patents
    Reduce junk in the ether – improve QUALITY!

 If we collectively standardized
    Interlinking between databases should improve


   CVSP – a separate presentation….stick around
Crowdsourcing ChemSpider

 ChemSpider is crowdsourced

 Community deposition, annotation
  and curation

 Anyone can “Leave Feedback”

 Registered users can add data
ChemSpider and Global Chemistry Hub
                      Internet Data




 Small organic molecules              Commercial Software
 Undefined materials                  Pre-competitive Data
 Organometallics                            Open Science
 Nanomaterials                                 Open Data
 Polymers                                      Publishers
 Minerals                                      Educators
 Particle bound                           Open Databases
 Links to Biologicals                   Chemical Vendors
Delivering a Prediction Platform
 Experimental data will be used as the basis of
  model generation – a predictive platform…
The Future of ChemSpider
 Continued focus on quality over quantity –
  but more data is good too!
 ChemSpider Reactions – work in progress
  and includes >300,000 reactions
 Plugging in a validation and standardization
  platform
 Delivering personal and institutional
  repository capabilities
Thank you

Email: williamsa@rsc.org
Twitter: ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

Connecting Chemistry Across the Internet Using ChemSpider

  • 1.
    Connecting Chemistry Acrossthe Internet Using ChemSpider Antony J Williams and Valery Tkachenko SERMACS, November 15th 2012
  • 2.
  • 3.
  • 4.
    So what isRound Up?
  • 5.
  • 6.
  • 7.
    Where do weRound Up data?  Where can I find the molfile for Roundup?  Papers/Patents about Roundup?  What are the side effects of Roundup?  Where can I order Roundup?  What are the physicochemical properties?  Metabolic pathways?  Different synonyms of Roundup?  Synthesis of Roundup?  Side effects of Roundup?  Etc….
  • 8.
    Where do IRound Up Data?
  • 10.
    In an increasingLinkedData map….
  • 11.
    But I wantto aggregate data? So…
  • 12.
    ChemSpider  Takes onthe role of a structure centric hub:  Connecting, validating, qualifying data  Enhancing data with connections to services  Provides access to data and services for others to use (Thermo, Agilent, Bruker, Waters, ACD/Labs, Accelrys, etc.)  Uses available services to integrate, connect and enhance the offering
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
    How did webuild it?  We deal in Molfiles or SDF files – with coordinates  Deposit anything that has an InChI – we support what InChI can handle, good and bad  Standardization based on “InChI standardization”  InChIs aggregate (certain) tautomers  How much of ChemSpider is “on ChemSpider”?
  • 24.
    Connecting Chemistry acrossthe web  So much of what is seen on ChemSpider is retrieved in real time using services
  • 25.
  • 26.
  • 27.
    A Comment onQuality  For >28 million chemical compounds there are some errors:  “Incorrect” structure representations  Mismatched name-structure relationships  Experimental properties (the values, the units)  Real vs. virtual compounds – text-mining and conversion  We have deprecated a LOT of data…
  • 28.
    Downsides of InChI Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules”  InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…
  • 29.
    Side Effects ofInChI Usage
  • 30.
  • 31.
    Side Effects ofInChI Usage
  • 32.
  • 33.
    Downsides of OverallApproach  Meshing data together based on InChIs worked for simple molecules  2D layout errors inherited or limited by algorithm  Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same
  • 34.
    So much dataonline is “erroneous”
  • 35.
    The confusion ofname-structures
  • 36.
    Collapsing Data –Standardization
  • 37.
    What needs tohappen?  If we could validate  Catch errors in databases (and clean)  Proactively catch errors in publications/patents  Reduce junk in the ether – improve QUALITY!  If we collectively standardized  Interlinking between databases should improve  CVSP – a separate presentation….stick around
  • 38.
    Crowdsourcing ChemSpider  ChemSpideris crowdsourced  Community deposition, annotation and curation  Anyone can “Leave Feedback”  Registered users can add data
  • 39.
    ChemSpider and GlobalChemistry Hub Internet Data Small organic molecules Commercial Software Undefined materials Pre-competitive Data Organometallics Open Science Nanomaterials Open Data Polymers Publishers Minerals Educators Particle bound Open Databases Links to Biologicals Chemical Vendors
  • 40.
    Delivering a PredictionPlatform  Experimental data will be used as the basis of model generation – a predictive platform…
  • 41.
    The Future ofChemSpider  Continued focus on quality over quantity – but more data is good too!  ChemSpider Reactions – work in progress and includes >300,000 reactions  Plugging in a validation and standardization platform  Delivering personal and institutional repository capabilities
  • 42.
    Thank you Email: williamsa@rsc.org Twitter:ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams