Where do we Round Up data? Where can I find the molfile for Roundup? Papers/Patents about Roundup? What are the side effects of Roundup? Where can I order Roundup? What are the physicochemical properties? Metabolic pathways? Different synonyms of Roundup? Synthesis of Roundup? Side effects of Roundup? Etc….
ChemSpider Takes on the role of a structure centric hub: Connecting, validating, qualifying data Enhancing data with connections to services Provides access to data and services for others to use (Thermo, Agilent, Bruker, Waters, ACD/Labs, Accelrys, etc.) Uses available services to integrate, connect and enhance the offering
How did we build it? We deal in Molfiles or SDF files – with coordinates Deposit anything that has an InChI – we support what InChI can handle, good and bad Standardization based on “InChI standardization” InChIs aggregate (certain) tautomers How much of ChemSpider is “on ChemSpider”?
Connecting Chemistry across the web So much of what is seen on ChemSpider is retrieved in real time using services
A Comment on Quality For >28 million chemical compounds there are some errors: “Incorrect” structure representations Mismatched name-structure relationships Experimental properties (the values, the units) Real vs. virtual compounds – text-mining and conversion We have deprecated a LOT of data…
Downsides of InChI Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules” InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…
Standardization IssuesDepiction based on molfile
Downsides of Overall Approach Meshing data together based on InChIs worked for simple molecules 2D layout errors inherited or limited by algorithm Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same
What needs to happen? If we could validate Catch errors in databases (and clean) Proactively catch errors in publications/patents Reduce junk in the ether – improve QUALITY! If we collectively standardized Interlinking between databases should improve CVSP – a separate presentation….stick around
Crowdsourcing ChemSpider ChemSpider is crowdsourced Community deposition, annotation and curation Anyone can “Leave Feedback” Registered users can add data
ChemSpider and Global Chemistry Hub Internet Data Small organic molecules Commercial Software Undefined materials Pre-competitive Data Organometallics Open Science Nanomaterials Open Data Polymers Publishers Minerals Educators Particle bound Open Databases Links to Biologicals Chemical Vendors
Delivering a Prediction Platform Experimental data will be used as the basis of model generation – a predictive platform…
The Future of ChemSpider Continued focus on quality over quantity – but more data is good too! ChemSpider Reactions – work in progress and includes >300,000 reactions Plugging in a validation and standardization platform Delivering personal and institutional repository capabilities