Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web


Published on

ChemSpider was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. There are many tens of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Despite the diversity of databases available online their inherent quality, accuracy and completeness is lacking in many regards. ChemSpider was established to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data and experimental properties. ChemSpider has now grown into a database of well over 20 million chemical substances integrated with over 300 disparate data sources, many of these directly supporting the Life Sciences. This presentation will provide an overview of our efforts to improve the quality of data online, to provide a foundation for the semantic web for chemistry and to provide access to a set online tools and services to support access to these data. I will also discuss how ChemSpider is being used to enhance Semantic Publishing in Chemistry at RSC.

  • Be the first to comment

ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

  1. 1. ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web
  2. 2. Declaration <ul><li>ChemSpider does NOT do toxicity prediction, yet </li></ul><ul><li>We are building a content database for you to use </li></ul><ul><li>What ChemSpider does can be invaluable to those who do toxicity prediction </li></ul><ul><ul><li>Find “correct” chemical structures </li></ul></ul><ul><ul><li>Find associated data (experimental/predicted) </li></ul></ul><ul><ul><li>Link out to rich sources of information online </li></ul></ul><ul><ul><li>Engage the community in sharing data </li></ul></ul>
  3. 3. A Pragmatic Vision in 2006 <ul><ul><li>“ Build a Structure Centric Community” </li></ul></ul><ul><li>December 2006 – A project initiated to connect chemistry on the web </li></ul><ul><ul><li>Integrate chemical structure data on the web </li></ul></ul><ul><ul><li>Create a “structure-based hub” to information and data </li></ul></ul><ul><ul><li>Provide access to structure-based “algorithms” </li></ul></ul><ul><ul><li>Let chemists contribute their own data </li></ul></ul><ul><ul><li>Allow the community to curate/correct data </li></ul></ul>
  4. 4. Three Years of Experience <ul><li>Internet-based chemistry is a mess ! </li></ul><ul><li>Most public compound databases on the web are contaminated . Including ours ! </li></ul><ul><li>The annotation/curation of data online is difficult </li></ul><ul><li>Most database hosts are non-responsive to feedback – “We are a host/repository of data” </li></ul><ul><li>Who cares ? </li></ul>
  5. 5. Where is chemistry online? <ul><li>Encyclopedic articles (Wikipedia) </li></ul><ul><li>Chemical vendor databases </li></ul><ul><li>Metabolic pathway databases </li></ul><ul><li>Property databases </li></ul><ul><li>Patents with chemical structures </li></ul><ul><li>Drug Discovery data </li></ul><ul><li>Scientific publications </li></ul><ul><li>Compound aggregators </li></ul><ul><li>Blogs/Wikis and Open Notebook Science </li></ul>
  6. 6. What is the Structure of Vitamin K?
  7. 7. MeSH – Medical Subject Headings <ul><li>A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants , VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K </li></ul>
  8. 8. What is the Structure of Vitamin K1?
  9. 9. What is the Structure of Vitamin K1?
  10. 10. Chemical Abstracts “Common Chemistry” Database
  11. 11. Wikipedia
  12. 13. Incorrect Structures
  13. 14. Wow!
  14. 15. Lack of Stereochemistry
  15. 16. Does stereochemistry matter? <ul><li>Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide </li></ul>
  16. 19. Comparative Toxigenomics Database
  17. 21. PubChem
  18. 23. <ul><li>“ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione” </li></ul><ul><li>Variants of systematic names on PubChem </li></ul><ul><li>2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-(3,7,11,15-tetramethyl </li></ul><ul><li>2-methyl-3-[(E)-3,7,11,15-tetramethyl </li></ul>
  19. 24. ChEBI – Manual Curation
  20. 27. What’s Methane?
  21. 28. What’s Methane?
  22. 29. What ELSE is Methane???
  23. 30. The EXPERTS must get it right?!
  24. 31. Wikipedia, C&E News, PubChem <ul><li>C&E News (from ACS) </li></ul>
  25. 32. Online Datasets
  26. 33. Online Datasets
  27. 34. Online Datasets
  28. 35. Online Datasets
  29. 36. Online Datasets
  30. 37. Online Datasets
  31. 38. What Sources Do You Trust?
  32. 39. QSAR World
  33. 40. Online Datasets <ul><li>The dataset for QSAR appears to have been generated with Name-to-Structure algorithms </li></ul><ul><li>Many systematic errors in the data – non-curated? </li></ul><ul><li>Using such data for modeling is risky </li></ul>
  34. 41. Online Datasets
  35. 42. Internet-Based Chemistry is a Mess <ul><li>Algorithms can get you so far in data cleaning </li></ul><ul><li>Human curation is necessary </li></ul><ul><li>Only the crowds can help with big data… </li></ul><ul><li>But, if we DID have a highly curated dataset… </li></ul><ul><ul><li>Reference database/dictionary of chemicals </li></ul></ul><ul><ul><li>High quality data for modeling </li></ul></ul><ul><ul><li>Centralized repository for models/data? </li></ul></ul>
  36. 43.
  37. 44. We Answer Questions for Chemists <ul><li>Questions a chemist might ask… </li></ul><ul><ul><li>What is the melting point of n-heptanol? </li></ul></ul><ul><ul><li>What is the chemical structure of Xanax? </li></ul></ul><ul><ul><li>Chemically, what is phenolphthalein? </li></ul></ul><ul><ul><li>What are the stereocenters of cholesterol? </li></ul></ul><ul><ul><li>Where can I find publications about xylene? </li></ul></ul><ul><ul><li>What are the different trade names for Aspirin? </li></ul></ul><ul><ul><li>What is the NMR spectrum of Benzoic Acid? </li></ul></ul><ul><ul><li>What are the safety handling issues for toluene? </li></ul></ul>
  38. 45. Search for a Chemical…by name
  39. 46. Available Information… <ul><li>Linked to vendors, safety data, toxicity, metabolism </li></ul>
  40. 47. Available Information….
  41. 48. Search for chemicals
  42. 49. ChemSpider Today <ul><li>24.8 million structures </li></ul><ul><li>400 data sources </li></ul><ul><li>Grows daily </li></ul><ul><li>Community annotation and curation </li></ul><ul><li>We curate, edit, change, enhance data daily </li></ul>
  43. 50. Search “Vitamin H”
  44. 51. Search “Vitamin H”
  45. 52. “ Curate” Identifiers
  46. 53. “ Curate” Identifiers
  47. 54. “ Curate” Identifiers
  48. 55. “ Curate” Identifiers <ul><li>General curation activities </li></ul><ul><ul><li>Remove incorrect names </li></ul></ul><ul><ul><li>Correct spellings </li></ul></ul><ul><ul><li>Add multilingual names </li></ul></ul><ul><ul><li>Add alternative names </li></ul></ul><ul><li>In 3 years over 1 million structure-identifier relationships have been validated – robotically and manually </li></ul><ul><li>130 people have participated in validation or annotation. “ Crowds ” can be quite small! </li></ul>
  49. 56. Crowdsourced “Annotations” <ul><li>Registered Users can add </li></ul><ul><ul><li>Descriptions/Syntheses/Commentaries </li></ul></ul><ul><ul><li>Links to articles, blogs, wikis etc </li></ul></ul><ul><ul><li>Add spectral data </li></ul></ul><ul><ul><li>Add photos </li></ul></ul><ul><ul><li>Add MP3 files </li></ul></ul><ul><ul><li>Add Videos </li></ul></ul>
  50. 57. Data Validation – ONE Cymarin Question Quality in Big Databases
  51. 58. Data Validation – Cortisol
  52. 59. Data Validation in Databases <ul><li>ADNPLDHMAVUMIW     509 </li></ul><ul><li>WQZGKKKJIJFFOK           119 </li></ul><ul><li>RUDATBOHQWOJDD      118 Ursodeoxycholic acid </li></ul><ul><li>GUBGYTABKSRVRQ        89 Lactose </li></ul><ul><li>BHQCQFFYRZLCQQ        80 Cholic acid </li></ul><ul><li>RCINICONZNJXQF            76 Taxol </li></ul><ul><li>KXGVEGMKQFWNSR      73 Deoxycholic acid </li></ul><ul><li>PXGPLTODNUVGFL         71 </li></ul><ul><li>HVYWMOMLDIMFJA       69 </li></ul><ul><li>QGXBDMJGAMFCBF        63 </li></ul>
  53. 60. First request to Database Hosts! <ul><li>Every public compound database host should add ONE feature – “Leave Comments” </li></ul>
  54. 61. Second request to Database Hosts! Show Comments
  55. 62. Linked Data on the Web Taken from: Rafael Sidis’ Blog
  56. 63. What is a compound?
  57. 64. The InChI Identifier
  58. 65. Linking and Modeling Bad Data <ul><li>What is the value of linking bad data? </li></ul><ul><li>How can we model suspect data efficiently? </li></ul><ul><li>Commonly data are incorrect </li></ul><ul><ul><li>Measured data are suspect </li></ul></ul><ul><ul><li>Structures associated with data are not correct </li></ul></ul><ul><ul><li>Identifiers are incorrectly associated </li></ul></ul>
  59. 66. Properties on the Database
  60. 67. Properties on the Database
  61. 68. Properties on the Database
  62. 69. Linked Out to Resources
  63. 70. Properties Linked Off the Database
  64. 71. <ul><li>LASSO uses 23 kinds of Interactive Surface Point Descriptors and </li></ul><ul><ul><li>is conformation independent </li></ul></ul><ul><ul><li>screens at 1 million structures/min </li></ul></ul><ul><ul><li>is proven to enrich screened databases </li></ul></ul><ul><ul><li>provides scaffold hopping </li></ul></ul><ul><li>Hbond Donors (5 kinds) </li></ul><ul><li>Acceptors (5 kinds) </li></ul><ul><li>Ambivalent H donor/acceptor </li></ul><ul><li>Aromatic Pi-stacking (5 kinds) </li></ul><ul><li>Hydrophobic (3 kinds) </li></ul><ul><li>Metal ions </li></ul><ul><li>Misc (Sulfur, Halogens) </li></ul> SimBioSys LASSO
  65. 72. SimBioSys LASSO
  66. 73. LASSO Linked Out
  67. 74. Present Activities <ul><li>Enhancing data model to manage more experimental properties – data available for download and modeling </li></ul><ul><li>Developing relationships with other software vendors and model developers for integration </li></ul><ul><li>Curating QSARWorld datasets for deposition </li></ul>
  68. 75. ChemSpider Tomorrow <ul><li>6 months: >1.2M compounds/month </li></ul><ul><li>6 months: >800,000 new uniques </li></ul><ul><li>6 months: >60 new data sources added </li></ul><ul><li>Continue the curation effort and keep cleaning </li></ul><ul><li>Finish depositions – millions left to deposit </li></ul><ul><li>Integrate RSC content – a massive archive! </li></ul><ul><li>Integrate RSC publishing workflows and databases </li></ul><ul><li>Enable the semantic web for chemistry – RDF </li></ul>
  69. 76. Future Activities – Data Management
  70. 77. Future Activities – Data Management <ul><li>Aggregating and managing data from publications </li></ul><ul><li>Specifically aggregating: </li></ul><ul><ul><li>Data from MedChemComm </li></ul></ul><ul><ul><li>Reaction Data (SyntheticPages) </li></ul></ul><ul><ul><li>Spectral Data </li></ul></ul>
  71. 78. Access Data Through Web Services
  72. 79. Mobile Data Access
  73. 80. The Future of Linked Chemistry on the Internet? <ul><li>Public compound databases federate to build a truly linked environment of validated data! </li></ul><ul><li>Data validation needs are not ignored </li></ul><ul><li>Publishers layer on information to make publications discoverable </li></ul><ul><li>Public-Private databases can be linked </li></ul><ul><li>Open Data proliferate </li></ul><ul><li>RDF is everywhere </li></ul>
  74. 81. ChemSpider & Toxicity Prediction <ul><li>Continue the curation effort and keep cleaning </li></ul><ul><li>Web services allow integration and data download </li></ul><ul><li>Presently collaborating with groups to provide access to data for modeling </li></ul><ul><li>Intention is to provide the highest quality online database with associated data </li></ul>
  75. 82. Community Contribution and Innovation <ul><li>“ Community contribution” </li></ul><ul><li> best practice award” </li></ul><ul><li>i-Expo Innovation Award:June 2010 </li></ul><ul><li>ALPSP Innovation Award: September 2010 </li></ul>
  76. 83. Thank you Email: Twitter: ChemConnector Blog: Personal Blog: SLIDES: