How the Web Has Weaved a Web of Interlinked Chemistry Data Antony Williams ACS Anaheim March 29 th  2011
Data on the Web
Where is Chemistry Online? Property databases Compound aggregators Screening assay results Scientific publications  Encyclopedic articles (Wikipedia) Metabolic pathway databases ADME/Tox data Blogs/Wikis and Open Notebook Science Contributing Open Source code to projects
How to Connect Chemicals…
Chemistry on the Internet 100s  of websites serving up chemistry data, SDF files of structures and data RSC’s ChemSpider  “links” chemistry on the internet Over 25 million compounds, over 400 data sources Allows community deposition, curation, annotation Integrating properties, publications, patents, media Text, structure, substructure, similarity searching
www.chemspider.com
Search for a Chemical
We Have Delivered the Vision “ Build a Structure Centric Community to Serve Chemists” Integrate chemical structure data on the web
How  Did We Build It? We deal in Molfiles or SDF files We do rudimentary filtering prior to deposition – valence checking, charge imbalance etc. We have our own “business logic” to standardize We use InChI to “aggregate tautomers” Link out to external sites where possible using IDs
Inherited Errors We have inherited errors All public compound databases, including ours, have errors “ Incorrect” structures – assertions, timelines etc “ Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE
The Structure of Vitamin K?
MeSH A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified:  VITAMIN K 1 (phytomenadione) derived from plants , VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
The Structure of Vitamin K1?
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
Wikipedia
 
 
ChEBI – Manual Curation
 
 
PubChem
 
“ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione” Variants of systematic names on PubChem 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl
Public Domain Databases Our  databases are a mess… Non-curated databases are proliferating errors We source and deposit data between databases Original sources of errors hard to determine Curation is time-consuming and challenging
 
Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.
 
To report at Denver ACS… An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs Five separate organizations, 8 individuals The Wikipedia List of the “200 Top Selling Drugs”
 
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Taxol: Paclitaxel  44  structures
Taxol: Paclitaxel  Bioassay  Data
Taxol: Paclitaxel  Bioassay  Data Most  Bioassay data associated with structure with one ambiguous stereocenter
Drug Name Generic Name ChEBI ChemSpider CAS Com. Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia Spiriva Tiotropium Bromide No Hits  No Hits    4/0  Depakote Valproate semisodium        No Structure Basen Voglibose   No Hits  No Hits  2/1  Symbicort 1) Budesonide       8/1  Symbicort 2) Formoterol WRONG  No Hits    6/1  Vytorin 1) Ezetimibe   No Hits      Vytorin 2) Simvastatin       2/1  Taxol Paclitaxel       44/1  Thalidomid Thalidomide No Hits        Zocor Simvastatin       2/1  Crestor Rosuvastatin   No Hits    2/1 
Entity-Extraction and Mark-up
Entity-Extraction and Mark-up
Success Depends on Dictionaries
Nature Chemistry
RSC Prospect
Validated “Dictionaries” The following resources do NOT have structures to link to ChemSpider…but are linked: Google Scholar PubMed DailyMed RSC Databases and Backfile How  did we link these resources to ChemSpider? Validated Name Look-up!
Extend the Vision “ Build a Structure Centric Community to Serve Chemists” Integrate chemical structure data on the web Create a “structure-based hub” to information, data and algorithmic predictions
Integrate other services.. We will integrate to systems of values to the community Many interfaces now available for integration NMRShiftDB ACD/Labs Name Generation ChemAxon Chemicalize What others???
 
Extend the Vision “ Build a Structure Centric Community to Serve Chemists” Integrate chemical structure data on the web Create a “structure-based hub” to information, data and algorithmic predictions Let chemists contribute their own data Allow the community to curate/correct data
How Did We Build It (cont.) Ask users to add… Descriptions/Syntheses/Commentaries Links to PubMed articles Links to articles via DOIs  Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos
Complex Data and Information
 
Kind Contributions!
Crowdsourcing “Vitamin H”
“ Curate” Identifiers
“ Curate” Identifiers
Crowdsourcing Works >130 people have deposited data and participated in data curation Different level curators check each other More curators and depositors are encouraged!
Accessibility and Reuse It’s a shame to go it alone!!! Can we “collectively” improve the quality of chemistry on the Internet?
All DBs should take comments!
Proof-of-concept curation sharing Presently collaborating with DrugBank to enable “curation sharing” Setting up services for monitoring curations and edits – starting with “identifiers”
The Social Network Career-wise  NOT  having a personal presence online will be a detriment Self-marketing Establishing a profile Getting on the record Collaborative Science Demonstrating a skill set Measured using alternative metrics Contributing to the public peer review process
Social Networking Tools A growing number of social networking tools: Facebook Twitter Linked-In Flickr YouTube Blogs Communities Collaborative environments
Chemistry Social Networking Methods of sharing  MY  chemistry online include: Wikis or blogs Slideshare  for presentations YouTube  for videos Flickr, Wikimedia  etc. for images PubChem  for assay data NMRShiftDB  for NMR assignments GoogleDocs  for data
Drivers in the Social Network Anonymity is a  choice  in the social networks Anonymity in peer-review will  likely  become less important and may be generational  I  may  want acknowledgment if… I share my data I review a paper I share my expertise
The Alt-Metrics Manifesto http://altmetrics.org/manifesto/
Enabled by ORCID…
What will enhance OUR network? The “semantic web” Mobile technologies More participation Use of standards: JCAMP, InChI
RDF and the semantic web Using RDF permalinks http://www.chemspider.com/Chemical-Structure.7787.rdf Using a Search Term http://www.chemspider.com/rdf.ashx?q=cyclohexane http://rdf.chemspider.com/cyclohexane
 
Enabled through InChIs
Mobile Support
Licensing “My Work” Online  The complex nature of licensing “my” chemistry Blogs - copyrighted and creative commons  Wikis - mixed licensing, depends on the host(s) Data – much value in sharing data as “Open Data” Often, people can make money from your work! Police your  own  “licensing” – how many people have read the Facebook and Twitter agreements?!
Who declares data as Open? Data licensing is very interesting and can spark “interesting” conversations. Opinions differ: Are images data? Are assertions data? What on a ChemSpider record is data? We allow people to declare their data as Open and add an Open Data button at upload
Acknowledgments RSC|ChemSpider team All data source providers >100 curators and annotators, and growing… Service providers: ACD/Labs ChemAxon GGA Software Services Google PubMed … .
ChemSpider Training Session ChemSpider:  A Community Resource for Chemical Data Wednesday, March 30th  8:30-11:00 AM Anaheim Convention Center,  Room 211 A

How the web has weaved a web of interlinked chemistry data final

  • 1.
    How the WebHas Weaved a Web of Interlinked Chemistry Data Antony Williams ACS Anaheim March 29 th 2011
  • 2.
  • 3.
    Where is ChemistryOnline? Property databases Compound aggregators Screening assay results Scientific publications Encyclopedic articles (Wikipedia) Metabolic pathway databases ADME/Tox data Blogs/Wikis and Open Notebook Science Contributing Open Source code to projects
  • 4.
    How to ConnectChemicals…
  • 5.
    Chemistry on theInternet 100s of websites serving up chemistry data, SDF files of structures and data RSC’s ChemSpider “links” chemistry on the internet Over 25 million compounds, over 400 data sources Allows community deposition, curation, annotation Integrating properties, publications, patents, media Text, structure, substructure, similarity searching
  • 6.
  • 7.
    Search for aChemical
  • 8.
    We Have Deliveredthe Vision “ Build a Structure Centric Community to Serve Chemists” Integrate chemical structure data on the web
  • 9.
    How DidWe Build It? We deal in Molfiles or SDF files We do rudimentary filtering prior to deposition – valence checking, charge imbalance etc. We have our own “business logic” to standardize We use InChI to “aggregate tautomers” Link out to external sites where possible using IDs
  • 10.
    Inherited Errors Wehave inherited errors All public compound databases, including ours, have errors “ Incorrect” structures – assertions, timelines etc “ Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE
  • 11.
    The Structure ofVitamin K?
  • 12.
    MeSH A lipidcofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants , VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
  • 13.
    The Structure ofVitamin K1?
  • 14.
    What is theStructure of Vitamin K1?
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
    “ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione”Variants of systematic names on PubChem 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl
  • 25.
    Public Domain DatabasesOur databases are a mess… Non-curated databases are proliferating errors We source and deposit data between databases Original sources of errors hard to determine Curation is time-consuming and challenging
  • 26.
  • 27.
    Consider searching eachof these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.
  • 28.
  • 29.
    To report atDenver ACS… An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs Five separate organizations, 8 individuals The Wikipedia List of the “200 Top Selling Drugs”
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
    Taxol: Paclitaxel 44 structures
  • 37.
    Taxol: Paclitaxel Bioassay Data
  • 38.
    Taxol: Paclitaxel Bioassay Data Most Bioassay data associated with structure with one ambiguous stereocenter
  • 39.
    Drug Name GenericName ChEBI ChemSpider CAS Com. Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia Spiriva Tiotropium Bromide No Hits  No Hits    4/0  Depakote Valproate semisodium        No Structure Basen Voglibose   No Hits  No Hits  2/1  Symbicort 1) Budesonide       8/1  Symbicort 2) Formoterol WRONG  No Hits    6/1  Vytorin 1) Ezetimibe   No Hits      Vytorin 2) Simvastatin       2/1  Taxol Paclitaxel       44/1  Thalidomid Thalidomide No Hits        Zocor Simvastatin       2/1  Crestor Rosuvastatin   No Hits    2/1 
  • 40.
  • 41.
  • 42.
    Success Depends onDictionaries
  • 43.
  • 44.
  • 45.
    Validated “Dictionaries” Thefollowing resources do NOT have structures to link to ChemSpider…but are linked: Google Scholar PubMed DailyMed RSC Databases and Backfile How did we link these resources to ChemSpider? Validated Name Look-up!
  • 46.
    Extend the Vision“ Build a Structure Centric Community to Serve Chemists” Integrate chemical structure data on the web Create a “structure-based hub” to information, data and algorithmic predictions
  • 47.
    Integrate other services..We will integrate to systems of values to the community Many interfaces now available for integration NMRShiftDB ACD/Labs Name Generation ChemAxon Chemicalize What others???
  • 48.
  • 49.
    Extend the Vision“ Build a Structure Centric Community to Serve Chemists” Integrate chemical structure data on the web Create a “structure-based hub” to information, data and algorithmic predictions Let chemists contribute their own data Allow the community to curate/correct data
  • 50.
    How Did WeBuild It (cont.) Ask users to add… Descriptions/Syntheses/Commentaries Links to PubMed articles Links to articles via DOIs Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos
  • 51.
    Complex Data andInformation
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
    Crowdsourcing Works >130people have deposited data and participated in data curation Different level curators check each other More curators and depositors are encouraged!
  • 58.
    Accessibility and ReuseIt’s a shame to go it alone!!! Can we “collectively” improve the quality of chemistry on the Internet?
  • 59.
    All DBs shouldtake comments!
  • 60.
    Proof-of-concept curation sharingPresently collaborating with DrugBank to enable “curation sharing” Setting up services for monitoring curations and edits – starting with “identifiers”
  • 61.
    The Social NetworkCareer-wise NOT having a personal presence online will be a detriment Self-marketing Establishing a profile Getting on the record Collaborative Science Demonstrating a skill set Measured using alternative metrics Contributing to the public peer review process
  • 62.
    Social Networking ToolsA growing number of social networking tools: Facebook Twitter Linked-In Flickr YouTube Blogs Communities Collaborative environments
  • 63.
    Chemistry Social NetworkingMethods of sharing MY chemistry online include: Wikis or blogs Slideshare for presentations YouTube for videos Flickr, Wikimedia etc. for images PubChem for assay data NMRShiftDB for NMR assignments GoogleDocs for data
  • 64.
    Drivers in theSocial Network Anonymity is a choice in the social networks Anonymity in peer-review will likely become less important and may be generational I may want acknowledgment if… I share my data I review a paper I share my expertise
  • 65.
    The Alt-Metrics Manifestohttp://altmetrics.org/manifesto/
  • 66.
  • 67.
    What will enhanceOUR network? The “semantic web” Mobile technologies More participation Use of standards: JCAMP, InChI
  • 68.
    RDF and thesemantic web Using RDF permalinks http://www.chemspider.com/Chemical-Structure.7787.rdf Using a Search Term http://www.chemspider.com/rdf.ashx?q=cyclohexane http://rdf.chemspider.com/cyclohexane
  • 69.
  • 70.
  • 71.
  • 72.
    Licensing “My Work”Online The complex nature of licensing “my” chemistry Blogs - copyrighted and creative commons Wikis - mixed licensing, depends on the host(s) Data – much value in sharing data as “Open Data” Often, people can make money from your work! Police your own “licensing” – how many people have read the Facebook and Twitter agreements?!
  • 73.
    Who declares dataas Open? Data licensing is very interesting and can spark “interesting” conversations. Opinions differ: Are images data? Are assertions data? What on a ChemSpider record is data? We allow people to declare their data as Open and add an Open Data button at upload
  • 74.
    Acknowledgments RSC|ChemSpider teamAll data source providers >100 curators and annotators, and growing… Service providers: ACD/Labs ChemAxon GGA Software Services Google PubMed … .
  • 75.
    ChemSpider Training SessionChemSpider: A Community Resource for Chemical Data Wednesday, March 30th 8:30-11:00 AM Anaheim Convention Center, Room 211 A