Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Crowdsourcing, Collaborations andCrowdsourcing, Collaborations and
Text-Mining in a World of OpenText-Mining in a World of Open
ChemistryChemistry
Antony WilliamsAntony Williams
Bio-IT World 2009Bio-IT World 2009

Building a Structure Centric Community
for Chemists
Linked Data CloudLinked Data Cloud

for Chemists
Chemistry on the InternetChemistry on the Internet
 Much of the information online isMuch of the information online is User Beware!User Beware!
 The Quality of information is “diverse”The Quality of information is “diverse”
 Technologies can “link and connect” information butTechnologies can “link and connect” information but
validation and curation is key to providing qualityvalidation and curation is key to providing quality
 The LinkedData web is of less value when the data linkedThe LinkedData web is of less value when the data linked
are “wrong”are “wrong”

for Chemists
Quality CostsQuality Costs
 Chemical Abstracts ServiceChemical Abstracts Service (CAS), a division of the(CAS), a division of the
ACS is “Gold Standard” in Chemistry relatedACS is “Gold Standard” in Chemistry related
informationinformation
 101 years of content, $260 million revenue (2006), >40101 years of content, $260 million revenue (2006), >40
million substances and 60 million sequencesmillion substances and 60 million sequences
 But online…But online…

for Chemists
What is “wrong”?What is “wrong”?

for Chemists
 A platform for:A platform for:
 Data deposition,Data deposition, curation and annotationcuration and annotation
 Supporting Open Notebook Science effortsSupporting Open Notebook Science efforts
 Chemistry document mark-up with ChemMantisChemistry document mark-up with ChemMantis
 The Open Access ChemSpider Journal of ChemistryThe Open Access ChemSpider Journal of Chemistry

for Chemists
Search CholesterolSearch Cholesterol

for Chemists
Complex Data and InformationComplex Data and Information

for Chemists
Online DataOnline Data
 Many websites host structure-based informationMany websites host structure-based information
 Question quality!!!Question quality!!!

for Chemists

for Chemists
Wikipedia, C&E News, PubChemWikipedia, C&E News, PubChem
C&E News (from ACS)C&E News (from ACS)

for Chemists
Does one stereocenter matter?Does one stereocenter matter?

for Chemists
VancomycinVancomycin
 Who will curate?Who will curate?
 PubChem is notPubChem is not
resourced to cleanresourced to clean
these errorsthese errors 
 How would youHow would you
clean such a largeclean such a large
dataset?dataset?

for Chemists
VancomycinVancomycin
ChemSpider: 1 compound – 3 daysChemSpider: 1 compound – 3 days

for Chemists
Question EverythingQuestion Everything
www.dhmo.orgwww.dhmo.org

for Chemists
DailyMedDailyMed
““DailyMed providesDailyMed provides high qualityhigh quality information aboutinformation about
marketed drugs.marketed drugs.
This information includes FDA approved labelsThis information includes FDA approved labels
(package inserts).”(package inserts).”

for Chemists
The FDA’s DailyMedThe FDA’s DailyMed

for Chemists
Structures on DailyMedStructures on DailyMed
Poor RepresentationsPoor Representations

for Chemists
Structures on DailyMedStructures on DailyMed
Lack of StereochemistyLack of Stereochemisty

for Chemists
Incorrect StructuresIncorrect Structures
Scanning (?) IssuesScanning (?) Issues

for Chemists
Incorrect StructuresIncorrect Structures

for Chemists
Does it Matter?Does it Matter?
 Does it matter to the consumer that the structures areDoes it matter to the consumer that the structures are
wrong? No…what matters is what is in the bottle is thewrong? No…what matters is what is in the bottle is the
right medication!right medication!
 To make DailyMed structure searchable it DOESTo make DailyMed structure searchable it DOES
mattermatter
 To data mine DailyMed it mattersTo data mine DailyMed it matters
 To mark up DailyMed it mattersTo mark up DailyMed it matters

for Chemists
CollaborativeCollaborative Knowledge ManagementKnowledge Management
for Chemistsfor Chemists

for Chemists
Wikipedia Links to DrugbankWikipedia Links to Drugbank

for Chemists
Taxol on PubChemTaxol on PubChem

for Chemists
Taxol on Daily MedTaxol on Daily Med

for Chemists
The InChI IdentifierThe InChI Identifier

for Chemists
Multiple LayersMultiple Layers
Source: Unofficial InChI FAQ pageSource: Unofficial InChI FAQ page

for Chemists
InChIStrings Hash to InChIKeysInChIStrings Hash to InChIKeys

for Chemists
InChIs for TaxolInChIs for Taxol

for Chemists
Back to TaxolBack to Taxol
 DrugBank: RCINICONZNJXQF-CLDWUXIMDDDrugBank: RCINICONZNJXQF-CLDWUXIMDD
 ChEBI:ChEBI: RCINICONZNJXQF-GXKQXQCDDNRCINICONZNJXQF-GXKQXQCDDN
 Wikipedia:Wikipedia: RCINICONZNJXQF-MZXODVADBJ
 Which one is correct???

for Chemists
InChIKeys for TaxolInChIKeys for Taxol
 DrugBank: RCINICONZNJXQF-CLDWUXIMDDDrugBank: RCINICONZNJXQF-CLDWUXIMDD
 ChEBI:ChEBI: RCINICONZNJXQF-GXKQXQCDDNRCINICONZNJXQF-GXKQXQCDDN
 Wikipedia:Wikipedia: RCINICONZNJXQF-MZXODVADBJ
 ChEBI and Wikipedia are the SAME structure
 Drugbank is a DIFFERENT structure – ONE
stereocenter

for Chemists
The InChI ResolverThe InChI Resolver

for Chemists
Coming Soon…Linked ArticlesComing Soon…Linked Articles

for Chemists
How bad can it get???How bad can it get???
And who is right????And who is right????

for Chemists
ChemMantisChemMantis
 ChemChemicalical MMarkuparkup AAndnd NNomenclatureomenclature TTransformationransformation
IIntegratedntegrated SSystem –ystem – ChemMantisChemMantis
 A platform for entity extraction for chemistryA platform for entity extraction for chemistry
documents, markup and integration to onlinedocuments, markup and integration to online
information sources – Wikipedia, ChemSpider, Entrez…information sources – Wikipedia, ChemSpider, Entrez…
 Web-based submission, markup and publishing platformWeb-based submission, markup and publishing platform
now hosting thenow hosting the ChemSpider Journal of ChemistryChemSpider Journal of Chemistry

for Chemists
ChemMantis MarkupChemMantis Markup

for Chemists
Enable Electronic Articles…Enable Electronic Articles…
 Structures are theStructures are the
language of chemistrylanguage of chemistry
 Show structures toShow structures to
chemists and search/linkchemists and search/link
from there…from there…

for Chemists
Species MarkupSpecies Markup

for Chemists
Dictionaries are Easily EnhancedDictionaries are Easily Enhanced
 Copy-Paste into appropriate Entity DictionaryCopy-Paste into appropriate Entity Dictionary
 Impacts all future markupsImpacts all future markups
 Expanding knowledgebases of informationExpanding knowledgebases of information
 Linked out to rich sources of informationLinked out to rich sources of information

for Chemists
Build DictionariesBuild Dictionaries
Ontologies NextOntologies Next

for Chemists
Outlinks…Outlinks…

for Chemists
Publishers and Document Mark-UpPublishers and Document Mark-Up

for Chemists
ChemSpider EverywhereChemSpider Everywhere
 Linked from WikipediaLinked from Wikipedia
 Linked from Open Notebook Science sites using EMBEDLinked from Open Notebook Science sites using EMBED
 Linked from Blogs using Structure/Spectra EMBEDLinked from Blogs using Structure/Spectra EMBED
 Integrated into structure drawing packages such asIntegrated into structure drawing packages such as
ACD/ChemSketch, Symyx Draw, Open Source appletsACD/ChemSketch, Symyx Draw, Open Source applets
 Integrated to software offerings from Thermo, Waters, Agilent,Integrated to software offerings from Thermo, Waters, Agilent,
BrukerBruker

for Chemists
Embed Functionality (like YouTube)Embed Functionality (like YouTube)

for Chemists
www.spectralgame.comwww.spectralgame.com

for Chemists
Crowdsourced Curation of SpectraCrowdsourced Curation of Spectra

for Chemists
RSC CompoundsRSC Compounds

for Chemists
Nature ChemistryNature Chemistry
Nature ChemistryNature Chemistry articles arearticles are
annotated to identify all of theannotated to identify all of the
chemical compounds mentionedchemical compounds mentioned
throughout the text.throughout the text.
Those compounds are linked out toThose compounds are linked out to
other information resourcesother information resources
including PubChem andincluding PubChem and
ChemSpiderChemSpider..

for Chemists
ChemMobiChemMobi

for Chemists
Structure RSS Feeds with InChIsStructure RSS Feeds with InChIs

for Chemists
AcknowledgmentsAcknowledgments
 Richard Kidd, Royal Society of ChemistryRichard Kidd, Royal Society of Chemistry
 Jason Wilde, Nature Publishing GroupJason Wilde, Nature Publishing Group
 Martin Walker and the Wikipedia Chemistry teamMartin Walker and the Wikipedia Chemistry team
 Microsoft – Rudy PotenzoneMicrosoft – Rudy Potenzone
 Symyx – Keith Taylor and James JackSymyx – Keith Taylor and James Jack
 SureChem – Nicko GoncharoffSureChem – Nicko Goncharoff
 Spectral game - Andrew Lang and Jean-Claude BradleySpectral game - Andrew Lang and Jean-Claude Bradley
 ““The InChI team and Advisory Group”The InChI team and Advisory Group”

for Chemists
ConclusionsConclusions
www.chemspider.comwww.chemspider.com
www.chemspider.com/journalwww.chemspider.com/journal
InChIs and Internet ChemistryInChIs and Internet Chemistry
http://inchis.chemspider.comhttp://inchis.chemspider.com

Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Similar to Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry (20)

Recently uploaded

Recently uploaded (20)

Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry