RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemSpider: Collecting and Curating the
World’s Chemistry with the Community

A Pragmatic Vision
“Build a Structure Centric Community”
 December 2006 – A hobby project initiated to
connect chemistry on the web
 Integrate chemical structure data on the web
 Create a “structure-based hub” to information and
data
 Provide access to structure-based “algorithms”
 Let chemists contribute their own data
 Allow the community to curate/correct data

Where is chemistry online?
 Encyclopedic articles (Wikipedia)
 Chemical vendor databases
 Metabolic pathway databases
 Property databases
 Patents with chemical structures
 Drug Discovery data
 Scientific publications
 Compound aggregators
 Blogs/Wikis and Open Notebook Science

Chemistry on the Internet TODAY
 Chemistry searches are generally limited to text-
based searches across the internet
 Data are dirty: sorting the wheat from the chaff.
Who can you trust?
 Too many searches required to resource data

media.obsessable.com
As few interfaces as possible
What do humans want?

Chemistry on the Internet FUTURE
 The semantic web for chemistry is in place
 Crowdsourced contributions are commonplace
 Chemists will search by structure/substructure
 Chemistry articles indexed and searchable
 Reduced number of searches to find data
 Data are integrated – compounds, vendors,
syntheses, data, publications and patents
 A world of Open Access and Open Data
 Classical business models will have to morph

Getting it done
 March 2007 – A beta system opened online
 One purchased computer, two home-built
 Seeded with 10.5 million structures
 Structure/substructure searching
 June 2007
 A curating layer to flag data
 A deposition interface to add to the data
 And so it continued….

Kyoto Encyclopedia of Genes and Genomes

Links to Patents based on structure

Link off a structure in ChemSpider
 Chemical suppliers
 Other publications
 Analytical Data
 Related Reactions
 Wikipedia
 Patents
 “Everything”

Answering Questions for Chemists
 Questions a chemist might ask…
 What is the melting point of n-butanol?
 What is the chemical structure of Xanax?
 Chemically, what is phenolphthalein?
 What are the stereocenters of cholesterol?
 Where can I find publications about xylene?
 What are the different trade names for Ketoconazole?
 What is the NMR spectrum of Aspirin?
 What are the safety handling issues for Thymol Blue?

ChemSpider is a structure-centric hub
 ChemSpider aggregates and links out across the
internet
 Data aggregate based on “structures and links”
 What defines a chemical compound?

Linked Data on the Web
Taken from: Rafael Sidis’ Blog

Where Would You look?
What Do You Trust?

Question Everything online: www.dhmo.org

Di-Hydrogen Monoxide
H2O
Water

Chemistry on The Internet Is Messy

Vancomycin
 Who will curate?
 How would you clean such
a large dataset?
 Assertions!!!

Vancomycin on ChemSpider
1 compound – 3 days

The EXPERTS must get it right?!

Wikipedia, C&E News, PubChem
C&E News (from ACS)

InChIStrings Hash to InChIKeys

InChIKeys for Taxol
 DrugBank: RCINICONZNJXQF-CLDWUXIMDD
 ChEBI: RCINICONZNJXQF-GXKQXQCDDN
 Wikipedia: RCINICONZNJXQF-MZXODVADBJ
 ChEBI and Wikipedia are the SAME structure
 Drugbank is a DIFFERENT structure – ONE
stereocenter

Does one stereocenter matter?
 Distaval, Talimol, Nibrol,
Sedimide, Quietoplex,
Contergan, Neurosedyn,
and Softenon

Assertion and Chemical Entities
 Who says what Taxol is?
 What is the “timeline” for a molecule?
 How do we clean up the Public data?
 The Quality source is Chemical Abstracts Service…

Vancomycin – Search the Internet

Full Skeleton Search: 104 Hits

Crowd-sourcing Chemistry Curation
 Crowd-sourced curation: identify/tag errors, edit
names, synonyms, identify records to deprecate

Building a Structure Centric
Multi-level Curation and Approval

Semantic Markup: Project Prospect

Entity-Extraction, Mark-up, Annotate

Success Depends on Dictionaries

Species – linked to Wikipedia

Semantic Linking of Structures
 What would you want
to link off a structure?
 Chemical suppliers
 Other publications
 Analytical Data
 Related Reactions
 Wikipedia
 Patents
 “Everything”

ChemSpider Everywhere: Spectral Game

ChemSpider Everywhere
Crowdsourced Curation of Spectra

ChemSpider Everywhere:
What do computers want?
Web services
flickr.com/photos/microcosmos

ChemSpider Everywhere
 Linked from Wikipedia and many Public Databases
 Linked from Open Notebook Science sites
 Linked from Blogs using Structure/Spectra EMBED
 Integrated into structure drawing packages
 Integrated to software offerings from Thermo,
Waters, Agilent, Bruker

ChemSpider Everywhere: ChemMobi

There will always be gaps...
 What ChemSpider does not deal with, yet...
 Materials
 Minerals
 Polymers
 Biological macromolecules

Open Source, Access and Data
 ChemSpider is NOT Open Source but we do use
Open Source components (OpenBabel,
JSpecView, Jmol). Thanks Microsoft!
 ChemSpider is not an “Open Access Database” –
it’s a “free access” resource
 We do not assume copyright. Rights to the data
and the creative works remain with the depositor
 Is ChemSpider “Open Data”?

Who declares data as Open?
 Data licensing is very interesting and can spark
“interesting” conversations. Opinions differ:
 Are images data? Are assertions data?
 What on a ChemSpider record is data?
 Is PubChem or PubMed Open Data?
 We allow people to declare their data as Open and
add an Open Data button at upload
 A lot of data on ChemSpider are free but not Open
 Pragmatism: Our focus is a community resource

Conclusions: ChemSpider Today
 ChemSpider is an established community resource
 >23 million compounds from >300 data sources
 About 7000 unique users per day and up to ½ million
transactions per day
 A crowdsourced deposition and curation platform
 Grows daily – more depositions, more links, more data
 Web services provider
 Linked to commercial and open source software
 Supporting analytical companies: Agilent, Thermo, Waters, Bruker
 Serving ONS, providing games to students, ChemSpidey robot
 A publishing platform for the community

ChemSpider Tomorrow
 Continue the curation effort and keep cleaning
 Finish depositions – millions left to deposit
 Integrate RSC content – a massive archive!
 Integrate RSC publishing workflows and databases
 Enable the semantic web for chemistry

Acknowledgments
 Royal Society of Chemistry
 Valery Tkachenko and Sergey Shevelev
 Commercial Software: Microsoft, Advanced
Chemistry Development, OpenEye and Symyx
 Open Source Software: Jmol, OpenBabel,
JSpecView
 JC Bradley, Andrew Lang – The Spectral Game
and Open Notebook Science integration
 The “Crowd” of curators
 306 Data Source providers
 SyntheticPages.org

Thank you
antony.williams@chemspider.com
Twitter: ChemSpiderman
www.chemspider.com/blog
SLIDES: www.slideshare.net/AntonyWilliamsSLIDES: www.slideshare.net/AntonyWilliams

RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Similar to RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn (15)

Recently uploaded

Recently uploaded (20)

RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn