ChemSpider: Collecting and Curating the
World’s Chemistry with the Community
A Pragmatic Vision
“Build a Structure Centric Community”
December 2006 – A hobby project initiated to
connect chemistry on the web
Integrate chemical structure data on the web
Create a “structure-based hub” to information and
Provide access to structure-based “algorithms”
Let chemists contribute their own data
Allow the community to curate/correct data
Where is chemistry online?
Encyclopedic articles (Wikipedia)
Chemical vendor databases
Metabolic pathway databases
Patents with chemical structures
Drug Discovery data
Blogs/Wikis and Open Notebook Science
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-
based searches across the internet
Data are dirty: sorting the wheat from the chaff.
Who can you trust?
Too many searches required to resource data
As few interfaces as possible
What do humans want?
Chemistry on the Internet FUTURE
The semantic web for chemistry is in place
Crowdsourced contributions are commonplace
Chemists will search by structure/substructure
Chemistry articles indexed and searchable
Reduced number of searches to find data
Data are integrated – compounds, vendors,
syntheses, data, publications and patents
A world of Open Access and Open Data
Classical business models will have to morph
Getting it done
March 2007 – A beta system opened online
One purchased computer, two home-built
Seeded with 10.5 million structures
A curating layer to flag data
A deposition interface to add to the data
And so it continued….
Link off a structure in ChemSpider
Answering Questions for Chemists
Questions a chemist might ask…
What is the melting point of n-butanol?
What is the chemical structure of Xanax?
Chemically, what is phenolphthalein?
What are the stereocenters of cholesterol?
Where can I find publications about xylene?
What are the different trade names for Ketoconazole?
What is the NMR spectrum of Aspirin?
What are the safety handling issues for Thymol Blue?
InChIKeys for Taxol
ChEBI and Wikipedia are the SAME structure
Drugbank is a DIFFERENT structure – ONE
Crowdsourced Curation of Spectra
What do computers want?
Linked from Wikipedia and many Public Databases
Linked from Open Notebook Science sites
Linked from Blogs using Structure/Spectra EMBED
Integrated into structure drawing packages
Integrated to software offerings from Thermo,
Waters, Agilent, Bruker
There will always be gaps...
What ChemSpider does not deal with, yet...
Open Source, Access and Data
ChemSpider is NOT Open Source but we do use
Open Source components (OpenBabel,
JSpecView, Jmol). Thanks Microsoft!
ChemSpider is not an “Open Access Database” –
it’s a “free access” resource
We do not assume copyright. Rights to the data
and the creative works remain with the depositor
Is ChemSpider “Open Data”?
Who declares data as Open?
Data licensing is very interesting and can spark
“interesting” conversations. Opinions differ:
Are images data? Are assertions data?
What on a ChemSpider record is data?
Is PubChem or PubMed Open Data?
We allow people to declare their data as Open and
add an Open Data button at upload
A lot of data on ChemSpider are free but not Open
Pragmatism: Our focus is a community resource
Conclusions: ChemSpider Today
ChemSpider is an established community resource
>23 million compounds from >300 data sources
About 7000 unique users per day and up to ½ million
transactions per day
A crowdsourced deposition and curation platform
Grows daily – more depositions, more links, more data
Web services provider
Linked to commercial and open source software
Supporting analytical companies: Agilent, Thermo, Waters, Bruker
Serving ONS, providing games to students, ChemSpidey robot
A publishing platform for the community
Continue the curation effort and keep cleaning
Finish depositions – millions left to deposit
Integrate RSC content – a massive archive!
Integrate RSC publishing workflows and databases
Enable the semantic web for chemistry
Royal Society of Chemistry
Valery Tkachenko and Sergey Shevelev
Commercial Software: Microsoft, Advanced
Chemistry Development, OpenEye and Symyx
Open Source Software: Jmol, OpenBabel,
JC Bradley, Andrew Lang – The Spectral Game
and Open Notebook Science integration
The “Crowd” of curators
306 Data Source providers
SLIDES: www.slideshare.net/AntonyWilliamsSLIDES: www.slideshare.net/AntonyWilliams