The internet is searchable by chemical structure and substructure
Chemistry articles are indexed and searchable by a free online service
Publicly funded research data can be shared and discussed in the Open
Cheminformatics has as much of a public face as bioinformatics
ChemSpider - A Search Engine for Chemists
Questions a chemist might ask…
What is the melting point of n-butanol?
What is the chemical structure of Xanax?
Chemically, what is phenolphthalein?
What are the stereocenters of cholesterol?
Where can I find publications about xylene?
What are the different trade names for Ketoconazole?
What is the NMR spectrum of Aspirin?
What are the safety handling issues for Thymol Blue?
ChemSpider can answer all of these questions
ChemSpider Data Content
Over 21.5 million unique chemical structures from ca. 150 data sources
Online Databases –PubChem, Drugbank, HMDB, Wikipedia
Chemical Vendors – over 40 different vendors and growing
Personal Depositions – individual contributions
Journal Publishers
Content database vendors
Analytical data collections
Patents (9 MILLION Structures being deposited now )
Web scraping
Content is generally linked back to the original data sources
Tell me about Aspirin
Tell me about Aspirin
Link outs
Links out to KEGG Kyoto Encyclopedia of Genes and Genomes
Tell me about Aspirin
Tell me About Aspirin
Tell me about Aspirin
Tell me about Aspirin
Tell me about Aspirin
Text- Indexing and ChemSpider?
ChemSpider text-indexes almost 500,000 Open Access and Free Access articles
Collection is growing weekly and more publishers have already agreed
Open Access Literature Search
Search PubMed – ChemSpider
Other Searches
What compounds have a mass of 300+/-0.001?
or search a combination of intrinsic/predicted properties
Other Searches
Complex Search
The Quality of Data Online…
Aggregating data opens up quality issues
Structure-identifier associations are “dirty”
Structures are COMMONLY incorrect – stereochem issues
Manual curation of small databases is enough work – what about millions of structures?
Structures are far from perfect. What is a “correct structure”?
Full stereochemistry?
Historical timeline of structure?
Who is the authority?
Who holds THE Quality Authority?
Chemical Abstracts Service is the structural authority today. 1400 (?) employees, world standard in chemistry information
101 years of knowledge, process and expertise. MANUAL curation is key. Robotic curation is enabling
How can an online, free access system peacefully co-exist with the authority?
Quality is a Major Issue- Search Butanol
Wikipedia – Crowdsourcing Chemistry
Wikipedia Chemistry Curation project
Only ca. 5000 organic structures, 7000 total structures
MONTHS of work so far for a team of 6 people
Many errors removed in the process. Curation process is a daily event for users/depositors
Slow and torturous process for stereo molecules.
Thymol Blue on ChemSpider
Data online includes:
UV-vis spectrum
Measured experimental properties
Link to Wikipedia article
Links to chromatography details
Multiple identifiers/trade names etc.
Links to vendors/suppliers/other databases
Safety information
Differences between ChemSpider/Wikipedia No Analytical Data Active editors – about 50 (?) Active depositors/curators – 30 No Prediction of properties ???? 5000 people/day; 1100 registered Detailed compound monographs Compound monographs linked Text Complex queries – Properties, Text, structure/substructure, OA publishers, Data Sources, … ~5000 organics, 2000 others >21 million unique structures Wikipedia ChemSpider
Differences between Wikipedia/ChemSpider Growing reputation as focused on quality Worldwide reputation as quality source Chemistry is the focus of ‘Spider Chemistry is a subset of the ‘Pedia Mixed “licensing” GFL licensing for everything Growing team of WP:Chem advocates, curators and admins Strong team of WP:Chem advocates, curators and admins “ Out of a basement” on three servers and 5 volunteers Established infrastructure and Wikipedia Foundation Team Primarily Microsoft .NET technologies with OS components Supported by tried and tested Media-Wiki platform. ChemSpider Wikipedia
Crowd-sourcing Curation
How to curate data for millions of structures?
Robot processes can clean up depositions
Search for Chloride and check molecular formula for Cl
Check for stereochemistry and remove names with stereo
Provide a simple-to-use platform to curate, annotate and tag data
Provide curator administration to prevent vandalism (Veropedia)
Multi-level Curation and Approval
Post Comments
Anyone can “Post Comments” associated with a structure. To curate data we require login to track
Crowd-sourcing Chemistry
Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation
ALSO
Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
But, when registered and logged in…
Ability to curate and add to the database
Add structures
“ Clean” structures
Add data (spectra, CIFs, images)
Add links to other pages (URLs)
Add publication details
Adding to the Database - Structure
Adding New Text Data Add Publication Add Identifier Add URL
Adding Supplementary Info to a Structure
ChemSpider TouchGraph
Structure-Centric
We want to search Open-Access articles by structure, substructure, similarity of structure
Standard approaches would be:
Identify chemical names “entity extraction”
Convert chemical names to structures and index
ChemSpider has a validated dictionary of structure-name pairs
Use name extraction, name-conversion and dictionary look-up. THEN curate.
Azo aldehyde 2 was synthesized according to a reported method [17]. To a stirred solution of azo aldehyde 2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0 oC were successively added (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) . The resulting mixture was stirred for 6 hours at room temperature [18]. The mixture was filtered and washed with dichloromethane . Then the solvent was evaporated under reduced pressure to give azo Schiff base 3 as a red solid which was recrystalized from ethanol 95% (1.28 g, 91 %)
Name Recognition
Azo aldehyde 2 was synthesized according to a reported method [17]. To a stirred solution of azo aldehyde 2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0 oC were successively added (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excess of anhydrous MgSO 4 (2.00 g,16.67 mmol) .
The resulting mixture was stirred for 6 hours at room temperature [18]. The mixture was filtered and washed with dichloromethane . Then the solvent was evaporated under reduced pressure to give azo Schiff base 3 as a red solid which was recrystalized from ethanol 95% (1.28 g, 91 %)
How Many Chemical Names?
“ She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”
How Many Chemical Names?
“ She had the drive to derive success in any venture and was well versed in Karate . When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil . He went home and took an aspirin after the beating.”
Making Open Access Articles Searchable Proof of Concept
Can we HOST Chemistry Open Access articles on ChemSpider and add-value
Can we identify chemical names in Open Access articles in a user-friendly manner
Can we convert names to structures in Open-Access articles and expand ChemSpider and provide structure searching of Open Access chemistry articles?
Can we provide an environment for chemists to mark-up their own articles and crowd-source markup of an archive?
Document markup
ChemSpider now hosting Open Access articles from MDPI, Molecular Diversity Preservation International
Hosting the Molbank collection at present
A Standard for Document Markup?
NLM-DTD: National Library of Medicine; Document Type Definition
Approved markup definitions to apply to journal articles – extended as necessary for our purposes
NLM/DTD markup
Chemistry and Biology
Chemistry and Biology
Menus can be extended as necessary
Document markup
Searching from the Structure Balloon
A Platform for Markup
Can we provide a platform for document markup for chemists?
Workflow:
Upload word docs, RTF files or point to HTML and load
Apply entity extraction, convert names to structures, mark-up automatically and ask for user participation
Publish final version with NLM-DTD markup
Deposit all structures on ChemSpider under embargo and wait for article DOI to release
Online Markup
Automated markup
Name to Structure Conversion
Conversion of Structure Images
Not all compounds have a “name”
Structure images can be converted to connection tables
Cryptomisrine
Structure Conversion from Images-CLiDE
Conversion dependent on zoom-factor can give perfect conversion!
Supports Word .DOC, HTML, RTF
Extensible Markup Process
Markup process is easily extendable
Configurable from one XML file
NLM/DTD is incorporated but is easy to extend
Tipping Point
Tipping point - the point at which a slow gradual change becomes irreversible and then proceeds with gathering pace
Our Challenges
There are “no employees”
ChemSpider is non-funded
System is hyper-dependent on ISP, power and limited compute power
We are upsetting some people – specifically “closed” data content providers
What’s Coming?
Agreement with Royal Society of Chemistry that we can add their structure-based RSS feeds to ChemSpider
Agreement with Nature Publishing Group to add their Nature Chemical Biology structure collections to ChemSpider as they issue
Presently indexing Acta Chemica Scandanavica, 1947-1999 PDF backfile – our first foray into OCR
Presently indexing PLoS journals directly
More publishers have agreed…
Conclusions
The quality of structure-based data online should always be questioned – that includes ChemSpider
Robots and software algorithms can help but eyeballs are necessary
Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always
ChemSpider now has a large validated structure-name dictionary
Further reading
www.chemspider.com/blog
Internet-based tools for communication and collaboration in chemistry, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008 502-506, doi:10.1016/j.drudis.2008.03.015
A perspective of publicly accessible/open-access chemistry databases, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008, 495-501, doi:10.1016/j.drudis.2008.03.017
ChemSpider Forums/Blogs
Forum.chemspider.com
www.chemspider.com/blog
Acknowledgments
The ChemSpider team of volunteer developers
ChemSpider Advisory Group
Our curators, depositors and users
Suppliers of commercial software – Microsoft, ACD/Labs, OpenEye, ChemAxon, SimBioSys
SureChem – Structure Based Online Patent Searching
0 comments
Post a comment