Looking for self-consistency across a Wikipedia Page
Primary key is the article TITLE
The chemical shown needs to match the title – then registry numbers, names, identifiers, outlinks need to match the chemical shown.
Cyclic self-consistency – and decisions must get made
Viagra
Viagra is sildenafil…
Viagra is Sildenafil Citrate
Sildenafil may be shown in the Wikipedia record but their might be a redirect from a search on Viagra
CAS registry Number: 139755-83-2, 171599-83-0 (as citrate). If structure box shows neutral compound then CAS Number must match neutral compound OR annotate
Other issues…
If structure shows no stereo then don’t put stereo in name
Outlinks to external databases – links should be structure to structure not name to name
Tautomers
Charges
Sugars – Machine Readable vs Aesthetics Fischer Stereo Haworth
Wikipedia – Crowdsourcing Chemistry
Thymol Blue on ChemSpider
Data online includes:
UV-vis spectrum
Measured experimental properties
Link to Wikipedia article
Links to chromatography details
Multiple identifiers/trade names etc.
Links to vendors/suppliers/other databases
Safety information
http://www.chemspider.com/q/thymol%20blue
Differences between ChemSpider/Wikipedia No, but links. Analytical Data Active editors > 50 (?) Active depositors/curators – 30 No Prediction of properties ???? 6000 people/day; 1900 registered Detailed compound monographs Compound monographs linked Text Complex queries – Properties, Text, structure/substructure, OA publishers, Data Sources, … ~5000 organics, 2000 others >21 million unique structures Wikipedia ChemSpider
Differences between Wikipedia/ChemSpider Growing reputation as focused on quality Worldwide reputation as quality source – good and bad Chemistry is the focus of ‘Spider Chemistry is a subset of the ‘Pedia Mixed “licensing” GFL licensing for everything Growing team of advocates, curators and users Strong team of WP:Chem advocates, curators and admins “ Out of a basement” on three servers and 5 volunteers Established infrastructure and Wikipedia Foundation Team Primarily Microsoft .NET technologies with OS components Supported by tried and tested Media-Wiki platform. ChemSpider Wikipedia
Usage Growth – Representative Plot
Crowd-sourcing Curation
How to curate data for millions of structures?
Robot processes can clean up depositions
Search for Chloride and check molecular formula for Cl
Check for stereochemistry and remove names with stereo
Provide a simple-to-use platform to curate, annotate and tag data
Provide curator administration to prevent vandalism (Veropedia)
Multi-level Curation and Approval
Post Comments
Anyone can “Post Comments” associated with a structure. To curate data we require login to track
Crowd-sourcing Chemistry
Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation
ALSO
Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
Structure-Centric
We want to search Open-Access articles by structure, substructure, similarity of structure
Standard approaches would be:
Identify chemical names “entity extraction”
Convert chemical names to structures and index
ChemSpider has a validated dictionary of structure-name pairs
Use name extraction, name-conversion and dictionary look-up. THEN curate.
Massive look-up dictionary of validated identifiers on ChemSpider
Name-to-Structure and Lexemes
Name Recognition
Azo aldehyde 2 was synthesized according to a reported method [17]. To a stirred solution of azo aldehyde 2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0 oC were successively added (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol) .
The resulting mixture was stirred for 6 hours at room temperature [18]. The mixture was filtered and washed with dichloromethane . Then the solvent was evaporated under reduced pressure to give azo Schiff base 3 as a red solid which was recrystalized from ethanol 95% (1.28 g, 91 %)
Name Recognition
Azo aldehyde 2 was synthesized according to a reported method [17]. To a stirred solution of azo aldehyde 2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0 oC were successively added (3,4-diaminophenyl)phenyl methanone 1 (0.40 g, 1.88 mmol) and a excess of anhydrous MgSO 4 (2.00 g,16.67 mmol) .
The resulting mixture was stirred for 6 hours at room temperature [18]. The mixture was filtered and washed with dichloromethane . Then the solvent was evaporated under reduced pressure to give azo Schiff base 3 as a red solid which was recrystalized from ethanol 95% (1.28 g, 91 %)
How Many Chemical Names?
“ She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”
How Many Chemical Names?
“ She had the drive to derive success in any venture and was well versed in Karate . When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil . He went home and took an aspirin after the beating.”
ChemMantis
Chem ical M arkup A nd N omenclature T ransformation I ntegrated S ystem
Making Open Access Articles Searchable Proof of Concept
Can we HOST Chemistry Open Access articles on ChemSpider and add-value
Can we identify chemical names in Open Access articles in a user-friendly manner
Can we convert names to structures in Open-Access articles and expand ChemSpider and provide structure searching of Open Access chemistry articles?
Can we provide an environment for chemists to mark-up their own articles and crowd-source markup of an archive?
Document markup
ChemSpider now hosting Open Access articles from MDPI, Molecular Diversity Preservation International
Hosting the Molbank collection at present
A Standard for Document Markup?
NLM-DTD: National Library of Medicine; Document Type Definition
Approved markup definitions to apply to journal articles – extended as necessary for our purposes
NLM/DTD markup
Chemistry and Biology
Menus can be extended as necessary
Document markup
Markup – 3 seconds!
On the fly conversion
Shorthand Formulae Supported
Curators Tools During Markup
One Click to more Info…
Coming Soon….
Structure Image Conversion
Two Seconds Later
Not Always Perfect….
A Platform for Markup
Can we provide a platform for document markup for chemists?
Workflow:
Upload word docs, RTF files or point to HTML and load
Apply entity extraction, convert names to structures, mark-up automatically and ask for user participation
Publish final version with NLM-DTD markup
Deposit all structures on ChemSpider under embargo and wait for article DOI to release
Challenges
Computer software can generate chemical names better than the majority of chemists
The majority of chemical names are generated by humans, and Incorrect – convert to the wrong structure or are ambiguous
One name, Multiple Structures
Names and Structures
Dichloroacetone
Trichloromethylsilane
Ambiguity
Ambiguity in Abbreviations - DPA
Ambiguity in Abbreviations - THF
Import is Easy
Make articles Public/Private (embargo date soon)
Auto-markup and check by user
IUPAC PAC Articles
Supports Word .DOC, HTML, RTF
Drexel University Documents
Drexel University Documents
Drexel University Documents
Patents
Organism Markup
Extensible Markup Process
Markup process is easily extendable
Configurable from one XML file
NLM/DTD is incorporated but is easy to extend
Markup Movie
DailyMed
DailyMed on ChemSpider
Quality of Structures!!!
Quality of Structures
DailyMed
Remaining Issues in Markup
Ongoing curation of look-up dictionaries – one name linked to multiple structures…which is right? Remember tautomers!
Import and markup of large documents – tested up to 2Mbyte. Needs extending for larger documents
Add comments/tags to documents where markup finds issues.
Oops…
Online document markup and indexing is a very disruptive offering and a natural extension for ChemSpider
What’s Coming?
Agreement with Royal Society of Chemistry that we can add their structure-based RSS feeds to ChemSpider
Agreement with Nature Publishing Group to add their Nature Chemical Biology structure collections to ChemSpider as they issue
Presently indexing Acta Chemica Scandanavica, 1947-1999 PDF backfile – our first foray into OCR
Presently indexing PLoS journals directly
More publishers have agreed…
What’s Coming?
Now working on “organism” extraction in document markup to identify organisms and link out to external resources – bacteria, fungi, viruses
Open platform for users to deposit and markup their documents
Export the markup in XML format and then map to the NLM-DTD
Extract machine-readable structures directly
Conclusions
The quality of structure-based data online should always be questioned – that includes ChemSpider
Robots and software algorithms can help but eyeballs are necessary
Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always
ChemSpider now has a large validated structure-name dictionary
Chemical name extraction and document markup is very enabling
Further reading
www.chemspider.com/blog
Internet-based tools for communication and collaboration in chemistry, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008 502-506, doi:10.1016/j.drudis.2008.03.015
A perspective of publicly accessible/open-access chemistry databases, Drug Discovery Today, Volume 13, Numbers 11/12, June 2008, 495-501, doi:10.1016/j.drudis.2008.03.017
A Talk delivered at both UNC Chapel Hill and Drexel more
A Talk delivered at both UNC Chapel Hill and Drexel University
There is an increasing availability of free and open access resources for scientists to use on the internet. Coupled with the increasing availability of Open Source software tools we are in the middle of a revolution in data availability and tools to manipulate these data. However, freedom costs and in many cases the cost is quality. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 150 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. This presentation will provide an overview of the issue of quality in many chemistry-related databases, approaches to cleaning up the data and how a curated platform can become the centralized hub for resourcing information about chemical entities. This includes experimental and predicted properties, analytical data, publications, suppliers and integrated databases. I will detail three efforts :1) the curation of chemistry on Wikipedia 2) an examination of structure integrity on the FDA Daily Med website, a web site of medication content and labeling as found in medication package inserts 3) recognizing chemical names in documents and providing a platform for structure-based searching of Open Access chemistry literature. less
0 comments
Post a comment