A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Crowdsourcing, Collaborations and
Text-Mining in a
World of Open Chemistry

Antony Williams

Imagine a time when ….

The internet is searchable by chemical structure and
substructure (e.g.Wikipedia, Google Scholar)
Chemistry articles are indexed and searchable by a free
online service
The web is linked together through the “language of
chemistry”
Publicly funded research data can be shared and
discussed in the Open, maybe as ONS?
Cheminformatics has as much of a public face as
bioinformatics

Building a Structure Centric Community for Chemists

ChemSpider - A Search Engine for Chemists

Questions a chemist might ask…
What is the melting point of n-butanol?
What is the chemical structure of Xanax?
Chemically, what is phenolphthalein?
What are the stereocenters of cholesterol?
Where can I find publications about xylene?
What are the different trade names for Ketoconazole?
What is the NMR spectrum of Aspirin?
What are the safety handling issues for Thymol Blue?

ChemSpider can answer all of these questions


What is a Structure?
Ask a computer…ask a chemist


Tell Me About Glutathione


Link outs


Links out to KEGG
Kyoto Encyclopedia of Genes and Genomes


How many names does a compound have?


ChemSpider Data Content

Over 21.5 million unique chemical structures from ca. 150 data
sources
Online Databases –PubChem, Drugbank, KEGG, Wikipedia
Literature – PubMed, J Het Chem, Nature, RSC, Open Access
Chemical Vendors – over 40 different vendors and growing
Personal Depositions – individual contributions
Content database vendors
Analytical data collections
Patents
Web scraping

Content is linked back to the original data sources


Other Searches

What compounds have a mass of 300+/-0.001?

or search a combination of intrinsic/predicted properties

Other Searches


Complex Search


The Quality of Data Online…
Aggregating data opens up quality issues
Structure-identifier associations are “dirty”
Structures are COMMONLY incorrect
Manual curation of small databases is enough work – what
about millions of structures?
Structures are far from perfect. What is a “correct structure”?
Full stereochemistry?
Historical timeline of structure?
Who is the authority?


Who holds THE Quality Authority?

Chemical Abstracts Service is the structural authority
today. 1400 employees, world standard in chemistry
information
101 years of knowledge, process and expertise.
How can an online, free access system peacefully co-
exist with the authority?


Quality is a Major Issue- Search Butanol
OLD EXAMPLE..now fixed


Wikipedia Chemistry Curation project

Only ca. 5000 organic structures, 7000 total
structures
Almost a year of work so far for a team of 6
people
Many errors removed in the process. Curation
process is a daily event for users/depositors
Slow and torturous process

http://en.wikipedia.org/wiki/Talk:Tacrolimus#
IUPAC_Name_and_structure


Wikipedia Curation

Looking for self-consistency
across a Wikipedia Page
Primary key is the article TITLE
The chemical shown needs to
match the title
Cyclic self-consistency – and
decisions must get made


Viagra or Sildenafil


Other issues…


Charges


Sugars – Machine Readable vs Aesthetics

Haworth Stereo Fischer


Wikipedia – Crowdsourcing Chemistry


Thymol Blue on ChemSpider

Data online includes:
UV-vis spectrum
Measured experimental properties
Link to Wikipedia article
Links to chromatography details
Multiple identifiers/trade names etc.
Links to vendors/suppliers/other databases
Safety information

http://www.chemspider.com/q/thymol%20blue


Differences between ChemSpider/Wikipedia

ChemSpider Wikipedia
>21 million unique structures ~5000 organics, 2000 others
Complex queries – Properties, Text
Text, structure/substructure, OA
publishers, Data Sources, …
Prediction of properties No
Analytical Data No, but links.
Active depositors/curators – 30 Active editors > 50 (?)
6000 people/day; 1900 registered ????
Compound monographs linked Detailed compound monographs


Differences between Wikipedia/ChemSpider

Wikipedia ChemSpider
Supported by tried and tested Primarily Microsoft .NET
Media-Wiki platform. technologies with OS components
Established infrastructure and “Out of a basement” on three
Wikipedia Foundation Team servers and 5 volunteers
Chemistry is a subset of the ‘Pedia Chemistry is the focus of ‘Spider
GFL licensing for everything Mixed “licensing”
Strong team of WP:Chem Growing team of advocates,
advocates, curators and admins curators and users
Worldwide reputation as quality Growing reputation as focused on
source – good and bad quality

Crowd-sourcing Curation

How to curate data for millions of structures?
Robot processes can clean up depositions
Search for Chloride and check molecular formula for Cl
Check for stereochemistry and remove names with stereo
Provide a simple-to-use platform to curate, annotate
and tag data
Provide curator administration to prevent vandalism
(Veropedia)


Post Comments
Anyone can “Post Comments” associated with a
structure. To curate data we require login to track


Multi-level Curation and Approval


Crowd-sourcing Chemistry

Crowd-sourced curation: identify and tag errors, edit
names, synonyms, identify records for deprecation

ALSO

Crowd-sourced deposition: anyone can deposit data
(structures, text, images, analytical data)


DailyMed


Quality of Structures


Quality of Structures!!!


Structure-Centric

We want to search “information” by structure, substructure,
similarity of structure
Specific focus on Open Chemistry at present
Standard approaches would be:
Identify chemical names “entity extraction”
Convert chemical names to structures and index
ChemSpider has a validated dictionary of structure-name
pairs
Use name extraction, name-conversion and dictionary look-
up. THEN curate.


“Entity Extraction”

Rule-based recognition of systematic names:
Use a lexeme of name fragments
Rules for identifying bounds of a name

Look-up dictionary:
Drug Names
Trivial Names
Numbers : Registry IDs, EINECS/ELINCS
Massive look-up dictionary of validated identifiers on
ChemSpider


Name Recognition

Azo aldehyde 2 was synthesized according to a
reported method [17]. To a stirred solution of azo aldehyde
2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0
oC were successively added (3,4-diaminophenyl)phenyl
methanone 1(0.40 g, 1.88 mmol) and a excces of anhydrous
MgSO4 (2.00 g,16.67 mmol) .
The resulting mixture was stirred for 6 hours at room
temperature [18]. The mixture was filtered and washed with
dichloromethane . Then the solvent was evaporated under
reduced pressure to give azo Schiff base 3 as a red solid which
was recrystalized from ethanol 95% (1.28 g, 91 %)


Name Recognition

Azo aldehyde 2 was synthesized according to a
reported method [17]. To a stirred solution of azo aldehyde
2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0
oC were successively added (3,4-diaminophenyl)phenyl
methanone 1(0.40 g, 1.88 mmol) and a excess of anhydrous
MgSO4 (2.00 g,16.67 mmol) .
The resulting mixture was stirred for 6 hours at room
temperature [18]. The mixture was filtered and washed with
dichloromethane . Then the solvent was evaporated under
reduced pressure to give azo Schiff base 3 as a red solid which
was recrystalized from ethanol 95% (1.28 g, 91 %)


How Many Chemical Names?
“She had the drive to derive success in any
venture and was well versed in Karate.
When the man in the tartan shirt
approached her with a dagger in his hand
she spat in his face, took the stance of a
commando and took advantage of his
shock to release the dagger from his grip
and causing him to recoil. He went home
and took an aspirin after the beating.”

ChemMantis

Chemical Markup And Nomenclature Transformation
Integrated System


Making Open Access Articles Searchable
Proof of Concept
Can we HOST Chemistry Open Access articles on
ChemSpider and add-value
Can we identify chemical names in Open Access articles
in a user-friendly manner
Can we convert names to structures in Open-Access
articles and expand ChemSpider and provide structure
searching of Open Access chemistry articles?
Can we provide an environment for chemists to mark-up
their own articles and crowd-source markup of an
archive?


Document markup

ChemSpider now hosting Open Access articles from
MDPI, Molecular Diversity Preservation International
Hosting the Molbank collection at present


A Standard for Document Markup?

NLM-DTD: National Library of Medicine; Document
Type Definition
Approved markup definitions to apply to journal
articles – extended as necessary for our purposes


NLM/DTD markup


Chemistry and Biology

Menus can be extended as necessary


Document markup


Markup – 3 seconds!


On the fly conversion


Shorthand Formulae Supported


One Click to more Info…


Structure Image Conversion


Two Seconds Later


Not Always Perfect….


A Platform for Markup

Can we provide a platform for document markup for
chemists?
Workflow:
Upload word docs, RTF files or point to HTML and load
Apply entity extraction, convert names to structures, mark-up
automatically and ask for user participation
Publish final version with NLM-DTD markup
Deposit all structures on ChemSpider under embargo and
wait for article DOI to release


Challenges

Computer software can generate chemical names better
than the majority of chemists
The majority of chemical names are generated by
humans, and Incorrect – convert to the wrong structure
or are ambiguous
One name, Multiple Structures


Names and Structures

Dichloroacetone

Trichloromethylsilane


Ambiguity


Ambiguity in Abbreviations - DPA


Ambiguity in Abbreviations - THF


Import is Easy

Make articles Public/Private (embargo date soon)
Auto-markup and check by user


IUPAC PAC Articles


Supports Word .DOC, HTML, RTF


Drexel University Documents


Patents


Single Configuration File defines entities
for markup
Algorithms can be built for certain
entities but the majority are dictionaries
– vendors, Phys Properties, Analytical
We can extend our system to support
your needs based on dictionaries – what
does NPG need/not need?


Nature Publications


Entity Balloons

Structures are the
language of chemistry
Show structures to
chemists and search/link
from there


Other Dictionaries - Species

We are considering
Bacteria
Fungi
Enzymes
Viruses
PDB codes….


Integrations Out to Other Sources


Reactions


Manual Curation is Always Necessary


Text-Indexing and ChemSpider?

ChemSpider text-indexes almost 500,000 Open Access
and Free Access articles

Collection is growing and more publishers have already
agreed. Including theses in the future.


Open Access Literature Search


Conclusions

The quality of structure-based data online should
always be questioned – that includes ChemSpider
Data on ChemSpider are being added and curated on a
daily basis but we need more eyeballs helping always
ChemSpider has a large validated structure-name
dictionary
Chemical name extraction and document markup is
very enabling


Oops…


A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Similar to A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry (20)

Recently uploaded

Recently uploaded (20)

A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry