Connecting chemistry-to-biology

Why is connecting
chemistry-to-biology in open sources
more difficult than it should be?
Presented at UCL School of Pharmacy, London, 13 June 2019
Hosted by Professor Mathew Todd
1
Christopher Southan

Abstract
Progress in drug discovery and chemical biology is hugely enabled by
curated document-assay-result-compound-target relationships
(D-A-R-C-P) in open databases from resources such as the Guide to
Pharmacology and ChEMBL. These are synergistically integrated into
PubChem which pre-computes chemical similarity and connectivity
between over 95 million structures and 5.6 million BioAssay results. It
also links chemistry to documents via various additional routes
including MeSH and large scale submissions from publishers.
However, these efforts are patchy and very few journals facilitate such
connectivity.There thus remains a massive shortfall in public D-A-R-
C-P capture from decades of papers and patents.This presentation
will cover these aspects and discuss their partial amelioration by
options such as author-driven depositions and open lab-book
approaches as used by Open Source Malaria
2

Outline
• D-A-R-C-P
• Chemistry space
• Biocuration challenges
• Biocuration sources
• Chemistry-to-document
• OSM engagment
• Conclusions
3

The core of the problem
4
"We have spent millions putting chemistry into
PDFs but now we are spending more millions
taking it back out” (Anon)

The chemistry < - > biology join
• Chemistry that does something significant in vitro, in cellulo, in vivo or in clinic
• Major bioactivity domains from drug discovery, chemical biology and ecology
• Some cases not adequately covered by this simple relationship chain (e.g. heparin
as indirect inhibitor of thrombin or where P could be a bacteria or protozoan)
• The majority of data still primarily archived in papers and patent documents
• Upper limit statistics for quality publications essentially unknown
D – A – R – C – P

So how much disintered chemistry is out there?
6

But getting D-A-R-C-P out of text is hard

Unsung Heroes
Expert extraction of D-A-R-C-P by biocurators is hard for many reasons that
include;
• Poor continuity of funding and career support
• Entity disambiguation challenges
• Unintentional obfuscation, ambiguity and errors by authors (and occasionally
deliberately from patent applicants)
• Difficult to capture nuances and complexities of molecular mechanisms of
action (e.g. prodrugs or no molecular target)
• Even primary activity parameters (IC50, Ki, Kd) have ~ 10-fold variation
between publications for nominally the same assays
• Judging the quality and potential reproducibility of the publications selected
for extraction
• Publisher guidelines only slowly beginning to address above
• Authors engagement with assay and target ontologies is limited

Disinterment from the PDF tomb (I)
Image extraction > structure
• Real chemists sketch images in a jiffy
• The rest of us can use OSRA: Optical Structure Recognition

Chemistry disinterment from PDF tombs (II)
IUPAC name > structure

11
Commercial biocuration of D-A-R-C-P
Exelra (formerly GVKBIO)
GOSTAR stats from 2015
• 1.3 million cpds from 112K
papers (~ 15 per paper)
• 3.5 million cpds from 70K
patents (~ 50 per pat)
• 3,882 human targets

12
Open biocuration of D-A-R-C-P (I)

13
Open biocuration of D-A-R-C-P (II)

14
Biocuration and BioAssay merging into PubChem

15
Chemistry < > document as a proxy
for full D-A-R-C-P

Key paper on PubChem < > PubMed
16

Recent large-scale chem < > doc PubChem submissions
17
• Generally a good thing but with caveats
• Difficult to automate filtration to identify “aboutness” of key compounds
• Issues with indexing of non-PubMed DOI-only Journal papers
• Quality of CNER chemistry extraction
• Introduces a // document < > structure mapping system into PubChem

Reciprocal links > virtuous circles (I)
18
• GtoPdb users can navigate “out” via PubChem or PubMed
• NCBI users can navigate “in” via PubChem or PubMed

Reciprocal links > virtuous circles (II)
19
• GtoMdb users can
navigate “out” via
PubChem or PubMed
• NCBI users can navigate
“in” via PubChem or
PubMed

20
Grappling with Open Source Malaria
(in a good way :)

OSM open data sheet
25
Next slide shows results of uploading 782 InChIKeys to PubChem

Statistics of OSM PubChem matches
26

Emerging capture challenges for bioactivity
28

Conclusions
• The bioscience community (including big data miners) still have their
collective feet nailed to the floor from the 5-decade backlog of
scientifically valuable bioactive chemistry relationships entombed in
PDF papers and patents
• Biocuration of D-A-R-C-P makes a crucial contribution but limited scale
• Automated entity extraction is advancing but is way behind the
specificity of mechanistic biocuration and is publisher-constrained
• Existence of several // document <> chemistry systems (e.g. MeSH,
IBM, ChEMBL, EPMC, Springer Nature,Theime ,Wikidata) is enabling
but also confusing
• The spread of Open Science ELNs is good to see but findability,
searchability and database submissions still need to be optimised
• The need remains to facilitate a flow of published (inc. preprints) of
author-specified bioactive chemistry direct to databases (even if the
papers are FAIR)
29

Proposed core of the solution
30
“Mandating authors to explicitly connect chemical structures to
their experimental bioactivity results in a form (extrinsic to PDF)
that is FAIR, structured, includes metadata, machine readable,
ontologised, transferable to open database records and
reciprocally linked to their publications” (Southan 2019)
• This is, of course, a council of perfection
• In essence, authors should become biocurators
• Currently only a few papers with data sets submitted to PubChem BioAssay
by authors would conform
• Has been technically feasible for at least a decade
• Impediments are thus sociological and publishing models

Connecting chemistry-to-biology

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Connecting chemistry-to-biology

Similar to Connecting chemistry-to-biology (20)

More from Chris Southan

More from Chris Southan (19)

Recently uploaded

Recently uploaded (20)

Connecting chemistry-to-biology

Editor's Notes