Connectivity > documents > structures > bioactivity

An overview of connectivity between
documents, structures and bioactivity
Christopher Southan
Presented at University of Copenhagen, Feb 2020
Host: David Gloriam
1

Outline
• D-A-R-C-P
• Chemistry stats
• Biocuration challenges
• GtoPdb curation examples
• Biocuration sources
• Chemistry-to-document
• Conclusions
2

The chemistry < - > biology join
• Chemistry “C” with significant bioactivity in vitro, in cellulo, in vivo , in clinico
• Applicability to drug discovery, pharmacology, chemical biology and enzymology
• Majority of primary quantitative data in papers and patents documents
• Does not cover all nuances of molecular mechanisms of actionNuances and
complexities of molecular mechanisms of action, e.g.
– indirect or complex targets, prodrugs, cellular assays, covalent inhibitors, activators
D – A – R – C – P

How much chemistry is out there?
4

The “lost connectivity” problem
5
"We have spent millions putting chemistry into PDFs but
now we are spending millions more taking it back out”
(Anon)
Rough estimates of 50+ years of public legacy DARCP:
• “D” ~ 200K papers, ~50K patents
• “C” ~ 5 million structures
• “P” ~ 4000 human proteins, ~ 2000 other species
• Only a small proportion captured as DARCP in open sources
• Quality would be a key issue if “everything” was extracted

Disinterring DARCP (and linkable data in general)
from document tombs is hard

Unsung Heroes
Impediments to artisanal extraction of DARCP by Biocurators
• Entity disambiguation
• Unintentional obfuscation and errors by journal authors
• Occasional deliberate obfuscation by patent applicants
• Activity parameters (IC50,EC50, Ki, Kd) can have ~ 10-fold variation between
publications for nominally the same assays
• Judging the reproducibility of the publications selected for extraction
• Variable publisher guidelines for entity specification and reporting standards
• Chemical structures often image-only
• Key data buried in supplementary data
• Limited author awareness of assay and target ontologies or gene naming
• Poor sustainability of funding and career structures

9
Commercial biocuration of DARCP
Exelra (formerly GVKBIO)
GOSTAR stats from 2015
• 1.3 million cpds from 112K
papers (~ 15 per paper)
• 3.5 million cpds from 70K
patents (~ 50 per pat)
• 3,882 human targets
• 9 million bioactivities

Open
biocuration
of DARCLP
10
GtoPdb
expanded to
DARCLP

11
GtoPdb stats from November release 2019.5

12
PubChem substances > references > author

15
GtoPdb > DARCP for 30 BACE1 lead inhibitors

16
Other Open sources of DARCP

DARCLP
curation from
US patents by
BindingDB
17
• 225,000 compounds
• 406,000 activitíes
• 1000 targets
• 3000 patents

18
Statistics of open D-A-R-C-P

19
Intersecting open DARCP: Publications “D”

20
Intersecting open DARCP: Chemistry “C”

21
Intersecting open DARCP: Targets “P”

22
716 Data Sources merged into

23
Chemistry to Document (C-D, c2d) connectivity
Potential gateway to DARPC but with limitations

PubChem < > PubMed (2016 snapshot)
24

PubChem large-scale C-D submissions
25
• Generally a good thing (inc. 3 million patents) but with caveats
• Difficult to identify “aboutness” of key compounds
• Issues with indexing of non-PubMed DOI-only Journal papers
• Quality issues of automated CNER chemistry extraction
• Introduces // c2d mappings into PubChem
• Massive ‘futile indexing’ of common chemistry

Automated entity look-ups on the fly
from documents (including C-D)
26
• Being pushed in European PubMed central via EBI
database look-ups
• PubMed/PubMedCentral via NCBI databases
• Can be a gateway to DARCP but specificity caveats

C-D Auto look-up ambiguities
27

Auto look-up ambiguities: sloppy synonyms
28

Auto look-up ambiguities: the wrong “bear”
29

31
Rounding off: so where do we go from here in
terms of open DARCP capture?

Will this make a difference?
32
• This should increase the flow of A,R,C,P from D into repositories
• However, whether this will also extent to D-A-R-C-P into major
databases such as PubChem remains unclear

Proposed solution but a council of perfection
33
“Mandating authors to explicitly connect their own DARCLP results
in a form that is FAIR, extrinsic to PDF, structured, with metadata,
machine readable, ontologised, transferable to open database
records and reciprocally linked to publications” (Southan 2019)
• Authors should become their own biocurators
• Has been technically feasible for over a decade
• Even in 2020 not one single journal insists on authors providing
machine-readable DARCLP to flow into PubChem BioAssay
• Impediments include sociological factors and publishing models

Conclusions
• The bioscience community (including big data miners) still have
their collective feet nailed to the floor from the 5-decades of
valuable DARCP entombed behind firewalls and buried in patents
• Biocuration makes a crucial contribution but is limited in scale
• Automated extraction is advancing (e.g. via NLP) but is way
behind the specificity of expert biocuration
• Existence of // document <> chemistry systems (e.g. MeSH, IBM,
SureChEMBL, Springer Nature,Theime ,Wikidata) in PubChem
and look-ups in EPMC, are enabling but also confusing
• The spread of Open Science ELNs is good to see but findability,
searchability and database submissions still need to be optimised
• The need remains to facilitate a flow of published (inc. preprints)
of author-specified bioactive chemistry direct to databases (even if
the papers are FAIR)
34

Reciprocal links > virtuous circles (II)
37
• GtoMdb users can
navigate “out” via
PubChem or PubMed
• NCBI users can navigate
“in” via PubChem or
PubMed

Reciprocal links > virtuous circles (I)
38
• GtoPdb users can navigate “out” via PubChem or PubMed
• NCBI users can navigate “in” via PubChem or PubMed

OSM open data sheet
43
Next slide shows results of uploading 782 InChIKeys to PubChem

Statistics of OSM PubChem matches
44

Emerging capture challenges for bioactivity
45

Chemistry disinterment from PDF tombs (II)
IUPAC name > structure

Disinterment from the PDF tomb (I)
Image extraction > structure
• Real chemists sketch images in a jiffy
• The rest of us can use OSRA: Optical Structure Recognition

Connectivity > documents > structures > bioactivity

More Related Content

What's hot

Similar to Connectivity > documents > structures > bioactivity

More from Chris Southan

Recently uploaded

Connectivity > documents > structures > bioactivity

Editor's Notes