An overview of connectivity between
documents, structures and bioactivity
Christopher Southan
Presented at University of Copenhagen, Feb 2020
Host: David Gloriam
1
Outline
• D-A-R-C-P
• Chemistry stats
• Biocuration challenges
• GtoPdb curation examples
• Biocuration sources
• Chemistry-to-document
• Conclusions
2
The chemistry < - > biology join
• Chemistry “C” with significant bioactivity in vitro, in cellulo, in vivo , in clinico
• Applicability to drug discovery, pharmacology, chemical biology and enzymology
• Majority of primary quantitative data in papers and patents documents
• Does not cover all nuances of molecular mechanisms of actionNuances and
complexities of molecular mechanisms of action, e.g.
– indirect or complex targets, prodrugs, cellular assays, covalent inhibitors, activators
D – A – R – C – P
How much chemistry is out there?
4
The “lost connectivity” problem
5
"We have spent millions putting chemistry into PDFs but
now we are spending millions more taking it back out”
(Anon)
Rough estimates of 50+ years of public legacy DARCP:
• “D” ~ 200K papers, ~50K patents
• “C” ~ 5 million structures
• “P” ~ 4000 human proteins, ~ 2000 other species
• Only a small proportion captured as DARCP in open sources
• Quality would be a key issue if “everything” was extracted
Disinterring DARCP (and linkable data in general)
from document tombs is hard
Unsung Heroes
Impediments to artisanal extraction of DARCP by Biocurators
• Entity disambiguation
• Unintentional obfuscation and errors by journal authors
• Occasional deliberate obfuscation by patent applicants
• Activity parameters (IC50,EC50, Ki, Kd) can have ~ 10-fold variation between
publications for nominally the same assays
• Judging the reproducibility of the publications selected for extraction
• Variable publisher guidelines for entity specification and reporting standards
• Chemical structures often image-only
• Key data buried in supplementary data
• Limited author awareness of assay and target ontologies or gene naming
• Poor sustainability of funding and career structures
8
D-A-R-C-P curated sources
9
Commercial biocuration of DARCP
Exelra (formerly GVKBIO)
GOSTAR stats from 2015
• 1.3 million cpds from 112K
papers (~ 15 per paper)
• 3.5 million cpds from 70K
patents (~ 50 per pat)
• 3,882 human targets
• 9 million bioactivities
Open
biocuration
of DARCLP
10
GtoPdb
expanded to
DARCLP
11
GtoPdb stats from November release 2019.5
12
PubChem substances > references > author
13
PubMed substances > GtoPdb
14
GtoPdb >
DARCP
15
GtoPdb > DARCP for 30 BACE1 lead inhibitors
16
Other Open sources of DARCP
DARCLP
curation from
US patents by
BindingDB
17
• 225,000 compounds
• 406,000 activitíes
• 1000 targets
• 3000 patents
18
Statistics of open D-A-R-C-P
19
Intersecting open DARCP: Publications “D”
20
Intersecting open DARCP: Chemistry “C”
21
Intersecting open DARCP: Targets “P”
22
716 Data Sources merged into
23
Chemistry to Document (C-D, c2d) connectivity
Potential gateway to DARPC but with limitations
PubChem < > PubMed (2016 snapshot)
24
PubChem large-scale C-D submissions
25
• Generally a good thing (inc. 3 million patents) but with caveats
• Difficult to identify “aboutness” of key compounds
• Issues with indexing of non-PubMed DOI-only Journal papers
• Quality issues of automated CNER chemistry extraction
• Introduces // c2d mappings into PubChem
• Massive ‘futile indexing’ of common chemistry
Automated entity look-ups on the fly
from documents (including C-D)
26
• Being pushed in European PubMed central via EBI
database look-ups
• PubMed/PubMedCentral via NCBI databases
• Can be a gateway to DARCP but specificity caveats
C-D Auto look-up ambiguities
27
Auto look-up ambiguities: sloppy synonyms
28
Auto look-up ambiguities: the wrong “bear”
29
Auto look-up in patents
30
31
Rounding off: so where do we go from here in
terms of open DARCP capture?
Will this make a difference?
32
• This should increase the flow of A,R,C,P from D into repositories
• However, whether this will also extent to D-A-R-C-P into major
databases such as PubChem remains unclear
Proposed solution but a council of perfection
33
“Mandating authors to explicitly connect their own DARCLP results
in a form that is FAIR, extrinsic to PDF, structured, with metadata,
machine readable, ontologised, transferable to open database
records and reciprocally linked to publications” (Southan 2019)
• Authors should become their own biocurators
• Has been technically feasible for over a decade
• Even in 2020 not one single journal insists on authors providing
machine-readable DARCLP to flow into PubChem BioAssay
• Impediments include sociological factors and publishing models
Conclusions
• The bioscience community (including big data miners) still have
their collective feet nailed to the floor from the 5-decades of
valuable DARCP entombed behind firewalls and buried in patents
• Biocuration makes a crucial contribution but is limited in scale
• Automated extraction is advancing (e.g. via NLP) but is way
behind the specificity of expert biocuration
• Existence of // document <> chemistry systems (e.g. MeSH, IBM,
SureChEMBL, Springer Nature,Theime ,Wikidata) in PubChem
and look-ups in EPMC, are enabling but also confusing
• The spread of Open Science ELNs is good to see but findability,
searchability and database submissions still need to be optimised
• The need remains to facilitate a flow of published (inc. preprints)
of author-specified bioactive chemistry direct to databases (even if
the papers are FAIR)
34
Further info
35
36
Extras
Reciprocal links > virtuous circles (II)
37
• GtoMdb users can
navigate “out” via
PubChem or PubMed
• NCBI users can navigate
“in” via PubChem or
PubMed
Reciprocal links > virtuous circles (I)
38
• GtoPdb users can navigate “out” via PubChem or PubMed
• NCBI users can navigate “in” via PubChem or PubMed
O_S_M
39
OSM-S-363 data links (I)
40
OSM-S-363 (II)
41
OSM-S-363 (III)
42
OSM open data sheet
43
Next slide shows results of uploading 782 InChIKeys to PubChem
Statistics of OSM PubChem matches
44
Emerging capture challenges for bioactivity
45
Chemistry disinterment from PDF tombs (II)
IUPAC name > structure
Disinterment from the PDF tomb (I)
Image extraction > structure
• Real chemists sketch images in a jiffy
• The rest of us can use OSRA: Optical Structure Recognition

Connectivity > documents > structures > bioactivity

  • 1.
    An overview ofconnectivity between documents, structures and bioactivity Christopher Southan Presented at University of Copenhagen, Feb 2020 Host: David Gloriam 1
  • 2.
    Outline • D-A-R-C-P • Chemistrystats • Biocuration challenges • GtoPdb curation examples • Biocuration sources • Chemistry-to-document • Conclusions 2
  • 3.
    The chemistry <- > biology join • Chemistry “C” with significant bioactivity in vitro, in cellulo, in vivo , in clinico • Applicability to drug discovery, pharmacology, chemical biology and enzymology • Majority of primary quantitative data in papers and patents documents • Does not cover all nuances of molecular mechanisms of actionNuances and complexities of molecular mechanisms of action, e.g. – indirect or complex targets, prodrugs, cellular assays, covalent inhibitors, activators D – A – R – C – P
  • 4.
    How much chemistryis out there? 4
  • 5.
    The “lost connectivity”problem 5 "We have spent millions putting chemistry into PDFs but now we are spending millions more taking it back out” (Anon) Rough estimates of 50+ years of public legacy DARCP: • “D” ~ 200K papers, ~50K patents • “C” ~ 5 million structures • “P” ~ 4000 human proteins, ~ 2000 other species • Only a small proportion captured as DARCP in open sources • Quality would be a key issue if “everything” was extracted
  • 6.
    Disinterring DARCP (andlinkable data in general) from document tombs is hard
  • 7.
    Unsung Heroes Impediments toartisanal extraction of DARCP by Biocurators • Entity disambiguation • Unintentional obfuscation and errors by journal authors • Occasional deliberate obfuscation by patent applicants • Activity parameters (IC50,EC50, Ki, Kd) can have ~ 10-fold variation between publications for nominally the same assays • Judging the reproducibility of the publications selected for extraction • Variable publisher guidelines for entity specification and reporting standards • Chemical structures often image-only • Key data buried in supplementary data • Limited author awareness of assay and target ontologies or gene naming • Poor sustainability of funding and career structures
  • 8.
  • 9.
    9 Commercial biocuration ofDARCP Exelra (formerly GVKBIO) GOSTAR stats from 2015 • 1.3 million cpds from 112K papers (~ 15 per paper) • 3.5 million cpds from 70K patents (~ 50 per pat) • 3,882 human targets • 9 million bioactivities
  • 10.
  • 11.
    11 GtoPdb stats fromNovember release 2019.5
  • 12.
    12 PubChem substances >references > author
  • 13.
  • 14.
  • 15.
    15 GtoPdb > DARCPfor 30 BACE1 lead inhibitors
  • 16.
  • 17.
    DARCLP curation from US patentsby BindingDB 17 • 225,000 compounds • 406,000 activitíes • 1000 targets • 3000 patents
  • 18.
  • 19.
    19 Intersecting open DARCP:Publications “D”
  • 20.
    20 Intersecting open DARCP:Chemistry “C”
  • 21.
  • 22.
  • 23.
    23 Chemistry to Document(C-D, c2d) connectivity Potential gateway to DARPC but with limitations
  • 24.
    PubChem < >PubMed (2016 snapshot) 24
  • 25.
    PubChem large-scale C-Dsubmissions 25 • Generally a good thing (inc. 3 million patents) but with caveats • Difficult to identify “aboutness” of key compounds • Issues with indexing of non-PubMed DOI-only Journal papers • Quality issues of automated CNER chemistry extraction • Introduces // c2d mappings into PubChem • Massive ‘futile indexing’ of common chemistry
  • 26.
    Automated entity look-upson the fly from documents (including C-D) 26 • Being pushed in European PubMed central via EBI database look-ups • PubMed/PubMedCentral via NCBI databases • Can be a gateway to DARCP but specificity caveats
  • 27.
    C-D Auto look-upambiguities 27
  • 28.
    Auto look-up ambiguities:sloppy synonyms 28
  • 29.
    Auto look-up ambiguities:the wrong “bear” 29
  • 30.
    Auto look-up inpatents 30
  • 31.
    31 Rounding off: sowhere do we go from here in terms of open DARCP capture?
  • 32.
    Will this makea difference? 32 • This should increase the flow of A,R,C,P from D into repositories • However, whether this will also extent to D-A-R-C-P into major databases such as PubChem remains unclear
  • 33.
    Proposed solution buta council of perfection 33 “Mandating authors to explicitly connect their own DARCLP results in a form that is FAIR, extrinsic to PDF, structured, with metadata, machine readable, ontologised, transferable to open database records and reciprocally linked to publications” (Southan 2019) • Authors should become their own biocurators • Has been technically feasible for over a decade • Even in 2020 not one single journal insists on authors providing machine-readable DARCLP to flow into PubChem BioAssay • Impediments include sociological factors and publishing models
  • 34.
    Conclusions • The biosciencecommunity (including big data miners) still have their collective feet nailed to the floor from the 5-decades of valuable DARCP entombed behind firewalls and buried in patents • Biocuration makes a crucial contribution but is limited in scale • Automated extraction is advancing (e.g. via NLP) but is way behind the specificity of expert biocuration • Existence of // document <> chemistry systems (e.g. MeSH, IBM, SureChEMBL, Springer Nature,Theime ,Wikidata) in PubChem and look-ups in EPMC, are enabling but also confusing • The spread of Open Science ELNs is good to see but findability, searchability and database submissions still need to be optimised • The need remains to facilitate a flow of published (inc. preprints) of author-specified bioactive chemistry direct to databases (even if the papers are FAIR) 34
  • 35.
  • 36.
  • 37.
    Reciprocal links >virtuous circles (II) 37 • GtoMdb users can navigate “out” via PubChem or PubMed • NCBI users can navigate “in” via PubChem or PubMed
  • 38.
    Reciprocal links >virtuous circles (I) 38 • GtoPdb users can navigate “out” via PubChem or PubMed • NCBI users can navigate “in” via PubChem or PubMed
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
    OSM open datasheet 43 Next slide shows results of uploading 782 InChIKeys to PubChem
  • 44.
    Statistics of OSMPubChem matches 44
  • 45.
    Emerging capture challengesfor bioactivity 45
  • 46.
    Chemistry disinterment fromPDF tombs (II) IUPAC name > structure
  • 47.
    Disinterment from thePDF tomb (I) Image extraction > structure • Real chemists sketch images in a jiffy • The rest of us can use OSRA: Optical Structure Recognition

Editor's Notes

  • #8 The simplest of starting points, at least the press release had a structure diagram OSRA provides good starting points to edit and get SMILES The structure does not have to be exactly right because a database similarity match is OK to see what it should have been
  • #47 The simplest of starting points, at least the press release had a structure diagram OSRA provides good starting points to edit and get SMILES The structure does not have to be exactly right because a database similarity match is OK to see what it should have been
  • #48 The simplest of starting points, at least the press release had a structure diagram OSRA provides good starting points to edit and get SMILES The structure does not have to be exactly right because a database similarity match is OK to see what it should have been