SlideShare a Scribd company logo
An overview of connectivity between
documents, structures and bioactivity
Christopher Southan
Presented at University of Copenhagen, Feb 2020
Host: David Gloriam
1
Outline
• D-A-R-C-P
• Chemistry stats
• Biocuration challenges
• GtoPdb curation examples
• Biocuration sources
• Chemistry-to-document
• Conclusions
2
The chemistry < - > biology join
• Chemistry “C” with significant bioactivity in vitro, in cellulo, in vivo , in clinico
• Applicability to drug discovery, pharmacology, chemical biology and enzymology
• Majority of primary quantitative data in papers and patents documents
• Does not cover all nuances of molecular mechanisms of actionNuances and
complexities of molecular mechanisms of action, e.g.
– indirect or complex targets, prodrugs, cellular assays, covalent inhibitors, activators
D – A – R – C – P
How much chemistry is out there?
4
The “lost connectivity” problem
5
"We have spent millions putting chemistry into PDFs but
now we are spending millions more taking it back out”
(Anon)
Rough estimates of 50+ years of public legacy DARCP:
• “D” ~ 200K papers, ~50K patents
• “C” ~ 5 million structures
• “P” ~ 4000 human proteins, ~ 2000 other species
• Only a small proportion captured as DARCP in open sources
• Quality would be a key issue if “everything” was extracted
Disinterring DARCP (and linkable data in general)
from document tombs is hard
Unsung Heroes
Impediments to artisanal extraction of DARCP by Biocurators
• Entity disambiguation
• Unintentional obfuscation and errors by journal authors
• Occasional deliberate obfuscation by patent applicants
• Activity parameters (IC50,EC50, Ki, Kd) can have ~ 10-fold variation between
publications for nominally the same assays
• Judging the reproducibility of the publications selected for extraction
• Variable publisher guidelines for entity specification and reporting standards
• Chemical structures often image-only
• Key data buried in supplementary data
• Limited author awareness of assay and target ontologies or gene naming
• Poor sustainability of funding and career structures
8
D-A-R-C-P curated sources
9
Commercial biocuration of DARCP
Exelra (formerly GVKBIO)
GOSTAR stats from 2015
• 1.3 million cpds from 112K
papers (~ 15 per paper)
• 3.5 million cpds from 70K
patents (~ 50 per pat)
• 3,882 human targets
• 9 million bioactivities
Open
biocuration
of DARCLP
10
GtoPdb
expanded to
DARCLP
11
GtoPdb stats from November release 2019.5
12
PubChem substances > references > author
13
PubMed substances > GtoPdb
14
GtoPdb >
DARCP
15
GtoPdb > DARCP for 30 BACE1 lead inhibitors
16
Other Open sources of DARCP
DARCLP
curation from
US patents by
BindingDB
17
• 225,000 compounds
• 406,000 activitíes
• 1000 targets
• 3000 patents
18
Statistics of open D-A-R-C-P
19
Intersecting open DARCP: Publications “D”
20
Intersecting open DARCP: Chemistry “C”
21
Intersecting open DARCP: Targets “P”
22
716 Data Sources merged into
23
Chemistry to Document (C-D, c2d) connectivity
Potential gateway to DARPC but with limitations
PubChem < > PubMed (2016 snapshot)
24
PubChem large-scale C-D submissions
25
• Generally a good thing (inc. 3 million patents) but with caveats
• Difficult to identify “aboutness” of key compounds
• Issues with indexing of non-PubMed DOI-only Journal papers
• Quality issues of automated CNER chemistry extraction
• Introduces // c2d mappings into PubChem
• Massive ‘futile indexing’ of common chemistry
Automated entity look-ups on the fly
from documents (including C-D)
26
• Being pushed in European PubMed central via EBI
database look-ups
• PubMed/PubMedCentral via NCBI databases
• Can be a gateway to DARCP but specificity caveats
C-D Auto look-up ambiguities
27
Auto look-up ambiguities: sloppy synonyms
28
Auto look-up ambiguities: the wrong “bear”
29
Auto look-up in patents
30
31
Rounding off: so where do we go from here in
terms of open DARCP capture?
Will this make a difference?
32
• This should increase the flow of A,R,C,P from D into repositories
• However, whether this will also extent to D-A-R-C-P into major
databases such as PubChem remains unclear
Proposed solution but a council of perfection
33
“Mandating authors to explicitly connect their own DARCLP results
in a form that is FAIR, extrinsic to PDF, structured, with metadata,
machine readable, ontologised, transferable to open database
records and reciprocally linked to publications” (Southan 2019)
• Authors should become their own biocurators
• Has been technically feasible for over a decade
• Even in 2020 not one single journal insists on authors providing
machine-readable DARCLP to flow into PubChem BioAssay
• Impediments include sociological factors and publishing models
Conclusions
• The bioscience community (including big data miners) still have
their collective feet nailed to the floor from the 5-decades of
valuable DARCP entombed behind firewalls and buried in patents
• Biocuration makes a crucial contribution but is limited in scale
• Automated extraction is advancing (e.g. via NLP) but is way
behind the specificity of expert biocuration
• Existence of // document <> chemistry systems (e.g. MeSH, IBM,
SureChEMBL, Springer Nature,Theime ,Wikidata) in PubChem
and look-ups in EPMC, are enabling but also confusing
• The spread of Open Science ELNs is good to see but findability,
searchability and database submissions still need to be optimised
• The need remains to facilitate a flow of published (inc. preprints)
of author-specified bioactive chemistry direct to databases (even if
the papers are FAIR)
34
Further info
35
36
Extras
Reciprocal links > virtuous circles (II)
37
• GtoMdb users can
navigate “out” via
PubChem or PubMed
• NCBI users can navigate
“in” via PubChem or
PubMed
Reciprocal links > virtuous circles (I)
38
• GtoPdb users can navigate “out” via PubChem or PubMed
• NCBI users can navigate “in” via PubChem or PubMed
O_S_M
39
OSM-S-363 data links (I)
40
OSM-S-363 (II)
41
OSM-S-363 (III)
42
OSM open data sheet
43
Next slide shows results of uploading 782 InChIKeys to PubChem
Statistics of OSM PubChem matches
44
Emerging capture challenges for bioactivity
45
Chemistry disinterment from PDF tombs (II)
IUPAC name > structure
Disinterment from the PDF tomb (I)
Image extraction > structure
• Real chemists sketch images in a jiffy
• The rest of us can use OSRA: Optical Structure Recognition

More Related Content

What's hot

US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
FAIR Data and Model Management for Systems Biology (and SOPs too!)
FAIR Data and Model Management for Systems Biology(and SOPs too!)FAIR Data and Model Management for Systems Biology(and SOPs too!)
FAIR Data and Model Management for Systems Biology (and SOPs too!)
Carole Goble
 
Presentation of ChemSPider at PubChem Public Meeting
Presentation of ChemSPider at PubChem Public MeetingPresentation of ChemSPider at PubChem Public Meeting
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
George Papadatos
 
ChemSpider Overview SLides August 2007
ChemSpider Overview SLides August 2007ChemSpider Overview SLides August 2007
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
Data Review and Clean-Up Using Crowdsourced Input via the  US EPA CompTox Das...Data Review and Clean-Up Using Crowdsourced Input via the  US EPA CompTox Das...
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
ChemAxon
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Globus
 
The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Giab workshop update mar2019
Giab workshop update mar2019Giab workshop update mar2019
Giab workshop update mar2019
GenomeInABottle
 
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP Project
Stuart Chalk
 
The influence of data curation on QSAR Modeling – examining issues of qualit...
 The influence of data curation on QSAR Modeling – examining issues of qualit... The influence of data curation on QSAR Modeling – examining issues of qualit...
The influence of data curation on QSAR Modeling – examining issues of qualit...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...
Sunghwan Kim
 
SureChEMBL patent annotations in Open PHACTS
SureChEMBL patent annotations in Open PHACTSSureChEMBL patent annotations in Open PHACTS
SureChEMBL patent annotations in Open PHACTS
George Papadatos
 
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

What's hot (20)

US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
 
FAIR Data and Model Management for Systems Biology (and SOPs too!)
FAIR Data and Model Management for Systems Biology(and SOPs too!)FAIR Data and Model Management for Systems Biology(and SOPs too!)
FAIR Data and Model Management for Systems Biology (and SOPs too!)
 
Presentation of ChemSPider at PubChem Public Meeting
Presentation of ChemSPider at PubChem Public MeetingPresentation of ChemSPider at PubChem Public Meeting
Presentation of ChemSPider at PubChem Public Meeting
 
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
 
ChemSpider Overview SLides August 2007
ChemSpider Overview SLides August 2007ChemSpider Overview SLides August 2007
ChemSpider Overview SLides August 2007
 
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
Data Review and Clean-Up Using Crowdsourced Input via the  US EPA CompTox Das...Data Review and Clean-Up Using Crowdsourced Input via the  US EPA CompTox Das...
Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Das...
 
Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...
 
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
 
The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...
 
Giab workshop update mar2019
Giab workshop update mar2019Giab workshop update mar2019
Giab workshop update mar2019
 
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
 
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
 
ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP Project
 
The influence of data curation on QSAR Modeling – examining issues of qualit...
 The influence of data curation on QSAR Modeling – examining issues of qualit... The influence of data curation on QSAR Modeling – examining issues of qualit...
The influence of data curation on QSAR Modeling – examining issues of qualit...
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...
 
SureChEMBL patent annotations in Open PHACTS
SureChEMBL patent annotations in Open PHACTSSureChEMBL patent annotations in Open PHACTS
SureChEMBL patent annotations in Open PHACTS
 
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
 

Similar to Connectivity > documents > structures > bioactivity

ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
Dr. Haxel Consult
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
Chris Southan
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
Dr. Haxel Consult
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
Chris Southan
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse
Chris Southan
 
Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
Chris Southan
 
Progress in delivering transparency in research data
Progress in delivering transparency in research dataProgress in delivering transparency in research data
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
Bioinformatics and Computational Biosciences Branch
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the Desktop
Marcus Hanwell
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity Data
Chris Southan
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistry
Sunghwan Kim
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
Chris Southan
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
Michel Dumontier
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Nathan Olson
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases
Chris Southan
 
From Open Access to Open Data
From Open Access to Open DataFrom Open Access to Open Data
From Open Access to Open Data
Brian Hole
 

Similar to Connectivity > documents > structures > bioactivity (20)

ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse
 
Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
 
Progress in delivering transparency in research data
Progress in delivering transparency in research dataProgress in delivering transparency in research data
Progress in delivering transparency in research data
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the Desktop
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity Data
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistry
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
 
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
 
Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases
 
From Open Access to Open Data
From Open Access to Open DataFrom Open Access to Open Data
From Open Access to Open Data
 

More from Chris Southan

Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
Chris Southan
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
Chris Southan
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
Chris Southan
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
Chris Southan
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
Chris Southan
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
Chris Southan
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
Chris Southan
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
Chris Southan
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
Chris Southan
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
Chris Southan
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
Chris Southan
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
Chris Southan
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
Chris Southan
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
Chris Southan
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
Chris Southan
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
Chris Southan
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
Chris Southan
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem Connectivity
Chris Southan
 
The IUPHAR/MMV Guide to Malaria Pharmacology
The  IUPHAR/MMV Guide to Malaria Pharmacology  The  IUPHAR/MMV Guide to Malaria Pharmacology
The IUPHAR/MMV Guide to Malaria Pharmacology
Chris Southan
 
The big data join in pharmacology
The big data join in pharmacologyThe big data join in pharmacology
The big data join in pharmacology
Chris Southan
 

More from Chris Southan (20)

Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem Connectivity
 
The IUPHAR/MMV Guide to Malaria Pharmacology
The  IUPHAR/MMV Guide to Malaria Pharmacology  The  IUPHAR/MMV Guide to Malaria Pharmacology
The IUPHAR/MMV Guide to Malaria Pharmacology
 
The big data join in pharmacology
The big data join in pharmacologyThe big data join in pharmacology
The big data join in pharmacology
 

Recently uploaded

The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
NathanBaughman3
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 

Recently uploaded (20)

The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 

Connectivity > documents > structures > bioactivity

  • 1. An overview of connectivity between documents, structures and bioactivity Christopher Southan Presented at University of Copenhagen, Feb 2020 Host: David Gloriam 1
  • 2. Outline • D-A-R-C-P • Chemistry stats • Biocuration challenges • GtoPdb curation examples • Biocuration sources • Chemistry-to-document • Conclusions 2
  • 3. The chemistry < - > biology join • Chemistry “C” with significant bioactivity in vitro, in cellulo, in vivo , in clinico • Applicability to drug discovery, pharmacology, chemical biology and enzymology • Majority of primary quantitative data in papers and patents documents • Does not cover all nuances of molecular mechanisms of actionNuances and complexities of molecular mechanisms of action, e.g. – indirect or complex targets, prodrugs, cellular assays, covalent inhibitors, activators D – A – R – C – P
  • 4. How much chemistry is out there? 4
  • 5. The “lost connectivity” problem 5 "We have spent millions putting chemistry into PDFs but now we are spending millions more taking it back out” (Anon) Rough estimates of 50+ years of public legacy DARCP: • “D” ~ 200K papers, ~50K patents • “C” ~ 5 million structures • “P” ~ 4000 human proteins, ~ 2000 other species • Only a small proportion captured as DARCP in open sources • Quality would be a key issue if “everything” was extracted
  • 6. Disinterring DARCP (and linkable data in general) from document tombs is hard
  • 7. Unsung Heroes Impediments to artisanal extraction of DARCP by Biocurators • Entity disambiguation • Unintentional obfuscation and errors by journal authors • Occasional deliberate obfuscation by patent applicants • Activity parameters (IC50,EC50, Ki, Kd) can have ~ 10-fold variation between publications for nominally the same assays • Judging the reproducibility of the publications selected for extraction • Variable publisher guidelines for entity specification and reporting standards • Chemical structures often image-only • Key data buried in supplementary data • Limited author awareness of assay and target ontologies or gene naming • Poor sustainability of funding and career structures
  • 9. 9 Commercial biocuration of DARCP Exelra (formerly GVKBIO) GOSTAR stats from 2015 • 1.3 million cpds from 112K papers (~ 15 per paper) • 3.5 million cpds from 70K patents (~ 50 per pat) • 3,882 human targets • 9 million bioactivities
  • 11. 11 GtoPdb stats from November release 2019.5
  • 12. 12 PubChem substances > references > author
  • 15. 15 GtoPdb > DARCP for 30 BACE1 lead inhibitors
  • 17. DARCLP curation from US patents by BindingDB 17 • 225,000 compounds • 406,000 activitíes • 1000 targets • 3000 patents
  • 19. 19 Intersecting open DARCP: Publications “D”
  • 20. 20 Intersecting open DARCP: Chemistry “C”
  • 21. 21 Intersecting open DARCP: Targets “P”
  • 22. 22 716 Data Sources merged into
  • 23. 23 Chemistry to Document (C-D, c2d) connectivity Potential gateway to DARPC but with limitations
  • 24. PubChem < > PubMed (2016 snapshot) 24
  • 25. PubChem large-scale C-D submissions 25 • Generally a good thing (inc. 3 million patents) but with caveats • Difficult to identify “aboutness” of key compounds • Issues with indexing of non-PubMed DOI-only Journal papers • Quality issues of automated CNER chemistry extraction • Introduces // c2d mappings into PubChem • Massive ‘futile indexing’ of common chemistry
  • 26. Automated entity look-ups on the fly from documents (including C-D) 26 • Being pushed in European PubMed central via EBI database look-ups • PubMed/PubMedCentral via NCBI databases • Can be a gateway to DARCP but specificity caveats
  • 27. C-D Auto look-up ambiguities 27
  • 28. Auto look-up ambiguities: sloppy synonyms 28
  • 29. Auto look-up ambiguities: the wrong “bear” 29
  • 30. Auto look-up in patents 30
  • 31. 31 Rounding off: so where do we go from here in terms of open DARCP capture?
  • 32. Will this make a difference? 32 • This should increase the flow of A,R,C,P from D into repositories • However, whether this will also extent to D-A-R-C-P into major databases such as PubChem remains unclear
  • 33. Proposed solution but a council of perfection 33 “Mandating authors to explicitly connect their own DARCLP results in a form that is FAIR, extrinsic to PDF, structured, with metadata, machine readable, ontologised, transferable to open database records and reciprocally linked to publications” (Southan 2019) • Authors should become their own biocurators • Has been technically feasible for over a decade • Even in 2020 not one single journal insists on authors providing machine-readable DARCLP to flow into PubChem BioAssay • Impediments include sociological factors and publishing models
  • 34. Conclusions • The bioscience community (including big data miners) still have their collective feet nailed to the floor from the 5-decades of valuable DARCP entombed behind firewalls and buried in patents • Biocuration makes a crucial contribution but is limited in scale • Automated extraction is advancing (e.g. via NLP) but is way behind the specificity of expert biocuration • Existence of // document <> chemistry systems (e.g. MeSH, IBM, SureChEMBL, Springer Nature,Theime ,Wikidata) in PubChem and look-ups in EPMC, are enabling but also confusing • The spread of Open Science ELNs is good to see but findability, searchability and database submissions still need to be optimised • The need remains to facilitate a flow of published (inc. preprints) of author-specified bioactive chemistry direct to databases (even if the papers are FAIR) 34
  • 37. Reciprocal links > virtuous circles (II) 37 • GtoMdb users can navigate “out” via PubChem or PubMed • NCBI users can navigate “in” via PubChem or PubMed
  • 38. Reciprocal links > virtuous circles (I) 38 • GtoPdb users can navigate “out” via PubChem or PubMed • NCBI users can navigate “in” via PubChem or PubMed
  • 43. OSM open data sheet 43 Next slide shows results of uploading 782 InChIKeys to PubChem
  • 44. Statistics of OSM PubChem matches 44
  • 45. Emerging capture challenges for bioactivity 45
  • 46. Chemistry disinterment from PDF tombs (II) IUPAC name > structure
  • 47. Disinterment from the PDF tomb (I) Image extraction > structure • Real chemists sketch images in a jiffy • The rest of us can use OSRA: Optical Structure Recognition

Editor's Notes

  1. The simplest of starting points, at least the press release had a structure diagram OSRA provides good starting points to edit and get SMILES The structure does not have to be exactly right because a database similarity match is OK to see what it should have been
  2. The simplest of starting points, at least the press release had a structure diagram OSRA provides good starting points to edit and get SMILES The structure does not have to be exactly right because a database similarity match is OK to see what it should have been
  3. The simplest of starting points, at least the press release had a structure diagram OSRA provides good starting points to edit and get SMILES The structure does not have to be exactly right because a database similarity match is OK to see what it should have been