SlideShare a Scribd company logo
1 of 40
[1]
Closing the gap between chemistry and
biology: Joining between text tombs and
databases
Presentation for Uppsla University Department of Neuroscience, Sept 2013
By Christopher Southan
Curator for IUPHARdb, http://www.guidetopharmacology.org/
Queen's Medical Research Institute, University of Edinburgh
Email: cdsouthan@hotmail.com
Twitter: http://twitter.com/#!/cdsouthan
Blog: http://cdsouthan.blogspot.com/
LinkedIN: http://www.linkedin.com/in/cdsouthan
TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm
Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications
Presentations: http://www.slideshare.net/cdsouthan
[2]
Abstract
• Progress in the biomedical sciences is critically dependent on explicit chemical structures
and bioactivity results described in text. This applies across drug discovery, pharmacology,
chemical biology, and metabolomics. However the entombing of the majority of these
structures and associated data within patents, papers, abstracts and web pages has been a
major barrier to progress. This presentation introduces the current public information flow
from documents and its associated barriers, such as inadequate author specification of
structures, journal pay walls precluding text mining and the patchiness of MeSH chemistry
annotation for PubMed-to-PubChem connectivity. It then reviews trends that are lowering
these barriers. These include the Google merge of over 50 million InChIKey(s) from
PubChem, ChemSpider and UniChem, ChEMBL containing SAR for 0.8 million structures
from 50K medicinal chemistry papers, over 20 million abstracts in PubMed, and full-text
open patent chemistry in SureChemOpen bringing PubChem patent-extracted structures to
15 million. In addition, options such as Open Lab Books and figshare are expanding the
choices for surfacing new structures. Methods will be outlined for establishing document-to-
document and document-to-database links via chemical structures. These include the
PubChem toolbox, protein targets in UniProt, PubChem BioAssay, ChEMBL indexing in UK
PMC, SureChemOpen, chemicalize.org for text name-to-structure conversion , OSRA for
image-to-structure conversion, Venny for set comparisons and InChIKey searching in
Google [1]. Combined use of these approaches to make joins between patents, papers,
abstracts chemical database entries, SAR data and drug target protein sequences will be
illustrated with recent novel antimalarial lead compounds, patent-only BACE2 inhibitors and
company code numbers in the NCATS repurposing list.
[3]
The Chem < - > Bio Join
• Chemistry that does something: drug discovery, drug development,
toxicology, pharmacology, systems chemical biology (probes), structural
biology, metabolomics, chemical ecology, etc etc ….
• With the exception of some PubChem Bioassays, the majority of data is sill
primarily archived in documents
[4]
Getting chemistry out of text is difficult
[5]
That’s why we used to have to pay
73 million
4,059,232
5.1 million
~ 20,000
[6]
The Chemical Representational Hextet:
Different usage between documents and databases
?
[7]
A recent NRDD article
• Just images and code numbers
• No PubChem or ChemSpider IDs
• No SMILES or InChIs
• No molfiles for download
• No links in or out
• No MeSH > PubChem substances
• Some cited sources might have IUPAC names
[8]
You can dig out structures from text for free:
- but its hard work
[9]
What’s out there for free
• InChIKey in Google ~ 50 million
• PubChem = 48 million
• PubChem ROF + 250-800 Mw (lead-like) = 31 million
• ChemSpider = 28 million
• PubChem all docs (papers & patents) = 16 million
• PubChem patents = 15 million
• SureChemOpen = 14.5 million
• PubChem journal sources (PubMed + ChEMBL) = 1 million
[10]
Medicinal chemistry patents (tombs with lids off)
• WO, C07D = 72,737 (assignee vs. year plots below)
• ~ 50 novel structures with SAR per patent = ~ 3.5 million bioactives
• Paradoxically now completely open for chemistry or any mining
[11]
PubMed: ~ 10% with chemistry (guarded tombs)
“Free full text” = 575,513
(24%)
[12]
Growth:
(escaping the
tombs)
• Patent “big bang”
(SureChem &
SCRIPDB in
2012)
• Literature “slow
burn” (ChEMBL
2009 jump)
• Paradox -
patents:papers
15:1
(both sets of CIDs
cumulative)
[13]
Databases <> structures < > documents:
links, but few reciprocal
Abstracts
Patents
Papers
15 mill
0.2 mill (mainly MeSH)
0.8 mill
(ChEMBL)
12K
[14]
Triaging document or webpage chemistry
• Identify the structure specification types, e.g.
– Semantic names (all sources)
– Code names (press releases, papers and abstracts)
– IUPAC names (papers, patents and abstracts)
– Images (papers, patents, & Google images)
– SMILES (open lab books)
– InChi strings (open lab books)
– SDF files (open lab books, & github)
Convert these to a structure (e.g. SDF, SMILES, InChI) then:
– Search InChIKey in Google
– Search major databases
– Search SureChemOpen
– Compare extracted sets for intersects and diffs
– Extend exact match connectivity with similarity searching
[15]
Triage example: a
new antimalaria
The MMV390048 code
name is linked to an
image in press reports
but is PubChem and
PubMed -ve
[16]
Images: convert and search
Real chemists sketch them in a jiffy;
the rest of us can use OSRA: Optical Structure Recognition Application
(after editing, CS(=O)(=O)c3ccc(C2=CN=C(N)C(C1=CCC(C(F)(F)F)N=C1)C2)cc3)
[17]
Making connections:
image > strucure > database > documents
CID 53311393 > ChEMBL > PubMed
SureChem or chemicalize.org > patent
[18]
Patent SAR from WO2011086531:
Collating activities via SureChemOpen
CID 53311393 >
[19]
Patent SAR results: top-20 from 39 IC50s
[20]
Results > figshare
http://figshare.com/articles/Patent_SAR_for_MMV390048/657979
[21]
Structures > MyNCBI
http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1zWhcobieZ
bIouGfUdsdbHek5/.
[22]
SAR Table: iOS app
from Molecular
Materials
Informatics
SureChemOpen strucs ->
manual data collation ->
PubChem CIDs -> SDF ->
Dropbox -> SAR Table
-> edit in data, R-group
decompose
-> share
[23]
InChIKey in Google: instant orthogonal joining
[24]
Chemicalize.org: 413 strucs from WO2011086531
CID 53311393 ->
[25]
Using OPSIN and chemcalize.org to fix
recalcitrant IUPACs from WO2011086532
Can quasi-manually extract ~ 10 more “split IUPAC” examples
[26]
Clustering document extraction sets: CheS-Mapper
WO2011086531 -> chemicalize.org -> 413 cpds download ->
CheS-Mapper -> cluster 8 -> export 53 cpds
[27]
PubChem -> ChEMBL -> PMID -> assay -> strucs
• CHEMBL2041980 (structure)
• PMID 22390538 (paper)
• CHEMBL2045642 (assay for 32 strucs
from paper)
• The 32 CIDs all have patent matches
[28]
Venny: intersects, diffs, de-dupes and merges
1) WO2011086531
matches in PubCHem
2) CheS-Mapper
cluster 8 from
WO2011086532
3) ChEMBL assayed
cpds from PMID
22390538
(handles any regular
strings e.g. db IDs,
SMILES, IChI or
InChIKey)
[29]
[30]
NCATS/MRC: the joy of codes with no structures
http://cdsouthan.blogspot.se/2012/09/mrc-22-vs-ncats-58-repurposing-lists.html
[31]
Code name-to-structure mapping:
Dig out the code names
PubChem Substance
PubChem Compound
PubMed/MeSH
Google Scholar
Google Images
Google open (filtered)
[32]
Sometimes the system works
[33]
PubMed > ChEMBL
[34]
Sometimes you get missing and cryptic links
[35]
NVP-Bxd552: Google results
[36]
BACE2: Almost no chemistry in papers
[37]
BACE2
1. WO2013054291 > chemicalize.org
2. Download 450 structures
3. Upload to PubChem search
[38]
Scibite > Alerts for new chemistry
[39]
Conclusions
• The ability to extract chemical structures from text and web sources
has been transformed by an expansion of the public toolbox
• The PubChem big-bang increases probability of extraction having
database exact or similarity matches
• Paradoxically, the patent corpus is now completely open while access
to journal text is still restricted
• However, ChEMBL has extracted ~ 0.8 mill. SAR-linked and target
mapped structures from ~ 50K papers
• The submission of ~15 mill. patent structures to PubChem ensures at
least representation from the majority of medicinal chemistry patents
(many of which spawned the subsequent ChEMBL papers)
• Those who want to share their structures globally (e.g. OSDD) have an
expanding set of options for surfacing their results.
[40]
References

More Related Content

What's hot

Mercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkMercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkBOSC 2010
 
Math 225-spring-2012
Math 225-spring-2012Math 225-spring-2012
Math 225-spring-2012Bruce Slutsky
 
The State of Open Research Data
The State of Open Research DataThe State of Open Research Data
The State of Open Research DataRoss Mounce
 
Making data sticky
Making data stickyMaking data sticky
Making data stickyRoderic Page
 
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Alasdair Gray
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themRoss Mounce
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyChris Southan
 
Data exchange alternatives, GIGA TAG (2009)
Data exchange alternatives, GIGA TAG (2009)Data exchange alternatives, GIGA TAG (2009)
Data exchange alternatives, GIGA TAG (2009)Dag Endresen
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureRoss Mounce
 
Publication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic moleculesPublication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic moleculesChristoph Steinbeck
 
dkNET Poster ENDO 2019
dkNET Poster ENDO 2019dkNET Poster ENDO 2019
dkNET Poster ENDO 2019dkNET
 
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Dag Endresen
 
Connecting antimalarial data
Connecting antimalarial dataConnecting antimalarial data
Connecting antimalarial dataChris Southan
 
Connecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked DataConnecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked DataTomasz Adamusiak
 
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)Dag Endresen
 

What's hot (20)

Mercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkMercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_framework
 
Math 225-spring-2012
Math 225-spring-2012Math 225-spring-2012
Math 225-spring-2012
 
The State of Open Research Data
The State of Open Research DataThe State of Open Research Data
The State of Open Research Data
 
Making data sticky
Making data stickyMaking data sticky
Making data sticky
 
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
Connecting Chemists To The Internet Training at Burlington House 2010
Connecting Chemists To The Internet Training at Burlington House 2010Connecting Chemists To The Internet Training at Burlington House 2010
Connecting Chemists To The Internet Training at Burlington House 2010
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
Data exchange alternatives, GIGA TAG (2009)
Data exchange alternatives, GIGA TAG (2009)Data exchange alternatives, GIGA TAG (2009)
Data exchange alternatives, GIGA TAG (2009)
 
ChemSpider – An Online Database and Registration System Linking the Web
ChemSpider – An Online Database and  Registration System Linking the WebChemSpider – An Online Database and  Registration System Linking the Web
ChemSpider – An Online Database and Registration System Linking the Web
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | Future
 
Publication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic moleculesPublication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic molecules
 
How a Structure-Centric Community for Chemists Can Benefit Drug Discovery - V...
How a Structure-Centric Community for Chemists Can Benefit Drug Discovery - V...How a Structure-Centric Community for Chemists Can Benefit Drug Discovery - V...
How a Structure-Centric Community for Chemists Can Benefit Drug Discovery - V...
 
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
 
dkNET Poster ENDO 2019
dkNET Poster ENDO 2019dkNET Poster ENDO 2019
dkNET Poster ENDO 2019
 
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
 
Connecting antimalarial data
Connecting antimalarial dataConnecting antimalarial data
Connecting antimalarial data
 
Connecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked DataConnecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked Data
 
OpenTox Europe 2013
OpenTox Europe 2013OpenTox Europe 2013
OpenTox Europe 2013
 
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
 

Similar to Closing the gap between chemistry and biology: Joining between text tombs and databases

Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Chris Southan
 
Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Sean Ekins
 
A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidatapetermurrayrust
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataChris Southan
 
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...ChemAxon
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityChris Southan
 
Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptxwadhava gurumeet
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horseChris Southan
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...Chris Southan
 
Overview of cheminformatics
Overview of cheminformaticsOverview of cheminformatics
Overview of cheminformaticsBenjamin Bucior
 
100505 koenig biological_databases
100505 koenig biological_databases100505 koenig biological_databases
100505 koenig biological_databasesMeetika Gupta
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...Michel Dumontier
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsChris Southan
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxChris Mungall
 

Similar to Closing the gap between chemistry and biology: Joining between text tombs and databases (20)

Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases
 
Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...
 
A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidata
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity Data
 
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptx
 
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...
 
Overview of cheminformatics
Overview of cheminformaticsOverview of cheminformatics
Overview of cheminformatics
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
100505 koenig biological_databases
100505 koenig biological_databases100505 koenig biological_databases
100505 koenig biological_databases
 
Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptx
 
Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...
 

More from Chris Southan

FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCPChris Southan
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulationsChris Southan
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Chris Southan
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeChris Southan
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentChris Southan
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Chris Southan
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCPChris Southan
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteinsChris Southan
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFERChris Southan
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databasesChris Southan
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology Chris Southan
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 posterChris Southan
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagensChris Southan
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand upChris Southan
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide TribulationsChris Southan
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRChris Southan
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology updateChris Southan
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProtChris Southan
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbChris Southan
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityChris Southan
 

More from Chris Southan (20)

FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem Connectivity
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Closing the gap between chemistry and biology: Joining between text tombs and databases

  • 1. [1] Closing the gap between chemistry and biology: Joining between text tombs and databases Presentation for Uppsla University Department of Neuroscience, Sept 2013 By Christopher Southan Curator for IUPHARdb, http://www.guidetopharmacology.org/ Queen's Medical Research Institute, University of Edinburgh Email: cdsouthan@hotmail.com Twitter: http://twitter.com/#!/cdsouthan Blog: http://cdsouthan.blogspot.com/ LinkedIN: http://www.linkedin.com/in/cdsouthan TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications Presentations: http://www.slideshare.net/cdsouthan
  • 2. [2] Abstract • Progress in the biomedical sciences is critically dependent on explicit chemical structures and bioactivity results described in text. This applies across drug discovery, pharmacology, chemical biology, and metabolomics. However the entombing of the majority of these structures and associated data within patents, papers, abstracts and web pages has been a major barrier to progress. This presentation introduces the current public information flow from documents and its associated barriers, such as inadequate author specification of structures, journal pay walls precluding text mining and the patchiness of MeSH chemistry annotation for PubMed-to-PubChem connectivity. It then reviews trends that are lowering these barriers. These include the Google merge of over 50 million InChIKey(s) from PubChem, ChemSpider and UniChem, ChEMBL containing SAR for 0.8 million structures from 50K medicinal chemistry papers, over 20 million abstracts in PubMed, and full-text open patent chemistry in SureChemOpen bringing PubChem patent-extracted structures to 15 million. In addition, options such as Open Lab Books and figshare are expanding the choices for surfacing new structures. Methods will be outlined for establishing document-to- document and document-to-database links via chemical structures. These include the PubChem toolbox, protein targets in UniProt, PubChem BioAssay, ChEMBL indexing in UK PMC, SureChemOpen, chemicalize.org for text name-to-structure conversion , OSRA for image-to-structure conversion, Venny for set comparisons and InChIKey searching in Google [1]. Combined use of these approaches to make joins between patents, papers, abstracts chemical database entries, SAR data and drug target protein sequences will be illustrated with recent novel antimalarial lead compounds, patent-only BACE2 inhibitors and company code numbers in the NCATS repurposing list.
  • 3. [3] The Chem < - > Bio Join • Chemistry that does something: drug discovery, drug development, toxicology, pharmacology, systems chemical biology (probes), structural biology, metabolomics, chemical ecology, etc etc …. • With the exception of some PubChem Bioassays, the majority of data is sill primarily archived in documents
  • 4. [4] Getting chemistry out of text is difficult
  • 5. [5] That’s why we used to have to pay 73 million 4,059,232 5.1 million ~ 20,000
  • 6. [6] The Chemical Representational Hextet: Different usage between documents and databases ?
  • 7. [7] A recent NRDD article • Just images and code numbers • No PubChem or ChemSpider IDs • No SMILES or InChIs • No molfiles for download • No links in or out • No MeSH > PubChem substances • Some cited sources might have IUPAC names
  • 8. [8] You can dig out structures from text for free: - but its hard work
  • 9. [9] What’s out there for free • InChIKey in Google ~ 50 million • PubChem = 48 million • PubChem ROF + 250-800 Mw (lead-like) = 31 million • ChemSpider = 28 million • PubChem all docs (papers & patents) = 16 million • PubChem patents = 15 million • SureChemOpen = 14.5 million • PubChem journal sources (PubMed + ChEMBL) = 1 million
  • 10. [10] Medicinal chemistry patents (tombs with lids off) • WO, C07D = 72,737 (assignee vs. year plots below) • ~ 50 novel structures with SAR per patent = ~ 3.5 million bioactives • Paradoxically now completely open for chemistry or any mining
  • 11. [11] PubMed: ~ 10% with chemistry (guarded tombs) “Free full text” = 575,513 (24%)
  • 12. [12] Growth: (escaping the tombs) • Patent “big bang” (SureChem & SCRIPDB in 2012) • Literature “slow burn” (ChEMBL 2009 jump) • Paradox - patents:papers 15:1 (both sets of CIDs cumulative)
  • 13. [13] Databases <> structures < > documents: links, but few reciprocal Abstracts Patents Papers 15 mill 0.2 mill (mainly MeSH) 0.8 mill (ChEMBL) 12K
  • 14. [14] Triaging document or webpage chemistry • Identify the structure specification types, e.g. – Semantic names (all sources) – Code names (press releases, papers and abstracts) – IUPAC names (papers, patents and abstracts) – Images (papers, patents, & Google images) – SMILES (open lab books) – InChi strings (open lab books) – SDF files (open lab books, & github) Convert these to a structure (e.g. SDF, SMILES, InChI) then: – Search InChIKey in Google – Search major databases – Search SureChemOpen – Compare extracted sets for intersects and diffs – Extend exact match connectivity with similarity searching
  • 15. [15] Triage example: a new antimalaria The MMV390048 code name is linked to an image in press reports but is PubChem and PubMed -ve
  • 16. [16] Images: convert and search Real chemists sketch them in a jiffy; the rest of us can use OSRA: Optical Structure Recognition Application (after editing, CS(=O)(=O)c3ccc(C2=CN=C(N)C(C1=CCC(C(F)(F)F)N=C1)C2)cc3)
  • 17. [17] Making connections: image > strucure > database > documents CID 53311393 > ChEMBL > PubMed SureChem or chemicalize.org > patent
  • 18. [18] Patent SAR from WO2011086531: Collating activities via SureChemOpen CID 53311393 >
  • 19. [19] Patent SAR results: top-20 from 39 IC50s
  • 22. [22] SAR Table: iOS app from Molecular Materials Informatics SureChemOpen strucs -> manual data collation -> PubChem CIDs -> SDF -> Dropbox -> SAR Table -> edit in data, R-group decompose -> share
  • 23. [23] InChIKey in Google: instant orthogonal joining
  • 24. [24] Chemicalize.org: 413 strucs from WO2011086531 CID 53311393 ->
  • 25. [25] Using OPSIN and chemcalize.org to fix recalcitrant IUPACs from WO2011086532 Can quasi-manually extract ~ 10 more “split IUPAC” examples
  • 26. [26] Clustering document extraction sets: CheS-Mapper WO2011086531 -> chemicalize.org -> 413 cpds download -> CheS-Mapper -> cluster 8 -> export 53 cpds
  • 27. [27] PubChem -> ChEMBL -> PMID -> assay -> strucs • CHEMBL2041980 (structure) • PMID 22390538 (paper) • CHEMBL2045642 (assay for 32 strucs from paper) • The 32 CIDs all have patent matches
  • 28. [28] Venny: intersects, diffs, de-dupes and merges 1) WO2011086531 matches in PubCHem 2) CheS-Mapper cluster 8 from WO2011086532 3) ChEMBL assayed cpds from PMID 22390538 (handles any regular strings e.g. db IDs, SMILES, IChI or InChIKey)
  • 29. [29]
  • 30. [30] NCATS/MRC: the joy of codes with no structures http://cdsouthan.blogspot.se/2012/09/mrc-22-vs-ncats-58-repurposing-lists.html
  • 31. [31] Code name-to-structure mapping: Dig out the code names PubChem Substance PubChem Compound PubMed/MeSH Google Scholar Google Images Google open (filtered)
  • 34. [34] Sometimes you get missing and cryptic links
  • 36. [36] BACE2: Almost no chemistry in papers
  • 37. [37] BACE2 1. WO2013054291 > chemicalize.org 2. Download 450 structures 3. Upload to PubChem search
  • 38. [38] Scibite > Alerts for new chemistry
  • 39. [39] Conclusions • The ability to extract chemical structures from text and web sources has been transformed by an expansion of the public toolbox • The PubChem big-bang increases probability of extraction having database exact or similarity matches • Paradoxically, the patent corpus is now completely open while access to journal text is still restricted • However, ChEMBL has extracted ~ 0.8 mill. SAR-linked and target mapped structures from ~ 50K papers • The submission of ~15 mill. patent structures to PubChem ensures at least representation from the majority of medicinal chemistry patents (many of which spawned the subsequent ChEMBL papers) • Those who want to share their structures globally (e.g. OSDD) have an expanding set of options for surfacing their results.

Editor's Notes

  1. IinChIKeys - estimate of PubChem + ChemSpider in Google – but PubChem currently has a backlog for Key scrapingThe ROF + 250-800 is a very approximate circumscription of the property space that has some possibility of bioactivityProbably a proportion of vendor structures may have never been committed to textThere are some virtuals “out there” including some patent-extractions but difficult to estimate
  2. Note the WO/PCT queries are non-redundant in the patent family senseThe medicinal chemistry corpus is actually quite smallNote big pharma patent decline post-2008 Average exemplified cpds with activity data per patent (family) is unknown but GVKs curation average is ~ 50
  3. Using the top level MeSH term as a filter for “PubMeds with some chemistry”Free full text is ¨ ¼ but there are a lot of biological journals in this set
  4. Note that cumulative plots include an element of back-mapping i.e. the 2005 matches are to the 2013 total not the just the 2005 documents
  5. Only Nature Chemical Biology and Nature Chemistry have direct links from the journal document to PubChemGiven todays technology the major patent offices could put links in the PDFs but are unlikely to do so
  6. Need to assess what representational types are being used in the documentEg. Some patents are image-only (but SureChem is pulling most of these out)Then select tools and sources for the job ´Decide how to store your structures locally The default batch search is an upload to PubChemThe default individual search is the InChIKey against Google
  7. Self explanatoryNote my blog post was indexed
  8. The simplest of starting points, at least the press release had a structure diagram OSRA provides good starting points to edit and get SMILESThe structure does not have to be exactly right because a database similarity match is OK to see what it should have been
  9. SMILES from the image hits the CID in PubChemThis links to patents via SureChem and chemicalize.orgChEMBL provides a link to the paper Note none of these sources have MMV390048 as synonym so all the connections are via structure
  10. We can start of with patent linksNote in this case numbered image capture, as oposed to the IUPAC listing, was important to manually collate the structure against the correct IC50
  11. From manual cross-checking between the individual example structures and the IC50 table the Excel sheet can be populated
  12. Useful way to share results that is citableIndexed in Google but no live links in Excel sheet (yet)
  13. Can upload CID lists and download as a saved and public collection
  14. This is the Pistoia /AlexClark SAR Table appDropped the CIDs out of PubChem into DropBox and picked them up on the IPADNice but would be good to automate the decomposition
  15. InChIkey search picks up instantly This was just a choice of one of the activesSo this connects PubChem and figshare
  16. The CID links straight throught to chemicalize and will just re-extract the whole patent in a few seconds The 413 gave 358 hits in pub chem
  17. IUPAC names have a lot of usage variants and OCR mistakes Typically gaps, line breaks 1 instead of 1 and missing bracketsOPSIN is good for indicating where the break is This can then be fixed for a series in chemicalize.org
  18. Total extractions from patents can include a lot of low Mw common reagent chemistryCheS mapper display makes it easy to pick out clusters of lead-like compoundsClusters can then be downloadedFlexibility is high because document sets can be split or merged at the imput stage
  19. ChEMBL extracts structure and dataCant actually select a set of cpds via the PubMed ID but can via the assay ID that is usually unique to that paperIn this case we got 32 structures, all of which came from that patent
  20. Very useful utility for any kind of set operations e.g. sets of extractions Total flexibility e.g. intersecting patents and papers with extractions from abstract setsSets can be de-duplicatedand merged from multiple sets (e.g. 10 patent extractions in one box)Can combine with selected downloaded database records