Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
[1]
Closing the gap between chemistry and
biology: Joining between text tombs and
databases
Presentation for Uppsla Univer...
[2]
Abstract
• Progress in the biomedical sciences is critically dependent on explicit chemical structures
and bioactivity...
[3]
The Chem < - > Bio Join
• Chemistry that does something: drug discovery, drug development,
toxicology, pharmacology, s...
[4]
Getting chemistry out of text is difficult
[5]
That’s why we used to have to pay
73 million
4,059,232
5.1 million
~ 20,000
[6]
The Chemical Representational Hextet:
Different usage between documents and databases
?
[7]
A recent NRDD article
• Just images and code numbers
• No PubChem or ChemSpider IDs
• No SMILES or InChIs
• No molfile...
[8]
You can dig out structures from text for free:
- but its hard work
[9]
What’s out there for free
• InChIKey in Google ~ 50 million
• PubChem = 48 million
• PubChem ROF + 250-800 Mw (lead-li...
[10]
Medicinal chemistry patents (tombs with lids off)
• WO, C07D = 72,737 (assignee vs. year plots below)
• ~ 50 novel st...
[11]
PubMed: ~ 10% with chemistry (guarded tombs)
“Free full text” = 575,513
(24%)
[12]
Growth:
(escaping the
tombs)
• Patent “big bang”
(SureChem &
SCRIPDB in
2012)
• Literature “slow
burn” (ChEMBL
2009 j...
[13]
Databases <> structures < > documents:
links, but few reciprocal
Abstracts
Patents
Papers
15 mill
0.2 mill (mainly Me...
[14]
Triaging document or webpage chemistry
• Identify the structure specification types, e.g.
– Semantic names (all sourc...
[15]
Triage example: a
new antimalaria
The MMV390048 code
name is linked to an
image in press reports
but is PubChem and
P...
[16]
Images: convert and search
Real chemists sketch them in a jiffy;
the rest of us can use OSRA: Optical Structure Recog...
[17]
Making connections:
image > strucure > database > documents
CID 53311393 > ChEMBL > PubMed
SureChem or chemicalize.or...
[18]
Patent SAR from WO2011086531:
Collating activities via SureChemOpen
CID 53311393 >
[19]
Patent SAR results: top-20 from 39 IC50s
[20]
Results > figshare
http://figshare.com/articles/Patent_SAR_for_MMV390048/657979
[21]
Structures > MyNCBI
http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1zWhcobieZ
bIouGfUdsdbHek5/.
[22]
SAR Table: iOS app
from Molecular
Materials
Informatics
SureChemOpen strucs ->
manual data collation ->
PubChem CIDs ...
[23]
InChIKey in Google: instant orthogonal joining
[24]
Chemicalize.org: 413 strucs from WO2011086531
CID 53311393 ->
[25]
Using OPSIN and chemcalize.org to fix
recalcitrant IUPACs from WO2011086532
Can quasi-manually extract ~ 10 more “spl...
[26]
Clustering document extraction sets: CheS-Mapper
WO2011086531 -> chemicalize.org -> 413 cpds download ->
CheS-Mapper ...
[27]
PubChem -> ChEMBL -> PMID -> assay -> strucs
• CHEMBL2041980 (structure)
• PMID 22390538 (paper)
• CHEMBL2045642 (ass...
[28]
Venny: intersects, diffs, de-dupes and merges
1) WO2011086531
matches in PubCHem
2) CheS-Mapper
cluster 8 from
WO2011...
[29]
[30]
NCATS/MRC: the joy of codes with no structures
http://cdsouthan.blogspot.se/2012/09/mrc-22-vs-ncats-58-repurposing-li...
[31]
Code name-to-structure mapping:
Dig out the code names
PubChem Substance
PubChem Compound
PubMed/MeSH
Google Scholar
...
[32]
Sometimes the system works
[33]
PubMed > ChEMBL
[34]
Sometimes you get missing and cryptic links
[35]
NVP-Bxd552: Google results
[36]
BACE2: Almost no chemistry in papers
[37]
BACE2
1. WO2013054291 > chemicalize.org
2. Download 450 structures
3. Upload to PubChem search
[38]
Scibite > Alerts for new chemistry
[39]
Conclusions
• The ability to extract chemical structures from text and web sources
has been transformed by an expansi...
[40]
References
Upcoming SlideShare
Loading in …5
×

Closing the gap between chemistry and biology: Joining between text tombs and databases

1,129 views

Published on

Progress in the biomedical sciences is critically dependent on explicit chemical structures and bioactivity results described in text. This applies across drug discovery, pharmacology, chemical biology, and metabolomics. However the entombing of the majority of these structures and associated data within patents, papers, abstracts and web pages has been a major barrier to progress. This presentation introduces the current public information flow from documents and its associated barriers, such as inadequate author specification of structures, journal pay walls precluding text mining and the patchiness of MeSH chemistry annotation for PubMed-to-PubChem connectivity. It then reviews trends that are lowering these barriers. These include the Google merge of over 50 million InChIKey(s) from PubChem, ChemSpider and UniChem, ChEMBL containing SAR for 0.8 million structures from 50K medicinal chemistry papers, over 20 million abstracts in PubMed, and full-text open patent chemistry in SureChemOpen bringing PubChem patent-extracted structures to 15 million. In addition, options such as Open Lab Books and figshare are expanding the choices for surfacing new structures. Methods will be outlined for establishing document-to-document and document-to-database links via chemical structures. These include the PubChem toolbox, protein targets in UniProt, PubChem BioAssay, ChEMBL indexing in UK PMC, SureChemOpen, chemicalize.org for text name-to-structure conversion , OSRA for image-to-structure conversion, Venny for set comparisons and InChIKey searching in Google [1]. Combined use of these approaches to make joins between patents, papers, abstracts chemical database entries, SAR data and drug target protein sequences will be illustrated with recent novel antimalarial lead compounds, patent-only BACE2 inhibitors and company code numbers in the NCATS repurposing list.

Published in: Technology, Education
  • .DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... .DOWNLOAD PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... .DOWNLOAD doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Closing the gap between chemistry and biology: Joining between text tombs and databases

  1. 1. [1] Closing the gap between chemistry and biology: Joining between text tombs and databases Presentation for Uppsla University Department of Neuroscience, Sept 2013 By Christopher Southan Curator for IUPHARdb, http://www.guidetopharmacology.org/ Queen's Medical Research Institute, University of Edinburgh Email: cdsouthan@hotmail.com Twitter: http://twitter.com/#!/cdsouthan Blog: http://cdsouthan.blogspot.com/ LinkedIN: http://www.linkedin.com/in/cdsouthan TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications Presentations: http://www.slideshare.net/cdsouthan
  2. 2. [2] Abstract • Progress in the biomedical sciences is critically dependent on explicit chemical structures and bioactivity results described in text. This applies across drug discovery, pharmacology, chemical biology, and metabolomics. However the entombing of the majority of these structures and associated data within patents, papers, abstracts and web pages has been a major barrier to progress. This presentation introduces the current public information flow from documents and its associated barriers, such as inadequate author specification of structures, journal pay walls precluding text mining and the patchiness of MeSH chemistry annotation for PubMed-to-PubChem connectivity. It then reviews trends that are lowering these barriers. These include the Google merge of over 50 million InChIKey(s) from PubChem, ChemSpider and UniChem, ChEMBL containing SAR for 0.8 million structures from 50K medicinal chemistry papers, over 20 million abstracts in PubMed, and full-text open patent chemistry in SureChemOpen bringing PubChem patent-extracted structures to 15 million. In addition, options such as Open Lab Books and figshare are expanding the choices for surfacing new structures. Methods will be outlined for establishing document-to- document and document-to-database links via chemical structures. These include the PubChem toolbox, protein targets in UniProt, PubChem BioAssay, ChEMBL indexing in UK PMC, SureChemOpen, chemicalize.org for text name-to-structure conversion , OSRA for image-to-structure conversion, Venny for set comparisons and InChIKey searching in Google [1]. Combined use of these approaches to make joins between patents, papers, abstracts chemical database entries, SAR data and drug target protein sequences will be illustrated with recent novel antimalarial lead compounds, patent-only BACE2 inhibitors and company code numbers in the NCATS repurposing list.
  3. 3. [3] The Chem < - > Bio Join • Chemistry that does something: drug discovery, drug development, toxicology, pharmacology, systems chemical biology (probes), structural biology, metabolomics, chemical ecology, etc etc …. • With the exception of some PubChem Bioassays, the majority of data is sill primarily archived in documents
  4. 4. [4] Getting chemistry out of text is difficult
  5. 5. [5] That’s why we used to have to pay 73 million 4,059,232 5.1 million ~ 20,000
  6. 6. [6] The Chemical Representational Hextet: Different usage between documents and databases ?
  7. 7. [7] A recent NRDD article • Just images and code numbers • No PubChem or ChemSpider IDs • No SMILES or InChIs • No molfiles for download • No links in or out • No MeSH > PubChem substances • Some cited sources might have IUPAC names
  8. 8. [8] You can dig out structures from text for free: - but its hard work
  9. 9. [9] What’s out there for free • InChIKey in Google ~ 50 million • PubChem = 48 million • PubChem ROF + 250-800 Mw (lead-like) = 31 million • ChemSpider = 28 million • PubChem all docs (papers & patents) = 16 million • PubChem patents = 15 million • SureChemOpen = 14.5 million • PubChem journal sources (PubMed + ChEMBL) = 1 million
  10. 10. [10] Medicinal chemistry patents (tombs with lids off) • WO, C07D = 72,737 (assignee vs. year plots below) • ~ 50 novel structures with SAR per patent = ~ 3.5 million bioactives • Paradoxically now completely open for chemistry or any mining
  11. 11. [11] PubMed: ~ 10% with chemistry (guarded tombs) “Free full text” = 575,513 (24%)
  12. 12. [12] Growth: (escaping the tombs) • Patent “big bang” (SureChem & SCRIPDB in 2012) • Literature “slow burn” (ChEMBL 2009 jump) • Paradox - patents:papers 15:1 (both sets of CIDs cumulative)
  13. 13. [13] Databases <> structures < > documents: links, but few reciprocal Abstracts Patents Papers 15 mill 0.2 mill (mainly MeSH) 0.8 mill (ChEMBL) 12K
  14. 14. [14] Triaging document or webpage chemistry • Identify the structure specification types, e.g. – Semantic names (all sources) – Code names (press releases, papers and abstracts) – IUPAC names (papers, patents and abstracts) – Images (papers, patents, & Google images) – SMILES (open lab books) – InChi strings (open lab books) – SDF files (open lab books, & github) Convert these to a structure (e.g. SDF, SMILES, InChI) then: – Search InChIKey in Google – Search major databases – Search SureChemOpen – Compare extracted sets for intersects and diffs – Extend exact match connectivity with similarity searching
  15. 15. [15] Triage example: a new antimalaria The MMV390048 code name is linked to an image in press reports but is PubChem and PubMed -ve
  16. 16. [16] Images: convert and search Real chemists sketch them in a jiffy; the rest of us can use OSRA: Optical Structure Recognition Application (after editing, CS(=O)(=O)c3ccc(C2=CN=C(N)C(C1=CCC(C(F)(F)F)N=C1)C2)cc3)
  17. 17. [17] Making connections: image > strucure > database > documents CID 53311393 > ChEMBL > PubMed SureChem or chemicalize.org > patent
  18. 18. [18] Patent SAR from WO2011086531: Collating activities via SureChemOpen CID 53311393 >
  19. 19. [19] Patent SAR results: top-20 from 39 IC50s
  20. 20. [20] Results > figshare http://figshare.com/articles/Patent_SAR_for_MMV390048/657979
  21. 21. [21] Structures > MyNCBI http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1zWhcobieZ bIouGfUdsdbHek5/.
  22. 22. [22] SAR Table: iOS app from Molecular Materials Informatics SureChemOpen strucs -> manual data collation -> PubChem CIDs -> SDF -> Dropbox -> SAR Table -> edit in data, R-group decompose -> share
  23. 23. [23] InChIKey in Google: instant orthogonal joining
  24. 24. [24] Chemicalize.org: 413 strucs from WO2011086531 CID 53311393 ->
  25. 25. [25] Using OPSIN and chemcalize.org to fix recalcitrant IUPACs from WO2011086532 Can quasi-manually extract ~ 10 more “split IUPAC” examples
  26. 26. [26] Clustering document extraction sets: CheS-Mapper WO2011086531 -> chemicalize.org -> 413 cpds download -> CheS-Mapper -> cluster 8 -> export 53 cpds
  27. 27. [27] PubChem -> ChEMBL -> PMID -> assay -> strucs • CHEMBL2041980 (structure) • PMID 22390538 (paper) • CHEMBL2045642 (assay for 32 strucs from paper) • The 32 CIDs all have patent matches
  28. 28. [28] Venny: intersects, diffs, de-dupes and merges 1) WO2011086531 matches in PubCHem 2) CheS-Mapper cluster 8 from WO2011086532 3) ChEMBL assayed cpds from PMID 22390538 (handles any regular strings e.g. db IDs, SMILES, IChI or InChIKey)
  29. 29. [29]
  30. 30. [30] NCATS/MRC: the joy of codes with no structures http://cdsouthan.blogspot.se/2012/09/mrc-22-vs-ncats-58-repurposing-lists.html
  31. 31. [31] Code name-to-structure mapping: Dig out the code names PubChem Substance PubChem Compound PubMed/MeSH Google Scholar Google Images Google open (filtered)
  32. 32. [32] Sometimes the system works
  33. 33. [33] PubMed > ChEMBL
  34. 34. [34] Sometimes you get missing and cryptic links
  35. 35. [35] NVP-Bxd552: Google results
  36. 36. [36] BACE2: Almost no chemistry in papers
  37. 37. [37] BACE2 1. WO2013054291 > chemicalize.org 2. Download 450 structures 3. Upload to PubChem search
  38. 38. [38] Scibite > Alerts for new chemistry
  39. 39. [39] Conclusions • The ability to extract chemical structures from text and web sources has been transformed by an expansion of the public toolbox • The PubChem big-bang increases probability of extraction having database exact or similarity matches • Paradoxically, the patent corpus is now completely open while access to journal text is still restricted • However, ChEMBL has extracted ~ 0.8 mill. SAR-linked and target mapped structures from ~ 50K papers • The submission of ~15 mill. patent structures to PubChem ensures at least representation from the majority of medicinal chemistry patents (many of which spawned the subsequent ChEMBL papers) • Those who want to share their structures globally (e.g. OSDD) have an expanding set of options for surfacing their results.
  40. 40. [40] References

×