Open, Collaborative, and Transformative:  Exploring and Connecting BioactiveChemistry Across Biomedical Documents   and Da...
Dr Christopher Southan, Ph.D., M.Sc.,B.Sc.TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htmMobile: +46(0)702-...
AbstractAlthough there are ~ 50 million chemical structure in public databases, manymillions of bioactive compounds are st...
Getting chemistry out of text and linking to data:  some is done but we have to dig for the rest                          ...
Estimates for chemical text tombs• Journal chemistry public extraction, ~10 to 20 million entombed ?• Majority of useful p...
What’s out there: publically disinterred structures    •   InChIKey in Google ~ 50 million    •   PubChem = 48 million    ...
Medicinal chemistry patents (tombs with lids off) • 18,777,229 patents, 2,208,422 WO’s (i.e. ~ 9 per family) • WO, C07 or ...
PubMed at 22 mill:~ 10% with chemistry (guarded tombs)      “Free full text” = 575,513 (24%)                              ...
Top-5 Med Chem journals (4% lids off tombs)             “Free full text” = 2671 (4.3%)                                    ...
Growth: (escaping the    tombs)• Patent “big bang”  (SureChem &  SCRIPDB in  2012)• Literature “slow  burn” (ChEMBL  2009 ...
Patents in PubChem:         post-bang total vs. unique contentPubChem at 47.3 million CIDs, 32% include patents, 20% paten...
Citations: connections between tombs     but still need to disinter structuresPapers                         Abstracts    ...
Databases <> structures < > documents:        links, but few reciprocal Papers                       Abstracts            ...
Post-document retrieval: basic questions1.    What is the name:IUPAC:image:other ratio in the document?2.    Which tools m...
Triaging document or webpage chemistry• Identify the structure specification types, e.g.   – Semantic names (all sources) ...
Triage example:  antimalarial starting pointThe MMV390048 codename is linked to animage in press reportsbut is PubChem and...
Images: convert and search                      Real chemists sketch them in a jiffy;   the rest of us can use OSRA: Optic...
Making connections:image > strucure > database > documents                 CID 53311393 > ChEMBL > PubMed                 ...
Patent SAR from WO2011086531:Collating activities via SureChemOpen     CID 53311393 >                                     ...
Patent SAR results: top-20 from 39 IC50s                                           [20]
Results > figsharehttp://figshare.com/articles/Patent_SAR_for_MMV390048/657979                                            ...
Structures > MyNCBIhttp://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1zWhcobieZbIouGfUdsdbHek5/.                ...
SAR Table: iOS app  from Molecular     Materials    InformaticsSureChemOpen strucs ->manual data collation ->PubChem CIDs ...
InChIKey in Google: instant orthogonal joining                                                 [24]
Chemicalize.org: 413 strucs from WO2011086532CID 53311393 ->                                            [25]
Using OPSIN and chemcalize.org to fix     recalcitrant IUPACs from WO2011086532Can quasi-manually extract ~ 10 more “split...
Clustering document extraction sets: CheS-Mapper  WO2011086531 -> chemicalize.org -> 413 cpds download ->  CheS-Mapper -> ...
PubChem -> ChEMBL -> PMID -> assay -> strucs                   • CHEMBL2041980 (structure)                   • PMID 223905...
Venny: intersects, diffs, de-dupes and merges                                   1) WO2011086531                           ...
OSDDMalaria: global sharing test-bed•   Different options being explored•   Team or personal URLs >chemicalize.org•   Gith...
The open toolbox facilitates extraction and  collation of 10 to 30 million structures             entombed in text        ...
Conclusions• The ability to extract chemical structures from text and web sources  has been transformed by an expansion of...
Upcoming SlideShare
Loading in...5
×

Connecting Bioactive Chemistry Across Documents and Databases

396
-1

Published on

Slides for BioIT track 11 2013 (presented elswhere as well)

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
396
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 70 million substances in CAS suggest a 20-30 million shortfall (i.e. SciFinder only) but they include virtualsand librariesSureChen will continue patent extraction but expect an asymtote of true novels only soonPubMed capture largely dependant on MeSH but a lot of IUPAC chemistry is only anually updated, and some not capturedSureChem, IBM and chemicalize all inticate that, including MeSH terms at least 0.5 million structures could be extracted from PubMedNo idea how much web-unique chemistry (not in documents or databases) is out there but open lab books will increase this
  • IinChIKeys - estimate of PubChem + ChemSpider in Google – but PubChem currently has a backlog for Key scrapingThe ROF + 250-800 is a very approximate circumscription of the property space that has some possibility of bioactivityProbably a proportion of vendor structures may have never been committed to textThere are some virtuals “out there” including some patent-extractions but difficult to estimate
  • Note the WO/PCT queries are non-redundant in the patent family senseThe medicinal chemistry corpus is actually quite smallNote big pharma patent decline post-2008 Average exemplified cpds with activity data per patent (family) is unknown but GVKs curation average is ~ 50
  • Using the top level MeSH term as a filter for “PubMeds with some chemistry”Free full text is ¨ ¼ but there are a lot of biological journals in this set
  • Select the core journals used for med chem extraction by GVKBIO and ChEMBL. Not a large corpus Both extract ~ 15 cpds per paperNote the proportion of “free full text” is low
  • Note that cumulative plots include an element of back-mapping i.e. the 2005 matches are to the 2013 total not the just the 2005 documents
  • PubChem hit 15 million patents in March 2013Largest unique content is SureChemOpen Thomson uniqueness low because a) they include at least 30% journal extractions and b) the Derwent WPI content (was) also in Discovery gateIBM are only pre-2000 patents and the extracted content overlaps with other sources.
  • Citations are a core tradition but they do not provide direct structure &lt;-&gt; structure linksPatents cite papers but papers rarely cite patents (with the exception of patent reviews)
  • Only Nature Chemical Biology and Nature Chemistry have direct links from the journal document to PubChemGiven todays technology the major patent offices could put links in the PDFs but are unlikely to do so
  • The problem “how do I find the chemistry out there relevant to my interests” is a general search retrieval recall and specificity challenge. cannot be addressed here. Beyond PubMed and Google it’s getting better (e.g. indexing of full text patents) but there are still issues (e.g. text mining of chemical journals still very restricted)Once you have found the documents or text, these are the typical set of questions you might want to address, especially in regard to choosing which tools are best for the job.
  • Need to assess what representational types are being used in the documentEg. Some patents are image-only (but SureChem is pulling most of these out)Then select tools and sources for the job ´Decide how to store your structures locally The default batch search is an upload to PubChemThe default individual search is the InChIKey against Google
  • Self explanatoryNote my blog post was indexed
  • The simplest of starting points, at least the press release had a structure diagram OSRA provides good starting points to edit and get SMILESThe structure does not have to be exactly right because a database similarity match is OK to see what it should have been
  • SMILES from the image hits the CID in PubChemThis links to patents via SureChem and chemicalize.orgChEMBL provides a link to the paper Note none of these sources have MMV390048 as synonym so all the connections are via structure
  • We can start of with patent linksNote in this case numbered image capture, as oposed to the IUPAC listing, was important to manually collate the structure against the correct IC50
  • From manual cross-checking between the individual example structures and the IC50 table the Excel sheet can be populated
  • Useful way to share results that is citableIndexed in Google but no live links in Excel sheet (yet)
  • Can upload CID lists and download as a saved and public collection
  • This is the Pistoia /AlexClark SAR Table appDropped the CIDs out of PubChem into DropBox and picked them up on the IPADNice but would be good to automate the decomposition
  • InChIkey search picks up instantly This was just a choice of one of the activesSo this connects PubChem and figshare
  • The CID links straight throught to chemicalize and will just re-extract the whole patent in a few seconds The 413 gave 358 hits in pub chem
  • IUPAC names have a lot of usage variants and OCR mistakes Typically gaps, line breaks 1 instead of 1 and missing bracketsOPSIN is good for indicating where the break is This can then be fixed for a series in chemicalize.org
  • Total extractions from patents can include a lot of low Mw common reagent chemistryCheS mapper display makes it easy to pick out clusters of lead-like compoundsClusters can then be downloadedFlexibility is high because document sets can be split or merged at the imput stage
  • ChEMBL extracts structure and dataCant actually select a set of cpds via the PubMed ID but can via the assay ID that is usually unique to that paperIn this case we got 32 structures, all of which came from that patent
  • Very useful utility for any kind of set operations e.g. sets of extractions Total flexibility e.g. intersecting patents and papers with extractions from abstract setsSets can be de-duplicatedand merged from multiple sets (e.g. 10 patent extractions in one box)Can combine with selected downloaded database records
  • Connecting Bioactive Chemistry Across Documents and Databases

    1. 1. Open, Collaborative, and Transformative: Exploring and Connecting BioactiveChemistry Across Biomedical Documents and Databases with Public Tools Christopher Southan TW2Informatics, Göteborg, Sweden, BioIT Track 11, Boston, April 2013 [1]
    2. 2. Dr Christopher Southan, Ph.D., M.Sc.,B.Sc.TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htmMobile: +46(0)702-530710Skype: cdsouthanEmail: cdsouthan@hotmail.comTwitter: http://twitter.com/#!/cdsouthanBlog: http://cdsouthan.blogspot.com/LinkedIN: http://www.linkedin.com/in/cdsouthanPublications: http://www.citeulike.org/user/cdsouthan/order/year,,/publicationsPresentations: http://www.slideshare.net/cdsouthan [2]
    3. 3. AbstractAlthough there are ~ 50 million chemical structure in public databases, manymillions of bioactive compounds are still entoombed in documents. In additionlinking chemistry between patents, papers, abstracts and databases has beenpatchy. However, new tools such as chemicalize.org, OPSIN, OSCA, Venny,CheS-Mapper and InChIKey indexing by Google, have transformed theextraction, analysis and connectivity of structures from text. Extractions can alsobe triaged against PubChem that now contains 14.5 million patent-extractedcompounds from SureChemOpen, SCRIPDB, Thomson and IBM as well as 1million from journals via ChEMBL and PubMed. These advances present newcollaborative options such as sharing extracted neglected disease patents withSAR annotations on figshare. [3]
    4. 4. Getting chemistry out of text and linking to data: some is done but we have to dig for the rest [4]
    5. 5. Estimates for chemical text tombs• Journal chemistry public extraction, ~10 to 20 million entombed ?• Majority of useful patent chemistry already publically extracted, but, ~5 to 10 million still to go?• PubMed abstracts and MeSH chemistry ~ 0.5 million still entombed ?• Other unique, useful, text-only (i.e. no database cross-references) chemistry on the web ~ 0.1 to 0.5 million entombed ? [5]
    6. 6. What’s out there: publically disinterred structures • InChIKey in Google ~ 50 million • PubChem = 48 million • PubChem ROF + 250-800 Mw (lead-like) = 31 million • ChemSpider = 28 million • PubChem all docs (papers & patents) = 16 million • PubChem patents = 15 million • SureChemOpen = 13 million • PubChem journal sources (PubMed + ChEMBL) = 1 million~90% of all structures in databases have their primary origin in text sources [6]
    7. 7. Medicinal chemistry patents (tombs with lids off) • 18,777,229 patents, 2,208,422 WO’s (i.e. ~ 9 per family) • WO, C07 or A61= 469,856 • WO , C07D or A61K = 235,854 • WO, C07D = 72,737 (assignee vs. year plots below) [7]
    8. 8. PubMed at 22 mill:~ 10% with chemistry (guarded tombs) “Free full text” = 575,513 (24%) [8]
    9. 9. Top-5 Med Chem journals (4% lids off tombs) “Free full text” = 2671 (4.3%) [9]
    10. 10. Growth: (escaping the tombs)• Patent “big bang” (SureChem & SCRIPDB in 2012)• Literature “slow burn” (ChEMBL 2009 jump)• Paradox - patents:papers 15:1(both sets of CIDscumulative) [10]
    11. 11. Patents in PubChem: post-bang total vs. unique contentPubChem at 47.3 million CIDs, 32% include patents, 20% patent-only [11]
    12. 12. Citations: connections between tombs but still need to disinter structuresPapers Abstracts PubMed Patents "relatedness" heuristics [12]
    13. 13. Databases <> structures < > documents: links, but few reciprocal Papers Abstracts 0.8 mill (ChEMBL) 12K 0.2 mill (mainly MeSH)Patents 15 mill [13]
    14. 14. Post-document retrieval: basic questions1. What is the name:IUPAC:image:other ratio in the document?2. Which tools might be appropriate for first-pass extractions?3. How many and what proportion of strucs can be extracted?4. Which SAR /in vivo/clinical data is linked to strucs ?5. Which document sections include the key strucs ?6. Which database entries have links (back) to this document?7. Which strucs have InChIKey matches in Google, & database entries?8. Which strucs have synthesis data?9. What other documents specify and/or cite this struc ?10. Which database records for this struc have links to other documents?11. What realtionship connections can be made using similarity searches?12. What intersects and differences are discernible within a document set ? [14]
    15. 15. Triaging document or webpage chemistry• Identify the structure specification types, e.g. – Semantic names (all sources) – Code names (press releases, papers and abstracts) – IUPAC names (papers, patents and abstracts) – Images (papers, patents, & Google images) – SMILES (open lab books) – InChi strings (open lab books) – SDF files (open lab books, & github)Convert these to a structure (e.g. SDF, SMILES, InChI) then: – Search InChIKey in Google – Search major databases – Search SureChemOpen – Compare extracted sets for intersects and diffs – Extend exact match connectivity with similarity searching [15]
    16. 16. Triage example: antimalarial starting pointThe MMV390048 codename is linked to animage in press reportsbut is PubChem andPubMed -ve [16]
    17. 17. Images: convert and search Real chemists sketch them in a jiffy; the rest of us can use OSRA: Optical Structure Recognition Application(after editing, CS(=O)(=O)c3ccc(C2=CN=C(N)C(C1=CCC(C(F)(F)F)N=C1)C2)cc3) [17]
    18. 18. Making connections:image > strucure > database > documents CID 53311393 > ChEMBL > PubMed SureChem or chemicalize.org > patent [18]
    19. 19. Patent SAR from WO2011086531:Collating activities via SureChemOpen CID 53311393 > [19]
    20. 20. Patent SAR results: top-20 from 39 IC50s [20]
    21. 21. Results > figsharehttp://figshare.com/articles/Patent_SAR_for_MMV390048/657979 [21]
    22. 22. Structures > MyNCBIhttp://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1zWhcobieZbIouGfUdsdbHek5/. [22]
    23. 23. SAR Table: iOS app from Molecular Materials InformaticsSureChemOpen strucs ->manual data collation ->PubChem CIDs -> SDF ->Dropbox -> SAR Table-> edit in data, R-groupdecompose-> share [23]
    24. 24. InChIKey in Google: instant orthogonal joining [24]
    25. 25. Chemicalize.org: 413 strucs from WO2011086532CID 53311393 -> [25]
    26. 26. Using OPSIN and chemcalize.org to fix recalcitrant IUPACs from WO2011086532Can quasi-manually extract ~ 10 more “split IUPAC” examples [26]
    27. 27. Clustering document extraction sets: CheS-Mapper WO2011086531 -> chemicalize.org -> 413 cpds download -> CheS-Mapper -> cluster 8 -> export 53 cpds [27]
    28. 28. PubChem -> ChEMBL -> PMID -> assay -> strucs • CHEMBL2041980 (structure) • PMID 22390538 (paper) • CHEMBL2045642 (assay for 32 strucs from paper) • The 32 CIDs all have patent matches • [28]
    29. 29. Venny: intersects, diffs, de-dupes and merges 1) WO2011086531 matches in PubChem 2) CheS-Mapper cluster 8 from WO2011086532 3) ChEMBL assayed cpds from PMID 22390538 (handles any regular strings e.g. db IDs, SMILES, IChI or InChIKey) [29]
    30. 30. OSDDMalaria: global sharing test-bed• Different options being explored• Team or personal URLs >chemicalize.org• Github for SD files• PubChem public collections• Direct feed to ChEMBL malaria• G+ for real-time exchange and feedback [30]
    31. 31. The open toolbox facilitates extraction and collation of 10 to 30 million structures entombed in text [31]
    32. 32. Conclusions• The ability to extract chemical structures from text and web sources has been transformed by an expansion of the public toolbox• The PubChem big-bang increases probability of extraction having database exact or similarity matches• Paradoxically, the patent corpus is now completely open while access to journal text is still restricted• However, ChEMBL has extracted ~ 0.8 mill. SAR-linked and target mapped structures from ~ 50K papers• The submission of ~15 mill. patent structures to PubChem ensures at least representation from the majority of medicinal chemistry patents (many of which spawned the subsequent ChEMBL papers)• Those who want to share their structures globally (e.g. OSDD) have an expanding set of options for surfacing their results. [32]
    1. Gostou de algum slide específico?

      Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

    ×