Heavenly Conjunctions in Chemical Information

343 views

Published on

ChemAxon UGM 2013
The ChemAxon name-to-struc functionality is not only a component of the SureChem patent extraction pipeline but also powers chemicalize.org. Both operations are now submitting sources to PubChem. The former has deposited structures that bring the patent-extracted total in PC to 14.5 mill. CIDs. The deposition from chemicalize is ~0.3 mill., but has been actively selected by users and is 20% unique. The final conjunction is that all three sources generate the InChIKey (IK) that turns Google into a de-facto merge of PubChem and ChemSpider of ~50 mill. structures. Chemicalize.org users can convert new patents, other external or internal documents and web based text. Individual results can be Googled, searched against SurChemOpen and bulk extractions triaged against PubChem. It thus becomes possible to connect chemistry between patents, papers, abstracts and database records via exact match or similarity searching. When SureChem and chemicalize.org update their submissions, relationships with the other ~200 PubChem sources (including ChEMBL and vendor databases) are re-computed and new CID links made. The synergy between SureChem and chemicalize.org is powerful because matches between them (~ 0.15 mill.) via SureChemOpen, give occurrence statistics and the location of the structure within patents. The applications of chemicalize.org are extended by web tools such as Venny for determining intersects from multiple extractions and CheS-Mapper for cluster visualization. These utility expansions will be illustrated by documents specifying BACE1 inhibitors for Alzheimer’s disease.

Published in: Health & Medicine, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
343
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 70 million substances in CAS suggest a 20-30 million shortfall (i.e. SciFinder only) but they include virtualsand librariesSureChen will continue patent extraction but expect an asymtote of true novels only soonPubMed capture largely dependant on MeSH but a lot of IUPAC chemistry is only anually updated, and some not capturedSureChem, IBM and chemicalize all inticate that, including MeSH terms at least 0.5 million structures could be extracted from PubMedNo idea how much web-unique chemistry (not in documents or databases) is out there but open lab books will increase this
  • Only Nature Chemical Biology and Nature Chemistry have direct links from the journal document to PubChemGiven todays technology the major patent offices could put links in the PDFs but are unlikely to do so
  • Need to assess what representational types are being used in the documentEg. Some patents are image-only (but SureChem is pulling most of these out)Then select tools and sources for the job ´Decide how to store your structures locally The default batch search is an upload to PubChemThe default individual search is the InChIKey against Google
  • Total extractions from patents can include a lot of low Mw common reagent chemistryCheS mapper display makes it easy to pick out clusters of lead-like compoundsClusters can then be downloadedFlexibility is high because document sets can be split or merged at the imput stage
  • Very useful utility for any kind of set operations e.g. sets of extractions Total flexibility e.g. intersecting patents and papers with extractions from abstract setsSets can be de-duplicatedand merged from multiple sets (e.g. 10 patent extractions in one box)Can combine with selected downloaded database records
  • Heavenly Conjunctions in Chemical Information

    1. 1. [1] Chemicalize.org, SureChemOpen, PubChem and the InChIKey: Heavenly conjunctions with transformative utility Christopher Southan, TW2Informatics, Göteborg, Sweden, ChemAxon UGM, Budapest, May 2013 Video http://www.youtube.com/watch?feature=player_embedded&v=OKLw9BaQzY0#t=0s Related Posters http://www.slideshare.net/cdsouthan/the-patent-chemistry-big-bang-in-pubchem http://www.slideshare.net/cdsouthan/cs-cax-bioitchemicalizeposter03apr Image credit: http://www.eso.org/public/images/yb_vlt_moon_cnn_cc/
    2. 2. [2] Dr Christopher Southan, Ph.D., M.Sc.,B.Sc. TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm Mobile: +46(0)702-530710 Skype: cdsouthan Email: cdsouthan@hotmail.com Twitter: http://twitter.com/#!/cdsouthan Blog: http://cdsouthan.blogspot.com/ LinkedIN: http://www.linkedin.com/in/cdsouthan Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications Slideshare: http://www.slideshare.net/cdsouthan Figshare: http://figshare.com/authors/Christopher%20Southan/97432
    3. 3. [3] The ChemAxon name-to-struc functionality is not only a component of the SureChem patent extraction pipeline but also powers chemicalize.org. Both operations are now submitting sources to PubChem. The former has deposited structures that bring the patent-extracted total in PC to 14.5 mill. CIDs. The deposition from chemicalize is ~0.3 mill., but has been actively selected by users and is 20% unique. The final conjunction is that all three sources generate the InChIKey (IK) that turns Google into a de-facto merge of PubChem and ChemSpider of ~50 mill. structures. Chemicalize.org users can convert new patents, other external or internal documents and web based text. Individual results can be Googled, searched against SurChemOpen and bulk extractions triaged against PubChem. It thus becomes possible to connect chemistry between patents, papers, abstracts and database records via exact match or similarity searching. When SureChem and chemicalize.org update their submissions, relationships with the other ~200 PubChem sources (including ChEMBL and vendor databases) are re-computed and new CID links made. The synergy between SureChem and chemicalize.org is powerful because matches between them (~ 0.15 mill.) via SureChemOpen, give occurrence statistics and the location of the structure within patents. The applications of chemicalize.org are extended by web tools such as Venny for determining intersects from multiple extractions and CheS-Mapper for cluster visualization. These utility expansions will be illustrated by documents specifying BACE1 inhibitors for Alzheimer’s disease. Abstract
    4. 4. [4] Auspicious Conjunctions 2012-13 • PubChem: structures to slice ‘n dice (48 mill) • SureChemOpen: majority of patent chemistry opened up (14.5 mill) • Chemicalize.org : chemistry extractable from any text tombs (0.3 mill) • Chemical images: patents extracted in SureChemOpen, OSRA handles papers • InChIKey indexing in Google (50 mill +) • ChemSpider: crowdsourcing chemisty quality (28 mill) • Exapnding toolbox e.g.OPSIN, Venny, Ches-mapper • SciBite alerts • Expanding preview and surfacing options e.g. ChEMBLntd, Github, OSDD, Open Lab Books, figshare etc • Rise of mobile chemistry
    5. 5. [5] Databases <> structures < > documents Abstracts Patents Papers 15 mill 0.2 mill (MeSH) 0.8 mill (ChEMBL) 12K Google InChIKey ~ 50 million (47m PubChem + 33m UniChem + 28m ChemSpider)
    6. 6. [6] Triaging chemistry from text • Identify the structure specification types, e.g. – Semantic names (all sources) – Code names (press releases, papers and abstracts) – IUPAC names (papers, patents and abstracts) – Images (papers, patents, & Google images) – SMILES (open lab books) – InChi strings (open lab books) – SDF files (open lab books, & github) Convert these to a structure (e.g. SDF, SMILES, InChI) then: – Search InChIKey in Google – Search major databases – Search SureChemOpen – Compare extracted sets for intersects and diffs – Extend exact match connectivity with similarity searching
    7. 7. [7] PubChem Composition
    8. 8. [8] SureChemOpen Composition (in PubChem)
    9. 9. [9] Chemicalize.org Composition (in PubChem)
    10. 10. [10] BACE2 Conjunctions
    11. 11. [11] BACE2 Conjunctions
    12. 12. [12] Chemicalise.org Triage
    13. 13. [13] BACE2 Conjunctions 1. WO2013054291 > chemicalize.org 2. Download 450 structures 3. Upload to PubChem search
    14. 14. [14] Clustering document extraction sets: CheS-Mapper
    15. 15. [15] Venny: intersects, diffs, de-dupes and merges
    16. 16. [16] Conclusions • Transformative opening up of chemistry > biology via structure >document connectivity • Open mining of patent metadata and data • Expanding toolbox • Inexorable expansion of open-access publishing But; • Journal chemistry extraction > database records still slow • Text mining of journals still restricted • Author annotation and direct db submission rare • Pharmaceutical research publications are still blinding structures (see PMID: 23159359)
    17. 17. [17] References http://www.slideshare.net/cdsouthan/the-patent-chemistry-big-bang-in-pubchem http://www.slideshare.net/cdsouthan/cs-cax-bioitchemicalizeposter03apr http://www.ncbi.nlm.nih.gov/pubmed/23399051 http://www.ncbi.nlm.nih.gov/pubmed/23618056 http://www.ncbi.nlm.nih.gov/pubmed/23506624

    ×