[1]Chemicalize.org, SureChemOpen, PubChem andthe InChIKey: A heavenly conjunction withtransformative utilityChristopher So...
[2]Dr Christopher Southan, Ph.D., M.Sc.,B.Sc.TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htmMobile: +46(0)7...
[3]The ChemAxon name-to-struc functionality is not only a component of the SureChempatent extraction pipeline but also pow...
[4]Auspicious Conjunctions 2012-13• PubChem: global chemistry to slice ‘n dice• SureChemOpen: majority of patent chemistry...
[5]Databases <> structures < > documentsAbstractsPatentsPapers15 mill0.2 mill (MeSH)0.8 mill(ChEMBL)12KGoogle InChIKey ~ 5...
[6]Triaging chemistry from text• Identify the structure specification types, e.g.– Semantic names (all sources)– Code name...
[7]PubChem Composition
[8]SureChemOpen Composition (in PubChem)
[9]Chemicalize.org Composition (in PubChem)
[10]BACE2 Conjunctions
[11]BACE2 Conjunctions
[12]Chemicalise.org Triage
[13]BACE2 Conjunctions1. WO2013054291 > chemicalize.org2. Download 450 structures3. Upload to PubChem search
[14]Clustering document extraction sets: CheS-Mapper
[15]Venny: intersects, diffs, de-dupes and merges
[16]Conclusions• Transformative opening up of chemistry > biology via structure >documentconnectivity• Open mining of pate...
[17]Referenceshttp://www.slideshare.net/cdsouthan/the-patent-chemistry-big-bang-in-pubchemhttp://www.slideshare.net/cdsout...
Upcoming SlideShare
Loading in …5
×

EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOpen, PubChem and the InChIKey: A heavenly conjunction with transformative utility

2,316 views
2,215 views

Published on

The ChemAxon Name to Structure functionality is not only a component of the SureChem patent extraction pipeline but also powers chemicalize.org. Both operations are now submitting sources to PubChem. The former has deposited structures that bring the patent-extracted total in PubChem to 14.5 mill. CIDs. The deposition from chemicalize is ~0.3 mill., but has been actively selected by users and is 20% unique. The final conjunction is that all three sources generate the InChIKey that turns Google into a de facto merge of PubChem and ChemSpider of ~50 mill. structures. Chemicalize.org users can convert new patents, other external or internal documents and web based text. Individual results can be Googled, searched against SureChemOpen and bulk extractions triaged against PubChem. It thus becomes possible to connect chemistry between patents, papers, abstracts and database records via exact match or similarity searching. When SureChem and chemicalize.org update their submissions, relationships with the other 47 million structures from ~200 PubChem sources (including ChEMBL and vendor databases) are re-computed and new CID links made. The synergy between SureChem and chemicalize.org is powerful because matches between them (~ 0.15 mill.) via SureChemOpen, give occurrence statistics and the location of the structure within patents. The applications of chemicalize.org are extended by web tools such as Venny for determining intersects from multiple extractions and CheS-Mapper for cluster visualization. These utility expansions will be illustrated by documents specifying BACE1 inhibitors for Alzheimer’s disease.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,316
On SlideShare
0
From Embeds
0
Number of Embeds
1,700
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOpen, PubChem and the InChIKey: A heavenly conjunction with transformative utility

  1. 1. [1]Chemicalize.org, SureChemOpen, PubChem andthe InChIKey: A heavenly conjunction withtransformative utilityChristopher Southan, TW2Informatics, Göteborg, Sweden,ChemAxon UGM, Budapest, May 2013Image credit: http://www.eso.org/public/images/yb_vlt_moon_cnn_cc/
  2. 2. [2]Dr Christopher Southan, Ph.D., M.Sc.,B.Sc.TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htmMobile: +46(0)702-530710Skype: cdsouthanEmail: cdsouthan@hotmail.comTwitter: http://twitter.com/#!/cdsouthanBlog: http://cdsouthan.blogspot.com/LinkedIN: http://www.linkedin.com/in/cdsouthanPublications: http://www.citeulike.org/user/cdsouthan/order/year,,/publicationsPresentations: http://www.slideshare.net/cdsouthan
  3. 3. [3]The ChemAxon name-to-struc functionality is not only a component of the SureChempatent extraction pipeline but also powers chemicalize.org. Both operations are nowsubmitting sources to PubChem. The former has deposited structures that bring thepatent-extracted total in PC to 14.5 mill. CIDs. The deposition from chemicalize is~0.3 mill., but has been actively selected by users and is 20% unique. The finalconjunction is that all three sources generate the InChIKey (IK) that turns Google intoa de-facto merge of PubChem and ChemSpider of ~50 mill. structures.Chemicalize.org users can convert new patents, other external or internal documentsand web based text. Individual results can be Googled, searched againstSurChemOpen and bulk extractions triaged against PubChem. It thus becomespossible to connect chemistry between patents, papers, abstracts and databaserecords via exact match or similarity searching. When SureChem andchemicalize.org update their submissions, relationships with the other ~200 PubChemsources (including ChEMBL and vendor databases) are re-computed and new CIDlinks made. The synergy between SureChem and chemicalize.org is powerful becausematches between them (~ 0.15 mill.) via SureChemOpen, give occurrence statisticsand the location of the structure within patents. The applications of chemicalize.orgare extended by web tools such as Venny for determining intersects from multipleextractions and CheS-Mapper for cluster visualization. These utility expansions will beillustrated by documents specifying BACE1 inhibitors for Alzheimer’s disease.Abstract
  4. 4. [4]Auspicious Conjunctions 2012-13• PubChem: global chemistry to slice ‘n dice• SureChemOpen: majority of patent chemistry opened up• Chemicalize.org : chemistry extractable from any text toombs• Chemical images: patents extracted in SureChemOpen, OSRAhandles papers• InChIKey indexing in Google• ChemSpider: crowdsourcing chemisty quality• Exapnding toolbox e.g.OPSIN, Venny, Ches-mapper• SciBite alerts• Expanding preview and surfacing options e.g. ChEMBLntd, Github,OSDD, Open Lab Books, figshare etc• Rise of mobile chemistry
  5. 5. [5]Databases <> structures < > documentsAbstractsPatentsPapers15 mill0.2 mill (MeSH)0.8 mill(ChEMBL)12KGoogle InChIKey ~ 50 million(47m PubChem + 33mUniChem + 28m ChemSpider)
  6. 6. [6]Triaging chemistry from text• Identify the structure specification types, e.g.– Semantic names (all sources)– Code names (press releases, papers and abstracts)– IUPAC names (papers, patents and abstracts)– Images (papers, patents, & Google images)– SMILES (open lab books)– InChi strings (open lab books)– SDF files (open lab books, & github)Convert these to a structure (e.g. SDF, SMILES, InChI) then:– Search InChIKey in Google– Search major databases– Search SureChemOpen– Compare extracted sets for intersects and diffs– Extend exact match connectivity with similarity searching
  7. 7. [7]PubChem Composition
  8. 8. [8]SureChemOpen Composition (in PubChem)
  9. 9. [9]Chemicalize.org Composition (in PubChem)
  10. 10. [10]BACE2 Conjunctions
  11. 11. [11]BACE2 Conjunctions
  12. 12. [12]Chemicalise.org Triage
  13. 13. [13]BACE2 Conjunctions1. WO2013054291 > chemicalize.org2. Download 450 structures3. Upload to PubChem search
  14. 14. [14]Clustering document extraction sets: CheS-Mapper
  15. 15. [15]Venny: intersects, diffs, de-dupes and merges
  16. 16. [16]Conclusions• Transformative opening up of chemistry > biology via structure >documentconnectivity• Open mining of patent metadata and data• Expanding toolbox• Inexorable expansion of open-access publishingBut;• Journal chemistry extraction > database records still slow• Text mining of journals still restricted• Author annotation and direct db submission rare• Pharmaceutical research publications are still blinding structures (seePMID: 23159359)
  17. 17. [17]Referenceshttp://www.slideshare.net/cdsouthan/the-patent-chemistry-big-bang-in-pubchemhttp://www.slideshare.net/cdsouthan/cs-cax-bioitchemicalizeposter03aprhttp://www.ncbi.nlm.nih.gov/pubmed/23399051http://www.ncbi.nlm.nih.gov/pubmed/23618056http://www.ncbi.nlm.nih.gov/pubmed/23506624

×