ChemAxon UGM, Budapest
20/05/2015
SureChEMBL: Open Patent Data
George Papadatos, PhD
ChEMBL Group, EMBL-EBI
georgep@ebi.ac...
EMBL-EBI Resources Genes, genomes & variation
ArrayExpress Expression Atlas PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Liter...
Bioactivity data
Compound
Assay/Target
>Thrombin
MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE
RECVEETCSYEEAFE...
Why looking at patent documents?
• Patent filing and searching
• Legal, financial and commercial incentives & interests
• ...
From SureChem to SureChEMBL
• Digital Science/Macmillan donated SureChem to EMBL-
EBI
• SureChem: commercial patent chemis...
SureChEMBL data processing
WO
EP
Applications
& Granted
US
Applications
& granted
JP
Abstracts
Patent
Offices
Chemistry
Da...
SureChEMBL data processing
WO
EP
Applications
& Granted
US
Applications
& granted
JP
Abstracts
Patent
Offices
Chemistry
Da...
Homepage
Help
Search by keyword and
meta-data
Search by
chemical
structure
(sketch
compound)
Search by
SMILES, MOL,
SMARTS...
Data growth
• ~80K novel compounds every month
• ~800K novel compounds since EBI took over
• 2–7 days for a published pate...
EMBL-EBI chemistry resources
RDF and REST API interfaces
REST API Interface - https://www.ebi.ac.uk/unichem/
Atlas
Ligand
...
Data access & exports
• Full compound repository
• FTP download, SDF and CSV format
• Updates quarterly
• Full compound-pa...
Compound-patent map
• Flat file with
• Compound, global frequency, document, section, section
frequency, publication date
...
Data feed client
http://vartree.blogspot.co.uk/2015/01/how-to-create-your-own-replica-of.html
Use cases with SureChEMBL
• Chemoinformatics
• Chemistry landscape for a particular biological target/disease
• Novel chem...
Bioactivity data extraction? Compounds
Target/Assay
Bioactivity
Markush structure extraction?
-alkyl
-aryl
-heteroaryl
-heterocyclyl
-cycloalkyl
….
Biological annotations
Bioannotations soon to be integrated into SureChEMBL interface –
using SciBite’s Termite text minin...
US-9012636-B2
Future steps
• OpenPHACTS ENSO
• Biological tagging of targets, genes, indications and diseases
• Development of integrate...
Acknowledgements
ChEMBL team:
• John Overington
• Anne Hersey
• Anna Gaulton
• Mark Davies
• Nathan Dedman
• Michal Nowotk...
Technology partners
ChemAxon UGM, Budapest
20/05/2015
SureChEMBL: Open Patent Data
George Papadatos, PhD
ChEMBL Group, EMBL-EBI
georgep@ebi.ac...
Back-up slides
• Connectivity match on single components - UniChem
ChEMBL-SureChEMBL compound overlap
21.4%
SureChEMBL
ChEMBL
1.5M
16M
Too granular? Try scaffolds instead
Level 1 scaffold overlap
57%
SureChEMBLChEMBL
61K
298K
Level 1 scaffold overlap
57%
SureChEMBLChEMBL
61K
298K
Can we have everything?
Cost
TimeQuality
Common sources of errors
• Small, poor quality images
• OCR errors in names (OCR done by IFI). There is an OCR correction
...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL: An Open Patent Chemistry Resource
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL: An Open Patent Chemistry Resource
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL: An Open Patent Chemistry Resource
Upcoming SlideShare
Loading in …5
×

EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL: An Open Patent Chemistry Resource

1,128 views

Published on

SureChEMBL is a new resource provided by the European Bioinformatics Institute (EMBL-EBI) that annotates, extracts and indexes chemistry from full text patent documents by means of continuous, automated text and image mining. SureChEMBL is perhaps the only open, freely available, live patent chemistry resource available, in a field that has been traditionally commercial.

Since its launch last September, the SureChEMBL interface provides sophisticated keyword and chemistry-based querying and exporting functionality against a corpus of more than 16 million compounds extracted from 13 million patent documents. Both the interface and the underlying data pipeline leverage extensively ChemAxon technologies for name to structure conversion, as well as compound standardisation, registration and searching.

In addition to providing an overview of the system, recent developments and improvements will be described. These include the introduction of various data interexchange and exporting options, such as flat files and a data feed client. Furthermore, our future plans for the SureChEMBL system will be outlined. To date, such plans include complementing the chemical annotations with biological ones, covering genes, proteins, diseases and indications. Furthermore, we are planning to further enrich the chemical annotations with a relevance score, indicating their importance in the patent document.

Published in: Software
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,128
On SlideShare
0
From Embeds
0
Number of Embeds
441
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL: An Open Patent Chemistry Resource

  1. 1. ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk
  2. 2. EMBL-EBI Resources Genes, genomes & variation ArrayExpress Expression Atlas PRIDE InterPro Pfam UniProt ChEMBL ChEBI Literature & ontologies Europe PubMed Central Gene Ontology Experimental Factor Ontology Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive 1000 Genomes Gene, protein & metabolite expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights Systems BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes European Genome-phenome Archive Metagenomics portal
  3. 3. Bioactivity data Compound Assay/Target >Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE 3. Insight, tools and resources for translational drug discovery 2. Organization, integration, curation and standardization of pharmacology data 1. Scientific facts Ki = 4.5nM APTT = 11 min. ChEMBL: Data for drug discovery
  4. 4. Why looking at patent documents? • Patent filing and searching • Legal, financial and commercial incentives & interests • Prior art, novelty, freedom to operate searches • Competitive intelligence • Unprecedented wealth of knowledge • Most of knowledge will never be disclosed anywhere else • Average lag of 2-3 years between patent document and journal publication disclosure for chemistry
  5. 5. From SureChem to SureChEMBL • Digital Science/Macmillan donated SureChem to EMBL- EBI • SureChem: commercial patent chemistry mining product • Wellcome Trust funds further development • EMBL-EBI provides an on-going, live service • Full functionality freely available to everyone • Query, view and export chemistry from patents • Complemented with biological annotations
  6. 6. SureChEMBL data processing WO EP Applications & Granted US Applications & granted JP Abstracts Patent Offices Chemistry Database SureChEMBL System Patent PDFs (service) Application Users API Database Entity Recognition SureChem IP 1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl- 1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4- methylpiperazine Image to Structure (one method) Name to Structure (five methods) OCR Processed patents (service)
  7. 7. SureChEMBL data processing WO EP Applications & Granted US Applications & granted JP Abstracts Patent Offices Chemistry Database SureChEMBL System Patent PDFs (service) Application Users API Database Entity Recognition SureChem IP 1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl- 1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4- methylpiperazine Image to Structure (one method) Name to Structure (five methods) OCR Processed patents (service)
  8. 8. Homepage Help Search by keyword and meta-data Search by chemical structure (sketch compound) Search by SMILES, MOL, SMARTS, name Search by patent number Filter by authority (US, EP, WO and JP) Filter by document section (title, claims, abstract, description and images) Chemical search type (substructure, similarity, identical) Filter by date Filter by MW www.surechembl.org
  9. 9. Data growth • ~80K novel compounds every month • ~800K novel compounds since EBI took over • 2–7 days for a published patent to be chemically annotated and searchable in SureChEMBL Cumulative growth of SureChEMBL compounds Compoundcount Time
  10. 10. EMBL-EBI chemistry resources RDF and REST API interfaces REST API Interface - https://www.ebi.ac.uk/unichem/ Atlas Ligand induced transcript response 750 PDBe Ligand structures from protein complexes 15K ChEBI Nomenclature of primary and secondary metabolites. Chemical Ontology 24K SureChEMBL Chemical structures from patent literature 16M ChEMBL Bioactivity data from literature and depositions 1.5M UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) >90M 3rd Party Data ZINC, PubChem, ThomsonPharma DOTF, IUPHAR, DrugBank, KEGG, NIH NCC, eMolecules, FDA SRS, PharmGKB, Selleck, …. ~65M
  11. 11. Data access & exports • Full compound repository • FTP download, SDF and CSV format • Updates quarterly • Full compound-patent map • FTP download, flat file • Updates quarterly • Data feed client • Creates a local replica database of SureChEMBL • Updates daily
  12. 12. Compound-patent map • Flat file with • Compound, global frequency, document, section, section frequency, publication date • Back file • 187,958,584 unique patent-compound pairs • 14,076,090 unique compound IDs • 3,585,233 EP, JP, WO and US patent docs • 1960-2014 • Quarterly incremental updates • Q1 2015 is also now available on the FTP http://chembl.blogspot.co.uk/2015/03/the-surechembl-map-file-is-out.html
  13. 13. Data feed client http://vartree.blogspot.co.uk/2015/01/how-to-create-your-own-replica-of.html
  14. 14. Use cases with SureChEMBL • Chemoinformatics • Chemistry landscape for a particular biological target/disease • Novel chemistry & scaffolds • MDS, MCS and R-group analysis for a particular patent family claimed chemistry • (Negative) novelty checking with UniChem • Competitive intelligence • Reporting • Patent alerts • Per target/disease/company
  15. 15. Bioactivity data extraction? Compounds Target/Assay Bioactivity
  16. 16. Markush structure extraction? -alkyl -aryl -heteroaryl -heterocyclyl -cycloalkyl ….
  17. 17. Biological annotations Bioannotations soon to be integrated into SureChEMBL interface – using SciBite’s Termite text mining engine
  18. 18. US-9012636-B2
  19. 19. Future steps • OpenPHACTS ENSO • Biological tagging of targets, genes, indications and diseases • Development of integrated use-cases • Combine chemistry & biology from patents, literature, pathways, etc. • OpenPHACTS API • Accessible via KNIME nodes • Further improvements/added value • Data quality and accuracy • Target and compound relevance score
  20. 20. Acknowledgements ChEMBL team: • John Overington • Anne Hersey • Anna Gaulton • Mark Davies • Nathan Dedman • Michal Nowotka Collaborators: • James Siddle • Richard Koks • Lee Harland • Kevin Clark Support: surechembl-help@ebi.ac.uk Webinar: http://www.ebi.ac.uk/training/online/course/surechembl-accessing-chemical-patent-data-webinar
  21. 21. Technology partners
  22. 22. ChemAxon UGM, Budapest 20/05/2015 SureChEMBL: Open Patent Data George Papadatos, PhD ChEMBL Group, EMBL-EBI georgep@ebi.ac.uk
  23. 23. Back-up slides
  24. 24. • Connectivity match on single components - UniChem ChEMBL-SureChEMBL compound overlap 21.4% SureChEMBL ChEMBL 1.5M 16M
  25. 25. Too granular? Try scaffolds instead
  26. 26. Level 1 scaffold overlap 57% SureChEMBLChEMBL 61K 298K
  27. 27. Level 1 scaffold overlap 57% SureChEMBLChEMBL 61K 298K
  28. 28. Can we have everything? Cost TimeQuality
  29. 29. Common sources of errors • Small, poor quality images • OCR errors in names (OCR done by IFI). There is an OCR correction step, but cannot fix all errors -> ‘2,6-Difluoro-Λ/-{1 -r(4-iodo-2-methylphenyl)methvn-1 H-pyrazol-3- vDbenzamide’ • Reliability better for US patents due to inclusion of mol files

×