John P. Overington - EMBL-EBI 
Nicko Goncharoff – Digital Science 
SureChEMBL: Open patent chemistry data
EMBL-EBI’s Mission 
•Provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress 
•Contribute to the advancement of biology through basic investigator-driven research in bioinformatics 
•Provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators 
•Help disseminate cutting-edge technologies to industry 
•Coordinate biological data provision throughout Europe
EMBL Member States 
Austria, Belgium, Croatia, Czech Republic, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom 
Associate member states: Australia, Argentina
ChEMBL 
•The world’s largest primary public database of medicinal chemistry data 
•https://www.ebi.ac.uk/chembl 
•>1.4 million compounds, >9,000 targets, >12 million bioactivities 
•Truly Open Data 
•CC-BY-SA license 
•Many download/access formats 
•myChEMBL 
•myChEMBL – Linux VM, PostgresQL RDKit, KNIME… 
•Semantic Web 
•RDF download, SPARQL endpoint at http://rdf.ebi.ac.uk/chembl
SAR Data 
Compound 
Assay 
Ki=4.5 nM 
>Thrombin 
MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLERECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGTNYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYTTDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVTTHGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGYCDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDRWVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTANVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE 
ED2=230 nM 
Inhibition of human Thrombin 
PTT (partial thromboplastin time) 
ChEMBL
SureChem = SureChEMBL 
•December 2013 EMBL-EBI ‘acquired’ SureChem 
•Existing SureChem user base 
•Free (SureChemOpen) 
•Paying (SureChemPro + API) 
•EMBL-EBI supported existing licensees during transition 
•EMBL-EBI provides an ongoing, free and open resource to entire community 
•Private, Secure, and Free 
•No login system 
•Rebranded as SureChEMBL 
•https://www.surechembl.org 
6 
PDG Biotech Meeting
Rebranding Complete! 
7 
PDG Biotech Meeting
8 
https://www.surechembl.org/ 
https://www.surechembl.org
EMBL-EBI Chemistry Resources 
RDF and REST API interfaces 
REST API Interface 
Atlas 
Ligand induced transcript response 
750 
PDBe 
Ligand structures from structurally defined protein complexes 
15K 
ChEBI 
Nomenclature of primary and secondary metabolites. Chemical Ontology 
24K 
SureChEMBL 
Chemical structures from patent literature 
16M 
ChEMBL 
Bioactivity data from literature and depositions 
1.5M 
UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) 
>70M 
3rd Party Data 
ZINC, PubChem, ThomsonPharma DOTF, IUPHAR, DrugBank, KEGG, NIH NCC, eMolecules, FDA SRS, PharmGKB, Selleck, …. 
~55M
SureChEMBL Data Pipeline 
WO 
EP 
Applications& Granted 
US 
Applications & granted 
JP 
Abstracts 
Patent 
Offices 
Chemistry Database 
SureChEMBL System 
Patent PDFs 
(service) 
Application Server 
Users 
API 
Database 
Entity Recognition 
1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl- 1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4- methylpiperazine 
Image to Structure 
(one method) 
Name to Structure (five methods) 
OCR 
Processed patents 
(IFI Claims) 
10 
PDG Biotech Meeting
SureChEMBL data coverage 
Data 
Description & Languages 
Years 
EP applications 
Bib. data 
Full text 
DocDB + Original 
Original (EN, DE, FR) 
from 1978 
EP granted 
Bib. data 
Full text 
DocDB + Original 
Original (EN, DE, FR) 
From 1980 
WO applications 
Bib. data 
Full text 
DocDB + Original 
Original (EN, DE, FR, ES, RU) 
From 1978 
From 1978 
US applications 
Bib. data 
Full text 
DocDB + Original 
Original (EN) 
From 2001 
From 2001 
US granted 
Bib. data 
Full text 
DocDB + Original 
Original (EN) 
From 1920 
From 1976 
JP applications 
Bib. Data 
DocDB 
PAJ - English abstracts/titles 
From 1973 
From 1976 
JP granted 
Bib. data 
DocDB 
From 1994 
90+ countries 
Bib. data 
DocDB 
From 1920 
11
•Structures from text: 1976 onwards 
•Title, abstract, claims, description 
•SureChem Chemical Entity Recognition - proprietary algorithms 
•ACD/Labs, ChemAxon, OpenEye, OPSIN, PerkinElmer name- structure conversion 
•Structures from images: 2007 onwards 
•CLiDE image-structure conversion 
•Will extend image processing backwards using AWS Spot Pricing compute 
•USPTO offers ‘Complex Work Units’ since 2001 
•CWU file types include MOL and CDX 
•CWUs processed as part of pipeline: 2007 onwards 
SureChEMBL Chemistry Data Coverage 
12 
PDG Biotech Meeting
Chemical Entity Extraction 
13 
PDG Biotech Meeting
SureChEMBL Content (September 2014) 
•15,668,225 compounds 
•12,888,125 patents 
•~80,000 new compounds extracted from ~50,000 patents monthly 
•1–7 days for published patent to become searchable in SureChEMBL 
•System provides search access to all patents (not just chemistry) 
14 
PDG Biotech Meeting
Current System Capabilities 
•Searching capabilities 
•Free text keywords and Lucene fields 
•Patent IDs & bibliographic information 
•Patent authority & date 
•Chemical structure 
•Retrieval capabilities 
•Retrieve chemistry (with additional filters) 
•Retrieve patent family information 
•Retrieve annotated full patent text 
•Retrieve patent document as PDF 
15 
PDG Biotech Meeting
16 
https://www.surechembl.org/
PDG Biotech Meeting 
17
PDG Biotech Meeting 
18
Compound Report Page 
https://www.surechembl.org/chemical/SCHEMBL1895
UniChem Integration 
On-the-fly integration with 71M structures and from 25 data sources
SureChEMBL Data Access 
•UniChem 
•https://www.ebi.ac.uk/unichem 
•Weekly updates 
•Private, secure, live integration with >25 chemistry resources 
•UniChem will soon be the worlds largest chemical structure integration resource….. 
•FTP Site 
•ftp://ebi.ac.uk/public 
•Quarterly updates 
•All SureChEMBL compounds in SDF and CSV format 
•Raw data – not filtered for ‘funnies’ 
•Further downloads planned in future 
21 
PDG Biotech Meeting
OCR Errors 
•Small, poor quality images 
•OCR errors in names (OCR done by IFI). There is an OCR correction step, but cannot fix all errors 
-> ‘2,6-Difluoro-Λ/-{1 -r(4-iodo-2-methylphenyl)methvn-1 H-pyrazol- 3- vDbenzamide’ 
•Reliability better for US patents due to inclusion of mol files 
22 
PDG Biotech Meeting
Name Conversion Errors 
Pentyl 
Thiol 
2-(2-((3-chloro-6-methyl-5,5-dioxido-6,11-dihydrodibenzo[c,f][1,2]thiazepin-11-yl)amino)ethoxy)acetic acid
•InChI based comparison using filtered parent compounds 
ChEMBL – SureChEMBL Overlap 
235K 
18.4% 
1.3M 
12.2M 
SureChEMBL 
ChEMBL 
Filters 
•MW between 100 and 1200 
•#Atoms between 6 and 70 
•ALogP between -10 and 10 
•#C > 0 
•#Rings > 0 
•#C != #Atoms 
•RTB <= 20 
(ChEMBL 18)
Future Entity Extraction and Indexing 
•Identify new entity types e.g. proteins, diseases and cell lines 
•Extend using ChEMBL dictionaries + others 
•Ontology/synonym mapping - semantic tagging 
•Target-relevance assessment 
•Protein/biotherapeutic sequence extraction 
•Sequence-based patent searches 
•Enhanced cross-referencing 
•Tag up all commonly used identifiers (Company codes, CAS, ChEBI, ChEMBL, PubChem, ENSEMBL, RefSeq, UniProt,…)
EFO – http://www.ebi.ac.uk/efo
Far Future - Bioactivity Data Extraction? 
Target/Assay 
Bioactivity 
27 
PDG Biotech Meeting
Far Future – Markush Extraction? 
-alkyl -aryl -heteroaryl -heterocyclyl -cycloalkyl …. 
28 
PDG Biotech Meeting
Acknowledgements 
•ChEMBL team 
•John Overington 
•Jon Chambers 
•George Papadatos 
•Mark Davies 
•Nathan Dedman 
•Anna Gaulton 
•Digital Science 
•Nicko Goncharoff 
•James Siddle 
•Richard Koks 
Funding: 
•Wellcome Trust Strategic Award for ChEMBL database (WT086151/Z/08/Z & WT104104/Z/14/Z) 
•Open PHACTS - Innovative Medicines Initiative Joint Undertaking (grant no. 115191) 
•European Molecular Biology Laboratory 
•BioMedBridges - European Commission FP7 Capacities Specific Programme (grant no. 284209) 
•Technology Partners:

ICIC 2014 From SureChem to SureChEMBL

  • 1.
    John P. Overington- EMBL-EBI Nicko Goncharoff – Digital Science SureChEMBL: Open patent chemistry data
  • 2.
    EMBL-EBI’s Mission •Providefreely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress •Contribute to the advancement of biology through basic investigator-driven research in bioinformatics •Provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators •Help disseminate cutting-edge technologies to industry •Coordinate biological data provision throughout Europe
  • 3.
    EMBL Member States Austria, Belgium, Croatia, Czech Republic, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom Associate member states: Australia, Argentina
  • 4.
    ChEMBL •The world’slargest primary public database of medicinal chemistry data •https://www.ebi.ac.uk/chembl •>1.4 million compounds, >9,000 targets, >12 million bioactivities •Truly Open Data •CC-BY-SA license •Many download/access formats •myChEMBL •myChEMBL – Linux VM, PostgresQL RDKit, KNIME… •Semantic Web •RDF download, SPARQL endpoint at http://rdf.ebi.ac.uk/chembl
  • 5.
    SAR Data Compound Assay Ki=4.5 nM >Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLERECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGTNYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYTTDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVTTHGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGYCDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDRWVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTANVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE ED2=230 nM Inhibition of human Thrombin PTT (partial thromboplastin time) ChEMBL
  • 6.
    SureChem = SureChEMBL •December 2013 EMBL-EBI ‘acquired’ SureChem •Existing SureChem user base •Free (SureChemOpen) •Paying (SureChemPro + API) •EMBL-EBI supported existing licensees during transition •EMBL-EBI provides an ongoing, free and open resource to entire community •Private, Secure, and Free •No login system •Rebranded as SureChEMBL •https://www.surechembl.org 6 PDG Biotech Meeting
  • 7.
    Rebranding Complete! 7 PDG Biotech Meeting
  • 8.
  • 9.
    EMBL-EBI Chemistry Resources RDF and REST API interfaces REST API Interface Atlas Ligand induced transcript response 750 PDBe Ligand structures from structurally defined protein complexes 15K ChEBI Nomenclature of primary and secondary metabolites. Chemical Ontology 24K SureChEMBL Chemical structures from patent literature 16M ChEMBL Bioactivity data from literature and depositions 1.5M UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) >70M 3rd Party Data ZINC, PubChem, ThomsonPharma DOTF, IUPHAR, DrugBank, KEGG, NIH NCC, eMolecules, FDA SRS, PharmGKB, Selleck, …. ~55M
  • 10.
    SureChEMBL Data Pipeline WO EP Applications& Granted US Applications & granted JP Abstracts Patent Offices Chemistry Database SureChEMBL System Patent PDFs (service) Application Server Users API Database Entity Recognition 1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl- 1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4- methylpiperazine Image to Structure (one method) Name to Structure (five methods) OCR Processed patents (IFI Claims) 10 PDG Biotech Meeting
  • 11.
    SureChEMBL data coverage Data Description & Languages Years EP applications Bib. data Full text DocDB + Original Original (EN, DE, FR) from 1978 EP granted Bib. data Full text DocDB + Original Original (EN, DE, FR) From 1980 WO applications Bib. data Full text DocDB + Original Original (EN, DE, FR, ES, RU) From 1978 From 1978 US applications Bib. data Full text DocDB + Original Original (EN) From 2001 From 2001 US granted Bib. data Full text DocDB + Original Original (EN) From 1920 From 1976 JP applications Bib. Data DocDB PAJ - English abstracts/titles From 1973 From 1976 JP granted Bib. data DocDB From 1994 90+ countries Bib. data DocDB From 1920 11
  • 12.
    •Structures from text:1976 onwards •Title, abstract, claims, description •SureChem Chemical Entity Recognition - proprietary algorithms •ACD/Labs, ChemAxon, OpenEye, OPSIN, PerkinElmer name- structure conversion •Structures from images: 2007 onwards •CLiDE image-structure conversion •Will extend image processing backwards using AWS Spot Pricing compute •USPTO offers ‘Complex Work Units’ since 2001 •CWU file types include MOL and CDX •CWUs processed as part of pipeline: 2007 onwards SureChEMBL Chemistry Data Coverage 12 PDG Biotech Meeting
  • 13.
    Chemical Entity Extraction 13 PDG Biotech Meeting
  • 14.
    SureChEMBL Content (September2014) •15,668,225 compounds •12,888,125 patents •~80,000 new compounds extracted from ~50,000 patents monthly •1–7 days for published patent to become searchable in SureChEMBL •System provides search access to all patents (not just chemistry) 14 PDG Biotech Meeting
  • 15.
    Current System Capabilities •Searching capabilities •Free text keywords and Lucene fields •Patent IDs & bibliographic information •Patent authority & date •Chemical structure •Retrieval capabilities •Retrieve chemistry (with additional filters) •Retrieve patent family information •Retrieve annotated full patent text •Retrieve patent document as PDF 15 PDG Biotech Meeting
  • 16.
  • 17.
  • 18.
  • 19.
    Compound Report Page https://www.surechembl.org/chemical/SCHEMBL1895
  • 20.
    UniChem Integration On-the-flyintegration with 71M structures and from 25 data sources
  • 21.
    SureChEMBL Data Access •UniChem •https://www.ebi.ac.uk/unichem •Weekly updates •Private, secure, live integration with >25 chemistry resources •UniChem will soon be the worlds largest chemical structure integration resource….. •FTP Site •ftp://ebi.ac.uk/public •Quarterly updates •All SureChEMBL compounds in SDF and CSV format •Raw data – not filtered for ‘funnies’ •Further downloads planned in future 21 PDG Biotech Meeting
  • 22.
    OCR Errors •Small,poor quality images •OCR errors in names (OCR done by IFI). There is an OCR correction step, but cannot fix all errors -> ‘2,6-Difluoro-Λ/-{1 -r(4-iodo-2-methylphenyl)methvn-1 H-pyrazol- 3- vDbenzamide’ •Reliability better for US patents due to inclusion of mol files 22 PDG Biotech Meeting
  • 23.
    Name Conversion Errors Pentyl Thiol 2-(2-((3-chloro-6-methyl-5,5-dioxido-6,11-dihydrodibenzo[c,f][1,2]thiazepin-11-yl)amino)ethoxy)acetic acid
  • 24.
    •InChI based comparisonusing filtered parent compounds ChEMBL – SureChEMBL Overlap 235K 18.4% 1.3M 12.2M SureChEMBL ChEMBL Filters •MW between 100 and 1200 •#Atoms between 6 and 70 •ALogP between -10 and 10 •#C > 0 •#Rings > 0 •#C != #Atoms •RTB <= 20 (ChEMBL 18)
  • 25.
    Future Entity Extractionand Indexing •Identify new entity types e.g. proteins, diseases and cell lines •Extend using ChEMBL dictionaries + others •Ontology/synonym mapping - semantic tagging •Target-relevance assessment •Protein/biotherapeutic sequence extraction •Sequence-based patent searches •Enhanced cross-referencing •Tag up all commonly used identifiers (Company codes, CAS, ChEBI, ChEMBL, PubChem, ENSEMBL, RefSeq, UniProt,…)
  • 26.
  • 27.
    Far Future -Bioactivity Data Extraction? Target/Assay Bioactivity 27 PDG Biotech Meeting
  • 28.
    Far Future –Markush Extraction? -alkyl -aryl -heteroaryl -heterocyclyl -cycloalkyl …. 28 PDG Biotech Meeting
  • 29.
    Acknowledgements •ChEMBL team •John Overington •Jon Chambers •George Papadatos •Mark Davies •Nathan Dedman •Anna Gaulton •Digital Science •Nicko Goncharoff •James Siddle •Richard Koks Funding: •Wellcome Trust Strategic Award for ChEMBL database (WT086151/Z/08/Z & WT104104/Z/14/Z) •Open PHACTS - Innovative Medicines Initiative Joint Undertaking (grant no. 115191) •European Molecular Biology Laboratory •BioMedBridges - European Commission FP7 Capacities Specific Programme (grant no. 284209) •Technology Partners: