Getting the Big Picture by Joining up the SAR dots


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Getting the Big Picture by Joining up the SAR dots

  1. 1. Getting the Big Picture by Joiningup the SAR dotsLarge-scale integration of structureand bioactivity dataThe 9th Annual Pharmaceutical IT Congress 2011Sorel MuresanAstraZeneca R&D MölndalDECS Computational Sciences
  2. 2. WO patents with the classification code C07D Query performed using the European Patent Office search interface DECS | CompSci
  3. 3. Driver – explosion in SAR data • Chemical information landscape changing fast • Databases, journal articles, patents, internal docs 2006 2008 DECS | CompSciSouthan, C.; Varkonyi, P.; Muresan, S., J. Cheminfo. 2009, 1:10
  4. 4. The Challenge – Information deluge• Volume• Complexity• Unstructured content DECS | CompSci
  5. 5. Since 2006 >1M chemistry publications per year Number of articles (diamonds) and patents (open boxes) abstracted annually by Chemical Abstracts Bachrach J.Cheminformatics 2009 1:2 DECS | CompSci
  6. 6. Number of structures per year from J Med Chem W. Patrick Walters; Jeremy Green; Jonathan R. Weiss; Mark A. Murcko; J. Med. Chem. Article ASAP DOI: 10.1021/jm200504p Copyright © 2011 American Chemical Society DECS | CompSci
  7. 7. SAR key entities and relationships Unstructured Data Structured Entries in from Documents Relational Databases Expert Extraction or Text Mining DECS | CompSciSouthan, C.; Boppana, K.; Jagarlapudi, S.; Muresan, S .J. Cheminfo. 2011, 3:14
  8. 8. Manually extracted SAR data (commercial)• GOSTAR (GVKBIO Online Structure Activity Relationship Database) is a comprehensive database that captures explicit relationships between the three entities of publications, compounds and sequences.• It includes 2.6 million compounds linked to 3,500 sequences with 12.5M SAR points extracted from 43,000 patents and 67,000 articles from 125 journals DECS | CompSci
  9. 9. SAR data (public)• PubChem • the NCBI public informatics backbone for the NIH Molecular Libraries Initiative focused on small molecules as systems biology probes and potential therapeutic agents. The statistics are 30.5 million compounds with 85.6 million links. Of the compounds, 1654K have been tested in 504K assays.• ChEMBL • includes drugs, small molecules from the medicinal chemistry or biochemical literature and their targets. It contains 1,060,258 distinct compounds extracted by expert manual curation from 42,516 publications with 5,479,146 activities, including SAR and ADMET values. This data is mapped to 8,603 targets. DECS | CompSci
  10. 10. Extracting chemical entities from textCollaboration with IBM Research Almaden to apply text analytics technology to analyze intellectual property and scientific literature - 10 million full text patents - 11 million structures - 12% out of 46M parent structures in Chemistry Connect DECS | CompSci
  11. 11. Chemical Named Entity Recognition (NER) 7-CHLORO-1,3-DIHYDRO-1-METHYL-5- PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE Name-to-Structure software CN1c2ccc(cc2C(=NCC1=O)c3ccccc3)Cl DECS | CompSci
  12. 12. Extracting chemical entities from textThe biggest cause of missing compounds when extracting chemical entities from text is the presence of typographical errors: human errors, OCR failures, hyphenation and multiple line issues, etc.• Automated spelling correction with CaffeineFix from NextMove Software • CaffeineFix significantly improves extraction rates (22% increase from D=0 to D=1) • name2structure software are complementary (40% of the structures come from single n2s contributions) DECS | CompSci
  13. 13. Structure standardisation “The big merge” requires: • A common set of chemistry and biology rules applied carefully & consistently across databases DECS | CompSciMuresan, S.; Sitzmann, M.; Southan, C., Biocomputing and Drug Discovery, 2011
  14. 14. Chemistry Connect DECS | CompSci
  15. 15. Technical Overview - ETL Data Sources Extraction Transformation Loading Text Files Python Structure Scripts Normalization (chemistry) Property calc Oracle PL/SQL Oracle DB (ext tables) Pipeline Pilot (biological results) Web Service DECS | CompSci
  16. 16. Technical Overview - Application HTML Java Oracle 11g WebLogic Server Direct 7 REST (and SOAP) services .Net PipelinePilot Knime Excel DECS | CompSci
  17. 17. Source content in Chemistry Connect Source Structures % unique Cpd/Str Syn/Str ChemSpider 18922316 50 1.07 1.8 Reaxys 15535377 59 1.12 2.0 IBM patents 11038533 51 1.00 n/a PubChemBE 4675643 n/a 1.03 n/a ACD 4452644 73 1.01 1.3 eMolecules 4213813 19 1.01 n/a TRPharma 3268613 n/a 1.03 n/a GOSTAR 3128567 27 1.00 3.3 ChEMBL 940905 n/a 1.05 1.6 TRIntegrity 307685 27 1.00 1.3 AZReagents 78265 3.4 1.73 3.4 TRPartnering 17901 10 1.00 1.0 ChEBI 13191 n/a 1.31 5.2 HMDB 7789 53 1.00 13.4 DrugBank 6359 n/a 1.04 5.0 TTD 2663 4.9 1.27 n/a Bioprint 2481 n/a 1.00 n/a DECS | CompSciMuresan, S. et al, Drug Discovery Today 2011, in print
  18. 18. Finding a common language Acetaminophen [3H]Acetaminophen 882-720-13 Acetaminophen (4-hydroxyacetanilide) 10066-90-7 882-720-16 Acetaminophen glucuronide(55%) acetaminophen sulfate 103-90-2 882-720-20 Acetaminophen sulfate(30%) 1047-607-00 A F ANACIN acetaminophen sulphate Acetaminophen Uniserts 1169-894-12 A PER acetaminophene A.F. ANACIN Acetamol 16110-10-4 ACETANILIDE, 4-HYDROXY- AAP Acetavance 222 AF aa-sulfate Acetofen 222-AF AA-sulphate ACETOMINOPHEN Actamin 3-(glutathion-S-yl)acetaminophen Abenol Actamin Extra Actamin Super 37519-14-5 Abensanil Actifed Plus 3-hydroxyacetaminophen ABROL Actimol Actimol Chewable Tablets 4-(Acetylamino)phenol ABROLET Actimol Childrens Suspension 4-13-00-01091 AC112578 Actimol Infants Suspension Actimol Junior Strength Caplets 4-ACETAMIDOPHENOL AC112579 Actron Acamol Afebrin 4-Acetaminophenol Afebryl Accu-Tap Aferadol 4-ACETYLAMINOPHENOL Acenol AG10223 4-Hydroxyacetanilide Acenol (pharmaceutical) AG12029 AG124687 4-HYDROXYACETANILIDE Acephen AG12800 AG12948 4-HYDROXYANILID KYSELINY OCTOVE Acertol Amadil 4-hydroxyphenolacetamide Aceta Aminofen 644/4046 Aceta Elixir Aminofen Max Anacin 644/7502 Aceta Tablets Anacin-3 64889-81-2 Acetaco Anacin-3 Extra Strength Acetagesic Anadin dla dzieci 659/9501 Anaflon Acetalgin 77097-85-9Acetaminophen: Analter ACETAMIDE, N-(4- Anapap 840-416-00 HYDROXYPHENYL)- Andox>1000 synonyms.. 872-667-00 878-022-04 ACETAMIDE, N-(P- HYDROXYPHENYL)- Anelix Anexsia Anexsia 10/660 878-022-09 Acetamidophenol Anexsia 5/325 878-022-14 Acetaminofen Anexsia 7.5/325 Acetaminophen Anexsia 7.5/650 878-022-19 Anhiba 882-720-04 Acetaminophen (4- Anoquan hydroxyacetanilide) Anti-Algos 882-720-07 Antidol Acetaminophen 882-720-10 glucuronide(55%) Apacet DECS | CompSci Apacet Capsules acetaminophen sulfate
  19. 19. Word of the Day : Crowdsourcing DECS | CompSci
  20. 20. Exact match source comparisons sources that include predominantly patent- known drugs derived compounds DECS | CompSci
  21. 21. Chemistry Connect - Synonyms Searches DECS | CompSci
  22. 22. Chemistry Connect - Structure Searches DECS | CompSci
  23. 23. Chemistry Connect - Patent Searches DECS | CompSci
  24. 24. Chemistry Connect - Test & Result Searches DECS | CompSci
  25. 25. Different Questions, Common Language Question Concepts• What compounds have been described in Target Pathway document D? Institute People Disease• What compounds bind target X with an affinity Compound Bioprocess greater than A? Target MoA Pathway Disease• What targets does compound C bind with an affinity greater than A? Compound Test Target• What compounds have AZ patented on target X?• What is the structure for this development Disease Study Drug MoA compound? Species• How can I quickly get the SAR data from this Compound BMO (AE) patent? Study BMO (AE) Compound DECS | CompSci
  26. 26. Take-home messages• Chemistry Connect is enabling AZ to intensify its exploitation of synergies between internal and external SAR estate and to shorten the time between hypothesis generation during DMTA cycles• Our Chemical Dictionary of 120 million chemical terms has become a crucial cross-mapping resource between chemistry and the scientific literature• We cannot wave a magic wand over data qality, provenance issues, drug name space, and the inherent challenges of chemistry representation but Chemistry Connect gives us a unique overview and amelioration options for each source DECS | CompSci
  27. 27. A Democracy of Ideas (Acknowledgements)• Plamen Petrov • Niklas Blomberg• Chris Southan • Kay Brickmann• Paul Xie • Ola Engkvist• Peter Varkonyi • Yidong Yang• Thierry Kogej • Hongming Chen• Christian Tyrchan • and many others…• Magnus Kjellberg• Håkan Nilsson• Mats Ericsson• Jonas Ekengren• Marcus Gelderman• Ithipol Suriyawongkul DECS | CompSci
  28. 28. Thank you! DECS | CompSci