Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Challenges and successes in machine interpretation of Markush descriptions

275 views

Published on

Presented by Roger Sayle at the 254th ACS National Meeting in Washington DC Aug 2017

Published in: Science
  • Be the first to comment

  • Be the first to like this

Challenges and successes in machine interpretation of Markush descriptions

  1. 1. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Challenges and successes in machine interpretation of Markush descriptions Daniel Lowe, John Mayfield and Roger Sayle NextMove Software (and MineSoft) Cambridge, UK
  2. 2. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Topics • Specific compounds described by Markush structures – Simplifying the problem. • Progress towards capturing Markush structures – Tackling the problem head-on. • Performing generic structural searches (OntoGrep) – Finessing the problem.
  3. 3. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Introduction: Trivial markush Markush core Definitions US20150376170
  4. 4. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Approach 1: simplify • Be careful what you ask for… • Relatively few organizations or databases support the ability to capture process Markush records; Markush registration and duplicate checking, and searching Markush vs. Markush, Compound vs. Markush, and Markush vs. Compound. • Lower hanging fruit is to export regular compounds from fully exemplified Markush examples, such as those commonly found in R-Group tables.
  5. 5. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 R-group tables (example 1)
  6. 6. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 R-group tables (example 2) US20160304465A1
  7. 7. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 R-group tables (example 3)
  8. 8. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Results (2001-June 2016 patent applications) Data type Unique Compounds Not in PubChem Not in PubChem (SureChEMBL) Exemplified compound R-group tables 621,140 496,831 (80.0%) 532,166 (85.7%) Text 4,759,009 564,886 (11.9%) 911,976 (19.2%) Sketches 4,479,113 886,991 (19.8%) 1,179,229 (26.3%) Structural identity checks performed using StdInChI
  9. 9. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Approach 2A: encode sketches • Interpretation of Markush core sketch – Substituent variation – Position variation – Frequency variation • Features captured as ChemAxon extended SMILES (CXSMILES)
  10. 10. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 GENERIC FEATURES Generic structure contain some type of variation • substituent • positional • frequency • homology (least conventions) Complexity from conventions, e.g. label for f-var: (CH2)n Complexity from combinations e.g. s+p+f
  11. 11. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Positional variation As CXSMILESNaïve ExportOriginal **.C=1C=C(C=CC1)C(C2=C(N(N=C2*)C3=CC=CC=C3)*)=O.**.**.**.** |$R12;;;;;;;;;;;;;;R2;;;;;;;R1; ;;R11;;R13;;R14;;R15$,m:1:2.3.4.5.6.7,23:2.3.4.5.6.7,25:15.16.17.18.19.20,27:15.16.17.18.19 .20,29:15.16.17.18.19.20| US20020016333A1-20020207-C00031
  12. 12. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Repeated group detection US20120309735A1-20121206-C00057 C1C(COC1)COC=2N=C(N=CC2)NC3=CC(=CC(=C3)C4=CN=C(S4)N5CC(NCCC5)=O)C |Sg:n:5:n:ht|
  13. 13. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Expansion of repeated groups
  14. 14. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 R-group Sketch interpretation • ChemDraw files from US patents used • Substituent attachment points detected: Always interpreted as attachments points Interpreted as attachment point when in R-group table
  15. 15. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Formula (Re)Interpretation Input ChemDraw 15 This work HATU C4F9 H3PO4 CON(cHex)2 No result III-2 No result
  16. 16. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Compared to PREVIOUS EFFORTS SCRIPDB CDX and Molfile w Open Babel No (apparent) correction Fragments split SureChEMBL Images w CLiDE Some correction Partial recognition (e.g. 2 of 4 in collection) US07714017_C00037 SureChEMBL (no result) SCRIPDB CID 59301147 OSRA IMAGO NextMove PubChem layout modified layout modifiedlayout generated
  17. 17. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Approach 2B: “Full” Markush • Traditionally, the “full” Markush problem is to capture definition lists, R-groups, which can be generic (homology groups) and recursive. • Definitions can now be generic – Lists of possible substituents – Substituents may be homology groups e.g. C1-6 alkyl, heteroaryl – Ranges of values for repeated linkers – Constraints e.g. R1 is not x if R2 is y
  18. 18. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Markush definition capture • Map all variants of common Markush phrase definitions to one of the following: Phrase Type Example rgroupAssignment* “R1 is” “R5 selected from” numberAssignment “m is an integer from 1 to 4” precedingGroupSubstituted “optionally substituted by” compoundVariants “and stereoisomers” markushGroup “not present”, “double bond” *Can be qualified to indicate that the R-groups are combined, e.g. to form a ring
  19. 19. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Formula parsing O(C(*)(*)CO)[*:1] |$;;R1'_p;R1'_p;;;_AP1$| Name-to-structure OC* |$;;_AP1$| Rgroup assignment Number assignment US20150376166
  20. 20. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Name-to-structure of generic chemical names • Modified version of, the chemical name- structure program, OPSIN supporting: – Positional variation – Homology groups
  21. 21. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Positional variation Ar2 is tetrazolyl, triazolyl, oxadiazolyl, thiadiazolyl, pyrazolyl, imidazolyl, oxazolyl, thiazolyl, isoxazolyl, isothiazolyl, furanyl, thienyl, pyrrolyl, pyrimidinyl, pyrazinyl, pyridinyl, hydroxypyridinyl, quinolinyl, isoquinolinyl, or indolyl O*.N1=C(C=CC=C1)* |$;;;;;;;;_AP1$,m:1:4.5.6.7|
  22. 22. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Approach 3: ontogrep • Form follows function… OntoGrep: Name=Search • Classical approaches to handling Markush start by defining an intermediate mathematical/graph representation, and decomposing the problem in (1) matching these representations and (2) entering/capturing these representations. • An artificial intelligence inspired approach is for the semantics of the search to be defined by natural language (text representations).
  23. 23. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Motivating examples • “Alleged” PubChem examples – zinc compounds – boronic acids – C6H12O • Frequently text-mined entities – alkane – heterocycle – inorganic acid – solvent
  24. 24. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Radioactive smarts [!0,!#1;!1,!#1;!2,!#1;!0,!#2;!3,!#2;!4,!#2;!0,!#3;!6,!#3;!7,!#3;!0,!#4;!9,!#4;!0,!#5;!10,!#5;!11,!#5;!0,!#6;!12,!#6;!13,!#6;!0,!#7;!14,! #7;!15,!#7;!0,!#8;!16,!#8;!17,!#8;!18,!#8;!0,!#9;!19,!#9;!0,!#10;!20,!#10;!21,!#10;!22,!#10;!0,!#11;!23,!#11;!0,!#12;!24,!#12;!25,! #12;!26,!#12;!0,!#13;!27,!#13;!0,!#14;!28,!#14;!29,!#14;!30,!#14;!0,!#15;!31,!#15;!0,!#16;!32,!#16;!33,!#16;!34,!#16;!36,!#16;!0, !#17;!35,!#17;!37,!#17;!0,!#18;!36,!#18;!38,!#18;!40,!#18;!0,!#19;!39,!#19;!41,!#19;!0,!#20;!40,!#20;!42,!#20;!43,!#20;!44,!#20; !46,!#20;!0,!#21;!47,!#21;!0,!#22;!46,!#22;!47,!#22;!48,!#22;!49,!#22;!50,!#22;!0,!#23;!51,!#23;!0,!#24;!50,!#24;!52,!#24;!53,!# 24;!54,!#24;!0,!#25;!55,!#25;!0,!#26;!54,!#26;!56,!#26;!57,!#26;!58,!#26;!0,!#27;!59,!#27;!0,!#28;!58,!#28;!60,!#28;!61,!#28;!62, !#28;!64,!#28;!0,!#29;!63,!#29;!65,!#29;!0,!#30;!64,!#30;!66,!#30;!67,!#30;!68,!#30;!70,!#30;!0,!#31;!69,!#31;!71,!#31;!0,!#32;! 70,!#32;!72,!#32;!73,!#32;!74,!#32;!0,!#33;!75,!#33;!0,!#34;!74,!#34;!76,!#34;!77,!#34;!78,!#34;!80,!#34;!0,!#35;!79,!#35;!81,!# 35;!0,!#36;!79,!#36;!80,!#36;!82,!#36;!83,!#36;!84,!#36;!86,!#36;!0,!#37;!85,!#37;!0,!#38;!84,!#38;!86,!#38;!87,!#38;!88,!#38;!0, !#39;!89,!#39;!0,!#40;!90,!#40;!91,!#40;!92,!#40;!94,!#40;!96,!#40;!0,!#41;!93,!#41;!0,!#42;!92,!#42;!94,!#42;!95,!#42;!96,!#42; !97,!#42;!98,!#42;!0,!#44;!96,!#44;!98,!#44;!99,!#44;!100,!#44;!101,!#44;!102,!#44;!104,!#44;!0,!#45;!103,!#45;!0,!#46;!102,!#4 6;!104,!#46;!105,!#46;!106,!#46;!108,!#46;!110,!#46;!0,!#47;!107,!#47;!109,!#47;!0,!#48;!106,!#48;!108,!#48;!110,!#48;!111,!# 48;!112,!#48;!114,!#48;!0,!#49;!113,!#49;!0,!#50;!112,!#50;!114,!#50;!115,!#50;!116,!#50;!117,!#50;!118,!#50;!119,!#50;!120,! #50;!122,!#50;!124,!#50;!0,!#51;!121,!#51;!123,!#51;!0,!#52;!120,!#52;!122,!#52;!123,!#52;!124,!#52;!125,!#52;!126,!#52;!0,!#5 3;!127,!#53;!0,!#54;!124,!#54;!126,!#54;!128,!#54;!129,!#54;!130,!#54;!131,!#54;!132,!#54;!134,!#54;!136,!#54;!0,!#55;!133,!# 55;!0,!#56;!130,!#56;!132,!#56;!134,!#56;!135,!#56;!136,!#56;!137,!#56;!138,!#56;!0,!#57;!139,!#57;!0,!#58;!136,!#58;!138,!#5 8;!140,!#58;!142,!#58;!0,!#59;!141,!#59;!0,!#60;!142,!#60;!143,!#60;!145,!#60;!146,!#60;!148,!#60;!0,!#62;!144,!#62;!149,!#62; !150,!#62;!152,!#62;!154,!#62;!0,!#63;!153,!#63;!0,!#64;!154,!#64;!155,!#64;!156,!#64;!157,!#64;!158,!#64;!160,!#64;!0,!#65;!1 59,!#65;!0,!#66;!156,!#66;!158,!#66;!160,!#66;!161,!#66;!162,!#66;!163,!#66;!164,!#66;!0,!#67;!165,!#67;!0,!#68;!162,!#68;!16 4,!#68;!166,!#68;!167,!#68;!168,!#68;!170,!#68;!0,!#69;!169,!#69;!0,!#70;!168,!#70;!170,!#70;!171,!#70;!172,!#70;!173,!#70;!1 74,!#70;!176,!#70;!0,!#71;!175,!#71;!0,!#72;!176,!#72;!177,!#72;!178,!#72;!179,!#72;!180,!#72;!0,!#73;!180,!#73;!181,!#73;!0,! #74;!182,!#74;!183,!#74;!184,!#74;!186,!#74;!0,!#75;!185,!#75;!0,!#76;!184,!#76;!187,!#76;!188,!#76;!189,!#76;!190,!#76;!192, !#76;!0,!#77;!191,!#77;!193,!#77;!0,!#78;!192,!#78;!194,!#78;!195,!#78;!196,!#78;!198,!#78;!0,!#79;!197,!#79;!0,!#80;!196,!#80 ;!198,!#80;!199,!#80;!200,!#80;!201,!#80;!202,!#80;!204,!#80;!0,!#81;!203,!#81;!205,!#81;!0,!#82;!204,!#82;!206,!#82;!207,!#8 2;!208,!#82]
  25. 25. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Example ontogrep queries • nitrogen containing heterocycles • cationic ring systems • cyclic alkanes • branched acyclic alkanes • transuranic elements • zinc compounds • binary compounds • Lewis acids • atropisomers • polyspiro ring systems • polyatomic elements • inorganic salts • radioactive compounds • radioactive elements • monocyclic ring systems • neutral compounds • uncharged compounds • zwitterionic compounds • carbon containing inorganics • zinc oxides • metal chlorides • iron halides • inorganic salts
  26. 26. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Patent text mining • The classic example of pharmaceutical patent busting is the 2009 Bayer patent for Vardenafil (Levitra), entitled “2-phenyl substituted imidazotriazinones as phosphodiesterase inhibitors”, US7696206B2. • How much information can/could be mined from the title alone?
  27. 27. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 2-phenyl substituted triazinones
  28. 28. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 imidazotriazines All 14 imidazotriazines
  29. 29. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Pistachio: Siri for chemists
  30. 30. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Conclusions • Specific compounds can be reconstructed from a generic structure and specific R-group definitions – Immediately useful for key-compound extraction and compound-property relationship extraction • Simple generic structural definitions can be captured, but many cases still too complex • “Ontogrep” technology brings implicit Markush structure querying to high school student level.
  31. 31. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Acknowledgements • This work was made possible by funding from: – GlaxoSmithKline – Minesoft – NCBI, NIH – Novartis – Vertex Pharmaceuticals
  32. 32. 254th ACS National Meeting, Washington DC, USA, Tuesday 22nd August 2017 Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog roger@nextmovesoftware.com

×