Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Eugene Garfield: the father of chemical text mining and artificial intelligence (AI) in cheminformatics

478 views

Published on

Presented by Roger Sayle on 19th March 2018, at the 255th ACS National Meeting in New Orleans

Published in: Science
  • Be the first to comment

  • Be the first to like this

Eugene Garfield: the father of chemical text mining and artificial intelligence (AI) in cheminformatics

  1. 1. eugene Garfield: the father of chemical text mining and artificial intelligence (ai) in cheminformatics Roger Sayle NextMove Software, Cambridge, UK Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  2. 2. birth of a research field… • The computational challenge of locating and resolving or classifying chemicals in text is commonly called “Chemical Named Entity Recognition”. • Today this field is an active area of Artificial Intelligence research, a branch of Natural Language Processing, with international community wide competitions to assess progress (e.g. BioCreative). • But this whole field started from a revolutionary Ph.D. thesis from the University of Pennsylvania’s Department of Linguistics in 1961… Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  3. 3. An Algorithm for Translating Chemical Names to Molecular Formulas EUGENE GARFIELD Originally presented to the Faculty of the Graduate School of Arts and Sciences of the University of Pennsylvania in Partial Fulfilment of the Requirements for the Degree of Doctor of Philosophy, 1961. Reprinted as Essays of an Information Scientist, Vol. 7, pp. 441-513, 1984. Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  4. 4. Garfield’s List of morphemes for acyclic organic chemistry. 1. a 11. di 21. in 31. on** 2. acid 12. e* 22. iod 32. ox** 3. al 13. en 23. it 33. pent 4. am 14. eth 24. ium 34. sulf*** 5. an 15. fluor 25. meth 35. tetr 6. at 16. hept 26. nitr 36. thi*** 7. az 17. hex 27. o* 37. tri 8. brom 18. hydr 28. oct 38. y* 9. but 19. id 29. oic 39. yl 10. chor 20. im 30. ol 40. yn Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  5. 5. Garfield’s algorithm 1. Ignore all locants 2. Retain all parentheses. 3. Replace all morphemes by dictionary value. 4. Resolve ambiguity of any penta-octa occurrences. 5. Place + between all morphemes except multipliers. 6. Carry out the multiplications and additions. 7. Calculate hydrogen using formula H=2+2C+N-X-2DB. • c.f. Dijkstra’s “Shunting Yard Algorithm”, Nov 1961. Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  6. 6. dictionary values of morphemes Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018 morpheme C N O DB al 1 1 amino 1 but 4 en 1 eth 2 hydroxy 1 meth 1 nitrile 1 2 nitro 1 2 1 ol 1 one 1 1 prop 3 yn 2 morpheme mult bis 2 di 2 hexa 6 penta 5 tetra 4 tri 3
  7. 7. example calculations • methylaminoethane → C+N+2C → C3H9N • (3-(diethylamino)propyl)ethyl-3-amino-1,4-butanedioic acid → ((2(2C)+N)+3C)+2C+N+4C+2(2O+DB) → C13H26N2O4 • bis(bis[diethylamino]propylamino)butane → 2(2(2(2C)+N)+3C+N)+4C → C26H50N6 • hexanitrohexatriene → 6(N+2O+DB)+6C+3DB → C6H2N6O12 Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  8. 8. modern name-to-structure • Daniel Lowe’s OPSIN (NCI Services,ChemDoodle) • OpenEye Scientific Software’s LexiChem (Biovia) • ChemAxon’s Name To Structure • ACD/Labs’ ACD Name • Perkin-Elmer’s ChemDraw Name>Struct. • InfoChem’s ICN2S • NextMove Software’s Sugar&Splice. Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  9. 9. Automated chemical text mining Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  10. 10. Extracting mps and reactions Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  11. 11. Spelling correction di-terf-butyl (4S)-/V-(fert-butoxycarbonyl)-4-{4-[3- (tosyloxy)propyl]benzyl}-L-glutamate CaffeineFix corrected to: di-tert-butyl (4S)-N-(tert-butoxycarbonyl)-4-{4-[3- (tosyloxy)propyl]benzyl}-L-glutamate Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  12. 12. Spelling correction di-terf-butyl (4S)-/V-(fert-butoxycarbonyl)-4-{4-[3- (tosyloxy)propyl]benzyl}-L-glutamate CaffeineFix corrected to: di-tert-butyl (4S)-N-(tert-butoxycarbonyl)-4-{4-[3- (tosyloxy)propyl]benzyl}-L-glutamate Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  13. 13. Foreign Language Support • LeadMine can rapidly convert Korean, Chinese and Japanese chemical names to English as a pre-processing step N-(5-氯-噻唑-2-基)-2-(2,4-二氟-苯氧基)- 2,2-二氟-乙酰胺 N-(5-chloro-thiazol-2-yl)-2-(2,4-difluoro- phenoxy)-2,2-difluoro-acetamide Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  14. 14. recent state-of-the-art • Oxytocin – L-cysteinyl-L-tyrosyl-L-isoleucyl-L-glutaminyl-L-asparagyl-L-cysteinyl-L-prolyl-L-leucyl- glycinamide (1->6)-disulphide – C43H66N12O12S2 • Mipomersen – O2'-(2-methoxyethyl)-P-thio-guanylyl-(3'->5')-O2'-(2-methoxyethyl)-5-methyl-P-thio- cytidylyl-(3'->5')-O2'-(2-methoxyethyl)-5-methyl-P-thio-cytidylyl-(3'->5')-O2'-(2- methoxyethyl)-5-methyl-P-thio-uridylyl-(3'->5')-O2'-(2-methoxyethyl)-5-methyl-P-thio- cytidylyl-(3'->5')-2'-deoxy-P-thio-adenylyl-(3'->5')-2'-deoxy-P-thio-guanylyl-(3'->5')-P- thio-thymidylyl-(3'->5')-2'-deoxy-5-methyl-P-thio-cytidylyl-(3'->5')-P-thio-thymidylyl-(3'- >5')-2'-deoxy-P-thio-guanylyl-(3'->5')-2'-deoxy-5-methyl-P-thio-cytidylyl-(3'->5')-P-thio- thymidylyl-(3'->5')-P-thio-thymidylyl-(3'->5')-2'-deoxy-5-methyl-P-thio-cytidylyl-(3'->5')- O2'-(2-methoxyethyl)-P-thio-guanylyl-(3'->5')-O2'-(2-methoxyethyl)-5-methyl-P-thio- cytidylyl-(3'->5')-O2'-(2-methoxyethyl)-P-thio-adenylyl-(3'->5')-O2'-(2-methoxyethyl)-5- methyl-P-thio-cytidylyl-(3'->5')-O2'-(2-methoxyethyl)-5-methyl-cytidine – C230H324N67O122P19S19 Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  15. 15. everything old is new again Recently attention has returned to Molecular Formulae (MF) and the challenges of turning them into chemical structures (connection tables) – Line formulae • CH3CH2CH2Cl (complete molecule) • CH2CH2 (linker) • CH3CH2 (substituent) – Inorganic Salts • MgSO4 • AuCl2 – Sum formulae • C20H25NO6 Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  16. 16. but sometimes with a new twist • Peptide formulae – Cys(1)-Tyr-Phe-Gln-Asn-Cys(1)-Pro-Arg-Gly-NH2 – [N(Me)Leu15]orexin B (1-25) • Oligosaccharides – α-L-Fucp-(1→4)-[β-D-Galp-(1→3)]-β-D-GlcpNAc-(1→3)-β- D-Galp-(1→4)-D-Glc-ol • Oligonucleotides – 3'-AATG-5’ – sP-cl2Ade-Ribf Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  17. 17. the changing meaning of A.i. • “I don’t know what I do any more, but it used to be called Artificial Intelligence”. • Once upon a time, AI covered many broad disciplines – Problem solving and Planning (A*-search, proof-number search, monte carlo search, genetic algorithms). – Natural Language Processing (NLP). – Propositional Logic & Reasoning (PROLOG). – Optical Character Recognition (OCR). • Definitions evolve over time, AI is now almost synonymous with (Supervised) Machine Learning. Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  18. 18. Garfield’s insights (legacy) • Chemical Text Mining to pragmatically solve a real world problem. – Determining the probability that a word in text represents a chemical is far less useful than resolving it to a chemical formula for indexing. – A chemical formula is sufficiently useful for document indexing, and finesses issues of structural representation. • C10H10Fe will retrieve the multiple possible representations of Ferrocene. Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  19. 19. final words: A glass half full • “Certainly if we are to find methods of analysing chemical texts for indexing and other purposes, we cannot expect better than a 50% resolution of the indexing problem in chemistry.” • “We will have reaped a very poor harvest if we are able to describe the text of a chemical article grammatically without a corresponding ability to deal with the problem of synonymy”. – Eugene Eli Garfield (1925-2017) Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018
  20. 20. Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog roger@nextmovesoftware.com Legacy of Eugene Garfield, 255th ACS National Meeting, New Orleans, Monday 19th March 2018

×