Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advanced grammars for state-of-the-art named entity recognition (NER)

382 views

Published on

Presented by Roger Sayle at 253rd ACS National Meeting, San Francisco, 4th April 2017

Published in: Science
  • Be the first to comment

  • Be the first to like this

Advanced grammars for state-of-the-art named entity recognition (NER)

  1. 1. Advanced grammars for state-of-the-art named entity recognition (NER) Roger Sayle and daniel lowe NextMove Software, Cambridge, UK 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  2. 2. overview • NextMove Software’s LeadMine text-mining engine internally uses “CaffeineFix” (.cfx) technology for specifying and efficiently matching important terms. • In addition to case-sensitive and case-insensitive term matching CaffeineFix/LeadMine also support spelling correction (fuzzy matching). • The most common usage is to simply compile dictionaries into binary form for fast matching. • Advanced users, specify “regular expressions”. • In this presentation, we go beyond REGEXPs. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  3. 3. leadmine v2 entity types 1. Chemicals 2. Biomolecules 3. Anatomy 4. Cell Lines 5. Diseases 6. Symptoms 7. Mechanisms of Action 8. Species/Organisms 9. Companies 10. Named Reactions 11. Regions 12. Languages/Possessives 1.1 Dictionary Names 1.2 Systematic Names 1.3 Generic Classes 1.4 Polymers 1.5 Formulae 2.1 Proteins 2.2 Genes 2.3 E.C. Numbers 2.4 PDB Codes 3.1 Cell Types 3.2 Cytogenetic Loci 1.1.1 Abbreviations 1.1.2 CAS RN Numbers 1.1.3 Registry Numbers 1.2.1 Functional Groups 1.2.2 Elements 1.2.3 Acids 1.2.4 SMILES 1.2.5 InChIs 2.1.1 Targets 2.1.2 P450s 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  4. 4. named entity normal forms • Chemicals SMILES and/or InChI • Proteins UniProt • Genes Entrez GeneID/HGNC • Targets ChEMBL • Species/Organism NCBI Taxonomy ID • Diseases/Symptoms ICD-10 • Named Reactions RXNO • Mechanism of Action ATC • Many of these can also use NLM MeSH Terms. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  5. 5. Example entity dictionary as dag • Nitrogen containing heterocycles as minimal DFA: – Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine, Pyrimidine, Pyrazine • CaffeineFix supports (very large) user dictionaries. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  6. 6. Obo ontologies as dictionaries • In addition to regular TSV (tab-separated value) files for storing dictionaries, LeadMine’s obo2dict also supports OBO ontologies, a convenient method for tracking synonyms and foreign language forms. [Term] id: RXNO:0000006 name: Diels-Alder reaction synonym: "Diels-Alder cycloaddition" EXACT [] synonym: "ディールス・アルダー反応" EXACT Japanese [] 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  7. 7. Plural form generation • LeadMine’s pluralize automatically generates English plural forms from singular dictionary entries. diels-alder couplings RXNO:0000006 diels-alder cycloadditions RXNO:0000006 diels-alder reactions RXNO:0000006 acridine syntheses RXNO:0000518 acyclic beckmann rearrangements RXNO:0000564 acyloin condensations RXNO:0000085 olefin metatheses RXNO:0000280 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  8. 8. Unusual entities • ISBN, URL, PubMed SQL statement • Roman Numerals, Date Solvent Mixture • ColorState, Zip codes Hearst Patterns • Katakana Unknown acid • HELM, InChI, SMILES, v2000 Unknown antibody • Credit Card Numbers Unknown disease • Region Unknown INN • Person Ordinal numbers • Disease Cardinal numbers • Journal de, es, fr, it, sv 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  9. 9. Grammars within grammars • LeadMine grammar’s are specified constructively effectively producing even more entity types. • Region = City + Continent + Country + Island + Lake + Mountain + Ocean + River + Sea + State/Province + OtherFeature + OtherRegion. • City = CityAlbania + CityAndorra + CityAustralia + CityAustria + … + CityUS + … • CityUS = CityUS_AK + CityUS_AL + CityUS_AR + CityUS_AZ + CityUS_CA + CityUS_CO + … 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  10. 10. Pharma registry numbers • CaffineFix v2.0 supports sets of user-defined regular expressions as dictionaries. • One application is specifying the format of registry numbers, such as GSK204454A • Prefix: “A” | “AZ” | “BMY” | “GSK” | “LY” | … • Number: d{3-7} • Suffix: (“.” d) | [“a” .. “z”] • RegistryNumber: Prefix [“ ” | “-”] Number [Suffix] 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  11. 11. Cardinal numbers • English – One, ten, two thousand and forty eight, ten million • German – Eins, Zehn, Hundert, Million, Viermillion – Vierhundertsiebenundzwanzigtausendfünfhundertvierunddreißig • French – Trois cents, un mille, mille neuf cent quatre-vingts dix-huit • Italian – Uno, due, trenta, ottocentosessantamila settecentoottantanove • Swedish – en miljon trehundrasjuttiåtta tusen niohundrasjuttiett 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  12. 12. cas registry number grammar • Two to seven digits, followed by a hyphen, two digits, a hyphen and a final check digit – e.g. 7732-18-5 • Regular Expression: (([1-9]d{2,5})|([5-9]d))-dd-d 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  13. 13. Cas check digit calculation • More generally CaffeineFix’s finite state machines can do limited processing... • The final check digit of a CAS number is calculated by series term summation modulo 10. • The last digit time 1, the previous digit times 2, the previous digit times 3, and computing the sum modulo 10. • The CAS number for water is 7732-18-5. • The checksum 5 is calculated as (1x8 + 2x1 + 3x2 + 4x3 + 5x7 + 6x7) mod 10 = 5. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  14. 14. Fsm for matching cas check digits 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  15. 15. Fsm for matching cas check digits 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  16. 16. cas number correction example • 7732-18-8? Did you mean... – 7732-18-5 – 7732-11-8 – 77328-18-8 – 7733-18-8 – 77342-18-8 – 77392-18-8 – 71732-18-8 – 76732-18-8 – 97732-18-8 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  17. 17. Roman numerals One useful operator is NonEmpty that removes the empty string from the set of valid matches, and requires at least one or more characters to match. I II III IV V VI VII VIII IX X XX XXX XL L LX LXX LXXX XC C CC CCC CD D DC DCC DCCC CM M MM MMM Thousands Hundreds Tens Units 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  18. 18. Unknown acid • Another operators allows wildcards with exceptions, effectively a not operator. • An unknown acid is “[a-z’-]+ acid” where the first word excludes: – Stop words: a, the, and, any, is, in, was, etc. – Common qualifiers: acceptable, preferred, etc. – Adjectives: battery, free, inorganic, strong, etc. – Known acids: acetic, nitric, amino, carboxylic, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  19. 19. Unknown inn • A variation on this theme allows LeadMine to recognize novel (recently announced) kinase inhibitors and antibodies based on the structure of their INN names. • An unknown kinase inhibitor is “[a-z]+inib” and an unknown antibody is “[a-z]+mab” where the words exclude previously known/reported INN names and “colliding” English words. april != capropril, KappaB != rozrolimupab, yuletide != exenatide, triumvir != zanamivir, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  20. 20. Person grammar • The named person grammar matches: 1. [Salutation] FirstName [Initials] Surname [Suffix] 2. [Salutation] FirstName [Initials] UnknownSurname [Suffix] 3. [Salutation] UnknownFirstName [Initials] Surname [Suffix] • where Salutation includes Mr., Mrs., Dr., Sir, His Highness, … FirstName includes David, John, Sarah, Tom, Angela, … Surname includes Smith, Jones, Overington, … UnknownFirstname excludes Big, Lake, The, Outer, etc. UnknownSurname excludes Avenue, Bridge, Street, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  21. 21. List construction operator • Another frequently used idiom, are the operators for constructing comma separated list. • These turn the grammar matching “X” into the grammar matching things like “X, X, X and X”. • More specifically: (X [ “,” “ ”? X]* (“ and ”| “ or ” | “ and/or ” )? X • Another variation of this allows “other”, “similar” and “related” to the final X if the list is non-empty. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  22. 22. Hearst pattern grammars • An example use of list constructions is in the recognition of Heart Patterns. 1. X such as Y [“including”, “especially” etc.] 2. Y and other X [“and related”, “or similar” etc.] 3. such X as Y • Where X is category or classification term; • And Y is a list of exemplified terms. • Marti A. Hearst, “Automatic Acquisition of Hyponyms from Large Text Corpora”, Proceedings of the 14th International Conference on Computational Linguistics, Nantes, France, July 1992. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  23. 23. Complex object builder • An application of the list construction operator is in our “complex object builder” construction operator. ComplexObjectBuilder cob; cob.insert(“red”, “lorry”, “lorries”); cob.insert(“yellow”, “lorry”, “lorries”); • Allows matching not only of “red lorry”, “red lorries”, “yellow lorry” and “yellow lorries” • But also of… “red and yellow lorries”, “yellow and red lorries”, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  24. 24. complex disease examples • Adenomatous polyps of the colon and rectum. • Fibroepithelia or epithelial hyperplasias. • Inherited spinocerebellar ataxia. • Stage II or stage III colorectal cancer. • Inherited breast and overian cancers. • Argentinian, Bolivian and Korean haemorrhagic fevers. • Dermatitis due to heat, cold, radiation, cosmetics, fungi and shellfish. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  25. 25. Grammars for Safety text mining • “May cause lung damage if swallowed” – “may” → “can”, “could”, “may”, “might”, “will”, etc. – “cause” → “lead to”, “result in”, “trigger”, “bring on”, … – “lung damage” → “explosion”, “cancer”, “injury”, … – “if” → “when”, “once”… – “swallowed” → “heated”, “shaken”, “dried”, “ignited”… • “Highly toxic” – “highly” → “very”, “extremely”, “unusually”, “intensely”… – “toxic” → “explosive”, “carcinogenic”, “poisonous”… 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  26. 26. efficient protein variant naming • CaffeineFix technology can also be applied to naming peptides and arbitrary protein variants/mutants. • Consider the a database of the following 11 peptides: – CFFQNCPRG phenylpressin – CFVRNCPTG annetocin – CFWTSCPIG octopressin – CYFQNCPRG argipressin – CYFQNCPKG lypressin – CYFRNCPIG cephalotocin – CYIQNCPLG oxytocin – CYIQNCPPG prol-oxytocin – CYIQNCPRG vasotocin – CYIQSCPIG seritocin – CYISNCPIG isotocin 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  27. 27. Dag representation of sequences These 11 peptides may be efficiently represented and search as a “directed acyclic graph” [38 vs. 99 states] 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  28. 28. entirety of uniprot/swissprot • Using this representation, all 540546 protein sequences in uniprot_sprot, which contains over 192M amino acids, requires 142M states (1.4Gb). • This data structure allows close analogues to be identified much faster than using NCBI blastp. • For example, all 540546 sequences can be queried against this database (i.e. all-against-all) in ~9m30s on a single core on a laptop. • The sequence from PDB 1CRN (crambin 46AA) is canonically named as [L25I]P01542 in 0.002s. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  29. 29. Application to precision medicine • A more realistic example is that sequence of the gene “spastic paraplegia4” with six mutations from OMIM:604277 can be canonically named as [I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0 • Run-time for this query is 0.2s. • By comparison, blastp 2.2.29+ takes about 6s. – With default arguments, NCBI blastp run time is 7s. – Only 6s with –num_descriptions 1 –num_alignments 1. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  30. 30. summary • LeadMine’s .cfx files can do far more than efficiently match very large dictionaries of terms. • Indeed, many of the grammars used at NextMove Software potentially match an infinite number of terms. • Construction of domain specific grammars can be done in collaboration with LeadMine customers. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017

×