Advanced grammars for state-of-the-art named entity recognition (NER)

Advanced grammars for
state-of-the-art named
entity recognition (NER)
Roger Sayle and daniel lowe
NextMove Software, Cambridge, UK
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017

overview
• NextMove Software’s LeadMine text-mining engine
internally uses “CaffeineFix” (.cfx) technology for
specifying and efficiently matching important terms.
• In addition to case-sensitive and case-insensitive
term matching CaffeineFix/LeadMine also support
spelling correction (fuzzy matching).
• The most common usage is to simply compile
dictionaries into binary form for fast matching.
• Advanced users, specify “regular expressions”.
• In this presentation, we go beyond REGEXPs.

leadmine v2 entity types
1. Chemicals
2. Biomolecules
3. Anatomy
4. Cell Lines
5. Diseases
6. Symptoms
7. Mechanisms of Action
8. Species/Organisms
9. Companies
10. Named Reactions
11. Regions
12. Languages/Possessives
1.1 Dictionary Names
1.2 Systematic Names
1.3 Generic Classes
1.4 Polymers
1.5 Formulae
2.1 Proteins
2.2 Genes
2.3 E.C. Numbers
2.4 PDB Codes
3.1 Cell Types
3.2 Cytogenetic Loci
1.1.1 Abbreviations
1.1.2 CAS RN Numbers
1.1.3 Registry Numbers
1.2.1 Functional Groups
1.2.2 Elements
1.2.3 Acids
1.2.4 SMILES
1.2.5 InChIs
2.1.1 Targets
2.1.2 P450s

named entity normal forms
• Chemicals SMILES and/or InChI
• Proteins UniProt
• Genes Entrez GeneID/HGNC
• Targets ChEMBL
• Species/Organism NCBI Taxonomy ID
• Diseases/Symptoms ICD-10
• Named Reactions RXNO
• Mechanism of Action ATC
• Many of these can also use NLM MeSH Terms.

Example entity dictionary as dag
• Nitrogen containing heterocycles as minimal DFA:
– Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine,
Pyrimidine, Pyrazine
• CaffeineFix supports (very large) user dictionaries.

Obo ontologies as dictionaries
• In addition to regular TSV (tab-separated value) files
for storing dictionaries, LeadMine’s obo2dict also
supports OBO ontologies, a convenient method for
tracking synonyms and foreign language forms.
[Term]
id: RXNO:0000006
name: Diels-Alder reaction
synonym: "Diels-Alder cycloaddition" EXACT []
synonym: "ディールス・アルダー反応" EXACT Japanese []

Plural form generation
• LeadMine’s pluralize automatically generates English
plural forms from singular dictionary entries.
diels-alder couplings RXNO:0000006
diels-alder cycloadditions RXNO:0000006
diels-alder reactions RXNO:0000006
acridine syntheses RXNO:0000518
acyclic beckmann rearrangements RXNO:0000564
acyloin condensations RXNO:0000085
olefin metatheses RXNO:0000280

Unusual entities
• ISBN, URL, PubMed SQL statement
• Roman Numerals, Date Solvent Mixture
• ColorState, Zip codes Hearst Patterns
• Katakana Unknown acid
• HELM, InChI, SMILES, v2000 Unknown antibody
• Credit Card Numbers Unknown disease
• Region Unknown INN
• Person Ordinal numbers
• Disease Cardinal numbers
• Journal de, es, fr, it, sv

Grammars within grammars
• LeadMine grammar’s are specified constructively
effectively producing even more entity types.
• Region = City + Continent + Country + Island + Lake +
Mountain + Ocean + River + Sea + State/Province +
OtherFeature + OtherRegion.
• City = CityAlbania + CityAndorra + CityAustralia + CityAustria +
… + CityUS + …
• CityUS = CityUS_AK + CityUS_AL + CityUS_AR + CityUS_AZ +
CityUS_CA + CityUS_CO + …

Pharma registry numbers
• CaffineFix v2.0 supports sets of user-defined
regular expressions as dictionaries.
• One application is specifying the format of
registry numbers, such as GSK204454A
• Prefix: “A” | “AZ” | “BMY” | “GSK” | “LY” | …
• Number: d{3-7}
• Suffix: (“.” d) | [“a” .. “z”]
• RegistryNumber: Prefix [“ ” | “-”] Number [Suffix]

Cardinal numbers
• English
– One, ten, two thousand and forty eight, ten million
• German
– Eins, Zehn, Hundert, Million, Viermillion
– Vierhundertsiebenundzwanzigtausendfünfhundertvierunddreißig
• French
– Trois cents, un mille, mille neuf cent quatre-vingts dix-huit
• Italian
– Uno, due, trenta, ottocentosessantamila settecentoottantanove
• Swedish
– en miljon trehundrasjuttiåtta tusen niohundrasjuttiett

cas registry number grammar
• Two to seven digits, followed by a hyphen, two digits,
a hyphen and a final check digit
– e.g. 7732-18-5
• Regular Expression: (([1-9]d{2,5})|([5-9]d))-dd-d

Cas check digit calculation
• More generally CaffeineFix’s finite state machines
can do limited processing...
• The final check digit of a CAS number is calculated by
series term summation modulo 10.
• The last digit time 1, the previous digit times 2, the
previous digit times 3, and computing the sum
modulo 10.
• The CAS number for water is 7732-18-5.
• The checksum 5 is calculated as (1x8 + 2x1 + 3x2 +
4x3 + 5x7 + 6x7) mod 10 = 5.

Fsm for matching cas check digits

cas number correction example
• 7732-18-8? Did you mean...
– 7732-18-5
– 7732-11-8
– 77328-18-8
– 7733-18-8
– 77342-18-8
– 77392-18-8
– 71732-18-8
– 76732-18-8
– 97732-18-8

Roman numerals
One useful operator is NonEmpty that removes the empty string
from the set of valid matches, and requires at least one or more
characters to match.
I
II
III
IV
V
VI
VII
VIII
IX
X
XX
XXX
XL
L
LX
LXX
LXXX
XC
C
CC
CCC
CD
D
DC
DCC
DCCC
CM
M
MM
MMM
Thousands Hundreds Tens Units

Unknown acid
• Another operators allows wildcards with exceptions,
effectively a not operator.
• An unknown acid is “[a-z’-]+ acid” where the first
word excludes:
– Stop words: a, the, and, any, is, in, was, etc.
– Common qualifiers: acceptable, preferred, etc.
– Adjectives: battery, free, inorganic, strong, etc.
– Known acids: acetic, nitric, amino, carboxylic, etc.

Unknown inn
• A variation on this theme allows LeadMine to
recognize novel (recently announced) kinase
inhibitors and antibodies based on the structure of
their INN names.
• An unknown kinase inhibitor is “[a-z]+inib” and an
unknown antibody is “[a-z]+mab” where the words
exclude previously known/reported INN names and
“colliding” English words.
april != capropril, KappaB != rozrolimupab, yuletide != exenatide,
triumvir != zanamivir, etc.

Person grammar
• The named person grammar matches:
1. [Salutation] FirstName [Initials] Surname [Suffix]
2. [Salutation] FirstName [Initials] UnknownSurname [Suffix]
3. [Salutation] UnknownFirstName [Initials] Surname [Suffix]
• where
Salutation includes Mr., Mrs., Dr., Sir, His Highness, …
FirstName includes David, John, Sarah, Tom, Angela, …
Surname includes Smith, Jones, Overington, …
UnknownFirstname excludes Big, Lake, The, Outer, etc.
UnknownSurname excludes Avenue, Bridge, Street, etc.

List construction operator
• Another frequently used idiom, are the operators for
constructing comma separated list.
• These turn the grammar matching “X” into the
grammar matching things like “X, X, X and X”.
• More specifically:
(X [ “,” “ ”? X]* (“ and ”| “ or ” | “ and/or ” )? X
• Another variation of this allows “other”, “similar”
and “related” to the final X if the list is non-empty.

Hearst pattern grammars
• An example use of list constructions is in the
recognition of Heart Patterns.
1. X such as Y [“including”, “especially” etc.]
2. Y and other X [“and related”, “or similar” etc.]
3. such X as Y
• Where X is category or classification term;
• And Y is a list of exemplified terms.
• Marti A. Hearst, “Automatic Acquisition of Hyponyms from Large Text Corpora”, Proceedings
of the 14th International Conference on Computational Linguistics, Nantes, France, July 1992.

Complex object builder
• An application of the list construction operator is in
our “complex object builder” construction operator.
ComplexObjectBuilder cob;
cob.insert(“red”, “lorry”, “lorries”);
cob.insert(“yellow”, “lorry”, “lorries”);
• Allows matching not only of
“red lorry”, “red lorries”, “yellow lorry” and “yellow lorries”
• But also of…
“red and yellow lorries”, “yellow and red lorries”, etc.

complex disease examples
• Adenomatous polyps of the colon and rectum.
• Fibroepithelia or epithelial hyperplasias.
• Inherited spinocerebellar ataxia.
• Stage II or stage III colorectal cancer.
• Inherited breast and overian cancers.
• Argentinian, Bolivian and Korean haemorrhagic
fevers.
• Dermatitis due to heat, cold, radiation, cosmetics,
fungi and shellfish.

Grammars for Safety text mining
• “May cause lung damage if swallowed”
– “may” → “can”, “could”, “may”, “might”, “will”, etc.
– “cause” → “lead to”, “result in”, “trigger”, “bring on”, …
– “lung damage” → “explosion”, “cancer”, “injury”, …
– “if” → “when”, “once”…
– “swallowed” → “heated”, “shaken”, “dried”, “ignited”…
• “Highly toxic”
– “highly” → “very”, “extremely”, “unusually”, “intensely”…
– “toxic” → “explosive”, “carcinogenic”, “poisonous”…

efficient protein variant naming
• CaffeineFix technology can also be applied to naming
peptides and arbitrary protein variants/mutants.
• Consider the a database of the following 11 peptides:
– CFFQNCPRG phenylpressin
– CFVRNCPTG annetocin
– CFWTSCPIG octopressin
– CYFQNCPRG argipressin
– CYFQNCPKG lypressin
– CYFRNCPIG cephalotocin
– CYIQNCPLG oxytocin
– CYIQNCPPG prol-oxytocin
– CYIQNCPRG vasotocin
– CYIQSCPIG seritocin
– CYISNCPIG isotocin

Dag representation of sequences
These 11 peptides may be efficiently represented and
search as a “directed acyclic graph” [38 vs. 99 states]

entirety of uniprot/swissprot
• Using this representation, all 540546 protein
sequences in uniprot_sprot, which contains over
192M amino acids, requires 142M states (1.4Gb).
• This data structure allows close analogues to be
identified much faster than using NCBI blastp.
• For example, all 540546 sequences can be queried
against this database (i.e. all-against-all) in ~9m30s
on a single core on a laptop.
• The sequence from PDB 1CRN (crambin 46AA) is
canonically named as [L25I]P01542 in 0.002s.

Application to precision medicine
• A more realistic example is that sequence of the
gene “spastic paraplegia4” with six mutations from
OMIM:604277 can be canonically named as
[I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0
• Run-time for this query is 0.2s.
• By comparison, blastp 2.2.29+ takes about 6s.
– With default arguments, NCBI blastp run time is 7s.
– Only 6s with –num_descriptions 1 –num_alignments 1.

summary
• LeadMine’s .cfx files can do far more than efficiently
match very large dictionaries of terms.
• Indeed, many of the grammars used at NextMove
Software potentially match an infinite number of
terms.
• Construction of domain specific grammars can be
done in collaboration with LeadMine customers.

Advanced grammars for state-of-the-art named entity recognition (NER)

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Advanced grammars for state-of-the-art named entity recognition (NER)

Similar to Advanced grammars for state-of-the-art named entity recognition (NER) (16)

More from NextMove Software

More from NextMove Software (20)

Recently uploaded

Recently uploaded (20)

Advanced grammars for state-of-the-art named entity recognition (NER)