Building Named Entity Recognition Models Efficiently using NERDS

Building Named Entity
Recognition Models efficiently
using NERDS
Sujit Pal, Elsevier Labs
December 2019.

About me
• Work at Elsevier Labs
• (Mostly self-taught) data scientist
• Mostly work with Deep Learning, Machine
Learning, Natural Language Processing, and
Search.
• Got interested in Named Entity Recognition
(NER) and NERDS as part of Search and
Knowledge Graph development.
2
I am NOT the author or maintainer of NERDS!
• Originally built by Panagiotis Eustratiadis.
• See CONTRIBUTORS.md for list of contributors.
• Open sourced by Elsevier July 3, 2018.

Agenda
• What can NER do for you?
• Evolution of NER techniques
• NERDS Architecture.
• NERDS Usage.
• Future Work.
3

Agenda
• NERDS Usage.
• Future Work.
4

What can NER do for you?
• In general…
• Foundational task for NLP pipelines.
• Good NERs available OOB for “standard” named entities.
• Topic Modeling, Co-reference Resolution, etc.
• Information Retrieval (IR)
• Chunk Entities into meaningful multi-word phrases.
• Understanding query intent.
• Automated Knowledge Graph Construction (AKBC)
• NER extracts entities from incoming text.
• Relationship Extraction extracts relationships between entity pairs.
• Entity Relationship triple inserted into Knowledge Graph.
5
ConceptSearch!

Agenda
• NERDS Usage.
• Future Work.
6

Evolution of NER Techniques
• Rules
• Regular
Expressions
• Gazetteers
7
• Word-based
models – PMI,
log-likelihood.
• Sequence models
– Conditional
Random Fields
• Bi-LSTM
• Bi-LSTM+CRF
• Transformer
based Models
Traditional Statistical Neural

Input Format – BIO Tagging
• BIO – Begin In Out.
• Barack/B-PER Obama/I-PER is/O 44th/O
United/B-LOC States/I-LOC President/O
./O
• BILOU – a tagging variant:
• U – Unit token (for single token entities)
• L – Last token in sequence, ex. Barack/B-
PER Obama/L-PER
8
Barack B-PER
Obama I-PER
is O
44th O
United B-LOC
States I-LOC
President O
. O

Gazetteer – Aho Corasick
• Create in-memory data
structure from dictionary.
• Stream content against data
structure.
• Multiple matches with single
pass.
9
Aho, A.V., and Corasick, M.J., 1975. Efficient String Matching: An aid to bibliographic search
21
43
0
Barack Obama
United States
NOT(Barack, United)
5
Airlines
PER
LOC
ORG

Sequence Modeling - CRF
• Sequence version of logistic regression.
• Computes optimum labeling l (y0, …, yn) over entire sentence s.
• Build multiple feature functions f on each token, return real value in range 0..1.
Function parameters:
• sentence s with tokens (x0, …, xn) – feature can use any token, the entire
sentence, or functions computed over the sentence (POS),
• current position i,
• previous and next labels yi-1 and yi+1.
• Optimum labeling computed as follows, probability computed using softmax.
• Weights wj learned using gradient descent.
10

Neural Model - BiLSTM
• Input is sequence of tokens, output is sequence of BIO tags.
• Weights trained end-to-end, no feature engineering needed.
• Bidirectional LSTM gets signal from neighboring words on both sides.
11
B-PER I-PER O O B-LOC I-LOC O O
Barack Obama is 44th United States PresidentStates .

Neural Model – BiLSTM-CRF
• Same as previous model, with additional CRF layer.
• No feature engineering for CRF, unlike CRF only NER model.
• Pre-trained embeddings observed to improve performance.
12
Barack Obama is 44th United States PresidentStates .
CRFBi-LSTM

Neural Model – adding char embeddings
• Concatenate char embedding + word embedding and feed to Bi-LSTM-CRF.
• All weights learned end-to-end.
• Handles rare / unknown words; Exploits signal in prefix/suffix.
13
.Barack Obama is 44th United PresidentStates
word embeddings char LSTM/CNN
Bi-LSTM-CRF
concatenate

Neural Model – ELMo preprocessing
14
.Barack Obama is 44th United PresidentStates
char LSTM/CNN
Bi-LSTM-CRF
concat
Contextualized
wordembeddings

Neural Model – Transformer based
• BERT = Bidirectional Encoder Representation for Transformers.
• Source of embeddings similar to ELMo in standard BiLSTM + CRF models, OR
• Fine-tune LM backed NERs such as HuggingFace’s BertForTokenClassification.
15
.Barack Obama is 44th United PresidentStates[CLS]

More Info on NER Techniques
• High level overview on NER in series of blog posts by Tobias Sterbak
(https://bit.ly/2pNdgPG).
• Traditional NER techniques covered in paper by Rahul Shernagat (2014) -- Named
Entity Recognition: A Literature Survey (https://bit.ly/2NRaCAg).
• Introduction to Neural Models in paper by Ronan Collolbert and Jason Weston
(2008) – A Unified Architecture for Natural Language Processing: Deep Neural
Networks with Multitask Learning (https://bit.ly/32rRYnO)
• Others (more modern papers) mentioned in slides.
16

Agenda
• NERDS Architecture
• NERDS Usage
• Future Work
17

NERDS Overview
• Framework that provides easy to use NER capabilities to Data
Scientists.
• Wraps various popular third party NER models.
• Extendable, new third party NER tools can be added as needed.
• Software Engineering tooling to boost Data Science productivity.
• Looking for support, bug reports, contributions, and ideas.
18

Unification through I/O Format
19
pyAhoCorasick CRFSuite SpaCy NER Anago BiLSTM
AnnotatedDocument (
doc: Document(“Barack Obama is 44th United States President .”),
annotations: [
Annotation(start_offset:0, end_offset:12, text:”Barack Obama”, label:”PER”),
Annotation(start_offset:22, end_offset:35, text:”United States”, label:”LOC”)
])

Benefits of Unification
• Consistent API – all models are subclasses of NERModel.
• Data prep. done once per project and reused across multiple models.
• Reusable Training and Evaluation code.
• Familiar Scikit-Learn like API, and access to Scikit-Learn utility functions.
• Duck-typing allows us to build Ensembles of NER.
• Easy to benchmark NER label data.
20

Can we do better?
21
Data: [[“Barack”, “Obama”, “is”, “44th”, “United” “States”, “President”, “.”]]
Labels and Predictions: [[“B-PER”, “I-PER”, “O”, “O”, “B-LOC”, “I-LOC”, “O”, “O”]]
DictionaryNER
I/O
Convert
SpacyNER
I/O
Convert
CrfNER BiLstmCrfNER

ELMo NER Model from Anago
22
DictionaryNER CrfNER SpacyNER BiLstmCrfNER
Data: [[“Barack”, “Obama”, “is”, “44th”, “United” “States”, “President”, “.”]]
Labels and Predictions: [[“B-PER”, “I-PER”, “O”, “O”, “B-LOC”, “I-LOC”, “O”, “O”]]
I/O
Convert
I/O
Convert
ElmoNER

Agenda
• NERDS Usage
• Future Work
23

Dataset
• Bio Entity recognition task from BioNLP 2004.
• Training and Test sets provided in BIO format.
• 511,097 training examples
• 104,895 test examples.
• Entity Distribution (training set)
• 25,307 DNA
• 2,481 RNA
• 11,217 cell_line
• 15,466 cell_type
• 55,117 protein
24

Dictionary NER
• Wraps pyAhoCorasick Automaton
• Improvements in fork.
• Supports dictionary loading as well as fit(X, y) like other NER models.
• Handles multiple entity classes.
25

Dictionary NER
• Wraps pyAhoCorasick Automaton
• Improvements in fork.
• Supports dictionary loading as well as fit(X, y) like other NER models.
• Handles multiple entity classes.
26

CRF NER
• Wraps sklearn.crfsuite CRF
• Improvements in this fork:
• Removes NLTK dependency, replaces with SpaCy.
• Allows non-default features to be passed in.
27

CRF NER
• Wraps sklearn.crfsuite CRF
• Improvements
• Removes NLTK dependency, replaces with SpaCy.
• Allows non-default features to be passed in.
28

SpaCy NER
• Wraps NER provided by SpaCy toolkit.
• More robust to large data sizes, uses mini-batches for training.
29

SpaCy NER
• Wraps NER provided by SpaCy toolkit.
• More robust to large data sizes, uses mini-batches for training.
30

BiLSTM CRF NER
• Wraps Anago BiLSTMCRF.
• Works against latest release (1.0.5) of Anago.
• No more intermittent failures due to time step mismatches.
31

BiLSTM CRF NER
• Wraps Anago BiLSTMCRF.
• Works against latest release (1.0.5) of Anago.
• No more intermittent failures due to time step mismatches.
32

Elmo NER
• Wraps Anago ELModel.
• New in this fork, available in current (dev) version of Anago.
• Needs (mandatory) base embedding for ELMo preprocessor.
33

Elmo NER
• Wraps Anago ELModel.
• New in this fork, available in current (dev) version of Anago.
• Needs (mandatory) base embedding for ELMo preprocessor.
34

Ensemble NER
• Max Voting
• Unifies Max Voting and
Weighted Max Voting
NERs into single model.
35

Ensemble NER
• Max Voting
• Unifies Max Voting and
Weighted Max Voting
NERs into single model.
36

Results (OOTB)
• Comparison across models
• ELMO based CRF has best performance.
• SpaCy and BiLSTM have comparable
performance, but CRF is competitive.
• Model based NERs outperform gazetteers.
• F1-scores range from 0.65 to 0.80
• Comparison across entity types
• Some correlation observed between data
volume and F1-scores for other models.
• F1-scores range from 0.61 to 0.81
37

Agenda
• NERDS Usage
• Future Work
38

Future Work
• Current API is only superficially Scikit-Learn like, convert to make models
fully conform to Scikit-Learn Classifier API.
• Eliminate Serialization issues reported by joblib.Parallel.
• Eliminate EnsembleNER in favor of ScikitLearn’s VotingClassifier.
• Leverage Scikit-Learn’s Model Selection classes
(RandomizedSearchCV and GridSearchCV).
• Add FLAIR and BERT based NER to supported model collection.
• BRAT annotation adapter.
39

Thank you
https://github.com/sujitpal/nerds
sujit.pal@elsevier.com

Building Named Entity Recognition Models Efficiently using NERDS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Named Entity Recognition Models Efficiently using NERDS

Similar to Building Named Entity Recognition Models Efficiently using NERDS (20)

More from Sujit Pal

More from Sujit Pal (20)

Recently uploaded

Recently uploaded (20)

Building Named Entity Recognition Models Efficiently using NERDS