Building Named Entity
Recognition Models efficiently
using NERDS
Sujit Pal, Elsevier Labs
December 2019.
About me
• Work at Elsevier Labs
• (Mostly self-taught) data scientist
• Mostly work with Deep Learning, Machine
Learning, Natural Language Processing, and
Search.
• Got interested in Named Entity Recognition
(NER) and NERDS as part of Search and
Knowledge Graph development.
2
I am NOT the author or maintainer of NERDS!
• Originally built by Panagiotis Eustratiadis.
• See CONTRIBUTORS.md for list of contributors.
• Open sourced by Elsevier July 3, 2018.
Agenda
• What can NER do for you?
• Evolution of NER techniques
• NERDS Architecture.
• NERDS Usage.
• Future Work.
3
Agenda
• What can NER do for you?
• Evolution of NER techniques
• NERDS Architecture.
• NERDS Usage.
• Future Work.
4
What can NER do for you?
• In general…
• Foundational task for NLP pipelines.
• Good NERs available OOB for “standard” named entities.
• Topic Modeling, Co-reference Resolution, etc.
• Information Retrieval (IR)
• Chunk Entities into meaningful multi-word phrases.
• Understanding query intent.
• Automated Knowledge Graph Construction (AKBC)
• NER extracts entities from incoming text.
• Relationship Extraction extracts relationships between entity pairs.
• Entity Relationship triple inserted into Knowledge Graph.
5
ConceptSearch!
Agenda
• What can NER do for you?
• Evolution of NER techniques
• NERDS Architecture.
• NERDS Usage.
• Future Work.
6
Evolution of NER Techniques
• Rules
• Regular
Expressions
• Gazetteers
7
• Word-based
models – PMI,
log-likelihood.
• Sequence models
– Conditional
Random Fields
• Bi-LSTM
• Bi-LSTM+CRF
• Transformer
based Models
Traditional Statistical Neural
Input Format – BIO Tagging
• BIO – Begin In Out.
• Barack/B-PER Obama/I-PER is/O 44th/O
United/B-LOC States/I-LOC President/O
./O
• BILOU – a tagging variant:
• U – Unit token (for single token entities)
• L – Last token in sequence, ex. Barack/B-
PER Obama/L-PER
8
Barack B-PER
Obama I-PER
is O
44th O
United B-LOC
States I-LOC
President O
. O
Gazetteer – Aho Corasick
• Create in-memory data
structure from dictionary.
• Stream content against data
structure.
• Multiple matches with single
pass.
9
Aho, A.V., and Corasick, M.J., 1975. Efficient String Matching: An aid to bibliographic search
21
43
0
Barack Obama
United States
NOT(Barack, United)
5
Airlines
PER
LOC
ORG
Sequence Modeling - CRF
• Sequence version of logistic regression.
• Computes optimum labeling l (y0, …, yn) over entire sentence s.
• Build multiple feature functions f on each token, return real value in range 0..1.
Function parameters:
• sentence s with tokens (x0, …, xn) – feature can use any token, the entire
sentence, or functions computed over the sentence (POS),
• current position i,
• previous and next labels yi-1 and yi+1.
• Optimum labeling computed as follows, probability computed using softmax.
• Weights wj learned using gradient descent.
10
Neural Model - BiLSTM
• Input is sequence of tokens, output is sequence of BIO tags.
• Weights trained end-to-end, no feature engineering needed.
• Bidirectional LSTM gets signal from neighboring words on both sides.
11
B-PER I-PER O O B-LOC I-LOC O O
Barack Obama is 44th United States PresidentStates .
Neural Model – BiLSTM-CRF
• Same as previous model, with additional CRF layer.
• No feature engineering for CRF, unlike CRF only NER model.
• Pre-trained embeddings observed to improve performance.
12
Barack Obama is 44th United States PresidentStates .
B-PER I-PER O O B-LOC I-LOC O O
CRFBi-LSTM
Neural Model – adding char embeddings
• Concatenate char embedding + word embedding and feed to Bi-LSTM-CRF.
• All weights learned end-to-end.
• Handles rare / unknown words; Exploits signal in prefix/suffix.
13
.Barack Obama is 44th United PresidentStates
B-PER I-PER O O B-LOC I-LOC O O
word embeddings char LSTM/CNN
Bi-LSTM-CRF
concatenate
Neural Model – ELMo preprocessing
14
.Barack Obama is 44th United PresidentStates
B-PER I-PER O O B-LOC I-LOC O O
char LSTM/CNN
Bi-LSTM-CRF
concat
Contextualized
wordembeddings
Neural Model – Transformer based
• BERT = Bidirectional Encoder Representation for Transformers.
• Source of embeddings similar to ELMo in standard BiLSTM + CRF models, OR
• Fine-tune LM backed NERs such as HuggingFace’s BertForTokenClassification.
15
.Barack Obama is 44th United PresidentStates[CLS]
B-PER I-PER O O B-LOC I-LOC O O
More Info on NER Techniques
• High level overview on NER in series of blog posts by Tobias Sterbak
(https://bit.ly/2pNdgPG).
• Traditional NER techniques covered in paper by Rahul Shernagat (2014) -- Named
Entity Recognition: A Literature Survey (https://bit.ly/2NRaCAg).
• Introduction to Neural Models in paper by Ronan Collolbert and Jason Weston
(2008) – A Unified Architecture for Natural Language Processing: Deep Neural
Networks with Multitask Learning (https://bit.ly/32rRYnO)
• Others (more modern papers) mentioned in slides.
16
Agenda
• What can NER do for you?
• Evolution of NER techniques
• NERDS Architecture
• NERDS Usage
• Future Work
17
NERDS Overview
• Framework that provides easy to use NER capabilities to Data
Scientists.
• Wraps various popular third party NER models.
• Extendable, new third party NER tools can be added as needed.
• Software Engineering tooling to boost Data Science productivity.
• Looking for support, bug reports, contributions, and ideas.
18
Unification through I/O Format
19
pyAhoCorasick CRFSuite SpaCy NER Anago BiLSTM
AnnotatedDocument (
doc: Document(“Barack Obama is 44th United States President .”),
annotations: [
Annotation(start_offset:0, end_offset:12, text:”Barack Obama”, label:”PER”),
Annotation(start_offset:22, end_offset:35, text:”United States”, label:”LOC”)
])
Benefits of Unification
• Consistent API – all models are subclasses of NERModel.
• Data prep. done once per project and reused across multiple models.
• Reusable Training and Evaluation code.
• Familiar Scikit-Learn like API, and access to Scikit-Learn utility functions.
• Duck-typing allows us to build Ensembles of NER.
• Easy to benchmark NER label data.
20
Can we do better?
21
Data: [[“Barack”, “Obama”, “is”, “44th”, “United” “States”, “President”, “.”]]
Labels and Predictions: [[“B-PER”, “I-PER”, “O”, “O”, “B-LOC”, “I-LOC”, “O”, “O”]]
DictionaryNER
I/O
Convert
SpacyNER
I/O
Convert
CrfNER BiLstmCrfNER
ELMo NER Model from Anago
22
DictionaryNER CrfNER SpacyNER BiLstmCrfNER
Data: [[“Barack”, “Obama”, “is”, “44th”, “United” “States”, “President”, “.”]]
Labels and Predictions: [[“B-PER”, “I-PER”, “O”, “O”, “B-LOC”, “I-LOC”, “O”, “O”]]
I/O
Convert
I/O
Convert
ElmoNER
Agenda
• What can NER do for you?
• Evolution of NER techniques
• NERDS Architecture
• NERDS Usage
• Future Work
23
Dataset
• Bio Entity recognition task from BioNLP 2004.
• Training and Test sets provided in BIO format.
• 511,097 training examples
• 104,895 test examples.
• Entity Distribution (training set)
• 25,307 DNA
• 2,481 RNA
• 11,217 cell_line
• 15,466 cell_type
• 55,117 protein
24
Dictionary NER
• Wraps pyAhoCorasick Automaton
• Improvements in fork.
• Supports dictionary loading as well as fit(X, y) like other NER models.
• Handles multiple entity classes.
25
Dictionary NER
• Wraps pyAhoCorasick Automaton
• Improvements in fork.
• Supports dictionary loading as well as fit(X, y) like other NER models.
• Handles multiple entity classes.
26
CRF NER
• Wraps sklearn.crfsuite CRF
• Improvements in this fork:
• Removes NLTK dependency, replaces with SpaCy.
• Allows non-default features to be passed in.
27
CRF NER
• Wraps sklearn.crfsuite CRF
• Improvements
• Removes NLTK dependency, replaces with SpaCy.
• Allows non-default features to be passed in.
28
SpaCy NER
• Wraps NER provided by SpaCy toolkit.
• Improvements in this fork:
• More robust to large data sizes, uses mini-batches for training.
29
SpaCy NER
• Wraps NER provided by SpaCy toolkit.
• Improvements in this fork:
• More robust to large data sizes, uses mini-batches for training.
30
BiLSTM CRF NER
• Wraps Anago BiLSTMCRF.
• Improvements in this fork:
• Works against latest release (1.0.5) of Anago.
• No more intermittent failures due to time step mismatches.
31
BiLSTM CRF NER
• Wraps Anago BiLSTMCRF.
• Improvements in this fork:
• Works against latest release (1.0.5) of Anago.
• No more intermittent failures due to time step mismatches.
32
Elmo NER
• Wraps Anago ELModel.
• New in this fork, available in current (dev) version of Anago.
• Needs (mandatory) base embedding for ELMo preprocessor.
33
Elmo NER
• Wraps Anago ELModel.
• New in this fork, available in current (dev) version of Anago.
• Needs (mandatory) base embedding for ELMo preprocessor.
34
Ensemble NER
• Max Voting
• Improvements in this fork:
• Unifies Max Voting and
Weighted Max Voting
NERs into single model.
35
Ensemble NER
• Max Voting
• Improvements in this fork:
• Unifies Max Voting and
Weighted Max Voting
NERs into single model.
36
Results (OOTB)
• Comparison across models
• ELMO based CRF has best performance.
• SpaCy and BiLSTM have comparable
performance, but CRF is competitive.
• Model based NERs outperform gazetteers.
• F1-scores range from 0.65 to 0.80
• Comparison across entity types
• Some correlation observed between data
volume and F1-scores for other models.
• F1-scores range from 0.61 to 0.81
37
Agenda
• What can NER do for you?
• Evolution of NER techniques
• NERDS Architecture
• NERDS Usage
• Future Work
38
Future Work
• Current API is only superficially Scikit-Learn like, convert to make models
fully conform to Scikit-Learn Classifier API.
• Eliminate Serialization issues reported by joblib.Parallel.
• Eliminate EnsembleNER in favor of ScikitLearn’s VotingClassifier.
• Leverage Scikit-Learn’s Model Selection classes
(RandomizedSearchCV and GridSearchCV).
• Add FLAIR and BERT based NER to supported model collection.
• BRAT annotation adapter.
39
Thank you
https://github.com/sujitpal/nerds
sujit.pal@elsevier.com

Building Named Entity Recognition Models Efficiently using NERDS

  • 1.
    Building Named Entity RecognitionModels efficiently using NERDS Sujit Pal, Elsevier Labs December 2019.
  • 2.
    About me • Workat Elsevier Labs • (Mostly self-taught) data scientist • Mostly work with Deep Learning, Machine Learning, Natural Language Processing, and Search. • Got interested in Named Entity Recognition (NER) and NERDS as part of Search and Knowledge Graph development. 2 I am NOT the author or maintainer of NERDS! • Originally built by Panagiotis Eustratiadis. • See CONTRIBUTORS.md for list of contributors. • Open sourced by Elsevier July 3, 2018.
  • 3.
    Agenda • What canNER do for you? • Evolution of NER techniques • NERDS Architecture. • NERDS Usage. • Future Work. 3
  • 4.
    Agenda • What canNER do for you? • Evolution of NER techniques • NERDS Architecture. • NERDS Usage. • Future Work. 4
  • 5.
    What can NERdo for you? • In general… • Foundational task for NLP pipelines. • Good NERs available OOB for “standard” named entities. • Topic Modeling, Co-reference Resolution, etc. • Information Retrieval (IR) • Chunk Entities into meaningful multi-word phrases. • Understanding query intent. • Automated Knowledge Graph Construction (AKBC) • NER extracts entities from incoming text. • Relationship Extraction extracts relationships between entity pairs. • Entity Relationship triple inserted into Knowledge Graph. 5 ConceptSearch!
  • 6.
    Agenda • What canNER do for you? • Evolution of NER techniques • NERDS Architecture. • NERDS Usage. • Future Work. 6
  • 7.
    Evolution of NERTechniques • Rules • Regular Expressions • Gazetteers 7 • Word-based models – PMI, log-likelihood. • Sequence models – Conditional Random Fields • Bi-LSTM • Bi-LSTM+CRF • Transformer based Models Traditional Statistical Neural
  • 8.
    Input Format –BIO Tagging • BIO – Begin In Out. • Barack/B-PER Obama/I-PER is/O 44th/O United/B-LOC States/I-LOC President/O ./O • BILOU – a tagging variant: • U – Unit token (for single token entities) • L – Last token in sequence, ex. Barack/B- PER Obama/L-PER 8 Barack B-PER Obama I-PER is O 44th O United B-LOC States I-LOC President O . O
  • 9.
    Gazetteer – AhoCorasick • Create in-memory data structure from dictionary. • Stream content against data structure. • Multiple matches with single pass. 9 Aho, A.V., and Corasick, M.J., 1975. Efficient String Matching: An aid to bibliographic search 21 43 0 Barack Obama United States NOT(Barack, United) 5 Airlines PER LOC ORG
  • 10.
    Sequence Modeling -CRF • Sequence version of logistic regression. • Computes optimum labeling l (y0, …, yn) over entire sentence s. • Build multiple feature functions f on each token, return real value in range 0..1. Function parameters: • sentence s with tokens (x0, …, xn) – feature can use any token, the entire sentence, or functions computed over the sentence (POS), • current position i, • previous and next labels yi-1 and yi+1. • Optimum labeling computed as follows, probability computed using softmax. • Weights wj learned using gradient descent. 10
  • 11.
    Neural Model -BiLSTM • Input is sequence of tokens, output is sequence of BIO tags. • Weights trained end-to-end, no feature engineering needed. • Bidirectional LSTM gets signal from neighboring words on both sides. 11 B-PER I-PER O O B-LOC I-LOC O O Barack Obama is 44th United States PresidentStates .
  • 12.
    Neural Model –BiLSTM-CRF • Same as previous model, with additional CRF layer. • No feature engineering for CRF, unlike CRF only NER model. • Pre-trained embeddings observed to improve performance. 12 Barack Obama is 44th United States PresidentStates . B-PER I-PER O O B-LOC I-LOC O O CRFBi-LSTM
  • 13.
    Neural Model –adding char embeddings • Concatenate char embedding + word embedding and feed to Bi-LSTM-CRF. • All weights learned end-to-end. • Handles rare / unknown words; Exploits signal in prefix/suffix. 13 .Barack Obama is 44th United PresidentStates B-PER I-PER O O B-LOC I-LOC O O word embeddings char LSTM/CNN Bi-LSTM-CRF concatenate
  • 14.
    Neural Model –ELMo preprocessing 14 .Barack Obama is 44th United PresidentStates B-PER I-PER O O B-LOC I-LOC O O char LSTM/CNN Bi-LSTM-CRF concat Contextualized wordembeddings
  • 15.
    Neural Model –Transformer based • BERT = Bidirectional Encoder Representation for Transformers. • Source of embeddings similar to ELMo in standard BiLSTM + CRF models, OR • Fine-tune LM backed NERs such as HuggingFace’s BertForTokenClassification. 15 .Barack Obama is 44th United PresidentStates[CLS] B-PER I-PER O O B-LOC I-LOC O O
  • 16.
    More Info onNER Techniques • High level overview on NER in series of blog posts by Tobias Sterbak (https://bit.ly/2pNdgPG). • Traditional NER techniques covered in paper by Rahul Shernagat (2014) -- Named Entity Recognition: A Literature Survey (https://bit.ly/2NRaCAg). • Introduction to Neural Models in paper by Ronan Collolbert and Jason Weston (2008) – A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning (https://bit.ly/32rRYnO) • Others (more modern papers) mentioned in slides. 16
  • 17.
    Agenda • What canNER do for you? • Evolution of NER techniques • NERDS Architecture • NERDS Usage • Future Work 17
  • 18.
    NERDS Overview • Frameworkthat provides easy to use NER capabilities to Data Scientists. • Wraps various popular third party NER models. • Extendable, new third party NER tools can be added as needed. • Software Engineering tooling to boost Data Science productivity. • Looking for support, bug reports, contributions, and ideas. 18
  • 19.
    Unification through I/OFormat 19 pyAhoCorasick CRFSuite SpaCy NER Anago BiLSTM AnnotatedDocument ( doc: Document(“Barack Obama is 44th United States President .”), annotations: [ Annotation(start_offset:0, end_offset:12, text:”Barack Obama”, label:”PER”), Annotation(start_offset:22, end_offset:35, text:”United States”, label:”LOC”) ])
  • 20.
    Benefits of Unification •Consistent API – all models are subclasses of NERModel. • Data prep. done once per project and reused across multiple models. • Reusable Training and Evaluation code. • Familiar Scikit-Learn like API, and access to Scikit-Learn utility functions. • Duck-typing allows us to build Ensembles of NER. • Easy to benchmark NER label data. 20
  • 21.
    Can we dobetter? 21 Data: [[“Barack”, “Obama”, “is”, “44th”, “United” “States”, “President”, “.”]] Labels and Predictions: [[“B-PER”, “I-PER”, “O”, “O”, “B-LOC”, “I-LOC”, “O”, “O”]] DictionaryNER I/O Convert SpacyNER I/O Convert CrfNER BiLstmCrfNER
  • 22.
    ELMo NER Modelfrom Anago 22 DictionaryNER CrfNER SpacyNER BiLstmCrfNER Data: [[“Barack”, “Obama”, “is”, “44th”, “United” “States”, “President”, “.”]] Labels and Predictions: [[“B-PER”, “I-PER”, “O”, “O”, “B-LOC”, “I-LOC”, “O”, “O”]] I/O Convert I/O Convert ElmoNER
  • 23.
    Agenda • What canNER do for you? • Evolution of NER techniques • NERDS Architecture • NERDS Usage • Future Work 23
  • 24.
    Dataset • Bio Entityrecognition task from BioNLP 2004. • Training and Test sets provided in BIO format. • 511,097 training examples • 104,895 test examples. • Entity Distribution (training set) • 25,307 DNA • 2,481 RNA • 11,217 cell_line • 15,466 cell_type • 55,117 protein 24
  • 25.
    Dictionary NER • WrapspyAhoCorasick Automaton • Improvements in fork. • Supports dictionary loading as well as fit(X, y) like other NER models. • Handles multiple entity classes. 25
  • 26.
    Dictionary NER • WrapspyAhoCorasick Automaton • Improvements in fork. • Supports dictionary loading as well as fit(X, y) like other NER models. • Handles multiple entity classes. 26
  • 27.
    CRF NER • Wrapssklearn.crfsuite CRF • Improvements in this fork: • Removes NLTK dependency, replaces with SpaCy. • Allows non-default features to be passed in. 27
  • 28.
    CRF NER • Wrapssklearn.crfsuite CRF • Improvements • Removes NLTK dependency, replaces with SpaCy. • Allows non-default features to be passed in. 28
  • 29.
    SpaCy NER • WrapsNER provided by SpaCy toolkit. • Improvements in this fork: • More robust to large data sizes, uses mini-batches for training. 29
  • 30.
    SpaCy NER • WrapsNER provided by SpaCy toolkit. • Improvements in this fork: • More robust to large data sizes, uses mini-batches for training. 30
  • 31.
    BiLSTM CRF NER •Wraps Anago BiLSTMCRF. • Improvements in this fork: • Works against latest release (1.0.5) of Anago. • No more intermittent failures due to time step mismatches. 31
  • 32.
    BiLSTM CRF NER •Wraps Anago BiLSTMCRF. • Improvements in this fork: • Works against latest release (1.0.5) of Anago. • No more intermittent failures due to time step mismatches. 32
  • 33.
    Elmo NER • WrapsAnago ELModel. • New in this fork, available in current (dev) version of Anago. • Needs (mandatory) base embedding for ELMo preprocessor. 33
  • 34.
    Elmo NER • WrapsAnago ELModel. • New in this fork, available in current (dev) version of Anago. • Needs (mandatory) base embedding for ELMo preprocessor. 34
  • 35.
    Ensemble NER • MaxVoting • Improvements in this fork: • Unifies Max Voting and Weighted Max Voting NERs into single model. 35
  • 36.
    Ensemble NER • MaxVoting • Improvements in this fork: • Unifies Max Voting and Weighted Max Voting NERs into single model. 36
  • 37.
    Results (OOTB) • Comparisonacross models • ELMO based CRF has best performance. • SpaCy and BiLSTM have comparable performance, but CRF is competitive. • Model based NERs outperform gazetteers. • F1-scores range from 0.65 to 0.80 • Comparison across entity types • Some correlation observed between data volume and F1-scores for other models. • F1-scores range from 0.61 to 0.81 37
  • 38.
    Agenda • What canNER do for you? • Evolution of NER techniques • NERDS Architecture • NERDS Usage • Future Work 38
  • 39.
    Future Work • CurrentAPI is only superficially Scikit-Learn like, convert to make models fully conform to Scikit-Learn Classifier API. • Eliminate Serialization issues reported by joblib.Parallel. • Eliminate EnsembleNER in favor of ScikitLearn’s VotingClassifier. • Leverage Scikit-Learn’s Model Selection classes (RandomizedSearchCV and GridSearchCV). • Add FLAIR and BERT based NER to supported model collection. • BRAT annotation adapter. 39
  • 40.