SoDA v2 - Named Entity Recognition from streaming text

Dictionary based
Named Entity
Extraction from
streaming text
Sujit Pal
SWIFT Technology Center, July 16, 2018

Agenda
• Introduction
• The Entity Resolution Problem
• Named Entity Recognition/Extraction (NER)
• SoDA v.2 Architecture
• SoDA v.2 Services
• Future Work
• Conclusion
2
Dictionary based Named Entity Extraction from streaming text

Introduction
• About Me
• Work at Elsevier Labs
• Interested in Search, NLP and Machine Learning
• Email: sujit.pal@elsevier.com
• Twitter: @palsujit
• About Elsevier Labs
• Advanced Technology Group within Elsevier
• More info: https://labs.elsevier.com
• About Elsevier
• World’s largest publisher of STM books and journals
• Uses data to inform and enable consumers of STM Info
3

The Entity Resolution Problem
• Named Entity Recognition/Extraction – recognize mentions of named
entities.
• Named Entity Resolution – resolve entity with root entity.
4
Hillary Clinton and Bill Clinton visited a diner during
Clinton’s 2016 presidential campaign.
PERSON LOCATIONEVENT
Hillary Clinton and Bill Clinton visited a diner during
Clinton’s 2016 presidential campaign.

Approaches to NER
• Three major approaches
• Regular Expression (RegEx) Based
• Dictionary Based
• Model Based
• Hybrid approaches
• Combining Approaches
• Data Programming
• Active Learning
5

RegEx based NER
Pierre Vinken , 61 years old , will join the board as a
nonexecutive director Nov. 29 .
PERSON
([A-Z][a-z]+){2,3}
AGE
(d){1,3}syearssold
DATE
([A-Z][a-z]{2}(.)*)s(d{2})
6

Dictionary Based NER
Pierre Vinken , 61 years old , will join the board as a
nonexecutive director Nov. 29 .
PERSON
Names of
famous
people
DATE
Month names
and abbrs.
7

Dictionary based NER – 3rd Party S/W
• Open Source
• GATE (General Architecture for Text Engineering)
• pyahocorasick
• SoDA (SOlr Dictionary Annotator)
• Commercial / Open Source
• LingPipe
8

Model Based NER
Pierre
Vinken
,
61
years
old
,
will
join
the
board
as
a
non-executive
director
Nov.
29
.
B-PER
I-PER
O
B-AGE
I-AGE
O
O
O
O
O
O
O
O
O
O
B-DATE
I-DATE
O
Machine
Learning
model
9

Model based NER – Sequence Models
• Typical model structure
• Input – a sentence s or a sequence of words {x0, x1, …, xn}.
• Output – a sequence Y {y0, y1, …, yn} of IOB tags.
• Hidden Markov Models – IOB tag depends on input variable and
previous label.
• Conditional Random Fields – IOB tag depends on features {f0, f1, …,
fm} with learned weights {ƛ0, ƛ1, …, ƛm} defined over current word xi,
current label yi, previous label yi-1, and the entire sentence s.
10

Model based NER – Sequence Models (2)
• Family of Deep Learning Sequence Models – has been used for POS
tagging, phrase chunking, NER and even language translation.
• Feature vectors for words created using Word Embeddings (word2vec,
GloVe, fasttext, etc).
• Performance can be improved with Attention mechanisms.
• Represents state of the art for Named Entity Recognition.
• Needs lots of data to train.
11
x1x0 EOSxn
y1y0 y2
y0 yny1
EOS
LSTM ENCODER LSTM DECODER
weights

Model based NER – 3rd party S/W
• Open Source
• GATE
• Apache OpenNLP
• Stanford NER (has NLTK plugin)
• SpaCy NER
• NERDS
• Commercial
• Basis Technologies Rosette Entity Extractor
• IBM Watson / Alchemy API
• Amazon Comprehend
• Azure Named Entity Recognition
12

Hybrid Approaches – combinations
• Create initial labeled dataset by harvesting entities from large text corpora
using one or more of the following:
• Weak Supervision – RegEx and other pattern matching (eg. Hearst
Patterns for phrases).
• Distant Supervision – matching against dictionaries derived from
industry specific (public or private) ontologies.
• Unsupervised – legacy rule based models.
• Supervised – predictions from weaker models.
• Crowdsourcing – using human experts.
• Train powerful seq2seq model using labeled dataset.
• Refine using human-in-the-loop active learning or other techniques.
13

Data Programming - Snorkel
• Start with noisy labels L from various sources
• Train generative model capable of generating probabilities P for each of
the output classes based on feature vector of noisy labels.
• Train final noise-aware discriminative model with output of generative
model P and original data X to predict class label Q for data.
• The Snorkel project (https://hazyresearch.github.io/snorkel/) pioneered
this approach and provides tooling for all these steps.
14
Image Credit: Snorkel Project

SoDA v.2 Architecture
• Theoretical Foundations
• Aho-Corasick algorithm
• SolrTextTagger
• SoDA Architecture
• Scaling SoDA
15

Aho-Corasick Algorithm
• Implements a data structure called “trie”
• State machine over characters
• Dictionary based NERs implement similar state machine over words in
phrases.
16
Image Credit: ResearchGate

SolrTextTagger
• Lucene’s TokenStreams are finite state automatons (FSA).
• SolrTextTagger (https://github.com/OpenSextant/SolrTextTagger)
dynamically creates FSAs from dictionary entries into a Finite State
Transducer (FST) data structure.
• Provides tag service to annotate incoming streaming text against FST.
• Input is text, output is matched dictionary entries and offsets into text.
• SolrTextTagger is OSS created by Lucene/Solr committer David Smiley.
17
Image Credit: Slides for Automata Invasion talk by Michael McCandless and Robert Muir

Architecture
18
• Co-located with standalone
Solr server.
• Scala based thin wrapper over
SolrTextTagger.
• Provides following services.
• unified JSON over HTTP
request/response
• multiple matching styles
• multiple lexicons
• hides details of managing
SolrTextTagger.
• Streaming (text) and non-
streaming (phrase) matching
services.
• Programmatic APIs for Scala
and Python.

Scaling
19
• Install and configure Solr,
SolrTextTagger and SoDA and
create AMI
• Use CloudFormation (or
Terraform) templates to
instantiate cluster of
Solr+SoDA instances behind
Elastic Load Balancer.
• Autoscaling cluster
• Monitored by CloudWatch
• New dictionaries loaded by
instantiating EC2 from AMI via
Lambda and saved back into
AMI for next cluster build.
client
loader

Consuming Annotations at scale
20
• Synchronous
• Asynchronous
Databricks
Notebook
Documents
on S3
SoDA cluster
Parquet
Annotations
on S3
Documents
on S3
SoDA cluster
Parquet
Annotations
on S3
Kafka/Kinesis
Streams
Producer Consumer

SoDA Services
• Bulk Loader (backend)
• Client facing (front end)
• Index (status check)
• Add New Record into Lexicon
• Delete Lexicon or Entry
• Annotate Text against Lexicon
• List Available Lexicons
• Find coverage of incoming text against Lexicons
• Lookup by ID
• Reverse Lookup by Phrase
21

SoDA Bulk Loader
• Multithreaded loader for bulk loading dictionaries into SoDA.
• Requires tab-separated file in following format:
• id {TAB} primary-name {PIPE} alt-name-1 {PIPE} ... {PIPE} alt-name-n
• One line per dictionary entry
• Script to run (on SoDA/Solr box).
• ./bulk_load.sh lexicon /path/to/input num_workers
22

SoDA Health Check – index.json
• Returns a status message. Meant to be used for testing if the SoDA application is up.
• Python client code
• Scala client code
• Output
23

Annotate Text against Lexicon – annot.json
• Annotates text against a specific lexicon and match type.
• Match types can be one of the following:
• exact – matches text spans with dictionary entries.
• lower – same as exact, but matches are case-sensitive
• stop – same as lower, but stop words removed from both text and dictionary entries
• stem1 – same as stop, but stemmed with Solr minimal English stemmer
• stem2 – same as stop, but stemmed with Solr Kstem stemmer
• stem3 – same as stop, but stemmed with Solr Porter stemmer.
• Input (HTTP POST)
24

Annotate Text against Lexicon (2)
• Python client code
• Scala client code
• Output
25

List Available Lexicons – dicts.json
• Returns a list of lexicons available to annotate against.
• Python client
• Scala client
• Output
26

Check Coverage – coverage.json
• This can be used to find which lexicons are appropriate for annotating your text.
The service allows you to send a piece of text to all hosted lexicons and returns
with the number of matches found in each.
• Python client
• Scala client
27

Check Coverage (2)
• Output
28

Lookup by ID – lookup.json
• Allows looking up a dictionary entry by lexicon and ID.
• Python client
• Scala client
29

Lookup by ID (2)
• Output
30

Reverse Lookup by Phrase
• Matches phrases against specific lexicon and match type.
• Match types can be one of the following:
• All match types supported by Annotation service (annot.json)
• lsort – case-insensitive matching against phrase with words sorted
alphabetically.
• s3sort – case-insensitive matching against phrase stemmed using
Porter Stemmer (stem3) and its words sorted alphabetically.
• Input
31

Reverse Lookup by Phrase (2)
• Python client
• Scala client
• Output
32

Future Work
• List of open items on the SoDA issues page and continuously updated as
I find them (https://github.com/elsevierlabs-os/soda/issues).
• Please feel free to post issues and ideas for improvement.
33

Thank you
Contact Information
Email: sujit.pal@elsevier.com
Twitter: @palsujit
SoDA: https://github.com/elsevierlabs-os/soda

SoDA v2 - Named Entity Recognition from streaming text

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SoDA v2 - Named Entity Recognition from streaming text

Similar to SoDA v2 - Named Entity Recognition from streaming text (20)

More from Sujit Pal

More from Sujit Pal (20)

Recently uploaded

Recently uploaded (20)

SoDA v2 - Named Entity Recognition from streaming text