SlideShare a Scribd company logo
Dictionary based
Named Entity
Extraction from
streaming text
Sujit Pal
SWIFT Technology Center, July 16, 2018
Agenda
• Introduction
• The Entity Resolution Problem
• Named Entity Recognition/Extraction (NER)
• SoDA v.2 Architecture
• SoDA v.2 Services
• Future Work
• Conclusion
2
Dictionary based Named Entity Extraction from streaming text
Introduction
• About Me
• Work at Elsevier Labs
• Interested in Search, NLP and Machine Learning
• Email: sujit.pal@elsevier.com
• Twitter: @palsujit
• About Elsevier Labs
• Advanced Technology Group within Elsevier
• More info: https://labs.elsevier.com
• About Elsevier
• World’s largest publisher of STM books and journals
• Uses data to inform and enable consumers of STM Info
3
Dictionary based Named Entity Extraction from streaming text
The Entity Resolution Problem
• Named Entity Recognition/Extraction – recognize mentions of named
entities.
• Named Entity Resolution – resolve entity with root entity.
4
Dictionary based Named Entity Extraction from streaming text
Hillary Clinton and Bill Clinton visited a diner during
Clinton’s 2016 presidential campaign.
PERSON LOCATIONEVENT
Hillary Clinton and Bill Clinton visited a diner during
Clinton’s 2016 presidential campaign.
Approaches to NER
• Three major approaches
• Regular Expression (RegEx) Based
• Dictionary Based
• Model Based
• Hybrid approaches
• Combining Approaches
• Data Programming
• Active Learning
5
Dictionary based Named Entity Extraction from streaming text
RegEx based NER
Pierre Vinken , 61 years old , will join the board as a
nonexecutive director Nov. 29 .
PERSON
([A-Z][a-z]+){2,3}
AGE
(d){1,3}syearssold
DATE
([A-Z][a-z]{2}(.)*)s(d{2})
6
Dictionary based Named Entity Extraction from streaming text
Dictionary Based NER
Pierre Vinken , 61 years old , will join the board as a
nonexecutive director Nov. 29 .
PERSON
Names of
famous
people
DATE
Month names
and abbrs.
7
Dictionary based Named Entity Extraction from streaming text
Dictionary based NER – 3rd Party S/W
• Open Source
• GATE (General Architecture for Text Engineering)
• pyahocorasick
• SoDA (SOlr Dictionary Annotator)
• Commercial / Open Source
• LingPipe
8
Dictionary based Named Entity Extraction from streaming text
Model Based NER
Pierre
Vinken
,
61
years
old
,
will
join
the
board
as
a
non-executive
director
Nov.
29
.
B-PER
I-PER
O
B-AGE
I-AGE
O
O
O
O
O
O
O
O
O
O
B-DATE
I-DATE
O
Machine
Learning
model
9
Dictionary based Named Entity Extraction from streaming text
Model based NER – Sequence Models
• Typical model structure
• Input – a sentence s or a sequence of words {x0, x1, …, xn}.
• Output – a sequence Y {y0, y1, …, yn} of IOB tags.
• Hidden Markov Models – IOB tag depends on input variable and
previous label.
• Conditional Random Fields – IOB tag depends on features {f0, f1, …,
fm} with learned weights {ƛ0, ƛ1, …, ƛm} defined over current word xi,
current label yi, previous label yi-1, and the entire sentence s.
10
Dictionary based Named Entity Extraction from streaming text
Model based NER – Sequence Models (2)
• Family of Deep Learning Sequence Models – has been used for POS
tagging, phrase chunking, NER and even language translation.
• Feature vectors for words created using Word Embeddings (word2vec,
GloVe, fasttext, etc).
• Performance can be improved with Attention mechanisms.
• Represents state of the art for Named Entity Recognition.
• Needs lots of data to train.
11
Dictionary based Named Entity Extraction from streaming text
x1x0 EOSxn
y1y0 y2
y0 yny1
EOS
LSTM ENCODER LSTM DECODER
weights
Model based NER – 3rd party S/W
• Open Source
• GATE
• Apache OpenNLP
• Stanford NER (has NLTK plugin)
• SpaCy NER
• NERDS
• Commercial
• Basis Technologies Rosette Entity Extractor
• IBM Watson / Alchemy API
• Amazon Comprehend
• Azure Named Entity Recognition
12
Dictionary based Named Entity Extraction from streaming text
Hybrid Approaches – combinations
• Create initial labeled dataset by harvesting entities from large text corpora
using one or more of the following:
• Weak Supervision – RegEx and other pattern matching (eg. Hearst
Patterns for phrases).
• Distant Supervision – matching against dictionaries derived from
industry specific (public or private) ontologies.
• Unsupervised – legacy rule based models.
• Supervised – predictions from weaker models.
• Crowdsourcing – using human experts.
• Train powerful seq2seq model using labeled dataset.
• Refine using human-in-the-loop active learning or other techniques.
13
Dictionary based Named Entity Extraction from streaming text
Data Programming - Snorkel
• Start with noisy labels L from various sources
• Train generative model capable of generating probabilities P for each of
the output classes based on feature vector of noisy labels.
• Train final noise-aware discriminative model with output of generative
model P and original data X to predict class label Q for data.
• The Snorkel project (https://hazyresearch.github.io/snorkel/) pioneered
this approach and provides tooling for all these steps.
14
Dictionary based Named Entity Extraction from streaming text
Image Credit: Snorkel Project
SoDA v.2 Architecture
• Theoretical Foundations
• Aho-Corasick algorithm
• SolrTextTagger
• SoDA Architecture
• Scaling SoDA
15
Dictionary based Named Entity Extraction from streaming text
Aho-Corasick Algorithm
• Implements a data structure called “trie”
• State machine over characters
• Dictionary based NERs implement similar state machine over words in
phrases.
16
Dictionary based Named Entity Extraction from streaming text
Image Credit: ResearchGate
SolrTextTagger
• Lucene’s TokenStreams are finite state automatons (FSA).
• SolrTextTagger (https://github.com/OpenSextant/SolrTextTagger)
dynamically creates FSAs from dictionary entries into a Finite State
Transducer (FST) data structure.
• Provides tag service to annotate incoming streaming text against FST.
• Input is text, output is matched dictionary entries and offsets into text.
• SolrTextTagger is OSS created by Lucene/Solr committer David Smiley.
17
Dictionary based Named Entity Extraction from streaming text
Image Credit: Slides for Automata Invasion talk by Michael McCandless and Robert Muir
Architecture
18
Dictionary based Named Entity Extraction from streaming text
• Co-located with standalone
Solr server.
• Scala based thin wrapper over
SolrTextTagger.
• Provides following services.
• unified JSON over HTTP
request/response
• multiple matching styles
• multiple lexicons
• hides details of managing
SolrTextTagger.
• Streaming (text) and non-
streaming (phrase) matching
services.
• Programmatic APIs for Scala
and Python.
Scaling
19
Dictionary based Named Entity Extraction from streaming text
• Install and configure Solr,
SolrTextTagger and SoDA and
create AMI
• Use CloudFormation (or
Terraform) templates to
instantiate cluster of
Solr+SoDA instances behind
Elastic Load Balancer.
• Autoscaling cluster
• Monitored by CloudWatch
• New dictionaries loaded by
instantiating EC2 from AMI via
Lambda and saved back into
AMI for next cluster build.
client
loader
Consuming Annotations at scale
20
Dictionary based Named Entity Extraction from streaming text
• Synchronous
• Asynchronous
Databricks
Notebook
Documents
on S3
SoDA cluster
Parquet
Annotations
on S3
Documents
on S3
SoDA cluster
Parquet
Annotations
on S3
Kafka/Kinesis
Streams
Producer Consumer
SoDA Services
• Bulk Loader (backend)
• Client facing (front end)
• Index (status check)
• Add New Record into Lexicon
• Delete Lexicon or Entry
• Annotate Text against Lexicon
• List Available Lexicons
• Find coverage of incoming text against Lexicons
• Lookup by ID
• Reverse Lookup by Phrase
21
Dictionary based Named Entity Extraction from streaming text
SoDA Bulk Loader
• Multithreaded loader for bulk loading dictionaries into SoDA.
• Requires tab-separated file in following format:
• id {TAB} primary-name {PIPE} alt-name-1 {PIPE} ... {PIPE} alt-name-n
• One line per dictionary entry
• Script to run (on SoDA/Solr box).
• ./bulk_load.sh lexicon /path/to/input num_workers
22
Dictionary based Named Entity Extraction from streaming text
SoDA Health Check – index.json
• Returns a status message. Meant to be used for testing if the SoDA application is up.
• Python client code
• Scala client code
• Output
23
Dictionary based Named Entity Extraction from streaming text
Annotate Text against Lexicon – annot.json
• Annotates text against a specific lexicon and match type.
• Match types can be one of the following:
• exact – matches text spans with dictionary entries.
• lower – same as exact, but matches are case-sensitive
• stop – same as lower, but stop words removed from both text and dictionary entries
• stem1 – same as stop, but stemmed with Solr minimal English stemmer
• stem2 – same as stop, but stemmed with Solr Kstem stemmer
• stem3 – same as stop, but stemmed with Solr Porter stemmer.
• Input (HTTP POST)
24
Dictionary based Named Entity Extraction from streaming text
Annotate Text against Lexicon (2)
• Python client code
• Scala client code
• Output
25
Dictionary based Named Entity Extraction from streaming text
List Available Lexicons – dicts.json
• Returns a list of lexicons available to annotate against.
• Python client
• Scala client
• Output
26
Dictionary based Named Entity Extraction from streaming text
Check Coverage – coverage.json
• This can be used to find which lexicons are appropriate for annotating your text.
The service allows you to send a piece of text to all hosted lexicons and returns
with the number of matches found in each.
• Input (HTTP POST)
• Python client
• Scala client
27
Dictionary based Named Entity Extraction from streaming text
Check Coverage (2)
• Output
28
Dictionary based Named Entity Extraction from streaming text
Lookup by ID – lookup.json
• Allows looking up a dictionary entry by lexicon and ID.
• Input (HTTP POST)
• Python client
• Scala client
29
Dictionary based Named Entity Extraction from streaming text
Lookup by ID (2)
• Output
30
Dictionary based Named Entity Extraction from streaming text
Reverse Lookup by Phrase
• Matches phrases against specific lexicon and match type.
• Match types can be one of the following:
• All match types supported by Annotation service (annot.json)
• lsort – case-insensitive matching against phrase with words sorted
alphabetically.
• s3sort – case-insensitive matching against phrase stemmed using
Porter Stemmer (stem3) and its words sorted alphabetically.
• Input
31
Dictionary based Named Entity Extraction from streaming text
Reverse Lookup by Phrase (2)
• Python client
• Scala client
• Output
32
Dictionary based Named Entity Extraction from streaming text
Future Work
• List of open items on the SoDA issues page and continuously updated as
I find them (https://github.com/elsevierlabs-os/soda/issues).
• Please feel free to post issues and ideas for improvement.
33
Dictionary based Named Entity Extraction from streaming text
Thank you
Contact Information
Email: sujit.pal@elsevier.com
Twitter: @palsujit
SoDA: https://github.com/elsevierlabs-os/soda

More Related Content

What's hot

Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
Ajay Taneja
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
Deptii Chaudhari
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
Kuppusamy P
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
Sanzid Kawsar
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
Pranav Gupta
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
Aritra Mukherjee
 
Word2Vec
Word2VecWord2Vec
Word2Vec
hyunyoung Lee
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Ila Group
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
Arvind Devaraj
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
KarenVacca
 
Understanding Natural Language Processing
Understanding Natural Language ProcessingUnderstanding Natural Language Processing
Understanding Natural Language Processing
International Institute of Information Technology (I²IT)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Rishikese MR
 

What's hot (20)

Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Language models
Language modelsLanguage models
Language models
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Understanding Natural Language Processing
Understanding Natural Language ProcessingUnderstanding Natural Language Processing
Understanding Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 

Similar to SoDA v2 - Named Entity Recognition from streaming text

Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
Stanley Wang
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
Trey Grainger
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
Robert Viseur
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Building OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsBuilding OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web tools
Melanie Courtot
 
Final presentation
Final presentationFinal presentation
Final presentation
Nitish Upreti
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Takeshi Morita
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
Erik Hatcher
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopJSGB
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
Angelo Salatino
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...eswcsummerschool
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction
abenyeung1
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Apache solr
Apache solrApache solr
Apache solr
Dipen Rangwani
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6DEEPAK KHETAWAT
 
Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...
Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...
Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...
National Information Standards Organization (NISO)
 

Similar to SoDA v2 - Named Entity Recognition from streaming text (20)

Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Building OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsBuilding OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web tools
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Apache solr
Apache solrApache solr
Apache solr
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 
Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...
Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...
Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...
 

More from Sujit Pal

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Sujit Pal
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
Sujit Pal
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
Sujit Pal
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
Sujit Pal
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
Sujit Pal
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Sujit Pal
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
Sujit Pal
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
Sujit Pal
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
Sujit Pal
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Sujit Pal
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
Sujit Pal
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
Sujit Pal
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
Sujit Pal
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
Sujit Pal
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
Sujit Pal
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
Sujit Pal
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Sujit Pal
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
Sujit Pal
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
Sujit Pal
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
Sujit Pal
 

More from Sujit Pal (20)

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 

Recently uploaded

一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 

Recently uploaded (20)

一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 

SoDA v2 - Named Entity Recognition from streaming text

  • 1. Dictionary based Named Entity Extraction from streaming text Sujit Pal SWIFT Technology Center, July 16, 2018
  • 2. Agenda • Introduction • The Entity Resolution Problem • Named Entity Recognition/Extraction (NER) • SoDA v.2 Architecture • SoDA v.2 Services • Future Work • Conclusion 2 Dictionary based Named Entity Extraction from streaming text
  • 3. Introduction • About Me • Work at Elsevier Labs • Interested in Search, NLP and Machine Learning • Email: sujit.pal@elsevier.com • Twitter: @palsujit • About Elsevier Labs • Advanced Technology Group within Elsevier • More info: https://labs.elsevier.com • About Elsevier • World’s largest publisher of STM books and journals • Uses data to inform and enable consumers of STM Info 3 Dictionary based Named Entity Extraction from streaming text
  • 4. The Entity Resolution Problem • Named Entity Recognition/Extraction – recognize mentions of named entities. • Named Entity Resolution – resolve entity with root entity. 4 Dictionary based Named Entity Extraction from streaming text Hillary Clinton and Bill Clinton visited a diner during Clinton’s 2016 presidential campaign. PERSON LOCATIONEVENT Hillary Clinton and Bill Clinton visited a diner during Clinton’s 2016 presidential campaign.
  • 5. Approaches to NER • Three major approaches • Regular Expression (RegEx) Based • Dictionary Based • Model Based • Hybrid approaches • Combining Approaches • Data Programming • Active Learning 5 Dictionary based Named Entity Extraction from streaming text
  • 6. RegEx based NER Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . PERSON ([A-Z][a-z]+){2,3} AGE (d){1,3}syearssold DATE ([A-Z][a-z]{2}(.)*)s(d{2}) 6 Dictionary based Named Entity Extraction from streaming text
  • 7. Dictionary Based NER Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . PERSON Names of famous people DATE Month names and abbrs. 7 Dictionary based Named Entity Extraction from streaming text
  • 8. Dictionary based NER – 3rd Party S/W • Open Source • GATE (General Architecture for Text Engineering) • pyahocorasick • SoDA (SOlr Dictionary Annotator) • Commercial / Open Source • LingPipe 8 Dictionary based Named Entity Extraction from streaming text
  • 10. Model based NER – Sequence Models • Typical model structure • Input – a sentence s or a sequence of words {x0, x1, …, xn}. • Output – a sequence Y {y0, y1, …, yn} of IOB tags. • Hidden Markov Models – IOB tag depends on input variable and previous label. • Conditional Random Fields – IOB tag depends on features {f0, f1, …, fm} with learned weights {ƛ0, ƛ1, …, ƛm} defined over current word xi, current label yi, previous label yi-1, and the entire sentence s. 10 Dictionary based Named Entity Extraction from streaming text
  • 11. Model based NER – Sequence Models (2) • Family of Deep Learning Sequence Models – has been used for POS tagging, phrase chunking, NER and even language translation. • Feature vectors for words created using Word Embeddings (word2vec, GloVe, fasttext, etc). • Performance can be improved with Attention mechanisms. • Represents state of the art for Named Entity Recognition. • Needs lots of data to train. 11 Dictionary based Named Entity Extraction from streaming text x1x0 EOSxn y1y0 y2 y0 yny1 EOS LSTM ENCODER LSTM DECODER weights
  • 12. Model based NER – 3rd party S/W • Open Source • GATE • Apache OpenNLP • Stanford NER (has NLTK plugin) • SpaCy NER • NERDS • Commercial • Basis Technologies Rosette Entity Extractor • IBM Watson / Alchemy API • Amazon Comprehend • Azure Named Entity Recognition 12 Dictionary based Named Entity Extraction from streaming text
  • 13. Hybrid Approaches – combinations • Create initial labeled dataset by harvesting entities from large text corpora using one or more of the following: • Weak Supervision – RegEx and other pattern matching (eg. Hearst Patterns for phrases). • Distant Supervision – matching against dictionaries derived from industry specific (public or private) ontologies. • Unsupervised – legacy rule based models. • Supervised – predictions from weaker models. • Crowdsourcing – using human experts. • Train powerful seq2seq model using labeled dataset. • Refine using human-in-the-loop active learning or other techniques. 13 Dictionary based Named Entity Extraction from streaming text
  • 14. Data Programming - Snorkel • Start with noisy labels L from various sources • Train generative model capable of generating probabilities P for each of the output classes based on feature vector of noisy labels. • Train final noise-aware discriminative model with output of generative model P and original data X to predict class label Q for data. • The Snorkel project (https://hazyresearch.github.io/snorkel/) pioneered this approach and provides tooling for all these steps. 14 Dictionary based Named Entity Extraction from streaming text Image Credit: Snorkel Project
  • 15. SoDA v.2 Architecture • Theoretical Foundations • Aho-Corasick algorithm • SolrTextTagger • SoDA Architecture • Scaling SoDA 15 Dictionary based Named Entity Extraction from streaming text
  • 16. Aho-Corasick Algorithm • Implements a data structure called “trie” • State machine over characters • Dictionary based NERs implement similar state machine over words in phrases. 16 Dictionary based Named Entity Extraction from streaming text Image Credit: ResearchGate
  • 17. SolrTextTagger • Lucene’s TokenStreams are finite state automatons (FSA). • SolrTextTagger (https://github.com/OpenSextant/SolrTextTagger) dynamically creates FSAs from dictionary entries into a Finite State Transducer (FST) data structure. • Provides tag service to annotate incoming streaming text against FST. • Input is text, output is matched dictionary entries and offsets into text. • SolrTextTagger is OSS created by Lucene/Solr committer David Smiley. 17 Dictionary based Named Entity Extraction from streaming text Image Credit: Slides for Automata Invasion talk by Michael McCandless and Robert Muir
  • 18. Architecture 18 Dictionary based Named Entity Extraction from streaming text • Co-located with standalone Solr server. • Scala based thin wrapper over SolrTextTagger. • Provides following services. • unified JSON over HTTP request/response • multiple matching styles • multiple lexicons • hides details of managing SolrTextTagger. • Streaming (text) and non- streaming (phrase) matching services. • Programmatic APIs for Scala and Python.
  • 19. Scaling 19 Dictionary based Named Entity Extraction from streaming text • Install and configure Solr, SolrTextTagger and SoDA and create AMI • Use CloudFormation (or Terraform) templates to instantiate cluster of Solr+SoDA instances behind Elastic Load Balancer. • Autoscaling cluster • Monitored by CloudWatch • New dictionaries loaded by instantiating EC2 from AMI via Lambda and saved back into AMI for next cluster build. client loader
  • 20. Consuming Annotations at scale 20 Dictionary based Named Entity Extraction from streaming text • Synchronous • Asynchronous Databricks Notebook Documents on S3 SoDA cluster Parquet Annotations on S3 Documents on S3 SoDA cluster Parquet Annotations on S3 Kafka/Kinesis Streams Producer Consumer
  • 21. SoDA Services • Bulk Loader (backend) • Client facing (front end) • Index (status check) • Add New Record into Lexicon • Delete Lexicon or Entry • Annotate Text against Lexicon • List Available Lexicons • Find coverage of incoming text against Lexicons • Lookup by ID • Reverse Lookup by Phrase 21 Dictionary based Named Entity Extraction from streaming text
  • 22. SoDA Bulk Loader • Multithreaded loader for bulk loading dictionaries into SoDA. • Requires tab-separated file in following format: • id {TAB} primary-name {PIPE} alt-name-1 {PIPE} ... {PIPE} alt-name-n • One line per dictionary entry • Script to run (on SoDA/Solr box). • ./bulk_load.sh lexicon /path/to/input num_workers 22 Dictionary based Named Entity Extraction from streaming text
  • 23. SoDA Health Check – index.json • Returns a status message. Meant to be used for testing if the SoDA application is up. • Python client code • Scala client code • Output 23 Dictionary based Named Entity Extraction from streaming text
  • 24. Annotate Text against Lexicon – annot.json • Annotates text against a specific lexicon and match type. • Match types can be one of the following: • exact – matches text spans with dictionary entries. • lower – same as exact, but matches are case-sensitive • stop – same as lower, but stop words removed from both text and dictionary entries • stem1 – same as stop, but stemmed with Solr minimal English stemmer • stem2 – same as stop, but stemmed with Solr Kstem stemmer • stem3 – same as stop, but stemmed with Solr Porter stemmer. • Input (HTTP POST) 24 Dictionary based Named Entity Extraction from streaming text
  • 25. Annotate Text against Lexicon (2) • Python client code • Scala client code • Output 25 Dictionary based Named Entity Extraction from streaming text
  • 26. List Available Lexicons – dicts.json • Returns a list of lexicons available to annotate against. • Python client • Scala client • Output 26 Dictionary based Named Entity Extraction from streaming text
  • 27. Check Coverage – coverage.json • This can be used to find which lexicons are appropriate for annotating your text. The service allows you to send a piece of text to all hosted lexicons and returns with the number of matches found in each. • Input (HTTP POST) • Python client • Scala client 27 Dictionary based Named Entity Extraction from streaming text
  • 28. Check Coverage (2) • Output 28 Dictionary based Named Entity Extraction from streaming text
  • 29. Lookup by ID – lookup.json • Allows looking up a dictionary entry by lexicon and ID. • Input (HTTP POST) • Python client • Scala client 29 Dictionary based Named Entity Extraction from streaming text
  • 30. Lookup by ID (2) • Output 30 Dictionary based Named Entity Extraction from streaming text
  • 31. Reverse Lookup by Phrase • Matches phrases against specific lexicon and match type. • Match types can be one of the following: • All match types supported by Annotation service (annot.json) • lsort – case-insensitive matching against phrase with words sorted alphabetically. • s3sort – case-insensitive matching against phrase stemmed using Porter Stemmer (stem3) and its words sorted alphabetically. • Input 31 Dictionary based Named Entity Extraction from streaming text
  • 32. Reverse Lookup by Phrase (2) • Python client • Scala client • Output 32 Dictionary based Named Entity Extraction from streaming text
  • 33. Future Work • List of open items on the SoDA issues page and continuously updated as I find them (https://github.com/elsevierlabs-os/soda/issues). • Please feel free to post issues and ideas for improvement. 33 Dictionary based Named Entity Extraction from streaming text
  • 34. Thank you Contact Information Email: sujit.pal@elsevier.com Twitter: @palsujit SoDA: https://github.com/elsevierlabs-os/soda