12 May 2020
Kfir Bar, Chief Scientist
Philip Blair, Senior Research Engineer
Carmel Eliav, Research Engineer
Understanding Names with
Neural Networks
BASIS TECHNOLOGY
ABOUT BASIS TECHNOLOGY
We engineer a safer & more productive world by building proven AI solutions for analyzing
text, connecting data silos, & discovering digital evidence.
EVERY DAY WE ENABLE
140MWATCHLIST CHECKS
EVERY MONTH WE SUPPLY TOOLS TO
40,000DIGITAL INVESTIGATORS
WE HAVE
25+YEARS EXPERIENCE
BASIS TECHNOLOGY
Rosette Capabilities
BASIS TECHNOLOGY
Rosette Capabilities
BASIS TECHNOLOGY
OUR SPEAKERS
Kfir Bar Philip Blair Carmel Eliav
BASIS TECHNOLOGY
Names are extremely complex!
BASIS TECHNOLOGY
Names are a Challenge
● 4.4M unique names in the world
(according to Alexa)
● 150,000 unique names in the US
(according to some study on the US
census)
● Names have a ton of variety
Names are vitally important data points
in financial compliance, anti-fraud,
government intelligence, law
enforcement, and identity verification.
BASIS TECHNOLOGY
Task: Name Matching
BASIS TECHNOLOGY
Names: Broad Range of Match “Phenomena”
BASIS TECHNOLOGY
Searching Names in Watch Lists
Rosette Name Indexer
BASIS TECHNOLOGY
Searching Names in Watch Lists
● Two Pass Algorithm to find the best match:
○ First pass: Quickly generates a tractable set of candidate names for the second
pass to consider
○ Second pass: compares name returned by the first pass against the query and
returns a score between 0 and 1 inclusive.
● Supports matching for:
○ Names (Person, Organization, Location, etc)
○ Dates
○ Addresses
18 languages
BASIS TECHNOLOGY
Name Matching Algorithms
● Edit Distance (Levenshtein distance)
● Rule Based Matching
● Phonetic Matching
○ Statistical Model - HMM
○ DNN Model - Sequence to Sequence encoder-decoder LSTM model
● Semantic Matching (Text Embeddings)
BASIS TECHNOLOGY
Name Matching Algorithms
● Edit Distance (Levenshtein distance)
● Rule Based Matching
● Phonetic Matching
○ Statistical Model - HMM
○ DNN Model - Sequence to Sequence encoder-decoder LSTM model
● Semantic Matching (Text Embeddings)
BASIS TECHNOLOGY
HMM-Based Name Matching
BASIS TECHNOLOGY
Step One: Modeling Sequences of Characters
J o h n
e a
r t
BASIS TECHNOLOGY
Step One: Modeling Sequences of Characters
J o h n
BASIS TECHNOLOGY
Step Two: Modeling Transliterations
ジ
オ
ョ
ホ
ン
J o h n
BASIS TECHNOLOGY
Step Two: Modeling Transliterations
ジ
オ
ョ
ホ
ン
J o h n
Given a sequence of
characters in the source
language...
...what is the probability
of the corresponding
sequence of characters in
the target language?
This probability
is our score!
BASIS TECHNOLOGY
Issues with HMM-Based Name Matching
ジ
オ
ョ
J o
English Character(s) Japanese Equivalent
o オ
yo ヨ
ji ジ
jo ジョ
...but this
represents
just "o", not "o
following a 'j'"
BASIS TECHNOLOGY
Issues with HMM-Based Name Matching
ジ
オ
ョ
J o
...but this
represents
just "o", not "o
following a 'j'"
Problems with HMMs:
● Multi-character equivalents
● Morphological effects on
pronunciation
○ Arabic
○ Similar: "photograph" vs
"photography"
Common Thread: Missing Context!
BASIS TECHNOLOGY
Let's try something different...
BASIS TECHNOLOGY
How would you transliterate a name?
BASIS TECHNOLOGY
How Would You Transliterate a Name?
John Titor
ジョ ン ・ タイ ター
BASIS TECHNOLOGY
What does an HMM Actually Do?
ジ
オ
ョ
ホ
ン
J o h n For every input character...
...associate 0-1 output
characters.
Sequence Tagging
BASIS TECHNOLOGY
How Would You Transliterate a Name?
John Titor
ジョ ン ・ タイ ター
BASIS TECHNOLOGY
What technology can we use to accomplish
this?
BASIS TECHNOLOGY
The Antidote: Sequence-to-Sequence (seq2seq)
Source: https://d2l.ai/chapter_recurrent-modern/seq2seq.html
BASIS TECHNOLOGY
Neural Network of Choice: Long Short-Term Memory (LSTM) Cells
Source: http://dprogrammer.org/rnn-lstm-gru
BASIS TECHNOLOGY
Classification paradigms in NLP
T u p a c
Classifier
Spanish/American...
T u p a c
Tagger
T u p a c
Seq2seq
BASIS TECHNOLOGY
Long Short-Term Memory (LSTM) Cells: Encoding
Piece of input sequence
(e.g. a character in a name)
Understanding of the
sequence so far
BASIS TECHNOLOGY
Long Short-Term Memory (LSTM) Cells: Decoding
Previously outputted character
(e.g. a character in a name)
Understanding of the
remaining sequence
Output Character
BASIS TECHNOLOGY
Putting it all together
BASIS TECHNOLOGY
Step One: Learning to Transliterate with seq2seq
"Tupac" English Name Reader
Japanese Name
Generator
"トゥパック"
BASIS TECHNOLOGY
Step One: Learning to Transliterate with seq2seq
T u p a c
ト ゥ ー パ ッ ク
First we "read"
the English
name...
...then we
generate the
translation
BASIS TECHNOLOGY
Step Two: Running the Transliterator in Reverse to Score
"Tupac" English Name Reader
Japanese Name
Generator
"トゥパック"
0.790
BASIS TECHNOLOGY
Step Two: Running the Transliterator in Reverse to Score
T u p a c
ト ゥ ー パ ッ ク
First we "read"
the English
name...
...then we pass in
the Japanese
name...
0.790
...to produce a
score.
BASIS TECHNOLOGY
How Can We Produce a Score?
...
ド ト プ カ
ヨ ガ ゥ デ
Our model's prediction
All possible predictions
BASIS TECHNOLOGY
How Can We Produce a Score?
...
ド ト プ カ
ヨ ガ ゥ デ
ト ゥ ...
Perplexity:
"How
surprising is
this name?"
Higher
1/Perplexity
=
Better Match
75% Error Reduction on
Customer Data
10% Accuracy Boost for
English/Katakana (Japanese)
Name Pairs!
Source:
http://www.comparatif-crm.com/un-crm-gratuit/t
here-is-no-free-lunch
BASIS TECHNOLOGY
One Problem: Speed
BASIS TECHNOLOGY
Processing Time on Name Pairs (seconds)
# Pairs HMM LSTM (Neural) Slowdown
1,000 0.766 7.17 9.36x
10,000 3.591 64.3 17.91x
100,000 26.267 653.3 24.87x
500,000 135.17 3,569.6 26.41x
BASIS TECHNOLOGY
All hope is not lost...
BASIS TECHNOLOGY
Faster seq2seq with a Convolutional Neural Network (CNN)
T u p a c
ト ゥ ー パ ッ ク
Attention
Context vector
<p> <p>
T u p a c
Embeddings
Convolutions
Gated Linear Units
BASIS TECHNOLOGY
Convolutional Neural Net (CNN)
● A CNN is capable of learning local features, which can replicated across the input
● CNN is popular in computer vision. It is typically structured in multiple layers and
it is very good at detecting image features like lines and curves and then combine
them into objects and faces
https://www.aspexit.com/en/neural-network-lets-try-to-demystify-all-this-a-little-bit-3-application-to-images/
BASIS TECHNOLOGY
CNN in Natural Language Processing
https://arxiv.org/abs/1408.5882
Why CNN?
● Faster - CNN’s are able to process all characters in the input in
parallel
● Context - While processing each character individually, we lose the
context information
Why didn’t we use it before?
The compromise
● Positional encoding - A standard embedding layer where the input
is not the character itself but the position of the character within the
token
Source:
http://www.comparatif-crm.com/un-crm-gratuit/t
here-is-no-free-lunch
BASIS TECHNOLOGY
Results
Experiment Accuracy improvement Speed
Lstm 14% X25 slower
Cnn version 1 8.8% X1.6 slower
Cnn version 2 9% X1.6 slower
● The speed was measured in seconds for 1,000 name pairs
● The improvement is relative to the HMM model
BASIS TECHNOLOGY
What Does This Tell Us?
● Name matching is an open problem
○ Existing technology can always be improved
● Accuracy comes at a cost
○ Machine learning holds the answers, but not always in the way we expect
● Looking Forward:
○ Embracing data-driven approaches to NLP
○ Fully cross-lingual name matching?
It’s
Q&A
time!
BASIS TECHNOLOGY
SOURCES
www.rosette.com
Thank you
kfir@basistech.com
Kfir Bar
pblair@basistech.com
Philip Blair
carmel@basistech.com
Carmel Eliav

Understanding Names with Neural Networks - May 2020