SlideShare a Scribd company logo
1 of 19
Download to read offline
Latest Techniques of Entity
Matching in NLP
Avinash Pathak
Expert Data Scientist
TomTom
Agenda
What is entity matching?
Why entity matching is important?
History of Entity matching
Entity matching models
How to measure success of Entity Matching
Entity Embed – Open-source tool for Entity Matching
What is entity matching?
Entity Matching refers to the problem of determining whether two different data representations
refer to the same real-world entity.
Example
Use Cases
Id Title Album Composer Songwriter
1 Me and Mrs Jones Call me irresponsible Michael Bublé Michael Bublé
2 Me and Mrs Jones[Remix] Call me irresponsible Michael Bublé Michael Bublé
Song Matching •Address Matching
•Social Profile
Matching (Facebook,
Twitter)
•Clothes Matching
(for that matter any
item in retail)
•Clothes Matching
(for that matter any
item in retail)
Things get little complicated
• Use of words in Colloquial fashion
• additional information
• Un-normalization
• Unstructured data
• Missing Data
• Dirty Data
• Un-availability of Supervised data
• Scale
Id Title Album Composer Songwriter
1 Me and Mrs Jones Call me irresponsible Michael Bublé
2 Me and Mrs Jones[Remix] Michael Bublé
3 Blowin’ in the Wind The Freewheelin’ Bob Dylam Bob Dylan
4 Blowing in the Wind Bob Dylan
Problem of Scale, hence blocking
• An exhaustive pairwise comparison grows quadratically with the number of records, which is unaffordable for
datasets of even moderate size. As a result, virtually every entity matching task on large datasets requires
blocking, a step that effectively reduces the number of record pairs to be considered for matching without
potentially ruling out true matches.
• A successful application of blocking to an entity matching task should fulfil the following four desiderata
• First, blocking, ideally, should not leave out any true matches (i.e., high recall), since only the candidate record
pairs generated by blocking will be further examined in the downstream matching step.
• Second, the number of candidate pairs should be small so that the cost of applying a usually computationally-
expensive matching algorithm is controlled.
• Third, human effort should not be overspent during the whole blocking process;
• Finally, the blocking algorithm should be scalable enough to handle millions of records.
Without
Blocking
Blocking (non-expensive
comparisons)
kC2*M Rigorous
computationally expensive
Comparisons
nC2 Rigorous
computationally expensive
Comparisons
With
Blocking
10c2 = 10 * 9 / 2 = 45 expensive comparisons without blocking
For k=2, 5 clusters of size 2
kC2*M = 1 * 5 comparisons. K – cluster size, M – number of clusters
For k=3, 3 clusters of size 3 and 1 with size 1
It becomes 2 * 3 + 1. 7 comparisons
History of EM
• Pattern Matching/Fuzzy Matching
• Deep Learning Based Blocking
• Self-supervised blocking
Pattern Matching/Fuzzy Matching
Example
Very specific solution
Needs incremental updates if new examples come in
Cross attributes mismatch is present and would need enormous efforts to cover
that
Id Title Album Composer Songwriter
1 Me and Mrs Jones Call me irresponsible Michael Bublé Michael Bublé
2 Me and Mrs
Jones[Remix]
Call me irresponsible Michael Bublé Michael Bublé
EM - Deep Learning Based Blocking
Autoblock
•Token embedding: A word-embedding model transforms each
token to a token embedding
•Attribute embedding: For each attribute value of a tuple, an
attention-based neural network encoder converts the input sequence
of token embeddings to an attribute embedding
•Tuple signature: Multiple signature functions combine the attribute
embeddings of each tuple and produce multiple tuple signatures (one
per signature function)
•Model training: Equipped with the positive label set, the model is
trained with an objective that maximizes the differences of the cosine
similarities between the tuple signatures of matched pairs and
between unmatched pairs
•Fast NN search: The learned model is applied to compute the
signatures for all tuples, and an LSH family for cosine similarity is used
to retrieve the nearest neighbors for each tuple to generate candidate
pairs for blocking
Architecture of Autoblock
Self-Supervised Blocking for entity matching - DeepBlock
• Encoder-Decoder for self-supervision
• Take a tuple 𝑡, feed it into a neural network (NN) to output a compact embedding vector u𝑡 ,
such that if we feed u𝑡 into a second NN, we can recover the original tuple 𝑡 (or a good
approximation of 𝑡). If this happens, u𝑡 can be viewed as a good compact summary of tuple 𝑡, and
can be used as the tuple embedding of 𝑡. The above two NNs are called encoder and decoder,
respectively.
• We can create u𝑡 in various ways. For example, one can feed in original tuple t with few attributes
missing and recover an original tuple with all the attributes (this would help our model match two
entities even when one of the entities has missing attributes)
encoder decoder entity
entity ut
After Blocking – The computationally expensive
comparisons
• Edit distance between strings
• Embedding distance between entities
• Phonetic similarity
• Length
• Jaro Winkler
• Sequence Matcher
• Jaccard Similarity
• Entity specific Features
Reference: Location Matching Kaggle Competition https://www.kaggle.com/code/icfstat/lightgbm-feature-engineering-training-0-888-pv-lb
How to measure success of Entity Matching?
For Model verification in isolation – Precision, Recall
Business Metrics – Business specific
Can we know if we did the perfect entity matching?
 One use-case, let’s say you are using entity matching for social media profile duplication. There is no perfect way of
knowing if all the duplicate/redundant profiles are identified
Human in the loop
 Sample the population and know if entity matching has worked for your cases
 Be mindful of testing with various size of samples and various types of mixtures of populations
Note: You can also check these benchmarks defined by Machamp (https://github.com/megagonlabs/machamp)
Applications of Entity Matching
Song Matching
Address Matching
Social Profile Matching (Facebook, Twitter)
Clothes Matching (for that matter any item in retail)
Matrimony, Dating sites profile matching
Entity Embed
Installation
pip install entity-embed
Preparing the data
Data needs to be a list of dict objects which must contain ‘id’ and ‘cluster’
[{'id': 0,
'cluster': 0,
'title': '001-Berimbou',
'artist': 'Astrud Gilberto',
'album': 'Look to the Rainbow (2008)'},
{'id': 1,
'cluster': 0,
'title': 'Berimbau',
'artist': 'Astrud Gilberto',
'album': 'Look to the Rainbow (1966)'},
{'id': 2,
'cluster': 1,
'title': '4 - Marooned - Pink Floyd',
'artist': '',
'album': 'The Division Bell'}]
Entity Embed
Defining the fields
We need to define how record fields will be numericalized and encoded by Entity Embed’s deep neural network
field_config_dict = {
'title': {
'field_type': "MULTITOKEN",
'tokenizer': "entity_embed.default_tokenizer",
'alphabet': DEFAULT_ALPHABET,
'max_str_len': None, # compute
},
'title_semantic': {
'key': 'title',
'field_type': "SEMANTIC_MULTITOKEN",
'tokenizer': "entity_embed.default_tokenizer",
'vocab': "fasttext.en.300d",
}
}
Entity Embed
Building the model
Under the hood, Entity Embed uses pytorch-lightning, so we need to create a datamodule object:
from entity_embed import DeduplicationDataModule
datamodule = DeduplicationDataModule(
train_record_dict=train_record_dict,
valid_record_dict=valid_record_dict,
test_record_dict=test_record_dict,
cluster_field="cluster",
record_numericalizer=record_numericalizer,
batch_size=32,
eval_batch_size=64,
random_seed=42,
)
Entity Embed
Training the model
We must choose the K of the Approximate Nearest Neighbors, i.e., the top K neighbors our model will use
to find duplicates in the embedding space.
from entity_embed import EntityEmbed
model = EntityEmbed(
record_numericalizer,
ann_k=100,
)
Entity Embed
Finding candidate pairs
When running in production, you only have access to the trained model object and the production record_dict
(without the true clusters filled, of course). You can get the embedding vectors of a production record_dict using
the predict method:
vector_dict = model.predict(
record_dict=test_record_dict,
batch_size=64
)
But what you usually want instead is the ANN pairs. You can get them with the predict_pairs method:
found_pair_set = model.predict_pairs(
record_dict=test_record_dict,
batch_size=64,
ann_k=100,
sim_threshold=0.3,
)
References
1) Deep Learning for Blocking in Entity Matching
2) Autoblock
3) Entity Embed
4) Location Matching

More Related Content

Similar to [DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak

ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...
ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...
ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...Daniel Katz
 
Network sampling, community detection
Network sampling, community detectionNetwork sampling, community detection
Network sampling, community detectionroberval mariano
 
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...Holistic Benchmarking of Big Linked Data
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...DECK36
 
Influence of priors over multityped object in evolutionary clustering
Influence of priors over multityped object in evolutionary clusteringInfluence of priors over multityped object in evolutionary clustering
Influence of priors over multityped object in evolutionary clusteringcsandit
 
INFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERING
INFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERINGINFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERING
INFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERINGcscpconf
 
Mining the social web 6
Mining the social web 6Mining the social web 6
Mining the social web 6HyeonSeok Choi
 
Community detection
Community detectionCommunity detection
Community detectionScott Pauls
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
CHAPTER 6FunctionsChapter Topics6.1 Introductio.docx
CHAPTER 6FunctionsChapter Topics6.1  Introductio.docxCHAPTER 6FunctionsChapter Topics6.1  Introductio.docx
CHAPTER 6FunctionsChapter Topics6.1 Introductio.docxrobertad6
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionTe-Yen Liu
 
GraphTour Boston - Graphs for AI and ML
GraphTour Boston - Graphs for AI and MLGraphTour Boston - Graphs for AI and ML
GraphTour Boston - Graphs for AI and MLNeo4j
 
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...christopher corlett
 
Synthetic dialogue generation with Deep Learning
Synthetic dialogue generation with Deep LearningSynthetic dialogue generation with Deep Learning
Synthetic dialogue generation with Deep LearningS N
 
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...ijaia
 

Similar to [DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak (20)

ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...
ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...
ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...
 
Network sampling, community detection
Network sampling, community detectionNetwork sampling, community detection
Network sampling, community detection
 
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 
Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow
 
Influence of priors over multityped object in evolutionary clustering
Influence of priors over multityped object in evolutionary clusteringInfluence of priors over multityped object in evolutionary clustering
Influence of priors over multityped object in evolutionary clustering
 
INFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERING
INFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERINGINFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERING
INFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERING
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
Kaggle kenneth
Kaggle kennethKaggle kenneth
Kaggle kenneth
 
Mining the social web 6
Mining the social web 6Mining the social web 6
Mining the social web 6
 
Community detection
Community detectionCommunity detection
Community detection
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
CHAPTER 6FunctionsChapter Topics6.1 Introductio.docx
CHAPTER 6FunctionsChapter Topics6.1  Introductio.docxCHAPTER 6FunctionsChapter Topics6.1  Introductio.docx
CHAPTER 6FunctionsChapter Topics6.1 Introductio.docx
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
GraphTour Boston - Graphs for AI and ML
GraphTour Boston - Graphs for AI and MLGraphTour Boston - Graphs for AI and ML
GraphTour Boston - Graphs for AI and ML
 
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...
 
Synthetic dialogue generation with Deep Learning
Synthetic dialogue generation with Deep LearningSynthetic dialogue generation with Deep Learning
Synthetic dialogue generation with Deep Learning
 
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 

More from DataScienceConferenc1

[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF
[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF
[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDFDataScienceConferenc1
 
[DSC Europe 23] Rania Wazir - Mathematician jokes, cute cat photos, offensiv...
[DSC Europe 23] Rania Wazir -  Mathematician jokes, cute cat photos, offensiv...[DSC Europe 23] Rania Wazir -  Mathematician jokes, cute cat photos, offensiv...
[DSC Europe 23] Rania Wazir - Mathematician jokes, cute cat photos, offensiv...DataScienceConferenc1
 
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
[DSC Europe 23] Irena Cerovic - AI in International Development.pdfDataScienceConferenc1
 
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...DataScienceConferenc1
 
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptxDataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Goran Dumic - Data-Driven Approach In Treatments
[DSC Europe 23][DigiHealth]  Goran Dumic -  Data-Driven Approach In Treatments[DSC Europe 23][DigiHealth]  Goran Dumic -  Data-Driven Approach In Treatments
[DSC Europe 23][DigiHealth] Goran Dumic - Data-Driven Approach In TreatmentsDataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Milos Todorovic - Bridging the Gap-Innovating Ag...
[DSC Europe 23][DigiHealth]  Milos Todorovic - Bridging the Gap-Innovating Ag...[DSC Europe 23][DigiHealth]  Milos Todorovic - Bridging the Gap-Innovating Ag...
[DSC Europe 23][DigiHealth] Milos Todorovic - Bridging the Gap-Innovating Ag...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...
[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...
[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Vladimir Brusic - SMART HEALTH HOME: Technology,...
[DSC Europe 23][DigiHealth]  Vladimir Brusic - SMART HEALTH HOME: Technology,...[DSC Europe 23][DigiHealth]  Vladimir Brusic - SMART HEALTH HOME: Technology,...
[DSC Europe 23][DigiHealth] Vladimir Brusic - SMART HEALTH HOME: Technology,...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Dimitar Penkov Grid Search Optimization of Novel...
[DSC Europe 23][DigiHealth]  Dimitar Penkov Grid Search Optimization of Novel...[DSC Europe 23][DigiHealth]  Dimitar Penkov Grid Search Optimization of Novel...
[DSC Europe 23][DigiHealth] Dimitar Penkov Grid Search Optimization of Novel...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED
[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED
[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMEDDataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...
[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...
[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...
[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...
[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...DataScienceConferenc1
 
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...DataScienceConferenc1
 
[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif
[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif
[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with SeifDataScienceConferenc1
 
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...DataScienceConferenc1
 
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you
[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you
[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help youDataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...
[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...
[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...DataScienceConferenc1
 

More from DataScienceConferenc1 (20)

[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF
[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF
[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF
 
[DSC Europe 23] Rania Wazir - Mathematician jokes, cute cat photos, offensiv...
[DSC Europe 23] Rania Wazir -  Mathematician jokes, cute cat photos, offensiv...[DSC Europe 23] Rania Wazir -  Mathematician jokes, cute cat photos, offensiv...
[DSC Europe 23] Rania Wazir - Mathematician jokes, cute cat photos, offensiv...
 
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
 
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
 
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
 
[DSC Europe 23][DigiHealth] Goran Dumic - Data-Driven Approach In Treatments
[DSC Europe 23][DigiHealth]  Goran Dumic -  Data-Driven Approach In Treatments[DSC Europe 23][DigiHealth]  Goran Dumic -  Data-Driven Approach In Treatments
[DSC Europe 23][DigiHealth] Goran Dumic - Data-Driven Approach In Treatments
 
[DSC Europe 23][DigiHealth] Milos Todorovic - Bridging the Gap-Innovating Ag...
[DSC Europe 23][DigiHealth]  Milos Todorovic - Bridging the Gap-Innovating Ag...[DSC Europe 23][DigiHealth]  Milos Todorovic - Bridging the Gap-Innovating Ag...
[DSC Europe 23][DigiHealth] Milos Todorovic - Bridging the Gap-Innovating Ag...
 
[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...
[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...
[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...
 
[DSC Europe 23][DigiHealth] Vladimir Brusic - SMART HEALTH HOME: Technology,...
[DSC Europe 23][DigiHealth]  Vladimir Brusic - SMART HEALTH HOME: Technology,...[DSC Europe 23][DigiHealth]  Vladimir Brusic - SMART HEALTH HOME: Technology,...
[DSC Europe 23][DigiHealth] Vladimir Brusic - SMART HEALTH HOME: Technology,...
 
[DSC Europe 23][DigiHealth] Dimitar Penkov Grid Search Optimization of Novel...
[DSC Europe 23][DigiHealth]  Dimitar Penkov Grid Search Optimization of Novel...[DSC Europe 23][DigiHealth]  Dimitar Penkov Grid Search Optimization of Novel...
[DSC Europe 23][DigiHealth] Dimitar Penkov Grid Search Optimization of Novel...
 
[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED
[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED
[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED
 
[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...
[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...
[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...
 
[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...
[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...
[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...
 
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
 
[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif
[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif
[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif
 
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
 
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
 
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
 
[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you
[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you
[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you
 
[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...
[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...
[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...
 

Recently uploaded

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
testingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdftestingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdfDSP Mutual Fund
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsbaAdobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsbas73678sri
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Data Discovery With Power Query in excel
Data Discovery With Power Query in excelData Discovery With Power Query in excel
Data Discovery With Power Query in excelKapilSidhpuria3
 
prediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approachprediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approachAdekunleJoseph4
 
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfRabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Adobe Scan 06-Mar-2024 (1).pdf shavashwvw
Adobe Scan 06-Mar-2024 (1).pdf shavashwvwAdobe Scan 06-Mar-2024 (1).pdf shavashwvw
Adobe Scan 06-Mar-2024 (1).pdf shavashwvws73678sri
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 

Recently uploaded (20)

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
testingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdftestingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdf
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsbaAdobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Data Discovery With Power Query in excel
Data Discovery With Power Query in excelData Discovery With Power Query in excel
Data Discovery With Power Query in excel
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
prediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approachprediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approach
 
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfRabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Adobe Scan 06-Mar-2024 (1).pdf shavashwvw
Adobe Scan 06-Mar-2024 (1).pdf shavashwvwAdobe Scan 06-Mar-2024 (1).pdf shavashwvw
Adobe Scan 06-Mar-2024 (1).pdf shavashwvw
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 

[DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak

  • 1. Latest Techniques of Entity Matching in NLP Avinash Pathak Expert Data Scientist TomTom
  • 2. Agenda What is entity matching? Why entity matching is important? History of Entity matching Entity matching models How to measure success of Entity Matching Entity Embed – Open-source tool for Entity Matching
  • 3. What is entity matching? Entity Matching refers to the problem of determining whether two different data representations refer to the same real-world entity. Example Use Cases Id Title Album Composer Songwriter 1 Me and Mrs Jones Call me irresponsible Michael Bublé Michael Bublé 2 Me and Mrs Jones[Remix] Call me irresponsible Michael Bublé Michael Bublé Song Matching •Address Matching •Social Profile Matching (Facebook, Twitter) •Clothes Matching (for that matter any item in retail) •Clothes Matching (for that matter any item in retail)
  • 4. Things get little complicated • Use of words in Colloquial fashion • additional information • Un-normalization • Unstructured data • Missing Data • Dirty Data • Un-availability of Supervised data • Scale Id Title Album Composer Songwriter 1 Me and Mrs Jones Call me irresponsible Michael Bublé 2 Me and Mrs Jones[Remix] Michael Bublé 3 Blowin’ in the Wind The Freewheelin’ Bob Dylam Bob Dylan 4 Blowing in the Wind Bob Dylan
  • 5. Problem of Scale, hence blocking • An exhaustive pairwise comparison grows quadratically with the number of records, which is unaffordable for datasets of even moderate size. As a result, virtually every entity matching task on large datasets requires blocking, a step that effectively reduces the number of record pairs to be considered for matching without potentially ruling out true matches. • A successful application of blocking to an entity matching task should fulfil the following four desiderata • First, blocking, ideally, should not leave out any true matches (i.e., high recall), since only the candidate record pairs generated by blocking will be further examined in the downstream matching step. • Second, the number of candidate pairs should be small so that the cost of applying a usually computationally- expensive matching algorithm is controlled. • Third, human effort should not be overspent during the whole blocking process; • Finally, the blocking algorithm should be scalable enough to handle millions of records.
  • 6. Without Blocking Blocking (non-expensive comparisons) kC2*M Rigorous computationally expensive Comparisons nC2 Rigorous computationally expensive Comparisons With Blocking 10c2 = 10 * 9 / 2 = 45 expensive comparisons without blocking For k=2, 5 clusters of size 2 kC2*M = 1 * 5 comparisons. K – cluster size, M – number of clusters For k=3, 3 clusters of size 3 and 1 with size 1 It becomes 2 * 3 + 1. 7 comparisons
  • 7. History of EM • Pattern Matching/Fuzzy Matching • Deep Learning Based Blocking • Self-supervised blocking
  • 8. Pattern Matching/Fuzzy Matching Example Very specific solution Needs incremental updates if new examples come in Cross attributes mismatch is present and would need enormous efforts to cover that Id Title Album Composer Songwriter 1 Me and Mrs Jones Call me irresponsible Michael Bublé Michael Bublé 2 Me and Mrs Jones[Remix] Call me irresponsible Michael Bublé Michael Bublé
  • 9. EM - Deep Learning Based Blocking Autoblock •Token embedding: A word-embedding model transforms each token to a token embedding •Attribute embedding: For each attribute value of a tuple, an attention-based neural network encoder converts the input sequence of token embeddings to an attribute embedding •Tuple signature: Multiple signature functions combine the attribute embeddings of each tuple and produce multiple tuple signatures (one per signature function) •Model training: Equipped with the positive label set, the model is trained with an objective that maximizes the differences of the cosine similarities between the tuple signatures of matched pairs and between unmatched pairs •Fast NN search: The learned model is applied to compute the signatures for all tuples, and an LSH family for cosine similarity is used to retrieve the nearest neighbors for each tuple to generate candidate pairs for blocking Architecture of Autoblock
  • 10. Self-Supervised Blocking for entity matching - DeepBlock • Encoder-Decoder for self-supervision • Take a tuple 𝑡, feed it into a neural network (NN) to output a compact embedding vector u𝑡 , such that if we feed u𝑡 into a second NN, we can recover the original tuple 𝑡 (or a good approximation of 𝑡). If this happens, u𝑡 can be viewed as a good compact summary of tuple 𝑡, and can be used as the tuple embedding of 𝑡. The above two NNs are called encoder and decoder, respectively. • We can create u𝑡 in various ways. For example, one can feed in original tuple t with few attributes missing and recover an original tuple with all the attributes (this would help our model match two entities even when one of the entities has missing attributes) encoder decoder entity entity ut
  • 11. After Blocking – The computationally expensive comparisons • Edit distance between strings • Embedding distance between entities • Phonetic similarity • Length • Jaro Winkler • Sequence Matcher • Jaccard Similarity • Entity specific Features Reference: Location Matching Kaggle Competition https://www.kaggle.com/code/icfstat/lightgbm-feature-engineering-training-0-888-pv-lb
  • 12. How to measure success of Entity Matching? For Model verification in isolation – Precision, Recall Business Metrics – Business specific Can we know if we did the perfect entity matching?  One use-case, let’s say you are using entity matching for social media profile duplication. There is no perfect way of knowing if all the duplicate/redundant profiles are identified Human in the loop  Sample the population and know if entity matching has worked for your cases  Be mindful of testing with various size of samples and various types of mixtures of populations Note: You can also check these benchmarks defined by Machamp (https://github.com/megagonlabs/machamp)
  • 13. Applications of Entity Matching Song Matching Address Matching Social Profile Matching (Facebook, Twitter) Clothes Matching (for that matter any item in retail) Matrimony, Dating sites profile matching
  • 14. Entity Embed Installation pip install entity-embed Preparing the data Data needs to be a list of dict objects which must contain ‘id’ and ‘cluster’ [{'id': 0, 'cluster': 0, 'title': '001-Berimbou', 'artist': 'Astrud Gilberto', 'album': 'Look to the Rainbow (2008)'}, {'id': 1, 'cluster': 0, 'title': 'Berimbau', 'artist': 'Astrud Gilberto', 'album': 'Look to the Rainbow (1966)'}, {'id': 2, 'cluster': 1, 'title': '4 - Marooned - Pink Floyd', 'artist': '', 'album': 'The Division Bell'}]
  • 15. Entity Embed Defining the fields We need to define how record fields will be numericalized and encoded by Entity Embed’s deep neural network field_config_dict = { 'title': { 'field_type': "MULTITOKEN", 'tokenizer': "entity_embed.default_tokenizer", 'alphabet': DEFAULT_ALPHABET, 'max_str_len': None, # compute }, 'title_semantic': { 'key': 'title', 'field_type': "SEMANTIC_MULTITOKEN", 'tokenizer': "entity_embed.default_tokenizer", 'vocab': "fasttext.en.300d", } }
  • 16. Entity Embed Building the model Under the hood, Entity Embed uses pytorch-lightning, so we need to create a datamodule object: from entity_embed import DeduplicationDataModule datamodule = DeduplicationDataModule( train_record_dict=train_record_dict, valid_record_dict=valid_record_dict, test_record_dict=test_record_dict, cluster_field="cluster", record_numericalizer=record_numericalizer, batch_size=32, eval_batch_size=64, random_seed=42, )
  • 17. Entity Embed Training the model We must choose the K of the Approximate Nearest Neighbors, i.e., the top K neighbors our model will use to find duplicates in the embedding space. from entity_embed import EntityEmbed model = EntityEmbed( record_numericalizer, ann_k=100, )
  • 18. Entity Embed Finding candidate pairs When running in production, you only have access to the trained model object and the production record_dict (without the true clusters filled, of course). You can get the embedding vectors of a production record_dict using the predict method: vector_dict = model.predict( record_dict=test_record_dict, batch_size=64 ) But what you usually want instead is the ANN pairs. You can get them with the predict_pairs method: found_pair_set = model.predict_pairs( record_dict=test_record_dict, batch_size=64, ann_k=100, sim_threshold=0.3, )
  • 19. References 1) Deep Learning for Blocking in Entity Matching 2) Autoblock 3) Entity Embed 4) Location Matching