SlideShare a Scribd company logo
Latest Techniques of Entity
Matching in NLP
Avinash Pathak
Expert Data Scientist
TomTom
Agenda
What is entity matching?
Why entity matching is important?
History of Entity matching
Entity matching models
How to measure success of Entity Matching
Entity Embed – Open-source tool for Entity Matching
What is entity matching?
Entity Matching refers to the problem of determining whether two different data representations
refer to the same real-world entity.
Example
Use Cases
Id Title Album Composer Songwriter
1 Me and Mrs Jones Call me irresponsible Michael Bublé Michael Bublé
2 Me and Mrs Jones[Remix] Call me irresponsible Michael Bublé Michael Bublé
Song Matching •Address Matching
•Social Profile
Matching (Facebook,
Twitter)
•Clothes Matching
(for that matter any
item in retail)
•Clothes Matching
(for that matter any
item in retail)
Things get little complicated
• Use of words in Colloquial fashion
• additional information
• Un-normalization
• Unstructured data
• Missing Data
• Dirty Data
• Un-availability of Supervised data
• Scale
Id Title Album Composer Songwriter
1 Me and Mrs Jones Call me irresponsible Michael Bublé
2 Me and Mrs Jones[Remix] Michael Bublé
3 Blowin’ in the Wind The Freewheelin’ Bob Dylam Bob Dylan
4 Blowing in the Wind Bob Dylan
Problem of Scale, hence blocking
• An exhaustive pairwise comparison grows quadratically with the number of records, which is unaffordable for
datasets of even moderate size. As a result, virtually every entity matching task on large datasets requires
blocking, a step that effectively reduces the number of record pairs to be considered for matching without
potentially ruling out true matches.
• A successful application of blocking to an entity matching task should fulfil the following four desiderata
• First, blocking, ideally, should not leave out any true matches (i.e., high recall), since only the candidate record
pairs generated by blocking will be further examined in the downstream matching step.
• Second, the number of candidate pairs should be small so that the cost of applying a usually computationally-
expensive matching algorithm is controlled.
• Third, human effort should not be overspent during the whole blocking process;
• Finally, the blocking algorithm should be scalable enough to handle millions of records.
Without
Blocking
Blocking (non-expensive
comparisons)
kC2*M Rigorous
computationally expensive
Comparisons
nC2 Rigorous
computationally expensive
Comparisons
With
Blocking
10c2 = 10 * 9 / 2 = 45 expensive comparisons without blocking
For k=2, 5 clusters of size 2
kC2*M = 1 * 5 comparisons. K – cluster size, M – number of clusters
For k=3, 3 clusters of size 3 and 1 with size 1
It becomes 2 * 3 + 1. 7 comparisons
History of EM
• Pattern Matching/Fuzzy Matching
• Deep Learning Based Blocking
• Self-supervised blocking
Pattern Matching/Fuzzy Matching
Example
Very specific solution
Needs incremental updates if new examples come in
Cross attributes mismatch is present and would need enormous efforts to cover
that
Id Title Album Composer Songwriter
1 Me and Mrs Jones Call me irresponsible Michael Bublé Michael Bublé
2 Me and Mrs
Jones[Remix]
Call me irresponsible Michael Bublé Michael Bublé
EM - Deep Learning Based Blocking
Autoblock
•Token embedding: A word-embedding model transforms each
token to a token embedding
•Attribute embedding: For each attribute value of a tuple, an
attention-based neural network encoder converts the input sequence
of token embeddings to an attribute embedding
•Tuple signature: Multiple signature functions combine the attribute
embeddings of each tuple and produce multiple tuple signatures (one
per signature function)
•Model training: Equipped with the positive label set, the model is
trained with an objective that maximizes the differences of the cosine
similarities between the tuple signatures of matched pairs and
between unmatched pairs
•Fast NN search: The learned model is applied to compute the
signatures for all tuples, and an LSH family for cosine similarity is used
to retrieve the nearest neighbors for each tuple to generate candidate
pairs for blocking
Architecture of Autoblock
Self-Supervised Blocking for entity matching - DeepBlock
• Encoder-Decoder for self-supervision
• Take a tuple 𝑡, feed it into a neural network (NN) to output a compact embedding vector u𝑡 ,
such that if we feed u𝑡 into a second NN, we can recover the original tuple 𝑡 (or a good
approximation of 𝑡). If this happens, u𝑡 can be viewed as a good compact summary of tuple 𝑡, and
can be used as the tuple embedding of 𝑡. The above two NNs are called encoder and decoder,
respectively.
• We can create u𝑡 in various ways. For example, one can feed in original tuple t with few attributes
missing and recover an original tuple with all the attributes (this would help our model match two
entities even when one of the entities has missing attributes)
encoder decoder entity
entity ut
After Blocking – The computationally expensive
comparisons
• Edit distance between strings
• Embedding distance between entities
• Phonetic similarity
• Length
• Jaro Winkler
• Sequence Matcher
• Jaccard Similarity
• Entity specific Features
Reference: Location Matching Kaggle Competition https://www.kaggle.com/code/icfstat/lightgbm-feature-engineering-training-0-888-pv-lb
How to measure success of Entity Matching?
For Model verification in isolation – Precision, Recall
Business Metrics – Business specific
Can we know if we did the perfect entity matching?
 One use-case, let’s say you are using entity matching for social media profile duplication. There is no perfect way of
knowing if all the duplicate/redundant profiles are identified
Human in the loop
 Sample the population and know if entity matching has worked for your cases
 Be mindful of testing with various size of samples and various types of mixtures of populations
Note: You can also check these benchmarks defined by Machamp (https://github.com/megagonlabs/machamp)
Applications of Entity Matching
Song Matching
Address Matching
Social Profile Matching (Facebook, Twitter)
Clothes Matching (for that matter any item in retail)
Matrimony, Dating sites profile matching
Entity Embed
Installation
pip install entity-embed
Preparing the data
Data needs to be a list of dict objects which must contain ‘id’ and ‘cluster’
[{'id': 0,
'cluster': 0,
'title': '001-Berimbou',
'artist': 'Astrud Gilberto',
'album': 'Look to the Rainbow (2008)'},
{'id': 1,
'cluster': 0,
'title': 'Berimbau',
'artist': 'Astrud Gilberto',
'album': 'Look to the Rainbow (1966)'},
{'id': 2,
'cluster': 1,
'title': '4 - Marooned - Pink Floyd',
'artist': '',
'album': 'The Division Bell'}]
Entity Embed
Defining the fields
We need to define how record fields will be numericalized and encoded by Entity Embed’s deep neural network
field_config_dict = {
'title': {
'field_type': "MULTITOKEN",
'tokenizer': "entity_embed.default_tokenizer",
'alphabet': DEFAULT_ALPHABET,
'max_str_len': None, # compute
},
'title_semantic': {
'key': 'title',
'field_type': "SEMANTIC_MULTITOKEN",
'tokenizer': "entity_embed.default_tokenizer",
'vocab': "fasttext.en.300d",
}
}
Entity Embed
Building the model
Under the hood, Entity Embed uses pytorch-lightning, so we need to create a datamodule object:
from entity_embed import DeduplicationDataModule
datamodule = DeduplicationDataModule(
train_record_dict=train_record_dict,
valid_record_dict=valid_record_dict,
test_record_dict=test_record_dict,
cluster_field="cluster",
record_numericalizer=record_numericalizer,
batch_size=32,
eval_batch_size=64,
random_seed=42,
)
Entity Embed
Training the model
We must choose the K of the Approximate Nearest Neighbors, i.e., the top K neighbors our model will use
to find duplicates in the embedding space.
from entity_embed import EntityEmbed
model = EntityEmbed(
record_numericalizer,
ann_k=100,
)
Entity Embed
Finding candidate pairs
When running in production, you only have access to the trained model object and the production record_dict
(without the true clusters filled, of course). You can get the embedding vectors of a production record_dict using
the predict method:
vector_dict = model.predict(
record_dict=test_record_dict,
batch_size=64
)
But what you usually want instead is the ANN pairs. You can get them with the predict_pairs method:
found_pair_set = model.predict_pairs(
record_dict=test_record_dict,
batch_size=64,
ann_k=100,
sim_threshold=0.3,
)
References
1) Deep Learning for Blocking in Entity Matching
2) Autoblock
3) Entity Embed
4) Location Matching

More Related Content

Similar to [DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak

ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...
ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...
ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...
Daniel Katz
 
CHAPTER 6FunctionsChapter Topics6.1 Introductio.docx
CHAPTER 6FunctionsChapter Topics6.1  Introductio.docxCHAPTER 6FunctionsChapter Topics6.1  Introductio.docx
CHAPTER 6FunctionsChapter Topics6.1 Introductio.docx
robertad6
 

Similar to [DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak (20)

ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...
ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...
ICPSR - Complex Systems Models in the Social Sciences - Lab Session 1 - Profe...
 
Network sampling, community detection
Network sampling, community detectionNetwork sampling, community detection
Network sampling, community detection
 
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 
Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow
 
Influence of priors over multityped object in evolutionary clustering
Influence of priors over multityped object in evolutionary clusteringInfluence of priors over multityped object in evolutionary clustering
Influence of priors over multityped object in evolutionary clustering
 
INFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERING
INFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERINGINFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERING
INFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERING
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
Kaggle kenneth
Kaggle kennethKaggle kenneth
Kaggle kenneth
 
Mining the social web 6
Mining the social web 6Mining the social web 6
Mining the social web 6
 
Community detection
Community detectionCommunity detection
Community detection
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
CHAPTER 6FunctionsChapter Topics6.1 Introductio.docx
CHAPTER 6FunctionsChapter Topics6.1  Introductio.docxCHAPTER 6FunctionsChapter Topics6.1  Introductio.docx
CHAPTER 6FunctionsChapter Topics6.1 Introductio.docx
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
GraphTour Boston - Graphs for AI and ML
GraphTour Boston - Graphs for AI and MLGraphTour Boston - Graphs for AI and ML
GraphTour Boston - Graphs for AI and ML
 
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...
K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Mariu...
 
Synthetic dialogue generation with Deep Learning
Synthetic dialogue generation with Deep LearningSynthetic dialogue generation with Deep Learning
Synthetic dialogue generation with Deep Learning
 
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 

More from DataScienceConferenc1

[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
DataScienceConferenc1
 
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
DataScienceConferenc1
 
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
DataScienceConferenc1
 

More from DataScienceConferenc1 (20)

[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
 
[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...
[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...
[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...
 
[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf
[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf
[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf
 
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
 
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
 
[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx
[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx
[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx
 
[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf
[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf
[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf
 
[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...
[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...
[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...
 
[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf
[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf
[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf
 
[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...
[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...
[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...
 
[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...
[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...
[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...
 
[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf
[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf
[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf
 
[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx
[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx
[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx
 
[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...
[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...
[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...
 
[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx
[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx
[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx
 
[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...
[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...
[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...
 
[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...
[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...
[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...
 
[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx
[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx
[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx
 
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
 
[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf
[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf
[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf
 

Recently uploaded

Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptx
benishzehra469
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 

Recently uploaded (20)

2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptx
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoin
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 

[DSC Europe 22] Latest Techniques of Entity Matching in NLP - Avinash Pathak

  • 1. Latest Techniques of Entity Matching in NLP Avinash Pathak Expert Data Scientist TomTom
  • 2. Agenda What is entity matching? Why entity matching is important? History of Entity matching Entity matching models How to measure success of Entity Matching Entity Embed – Open-source tool for Entity Matching
  • 3. What is entity matching? Entity Matching refers to the problem of determining whether two different data representations refer to the same real-world entity. Example Use Cases Id Title Album Composer Songwriter 1 Me and Mrs Jones Call me irresponsible Michael Bublé Michael Bublé 2 Me and Mrs Jones[Remix] Call me irresponsible Michael Bublé Michael Bublé Song Matching •Address Matching •Social Profile Matching (Facebook, Twitter) •Clothes Matching (for that matter any item in retail) •Clothes Matching (for that matter any item in retail)
  • 4. Things get little complicated • Use of words in Colloquial fashion • additional information • Un-normalization • Unstructured data • Missing Data • Dirty Data • Un-availability of Supervised data • Scale Id Title Album Composer Songwriter 1 Me and Mrs Jones Call me irresponsible Michael Bublé 2 Me and Mrs Jones[Remix] Michael Bublé 3 Blowin’ in the Wind The Freewheelin’ Bob Dylam Bob Dylan 4 Blowing in the Wind Bob Dylan
  • 5. Problem of Scale, hence blocking • An exhaustive pairwise comparison grows quadratically with the number of records, which is unaffordable for datasets of even moderate size. As a result, virtually every entity matching task on large datasets requires blocking, a step that effectively reduces the number of record pairs to be considered for matching without potentially ruling out true matches. • A successful application of blocking to an entity matching task should fulfil the following four desiderata • First, blocking, ideally, should not leave out any true matches (i.e., high recall), since only the candidate record pairs generated by blocking will be further examined in the downstream matching step. • Second, the number of candidate pairs should be small so that the cost of applying a usually computationally- expensive matching algorithm is controlled. • Third, human effort should not be overspent during the whole blocking process; • Finally, the blocking algorithm should be scalable enough to handle millions of records.
  • 6. Without Blocking Blocking (non-expensive comparisons) kC2*M Rigorous computationally expensive Comparisons nC2 Rigorous computationally expensive Comparisons With Blocking 10c2 = 10 * 9 / 2 = 45 expensive comparisons without blocking For k=2, 5 clusters of size 2 kC2*M = 1 * 5 comparisons. K – cluster size, M – number of clusters For k=3, 3 clusters of size 3 and 1 with size 1 It becomes 2 * 3 + 1. 7 comparisons
  • 7. History of EM • Pattern Matching/Fuzzy Matching • Deep Learning Based Blocking • Self-supervised blocking
  • 8. Pattern Matching/Fuzzy Matching Example Very specific solution Needs incremental updates if new examples come in Cross attributes mismatch is present and would need enormous efforts to cover that Id Title Album Composer Songwriter 1 Me and Mrs Jones Call me irresponsible Michael Bublé Michael Bublé 2 Me and Mrs Jones[Remix] Call me irresponsible Michael Bublé Michael Bublé
  • 9. EM - Deep Learning Based Blocking Autoblock •Token embedding: A word-embedding model transforms each token to a token embedding •Attribute embedding: For each attribute value of a tuple, an attention-based neural network encoder converts the input sequence of token embeddings to an attribute embedding •Tuple signature: Multiple signature functions combine the attribute embeddings of each tuple and produce multiple tuple signatures (one per signature function) •Model training: Equipped with the positive label set, the model is trained with an objective that maximizes the differences of the cosine similarities between the tuple signatures of matched pairs and between unmatched pairs •Fast NN search: The learned model is applied to compute the signatures for all tuples, and an LSH family for cosine similarity is used to retrieve the nearest neighbors for each tuple to generate candidate pairs for blocking Architecture of Autoblock
  • 10. Self-Supervised Blocking for entity matching - DeepBlock • Encoder-Decoder for self-supervision • Take a tuple 𝑡, feed it into a neural network (NN) to output a compact embedding vector u𝑡 , such that if we feed u𝑡 into a second NN, we can recover the original tuple 𝑡 (or a good approximation of 𝑡). If this happens, u𝑡 can be viewed as a good compact summary of tuple 𝑡, and can be used as the tuple embedding of 𝑡. The above two NNs are called encoder and decoder, respectively. • We can create u𝑡 in various ways. For example, one can feed in original tuple t with few attributes missing and recover an original tuple with all the attributes (this would help our model match two entities even when one of the entities has missing attributes) encoder decoder entity entity ut
  • 11. After Blocking – The computationally expensive comparisons • Edit distance between strings • Embedding distance between entities • Phonetic similarity • Length • Jaro Winkler • Sequence Matcher • Jaccard Similarity • Entity specific Features Reference: Location Matching Kaggle Competition https://www.kaggle.com/code/icfstat/lightgbm-feature-engineering-training-0-888-pv-lb
  • 12. How to measure success of Entity Matching? For Model verification in isolation – Precision, Recall Business Metrics – Business specific Can we know if we did the perfect entity matching?  One use-case, let’s say you are using entity matching for social media profile duplication. There is no perfect way of knowing if all the duplicate/redundant profiles are identified Human in the loop  Sample the population and know if entity matching has worked for your cases  Be mindful of testing with various size of samples and various types of mixtures of populations Note: You can also check these benchmarks defined by Machamp (https://github.com/megagonlabs/machamp)
  • 13. Applications of Entity Matching Song Matching Address Matching Social Profile Matching (Facebook, Twitter) Clothes Matching (for that matter any item in retail) Matrimony, Dating sites profile matching
  • 14. Entity Embed Installation pip install entity-embed Preparing the data Data needs to be a list of dict objects which must contain ‘id’ and ‘cluster’ [{'id': 0, 'cluster': 0, 'title': '001-Berimbou', 'artist': 'Astrud Gilberto', 'album': 'Look to the Rainbow (2008)'}, {'id': 1, 'cluster': 0, 'title': 'Berimbau', 'artist': 'Astrud Gilberto', 'album': 'Look to the Rainbow (1966)'}, {'id': 2, 'cluster': 1, 'title': '4 - Marooned - Pink Floyd', 'artist': '', 'album': 'The Division Bell'}]
  • 15. Entity Embed Defining the fields We need to define how record fields will be numericalized and encoded by Entity Embed’s deep neural network field_config_dict = { 'title': { 'field_type': "MULTITOKEN", 'tokenizer': "entity_embed.default_tokenizer", 'alphabet': DEFAULT_ALPHABET, 'max_str_len': None, # compute }, 'title_semantic': { 'key': 'title', 'field_type': "SEMANTIC_MULTITOKEN", 'tokenizer': "entity_embed.default_tokenizer", 'vocab': "fasttext.en.300d", } }
  • 16. Entity Embed Building the model Under the hood, Entity Embed uses pytorch-lightning, so we need to create a datamodule object: from entity_embed import DeduplicationDataModule datamodule = DeduplicationDataModule( train_record_dict=train_record_dict, valid_record_dict=valid_record_dict, test_record_dict=test_record_dict, cluster_field="cluster", record_numericalizer=record_numericalizer, batch_size=32, eval_batch_size=64, random_seed=42, )
  • 17. Entity Embed Training the model We must choose the K of the Approximate Nearest Neighbors, i.e., the top K neighbors our model will use to find duplicates in the embedding space. from entity_embed import EntityEmbed model = EntityEmbed( record_numericalizer, ann_k=100, )
  • 18. Entity Embed Finding candidate pairs When running in production, you only have access to the trained model object and the production record_dict (without the true clusters filled, of course). You can get the embedding vectors of a production record_dict using the predict method: vector_dict = model.predict( record_dict=test_record_dict, batch_size=64 ) But what you usually want instead is the ANN pairs. You can get them with the predict_pairs method: found_pair_set = model.predict_pairs( record_dict=test_record_dict, batch_size=64, ann_k=100, sim_threshold=0.3, )
  • 19. References 1) Deep Learning for Blocking in Entity Matching 2) Autoblock 3) Entity Embed 4) Location Matching