SlideShare a Scribd company logo
1 of 28
Getting Started: Entity Resolution
Macêdo, Crislânio
Dieb, Felipe
Menezes, Clairton
05, Feb, 2020
Outline
1. Motivation
1. Record Deduplication & Record Linkage
1. Advantages
1. Hands on
1. Conclusion
1. References
Entity Resolution is the task of disambiguating manifestations of real world
entities in various records or mentions by linking and grouping.
For example, there could be different ways of addressing the same person in
text, different addresses for businesses, or photos of a particular object.
This clearly has many applications, particularly in government and public health
data, web search, comparison shopping, law enforcement, and more.
What is Entity Resolution
Real world data is inputted by people and often it's:
● Not linked with related data
● Incorrectly inputted because people make mistakes: type mishearing,
miscalculation, misinterpretation, etc.
This causes the following problems on data:
● Duplications (e.g. person appears with multiple addresses)
● Bad formatting (e.g birth dates appear with multiple formats)
● Inconsistencies (e.g. a person appears with multiple addresses)
Motivation
There exists in the real world entities, and in the digital world, records and mentions of those
entities.
Databases frequently contain duplicate fields and records that refer to the same real-
world entity.
Data world is noisy
Data world is messy
Real World
Record Linkage & Record
Deduplication
Data Deduplication - is a technique for detecting /
eliminating duplicate data in a dataset.
Record Linkage (RL) - Task of finding records in a dataset
that refers to the same entity in different data sources (e.g.,
books websites, database), when this task refers to only one
data source, it is known as Deduplication.
Canonicalization: converting data with more than one
possible representation into a standard form.
Deduplication
Record Linkage
Record Linkage is also known as Data Matching, Entity
Resolution etc
Canonicalization
Database A
Database B
Cleaning and
Normalization
Cleaning and
Normalization
Indexing
Record pair
comparison
Similarity vector
classification
Evaluation
Indexing in Record Linkage
Matches Non-matches Review
Dedupe is a library that uses machine learning to perform deduplication and
entity resolution quickly on structured data. In addition to removing duplicate
entries from within a single dataset, Dedupe can also do record linkage across
disparate datasets.
How it works?
As such, Dedupe works by engaging the user in labeling the data via a
command line interface, and using machine learning on the resulting training data
to predict similar or matching records within unseen data. The name of this
process is Active Learn.
Dedupe.io
source: https://pypi.org/project/dedupe/1.6.5/
Testing Out Dedupe
Getting started with Dedupe is easy, and the developers have provided a
convenient repo with examples that you can use and iterate on.
To get Dedupe running, we’ll need to install unidecode, future, and dedupe.
How can computers know if names are similar ?
How can computers know if similar addresses matter more or less than similar names
or similar employers ?
How can computers cluster similar records quickly if there’s a lot of data?
The challenges
● Improving data quality and integrity
● Reducing costs and efforts in data acquisition
● Duplicate data reduction or group analysis
● Identifying records that reference the same entity across different sources.
Multiples Domains
● Fraud Detection
● Health systems
● Enterprise business systems
Proper identification of duplicated patient information remains an arduous problem for hospitals,
pharmacies and service providers.
Advantages
Hands on
Hands on
Hands on
Dedupe cleverly exploits the structure of the input data to instead compare the
records field by field.
Dedupe lets the user nominate the features they believe will be most useful:
Hands on
Dedupe scans the data and group the data as matches, not matches, or
possible matches.
These uncertainPairs are identified using a combination of blocking , affine gap
distance, and active learning.
Hands on: Blocking
Dedupe’s method of blocking involves engineering subsets of feature vectors (these
are called ‘predicates’).
In the case of our people dataset above, the predicates might be things like:
● the first three digits of the phone number
● the full name
● the first five characters of the name
● a random 4-gram within the city name
Hamming Distance: https://www.tutorialspoint.com/what-is-hamming-distance
Hands on: Affine gap
Use a distance metric like a variation on Hamming distance that makes
subsequent consecutive deletions or insertions cheaper.
Hamming Distance: https://www.tutorialspoint.com/what-is-hamming-distance
Dedupe types: https://docs.dedupe.io/en/latest/Variable-definition.html
Hands on: Active Learning
Uses all the processes above then generate an iteratively result for each element of the data.
Dedupe is a command line application that will prompt the user to engage in active learning
by showing pairs of entities and asking if they are the same or different.
Conclusion
Finding duplicates or matching data when you don't
have primary keys is one of the biggest challenges in
preparing data for data science.
https://developers.google.com/knowledge-graph
Conclusion
Entity Resolution is becoming an increasingly important task as linked data
grows, and the requirement for graph based reasoning extends beyond
theoretical applications.
With the advent of big data computations, this need has become even more
prevalent.
https://developers.google.com/knowledge-graph
https://youtu.be/mmQl6VGvX-c
https://developers.google.com/knowledge-graph
https://youtu.be/mmQl6VGvX-c
References
[1] Linking Data for Health Services Research: A Framework and Instructional Guide [Internet]-
https://www.ncbi.nlm.nih.gov/books/NBK253312/
[2] Data Linkage: The Big Picture - https://hdsr.mitpress.mit.edu/pub/8fm8lo1e
[3] Deduplicatoin & Record Linkage- https://www.kaggle.com/caesarlupum/deduping-record-linkage#Deduplication-&-Record-
Linkage.
[4] 1 + 1 = 1 or Record Deduplication with Python- https://youtu.be/McsTWXeURhA
[5] Indexing Techniques for Scalable Record Linkage and Deduplication-https://pt.slideshare.net/kkpradeeban/indexing-
techniques-for-scalable-record-linkage-and-deduplication
[6] Deduplication detection- https://pt.slideshare.net/kirar/tutorial-4-duplicate-detection
References
[7] Basics of Entity Resolution with Python and Dedupe- https://medium.com/district-data-labs/basics-of-entity-resolution-with-
python-and-dedupe-bc87440b64d4
[8] A THEORY FOR RECORD LINKAGE* - https://courses.cs.washington.edu/courses/cse590q/04au/papers/Felligi69.pdf
[9] Entity Resolution for Big Data - http://www.datacommunitydc.org/blog/2013/08/entity-resolution-for-big-data
[10] Google Knowledge Graph Search API - https://developers.google.com/knowledge-graph
[10] Generate Fake Data - https://mockaroo.com/
Record Deduplication and  Record Linkage

More Related Content

What's hot

Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture Mark Hewitt
 
Weather Data Analytics Using Hadoop
Weather Data Analytics Using HadoopWeather Data Analytics Using Hadoop
Weather Data Analytics Using HadoopNajima Begum
 
IRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET Journal
 
Big Data Challenges at NASA
Big Data Challenges at NASABig Data Challenges at NASA
Big Data Challenges at NASADataWorks Summit
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisJen Aman
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with RYanchang Zhao
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at AirbnbNeo4j
 
Naming convention in_database
Naming convention in_databaseNaming convention in_database
Naming convention in_databaseDavis Chen
 
Data Visualization Trends - Next Steps for Tableau
Data Visualization Trends - Next Steps for TableauData Visualization Trends - Next Steps for Tableau
Data Visualization Trends - Next Steps for TableauArunima Gupta
 
Crisp dm
Crisp dmCrisp dm
Crisp dmakbkck
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using RVictoria López
 
Converting Relational to Graph Databases
Converting Relational to Graph DatabasesConverting Relational to Graph Databases
Converting Relational to Graph DatabasesAntonio Maccioni
 
주가_변화시점탐지(Change point Detection)
주가_변화시점탐지(Change point Detection)주가_변화시점탐지(Change point Detection)
주가_변화시점탐지(Change point Detection)Seung-Woo Kang
 

What's hot (20)

Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture
 
Weather Data Analytics Using Hadoop
Weather Data Analytics Using HadoopWeather Data Analytics Using Hadoop
Weather Data Analytics Using Hadoop
 
IRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom Industry
 
Big Data Challenges at NASA
Big Data Challenges at NASABig Data Challenges at NASA
Big Data Challenges at NASA
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph Analysis
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Dimensional Modelling
Dimensional ModellingDimensional Modelling
Dimensional Modelling
 
Spatial databases
Spatial databasesSpatial databases
Spatial databases
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at Airbnb
 
Naming convention in_database
Naming convention in_databaseNaming convention in_database
Naming convention in_database
 
Data Visualization Trends - Next Steps for Tableau
Data Visualization Trends - Next Steps for TableauData Visualization Trends - Next Steps for Tableau
Data Visualization Trends - Next Steps for Tableau
 
Crisp dm
Crisp dmCrisp dm
Crisp dm
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Converting Relational to Graph Databases
Converting Relational to Graph DatabasesConverting Relational to Graph Databases
Converting Relational to Graph Databases
 
Data mining on Financial Data
Data mining on Financial DataData mining on Financial Data
Data mining on Financial Data
 
주가_변화시점탐지(Change point Detection)
주가_변화시점탐지(Change point Detection)주가_변화시점탐지(Change point Detection)
주가_변화시점탐지(Change point Detection)
 

Similar to Record Deduplication and Record Linkage

Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And FootballAmanda Gray
 
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond HillDOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond HillClaraZara1
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEijsptm
 
SIM PASCA CHAPTER 4.pdf
SIM PASCA CHAPTER 4.pdfSIM PASCA CHAPTER 4.pdf
SIM PASCA CHAPTER 4.pdfAdiSuputrq
 
Teradata Aster Discovery Platform
Teradata Aster Discovery PlatformTeradata Aster Discovery Platform
Teradata Aster Discovery PlatformScott Antony
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxjuliennehar
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfData Science Council of America
 
Implementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageImplementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageIOSR Journals
 
Open source vs. open data
Open source vs. open dataOpen source vs. open data
Open source vs. open datadata publica
 
Discussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ouDiscussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ouhuttenangela
 
1. What are the business costs or risks of poor data quality Sup.docx
1.  What are the business costs or risks of poor data quality Sup.docx1.  What are the business costs or risks of poor data quality Sup.docx
1. What are the business costs or risks of poor data quality Sup.docxSONU61709
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattooMohamed Magdy
 

Similar to Record Deduplication and Record Linkage (20)

Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
ANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEWANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEW
 
Gov civilworkshop
Gov civilworkshopGov civilworkshop
Gov civilworkshop
 
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond HillDOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCE
 
SIM PASCA CHAPTER 4.pdf
SIM PASCA CHAPTER 4.pdfSIM PASCA CHAPTER 4.pdf
SIM PASCA CHAPTER 4.pdf
 
Database Concepts
Database ConceptsDatabase Concepts
Database Concepts
 
Teradata Aster Discovery Platform
Teradata Aster Discovery PlatformTeradata Aster Discovery Platform
Teradata Aster Discovery Platform
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docx
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Implementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageImplementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record Linkage
 
Open source vs. open data
Open source vs. open dataOpen source vs. open data
Open source vs. open data
 
Discussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ouDiscussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ou
 
Data loss prevention (dlp)
Data loss prevention (dlp)Data loss prevention (dlp)
Data loss prevention (dlp)
 
1. What are the business costs or risks of poor data quality Sup.docx
1.  What are the business costs or risks of poor data quality Sup.docx1.  What are the business costs or risks of poor data quality Sup.docx
1. What are the business costs or risks of poor data quality Sup.docx
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
Oops Concepts
Oops ConceptsOops Concepts
Oops Concepts
 

More from CRISLANIO MACEDO

Pitch selo sebrae - Hackathon 2019
Pitch selo sebrae - Hackathon 2019Pitch selo sebrae - Hackathon 2019
Pitch selo sebrae - Hackathon 2019CRISLANIO MACEDO
 
Search based gravitational algorithm
Search based gravitational algorithmSearch based gravitational algorithm
Search based gravitational algorithmCRISLANIO MACEDO
 
ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...
ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...
ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...CRISLANIO MACEDO
 
Integración de métodos ágiles a una empresa de nivel 5 cmmi dev- un caso de e...
Integración de métodos ágiles a una empresa de nivel 5 cmmi dev- un caso de e...Integración de métodos ágiles a una empresa de nivel 5 cmmi dev- un caso de e...
Integración de métodos ágiles a una empresa de nivel 5 cmmi dev- un caso de e...CRISLANIO MACEDO
 
JGROUPS- A Toolkit for Reliable Multicast Communication
JGROUPS- A Toolkit for Reliable Multicast CommunicationJGROUPS- A Toolkit for Reliable Multicast Communication
JGROUPS- A Toolkit for Reliable Multicast CommunicationCRISLANIO MACEDO
 
Inteligência artificial algumas técnicas aplicadas em jogos
Inteligência artificial  algumas técnicas aplicadas em jogosInteligência artificial  algumas técnicas aplicadas em jogos
Inteligência artificial algumas técnicas aplicadas em jogosCRISLANIO MACEDO
 
Artigo ia traps, invariants, and dead-ends
Artigo ia   traps, invariants, and dead-endsArtigo ia   traps, invariants, and dead-ends
Artigo ia traps, invariants, and dead-endsCRISLANIO MACEDO
 
Análise dos dados abertos do governo federal
Análise dos dados abertos do governo federalAnálise dos dados abertos do governo federal
Análise dos dados abertos do governo federalCRISLANIO MACEDO
 
ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...
ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...
ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...CRISLANIO MACEDO
 
Sistema de denúncia de desperdício de água - Etapa de Avaliação
Sistema de denúncia de desperdício de água - Etapa de AvaliaçãoSistema de denúncia de desperdício de água - Etapa de Avaliação
Sistema de denúncia de desperdício de água - Etapa de AvaliaçãoCRISLANIO MACEDO
 
Sistema de denúncia de desperdício de água - Etapa de Síntese
Sistema de denúncia de desperdício de água - Etapa de SínteseSistema de denúncia de desperdício de água - Etapa de Síntese
Sistema de denúncia de desperdício de água - Etapa de SínteseCRISLANIO MACEDO
 
Haskell aula7 libs_intro_arquivos
Haskell aula7 libs_intro_arquivosHaskell aula7 libs_intro_arquivos
Haskell aula7 libs_intro_arquivosCRISLANIO MACEDO
 
Haskell aula5 f.ordem-sup_modulos-cifra_cesar
Haskell aula5 f.ordem-sup_modulos-cifra_cesarHaskell aula5 f.ordem-sup_modulos-cifra_cesar
Haskell aula5 f.ordem-sup_modulos-cifra_cesarCRISLANIO MACEDO
 

More from CRISLANIO MACEDO (20)

Pitch selo sebrae - Hackathon 2019
Pitch selo sebrae - Hackathon 2019Pitch selo sebrae - Hackathon 2019
Pitch selo sebrae - Hackathon 2019
 
Pitch Medbloc
Pitch MedblocPitch Medbloc
Pitch Medbloc
 
Search based gravitational algorithm
Search based gravitational algorithmSearch based gravitational algorithm
Search based gravitational algorithm
 
ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...
ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...
ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...
 
Integración de métodos ágiles a una empresa de nivel 5 cmmi dev- un caso de e...
Integración de métodos ágiles a una empresa de nivel 5 cmmi dev- un caso de e...Integración de métodos ágiles a una empresa de nivel 5 cmmi dev- un caso de e...
Integración de métodos ágiles a una empresa de nivel 5 cmmi dev- un caso de e...
 
JGROUPS- A Toolkit for Reliable Multicast Communication
JGROUPS- A Toolkit for Reliable Multicast CommunicationJGROUPS- A Toolkit for Reliable Multicast Communication
JGROUPS- A Toolkit for Reliable Multicast Communication
 
Inteligência artificial algumas técnicas aplicadas em jogos
Inteligência artificial  algumas técnicas aplicadas em jogosInteligência artificial  algumas técnicas aplicadas em jogos
Inteligência artificial algumas técnicas aplicadas em jogos
 
Artigo ia traps, invariants, and dead-ends
Artigo ia   traps, invariants, and dead-endsArtigo ia   traps, invariants, and dead-ends
Artigo ia traps, invariants, and dead-ends
 
Análise dos dados abertos do governo federal
Análise dos dados abertos do governo federalAnálise dos dados abertos do governo federal
Análise dos dados abertos do governo federal
 
ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...
ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...
ANÁLISE ESTATÍSTICA DA RELAÇÃO ENTRE EVASÃO E AS RESPOSTAS DO QUESTIONÁRIO PA...
 
Sistema de denúncia de desperdício de água - Etapa de Avaliação
Sistema de denúncia de desperdício de água - Etapa de AvaliaçãoSistema de denúncia de desperdício de água - Etapa de Avaliação
Sistema de denúncia de desperdício de água - Etapa de Avaliação
 
Sistema de denúncia de desperdício de água - Etapa de Síntese
Sistema de denúncia de desperdício de água - Etapa de SínteseSistema de denúncia de desperdício de água - Etapa de Síntese
Sistema de denúncia de desperdício de água - Etapa de Síntese
 
Resolução lista2
Resolução lista2Resolução lista2
Resolução lista2
 
Resoluçãohaskell2
Resoluçãohaskell2Resoluçãohaskell2
Resoluçãohaskell2
 
Haskell ufc quixadalista2
Haskell ufc quixadalista2Haskell ufc quixadalista2
Haskell ufc quixadalista2
 
Haskell ufc quixadalista1
Haskell ufc quixadalista1Haskell ufc quixadalista1
Haskell ufc quixadalista1
 
Haskell motivação
Haskell motivaçãoHaskell motivação
Haskell motivação
 
Haskell motivaçãoaula2
Haskell motivaçãoaula2Haskell motivaçãoaula2
Haskell motivaçãoaula2
 
Haskell aula7 libs_intro_arquivos
Haskell aula7 libs_intro_arquivosHaskell aula7 libs_intro_arquivos
Haskell aula7 libs_intro_arquivos
 
Haskell aula5 f.ordem-sup_modulos-cifra_cesar
Haskell aula5 f.ordem-sup_modulos-cifra_cesarHaskell aula5 f.ordem-sup_modulos-cifra_cesar
Haskell aula5 f.ordem-sup_modulos-cifra_cesar
 

Recently uploaded

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 

Recently uploaded (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Record Deduplication and Record Linkage

  • 1. Getting Started: Entity Resolution Macêdo, Crislânio Dieb, Felipe Menezes, Clairton 05, Feb, 2020
  • 2. Outline 1. Motivation 1. Record Deduplication & Record Linkage 1. Advantages 1. Hands on 1. Conclusion 1. References
  • 3. Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. This clearly has many applications, particularly in government and public health data, web search, comparison shopping, law enforcement, and more. What is Entity Resolution
  • 4. Real world data is inputted by people and often it's: ● Not linked with related data ● Incorrectly inputted because people make mistakes: type mishearing, miscalculation, misinterpretation, etc. This causes the following problems on data: ● Duplications (e.g. person appears with multiple addresses) ● Bad formatting (e.g birth dates appear with multiple formats) ● Inconsistencies (e.g. a person appears with multiple addresses) Motivation
  • 5. There exists in the real world entities, and in the digital world, records and mentions of those entities.
  • 6. Databases frequently contain duplicate fields and records that refer to the same real- world entity. Data world is noisy
  • 7. Data world is messy Real World
  • 8. Record Linkage & Record Deduplication Data Deduplication - is a technique for detecting / eliminating duplicate data in a dataset. Record Linkage (RL) - Task of finding records in a dataset that refers to the same entity in different data sources (e.g., books websites, database), when this task refers to only one data source, it is known as Deduplication. Canonicalization: converting data with more than one possible representation into a standard form.
  • 10. Record Linkage Record Linkage is also known as Data Matching, Entity Resolution etc
  • 12. Database A Database B Cleaning and Normalization Cleaning and Normalization Indexing Record pair comparison Similarity vector classification Evaluation Indexing in Record Linkage Matches Non-matches Review
  • 13. Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. How it works? As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data. The name of this process is Active Learn. Dedupe.io source: https://pypi.org/project/dedupe/1.6.5/
  • 14. Testing Out Dedupe Getting started with Dedupe is easy, and the developers have provided a convenient repo with examples that you can use and iterate on. To get Dedupe running, we’ll need to install unidecode, future, and dedupe.
  • 15. How can computers know if names are similar ? How can computers know if similar addresses matter more or less than similar names or similar employers ? How can computers cluster similar records quickly if there’s a lot of data? The challenges
  • 16. ● Improving data quality and integrity ● Reducing costs and efforts in data acquisition ● Duplicate data reduction or group analysis ● Identifying records that reference the same entity across different sources. Multiples Domains ● Fraud Detection ● Health systems ● Enterprise business systems Proper identification of duplicated patient information remains an arduous problem for hospitals, pharmacies and service providers. Advantages
  • 19. Hands on Dedupe cleverly exploits the structure of the input data to instead compare the records field by field. Dedupe lets the user nominate the features they believe will be most useful:
  • 20. Hands on Dedupe scans the data and group the data as matches, not matches, or possible matches. These uncertainPairs are identified using a combination of blocking , affine gap distance, and active learning.
  • 21. Hands on: Blocking Dedupe’s method of blocking involves engineering subsets of feature vectors (these are called ‘predicates’). In the case of our people dataset above, the predicates might be things like: ● the first three digits of the phone number ● the full name ● the first five characters of the name ● a random 4-gram within the city name Hamming Distance: https://www.tutorialspoint.com/what-is-hamming-distance
  • 22. Hands on: Affine gap Use a distance metric like a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper. Hamming Distance: https://www.tutorialspoint.com/what-is-hamming-distance Dedupe types: https://docs.dedupe.io/en/latest/Variable-definition.html Hands on: Active Learning Uses all the processes above then generate an iteratively result for each element of the data. Dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.
  • 23. Conclusion Finding duplicates or matching data when you don't have primary keys is one of the biggest challenges in preparing data for data science. https://developers.google.com/knowledge-graph
  • 24. Conclusion Entity Resolution is becoming an increasingly important task as linked data grows, and the requirement for graph based reasoning extends beyond theoretical applications. With the advent of big data computations, this need has become even more prevalent. https://developers.google.com/knowledge-graph https://youtu.be/mmQl6VGvX-c
  • 26. References [1] Linking Data for Health Services Research: A Framework and Instructional Guide [Internet]- https://www.ncbi.nlm.nih.gov/books/NBK253312/ [2] Data Linkage: The Big Picture - https://hdsr.mitpress.mit.edu/pub/8fm8lo1e [3] Deduplicatoin & Record Linkage- https://www.kaggle.com/caesarlupum/deduping-record-linkage#Deduplication-&-Record- Linkage. [4] 1 + 1 = 1 or Record Deduplication with Python- https://youtu.be/McsTWXeURhA [5] Indexing Techniques for Scalable Record Linkage and Deduplication-https://pt.slideshare.net/kkpradeeban/indexing- techniques-for-scalable-record-linkage-and-deduplication [6] Deduplication detection- https://pt.slideshare.net/kirar/tutorial-4-duplicate-detection
  • 27. References [7] Basics of Entity Resolution with Python and Dedupe- https://medium.com/district-data-labs/basics-of-entity-resolution-with- python-and-dedupe-bc87440b64d4 [8] A THEORY FOR RECORD LINKAGE* - https://courses.cs.washington.edu/courses/cse590q/04au/papers/Felligi69.pdf [9] Entity Resolution for Big Data - http://www.datacommunitydc.org/blog/2013/08/entity-resolution-for-big-data [10] Google Knowledge Graph Search API - https://developers.google.com/knowledge-graph [10] Generate Fake Data - https://mockaroo.com/

Editor's Notes

  1. Let’s imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy.
  2. Let’s imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy.