SlideShare a Scribd company logo
1 of 22
Download to read offline
Towards Identity Resolution: The
Challenge Of Name Matching
About Me
Gil Irizarry - Director of Engineering for name
and identity resolution technology
Basis Technology - leading provider of software
solutions for extracting meaningful intelligence
from multilingual text and digital devices
Names
What's in a name? that which we call a rose
By any other name would smell as sweet;
So Romeo would, were he not Romeo call'd,
Retain that dear perfection which he owes
Without that title.
Romeo and Juliet, Act 2, Scene 2
First, An Exercise...
Ask your neighbor for his/her name
The Challenge
=문재인 文在寅?
Lies We Believe About Names
People have exactly one canonical full name.
People have exactly one full name which they go by.
People have, at this point in time, exactly one canonical full name.
People have, at this point in time, one full name which they go by.
People have exactly N names, for any value of N.
People’s names fit within a certain defined amount of space.
People’s names do not change.
People’s names change, but only at a certain enumerated set of events.
People’s names are written in ASCII.
People’s names are written in any single character set.
People’s names are all mapped in Unicode code points.
People’s names are case sensitive.
People’s names are case insensitive.
People’s names sometimes have prefixes or suffixes, but you can safely ignore those.
People’s names do not contain numbers.
People’s names are not written in ALL CAPS.
People’s names are not written in all lower case letters.
People’s names have an order to them. Picking any ordering scheme will automatically result in consistent
ordering among all systems, as long as both use the same ordering scheme for the same name.
People’s first names and last names are, by necessity, different.
...and more…
https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-na
mes/
A True Story
Mícheál MacDonncha
vs.
Ardmhéara Micheál MacDonncha
(vs. Micheál Mac Donncha)
https://www.irishpost.com/news/
misspelling-irish-lord-mayors-name
-leads-mishap-trip-israel-153193
Imagine That You Need To Enter Your Name
https://www.basistech.com/case-study/nyu-names-search/
Imagine we have a backend datastore of identities...
● This identity directory is stored in Elasticsearch. (Why
Elasticsearch? Because it has fuzzy search capability)
● To access data, you have to enter a name
● The terminal asks for Last Name, First Name (and the
system stores your name as Smith, John Armstrong)
● However, the user enters:
○ John Smith
○ John A. Smith
○ John Armstrong Smith
Matching Terms
Now imagine a user calls into a call center...
● User calls into a call center
● An operator hears the user’s
name and transcribes it
● This is prone to errors
● The operator enters
○ Jon Smyth
○ John ArmstrongSmith
○ John Armstrong-Smith
At The Call Center
Suppose Other Phenomena Are Entered
● Nicknames
○ Johnny Smith
● Gender mistakes
○ Joan Smith
○ Joanie Smith
● Differences in relative name frequencies
○ Shaun Smith
● Initials, especially for organization names
○ IBM vs. International Business Machines
Scoring The Different Phenomena
Scoring The Different Phenomena
If Our Users Cross National Borders
문재인 vs. Moon Jae-in
or
安倍 晋三 vs. Shinzo Abe
Multi-lingual Matching
Multi-lingual Matching
A Challenge Fulfilled
=문재인
Moon Jae-in in
Hangul
Hangul: Korean Alphabet
文在寅
Moon Jae-in
in Hanja
Hanja: Chinese characters
with Korean pronunciation
Conclusions
1. Matching text strings is straightforward; matching
names is not.
2. Text strings comprised of different characters
may have the same social or cultural meaning.
Conclusions
3. The situation gets more complex when combining a
name with other information, such as date of birth
or address. These types of data also have multiple
formats and are prone to transcription errors.
Finally...
What is your neighbor’s name?

More Related Content

More from Gil Irizarry

[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entit...
[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entit...[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entit...
[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entit...Gil Irizarry
 
Ai for Good: Bad Guys, Messy Data, & NLP
Ai for Good: Bad Guys, Messy Data, & NLPAi for Good: Bad Guys, Messy Data, & NLP
Ai for Good: Bad Guys, Messy Data, & NLPGil Irizarry
 
DevSecOps Orchestration of Text Analytics with Containers
DevSecOps Orchestration of Text Analytics with ContainersDevSecOps Orchestration of Text Analytics with Containers
DevSecOps Orchestration of Text Analytics with ContainersGil Irizarry
 
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...Gil Irizarry
 
Beginning Native Android Apps
Beginning Native Android AppsBeginning Native Android Apps
Beginning Native Android AppsGil Irizarry
 
From Silos to DevOps: Our Story
From Silos to DevOps:  Our StoryFrom Silos to DevOps:  Our Story
From Silos to DevOps: Our StoryGil Irizarry
 
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014Gil Irizarry
 
Graphics on the Go
Graphics on the GoGraphics on the Go
Graphics on the GoGil Irizarry
 
Make Mobile Apps Quickly
Make Mobile Apps QuicklyMake Mobile Apps Quickly
Make Mobile Apps QuicklyGil Irizarry
 
Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12Gil Irizarry
 
Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011Gil Irizarry
 
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011Gil Irizarry
 
Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11Gil Irizarry
 
Transitioning to Kanban
Transitioning to KanbanTransitioning to Kanban
Transitioning to KanbanGil Irizarry
 
Beyond Scrum of Scrums
Beyond Scrum of ScrumsBeyond Scrum of Scrums
Beyond Scrum of ScrumsGil Irizarry
 

More from Gil Irizarry (15)

[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entit...
[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entit...[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entit...
[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entit...
 
Ai for Good: Bad Guys, Messy Data, & NLP
Ai for Good: Bad Guys, Messy Data, & NLPAi for Good: Bad Guys, Messy Data, & NLP
Ai for Good: Bad Guys, Messy Data, & NLP
 
DevSecOps Orchestration of Text Analytics with Containers
DevSecOps Orchestration of Text Analytics with ContainersDevSecOps Orchestration of Text Analytics with Containers
DevSecOps Orchestration of Text Analytics with Containers
 
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
 
Beginning Native Android Apps
Beginning Native Android AppsBeginning Native Android Apps
Beginning Native Android Apps
 
From Silos to DevOps: Our Story
From Silos to DevOps:  Our StoryFrom Silos to DevOps:  Our Story
From Silos to DevOps: Our Story
 
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
 
Graphics on the Go
Graphics on the GoGraphics on the Go
Graphics on the Go
 
Make Mobile Apps Quickly
Make Mobile Apps QuicklyMake Mobile Apps Quickly
Make Mobile Apps Quickly
 
Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12
 
Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011
 
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
 
Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11
 
Transitioning to Kanban
Transitioning to KanbanTransitioning to Kanban
Transitioning to Kanban
 
Beyond Scrum of Scrums
Beyond Scrum of ScrumsBeyond Scrum of Scrums
Beyond Scrum of Scrums
 

Recently uploaded

Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapIshara Amarasekera
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
What is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxWhat is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxTechnogeeks
 
Revolutionize Your Video Editing with InVideo.io: A Comprehensive Review
Revolutionize Your Video Editing with InVideo.io: A Comprehensive ReviewRevolutionize Your Video Editing with InVideo.io: A Comprehensive Review
Revolutionize Your Video Editing with InVideo.io: A Comprehensive Reviewjw364beach
 
Business Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business AnalysisBusiness Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business AnalysisDEEPRAJ PATHAK
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
logical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptxlogical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptxRemote DBA Services
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
oracle 23c new features for developer and dba
oracle 23c new features for developer and dbaoracle 23c new features for developer and dba
oracle 23c new features for developer and dbaRemote DBA Services
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flinkconfluent
 

Recently uploaded (20)

Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery Roadmap
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
What is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxWhat is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docx
 
Revolutionize Your Video Editing with InVideo.io: A Comprehensive Review
Revolutionize Your Video Editing with InVideo.io: A Comprehensive ReviewRevolutionize Your Video Editing with InVideo.io: A Comprehensive Review
Revolutionize Your Video Editing with InVideo.io: A Comprehensive Review
 
Business Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business AnalysisBusiness Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business Analysis
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
logical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptxlogical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptx
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
oracle 23c new features for developer and dba
oracle 23c new features for developer and dbaoracle 23c new features for developer and dba
oracle 23c new features for developer and dba
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 

Towards Identity Resolution: The Challenge of Name Matching

  • 1. Towards Identity Resolution: The Challenge Of Name Matching
  • 2. About Me Gil Irizarry - Director of Engineering for name and identity resolution technology Basis Technology - leading provider of software solutions for extracting meaningful intelligence from multilingual text and digital devices
  • 3. Names What's in a name? that which we call a rose By any other name would smell as sweet; So Romeo would, were he not Romeo call'd, Retain that dear perfection which he owes Without that title. Romeo and Juliet, Act 2, Scene 2
  • 4. First, An Exercise... Ask your neighbor for his/her name
  • 6. Lies We Believe About Names People have exactly one canonical full name. People have exactly one full name which they go by. People have, at this point in time, exactly one canonical full name. People have, at this point in time, one full name which they go by. People have exactly N names, for any value of N. People’s names fit within a certain defined amount of space. People’s names do not change. People’s names change, but only at a certain enumerated set of events. People’s names are written in ASCII. People’s names are written in any single character set. People’s names are all mapped in Unicode code points. People’s names are case sensitive. People’s names are case insensitive. People’s names sometimes have prefixes or suffixes, but you can safely ignore those. People’s names do not contain numbers. People’s names are not written in ALL CAPS. People’s names are not written in all lower case letters. People’s names have an order to them. Picking any ordering scheme will automatically result in consistent ordering among all systems, as long as both use the same ordering scheme for the same name. People’s first names and last names are, by necessity, different. ...and more… https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-na mes/
  • 7. A True Story Mícheál MacDonncha vs. Ardmhéara Micheál MacDonncha (vs. Micheál Mac Donncha) https://www.irishpost.com/news/ misspelling-irish-lord-mayors-name -leads-mishap-trip-israel-153193
  • 8. Imagine That You Need To Enter Your Name https://www.basistech.com/case-study/nyu-names-search/
  • 9. Imagine we have a backend datastore of identities... ● This identity directory is stored in Elasticsearch. (Why Elasticsearch? Because it has fuzzy search capability) ● To access data, you have to enter a name ● The terminal asks for Last Name, First Name (and the system stores your name as Smith, John Armstrong) ● However, the user enters: ○ John Smith ○ John A. Smith ○ John Armstrong Smith
  • 11. Now imagine a user calls into a call center... ● User calls into a call center ● An operator hears the user’s name and transcribes it ● This is prone to errors ● The operator enters ○ Jon Smyth ○ John ArmstrongSmith ○ John Armstrong-Smith
  • 12. At The Call Center
  • 13. Suppose Other Phenomena Are Entered ● Nicknames ○ Johnny Smith ● Gender mistakes ○ Joan Smith ○ Joanie Smith ● Differences in relative name frequencies ○ Shaun Smith ● Initials, especially for organization names ○ IBM vs. International Business Machines
  • 16. If Our Users Cross National Borders 문재인 vs. Moon Jae-in or 安倍 晋三 vs. Shinzo Abe
  • 19. A Challenge Fulfilled =문재인 Moon Jae-in in Hangul Hangul: Korean Alphabet 文在寅 Moon Jae-in in Hanja Hanja: Chinese characters with Korean pronunciation
  • 20. Conclusions 1. Matching text strings is straightforward; matching names is not. 2. Text strings comprised of different characters may have the same social or cultural meaning.
  • 21. Conclusions 3. The situation gets more complex when combining a name with other information, such as date of birth or address. These types of data also have multiple formats and are prone to transcription errors.
  • 22. Finally... What is your neighbor’s name?