SlideShare a Scribd company logo
Approximate string comparators

TvungenOne, 2012-06-15
Lars Marius Garshol, <larsga@bouvet.no>
http://twitter.com/larsga




1
Approximate string comparators?

• Basically, measures of the similarity between
  two strings
• Useful in situations where exact match is
  insufficient
    – record linkage
    – search
    – ...
• Many of these are slow: O(n2)



2
Levenshtein

• Also known as edit distance
• Measures the number of edit operations
  necessary to turn s1 into s2
• Edit operations are
    – insert a character
    – remove a character
    – substitute a character




3
Levenshtein example

• Levenshtein -> Löwenstein
    – Levenstein (remove „h‟)
    – Lövenstein (substitute „ö‟)
    – Löwenstein (substitute „w‟)
• Edit distance = 3




4
Weighted Levenshtein

• Not all edit operations are equal
• Substituting “i” for “e” is a smaller edit than
  substituting “o” for “k”
• Weighted Levenshtein evaluates each edit
  operation as a number 0.0-1.0
• Difficult to implement
    – weights are also language-dependent




5
Jaro-Winkler

• Developed at the US Bureau of the Census
• For name comparisons
    – not well suited to long strings
    – best if given name/surname are separated
• Exists in a few variants
    – originally proposed by Winkler
    – then modified by Jaro
    – a few different versions of modifications etc



6
Jaro-Winkler definition

• Formula:
    – m = number of matching characters
    – t = number of transposed characters
• A character from string s1 matches s2 if the
  same character is found in s2 less then half the
  length of the string away
• Levenshtein ~ Löwenstein = 0.8
• Axel ~ Aksel = 0.783


7
Jaro-Winkler variant




8
Soundex

• A coarse schema for matching names by sound
    – produces a key from the name
    – names match if key is the same
• In common use in many places
    – Nav‟s person register uses it for search
    – built-in in many databases
    – ...




9
Soundex definition




10
Examples

•    soundex(“Axel”) = „A240‟
•    soundex(“Aksel”) = „A240‟
•    soundex(“Levenshtein”) = „L523‟
•    soundex(“Löwenstein”) = „L152‟




11
Metaphone

• Developed by Lawrence Philips
• Similar to Soundex, but much more complex
     – both more accurate and more sensitive
• Developed further into Double Metaphone
• Metaphone 3.0 also exists, but only available
  commercially




12
Metaphone examples

•    metaphone(“Axel”) = „AKSL‟
•    metaphone(“Aksel”) = „AKSL‟
•    metaphone(“Levenshtein”) = „LFNX‟
•    metaphone(“Löwenstein”) = „LWNS‟




13
Dice coefficient

• A similarity measure for sets
     – set can be tokens in a string
     – or characters in a string
• Formula:




14
TFIDF

• Compares strings as sets of tokens
     – a la Dice coefficient
• However, takes frequency of tokens in corpus
  into account
     – this matches how we evaluate matches mentally
• Has done well in evaluations
     – however, can be difficult to evaluate
     – results will change as corpus changes



15
More comparators

• Smith-Waterman
     – originated in DNA sequencing
• Q-grams distance
     – breaks string into sets of pieces of q characters
     – then does set similarity comparison
• Monge-Elkan
     – similar to Smith-Waterman, but with affine gap distances
     – has done very well in evaluations
     – costly to evaluate
• Many, many more
     – ...

16

More Related Content

Similar to Approximate string comparators

Flat unit 1
Flat unit 1Flat unit 1
Flat unit 1
VenkataRaoS1
 
SWRL Overview
SWRL OverviewSWRL Overview
SWRL Overview
Emiliano Reynares
 
[DCSB] Chiara Palladino & Tariq Youssef (Leipzig) iAligner: a tool for syntax...
[DCSB] Chiara Palladino & Tariq Youssef (Leipzig) iAligner: a tool for syntax...[DCSB] Chiara Palladino & Tariq Youssef (Leipzig) iAligner: a tool for syntax...
[DCSB] Chiara Palladino & Tariq Youssef (Leipzig) iAligner: a tool for syntax...
Digital Classicist Seminar Berlin
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
Roberto Pereira Silveira
 
N20190729
N20190729N20190729
N20190729
TMU, Japan
 
Regular expressions h1
Regular expressions h1Regular expressions h1
Regular expressions h1
Rajendran
 
Fuzzy Matching with Apache Spark
Fuzzy Matching with Apache SparkFuzzy Matching with Apache Spark
Fuzzy Matching with Apache Spark
DataWorks Summit
 
1 introduction
1 introduction1 introduction
1 introduction
parmeet834
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler Design
Kuppusamy P
 
Programming_Language_Syntax.ppt
Programming_Language_Syntax.pptProgramming_Language_Syntax.ppt
Programming_Language_Syntax.ppt
Amrita Sharma
 

Similar to Approximate string comparators (11)

Flat unit 1
Flat unit 1Flat unit 1
Flat unit 1
 
SWRL Overview
SWRL OverviewSWRL Overview
SWRL Overview
 
[DCSB] Chiara Palladino & Tariq Youssef (Leipzig) iAligner: a tool for syntax...
[DCSB] Chiara Palladino & Tariq Youssef (Leipzig) iAligner: a tool for syntax...[DCSB] Chiara Palladino & Tariq Youssef (Leipzig) iAligner: a tool for syntax...
[DCSB] Chiara Palladino & Tariq Youssef (Leipzig) iAligner: a tool for syntax...
 
intro.ppt
intro.pptintro.ppt
intro.ppt
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
 
N20190729
N20190729N20190729
N20190729
 
Regular expressions h1
Regular expressions h1Regular expressions h1
Regular expressions h1
 
Fuzzy Matching with Apache Spark
Fuzzy Matching with Apache SparkFuzzy Matching with Apache Spark
Fuzzy Matching with Apache Spark
 
1 introduction
1 introduction1 introduction
1 introduction
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler Design
 
Programming_Language_Syntax.ppt
Programming_Language_Syntax.pptProgramming_Language_Syntax.ppt
Programming_Language_Syntax.ppt
 

More from Lars Marius Garshol

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
Lars Marius Garshol
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
Lars Marius Garshol
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
Lars Marius Garshol
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
Lars Marius Garshol
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
Lars Marius Garshol
 
History of writing
History of writingHistory of writing
History of writing
Lars Marius Garshol
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
Lars Marius Garshol
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
Lars Marius Garshol
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
Lars Marius Garshol
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
Lars Marius Garshol
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
Lars Marius Garshol
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
Lars Marius Garshol
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
Lars Marius Garshol
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
Lars Marius Garshol
 
Big data 101
Big data 101Big data 101
Big data 101
Lars Marius Garshol
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
Lars Marius Garshol
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceLars Marius Garshol
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
Lars Marius Garshol
 

More from Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Big data 101
Big data 101Big data 101
Big data 101
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 

Recently uploaded

Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 

Recently uploaded (20)

Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 

Approximate string comparators

  • 1. Approximate string comparators TvungenOne, 2012-06-15 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga 1
  • 2. Approximate string comparators? • Basically, measures of the similarity between two strings • Useful in situations where exact match is insufficient – record linkage – search – ... • Many of these are slow: O(n2) 2
  • 3. Levenshtein • Also known as edit distance • Measures the number of edit operations necessary to turn s1 into s2 • Edit operations are – insert a character – remove a character – substitute a character 3
  • 4. Levenshtein example • Levenshtein -> Löwenstein – Levenstein (remove „h‟) – Lövenstein (substitute „ö‟) – Löwenstein (substitute „w‟) • Edit distance = 3 4
  • 5. Weighted Levenshtein • Not all edit operations are equal • Substituting “i” for “e” is a smaller edit than substituting “o” for “k” • Weighted Levenshtein evaluates each edit operation as a number 0.0-1.0 • Difficult to implement – weights are also language-dependent 5
  • 6. Jaro-Winkler • Developed at the US Bureau of the Census • For name comparisons – not well suited to long strings – best if given name/surname are separated • Exists in a few variants – originally proposed by Winkler – then modified by Jaro – a few different versions of modifications etc 6
  • 7. Jaro-Winkler definition • Formula: – m = number of matching characters – t = number of transposed characters • A character from string s1 matches s2 if the same character is found in s2 less then half the length of the string away • Levenshtein ~ Löwenstein = 0.8 • Axel ~ Aksel = 0.783 7
  • 9. Soundex • A coarse schema for matching names by sound – produces a key from the name – names match if key is the same • In common use in many places – Nav‟s person register uses it for search – built-in in many databases – ... 9
  • 11. Examples • soundex(“Axel”) = „A240‟ • soundex(“Aksel”) = „A240‟ • soundex(“Levenshtein”) = „L523‟ • soundex(“Löwenstein”) = „L152‟ 11
  • 12. Metaphone • Developed by Lawrence Philips • Similar to Soundex, but much more complex – both more accurate and more sensitive • Developed further into Double Metaphone • Metaphone 3.0 also exists, but only available commercially 12
  • 13. Metaphone examples • metaphone(“Axel”) = „AKSL‟ • metaphone(“Aksel”) = „AKSL‟ • metaphone(“Levenshtein”) = „LFNX‟ • metaphone(“Löwenstein”) = „LWNS‟ 13
  • 14. Dice coefficient • A similarity measure for sets – set can be tokens in a string – or characters in a string • Formula: 14
  • 15. TFIDF • Compares strings as sets of tokens – a la Dice coefficient • However, takes frequency of tokens in corpus into account – this matches how we evaluate matches mentally • Has done well in evaluations – however, can be difficult to evaluate – results will change as corpus changes 15
  • 16. More comparators • Smith-Waterman – originated in DNA sequencing • Q-grams distance – breaks string into sets of pieces of q characters – then does set similarity comparison • Monge-Elkan – similar to Smith-Waterman, but with affine gap distances – has done very well in evaluations – costly to evaluate • Many, many more – ... 16