Upcoming SlideShare
×

Approximate string comparators

2,603 views

Published on

A quick overview of some common approximate string comparators used in record linkage.

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
2,603
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
44
0
Likes
1
Embeds 0
No embeds

No notes for slide

Approximate string comparators

1. 1. Approximate string comparatorsTvungenOne, 2012-06-15Lars Marius Garshol, <larsga@bouvet.no>http://twitter.com/larsga1
2. 2. Approximate string comparators?• Basically, measures of the similarity between two strings• Useful in situations where exact match is insufficient – record linkage – search – ...• Many of these are slow: O(n2)2
3. 3. Levenshtein• Also known as edit distance• Measures the number of edit operations necessary to turn s1 into s2• Edit operations are – insert a character – remove a character – substitute a character3
4. 4. Levenshtein example• Levenshtein -> Löwenstein – Levenstein (remove „h‟) – Lövenstein (substitute „ö‟) – Löwenstein (substitute „w‟)• Edit distance = 34
5. 5. Weighted Levenshtein• Not all edit operations are equal• Substituting “i” for “e” is a smaller edit than substituting “o” for “k”• Weighted Levenshtein evaluates each edit operation as a number 0.0-1.0• Difficult to implement – weights are also language-dependent5
6. 6. Jaro-Winkler• Developed at the US Bureau of the Census• For name comparisons – not well suited to long strings – best if given name/surname are separated• Exists in a few variants – originally proposed by Winkler – then modified by Jaro – a few different versions of modifications etc6
7. 7. Jaro-Winkler definition• Formula: – m = number of matching characters – t = number of transposed characters• A character from string s1 matches s2 if the same character is found in s2 less then half the length of the string away• Levenshtein ~ Löwenstein = 0.8• Axel ~ Aksel = 0.7837
8. 8. Jaro-Winkler variant8
9. 9. Soundex• A coarse schema for matching names by sound – produces a key from the name – names match if key is the same• In common use in many places – Nav‟s person register uses it for search – built-in in many databases – ...9
10. 10. Soundex definition10
11. 11. Examples• soundex(“Axel”) = „A240‟• soundex(“Aksel”) = „A240‟• soundex(“Levenshtein”) = „L523‟• soundex(“Löwenstein”) = „L152‟11
12. 12. Metaphone• Developed by Lawrence Philips• Similar to Soundex, but much more complex – both more accurate and more sensitive• Developed further into Double Metaphone• Metaphone 3.0 also exists, but only available commercially12
13. 13. Metaphone examples• metaphone(“Axel”) = „AKSL‟• metaphone(“Aksel”) = „AKSL‟• metaphone(“Levenshtein”) = „LFNX‟• metaphone(“Löwenstein”) = „LWNS‟13
14. 14. Dice coefficient• A similarity measure for sets – set can be tokens in a string – or characters in a string• Formula:14
15. 15. TFIDF• Compares strings as sets of tokens – a la Dice coefficient• However, takes frequency of tokens in corpus into account – this matches how we evaluate matches mentally• Has done well in evaluations – however, can be difficult to evaluate – results will change as corpus changes15
16. 16. More comparators• Smith-Waterman – originated in DNA sequencing• Q-grams distance – breaks string into sets of pieces of q characters – then does set similarity comparison• Monge-Elkan – similar to Smith-Waterman, but with affine gap distances – has done very well in evaluations – costly to evaluate• Many, many more – ...16