Your SlideShare is downloading. ×
Approximate string comparators
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Approximate string comparators

1,734
views

Published on

A quick overview of some common approximate string comparators used in record linkage.

A quick overview of some common approximate string comparators used in record linkage.


0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,734
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
21
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Approximate string comparatorsTvungenOne, 2012-06-15Lars Marius Garshol, <larsga@bouvet.no>http://twitter.com/larsga1
  • 2. Approximate string comparators?• Basically, measures of the similarity between two strings• Useful in situations where exact match is insufficient – record linkage – search – ...• Many of these are slow: O(n2)2
  • 3. Levenshtein• Also known as edit distance• Measures the number of edit operations necessary to turn s1 into s2• Edit operations are – insert a character – remove a character – substitute a character3
  • 4. Levenshtein example• Levenshtein -> Löwenstein – Levenstein (remove „h‟) – Lövenstein (substitute „ö‟) – Löwenstein (substitute „w‟)• Edit distance = 34
  • 5. Weighted Levenshtein• Not all edit operations are equal• Substituting “i” for “e” is a smaller edit than substituting “o” for “k”• Weighted Levenshtein evaluates each edit operation as a number 0.0-1.0• Difficult to implement – weights are also language-dependent5
  • 6. Jaro-Winkler• Developed at the US Bureau of the Census• For name comparisons – not well suited to long strings – best if given name/surname are separated• Exists in a few variants – originally proposed by Winkler – then modified by Jaro – a few different versions of modifications etc6
  • 7. Jaro-Winkler definition• Formula: – m = number of matching characters – t = number of transposed characters• A character from string s1 matches s2 if the same character is found in s2 less then half the length of the string away• Levenshtein ~ Löwenstein = 0.8• Axel ~ Aksel = 0.7837
  • 8. Jaro-Winkler variant8
  • 9. Soundex• A coarse schema for matching names by sound – produces a key from the name – names match if key is the same• In common use in many places – Nav‟s person register uses it for search – built-in in many databases – ...9
  • 10. Soundex definition10
  • 11. Examples• soundex(“Axel”) = „A240‟• soundex(“Aksel”) = „A240‟• soundex(“Levenshtein”) = „L523‟• soundex(“Löwenstein”) = „L152‟11
  • 12. Metaphone• Developed by Lawrence Philips• Similar to Soundex, but much more complex – both more accurate and more sensitive• Developed further into Double Metaphone• Metaphone 3.0 also exists, but only available commercially12
  • 13. Metaphone examples• metaphone(“Axel”) = „AKSL‟• metaphone(“Aksel”) = „AKSL‟• metaphone(“Levenshtein”) = „LFNX‟• metaphone(“Löwenstein”) = „LWNS‟13
  • 14. Dice coefficient• A similarity measure for sets – set can be tokens in a string – or characters in a string• Formula:14
  • 15. TFIDF• Compares strings as sets of tokens – a la Dice coefficient• However, takes frequency of tokens in corpus into account – this matches how we evaluate matches mentally• Has done well in evaluations – however, can be difficult to evaluate – results will change as corpus changes15
  • 16. More comparators• Smith-Waterman – originated in DNA sequencing• Q-grams distance – breaks string into sets of pieces of q characters – then does set similarity comparison• Monge-Elkan – similar to Smith-Waterman, but with affine gap distances – has done very well in evaluations – costly to evaluate• Many, many more – ...16