An Open-source Similar-name Finder Dallan Quass  [email_address]
What's the problem?
People can't spell unusual names Maybe a piece of mail addressed to Solverg Quast? Solverg Quast 5934 Phoenix Ave. Shorevi...
People use nicknames John Johnny Jack
Transcribers make typos Jhon
Most of our ancestors didn't know how to read or write  signature
What does it matter?
How do you find records? Johnny Snith John Smith
How do you match people? John Smith Johnny Smithe
Not a new problem
Lots of solutions Soundex Nysiis Double Metaphone Refined Soundex Daitch-Mokotoff Caverphone Levenstein Jaro Winkler Monge...
No Bullseye
Why is this so hard?
How similar are two names? We’re neighbors John Jonny Joe I don’t know those guys
First approach: Coders Soundex Nysiis Double Metaphone Refined Soundex Daitch-Mokotoff Caverphone <ul><li>General approach...
First approach: Coders Jim John Jane Johan Johannes
Second approach: Distance functions Levenstein Jaro Winkler Monge Elkan Needleman Wunch Smith Waterman <ul><li>General app...
Second approach: Distance functions Jim John Jane Johan Johannes Better results, but Doesn't scale well
Can we do better?
Warning: Machine learning ahead!
Thank you Ancestry! <ul><li>Ancestry.com paid someone to label 100,000 pairs of names </li></ul><ul><li>Name pairs were dr...
A closer look at Levenstein Jon John Bohn -1 -1
Maximize your expectations <ul><li>Expectation Maximization Algorithm </li></ul><ul><li>Expectation step: calculate the ex...
Learn to classify <ul><li>Positive and negative examples </li></ul><ul><li>Features </li></ul><ul><ul><li>Coders </li></ul...
Wait, i sn't this just another distance function? Distance functions don't scale, right?
Right
Back to the basics x  f(x) -5  -1 -3  4.5 0  7 2  3.5 4  2
Long tail
Long tail 200,000 Surnames  70,000 Given names ≤   1/5,000,000 names
Long tail Use distance function with table here Use coder here
Result: Table initialized by a function Dallan:  Dalana Daleen Dalen Dalin … Talan Tallon Ryan:  Aaran Aran Arrin … Rian R...
A nice thing about tables... Dallan:  Dalana Daleen Dalen Dalin … Talan Tallon Ryan:  Aaran Aran Arrin … Rian Riana ...
Add to the table Nicknames BehindTheName.com The New American Dictionary of Baby Names by Leslie Dunking and William Gosli...
Thank you BehindTheName.com! Fascinating  Family Trees for given names
Result Soundex Our approach Precision  Recall 28% decrease in false negatives Given names Soundex Our approach Precision  ...
Who is using it?
WeRelate.org
Continuous improvement
Continuous improvement
Community oversight
How do I use it? <ul><li>Source code and table available on Github:  </li></ul><ul><ul><li>http://github.com/DallanQ/Names...
Roadmap <ul><li>Jan 2011 Open-source project created </li></ul><ul><li>Jan 2011 Implemented at WeRelate </li></ul><ul><li>...
Future work
Future work <ul><li>Reduce the number of name variants to look up </li></ul><ul><li>Look up multiple codes </li></ul><ul><...
Conclusion Images appearing on these slides are copyrighted by the contributors to  http://commons.wikimedia.org and are u...
 
Upcoming SlideShare
Loading in...5
×

An Open-source Similar-name Finder

1,233

Published on

An Open-source Similar-name Finder presented by Dallan Quass at RootsTech 2012

An improvement on Soundex

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,233
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

An Open-source Similar-name Finder

  1. 1. An Open-source Similar-name Finder Dallan Quass [email_address]
  2. 2. What's the problem?
  3. 3. People can't spell unusual names Maybe a piece of mail addressed to Solverg Quast? Solverg Quast 5934 Phoenix Ave. Shoreview, MN 55126 Johnston Bros. 1256 Bristol St. Mapleton, MN 55126 Should be: Solveig Quass
  4. 4. People use nicknames John Johnny Jack
  5. 5. Transcribers make typos Jhon
  6. 6. Most of our ancestors didn't know how to read or write  signature
  7. 7. What does it matter?
  8. 8. How do you find records? Johnny Snith John Smith
  9. 9. How do you match people? John Smith Johnny Smithe
  10. 10. Not a new problem
  11. 11. Lots of solutions Soundex Nysiis Double Metaphone Refined Soundex Daitch-Mokotoff Caverphone Levenstein Jaro Winkler Monge Elkan Needleman Wunch Smith Waterman
  12. 12. No Bullseye
  13. 13. Why is this so hard?
  14. 14. How similar are two names? We’re neighbors John Jonny Joe I don’t know those guys
  15. 15. First approach: Coders Soundex Nysiis Double Metaphone Refined Soundex Daitch-Mokotoff Caverphone <ul><li>General approach </li></ul><ul><li>Combine repeated letters </li></ul><ul><li>Remove vowels (except maybe for leading) </li></ul><ul><li>Unite similar-sounding letters </li></ul>
  16. 16. First approach: Coders Jim John Jane Johan Johannes
  17. 17. Second approach: Distance functions Levenstein Jaro Winkler Monge Elkan Needleman Wunch Smith Waterman <ul><li>General approach </li></ul><ul><li>Align sequences of letters </li></ul><ul><li>Score based upon the number of matches, transpositions, differences </li></ul><ul><li>Monge Elkan considers similar-sounding letters </li></ul>
  18. 18. Second approach: Distance functions Jim John Jane Johan Johannes Better results, but Doesn't scale well
  19. 19. Can we do better?
  20. 20. Warning: Machine learning ahead!
  21. 21. Thank you Ancestry! <ul><li>Ancestry.com paid someone to label 100,000 pairs of names </li></ul><ul><li>Name pairs were drawn from actual matching records at Ancestry </li></ul><ul><li>Labeled name pairs have been made freely available </li></ul>
  22. 22. A closer look at Levenstein Jon John Bohn -1 -1
  23. 23. Maximize your expectations <ul><li>Expectation Maximization Algorithm </li></ul><ul><li>Expectation step: calculate the expected value of a function </li></ul><ul><li>Maximization step: find parameters that maximize the expected value </li></ul><ul><li>Iterate until convergence </li></ul>Jon John Bohn high cost low cost Weighted Edit Distance
  24. 24. Learn to classify <ul><li>Positive and negative examples </li></ul><ul><li>Features </li></ul><ul><ul><li>Coders </li></ul></ul><ul><ul><li>Distance functions </li></ul></ul><ul><ul><li>Weighted edit distance </li></ul></ul><ul><li>Learn weights </li></ul><ul><ul><li>several algorithms to choose from </li></ul></ul><ul><li>Results in a vector </li></ul><ul><li>Threshold separates matches from non-matches </li></ul>
  25. 25. Wait, i sn't this just another distance function? Distance functions don't scale, right?
  26. 26. Right
  27. 27. Back to the basics x f(x) -5 -1 -3 4.5 0 7 2 3.5 4 2
  28. 28. Long tail
  29. 29. Long tail 200,000 Surnames 70,000 Given names ≤ 1/5,000,000 names
  30. 30. Long tail Use distance function with table here Use coder here
  31. 31. Result: Table initialized by a function Dallan: Dalana Daleen Dalen Dalin … Talan Tallon Ryan: Aaran Aran Arrin … Rian Riana ...
  32. 32. A nice thing about tables... Dallan: Dalana Daleen Dalen Dalin … Talan Tallon Ryan: Aaran Aran Arrin … Rian Riana ...
  33. 33. Add to the table Nicknames BehindTheName.com The New American Dictionary of Baby Names by Leslie Dunking and William Gosling A Dictionary of Surnames by Patrick Hanks and Flavia Hodges WeRelate community
  34. 34. Thank you BehindTheName.com! Fascinating Family Trees for given names
  35. 35. Result Soundex Our approach Precision Recall 28% decrease in false negatives Given names Soundex Our approach Precision Recall 28% decrease in false negatives Surnames 97 65 97 74 89 68 89 77
  36. 36. Who is using it?
  37. 37. WeRelate.org
  38. 38. Continuous improvement
  39. 39. Continuous improvement
  40. 40. Community oversight
  41. 41. How do I use it? <ul><li>Source code and table available on Github: </li></ul><ul><ul><li>http://github.com/DallanQ/Names </li></ul></ul><ul><li>Search </li></ul><ul><ul><li>Normalize </li></ul></ul><ul><ul><li>Index </li></ul></ul><ul><ul><li>Search </li></ul></ul><ul><li>Score </li></ul><ul><li>Eval </li></ul><ul><li>Service </li></ul>
  42. 42. Roadmap <ul><li>Jan 2011 Open-source project created </li></ul><ul><li>Jan 2011 Implemented at WeRelate </li></ul><ul><li>Feb 2011 Announce at RootsTech </li></ul><ul><li> Continued improvements </li></ul>
  43. 43. Future work
  44. 44. Future work <ul><li>Reduce the number of name variants to look up </li></ul><ul><li>Look up multiple codes </li></ul><ul><ul><li>Refined soundex? </li></ul></ul><ul><li>Cluster names </li></ul><ul><ul><li>Mahout? </li></ul></ul>Remove “chaff” variants from common names
  45. 45. Conclusion Images appearing on these slides are copyrighted by the contributors to http://commons.wikimedia.org and are used under license <ul><li>Thank you Ancestry.com and BehindTheName.com!!! </li></ul><ul><li>Identifying name variants is hard </li></ul><ul><li>But getting it right is pretty important </li></ul><ul><ul><li>names are at the core of genealogical research </li></ul></ul><ul><li>Open source algorithm is now freely available </li></ul><ul><ul><li>http://github.com/DallanQ/Names </li></ul></ul><ul><ul><li>28% reduction in false negatives compared to Soundex </li></ul></ul><ul><ul><li>continuous improvement </li></ul></ul><ul><li>Hopefully others will benefit from this effort </li></ul><ul><ul><li>goal is to improve genealogical searches across the Web </li></ul></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×