Your SlideShare is downloading. ×
An Open-source Similar-name Finder
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

An Open-source Similar-name Finder

1,158

Published on

An Open-source Similar-name Finder presented by Dallan Quass at RootsTech 2012 …

An Open-source Similar-name Finder presented by Dallan Quass at RootsTech 2012

An improvement on Soundex

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,158
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. An Open-source Similar-name Finder Dallan Quass [email_address]
  • 2. What's the problem?
  • 3. People can't spell unusual names Maybe a piece of mail addressed to Solverg Quast? Solverg Quast 5934 Phoenix Ave. Shoreview, MN 55126 Johnston Bros. 1256 Bristol St. Mapleton, MN 55126 Should be: Solveig Quass
  • 4. People use nicknames John Johnny Jack
  • 5. Transcribers make typos Jhon
  • 6. Most of our ancestors didn't know how to read or write  signature
  • 7. What does it matter?
  • 8. How do you find records? Johnny Snith John Smith
  • 9. How do you match people? John Smith Johnny Smithe
  • 10. Not a new problem
  • 11. Lots of solutions Soundex Nysiis Double Metaphone Refined Soundex Daitch-Mokotoff Caverphone Levenstein Jaro Winkler Monge Elkan Needleman Wunch Smith Waterman
  • 12. No Bullseye
  • 13. Why is this so hard?
  • 14. How similar are two names? We’re neighbors John Jonny Joe I don’t know those guys
  • 15. First approach: Coders Soundex Nysiis Double Metaphone Refined Soundex Daitch-Mokotoff Caverphone
    • General approach
    • Combine repeated letters
    • Remove vowels (except maybe for leading)
    • Unite similar-sounding letters
  • 16. First approach: Coders Jim John Jane Johan Johannes
  • 17. Second approach: Distance functions Levenstein Jaro Winkler Monge Elkan Needleman Wunch Smith Waterman
    • General approach
    • Align sequences of letters
    • Score based upon the number of matches, transpositions, differences
    • Monge Elkan considers similar-sounding letters
  • 18. Second approach: Distance functions Jim John Jane Johan Johannes Better results, but Doesn't scale well
  • 19. Can we do better?
  • 20. Warning: Machine learning ahead!
  • 21. Thank you Ancestry!
    • Ancestry.com paid someone to label 100,000 pairs of names
    • Name pairs were drawn from actual matching records at Ancestry
    • Labeled name pairs have been made freely available
  • 22. A closer look at Levenstein Jon John Bohn -1 -1
  • 23. Maximize your expectations
    • Expectation Maximization Algorithm
    • Expectation step: calculate the expected value of a function
    • Maximization step: find parameters that maximize the expected value
    • Iterate until convergence
    Jon John Bohn high cost low cost Weighted Edit Distance
  • 24. Learn to classify
    • Positive and negative examples
    • Features
      • Coders
      • Distance functions
      • Weighted edit distance
    • Learn weights
      • several algorithms to choose from
    • Results in a vector
    • Threshold separates matches from non-matches
  • 25. Wait, i sn't this just another distance function? Distance functions don't scale, right?
  • 26. Right
  • 27. Back to the basics x f(x) -5 -1 -3 4.5 0 7 2 3.5 4 2
  • 28. Long tail
  • 29. Long tail 200,000 Surnames 70,000 Given names ≤ 1/5,000,000 names
  • 30. Long tail Use distance function with table here Use coder here
  • 31. Result: Table initialized by a function Dallan: Dalana Daleen Dalen Dalin … Talan Tallon Ryan: Aaran Aran Arrin … Rian Riana ...
  • 32. A nice thing about tables... Dallan: Dalana Daleen Dalen Dalin … Talan Tallon Ryan: Aaran Aran Arrin … Rian Riana ...
  • 33. Add to the table Nicknames BehindTheName.com The New American Dictionary of Baby Names by Leslie Dunking and William Gosling A Dictionary of Surnames by Patrick Hanks and Flavia Hodges WeRelate community
  • 34. Thank you BehindTheName.com! Fascinating Family Trees for given names
  • 35. Result Soundex Our approach Precision Recall 28% decrease in false negatives Given names Soundex Our approach Precision Recall 28% decrease in false negatives Surnames 97 65 97 74 89 68 89 77
  • 36. Who is using it?
  • 37. WeRelate.org
  • 38. Continuous improvement
  • 39. Continuous improvement
  • 40. Community oversight
  • 41. How do I use it?
    • Source code and table available on Github:
      • http://github.com/DallanQ/Names
    • Search
      • Normalize
      • Index
      • Search
    • Score
    • Eval
    • Service
  • 42. Roadmap
    • Jan 2011 Open-source project created
    • Jan 2011 Implemented at WeRelate
    • Feb 2011 Announce at RootsTech
    • Continued improvements
  • 43. Future work
  • 44. Future work
    • Reduce the number of name variants to look up
    • Look up multiple codes
      • Refined soundex?
    • Cluster names
      • Mahout?
    Remove “chaff” variants from common names
  • 45. Conclusion Images appearing on these slides are copyrighted by the contributors to http://commons.wikimedia.org and are used under license
    • Thank you Ancestry.com and BehindTheName.com!!!
    • Identifying name variants is hard
    • But getting it right is pretty important
      • names are at the core of genealogical research
    • Open source algorithm is now freely available
      • http://github.com/DallanQ/Names
      • 28% reduction in false negatives compared to Soundex
      • continuous improvement
    • Hopefully others will benefit from this effort
      • goal is to improve genealogical searches across the Web
  • 46.  

×