Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Open-source Place-finder for Genealogy

662 views

Published on

An Open-source Place-finder for Genealogy presented by Dallan Quass and Ryan Knight at RootsTech 2012

Translate place texts to fully-qualified standardized place names, including historical.

Published in: Technology
  • Be the first to comment

An Open-source Place-finder for Genealogy

  1. 1. An Open-source Place-finder for Genealogy Dallan Quass [email_address] Ryan Knight [email_address]
  2. 2. What's the problem?
  3. 3. Philadelphia PA After leaving Marion he moved to Cambridge, MA. Church Row Goudhurst Kent Los Angles, California Kanesville, (Council Bluffs), Pottawattamie, IA Not stated, Ohio, Kentucky Lathom, Yorkshire, England Tranbylier, Bskrd. Norway La Junta, CO Of Cranbury, Middlesex, NJ, Germany Greatford (near Marton), NZ Kingston Surrey Prestingol, St Agnes, Cornwall, England Deirdorf, Rhineland, Prss, (Germany) Farmer. & Butcher. Labourer Returned to Boston Mass. with parents after a visit to Nova Scotia. Genealogists write places in many different ways
  4. 4. Some are misspelled Los angles, California
  5. 5. Others are abbreviated Tranbylier, Bskrd. Norway Deirdorf, Rhineland, Prss, (Germany)
  6. 6. Some leave out commas Philadelphia PA Tranbylier, Bskrd. Norway Church Row Goudhurst Kent Kingston Surrey
  7. 7. Others have extra words Not stated, Ohio, Kentucky Of Cranbury, Middlesex, NJ, Germany After leaving Marion he moved to Cambridge, MA. Returned to Boston Mass. with parents after a visit to Nova Scotia.
  8. 8. Some no longer exist or exist under different names or jurisdictions Kanesville, (Council Bluffs), Pottawattamie, IA Deirdorf, Rhineland, Prss, (Germany)
  9. 9. Others have an incorrect intermediate level Lathom, Yorkshire, England
  10. 10. Some can't be found anywhere Prestingol, St Agnes, Cornwall, England
  11. 11. Others can be found in multiple places La Junta, CO Philadelphia PA Kingston Surrey
  12. 12. And finally, some aren't places at all Farmer. & Butcher. Labourer
  13. 13. Why does it matter?
  14. 14. Search
  15. 15. Match
  16. 16. Maps
  17. 17. How does it work?
  18. 18. Steps 1. Work right-to-left, finding matching places - split on commas - back off if no matches Ramsey, Hennepin, MN United States
  19. 19. Steps 1. Work right-to-left, finding matching places - split on commas - back off if no matches 2. Keep only subordinate jurisdictions - if none are subordinate, try skipping a level - if still no matches, ignore this level Ramsey, Hennepin, MN United States
  20. 20. Steps Ramsey, Hennepin, MN United States 1. Work right-to-left, finding matching places - split on commas - back off if no matches 2. Keep only subordinate jurisdictions - if none are subordinate, try skipping a level - if still no matches, ignore this level 3. If there are multiple matches (ambiguous) - filter on type - filter out subordinate places - rank remaining matches Ramsey, Minnesota, United States Ramsey, Anoka, Minnesota, United States Ramsey, Mower, Minnesota, United States
  21. 21. Database <ul><li>WeRelate has a database of </li></ul><ul><li>435,000 places </li></ul><ul><li>Includes inhabited places and record-keeping jurisdictions </li></ul><ul><li>Excludes geographic entities like rivers, mountains, etc. </li></ul><ul><li>Not complete, but we've researched and added additional places that appear frequently in GEDCOMs </li></ul>
  22. 22. Wiki as a Database
  23. 23. Wiki as a Database
  24. 24. How it began Wikipedia Getty Thesaurus of Geographic Names Family History Catalog
  25. 25. All of us are smarter than any of us
  26. 26. Community input
  27. 27. Community oversight
  28. 28. Community oversight
  29. 29. Result Proof is in the pudding
  30. 30. Compare to FamilySearch <ul><li>Standardized 3736 place </li></ul><ul><li>texts chosen at random </li></ul><ul><li>from GEDCOMs </li></ul><ul><li>using both algorithms </li></ul><ul><li>1911 standardized the same </li></ul><ul><li>1825 were different </li></ul>
  31. 31. Let's look at the K's GEDCOM place text This project Family Search Best guess kaiapoi, nz Kaiapoi, Canterbury, New Zealand Kaiapoi, Canterbury, Canterbury, New Zealand Kaiapoi, Waimakariri (district), Canterbury (region), New Zealand kanesville, (council bluffs), pottawattamie, ia Council Bluffs, Pottawattamie, Iowa, United States Kanesville, Pottawattamie, Iowa, United States Council Bluffs (formerly Kanesville), Pottawattamie, Iowa, United States kansas city, missouri Kansas City, Cass, Missouri, United States Kansas City, Jackson, Missouri, United States located in Jackson, Clay, Cass, and Platte counties kelvin grove cemetary, palmerston north, (section s block 3 plot 38) Palmerston North, Manawatu-Wanganui, New Zealand Kelvin Grove, Barkly East, Cape of Good Hope, South Africa Kelvin Grove Cemetery, Palmerston North, Manawatu-Wanganui (region), New Zealand
  32. 32. Let's look at the K's GEDCOM place text This project Family Search Best guess kenny ?? cots altandhu lochbroom Lochbroom, Ross and Cromarty, Scotland Loch Broom, Pictou, Nova Scotia, Canada Altandhu, Lochbroom, Ross and Cromarty, Scotland kincardine ross & cromarty Cromarty, Ross and Cromarty, Scotland Ross and Cromarty, Scotland Kincardine, Ross and Cromarty (county), Scotland , king queen, virginia, usa King, Wetzel, West Virginia, United States ,King, Clay, Virginia, United States King and Queen (county), Virginia, United States kingston surrey Kingston, Surrey, Jamaica Kingston, Surrey, England both places exist, but England is more likely
  33. 33. Bottom line <ul><li>Of the 38 place texts compared </li></ul><ul><li>3 texts were either not a place of were truly ambiguous </li></ul><ul><li>8 texts weren't matched correctly by either system </li></ul><ul><li>10 texts were matched to the same place (just named differently) by both systems </li></ul><ul><li>11 texts were matched better by this project </li></ul><ul><li>9 texts were matched better by FamilySearch's project </li></ul>Interestingly, these results are similar to the Nature study comparing Wikipedia with Encyclopedia Britannica – both had about the same number of mistakes.
  34. 34. Roadmap <ul><li>2005-2011 Place wiki pages under development at WeRelate </li></ul><ul><li>Jan 2011 Open-source project created </li></ul><ul><li>Feb 2011 Announce at RootsTech </li></ul><ul><li>Mar 2011 Incorporate new algorithm at WeRelate </li></ul><ul><ul><ul><li> Continued improvements </li></ul></ul></ul>
  35. 35. Future work <ul><li>Analyze differences with FamilySearch </li></ul><ul><li>Review frequent missing places </li></ul><ul><li>Use machine learning for better scoring of ambiguous places </li></ul>
  36. 36. Demonstration of Places Server <ul><li>Demonstrates Matching Places </li></ul><ul><li>Built with Play 1.2.4 - A Java Web framework </li></ul><ul><ul><li>Allows for rapid development of web applications with a fully integrated stack </li></ul></ul><ul><li>Deployed to Heroku – Cloud Application Platform </li></ul><ul><ul><li>Heroku allows one step deployment with git </li></ul></ul>
  37. 37. Demonstration of Places Server
  38. 38. Demonstration of Places Server
  39. 39. Demonstration of Places Server
  40. 40. Demonstration of Labeler <ul><li>Community feedback on places we couldn’t match </li></ul><ul><li>Provides the best guess from the Places Standardizers </li></ul>
  41. 41. Demonstration of Labeler
  42. 42. Conclusion <ul><li>Matching places is hard </li></ul><ul><ul><li>people record places in lots of different ways </li></ul></ul><ul><li>But it’s important </li></ul><ul><ul><li>useful in search, match, and mapping </li></ul></ul><ul><li>Open source algorithm and database are now freely available </li></ul><ul><ul><li>http://github.com/DallanQ/Places </li></ul></ul><ul><li>Not perfect, but ongoing improvement </li></ul><ul><li>Hopefully others will benefit from this effort </li></ul>Images appearing on these slides are copyrighted by the contributors to http://commons.wikimedia.org and are used under license

×