Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Artificial Intelligence and the Coming Revolution of Family History - Presentation

  1. © 2013 by Intellectual Reserve, Inc. All rights reserved. Artificial Intelligence and the Coming Revolution of Family History Virtual Genealogical Association Webinar November 16, 2019 Ben Baker bakerb@familysearch.org
  2. Background • Nearly 10 years as a Software Engineer at FamilySearch • Currently on the Automated Content Extraction team • Try to do my own genealogy and help others • Hope I’ll be able to help you see a vision of the future and how you can be an active part of bringing it to pass • Go to https://www.slideshare.net/bakers84 or e-mail me (bakerb@familysearch.org) to get a copy of this presentation • Click here for the related printed handout materials
  3. First, Some Basics About the “Big 4” FamilySearch, Ancestry, MyHeritage and FindMyPast Good News • All of the “Big 4” have each published 5-10 billion indexed names of people from records • All the “Big 4” utilize record hinting to help users source persons in their trees with records. (Ex. Over a billion sources have been attached to persons in FamilySearch Family Tree.) • Hundreds of cameras worldwide continue to digitize millions of images per day Bad News • Only a small fraction of historical records have been digitized • Only a small fraction of the fraction of digitized images captured have been indexed • Indexing isn’t keeping up with the ability to digitize images, especially in non-English languages • For-profit genealogy companies are mostly using offshore indexing • Only indexed records can be presented as record hints
  4. Accelerating Records Publication • All of the “Big 4” want to publish more records • All of the “Big 4” are already using automated technologies to accelerate records publication • Efforts are underway to provide abilities for homelands to better publish their own records • More responsibility is moving to users to correct errors in records
  5. FamilySearch • Partnership with GenealogyBank to extract data from born-digital obituaries • First run indexed 5M obituaries in 10 hours, saving about 150 man-years of indexing • 23M obituaries auto-indexed as of Nov 2019, more likely coming • Millions of historical newspaper death stories starting to be released • Uses recent advancements in machine learning and artificial intelligence (AI) • Can produce even more information than indexing (Ex. In-law couple relationships) • https://www.familysearch.org/search/collection/2333694
  6. https://blogs.ancestry.com/ancestry/2019/10/28/powered-by-cutting-edge-machine- learning-technology-ancestry-debuts-the-worlds-largest-searchable-online-obituary- collection-providing-members-with-even-more-details-about-the-ancest/
  7. https://medium.com/myheritage-engineering/face-recognition-and-ocr-processing-of- 300-million-records-from-us-yearbooks-a95d55c6ac58
  8. https://abundantgenealogy.com/findmypast-announces-trial-of-revolutionary-new- newspaper-search/
  9. How are Artificial Intelligence, Machine Learning and Deep Learning Related? Artificial Intelligence Machines exhibiting human intelligence • General AI – Still science fiction • Narrow AI – Technologies that perform specific tasks as well or better than humans Machine Learning Practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world Deep Learning Using much larger machine learning neural networks requiring more training data and computational power Artificial Intelligence Machine Learning Deep Learning
  10. How is Machine Learning different from traditional programming? Machine Learning is using computers so they can learn from data instead of writing rules (i.e. code) to solve problems Study the Problem Write Rules Evaluate Launch! Analyze Errors Study the Problem Train ML Algorithm Evaluate Launch! Analyze Errors Data
  11. Machine Learning Isn’t Really New • Been around for decades • Spam filters in 1990s • OCR (Optical Character Recognition) • “Big 4” family history companies have also been using for some time • Record hinting (person – record matching) • Possible duplicates (person – person matching) • Exploring new uses • Automated Content Extraction (aka Auto-Indexing) • Facial Recognition
  12. Necessary Technologies for Auto-Indexing • Text Recognition • Zoning/text tiling to separate into individual records • Segment text into lines • Recognize typed and/or handwritten characters • Use additional context to determine most likely words • Natural Language Processing (NLP) • Named entity recognition (NER) – identify names, dates, places, etc. • Relation extraction – identify relationships between the names, dates & places • Additional processing to get into format for publication, standardize data, etc. • Notice the steps are very similar to what a genealogist would do
  13. Zoning/Text Tiling Find record boundaries within/across image(s) • Articles on a newspaper page • Marriage records flowing across pages • Pages of a probate record
  14. Line Segmentation/Text Recognition --- 9 WARRANTY DEED. Baker, Jones & Co., Printers, in Washington St., Buffalo, N. Y. This Indenture, Made this Fourteenth day of February in the year of our Lord one Thousand Eight Hundred and Eighty - Five, Between Della Fuller of Oleare County of Cattaraugus and State of New York of the first part, and William Parsel of the same place of the second part, Witnesseth, that the said party of the first part, in consideration of the sum of Fifteen hundred Dollars to her duly paid has sold, and by these Presents, does grant and convey to the said party of the second part his heirs and assigns that tract or parcel of Land situate in the Town and lenty of Olean County of
  15. Natural Language Processing
  16. Pros and Cons of Automated Technologies Good News • Can produce records much more quickly than human indexing • Can scale much larger than number of indexers • Provides searchable/hintable records much sooner than they’d be otherwise available • Is cheaper than paying for indexers and associated costs • Can be used on records not well suited to human indexing • Can extract more information than indexers may be able to • Once trained, can be applied to languages with very few indexers Bad News • Sometimes doesn’t produce as good of results • Requires more human judgment to properly attach as a source • May require different methods of searching to find records • Requires ability to correct errors in records
  17. Beware of Automated Computer Indexing “We all need to be aware that these types of errors will occur … Even with the indexing errors, searching in digitized collections is much easier these days than it was searching newsprint and/or microfilm of newspaper pages 20 years ago. I greatly appreciate the efforts by companies like Ancestry.com … I'm not complaining here - just making the point that we need to expect errors like this will be made, and we need to be flexible in our searches if we don't get results when we use an exact name or date or place.” Randy Seaver https://www.geneamusings.com/2019 /10/beware-of-automated- computer-indexing.html
  18. User Corrections • Errors will be fixed fairly quickly by reporting via the Errors tab • Ability to correct names available in many collections • Ability to correct dates, places, relationships and more coming
  19. View Original Image/Text
  20. Collections to Watch The following collections on FamilySearch contain automatically indexed records: • United States, GenealogyBank Obituaries, 1980-2014 More recent “born digital” obituaries, at least 23M done by a computer • United States, GenealogyBank Historical Newspaper Obituaries, 1815-2011 Death stories taken from historical newspaper articles, hundreds of thousands published as of Nov 2019, millions more coming • New York, Wills and Deeds, ca. 1700s-2017 First foray into automatically indexing handwritten records Millions of records coming for all 50 states eventually
  21. Browse All Published Collections
  22. Find Recently Updated Collections
  23. What You Can Do • Keep Indexing on FamilySearch • It is still valuable, especially in non-English languages • Remember indexed data is the foundation for training machines to auto-index correctly • We’ll also likely continue to use human indexing to continue to measure how the machines are doing • Understand your role in correcting records that have been automatically indexed incorrectly • Be patient as solutions continue to expand, perhaps on collections that don’t benefit your research, remembering global users have much fewer records than many of us • Pray for God’s help to bless these efforts
  24. Bonus Tip: Image Search Impatient and want to search images yourself? Try https://www.familysearch.org/records/images/
  25. Image Search Results
  26. “We always overestimate the change that will occur in the next two years and underestimate the change that will occur in the next ten.” Bill Gates
  27. Tale of Three Decades 1998-2007 – Laying the technological foundation 1996 – GEDCOM 5.5 standard released (still supported) 1999 – PAF 4.0 – First Windows version 2002 – PAF 5.2 – Last major version 2004 – First vault microfilms converted to digital images 2006 – FamilySearch indexing began 2007 – FamilySearch Research wiki started 2007 – First digital images from the vault published on FamilySearch.org 2008-2017 – Single publicly available tree integrated with historical records 2010 – Launch of FamilySearch record search (>1B names, millions of images) 2009 – new.familysearch became available in Utah (limited rollout began in 2007) 2009 – I began to work at FamilySearch 2011 – RootsTech conference began 2013 – Family Tree added – made available to non-LDS patrons 2013 – Memories (photos & stories) initial rollout 2014 – Partnerships with Ancestry, MyHeritage and FindMyPast 2014 – Record hinting 2014 – First FamilySearch mobile app released 2015 – User to User Messaging 2015 – Printing temple cards from home in 44 languages 2016 – Family Tree moved to scalable servers 2017 – Web indexing 2018 – Family Tree Lite 2018-2027 – Worldwide explosion of records 2017 – Nordic Records – Year of the Viking Scandanavian (Sweden, Denmark, Finland) first 3 of top 50 countries 2018 – Mexican Civil Records project – 60M records ???? – Billions more indexed records made available via automatic indexing technologies ???? – User corrections of records supported ???? – DNA Features? ???? – ????
  28. Thank you! I hope you’ve learned some useful things I hope you’ve been inspired I hope you’re excited for what is coming I hope you’ll help evangelize why this revolution is good
Advertisement