More Related Content

Slideshows for you(20)


More from bakers84(20)


The Coming Explosion of Records at FamilySearch - Presentation

  1. © 2013 by Intellectual Reserve, Inc. All rights reserved. The Coming Explosion of Records at FamilySearch BYU Conference on Family History and Genealogy July 31, 2018 Ben Baker
  2. Background • Over 8½ years as a Software Engineer at FamilySearch • Currently on the Automated Content Extraction team • Try to do my own genealogy and help others • Hope I’ll be able to help you see a vision of the future • Go to or e-mail me ( to get a copy of this presentation • Click here for the related printed handout materials
  3. First, Some Basics Good News • FamilySearch published its 2 billionth image in April 2018 • The 1 billionth image was published in June 2014 • FamilySearch continues to digitize nearly 1M images per day from microfilm and about 320 cameras worldwide • Family has nearly 6.4B indexed names of people in records • Record hinting has already made FamilySearch Family Tree the most well sourced tree in the world with over 1B sources attached to persons in the tree Bad News • Many records are only available as images via the catalog. Only a fraction of records have been indexed • Indexing isn’t keeping up with the ability to digitize images, especially in non-English languages • Current available record images do not match church membership in some areas • Only indexed records can be presented as record hints
  4. Historical Records Images by Region at FamilySearch North America Europe and Middle East Latin America Other Asia Africa/Pacific LDS Church Membership by Region North America Europe and Middle East Latin America Other Asia Africa/Pacific
  5. Changing the Records Publication Paradigm • Several teams at FamilySearch are dedicated to improving the records publication platform • The Goal: Provide more findable, relevant, curated records for gathering multi-generational families from around the world • Want to publish and make hintable 20% of the top tier records in 50 of the highest priority countries within 15 years • 58% coverage in North America as of 2017 • Crossed 20% in 3 more countries in 2017 (Denmark, Finland and Sweden) • Major release of Mexican records in 2018 • Seeking to allow homelands to be more involved in building local content • Will support user corrections to records and indexing on-the-fly • Will use automated technologies to accelerate publication
  6. International Conference on Document Analysis and Recognition (ICDAR) 2011 Beijing Friendship Hotel
  7. First Mini-Explosion • Partnership with GenealogyBank to extract data from born digital obituaries • First run indexed 5M obituaries in 10 hours, saving about 150 man-years of indexing • 23M obituaries indexed as of May 2018, many more coming • Uses recent advancements in machine learning and artificial intelligence (AI) • Can produce even more information than indexing (Ex. In-law couple relationships)
  8. Video 1
  9. GenealogyBank collection of obituaries is available now Improvements to correcting data are coming
  10. What is Being Done Now • Refining research code and models to be more stable, reproducible and measurable • Support ability to publish 1M obituaries a month now, continuing to increase • Built on scalable Amazon Web Services to meet any future demands
  11. How are Artificial Intelligence, Machine Learning and Deep Learning Related? Artificial Intelligence – Machines exhibiting human intelligence • General AI – still science fiction • Narrow AI – technologies that perform specific tasks as well or better than humans Machine Learning – Practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world Deep Learning – Using much larger machine learning neural networks requiring more training data and computational power Artificial Intelligence Machine Learning Deep Learning
  12. Machine Learning Isn’t Really New • Been around for decades • Spam filters in 1990s • OCR (Optical Character Recognition) • FamilySearch already uses for some things • Match classifier • Possible duplicates (person – person) • Record hinting (person – record) • FamilySearch is beginning to explore new uses • Research Team -> Automated Content Extraction • Exploring Deep Learning and other methods to automatically understand historical documents
  13. How is Machine Learning different from traditional programming? Machine Learning is using computers so they can learn from data instead of writing rules (i.e. code) to solve problems Study the Problem Write Rules Evaluate Launch! Analyze Errors Study the Problem Train ML Algorithm Evaluate Launch! Analyze Errors Data
  14. Necessary Technologies • Natural Language Processing (NLP) • Named entity recognition (NER) – identify the names, dates, places, etc. • Relation extraction – identify relationships between the names, dates & places • Additional processing to get into format for publication, standardize data, etc. • Notice the steps are similar to what a genealogist would do
  15. Identification and Extraction of Data
  16. Live Demos Lille E. Yeckley 1915-1980
  17. Document Type Record Type Language Status in May 2018 Digital text Obituaries English Already published 23M Working to continuously publish Typewritten newspaper text Obituaries English Active research Handwritten text Wills and deeds English Active research Handwritten calligraphy Genealogies Chinese Preliminary research Handwritten text Church records Spanish Preliminary research More document types More record types More languages Expect future “explosions”
  18. Video 2
  19. What You Can Do • Keep Indexing • It is still valuable, especially in non-English languages • Remember indexed data is the foundation for training machines to auto-index correctly • We’ll also likely continue to use human indexing to continue to measure how the machines are doing • Understand your role in correcting records that have been automatically indexed incorrectly • Be patient as solutions continue to expand, perhaps on collections that don’t benefit your research, remembering we are a global church • Pray for the Lord’s help to bless these efforts
  20. Infinity Automated Technologies Truth / Training Data Indexing User Corrections
  21. “We always overestimate the change that will occur in the next two years and underestimate the change that will occur in the next ten.” Bill Gates
  22. Tale of Three Decades 1998-2007 – Laying the technological foundation 1996 – GEDCOM 5.5 standard released (still supported) 1999 – PAF 4.0 – First Windows version 2002 – PAF 5.2 – Last major version 2004 – First vault microfilms converted to digital images 2007 – First digital images from the vault published on 2008-2017 – Single publicly available tree integrated with historical records 2010 – Launch of FamilySearch record search (>1B names, millions of images) 2006 – FamilySearch indexing began 2007 – FamilySearch Research wiki started 2009 – new.familysearch became available in Utah (limited rollout began in 2007) 2009 – I began to work at FamilySearch 2011 – RootsTech conference began 2013 – Family Tree added – made available to non-LDS patrons 2013 – Memories (photos & stories) initial rollout 2014 – Partnerships with Ancestry, MyHeritage and FindMyPast 2014 – Record hinting 2014 – First FamilySearch mobile app released 2015 – User to User Messaging 2015 – Printing temple cards from home in 44 languages 2016 – Family Tree moved to scalable servers 2017 – Web indexing 2018 – Family Tree Lite 2018-2027 – Worldwide explosion of records 2017 – Nordic Records – Year of the Viking Scandanavian (Sweden, Denmark, Finland) first 3 of top 50 countries 2018 – Mexican Civil Records project – 60M records ???? – Billions more indexed records made available via automatic indexing technologies ???? – User corrections of records supported ???? – DNA Features? ???? – ????
  23. Thank you! I hope you’ve been inspired Keep an eye out for more explosions