Syllabus for the 2018 BYU Conference on Family History and Genealogy. While record hinting has greatly increased the number of record sources attached to persons in FamilySearch Family Tree, many records are still only available as images and are not yet indexed to be searchable. This is especially true for non-English records. This presentation shows how FamilySearch is working to provide more findable, relevant, curated records for gathering multi-generational families from around the world by using Artificial Intelligence (AI) and other cutting edge technologies to greatly accelerate the number of historical records available to patrons.
The Coming Explosion of Records at FamilySearch Syllabus
1. The Coming Explosion of Records at FamilySearch
Ben Baker – bakerb@familysearch.org
To view the presentation slides this handout accompanies, please go to:
https://www.slideshare.net/bakers84/the-coming-explosion-of-records-at-familysearch-presentation
Historical Records Basics
• FamilySearch published its 2 billionth image in April 2018 – 1 billionth image was in June 2014
• Continue to digitize nearly 1M images per day from microfilm and over 320 cameras worldwide
• Many records are only available as images via the catalog
• Despite having 6.3B indexed names, only a fraction of records have been indexed
• Indexing isn’t keeping up with the ability to digitize images, especially in non-English languages
• Only indexed records can be presented as record hints
• Record hinting has already made FamilySearch Family Tree the most well sourced tree in the
world with over 931M sources attached to persons in the tree
• Current available record images do not match church membership in some areas
Changing the Records Publication Paradigm
• Several teams at FamilySearch are dedicated to improving the records publication platform
• The Goal: Provide more findable, relevant, curated records for gathering multi-generational
families from around the world
• Want to publish and make hintable 20% of the top tier records in 50 of the highest priority
countries within 15 years
• Seeking to allow homelands to be more involved in building local content
• Will support user corrections to records and indexing on-the-fly
• Will use automated technologies to accelerate publication
Historical Records Images by
Region at FamilySearch
North America Europe and Middle East
Latin America Other
Asia Africa/Pacific
LDS Church Membership by
Region
North America Europe and Middle East
Latin America Other
Asia Africa/Pacific
2. Investigations into Automated Indexing
• Personal Story - 2011 International Conference on Document Analysis and Recognition in Beijing
• Collaboration with other companies to explore handwriting recognition – “not ready yet”
• First “mini explosion” occurred a couple of years ago
o Partnership with GenealogyBank to extract data from born digital obituaries
o First run indexed 5M obituaries in 10 hours, saving about 150 man-years of indexing
o 23M obituaries indexed as of May 2018, many more coming
o Uses recent advancements in machine learning and artificial intelligence (AI)
o Can produce even more information than indexing (Ex. In-law relationships)
3. What is Being Done Now
• Refining research code and models to be more stable, reproducible and measurable
• Support ability to publish 1M obituaries a month now, continuing to increase
• Built on scalable Amazon Web Services to meet any future demands
Basics of Artificial Intelligence / Machine Learning
• Artificial Intelligence – Machines exhibiting human intelligence
o General AI – still science fiction
o Narrow AI – technologies that perform specific tasks as well or better than humans
• Machine Learning – A subset of AI. The practice of using algorithms to parse data, learn from it,
and then make a determination or prediction about something in the world
• Machine Learning is using computers so they can learn from data instead of writing rules (i.e.
code) to solve problems
• FamilySearch has actually been using machine learning for a while
o Possible duplicates
o Record hints
• Technologies needed to successfully extract information from an obituary
o Natural Language Processing (NLP)
▪ Named entity recognition (NER) – identify the names, dates, places, etc.
▪ Relation extraction – identify relationships between the names, dates & places
o Additional processing to get into format for publication, standardize data, etc.
o Notice the steps are similar to what a genealogist would do
4. What is Coming in the Future
• Research already underway and looking very promising for
o Optical Character Recognition (OCR)
o Zoning (Ex. determining where newspaper articles are)
o Handwriting Recognition
• Expanding capabilities into more document and record types
• Beginning to investigate other languages
Document Type Record Type Language Status in May 2018
Digital text Obituaries English Already published 23M
Working to continuously publish
Typewritten
newspaper text
Obituaries English Active research
Handwritten text Wills and deeds English Active research
Handwritten
calligraphy
Genealogies Chinese Preliminary research
Handwritten text Church records Spanish Preliminary research
More document
types
More record types More
languages
Expect future “explosions”
What You Can Do
• Indexing is still valuable, especially in non-English languages
• Remember indexed data is the foundation for training machines to auto-index correctly
• Understand your role in correcting records that have been automatically indexed incorrectly
• Be patient as solutions continue to expand, perhaps on collections that don’t benefit your
research, remembering we are a global church
• Pray for the Lord’s help to bless these efforts
“We always overestimate the change that will occur in the next two years and
underestimate the change that will occur in the next ten.”
Bill Gates