The Coming Explosion of Records at FamilySearch - Presentation

© 2013 by Intellectual Reserve, Inc. All rights reserved.
The Coming Explosion of
Records at FamilySearch
BYU Conference on Family History and Genealogy
July 31, 2018
Ben Baker
bakerb@familysearch.org

Background
• Over 8½ years as a Software Engineer at FamilySearch
• Currently on the Automated Content Extraction team
• Try to do my own genealogy and help others
• Hope I’ll be able to help you see a vision of the future
• Go to https://www.slideshare.net/bakers84 or e-mail me
(bakerb@familysearch.org) to get a copy of this
presentation
• Click here for the related printed handout materials

First, Some Basics
Good News
• FamilySearch published its 2
billionth image in April 2018
• The 1 billionth image was
published in June 2014
• FamilySearch continues to
digitize nearly 1M images per
day from microfilm and about 320
cameras worldwide
• Family has nearly 6.4B indexed
names of people in records
• Record hinting has already made
FamilySearch Family Tree the
most well sourced tree in the
world with over 1B sources
attached to persons in the tree
Bad News
• Many records are only available
as images via the catalog. Only a
fraction of records have been
indexed
• Indexing isn’t keeping up with the
ability to digitize images,
especially in non-English
languages
• Current available record images
do not match church membership
in some areas
• Only indexed records can be
presented as record hints

Historical Records Images by Region at
FamilySearch
North America Europe and Middle East Latin America
Other Asia Africa/Pacific
LDS Church Membership by Region
North America Europe and Middle East Latin America
Other Asia Africa/Pacific

Changing the Records Publication
Paradigm
• Several teams at FamilySearch are dedicated to improving the
records publication platform
• The Goal: Provide more findable, relevant, curated records for
gathering multi-generational families from around the world
• Want to publish and make hintable 20% of the top tier records
in 50 of the highest priority countries within 15 years
• 58% coverage in North America as of 2017
• Crossed 20% in 3 more countries in 2017 (Denmark, Finland and Sweden)
• Major release of Mexican records in 2018
• Seeking to allow homelands to be more involved in building
local content
• Will support user corrections to records and indexing on-the-fly
• Will use automated technologies to accelerate publication

International Conference on Document
Analysis and Recognition (ICDAR) 2011
Beijing Friendship Hotel

First Mini-Explosion
• Partnership with GenealogyBank to extract
data from born digital obituaries
• First run indexed 5M obituaries in 10 hours,
saving about 150 man-years of indexing
• 23M obituaries indexed as of May 2018,
many more coming
• Uses recent advancements in machine
learning and artificial intelligence (AI)
• Can produce even more information than
indexing (Ex. In-law couple relationships)

GenealogyBank
collection of
obituaries is
available now
Improvements
to correcting
data are coming

What is Being Done Now
• Refining research code and models to be more
stable, reproducible and measurable
• Support ability to publish 1M obituaries a month
now, continuing to increase
• Built on scalable Amazon Web Services to meet
any future demands

How are Artificial Intelligence,
Machine Learning and Deep
Learning Related?
Artificial Intelligence – Machines exhibiting
human intelligence
• General AI – still science fiction
• Narrow AI – technologies that perform
specific tasks as well or better than humans
Machine Learning – Practice of using algorithms
to parse data, learn from it, and then make a
determination or prediction about something
in the world
Deep Learning – Using much larger machine
learning neural networks requiring more
training data and computational power
Artificial Intelligence
Machine
Learning
Deep
Learning

Machine Learning Isn’t Really New
• Been around for decades
• Spam filters in 1990s
• OCR (Optical Character Recognition)
• FamilySearch already uses for some things
• Match classifier
• Possible duplicates (person – person)
• Record hinting (person – record)
• FamilySearch is beginning to explore new uses
• Research Team -> Automated Content Extraction
• Exploring Deep Learning and other methods to automatically
understand historical documents

How is Machine Learning different
from traditional programming?
Machine Learning is using computers so they can learn
from data instead of writing rules (i.e. code) to solve
problems
Study the
Problem
Write Rules Evaluate
Launch!
Analyze
Errors
Study the
Problem
Train ML
Algorithm
Evaluate
Launch!
Analyze
Errors
Data

Necessary Technologies
• Natural Language Processing (NLP)
• Named entity recognition (NER) – identify the names,
dates, places, etc.
• Relation extraction – identify relationships between the
names, dates & places
• Additional processing to get into format for
publication, standardize data, etc.
• Notice the steps are similar to what a
genealogist would do

Identification and Extraction of Data

Live Demos
Lille E. Yeckley 1915-1980

Document Type Record Type Language Status in May 2018
Digital text Obituaries English Already published 23M
Working to continuously publish
Typewritten
newspaper text
Obituaries English Active research
Handwritten text Wills and deeds English Active research
Handwritten
calligraphy
Genealogies Chinese Preliminary research
Handwritten text Church records Spanish Preliminary research
More document
types
More record
types
More
languages
Expect future “explosions”

What You Can Do
• Keep Indexing
• It is still valuable, especially in non-English languages
• Remember indexed data is the foundation for training machines
to auto-index correctly
• We’ll also likely continue to use human indexing to continue to
measure how the machines are doing
• Understand your role in correcting records that
have been automatically indexed incorrectly
• Be patient as solutions continue to expand,
perhaps on collections that don’t benefit your
research, remembering we are a global church
• Pray for the Lord’s help to bless these efforts

Infinity
Automated
Technologies
Truth / Training Data
Indexing
User Corrections

“We always overestimate the change that will
occur in the next two years and
underestimate the change that will occur in the
next ten.”
Bill Gates

Tale of Three Decades
1998-2007 – Laying the technological foundation
1996 – GEDCOM 5.5 standard released (still supported)
1999 – PAF 4.0 – First Windows version
2002 – PAF 5.2 – Last major version
2004 – First vault microfilms converted to digital images
2007 – First digital images from the vault published on FamilySearch.org
2008-2017 – Single publicly available tree integrated with historical records
2010 – Launch of FamilySearch record search (>1B names, millions of images)
2006 – FamilySearch indexing began
2007 – FamilySearch Research wiki started
2009 – new.familysearch became available in Utah (limited rollout began in 2007)
2009 – I began to work at FamilySearch
2011 – RootsTech conference began
2013 – Family Tree added – made available to non-LDS patrons
2013 – Memories (photos & stories) initial rollout
2014 – Partnerships with Ancestry, MyHeritage and FindMyPast
2014 – Record hinting
2014 – First FamilySearch mobile app released
2015 – User to User Messaging
2015 – Printing temple cards from home in 44 languages
2016 – Family Tree moved to scalable servers
2017 – Web indexing
2018 – Family Tree Lite
2018-2027 – Worldwide explosion of records
2017 – Nordic Records – Year of the Viking Scandanavian (Sweden, Denmark, Finland) first 3 of top 50 countries
2018 – Mexican Civil Records project – 60M records
???? – Billions more indexed records made available via automatic indexing technologies
???? – User corrections of records supported
???? – DNA Features?
???? – ????

Thank you!
I hope you’ve been inspired
Keep an eye out for more explosions

The Coming Explosion of Records at FamilySearch - Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Coming Explosion of Records at FamilySearch - Presentation

Similar to The Coming Explosion of Records at FamilySearch - Presentation (20)

More from bakers84

More from bakers84 (20)

Recently uploaded

Recently uploaded (20)

The Coming Explosion of Records at FamilySearch - Presentation