Background
• Nearly 10 years as a Software Engineer at FamilySearch
• Currently on the Automated Content Extraction team
• Try to do my own genealogy and help others
• Hope I’ll be able to help you see a vision of the future and
how you can be an active part of bringing it to pass
• Go to https://www.slideshare.net/bakers84 or e-mail me
(bakerb@familysearch.org) to get a copy of this
presentation
• Click here for the related printed handout materials
First, Some Basics About the “Big 4”
FamilySearch, Ancestry, MyHeritage and FindMyPast
Good News
• All of the “Big 4” have each
published 5-10 billion indexed
names of people from records
• All the “Big 4” utilize record
hinting to help users source
persons in their trees with
records. (Ex. Over a billion
sources have been attached to
persons in FamilySearch Family
Tree.)
• Hundreds of cameras worldwide
continue to digitize millions of
images per day
Bad News
• Only a small fraction of historical
records have been digitized
• Only a small fraction of the
fraction of digitized images
captured have been indexed
• Indexing isn’t keeping up with the
ability to digitize images,
especially in non-English
languages
• For-profit genealogy companies
are mostly using offshore
indexing
• Only indexed records can be
presented as record hints
Accelerating Records Publication
• All of the “Big 4” want to publish more records
• All of the “Big 4” are already using automated
technologies to accelerate records publication
• Efforts are underway to provide abilities for
homelands to better publish their own records
• More responsibility is moving to users to correct
errors in records
FamilySearch
• Partnership with GenealogyBank to extract data from
born-digital obituaries
• First run indexed 5M obituaries in 10 hours, saving about
150 man-years of indexing
• 23M obituaries auto-indexed as of Nov 2019, more likely
coming
• Millions of historical newspaper death stories starting to
be released
• Uses recent advancements in machine learning and
artificial intelligence (AI)
• Can produce even more information than indexing (Ex.
In-law couple relationships)
• https://www.familysearch.org/search/collection/2333694
How are Artificial Intelligence,
Machine Learning and Deep
Learning Related?
Artificial Intelligence
Machines exhibiting human intelligence
• General AI – Still science fiction
• Narrow AI – Technologies that perform specific
tasks as well or better than humans
Machine Learning
Practice of using algorithms to parse data, learn from it,
and then make a determination or prediction about
something in the world
Deep Learning
Using much larger machine learning neural networks
requiring more training data and computational power
Artificial Intelligence
Machine
Learning
Deep
Learning
How is Machine Learning different
from traditional programming?
Machine Learning is using computers so they can
learn from data instead of writing rules (i.e. code) to
solve problems
Study the
Problem
Write Rules Evaluate
Launch!
Analyze
Errors
Study the
Problem
Train ML
Algorithm
Evaluate
Launch!
Analyze
Errors
Data
Machine Learning Isn’t Really New
• Been around for decades
• Spam filters in 1990s
• OCR (Optical Character Recognition)
• “Big 4” family history companies have also
been using for some time
• Record hinting (person – record matching)
• Possible duplicates (person – person matching)
• Exploring new uses
• Automated Content Extraction (aka Auto-Indexing)
• Facial Recognition
Necessary Technologies for
Auto-Indexing
• Text Recognition
• Zoning/text tiling to separate into individual records
• Segment text into lines
• Recognize typed and/or handwritten characters
• Use additional context to determine most likely words
• Natural Language Processing (NLP)
• Named entity recognition (NER) – identify names, dates, places, etc.
• Relation extraction – identify relationships between the names, dates
& places
• Additional processing to get into format for
publication, standardize data, etc.
• Notice the steps are very similar to what a
genealogist would do
Zoning/Text Tiling
Find record boundaries within/across image(s)
• Articles on a newspaper page
• Marriage records flowing across pages
• Pages of a probate record
Line Segmentation/Text Recognition
---
9
WARRANTY DEED.
Baker, Jones & Co., Printers, in Washington St., Buffalo, N. Y.
This Indenture, Made this Fourteenth day of February in the year of
our Lord one Thousand Eight Hundred and Eighty - Five, Between
Della Fuller of Oleare County of Cattaraugus and State of New York
of the first part, and William Parsel of the same place
of the second part,
Witnesseth, that the said party of the first part, in consideration of the sum of
Fifteen hundred Dollars
to her duly paid has sold, and by these Presents, does grant and convey to
the said party of the second
part his heirs and assigns that tract or parcel of Land situate in the Town and
lenty of Olean County of
Pros and Cons of Automated Technologies
Good News
• Can produce records much more
quickly than human indexing
• Can scale much larger than
number of indexers
• Provides searchable/hintable
records much sooner than they’d
be otherwise available
• Is cheaper than paying for indexers
and associated costs
• Can be used on records not well
suited to human indexing
• Can extract more information than
indexers may be able to
• Once trained, can be applied to
languages with very few indexers
Bad News
• Sometimes doesn’t produce as
good of results
• Requires more human judgment
to properly attach as a source
• May require different methods of
searching to find records
• Requires ability to correct errors
in records
Beware of Automated Computer Indexing
“We all need to be aware that these
types of errors will occur …
Even with the indexing errors,
searching in digitized collections is
much easier these days than it was
searching newsprint and/or microfilm
of newspaper pages 20 years ago.
I greatly appreciate the efforts by
companies like Ancestry.com …
I'm not complaining here - just making
the point that we need to expect
errors like this will be made, and we
need to be flexible in our searches if
we don't get results when we use an
exact name or date or place.”
Randy Seaver
https://www.geneamusings.com/2019
/10/beware-of-automated-
computer-indexing.html
User Corrections
• Errors will be fixed
fairly quickly by
reporting via the
Errors tab
• Ability to correct
names available in
many collections
• Ability to correct
dates, places,
relationships and
more coming
Collections to Watch
The following collections on FamilySearch contain
automatically indexed records:
• United States, GenealogyBank Obituaries, 1980-2014
More recent “born digital” obituaries, at least 23M done by a
computer
• United States, GenealogyBank Historical Newspaper Obituaries,
1815-2011
Death stories taken from historical newspaper articles, hundreds of
thousands published as of Nov 2019, millions more coming
• New York, Wills and Deeds, ca. 1700s-2017
First foray into automatically indexing handwritten records
Millions of records coming for all 50 states eventually
What You Can Do
• Keep Indexing on FamilySearch
• It is still valuable, especially in non-English languages
• Remember indexed data is the foundation for training machines
to auto-index correctly
• We’ll also likely continue to use human indexing to continue to
measure how the machines are doing
• Understand your role in correcting records that
have been automatically indexed incorrectly
• Be patient as solutions continue to expand,
perhaps on collections that don’t benefit your
research, remembering global users have much
fewer records than many of us
• Pray for God’s help to bless these efforts
Bonus Tip: Image Search
Impatient and want to search images yourself?
Try https://www.familysearch.org/records/images/
“We always overestimate the change that will
occur in the next two years and
underestimate the change that will occur in the
next ten.”
Bill Gates
Tale of Three Decades
1998-2007 – Laying the technological foundation
1996 – GEDCOM 5.5 standard released (still supported)
1999 – PAF 4.0 – First Windows version
2002 – PAF 5.2 – Last major version
2004 – First vault microfilms converted to digital images
2006 – FamilySearch indexing began
2007 – FamilySearch Research wiki started
2007 – First digital images from the vault published on FamilySearch.org
2008-2017 – Single publicly available tree integrated with historical records
2010 – Launch of FamilySearch record search (>1B names, millions of images)
2009 – new.familysearch became available in Utah (limited rollout began in 2007)
2009 – I began to work at FamilySearch
2011 – RootsTech conference began
2013 – Family Tree added – made available to non-LDS patrons
2013 – Memories (photos & stories) initial rollout
2014 – Partnerships with Ancestry, MyHeritage and FindMyPast
2014 – Record hinting
2014 – First FamilySearch mobile app released
2015 – User to User Messaging
2015 – Printing temple cards from home in 44 languages
2016 – Family Tree moved to scalable servers
2017 – Web indexing
2018 – Family Tree Lite
2018-2027 – Worldwide explosion of records
2017 – Nordic Records – Year of the Viking Scandanavian (Sweden, Denmark, Finland) first 3 of top 50 countries
2018 – Mexican Civil Records project – 60M records
???? – Billions more indexed records made available via automatic indexing technologies
???? – User corrections of records supported
???? – DNA Features?
???? – ????
Thank you!
I hope you’ve learned some useful things
I hope you’ve been inspired
I hope you’re excited for what is coming
I hope you’ll help evangelize why this revolution is good