SlideShare a Scribd company logo
A Comparison of Three Systems
Tom Creighton
CTO & Lead Architect
Family Search, International
tc@familysearch.org
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
FamilySearch – Who We Are
● Non-profit organization dedicated to helping people make
joyful family discoveries.
● Founded as the Genealogical Society of Utah in 1894.
● More than 100 years gathering the world’s records.
● Free web tools, apps, and resources at
www.familysearch.org.
● Sponsored and funded by the Church of Jesus Christ of
Latter-day Saints
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
FamilySearch – What We Do
● Gather and preserve the world’s records – Everyone
deserves to be remembered
● Provide free access to records and tools
on www.familysearch.org that help people
discover who they are.
● Run the Family History Library in Salt
Lake City and more than 5,000 affiliated
Family History Centers.
● Enable people around the world to
gather and share their own stories and
connect with other family members.
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
FamilySearch – Why Do It?
● Gather and preserve the world’s records – Everyone
deserves to be remembered
● Family is the key unit of society. When we
strengthen families, we strengthen society.
● When we realize we are all part of the
human family we treat each other differently.
● Understanding the stories of our families
provides healing, strength, and resiliency.
● Family relationships are eternal in nature.
● Everyone deserves to be remembered as a member of
God’s family.
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Comparing
Three
Systems
● Family Tree
− A wiki-like system for publishing and collaborating
on genealogical conclusions: a pedigree of
mankind
● Hinting
− ML-based recommendations
● Resource Metadata System (RMS)
− Metadata management for billions of source
records
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Family Tree
Application
● Enable collaboration among FH researchers
● Maintain a common pedigree
● Provide repository for FH research conclusions
● Provide multiple views of data
● Facilitate multiple parental branches
● Support tens of thousands of concurrent users
● Work well around the world/multiple languages
● Manage billions of ancestor records
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
Tips for using image placeholder slides will appear on the pasteboard near the image.
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
Tips for using image placeholder slides will appear on the pasteboard near the image.
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
Tips for using image placeholder slides will appear on the pasteboard near the image.
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Family Tree
DB Overview
● 5.1 DSE, Cassandra-only workload
● 21 nodes in main DC serving user reads/writes
● Peak day comes weekly every Sunday
● Peak months are Jan-Feb and Jul-Sep
● Reads: 250k/sec sustained, top node 15k/sec
● Writes: 9.3k/sec sustained
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Family Tree
DB Overview
Current Topology:
• Primary ring: 21 nodes, RF=3, i3.4xlarge, ephemeral disk
• Disk safety ring: 6 nodes, RF=1, r5.2xlarge, 4500GB EBS
• Remote, Secondary ring: 15 nodes, RF=2, r4.2xlarge,
2500GB EBS
• EBS volume snapshots on secondary ring
Near-term Topology Changes:
• Remote, Secondary ring: 14 nodes, RF=2, r5.2xlarge,
4500GB EBS
• Eliminate data safety ring
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Family Tree
DB Details
• read quorum latencies:
• p50=1.6ms
• p99=10ms
• p999=75ms
• max around 1000ms (timeout 3000ms)
• read one latencies:
• p50=0.15ms
• p99=1.13ms
• p999=5ms
• max around 1000ms
• write quorum latencies:
• p50=1.13ms
• p99=5.8ms
• p999=62ms,
• max around 1000-2000ms (timeout 5000ms)
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Hinting
Application
• Manage results from large ML job
• Precomputes likely connections between tree and
sources
• Trillions of comparisons
• Supports both on-going and batch update
• Enables users to more easily locate relevant data
• Tracks use of the hints
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Hinting DB
Overview
• All reads/writes are quorum
• Aggregate reads: 192K/sec
• Aggregate writes: 12K/sec
• Data model: 3 partitions, many columns + blobs
• One ring; 27 nodes: i3.2xlarge; ephemeral disk only
• RF=3; no volume backup
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Records
Metadata
System(RMS)
Application
• Manage metadata about digital artifacts (images)
together in a single searchable store
• Support publication workflow on these artifacts
• Manage entitlements (access permissions) on
these artifacts
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
RMS
Application
Manage
Metadata
• Artifact Date – for example : Belfast Ireland Death
Records from 1850-1878
• Artifact Type – for example : Belfast Ireland Death
Records from 1850-1878
• Artifact Place – for example : Belfast Ireland Death
Records from 1850-1878
• As of May 2019 ~ 3.7 billion records managed
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
RMS
Application
Support
Publication
Workflow
• We capture digital images of paper records and
then process, store, and preserve those digital
artifacts.
• We transcribe (index) data from digital artifacts and
make that data also searchable
• The ability to search artifacts by date, place and
record type is essential to determining what to put
through different parts of our publication workflow.
• As of May 2019 ~ 3.7 billion records managed
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.Confidential
Image Placeholder Slides
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
RMS DB
Overview
• DSE 5.1, Cassandra and DSE Search workload
• Primary ring supports both read/write and search
• All reads/writes are quorum
• Data model: 3 partitions, many columns + blobs
• Reads: 60K/second
• Writes: 40K/second
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
RMS DB
Overview
Current Topology:
• Primary ring: 24 nodes, RF=3, i3.4xlarge, ephemeral disk
• Remote, Secondary ring: 8 nodes, RF=2, r5.2xlarge,
4500GB EBS
• EBS volume snapshots on secondary ring
Near-term Topology Changes:
• Secondary ring to move to remote DC
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
RMS DB
Details
• read quorum latencies:
• p50=1.95ms
• p99=10ms
• p999=90ms
• Max ~ 3000ms (timeout 5000ms)
• write quorum latencies:
• p50=1.13ms
• p99=10ms
• p999=150ms
• Max ~ 1000-2000ms (timeout 5000ms)
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Summary Comparison
System
Name
Read Rate Write Rate Canonical
Data?
Multi-ring? Node
Count
Primary
Node Type
Primary
Backup? Higher-level
Services
Family
Tree
250K/second 9.3K/second yes yes 21 I3.4xlarge yes none
Hinting 192K/second 12K/second no no 27 I3.2xlarge no none
RMS 60K/second 40K/second yes yes 24 I3.4xlarge soon DSE Search
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Conclusions
• Canonical data requires special care
• Architecture: DB topology, machine config,
application design and implementation
• Recovery Point/Recovery Time objectives drive
architecture
• Performance & Scale drive architecture
• Availability SLA drives architecture
• Cost trade-offs drive architecture
• Architecture == Balance
THANK YOU

More Related Content

What's hot

Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
Bill GU
 
Hadoop
HadoopHadoop
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
hadoop
hadoophadoop
hadoop
swatic018
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
Adam Kawa
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
saintdevil163
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
yaevents
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Directories
DirectoriesDirectories
Directories
charan5021
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
NilaNila16
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Jihoon Son
 
Postgresql tutorial
Postgresql tutorialPostgresql tutorial
Postgresql tutorial
Ashoka Vanjare
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
Sudarshan Pant
 
Hadoop
HadoopHadoop

What's hot (16)

Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 
Hadoop
HadoopHadoop
Hadoop
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
hadoop
hadoophadoop
hadoop
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Directories
DirectoriesDirectories
Directories
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
 
Postgresql tutorial
Postgresql tutorialPostgresql tutorial
Postgresql tutorial
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 

Similar to Tc accelerate-2019-05

Dealing with the Challenges of Large Life Science Data Sets from Acquisition ...
Dealing with the Challenges of Large Life Science Data Sets from Acquisition ...Dealing with the Challenges of Large Life Science Data Sets from Acquisition ...
Dealing with the Challenges of Large Life Science Data Sets from Acquisition ...
inside-BigData.com
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.ppt
EqinNiftalyev
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
bodaceacat
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
Sara-Jayne Terp
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
Joe Alex
 
Difference between Database vs Data Warehouse vs Data Lake
Difference between Database vs Data Warehouse vs Data LakeDifference between Database vs Data Warehouse vs Data Lake
Difference between Database vs Data Warehouse vs Data Lake
jeetendra mandal
 
DownloadClassSessionFile (44).pdf
DownloadClassSessionFile (44).pdfDownloadClassSessionFile (44).pdf
DownloadClassSessionFile (44).pdf
HanaBurhan1
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
Rakuten Group, Inc.
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
Guy Coates
 
Mysql NDB Cluster's Asynchronous Parallel Design for High Performance
Mysql NDB Cluster's Asynchronous Parallel Design for High PerformanceMysql NDB Cluster's Asynchronous Parallel Design for High Performance
Mysql NDB Cluster's Asynchronous Parallel Design for High Performance
Bernd Ocklin
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
WSO2
 
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Merce Crosas
 
SRA-System (7).ppsx
SRA-System (7).ppsxSRA-System (7).ppsx
SRA-System (7).ppsx
laibayyy38
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
datastack
 
The causes and consequences of too many bits
The causes and consequences of too many bitsThe causes and consequences of too many bits
The causes and consequences of too many bits
Dipesh Lall
 
Leeds presentation
Leeds presentationLeeds presentation
Leeds presentation
Tracy Kendall
 
Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...
QBiC_Tue
 
MySpace Data Architecture June 2009
MySpace Data Architecture June 2009MySpace Data Architecture June 2009
MySpace Data Architecture June 2009
Mark Ginnebaugh
 
Big data for SAS programmers
Big data for SAS programmersBig data for SAS programmers
Big data for SAS programmers
Kevin Lee
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 

Similar to Tc accelerate-2019-05 (20)

Dealing with the Challenges of Large Life Science Data Sets from Acquisition ...
Dealing with the Challenges of Large Life Science Data Sets from Acquisition ...Dealing with the Challenges of Large Life Science Data Sets from Acquisition ...
Dealing with the Challenges of Large Life Science Data Sets from Acquisition ...
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.ppt
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Difference between Database vs Data Warehouse vs Data Lake
Difference between Database vs Data Warehouse vs Data LakeDifference between Database vs Data Warehouse vs Data Lake
Difference between Database vs Data Warehouse vs Data Lake
 
DownloadClassSessionFile (44).pdf
DownloadClassSessionFile (44).pdfDownloadClassSessionFile (44).pdf
DownloadClassSessionFile (44).pdf
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Mysql NDB Cluster's Asynchronous Parallel Design for High Performance
Mysql NDB Cluster's Asynchronous Parallel Design for High PerformanceMysql NDB Cluster's Asynchronous Parallel Design for High Performance
Mysql NDB Cluster's Asynchronous Parallel Design for High Performance
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
 
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
 
SRA-System (7).ppsx
SRA-System (7).ppsxSRA-System (7).ppsx
SRA-System (7).ppsx
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
The causes and consequences of too many bits
The causes and consequences of too many bitsThe causes and consequences of too many bits
The causes and consequences of too many bits
 
Leeds presentation
Leeds presentationLeeds presentation
Leeds presentation
 
Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...
 
MySpace Data Architecture June 2009
MySpace Data Architecture June 2009MySpace Data Architecture June 2009
MySpace Data Architecture June 2009
 
Big data for SAS programmers
Big data for SAS programmersBig data for SAS programmers
Big data for SAS programmers
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 

Recently uploaded

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
Claudio Di Ciccio
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 

Recently uploaded (20)

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 

Tc accelerate-2019-05

  • 1. A Comparison of Three Systems Tom Creighton CTO & Lead Architect Family Search, International tc@familysearch.org
  • 2. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. FamilySearch – Who We Are ● Non-profit organization dedicated to helping people make joyful family discoveries. ● Founded as the Genealogical Society of Utah in 1894. ● More than 100 years gathering the world’s records. ● Free web tools, apps, and resources at www.familysearch.org. ● Sponsored and funded by the Church of Jesus Christ of Latter-day Saints
  • 3. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. FamilySearch – What We Do ● Gather and preserve the world’s records – Everyone deserves to be remembered ● Provide free access to records and tools on www.familysearch.org that help people discover who they are. ● Run the Family History Library in Salt Lake City and more than 5,000 affiliated Family History Centers. ● Enable people around the world to gather and share their own stories and connect with other family members.
  • 4. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. FamilySearch – Why Do It? ● Gather and preserve the world’s records – Everyone deserves to be remembered ● Family is the key unit of society. When we strengthen families, we strengthen society. ● When we realize we are all part of the human family we treat each other differently. ● Understanding the stories of our families provides healing, strength, and resiliency. ● Family relationships are eternal in nature. ● Everyone deserves to be remembered as a member of God’s family.
  • 5. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. Comparing Three Systems ● Family Tree − A wiki-like system for publishing and collaborating on genealogical conclusions: a pedigree of mankind ● Hinting − ML-based recommendations ● Resource Metadata System (RMS) − Metadata management for billions of source records
  • 6. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. Family Tree Application ● Enable collaboration among FH researchers ● Maintain a common pedigree ● Provide repository for FH research conclusions ● Provide multiple views of data ● Facilitate multiple parental branches ● Support tens of thousands of concurrent users ● Work well around the world/multiple languages ● Manage billions of ancestor records
  • 7. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides Tips for using image placeholder slides will appear on the pasteboard near the image.
  • 8. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides Tips for using image placeholder slides will appear on the pasteboard near the image.
  • 9. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides Tips for using image placeholder slides will appear on the pasteboard near the image.
  • 10. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 11. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 12. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 13. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 14. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 15. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. Family Tree DB Overview ● 5.1 DSE, Cassandra-only workload ● 21 nodes in main DC serving user reads/writes ● Peak day comes weekly every Sunday ● Peak months are Jan-Feb and Jul-Sep ● Reads: 250k/sec sustained, top node 15k/sec ● Writes: 9.3k/sec sustained
  • 16. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. Family Tree DB Overview Current Topology: • Primary ring: 21 nodes, RF=3, i3.4xlarge, ephemeral disk • Disk safety ring: 6 nodes, RF=1, r5.2xlarge, 4500GB EBS • Remote, Secondary ring: 15 nodes, RF=2, r4.2xlarge, 2500GB EBS • EBS volume snapshots on secondary ring Near-term Topology Changes: • Remote, Secondary ring: 14 nodes, RF=2, r5.2xlarge, 4500GB EBS • Eliminate data safety ring
  • 17. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. Family Tree DB Details • read quorum latencies: • p50=1.6ms • p99=10ms • p999=75ms • max around 1000ms (timeout 3000ms) • read one latencies: • p50=0.15ms • p99=1.13ms • p999=5ms • max around 1000ms • write quorum latencies: • p50=1.13ms • p99=5.8ms • p999=62ms, • max around 1000-2000ms (timeout 5000ms)
  • 18. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. Hinting Application • Manage results from large ML job • Precomputes likely connections between tree and sources • Trillions of comparisons • Supports both on-going and batch update • Enables users to more easily locate relevant data • Tracks use of the hints
  • 19. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 20. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 21. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 22. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 23. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 24. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. Hinting DB Overview • All reads/writes are quorum • Aggregate reads: 192K/sec • Aggregate writes: 12K/sec • Data model: 3 partitions, many columns + blobs • One ring; 27 nodes: i3.2xlarge; ephemeral disk only • RF=3; no volume backup
  • 25. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. Records Metadata System(RMS) Application • Manage metadata about digital artifacts (images) together in a single searchable store • Support publication workflow on these artifacts • Manage entitlements (access permissions) on these artifacts
  • 26. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. RMS Application Manage Metadata • Artifact Date – for example : Belfast Ireland Death Records from 1850-1878 • Artifact Type – for example : Belfast Ireland Death Records from 1850-1878 • Artifact Place – for example : Belfast Ireland Death Records from 1850-1878 • As of May 2019 ~ 3.7 billion records managed
  • 27. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. RMS Application Support Publication Workflow • We capture digital images of paper records and then process, store, and preserve those digital artifacts. • We transcribe (index) data from digital artifacts and make that data also searchable • The ability to search artifacts by date, place and record type is essential to determining what to put through different parts of our publication workflow. • As of May 2019 ~ 3.7 billion records managed
  • 28. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 29. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 30. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 31. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 32. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 33. © DataStax, All Rights Reserved.Confidential Image Placeholder Slides
  • 34. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. RMS DB Overview • DSE 5.1, Cassandra and DSE Search workload • Primary ring supports both read/write and search • All reads/writes are quorum • Data model: 3 partitions, many columns + blobs • Reads: 60K/second • Writes: 40K/second
  • 35. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. RMS DB Overview Current Topology: • Primary ring: 24 nodes, RF=3, i3.4xlarge, ephemeral disk • Remote, Secondary ring: 8 nodes, RF=2, r5.2xlarge, 4500GB EBS • EBS volume snapshots on secondary ring Near-term Topology Changes: • Secondary ring to move to remote DC
  • 36. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. RMS DB Details • read quorum latencies: • p50=1.95ms • p99=10ms • p999=90ms • Max ~ 3000ms (timeout 5000ms) • write quorum latencies: • p50=1.13ms • p99=10ms • p999=150ms • Max ~ 1000-2000ms (timeout 5000ms)
  • 37. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. Summary Comparison System Name Read Rate Write Rate Canonical Data? Multi-ring? Node Count Primary Node Type Primary Backup? Higher-level Services Family Tree 250K/second 9.3K/second yes yes 21 I3.4xlarge yes none Hinting 192K/second 12K/second no no 27 I3.2xlarge no none RMS 60K/second 40K/second yes yes 24 I3.4xlarge soon DSE Search
  • 38. © DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved. Conclusions • Canonical data requires special care • Architecture: DB topology, machine config, application design and implementation • Recovery Point/Recovery Time objectives drive architecture • Performance & Scale drive architecture • Availability SLA drives architecture • Cost trade-offs drive architecture • Architecture == Balance