SlideShare a Scribd company logo
1 of 48
1
Scaling AncestryDNA
Using Hadoop and HBase
April 10th, 2014
Jeremy Pollack
Ancestry.com Mission
2
• Over 30,000 historical content collections
• 13 billion records and images
• Records dating back to 16th century
• 10 petabytes
We are the world's largest online family history resource
Discoveries are the Key
3
The “eureka” moment drives our business
Discoveries in Detail
4
Spit in a tube, pay $99, learn your past
Autosomal DNA tests
Over 200,000+ DNA samples
700,000 SNPs for each sample
10,000,000+ cousin matches
DNA molecule 1 differs from DNA
molecule 2 at a single base-pair
location (a C/T polymorphism)
(http://en.wikipedia.org/wiki/Single-
nucleiotide_polymorphism)
-
50,000
100,000
150,000
Genotyped Samples
Discoveries with DNA
5
Network Effect – Cousin Matches
6
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
2,000 10,053 21,205 40,201 60,240 80,405 115,756
CousinMatches
Database Size
So how do we do it?
7
• GERMLINE is an algorithm that finds hidden
relationships within a pool of DNA
• GERMLINE also refers to the reference
implementation of that algorithm written in C++
• You can find it here :
http://www1.cs.columbia.edu/~gusev/germline/
Introducing … GERMLINE!
8
• GERMLINE (the implementation) was not meant to be
used in an industrial setting
 Stateless
 Single threaded
 Prone to swapping (heavy memory usage)
• GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would
slow to a crawl
• Put simply: GERMLINE couldn't scale
So What’s the Problem?
9
GERMLINE Run Times (in hours)
0
5
10
15
20
25
2,500
5,000
7,500
10,000
12,500
15,000
17,500
20,000
22,500
25,000
27,500
30,000
32,500
35,000
37,500
40,000
42,500
45,000
47,500
50,000
52,500
55,000
57,500
60,000
Hours
Samples
10
Projected GERMLINE Run Times (in hours)
0
100
200
300
400
500
600
700
2,500
12,5…
22,5…
32,5…
42,5…
52,5…
62,5…
72,5…
82,5…
92,5…
102,…
112,…
122,…
GERMLINE run
times
Projected
GERMLINE run
times
Samples
Hours
11
The Mission : Create a Scalable Matching
Engine
... and thus was
born
(aka "Jermline with a J")
12
What is Hadoop?
• Hadoop is an open-source platform for processing large
amounts of data in a scalable, fault-tolerant, affordable
fashion, using commodity hardware
• Hadoop specifies a distributed file system called HDFS
• Hadoop supports a processing methodology known as
MapReduce
• Many tools are built on top of Hadoop, such as HBase,
Hive, and Flume
13
What is HDFS?
What happens when HDFS loses a server
What is MapReduce?
16
What is HBase?
• HBase is an open-source NoSQL data store that runs on top of
HDFS
• HBase is columnar; you can think of it as a weird amalgam of a
hashtable and a spreadsheet
• HBase supports unlimited rows and columns
• HBase stores data sparsely; there is no penalty for empty cells
• HBase is gaining in popularity: Salesforce, Facebook, and Twitter
have all invested heavily in the technology, as well as many others
17
Game of Thrones Characters, in an HBase Table
KEY gender hair_color family_name is_evil
Joffrey male blonde Baratheon yes
Cersei female blonde Lannister kinda
18
Adding a Row to an HBase Table
KEY gender hair_color family_name is_evil
Joffrey male blonde Baratheon yes
Cersei female blonde Lannister kinda
Sansa female red Stark no
19
Adding a Column to an HBase Table
KEY gender hair_color family_name is_evil title
Joffrey male blonde Baratheon yes king
Cersei female blonde Lannister kinda
Sansa female red Stark no
20
Cersei : ACTGACCTAGTTGAC
Joffrey : TTAAGCCTAGTTGAC
The Input
Cersei Baratheon
• Former queen
of Westeros
• Machiavellian
manipulator
• Mostly evil, but
occasionally
sympathetic
Joffrey Baratheon
• Pretty much the
human
embodiment of
evil
• Needlessly cruel
• Kinda looks like
Justin Bieber
DNA Matching : How it Works
21
0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
Separate into words
DNA Matching : How it Works
22
0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
ACTGA_0 : Cersei
TTAAG_0 : Joffrey
CCTAG_1 : Cersei, Joffrey
TTGAC_2 : Cersei, Joffrey
Build the hash table
DNA Matching : How it Works
23
Iterate through genome and find matches
Cersei and Joffrey match from position 1 to position 2
0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
ACTGA_0 : Cersei
TTAAG_0 : Joffrey
CCTAG_1 : Cersei, Joffrey
TTGAC_2 : Cersei, Joffrey
DNA Matching : How it Works
24
Does that mean they're related?
...maybe
25
IBD to Relationship Estimation
• We use the total length of
all shared segments to
estimate the relationship
between to genetic
relatives
• This is basically a
classification problem
26
5 10 20 50 100 200 500 1000 5000
0.000.010.020.030.040.05
ERSA
total_IBD(cM)
probability
m1
m2
m3
m4
m5
m6
m7
m8
m9
m10
m11
Jaime : TTAAGCCTAGGGGCG
But Wait...What About Jaime?
Jaime Lannister
• Kind of a has-been
• Killed the Mad King
• Has the hots for his
sister, Cersei
27
Adding a new sample, the GERMLINE way
28
0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
Jaime : TTAAG CCTAG GGGCG
ACTGA_0 : Cersei
TTAAG_0 : Joffrey, Jaime
CCTAG_1 : Cersei, Joffrey, Jaime
TTGAC_2 : Cersei, Joffrey
GGGCG_2 : Jaime
Step one: Rebuild the entire hash table from
scratch, including the new sample
The GERMLINE Way
29
Cersei and Joffrey match from position 1 to position 2
Joffrey and Jaime match from position 0 to position 1
Cersei and Jaime match at position 1
Step two: Find everybody's matches all over
again, including the new sample. (n x n comparisons)
0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
Jaime : TTAAG CCTAG GGGCG
ACTGA_0 : Cersei
TTAAG_0 : Joffrey, Jaime
CCTAG_1 : Cersei, Joffrey, Jaime
TTGAC_2 : Cersei, Joffrey
GGGCG_2 : Jaime
The GERMLINE Way
30
Cersei and Joffrey match from position 1 to position 2
Joffrey and Jaime match from position 0 to position 1
Cersei and Jaime match at position 1
Step three : Now, throw away the evidence!
0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
Jaime : TTAAG CCTAG GGGCG
ACTGA_0 : Cersei
TTAAG_0 : Joffrey, Jaime
CCTAG_1 : Cersei, Joffrey, Jaime
TTGAC_2 : Cersei, Joffrey
GGGCG_2 : Jaime
The GERMLINE Way
31
Step one: Update the hash table
Cersei Joffrey
2_ACTGA_0 1
2_TTAAG_0 1
2_CCTAG_1 1 1
2_TTGAC_2 1 1
Already stored in HBase
Jaime : TTAAG CCTAG GGGCG New sample to add
Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that
position on that chromosome
The Way
32
Jaime and Joffrey match from position 0 to position 1
Jaime and Cersei match at position 1
Already stored
in HBase
2_Cersei 2_Joffrey
2_Cersei { (1, 2), ...}
2_Joffrey { (1, 2), ...}
New matches
to add
Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome
Step two: Find matches, update the results table
The Way
33
Results Table
2_Cersei 2_Joffrey 2_Jaime
2_Cersei { (1, 2), ...} { (1), ...}
2_Joffrey { (1, 2), ...} { (0,1), ...}
2_Jaime { (1), ...} { (0,1), ...}
Hash Table
Cersei Joffrey Jaime
2_ACTGA_0 1
2_TTAAG_0 1 1
2_CCTAG_1 1 1 1
2_TTGAC_2 1 1
2_GGGCG_2 1
The Way
34
But wait ... what about
Daenerys, Tyrion, Arya, and Jon Snow?
35
Run them in parallel with Hadoop!
36
Parallelism with Hadoop
• Batches are usually about a thousand people
• Each mapper takes a single chromosome for a
single person
• MapReduce Jobs :
Job #1 : Match Words
o Updates the hash table
Job #2 : Match Segments
o Identifies areas where the samples match
37
How does perform?
A 1700% performance improvement
over GERMLINE!
38
0
5
10
15
20
25
2,500
7,500
12,500
17,500
22,500
27,500
32,500
37,500
42,500
47,500
52,500
57,500
62,500
67,500
72,500
77,500
82,500
87,500
92,500
97,500
102,500
107,500
112,500
117,500
Hours
Samples
Run times for Matching (in hours)
39
0
20
40
60
80
100
120
140
160
180
GERMLINE run
times
Jermline run
times
Projected
GERMLINE run
times
Run times for Matching (in hours)
Samples
Hours
40
Bottom line: Without Hadoop and HBase, this would have
been expensive and difficult.
Dramatically Increased our Capacity
41
And now for everybody's favorite part ....
Lessons Learned
42
Lessons Learned
What went right?
43
Lessons Learned : What went right?
44
• This project would not have been possible without TDD
• Two sets of test data : generated and public domain
• 89% coverage
• Corrected a bug in the reference implementation
• Has never failed in production
Lessons Learned
What would we do differently?
45
Lessons Learned : What would we do differently?
• Front-load some performance tests
 HBase and Hadoop can have odd performance profiles
 HBase in particular has some strange behavior if you're not
familiar with its inner workings
• Allow a lot of time for live tests, dry runs, and
deployment
 These technologies are relatively new, and it isn't always
possible to find experienced admins. Be prepared to "get your
hands dirty"
46
Questions?
47
48
And yes, we are hiring!
You can contact me at jpollack@ancestry.com
or just talk to me here!

More Related Content

More from Ancestry.com

Finding Your U.S. Military Heroes on Ancestry
Finding Your U.S. Military Heroes on AncestryFinding Your U.S. Military Heroes on Ancestry
Finding Your U.S. Military Heroes on AncestryAncestry.com
 
Honoring the Dead: Military Burials
Honoring the Dead: Military BurialsHonoring the Dead: Military Burials
Honoring the Dead: Military BurialsAncestry.com
 
Putting your ancestors in historical perspective
Putting your ancestors in historical perspectivePutting your ancestors in historical perspective
Putting your ancestors in historical perspectiveAncestry.com
 
searching successfully to reveal your ancestor’s story on ancestry
searching successfully to reveal your ancestor’s story on ancestrysearching successfully to reveal your ancestor’s story on ancestry
searching successfully to reveal your ancestor’s story on ancestryAncestry.com
 
Making Your Own Domain Specific Language
Making Your Own Domain Specific LanguageMaking Your Own Domain Specific Language
Making Your Own Domain Specific LanguageAncestry.com
 
SXSW Core Conversation - How Using Big Data Can Tell Personalized Stories
SXSW Core Conversation - How Using Big Data Can Tell Personalized StoriesSXSW Core Conversation - How Using Big Data Can Tell Personalized Stories
SXSW Core Conversation - How Using Big Data Can Tell Personalized StoriesAncestry.com
 
Finding and using census records to create a framework
Finding and using census records to create a framework Finding and using census records to create a framework
Finding and using census records to create a framework Ancestry.com
 
Putting your ancestors in historical perspective using ancestry to tell your ...
Putting your ancestors in historical perspective using ancestry to tell your ...Putting your ancestors in historical perspective using ancestry to tell your ...
Putting your ancestors in historical perspective using ancestry to tell your ...Ancestry.com
 
Common Surnames: Finding Your Smiths
Common Surnames: Finding Your SmithsCommon Surnames: Finding Your Smiths
Common Surnames: Finding Your SmithsAncestry.com
 
Putting Your Ancestors in Historical Perspective
Putting Your Ancestors in Historical PerspectivePutting Your Ancestors in Historical Perspective
Putting Your Ancestors in Historical PerspectiveAncestry.com
 
Learning How To Tune Your Ancestry.com Search
Learning How To Tune Your Ancestry.com SearchLearning How To Tune Your Ancestry.com Search
Learning How To Tune Your Ancestry.com SearchAncestry.com
 
Researching Civil War Soldiers: An Overview
Researching Civil War Soldiers: An OverviewResearching Civil War Soldiers: An Overview
Researching Civil War Soldiers: An OverviewAncestry.com
 

More from Ancestry.com (12)

Finding Your U.S. Military Heroes on Ancestry
Finding Your U.S. Military Heroes on AncestryFinding Your U.S. Military Heroes on Ancestry
Finding Your U.S. Military Heroes on Ancestry
 
Honoring the Dead: Military Burials
Honoring the Dead: Military BurialsHonoring the Dead: Military Burials
Honoring the Dead: Military Burials
 
Putting your ancestors in historical perspective
Putting your ancestors in historical perspectivePutting your ancestors in historical perspective
Putting your ancestors in historical perspective
 
searching successfully to reveal your ancestor’s story on ancestry
searching successfully to reveal your ancestor’s story on ancestrysearching successfully to reveal your ancestor’s story on ancestry
searching successfully to reveal your ancestor’s story on ancestry
 
Making Your Own Domain Specific Language
Making Your Own Domain Specific LanguageMaking Your Own Domain Specific Language
Making Your Own Domain Specific Language
 
SXSW Core Conversation - How Using Big Data Can Tell Personalized Stories
SXSW Core Conversation - How Using Big Data Can Tell Personalized StoriesSXSW Core Conversation - How Using Big Data Can Tell Personalized Stories
SXSW Core Conversation - How Using Big Data Can Tell Personalized Stories
 
Finding and using census records to create a framework
Finding and using census records to create a framework Finding and using census records to create a framework
Finding and using census records to create a framework
 
Putting your ancestors in historical perspective using ancestry to tell your ...
Putting your ancestors in historical perspective using ancestry to tell your ...Putting your ancestors in historical perspective using ancestry to tell your ...
Putting your ancestors in historical perspective using ancestry to tell your ...
 
Common Surnames: Finding Your Smiths
Common Surnames: Finding Your SmithsCommon Surnames: Finding Your Smiths
Common Surnames: Finding Your Smiths
 
Putting Your Ancestors in Historical Perspective
Putting Your Ancestors in Historical PerspectivePutting Your Ancestors in Historical Perspective
Putting Your Ancestors in Historical Perspective
 
Learning How To Tune Your Ancestry.com Search
Learning How To Tune Your Ancestry.com SearchLearning How To Tune Your Ancestry.com Search
Learning How To Tune Your Ancestry.com Search
 
Researching Civil War Soldiers: An Overview
Researching Civil War Soldiers: An OverviewResearching Civil War Soldiers: An Overview
Researching Civil War Soldiers: An Overview
 

Recently uploaded

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Scaling AncestryDNA Using Hadoop and HBase

  • 1. 1 Scaling AncestryDNA Using Hadoop and HBase April 10th, 2014 Jeremy Pollack
  • 3. • Over 30,000 historical content collections • 13 billion records and images • Records dating back to 16th century • 10 petabytes We are the world's largest online family history resource Discoveries are the Key 3
  • 4. The “eureka” moment drives our business Discoveries in Detail 4
  • 5. Spit in a tube, pay $99, learn your past Autosomal DNA tests Over 200,000+ DNA samples 700,000 SNPs for each sample 10,000,000+ cousin matches DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism) (http://en.wikipedia.org/wiki/Single- nucleiotide_polymorphism) - 50,000 100,000 150,000 Genotyped Samples Discoveries with DNA 5
  • 6. Network Effect – Cousin Matches 6 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 2,000 10,053 21,205 40,201 60,240 80,405 115,756 CousinMatches Database Size
  • 7. So how do we do it? 7
  • 8. • GERMLINE is an algorithm that finds hidden relationships within a pool of DNA • GERMLINE also refers to the reference implementation of that algorithm written in C++ • You can find it here : http://www1.cs.columbia.edu/~gusev/germline/ Introducing … GERMLINE! 8
  • 9. • GERMLINE (the implementation) was not meant to be used in an industrial setting  Stateless  Single threaded  Prone to swapping (heavy memory usage) • GERMLINE performs poorly on large data sets • Our metrics predicted exactly where the process would slow to a crawl • Put simply: GERMLINE couldn't scale So What’s the Problem? 9
  • 10. GERMLINE Run Times (in hours) 0 5 10 15 20 25 2,500 5,000 7,500 10,000 12,500 15,000 17,500 20,000 22,500 25,000 27,500 30,000 32,500 35,000 37,500 40,000 42,500 45,000 47,500 50,000 52,500 55,000 57,500 60,000 Hours Samples 10
  • 11. Projected GERMLINE Run Times (in hours) 0 100 200 300 400 500 600 700 2,500 12,5… 22,5… 32,5… 42,5… 52,5… 62,5… 72,5… 82,5… 92,5… 102,… 112,… 122,… GERMLINE run times Projected GERMLINE run times Samples Hours 11
  • 12. The Mission : Create a Scalable Matching Engine ... and thus was born (aka "Jermline with a J") 12
  • 13. What is Hadoop? • Hadoop is an open-source platform for processing large amounts of data in a scalable, fault-tolerant, affordable fashion, using commodity hardware • Hadoop specifies a distributed file system called HDFS • Hadoop supports a processing methodology known as MapReduce • Many tools are built on top of Hadoop, such as HBase, Hive, and Flume 13
  • 15. What happens when HDFS loses a server
  • 17. What is HBase? • HBase is an open-source NoSQL data store that runs on top of HDFS • HBase is columnar; you can think of it as a weird amalgam of a hashtable and a spreadsheet • HBase supports unlimited rows and columns • HBase stores data sparsely; there is no penalty for empty cells • HBase is gaining in popularity: Salesforce, Facebook, and Twitter have all invested heavily in the technology, as well as many others 17
  • 18. Game of Thrones Characters, in an HBase Table KEY gender hair_color family_name is_evil Joffrey male blonde Baratheon yes Cersei female blonde Lannister kinda 18
  • 19. Adding a Row to an HBase Table KEY gender hair_color family_name is_evil Joffrey male blonde Baratheon yes Cersei female blonde Lannister kinda Sansa female red Stark no 19
  • 20. Adding a Column to an HBase Table KEY gender hair_color family_name is_evil title Joffrey male blonde Baratheon yes king Cersei female blonde Lannister kinda Sansa female red Stark no 20
  • 21. Cersei : ACTGACCTAGTTGAC Joffrey : TTAAGCCTAGTTGAC The Input Cersei Baratheon • Former queen of Westeros • Machiavellian manipulator • Mostly evil, but occasionally sympathetic Joffrey Baratheon • Pretty much the human embodiment of evil • Needlessly cruel • Kinda looks like Justin Bieber DNA Matching : How it Works 21
  • 22. 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC Separate into words DNA Matching : How it Works 22
  • 23. 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC ACTGA_0 : Cersei TTAAG_0 : Joffrey CCTAG_1 : Cersei, Joffrey TTGAC_2 : Cersei, Joffrey Build the hash table DNA Matching : How it Works 23
  • 24. Iterate through genome and find matches Cersei and Joffrey match from position 1 to position 2 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC ACTGA_0 : Cersei TTAAG_0 : Joffrey CCTAG_1 : Cersei, Joffrey TTGAC_2 : Cersei, Joffrey DNA Matching : How it Works 24
  • 25. Does that mean they're related? ...maybe 25
  • 26. IBD to Relationship Estimation • We use the total length of all shared segments to estimate the relationship between to genetic relatives • This is basically a classification problem 26 5 10 20 50 100 200 500 1000 5000 0.000.010.020.030.040.05 ERSA total_IBD(cM) probability m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
  • 27. Jaime : TTAAGCCTAGGGGCG But Wait...What About Jaime? Jaime Lannister • Kind of a has-been • Killed the Mad King • Has the hots for his sister, Cersei 27
  • 28. Adding a new sample, the GERMLINE way 28
  • 29. 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC Jaime : TTAAG CCTAG GGGCG ACTGA_0 : Cersei TTAAG_0 : Joffrey, Jaime CCTAG_1 : Cersei, Joffrey, Jaime TTGAC_2 : Cersei, Joffrey GGGCG_2 : Jaime Step one: Rebuild the entire hash table from scratch, including the new sample The GERMLINE Way 29
  • 30. Cersei and Joffrey match from position 1 to position 2 Joffrey and Jaime match from position 0 to position 1 Cersei and Jaime match at position 1 Step two: Find everybody's matches all over again, including the new sample. (n x n comparisons) 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC Jaime : TTAAG CCTAG GGGCG ACTGA_0 : Cersei TTAAG_0 : Joffrey, Jaime CCTAG_1 : Cersei, Joffrey, Jaime TTGAC_2 : Cersei, Joffrey GGGCG_2 : Jaime The GERMLINE Way 30
  • 31. Cersei and Joffrey match from position 1 to position 2 Joffrey and Jaime match from position 0 to position 1 Cersei and Jaime match at position 1 Step three : Now, throw away the evidence! 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC Jaime : TTAAG CCTAG GGGCG ACTGA_0 : Cersei TTAAG_0 : Joffrey, Jaime CCTAG_1 : Cersei, Joffrey, Jaime TTGAC_2 : Cersei, Joffrey GGGCG_2 : Jaime The GERMLINE Way 31
  • 32. Step one: Update the hash table Cersei Joffrey 2_ACTGA_0 1 2_TTAAG_0 1 2_CCTAG_1 1 1 2_TTGAC_2 1 1 Already stored in HBase Jaime : TTAAG CCTAG GGGCG New sample to add Key : [CHROMOSOME]_[WORD]_[POSITION] Qualifier : [USER ID] Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome The Way 32
  • 33. Jaime and Joffrey match from position 0 to position 1 Jaime and Cersei match at position 1 Already stored in HBase 2_Cersei 2_Joffrey 2_Cersei { (1, 2), ...} 2_Joffrey { (1, 2), ...} New matches to add Key : [CHROMOSOME]_[USER ID] Qualifier : [CHROMOSOME]_[USER ID] Cell value : A list of ranges where the two users match on a chromosome Step two: Find matches, update the results table The Way 33
  • 34. Results Table 2_Cersei 2_Joffrey 2_Jaime 2_Cersei { (1, 2), ...} { (1), ...} 2_Joffrey { (1, 2), ...} { (0,1), ...} 2_Jaime { (1), ...} { (0,1), ...} Hash Table Cersei Joffrey Jaime 2_ACTGA_0 1 2_TTAAG_0 1 1 2_CCTAG_1 1 1 1 2_TTGAC_2 1 1 2_GGGCG_2 1 The Way 34
  • 35. But wait ... what about Daenerys, Tyrion, Arya, and Jon Snow? 35
  • 36. Run them in parallel with Hadoop! 36
  • 37. Parallelism with Hadoop • Batches are usually about a thousand people • Each mapper takes a single chromosome for a single person • MapReduce Jobs : Job #1 : Match Words o Updates the hash table Job #2 : Match Segments o Identifies areas where the samples match 37
  • 38. How does perform? A 1700% performance improvement over GERMLINE! 38
  • 40. 0 20 40 60 80 100 120 140 160 180 GERMLINE run times Jermline run times Projected GERMLINE run times Run times for Matching (in hours) Samples Hours 40
  • 41. Bottom line: Without Hadoop and HBase, this would have been expensive and difficult. Dramatically Increased our Capacity 41
  • 42. And now for everybody's favorite part .... Lessons Learned 42
  • 44. Lessons Learned : What went right? 44 • This project would not have been possible without TDD • Two sets of test data : generated and public domain • 89% coverage • Corrected a bug in the reference implementation • Has never failed in production
  • 45. Lessons Learned What would we do differently? 45
  • 46. Lessons Learned : What would we do differently? • Front-load some performance tests  HBase and Hadoop can have odd performance profiles  HBase in particular has some strange behavior if you're not familiar with its inner workings • Allow a lot of time for live tests, dry runs, and deployment  These technologies are relatively new, and it isn't always possible to find experienced admins. Be prepared to "get your hands dirty" 46
  • 48. 48 And yes, we are hiring! You can contact me at jpollack@ancestry.com or just talk to me here!