Scaling AncestryDNA Using Hadoop and HBase

1
Scaling AncestryDNA
Using Hadoop and HBase
April 10th, 2014
Jeremy Pollack

• Over 30,000 historical content collections
• 13 billion records and images
• Records dating back to 16th century
• 10 petabytes
We are the world's largest online family history resource
Discoveries are the Key
3

The “eureka” moment drives our business
Discoveries in Detail
4

Spit in a tube, pay $99, learn your past
Autosomal DNA tests
Over 200,000+ DNA samples
700,000 SNPs for each sample
10,000,000+ cousin matches
DNA molecule 1 differs from DNA
molecule 2 at a single base-pair
location (a C/T polymorphism)
(http://en.wikipedia.org/wiki/Single-
nucleiotide_polymorphism)
-
50,000
100,000
150,000
Genotyped Samples
Discoveries with DNA
5

Network Effect – Cousin Matches
6
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
2,000 10,053 21,205 40,201 60,240 80,405 115,756
CousinMatches
Database Size

• GERMLINE is an algorithm that finds hidden
relationships within a pool of DNA
• GERMLINE also refers to the reference
implementation of that algorithm written in C++
• You can find it here :
http://www1.cs.columbia.edu/~gusev/germline/
Introducing … GERMLINE!
8

• GERMLINE (the implementation) was not meant to be
used in an industrial setting
 Stateless
 Single threaded
 Prone to swapping (heavy memory usage)
• GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would
slow to a crawl
• Put simply: GERMLINE couldn't scale
So What’s the Problem?
9

GERMLINE Run Times (in hours)
0
5
10
15
20
25
2,500
5,000
7,500
10,000
12,500
15,000
17,500
20,000
22,500
25,000
27,500
30,000
32,500
35,000
37,500
40,000
42,500
45,000
47,500
50,000
52,500
55,000
57,500
60,000
Hours
Samples
10

Projected GERMLINE Run Times (in hours)
0
100
200
300
400
500
600
700
2,500
12,5…
22,5…
32,5…
42,5…
52,5…
62,5…
72,5…
82,5…
92,5…
102,…
112,…
122,…
GERMLINE run
times
Projected
GERMLINE run
times
Samples
Hours
11

The Mission : Create a Scalable Matching
Engine
... and thus was
born
(aka "Jermline with a J")
12

What is Hadoop?
• Hadoop is an open-source platform for processing large
amounts of data in a scalable, fault-tolerant, affordable
fashion, using commodity hardware
• Hadoop specifies a distributed file system called HDFS
• Hadoop supports a processing methodology known as
MapReduce
• Many tools are built on top of Hadoop, such as HBase,
Hive, and Flume
13

What happens when HDFS loses a server

What is HBase?
• HBase is an open-source NoSQL data store that runs on top of
HDFS
• HBase is columnar; you can think of it as a weird amalgam of a
hashtable and a spreadsheet
• HBase supports unlimited rows and columns
• HBase stores data sparsely; there is no penalty for empty cells
• HBase is gaining in popularity: Salesforce, Facebook, and Twitter
have all invested heavily in the technology, as well as many others
17

Game of Thrones Characters, in an HBase Table
KEY gender hair_color family_name is_evil
Joffrey male blonde Baratheon yes
Cersei female blonde Lannister kinda
18

Adding a Row to an HBase Table
KEY gender hair_color family_name is_evil
Joffrey male blonde Baratheon yes
Sansa female red Stark no
19

Adding a Column to an HBase Table
KEY gender hair_color family_name is_evil title
Joffrey male blonde Baratheon yes king
Sansa female red Stark no
20

Cersei : ACTGACCTAGTTGAC
Joffrey : TTAAGCCTAGTTGAC
The Input
Cersei Baratheon
• Former queen
of Westeros
• Machiavellian
manipulator
• Mostly evil, but
occasionally
sympathetic
Joffrey Baratheon
• Pretty much the
human
embodiment of
evil
• Needlessly cruel
• Kinda looks like
Justin Bieber
DNA Matching : How it Works
21

0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
Separate into words
22

0 1 2
ACTGA_0 : Cersei
TTAAG_0 : Joffrey
CCTAG_1 : Cersei, Joffrey
TTGAC_2 : Cersei, Joffrey
Build the hash table
23

Iterate through genome and find matches
Cersei and Joffrey match from position 1 to position 2
0 1 2
ACTGA_0 : Cersei
TTAAG_0 : Joffrey
CCTAG_1 : Cersei, Joffrey
24

Does that mean they're related?
...maybe
25

IBD to Relationship Estimation
• We use the total length of
all shared segments to
estimate the relationship
between to genetic
relatives
• This is basically a
classification problem
26
5 10 20 50 100 200 500 1000 5000
0.000.010.020.030.040.05
ERSA
total_IBD(cM)
probability
m1
m2
m3
m4
m5
m6
m7
m8
m9
m10
m11

Jaime : TTAAGCCTAGGGGCG
But Wait...What About Jaime?
Jaime Lannister
• Kind of a has-been
• Killed the Mad King
• Has the hots for his
sister, Cersei
27

Adding a new sample, the GERMLINE way
28

0 1 2
Jaime : TTAAG CCTAG GGGCG
ACTGA_0 : Cersei
TTAAG_0 : Joffrey, Jaime
CCTAG_1 : Cersei, Joffrey, Jaime
GGGCG_2 : Jaime
Step one: Rebuild the entire hash table from
scratch, including the new sample
The GERMLINE Way
29

Joffrey and Jaime match from position 0 to position 1
Cersei and Jaime match at position 1
Step two: Find everybody's matches all over
again, including the new sample. (n x n comparisons)
0 1 2
ACTGA_0 : Cersei
GGGCG_2 : Jaime
The GERMLINE Way
30

Joffrey and Jaime match from position 0 to position 1
Cersei and Jaime match at position 1
Step three : Now, throw away the evidence!
0 1 2
ACTGA_0 : Cersei
GGGCG_2 : Jaime
The GERMLINE Way
31

Step one: Update the hash table
Cersei Joffrey
2_ACTGA_0 1
2_TTAAG_0 1
2_CCTAG_1 1 1
2_TTGAC_2 1 1
Already stored in HBase
Jaime : TTAAG CCTAG GGGCG New sample to add
Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that
position on that chromosome
The Way
32

Jaime and Joffrey match from position 0 to position 1
Jaime and Cersei match at position 1
Already stored
in HBase
2_Cersei 2_Joffrey
2_Cersei { (1, 2), ...}
2_Joffrey { (1, 2), ...}
New matches
to add
Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome
Step two: Find matches, update the results table
The Way
33

Results Table
2_Cersei 2_Joffrey 2_Jaime
2_Cersei { (1, 2), ...} { (1), ...}
2_Joffrey { (1, 2), ...} { (0,1), ...}
2_Jaime { (1), ...} { (0,1), ...}
Hash Table
Cersei Joffrey Jaime
2_ACTGA_0 1
2_TTAAG_0 1 1
2_CCTAG_1 1 1 1
2_TTGAC_2 1 1
2_GGGCG_2 1
The Way
34

But wait ... what about
Daenerys, Tyrion, Arya, and Jon Snow?
35

Run them in parallel with Hadoop!
36

Parallelism with Hadoop
• Batches are usually about a thousand people
• Each mapper takes a single chromosome for a
single person
• MapReduce Jobs :
Job #1 : Match Words
o Updates the hash table
Job #2 : Match Segments
o Identifies areas where the samples match
37

How does perform?
A 1700% performance improvement
over GERMLINE!
38

0
5
10
15
20
25
2,500
7,500
12,500
17,500
22,500
27,500
32,500
37,500
42,500
47,500
52,500
57,500
62,500
67,500
72,500
77,500
82,500
87,500
92,500
97,500
102,500
107,500
112,500
117,500
Hours
Samples
Run times for Matching (in hours)
39

0
20
40
60
80
100
120
140
160
180
GERMLINE run
times
Jermline run
times
Projected
GERMLINE run
times
Run times for Matching (in hours)
Samples
Hours
40

Bottom line: Without Hadoop and HBase, this would have
been expensive and difficult.
Dramatically Increased our Capacity
41

And now for everybody's favorite part ....
Lessons Learned
42

Lessons Learned
What went right?
43

Lessons Learned : What went right?
44
• This project would not have been possible without TDD
• Two sets of test data : generated and public domain
• 89% coverage
• Corrected a bug in the reference implementation
• Has never failed in production

Lessons Learned
What would we do differently?
45

Lessons Learned : What would we do differently?
• Front-load some performance tests
 HBase and Hadoop can have odd performance profiles
 HBase in particular has some strange behavior if you're not
familiar with its inner workings
• Allow a lot of time for live tests, dry runs, and
deployment
 These technologies are relatively new, and it isn't always
possible to find experienced admins. Be prepared to "get your
hands dirty"
46

48
And yes, we are hiring!
You can contact me at jpollack@ancestry.com
or just talk to me here!

Scaling AncestryDNA Using Hadoop and HBase

Recommended

Recommended

More Related Content

More from Ancestry.com

More from Ancestry.com (12)

Recently uploaded

Recently uploaded (20)

Scaling AncestryDNA Using Hadoop and HBase