Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02

Ancestry DNA at Scale
Using Hadoop and HBase

September 7, 2013

1

What does this talk cover?

What does Ancestry do?
How did our journey with Hadoop start?
Using Hadoop as a Job Processor
DNA Matching with Hadoop and HBase

What’s next?

2

Discoveries Are the Key
We are the world's largest online family history resource.

• Over 30,000 historical content collections

• 11 billion records and images
• Records dating back to 16th century
• 4 petabytes

Discoveries In Detail
The “eureka” moment drives our business

Discoveries With DNA
Spit in a tube, pay $99, learn your past
Autosomal DNA tests
Over 120,000 DNA samples
700,000 SNPs for each sample
6,000,000+ 4th cousin matches
150,000
100,000

Genotyped samples

50,000
-

6

DNA molecule 1 differs from DNA
molecule 2 at a single base-pair location
(a C/T polymorphism).
(http://en.wikipedia.org/wiki/Singlenucleiotide_polymorphism)

What does the customer see?

7

Network Effect – Cousin Matches
3,500,000

Cousin Matches

3,000,000

2,500,000

2,000,000

1,500,000

1,000,000

500,000

2,000

10,053

21,205

40,201

Database Size
8

60,240

80,405

115,756

Where Did We Start?
The process before Hadoop

9

What’s the Story?
Cast of Characters (Scientists and Software Engineers)
Scientists

Software Engineers

Think they can code:

Think they are Scientists:

• Linux

• Biology in HS and College

• MySQL

• Math/Statistics

• PERL and/or Python

• Read science papers

Pressures of a startup business
– Release a product, learn, and then scale

Sr. Manager and 5 developers and 4 member Science Team
10

DNA Input
Raw Data (A,C,T,G,0):
3 123456789_RZZZZ2_XXXXXXH3Q7U7Q2B_YYYY84598-DNA 0 0 0 -9 C C G G G G G G A A A A C C G G A
AAACCGGGGAAGGGAAAGGAGAACCAAAAGGAAAGGGGGCCGGAAGGGGGG
G A A A A C G A A A A G A G A A A A G G G G G G A G G G G G G G … (continues for 700,000+ snips)

Map File:
0
0
0
0
0
0
0
0
0
0
0

11

rs10005853
rs10015934
rs1004236
rs10059646
rs10085382
rs10123921
rs10127827
rs10155688
rs10162780
rs1017484
rs10188129

0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0

What Did “Get Something Running” Look Like?
Old Version
Run

Watch Dog
B

Init

Rakesh

Results
Processing
3) Poll
status

Finalize

Creates run
Reruns
Heart beat

2) Enqueuer
(dna validation)

Pipeline
Control

Monitor

Monitor

4) Disc
Management
(V2)

Runs on

AdMixture (Ethnicity)
Beagle (Phasing) and GermLine (Matching)
runs here

“Beefy Box”

Single Beefy Box – Only option is to scale Vertically
12

Measure Everything Principle
• Start time, end time, duration in seconds, and sample
count for every step in the pipeline. Also the full end-toend processing time

• Put the data in pivot tables and graphed each step

• Normalize the data (sample size was changing)
#1

• Use the data collected to predict future performance
13

Challenges and Pain Points
Performance degrades when DNA pool grows
• Static
(by batch size)

• Linear
(by DNA pool size)

• Quadratic (Matching related steps) – Time bomb

(Courtesy from Keith’s Potting)
14

Parallel Ethnicity Jobs
Use Hadoop as a job processor

15

Why Attack Ethnicity First?
• Smart developers, little Hadoop experience
– Using Hadoop as a job scheduler and scaling the ethnicity step
was easier than redesigning the matching step

• AdMixture is a self-contained application
– Reference panel, the users DNA, and a seed value for inputs
– CPU intensive job that writes to stdout

• Easy to split up the input
• Looked hard enough at the matching problem to realize a
HBase, MapReduce solution was realistic

16

Parallel Ethnicity Jobs
Typical run of 1000 samples. Queue up one Hadoop job
with 40 tasks, 25 samples per task
Hadoop Cluster (20 x 4 slots x 96g)

Server

Server

Server Server

Server

Server

Server

Server

Server

Server

1) Map Reduce
Admixture

Admixture

Admixture

Admixture

Admixture

Admixture

17

Admixture

Admixture

Admixture

#2

2012-03-01T21:18:03
2012-03-31T16:27:50
2012-04-17T07:31:45
2012-05-17T18:36:08
2012-06-16T15:23:27
2012-06-29T19:42:18
2012-07-11T11:29:56
2012-07-22T07:48:32
2012-07-30T06:56:26
2012-08-08T20:42:30
2012-08-17T20:58:55
2012-09-01T01:51:54
2012-09-11T21:53:05
2012-09-23T21:46:15
2012-10-02T14:28:50
2012-10-14T17:45:53
2012-11-04T02:43:36
2012-11-24T11:12:19
2012-12-12T17:35:15
2012-12-25T04:36:45
2013-01-14T15:18:38
2013-01-29T12:29:56
2013-02-11T10:22:02
2013-03-02T16:03:16
2013-03-29T00:19:36
2013-04-21T02:02:51
2013-05-17T01:34:00
2013-05-29T07:08:04
2013-06-13T13:50:45
2013-06-25T21:06:04
2013-07-17T15:15:27
2013-08-06T07:57:41

Results

1000 sample runs under 3 hours (one interesting bug)
AdMixture Time (sec)

100000

90000

80000

70000

60000

50000

40000

30000

18
Sum of Run Size

20000
Admixture Time

10000

0

Freed up the “Beefy Box”
• Moving AdMixture off left an additional 10 threads for
phasing and matching

• Memory was freed up for phasing and matching

• Just moving AdMixture off, saved over 6 hours of
processing on the single box
– Bought us time

19

New Matching Algorithm
Hadoop and HBase

20

What is GERMLINE?
•
•
•

GERMLINE is an algorithm that finds hidden relationships
within a pool of DNA
GERMLINE also refers to the reference implementation of
that algorithm written in C++
You can find it here :
http://www1.cs.columbia.edu/~gusev/germline/

So what's the problem?
•

•
•
•

GERMLINE (the implementation) was not meant to be
used in an industrial setting
• Stateless
• Single threaded
• Prone to swapping (heavy memory usage)
• Generic
• Used for any DNA (fish, fruit fly, human, …)
GERMLINE performs poorly on large data sets
Our metrics predicted exactly where the process would
slow to a crawl
Put simply : GERMLINE couldn't scale

Hours

GERMLINE Run Times (in hours)
25

20

15

10

5

0

60000
57500
55000
52500
50000
47500

45000
42500
40000
37500
35000
32500
30000
27500
25000
22500
20000
17500
15000
12500
10000
7500
5000

2500

Number of samples

Hours

Projected GERMLINE Run Times (in hours)
700

600

500

400

300

200

GERMLINE run
times
100

Projected
GERMLINE run
times
0

122500
120000
117500
115000
112500
110000
107500
105000
102500
100000
97500
95000
92500
90000
87500
85000
82500
80000
77500
75000
72500
70000
67500
65000
62500
60000
57500
55000
52500
50000
47500
45000
42500
40000
37500
35000
32500
30000
27500
25000
22500
20000
17500
15000
12500
10000
7500
5000
2500

Number of samples

The Mission : Create a Scalable
Matching Engine

... and thus was
born

(aka "Jermline with a J")

DNA Matching : How it Works
The
Input
Starbuck : ACTGACCTAGTTGAC
Adama : TTAAGCCTAGTTGAC

Kara Thrace, aka
Starbuck
•
•
•

Ace viper pilot
Has a special
destiny
Not to be trifled
with

Admiral Adama
•
•

Admiral of the
Colonial Fleet
Routinely
saves
humanity from
destruction


Separate into
words
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC


Build the hash
table
0
1
2
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama

Iterate through genome and find matches
0
1
2
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama

Starbuck and Adama match from position 1 to position 2

Does that mean they're related?

...maybe

But wait... what about Baltar?
Baltar : TTAAGCCTAGGGGCG

Gaius Baltar
•
•
•

Handsome
Genius
Kinda evil

Adding a new sample, the GERMLINE
way

The GERMLINE Way
Step one : Rebuild the entire hash table from scratch, including
the new sample

0
1
2
Baltar
: TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
GGGCG_2 : Baltar

The GERMLINE Way
Step two : Find everybody's matches all over again, including the
new sample. (n x n comparisons)
0
1
2
Baltar
: TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
GGGCG_2 : Baltar
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1

The GERMLINE Way
Step three : Now, throw away the evidence!
0
1
2
Baltar
: TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
GGGCG_2 : Baltar
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1

You have done this before, and you will have
to do it ALL OVER AGAIN.

Not so good, right?
Now let's take a look at the
way.

The
Starbuck
2_ACTGA_0

way

Adama

Step one : Update the hash table.

1

2_TTAAG_0

1

2_CCTAG_1

1

1

2_TTGAC_2

1

Already stored in HBase

1

Baltar : TTAAG CCTAG GGGCG

New sample to add

Add a column for every new sample for each user
Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that
position on that chromosome

The
2_Starbuck
2_Starbuck
2_Adama

way
2_Adama

Step two : Find matches.

{ (1, 2), ...}
{ (1, 2), ...}

Baltar and Adama match from position 0 to position 1
Baltar and Starbuck match at position 1

Already
stored in
HBase
New
matches to
add

“Fuzzy Match” the consecutive words. Worst case: Identical twins
Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome

The
Starbuck
2_ACTGA_0

way
Adama

Baltar

1

1
1

1

2_TTAAG_0
2_CCTAG_1

1

1

2_TTGAC_2

1

1

2_GGGCG_2

1

2_Starbuck

2_Adama

{ (1), ...}

{ (1), ...}

{ (1, 2), ...}

2_Baltar

2_Baltar

{ (1, 2), ...}

2_Starbuck

2_Adama

{ (0,1), ...}
{ (0,1), ...}

These are the updated
tables after adding
Baltar’s information
Only looking at 3
samples, chromosome
#2, positions 0, 1, and 2
Very simple example of
how the matching
process works

But wait ... what about
Zarek, Roslin, Hera, and Helo?

Run them in parallel with Hadoop!

Photo by Benh Lieu
Song

Parallelism with Hadoop
•

Batches are usually about a thousand
people.

•

Each mapper takes a single chromosome for
a single person.
o

•

Three samples per task means 22 jobs with 334 tasks
(1000/3) each

MapReduce Jobs :
Job #1 : Match Words
• Updates the hash table
Job #2 : Match Segments
• Identifies areas where the samples
match

How does Jermline perform?

A 1700% improvement over
GERMLINE!
Along with more accurate results
#3

Hours

Run Times For Matching (in hours)

25

20

15

10

5

0

120000
117500
115000
112500
110000
107500
105000
102500
100000
97500
95000
92500
90000
87500
85000
82500
80000
77500
75000
72500
70000
67500
65000
62500
60000
57500
55000
52500
50000
47500
45000
42500
40000
37500
35000
32500
30000
27500
25000
22500
20000
17500
15000
12500
10000
7500
5000
2500

Number of samples

2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
62500
65000
67500
70000
72500
75000
77500
80000
82500
85000
87500
90000
92500
95000
97500
100000
102500
105000
107500
110000
112500
115000
117500
120000

Hours

Run Times For Matching (in hours)

180

160

140

120

100

GERMLINE
run times

80

60

Jermline run
times

40

Projected
GERMLINE
run times

20

0

Number of samples

Incremental Changes Over Time
• Support the business, move incrementally and adjust
• After H2, pipeline speed stays flat

•
46

(Courtesy from Bill’s plotting)

Dramatically Increased our Capacity
Bottom line : Without Hadoop and HBase, this would
have been expensive and difficult.
•

Previously, we ran GERMLINE on a single "beefy box".
• 12-core 2.2GHZ Opteron 6174 with 256GB of RAM
• We had upgraded this machine until it couldn't be upgraded any more.
• Processing time was unacceptable, growth was unsustainable.
• To continue running GERMLINE on a single box, we would have required a vastly more
powerful machine, probably at the supercomputer level – at considerable cost!

•

Now, we run Jermline on a cluster.
• 20 X 12-core 2GHZ Xeon E5-2620 with 96GB of RAM
• We can now run 16 batches per day, whereas before we could only run one.
• Most importantly, growth is sustainable. To add capacity, we need only add more
nodes.

What’s Next?
Hadoop and HBase

48

Continue to Evolve the Software
• Azkaban for job control
– Nearly complete

• Phasing
– Still runs on the “Beefy Box”, 1000 samples take over 11 hours
– Total run time for 1000 samples is about 14 hours.
– Re-implement with HBase, MapReduce, Hadoop

• Version Updates
– New algorithms require us to re-run the entire DNA pool
– Burst capacity to the cloud

• Machine Learning
– Matching (V2) and Ethnicity (V3) both would benefit from a
Machine Learning approach
49

End of the Journey (for now) - Questions?

50

Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02

More Related Content

Viewers also liked

Similar to Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02

Recently uploaded

Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02

Editor's Notes