3. • Over 30,000 historical content collections
• 13 billion records and images
• Records dating back to 16th century
• 10 petabytes
We are the world's largest online family history resource
Discoveries are the Key
3
5. Spit in a tube, pay $99, learn your past
Autosomal DNA tests
Over 200,000+ DNA samples
700,000 SNPs for each sample
10,000,000+ cousin matches
DNA molecule 1 differs from DNA
molecule 2 at a single base-pair
location (a C/T polymorphism)
(http://en.wikipedia.org/wiki/Single-
nucleiotide_polymorphism)
-
50,000
100,000
150,000
Genotyped Samples
Discoveries with DNA
5
8. • GERMLINE is an algorithm that finds hidden
relationships within a pool of DNA
• GERMLINE also refers to the reference
implementation of that algorithm written in C++
• You can find it here :
http://www1.cs.columbia.edu/~gusev/germline/
Introducing … GERMLINE!
8
9. • GERMLINE (the implementation) was not meant to be
used in an industrial setting
Stateless
Single threaded
Prone to swapping (heavy memory usage)
• GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would
slow to a crawl
• Put simply: GERMLINE couldn't scale
So What’s the Problem?
9
11. Projected GERMLINE Run Times (in hours)
0
100
200
300
400
500
600
700
2,500
12,5…
22,5…
32,5…
42,5…
52,5…
62,5…
72,5…
82,5…
92,5…
102,…
112,…
122,…
GERMLINE run
times
Projected
GERMLINE run
times
Samples
Hours
11
12. The Mission : Create a Scalable Matching
Engine
... and thus was
born
(aka "Jermline with a J")
12
13. What is Hadoop?
• Hadoop is an open-source platform for processing large
amounts of data in a scalable, fault-tolerant, affordable
fashion, using commodity hardware
• Hadoop specifies a distributed file system called HDFS
• Hadoop supports a processing methodology known as
MapReduce
• Many tools are built on top of Hadoop, such as HBase,
Hive, and Flume
13
17. What is HBase?
• HBase is an open-source NoSQL data store that runs on top of
HDFS
• HBase is columnar; you can think of it as a weird amalgam of a
hashtable and a spreadsheet
• HBase supports unlimited rows and columns
• HBase stores data sparsely; there is no penalty for empty cells
• HBase is gaining in popularity: Salesforce, Facebook, and Twitter
have all invested heavily in the technology, as well as many others
17
18. Game of Thrones Characters, in an HBase Table
KEY gender hair_color family_name is_evil
Joffrey male blonde Baratheon yes
Cersei female blonde Lannister kinda
18
19. Adding a Row to an HBase Table
KEY gender hair_color family_name is_evil
Joffrey male blonde Baratheon yes
Cersei female blonde Lannister kinda
Sansa female red Stark no
19
20. Adding a Column to an HBase Table
KEY gender hair_color family_name is_evil title
Joffrey male blonde Baratheon yes king
Cersei female blonde Lannister kinda
Sansa female red Stark no
20
21. Cersei : ACTGACCTAGTTGAC
Joffrey : TTAAGCCTAGTTGAC
The Input
Cersei Baratheon
• Former queen
of Westeros
• Machiavellian
manipulator
• Mostly evil, but
occasionally
sympathetic
Joffrey Baratheon
• Pretty much the
human
embodiment of
evil
• Needlessly cruel
• Kinda looks like
Justin Bieber
DNA Matching : How it Works
21
22. 0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
Separate into words
DNA Matching : How it Works
22
23. 0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
ACTGA_0 : Cersei
TTAAG_0 : Joffrey
CCTAG_1 : Cersei, Joffrey
TTGAC_2 : Cersei, Joffrey
Build the hash table
DNA Matching : How it Works
23
24. Iterate through genome and find matches
Cersei and Joffrey match from position 1 to position 2
0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
ACTGA_0 : Cersei
TTAAG_0 : Joffrey
CCTAG_1 : Cersei, Joffrey
TTGAC_2 : Cersei, Joffrey
DNA Matching : How it Works
24
26. IBD to Relationship Estimation
• We use the total length of
all shared segments to
estimate the relationship
between to genetic
relatives
• This is basically a
classification problem
26
5 10 20 50 100 200 500 1000 5000
0.000.010.020.030.040.05
ERSA
total_IBD(cM)
probability
m1
m2
m3
m4
m5
m6
m7
m8
m9
m10
m11
27. Jaime : TTAAGCCTAGGGGCG
But Wait...What About Jaime?
Jaime Lannister
• Kind of a has-been
• Killed the Mad King
• Has the hots for his
sister, Cersei
27
29. 0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
Jaime : TTAAG CCTAG GGGCG
ACTGA_0 : Cersei
TTAAG_0 : Joffrey, Jaime
CCTAG_1 : Cersei, Joffrey, Jaime
TTGAC_2 : Cersei, Joffrey
GGGCG_2 : Jaime
Step one: Rebuild the entire hash table from
scratch, including the new sample
The GERMLINE Way
29
30. Cersei and Joffrey match from position 1 to position 2
Joffrey and Jaime match from position 0 to position 1
Cersei and Jaime match at position 1
Step two: Find everybody's matches all over
again, including the new sample. (n x n comparisons)
0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
Jaime : TTAAG CCTAG GGGCG
ACTGA_0 : Cersei
TTAAG_0 : Joffrey, Jaime
CCTAG_1 : Cersei, Joffrey, Jaime
TTGAC_2 : Cersei, Joffrey
GGGCG_2 : Jaime
The GERMLINE Way
30
31. Cersei and Joffrey match from position 1 to position 2
Joffrey and Jaime match from position 0 to position 1
Cersei and Jaime match at position 1
Step three : Now, throw away the evidence!
0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
Jaime : TTAAG CCTAG GGGCG
ACTGA_0 : Cersei
TTAAG_0 : Joffrey, Jaime
CCTAG_1 : Cersei, Joffrey, Jaime
TTGAC_2 : Cersei, Joffrey
GGGCG_2 : Jaime
The GERMLINE Way
31
32. Step one: Update the hash table
Cersei Joffrey
2_ACTGA_0 1
2_TTAAG_0 1
2_CCTAG_1 1 1
2_TTGAC_2 1 1
Already stored in HBase
Jaime : TTAAG CCTAG GGGCG New sample to add
Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that
position on that chromosome
The Way
32
33. Jaime and Joffrey match from position 0 to position 1
Jaime and Cersei match at position 1
Already stored
in HBase
2_Cersei 2_Joffrey
2_Cersei { (1, 2), ...}
2_Joffrey { (1, 2), ...}
New matches
to add
Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome
Step two: Find matches, update the results table
The Way
33
37. Parallelism with Hadoop
• Batches are usually about a thousand people
• Each mapper takes a single chromosome for a
single person
• MapReduce Jobs :
Job #1 : Match Words
o Updates the hash table
Job #2 : Match Segments
o Identifies areas where the samples match
37
44. Lessons Learned : What went right?
44
• This project would not have been possible without TDD
• Two sets of test data : generated and public domain
• 89% coverage
• Corrected a bug in the reference implementation
• Has never failed in production
46. Lessons Learned : What would we do differently?
• Front-load some performance tests
HBase and Hadoop can have odd performance profiles
HBase in particular has some strange behavior if you're not
familiar with its inner workings
• Allow a lot of time for live tests, dry runs, and
deployment
These technologies are relatively new, and it isn't always
possible to find experienced admins. Be prepared to "get your
hands dirty"
46