Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1aNuQO8.
Bill Yetman and Jeremy Pollack discuss using several Agile techniques -start simple, get going, iterate- and the “measure everything” principle to create the architecture behind the Family History website. Filmed at qconsf.com.
Jeremy Pollack is a Senior Engineer at Ancestry.com, where he supports a team of scientists and makes their discoveries scale. Bill Yetman has served as Senior Director of Engineering at Ancestry.com since January 2011. He holds a B.S. in Computer Science and a B.A. in Psychology from San Diego State University.
2. Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/scaling-ancestry-dna
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
3. Presented at QCon San Francisco
www.qconsf.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. What Does This Talk Cover?
What does Ancestry do?
How does the science work?
How did our journey with Hadoop start?
DNA matching with Hadoop and Hbase
Lessons Learned
What’s next?
2
6. Discoveries are the Key
We are the world's largest online family history resource
• Over 30,000 historical content collections
• 12 billion records and images
• Records dating back to 16th century
• 10 petabytes
4
8. Discoveries with DNA
Spit in a tube, pay $99, learn your past
Autosomal DNA tests
Over 200,000+ DNA samples
700,000 SNPs for each sample
10,000,000+ cousin matches
150,000
Genotyped Samples
100,000
50,000
-
DNA molecule 1 differs from DNA
molecule 2 at a single base-pair
location (a C/T polymorphism)
(http://en.wikipedia.org/wiki/Singlenucleiotide_polymorphism)
6
11. What’s the Story?
Cast of Characters (Scientists and Software Engineers)
Scientists
Think they can code:
• Linux
• MySQL
• PERL and/or Python
Software Engineers
Think they are Scientists:
• Biology in HS and College
• Math/Statistics
• Read science papers
Pressures of a new business
– Release a product, learn, and then scale
Sr. Manager and 3 developers and 2 member Science Team
9
12. What Did “Get Something Running” Look Like?
Ethnicity Step
and Matching (Germline)
runs here
“Beefy Box”
Specifics:
1) Ran multiple threads for the two steps
2) Both steps were run in parallel
3) As the DNA Pool grew both steps required more memory
Single Beefy Box – Only option is to scale Vertically
10
13. Measure Everything Principle
• Start time, end time, duration in seconds, and sample
count for every step in the pipeline. Also the full end-toend processing time
• Put the data in pivot tables and graphed each step
• Normalize the data (sample size was changing)
• Use the data collected to predict future performance
11
14. Challenges and Pain Points
Performance degrades when DNA pool grows
• Static
(by batch size)
• Linear
(by DNA pool size)
• Quadratic (Matching related steps) – Time bomb
(Courtesy from Keith’s Plotting)
12
16. What is GERMLINE?
• GERMLINE is an algorithm that finds hidden
relationships within a pool of DNA
• GERMLINE also refers to the reference
implementation of that algorithm written in C++
• You can find it here :
http://www1.cs.columbia.edu/~gusev/germline/
14
17. So What’s the Problem?
• GERMLINE (the implementation) was not meant to be
used in an industrial setting
Stateless
Single threaded
Prone to swapping (heavy memory usage)
• GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would
slow to a crawl
• Put simply: GERMLINE couldn't scale
15
19. Projected GERMLINE Run Times (in hours)
700
600
500
Hours
400
300
200
GERMLINE run
times
100
Projected
GERMLINE run
times
0
122,500
112,500
102,500
92,500
82,500
Samples
72,500
62,500
52,500
42,500
32,500
22,500
12,500
2,500
17
20. The Mission : Create a Scalable Matching
Engine
... and thus was
born
(aka "Jermline with a J")
18
21. What is Hadoop?
• Hadoop is an open-source platform for processing large
amounts of data in a scalable, fault-tolerant, affordable
fashion, using commodity hardware
• Hadoop specifies a distributed file system called HDFS
• Hadoop supports a processing methodology known as
MapReduce
• Many tools are built on top of Hadoop, such as HBase,
Hive, and Flume
19
23. What is HBase?
• HBase is an open-source NoSQL data store that runs on top of
HDFS
• HBase is columnar; you can think of it as a weird amalgam of a
hashtable and a spreadsheet
• HBase supports unlimited rows and columns
• HBase stores data sparsely; there is no penalty for empty cells
• HBase is gaining in popularity: Salesforce, Facebook, and Twitter
have all invested heavily in the technology, as well as many others
21
24. Battlestar Galactica Characters, in an HBase Table
KEY
is_cylon hair_color
gender
is_final_five
no
Six
blonde
female
Adama
22
true
false
brown
male
rank
admiral
25. Adding a Row to an HBase Table
KEY
is_cylon hair_color
gender
is_final_five
no
Six
blonde
female
Adama
false
brown
male
Baltar
23
true
false
brown
male
rank
admiral
26. Adding a Column to an HBase Table
KEY
is_cylon
hair_color
gender
is_final_five
Six
true
blonde
female
no
Adama
false
brown
male
Baltar
false
brown
male
24
rank
friends
admiral
Kara Thrace,
Saul Tigh
27. DNA Matching : How it Works
The Input
Starbuck : ACTGACCTAGTTGAC
Adama : TTAAGCCTAGTTGAC
Kara Thrace, aka
Starbuck
• Ace viper pilot
• Has a special
destiny
• Not to be trifled
with
25
Admiral Adama
• Admiral of the
Colonial Fleet
• Routinely saves
humanity from
destruction
28. DNA Matching : How it Works
Separate into words
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
26
29. DNA Matching : How it Works
Build the hash table
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama
27
30. DNA Matching : How it Works
Iterate through genome and find matches
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama
Starbuck and Adama match from position 1 to position 2
28
32. IBD to Relationship Estimation
0.02
0.03
0.04
m1
m2
m3
m4
m5
m6
m7
m8
m9
m10
m11
0.00
0.01
• This is basically a
classification problem
probability
• We use the total length of
all shared segments to
estimate the relationship
between to genetic
relatives
0.05
ERSA
5
10
20
50
100 200
total_IBD(cM)
30
500 1000
5000
33. But Wait...What About Baltar?
Baltar : TTAAGCCTAGGGGCG
Gaius Baltar
• Handsome
• Genius
• Kinda evil
31
35. The GERMLINE Way
Step one: Rebuild the entire hash table from scratch,
including the new sample
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
33
36. The GERMLINE Way
Step two: Find everybody's matches all over again,
including the new sample. (n x n comparisons)
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
Starbuck and Adama match from position 1 to position 2
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
34
37. The GERMLINE Way
Step three: Now, throw away the evidence!
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
Starbuck and Adama match from position 1 to position 2
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
You have done this before, and you will have to do
it ALL OVER AGAIN.
35
38. The
Way
Step one: Update the hash table
Starbuck
2_ACTGA_0
Adama
1
2_TTAAG_0
1
2_CCTAG_1
1
1
2_TTGAC_2
1
Already stored in HBase
1
Baltar : TTAAG CCTAG GGGCG
New sample to add
Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that
position on that chromosome
36
39. The
Way
Step two: Find matches, update the results table
2_Starbuck
2_Starbuck
2_Adama
2_Adama
{ (1, 2), ...}
Already stored
in HBase
{ (1, 2), ...}
Baltar and Adama match from position 0 to position 1
Baltar and Starbuck match at position 1
New matches
to add
Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome
37
41. But wait ... what about Zarek, Roslin, Hera,
and Helo?
39
42. Run them in parallel with Hadoop!
Photo by Benh Lieu Song
40
43. Parallelism with Hadoop
• Batches are usually about a thousand people
• Each mapper takes a single chromosome for a
single person
• MapReduce Jobs :
Job #1 : Match Words
o
Updates the hash table
Job #2 : Match Segments
o
41
Identifies areas where the samples match
44. How does
perform?
A 1700% performance improvement
over GERMLINE!
(Along with more accurate results)
42
45. Hours
Run times for Matching (in hours)
25
20
15
10
5
0
117,500
112,500
107,500
102,500
97,500
92,500
87,500
82,500
77,500
72,500
67,500
62,500
57,500
52,500
47,500
42,500
37,500
32,500
27,500
22,500
17,500
12,500
43
7,500
2,500
Samples
46. Run times for Matching (in hours)
180
160
140
120
Hours
100
GERMLINE run
times
80
Jermline run
times
60
Projected
GERMLINE run
times
40
20
0
44
Samples
47. Incremental Changes Over Time
• Support the business, move incrementally and adjust
• After H2, pipeline speed stays flat
(Courtesy from Bill’s plotting)
45
48. Dramatically Increased our Capacity
Bottom line: Without Hadoop and HBase, this would have
been expensive and difficult.
46
49. And now for everybody's favorite part ....
Lessons Learned
47
51. Lessons Learned : What went right?
• This project would not have been possible without TDD
• Two sets of test data : generated and public domain
• 89% coverage
• Corrected bugs found in the reference implementation
• Has never failed in production
49
53. Lessons Learned : What would we do differently?
• Front-load some performance tests
HBase and Hadoop can have odd performance profiles
HBase in particular has some strange behavior if you're not
familiar with its inner workings
• Allow a lot of time for live tests, dry runs, and
deployment
51
These technologies are relatively new, and it isn't always
possible to find experienced admins. Be prepared to "get your
hands dirty"
58. Mapping Potential Birth Locations for Ancestors
Birth locations from 1750-1900 of individuals with large amounts of genetic
ancestry from Senegal
1750-1850
1800-1900
Over-represented birth location in individuals with large amounts of Senegalese ancestry
Birth location common amongst individuals with W. African ancestry
56
59. How will the engineering team enable these
advances?
57
60. Engineering Improvements
• Implement algorithmic improvements to make our results
more accurate
• Recalculate data as needed to support new scientific
discoveries
• Utilize cloud computing for burst capacity
• Create asynchronous processes to continuously refine our
data
• Whatever science throws at us, we'll be there to turn
their discoveries into robust, scalable solutions
58
61. End of the Journey (for now)
Questions?
Tech Roots Blog: http://blogs.ancestry.com/techroots
59