SlideShare a Scribd company logo
1 of 64
Scaling AncestryDNA
Using Hadoop and HBase

November 11, 2013
Jeremy Pollack (Engineer) and Bill Yetman (Manager)

1
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/scaling-ancestry-dna

InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Presented at QCon San Francisco
www.qconsf.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
What Does This Talk Cover?

What does Ancestry do?
How does the science work?

How did our journey with Hadoop start?
DNA matching with Hadoop and Hbase

Lessons Learned
What’s next?

2
Ancestry.com Mission

3
Discoveries are the Key
We are the world's largest online family history resource

• Over 30,000 historical content collections
• 12 billion records and images
• Records dating back to 16th century
• 10 petabytes

4
Discoveries in Detail

The “eureka” moment drives our business

5
Discoveries with DNA

Spit in a tube, pay $99, learn your past
Autosomal DNA tests
Over 200,000+ DNA samples
700,000 SNPs for each sample
10,000,000+ cousin matches
150,000

Genotyped Samples

100,000
50,000
-

DNA molecule 1 differs from DNA
molecule 2 at a single base-pair
location (a C/T polymorphism)
(http://en.wikipedia.org/wiki/Singlenucleiotide_polymorphism)

6
Network Effect – Cousin Matches
3,500,000

Cousin Matches

3,000,000
2,500,000

2,000,000
1,500,000
1,000,000
500,000
0
2,000

10,053

21,205

40,201

60,240

Database Size
7

80,405

115,756
Where Did We Start?
The process before Hadoop

8
What’s the Story?
Cast of Characters (Scientists and Software Engineers)
Scientists
Think they can code:
• Linux
• MySQL

• PERL and/or Python

Software Engineers
Think they are Scientists:
• Biology in HS and College

• Math/Statistics
• Read science papers

Pressures of a new business
– Release a product, learn, and then scale

Sr. Manager and 3 developers and 2 member Science Team
9
What Did “Get Something Running” Look Like?

Ethnicity Step
and Matching (Germline)
runs here

“Beefy Box”

Specifics:
1) Ran multiple threads for the two steps
2) Both steps were run in parallel
3) As the DNA Pool grew both steps required more memory

Single Beefy Box – Only option is to scale Vertically
10
Measure Everything Principle
• Start time, end time, duration in seconds, and sample
count for every step in the pipeline. Also the full end-toend processing time

• Put the data in pivot tables and graphed each step

• Normalize the data (sample size was changing)

• Use the data collected to predict future performance
11
Challenges and Pain Points
Performance degrades when DNA pool grows
• Static
(by batch size)

• Linear
(by DNA pool size)

• Quadratic (Matching related steps) – Time bomb

(Courtesy from Keith’s Plotting)
12
New Matching Algorithm
Hadoop and HBase

13
What is GERMLINE?
• GERMLINE is an algorithm that finds hidden
relationships within a pool of DNA
• GERMLINE also refers to the reference
implementation of that algorithm written in C++
• You can find it here :

http://www1.cs.columbia.edu/~gusev/germline/

14
So What’s the Problem?
• GERMLINE (the implementation) was not meant to be
used in an industrial setting





Stateless
Single threaded
Prone to swapping (heavy memory usage)

• GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would
slow to a crawl

• Put simply: GERMLINE couldn't scale

15
Hours

GERMLINE Run Times (in hours)
25

20

15

10

5

0

60,000
57,500
55,000
52,500
50,000
47,500
45,000
42,500
40,000
37,500
35,000
32,500
30,000

27,500
25,000
22,500
20,000
17,500
15,000
12,500
10,000
7,500

16

5,000
2,500

Samples
Projected GERMLINE Run Times (in hours)
700

600

500

Hours

400

300

200
GERMLINE run
times
100

Projected
GERMLINE run
times

0

122,500

112,500

102,500

92,500

82,500

Samples

72,500

62,500

52,500

42,500

32,500

22,500

12,500

2,500
17
The Mission : Create a Scalable Matching
Engine

... and thus was
born
(aka "Jermline with a J")

18
What is Hadoop?

• Hadoop is an open-source platform for processing large
amounts of data in a scalable, fault-tolerant, affordable
fashion, using commodity hardware
• Hadoop specifies a distributed file system called HDFS
• Hadoop supports a processing methodology known as
MapReduce
• Many tools are built on top of Hadoop, such as HBase,
Hive, and Flume
19
What is MapReduce?

20
What is HBase?

• HBase is an open-source NoSQL data store that runs on top of
HDFS
• HBase is columnar; you can think of it as a weird amalgam of a
hashtable and a spreadsheet

• HBase supports unlimited rows and columns
• HBase stores data sparsely; there is no penalty for empty cells
• HBase is gaining in popularity: Salesforce, Facebook, and Twitter
have all invested heavily in the technology, as well as many others
21
Battlestar Galactica Characters, in an HBase Table

KEY

is_cylon hair_color

gender

is_final_five
no

Six

blonde

female

Adama

22

true
false

brown

male

rank
admiral
Adding a Row to an HBase Table

KEY

is_cylon hair_color

gender

is_final_five
no

Six

blonde

female

Adama

false

brown

male

Baltar

23

true
false

brown

male

rank
admiral
Adding a Column to an HBase Table

KEY

is_cylon

hair_color

gender

is_final_five

Six

true

blonde

female

no

Adama

false

brown

male

Baltar

false

brown

male

24

rank

friends

admiral

Kara Thrace,
Saul Tigh
DNA Matching : How it Works

The Input
Starbuck : ACTGACCTAGTTGAC
Adama : TTAAGCCTAGTTGAC

Kara Thrace, aka
Starbuck
• Ace viper pilot
• Has a special
destiny
• Not to be trifled
with
25

Admiral Adama
• Admiral of the
Colonial Fleet
• Routinely saves
humanity from
destruction
DNA Matching : How it Works

Separate into words

0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC

26
DNA Matching : How it Works

Build the hash table
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama

27
DNA Matching : How it Works

Iterate through genome and find matches
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC

ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama
Starbuck and Adama match from position 1 to position 2

28
Does that mean they're related?

...maybe
29
IBD to Relationship Estimation

0.02

0.03

0.04

m1
m2
m3
m4
m5
m6
m7
m8
m9
m10
m11

0.00

0.01

• This is basically a
classification problem

probability

• We use the total length of
all shared segments to
estimate the relationship
between to genetic
relatives

0.05

ERSA

5

10

20

50

100 200
total_IBD(cM)

30

500 1000

5000
But Wait...What About Baltar?
Baltar : TTAAGCCTAGGGGCG

Gaius Baltar
• Handsome
• Genius
• Kinda evil
31
Adding a new sample, the GERMLINE way

32
The GERMLINE Way
Step one: Rebuild the entire hash table from scratch,
including the new sample
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar

33
The GERMLINE Way
Step two: Find everybody's matches all over again,
including the new sample. (n x n comparisons)
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar : TTAAG CCTAG GGGCG

ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
Starbuck and Adama match from position 1 to position 2
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
34
The GERMLINE Way
Step three: Now, throw away the evidence!
0
1
2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
Starbuck and Adama match from position 1 to position 2
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1

You have done this before, and you will have to do
it ALL OVER AGAIN.
35
The

Way
Step one: Update the hash table
Starbuck

2_ACTGA_0

Adama

1

2_TTAAG_0

1

2_CCTAG_1

1

1

2_TTGAC_2

1

Already stored in HBase

1

Baltar : TTAAG CCTAG GGGCG

New sample to add

Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that
position on that chromosome
36
The

Way
Step two: Find matches, update the results table
2_Starbuck

2_Starbuck
2_Adama

2_Adama
{ (1, 2), ...}

Already stored
in HBase

{ (1, 2), ...}

Baltar and Adama match from position 0 to position 1
Baltar and Starbuck match at position 1

New matches
to add

Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome
37
The

Way
Hash Table
Starbuck

2_ACTGA_0

Adama

Baltar

1

1
1

1

2_TTAAG_0
2_CCTAG_1

1

1

2_TTGAC_2

1

1

2_GGGCG_2

1

Results Table
2_Starbuck

2_Adama

38

{ (1), ...}

{ (1), ...}

{ (1, 2), ...}

2_Baltar

2_Baltar

{ (1, 2), ...}

2_Starbuck

2_Adama

{ (0,1), ...}
{ (0,1), ...}
But wait ... what about Zarek, Roslin, Hera,
and Helo?

39
Run them in parallel with Hadoop!

Photo by Benh Lieu Song

40
Parallelism with Hadoop
• Batches are usually about a thousand people
• Each mapper takes a single chromosome for a
single person
• MapReduce Jobs :
Job #1 : Match Words
o

Updates the hash table

Job #2 : Match Segments
o

41

Identifies areas where the samples match
How does

perform?

A 1700% performance improvement
over GERMLINE!
(Along with more accurate results)

42
Hours

Run times for Matching (in hours)
25

20

15

10

5

0

117,500
112,500
107,500
102,500
97,500
92,500
87,500
82,500
77,500
72,500
67,500
62,500
57,500
52,500
47,500

42,500
37,500
32,500
27,500
22,500
17,500
12,500

43

7,500
2,500

Samples
Run times for Matching (in hours)
180
160

140
120

Hours

100

GERMLINE run
times

80
Jermline run
times

60

Projected
GERMLINE run
times

40
20
0

44

Samples
Incremental Changes Over Time
• Support the business, move incrementally and adjust

• After H2, pipeline speed stays flat

(Courtesy from Bill’s plotting)
45
Dramatically Increased our Capacity

Bottom line: Without Hadoop and HBase, this would have
been expensive and difficult.

46
And now for everybody's favorite part ....

Lessons Learned

47
Lessons Learned

What went right?

48
Lessons Learned : What went right?

• This project would not have been possible without TDD
• Two sets of test data : generated and public domain
• 89% coverage

• Corrected bugs found in the reference implementation
• Has never failed in production

49
Lessons Learned

What would we do differently?

50
Lessons Learned : What would we do differently?
• Front-load some performance tests



HBase and Hadoop can have odd performance profiles
HBase in particular has some strange behavior if you're not
familiar with its inner workings

• Allow a lot of time for live tests, dry runs, and
deployment


51

These technologies are relatively new, and it isn't always
possible to find experienced admins. Be prepared to "get your
hands dirty"
What’s next for the Science Team?

52
Our new lab in Albuquerque, NM

53
Okay, for real this time. What’s next for the Science
Team?

54
More Accurate Results
Potential matches

Relevant matches

55
Mapping Potential Birth Locations for Ancestors
Birth locations from 1750-1900 of individuals with large amounts of genetic
ancestry from Senegal
1750-1850
1800-1900

Over-represented birth location in individuals with large amounts of Senegalese ancestry
Birth location common amongst individuals with W. African ancestry
56
How will the engineering team enable these
advances?

57
Engineering Improvements
• Implement algorithmic improvements to make our results
more accurate
• Recalculate data as needed to support new scientific
discoveries

• Utilize cloud computing for burst capacity
• Create asynchronous processes to continuously refine our
data
• Whatever science throws at us, we'll be there to turn
their discoveries into robust, scalable solutions

58
End of the Journey (for now)

Questions?
Tech Roots Blog: http://blogs.ancestry.com/techroots

59
Appendix

60
Appendix A. Who are the presenters?

Bill Yetman
61

Jeremy Pollack
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/scalingancestry-dna

More Related Content

Viewers also liked

Asociate/ Pricing Strategist (APAC/ EMEA)
Asociate/ Pricing Strategist (APAC/ EMEA)Asociate/ Pricing Strategist (APAC/ EMEA)
Asociate/ Pricing Strategist (APAC/ EMEA)
Sabrina Teo
 
Windows 10 vs Windows 7
Windows 10 vs Windows 7Windows 10 vs Windows 7
Windows 10 vs Windows 7
Jonathan Min
 
Опыт работы ДОУ
Опыт работы ДОУОпыт работы ДОУ
Опыт работы ДОУ
Валерия Кулеш
 

Viewers also liked (10)

Asociate/ Pricing Strategist (APAC/ EMEA)
Asociate/ Pricing Strategist (APAC/ EMEA)Asociate/ Pricing Strategist (APAC/ EMEA)
Asociate/ Pricing Strategist (APAC/ EMEA)
 
DevDay 2016: Thomas Haase - Sicherheit der Dinge, Hacking im Internet of Things
DevDay 2016: Thomas Haase - Sicherheit der Dinge, Hacking im Internet of ThingsDevDay 2016: Thomas Haase - Sicherheit der Dinge, Hacking im Internet of Things
DevDay 2016: Thomas Haase - Sicherheit der Dinge, Hacking im Internet of Things
 
DevDay 2016: Sascha Askani - Cloud-Umgebungen mit Terraform verwalten
DevDay 2016: Sascha Askani - Cloud-Umgebungen mit Terraform verwaltenDevDay 2016: Sascha Askani - Cloud-Umgebungen mit Terraform verwalten
DevDay 2016: Sascha Askani - Cloud-Umgebungen mit Terraform verwalten
 
Unbundled Pricing - A Reference Price Solution
Unbundled Pricing - A Reference Price SolutionUnbundled Pricing - A Reference Price Solution
Unbundled Pricing - A Reference Price Solution
 
Power BI reports and dashboards for Microsoft Project Server
Power BI reports and  dashboards for Microsoft Project ServerPower BI reports and  dashboards for Microsoft Project Server
Power BI reports and dashboards for Microsoft Project Server
 
Cooperative societies and joint stock companies
Cooperative societies and joint stock companiesCooperative societies and joint stock companies
Cooperative societies and joint stock companies
 
Формування правильної звуковимови та навичок фонематичного аналізу за допомог...
Формування правильної звуковимови та навичок фонематичного аналізу за допомог...Формування правильної звуковимови та навичок фонематичного аналізу за допомог...
Формування правильної звуковимови та навичок фонематичного аналізу за допомог...
 
Проблемне питання педагога-організатора
Проблемне питання педагога-організатораПроблемне питання педагога-організатора
Проблемне питання педагога-організатора
 
Windows 10 vs Windows 7
Windows 10 vs Windows 7Windows 10 vs Windows 7
Windows 10 vs Windows 7
 
Опыт работы ДОУ
Опыт работы ДОУОпыт работы ДОУ
Опыт работы ДОУ
 

More from C4Media

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Scaling AncestryDNA using Hadoop and HBase

  • 1. Scaling AncestryDNA Using Hadoop and HBase November 11, 2013 Jeremy Pollack (Engineer) and Bill Yetman (Manager) 1
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /scaling-ancestry-dna InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. What Does This Talk Cover? What does Ancestry do? How does the science work? How did our journey with Hadoop start? DNA matching with Hadoop and Hbase Lessons Learned What’s next? 2
  • 6. Discoveries are the Key We are the world's largest online family history resource • Over 30,000 historical content collections • 12 billion records and images • Records dating back to 16th century • 10 petabytes 4
  • 7. Discoveries in Detail The “eureka” moment drives our business 5
  • 8. Discoveries with DNA Spit in a tube, pay $99, learn your past Autosomal DNA tests Over 200,000+ DNA samples 700,000 SNPs for each sample 10,000,000+ cousin matches 150,000 Genotyped Samples 100,000 50,000 - DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism) (http://en.wikipedia.org/wiki/Singlenucleiotide_polymorphism) 6
  • 9. Network Effect – Cousin Matches 3,500,000 Cousin Matches 3,000,000 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 2,000 10,053 21,205 40,201 60,240 Database Size 7 80,405 115,756
  • 10. Where Did We Start? The process before Hadoop 8
  • 11. What’s the Story? Cast of Characters (Scientists and Software Engineers) Scientists Think they can code: • Linux • MySQL • PERL and/or Python Software Engineers Think they are Scientists: • Biology in HS and College • Math/Statistics • Read science papers Pressures of a new business – Release a product, learn, and then scale Sr. Manager and 3 developers and 2 member Science Team 9
  • 12. What Did “Get Something Running” Look Like? Ethnicity Step and Matching (Germline) runs here “Beefy Box” Specifics: 1) Ran multiple threads for the two steps 2) Both steps were run in parallel 3) As the DNA Pool grew both steps required more memory Single Beefy Box – Only option is to scale Vertically 10
  • 13. Measure Everything Principle • Start time, end time, duration in seconds, and sample count for every step in the pipeline. Also the full end-toend processing time • Put the data in pivot tables and graphed each step • Normalize the data (sample size was changing) • Use the data collected to predict future performance 11
  • 14. Challenges and Pain Points Performance degrades when DNA pool grows • Static (by batch size) • Linear (by DNA pool size) • Quadratic (Matching related steps) – Time bomb (Courtesy from Keith’s Plotting) 12
  • 16. What is GERMLINE? • GERMLINE is an algorithm that finds hidden relationships within a pool of DNA • GERMLINE also refers to the reference implementation of that algorithm written in C++ • You can find it here : http://www1.cs.columbia.edu/~gusev/germline/ 14
  • 17. So What’s the Problem? • GERMLINE (the implementation) was not meant to be used in an industrial setting    Stateless Single threaded Prone to swapping (heavy memory usage) • GERMLINE performs poorly on large data sets • Our metrics predicted exactly where the process would slow to a crawl • Put simply: GERMLINE couldn't scale 15
  • 18. Hours GERMLINE Run Times (in hours) 25 20 15 10 5 0 60,000 57,500 55,000 52,500 50,000 47,500 45,000 42,500 40,000 37,500 35,000 32,500 30,000 27,500 25,000 22,500 20,000 17,500 15,000 12,500 10,000 7,500 16 5,000 2,500 Samples
  • 19. Projected GERMLINE Run Times (in hours) 700 600 500 Hours 400 300 200 GERMLINE run times 100 Projected GERMLINE run times 0 122,500 112,500 102,500 92,500 82,500 Samples 72,500 62,500 52,500 42,500 32,500 22,500 12,500 2,500 17
  • 20. The Mission : Create a Scalable Matching Engine ... and thus was born (aka "Jermline with a J") 18
  • 21. What is Hadoop? • Hadoop is an open-source platform for processing large amounts of data in a scalable, fault-tolerant, affordable fashion, using commodity hardware • Hadoop specifies a distributed file system called HDFS • Hadoop supports a processing methodology known as MapReduce • Many tools are built on top of Hadoop, such as HBase, Hive, and Flume 19
  • 23. What is HBase? • HBase is an open-source NoSQL data store that runs on top of HDFS • HBase is columnar; you can think of it as a weird amalgam of a hashtable and a spreadsheet • HBase supports unlimited rows and columns • HBase stores data sparsely; there is no penalty for empty cells • HBase is gaining in popularity: Salesforce, Facebook, and Twitter have all invested heavily in the technology, as well as many others 21
  • 24. Battlestar Galactica Characters, in an HBase Table KEY is_cylon hair_color gender is_final_five no Six blonde female Adama 22 true false brown male rank admiral
  • 25. Adding a Row to an HBase Table KEY is_cylon hair_color gender is_final_five no Six blonde female Adama false brown male Baltar 23 true false brown male rank admiral
  • 26. Adding a Column to an HBase Table KEY is_cylon hair_color gender is_final_five Six true blonde female no Adama false brown male Baltar false brown male 24 rank friends admiral Kara Thrace, Saul Tigh
  • 27. DNA Matching : How it Works The Input Starbuck : ACTGACCTAGTTGAC Adama : TTAAGCCTAGTTGAC Kara Thrace, aka Starbuck • Ace viper pilot • Has a special destiny • Not to be trifled with 25 Admiral Adama • Admiral of the Colonial Fleet • Routinely saves humanity from destruction
  • 28. DNA Matching : How it Works Separate into words 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC 26
  • 29. DNA Matching : How it Works Build the hash table 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC ACTGA_0 : Starbuck TTAAG_0 : Adama CCTAG_1 : Starbuck, Adama TTGAC_2 : Starbuck, Adama 27
  • 30. DNA Matching : How it Works Iterate through genome and find matches 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC ACTGA_0 : Starbuck TTAAG_0 : Adama CCTAG_1 : Starbuck, Adama TTGAC_2 : Starbuck, Adama Starbuck and Adama match from position 1 to position 2 28
  • 31. Does that mean they're related? ...maybe 29
  • 32. IBD to Relationship Estimation 0.02 0.03 0.04 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 0.00 0.01 • This is basically a classification problem probability • We use the total length of all shared segments to estimate the relationship between to genetic relatives 0.05 ERSA 5 10 20 50 100 200 total_IBD(cM) 30 500 1000 5000
  • 33. But Wait...What About Baltar? Baltar : TTAAGCCTAGGGGCG Gaius Baltar • Handsome • Genius • Kinda evil 31
  • 34. Adding a new sample, the GERMLINE way 32
  • 35. The GERMLINE Way Step one: Rebuild the entire hash table from scratch, including the new sample 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar 33
  • 36. The GERMLINE Way Step two: Find everybody's matches all over again, including the new sample. (n x n comparisons) 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar Starbuck and Adama match from position 1 to position 2 Adama and Baltar match from position 0 to position 1 Starbuck and Baltar match at position 1 34
  • 37. The GERMLINE Way Step three: Now, throw away the evidence! 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar Starbuck and Adama match from position 1 to position 2 Adama and Baltar match from position 0 to position 1 Starbuck and Baltar match at position 1 You have done this before, and you will have to do it ALL OVER AGAIN. 35
  • 38. The Way Step one: Update the hash table Starbuck 2_ACTGA_0 Adama 1 2_TTAAG_0 1 2_CCTAG_1 1 1 2_TTGAC_2 1 Already stored in HBase 1 Baltar : TTAAG CCTAG GGGCG New sample to add Key : [CHROMOSOME]_[WORD]_[POSITION] Qualifier : [USER ID] Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome 36
  • 39. The Way Step two: Find matches, update the results table 2_Starbuck 2_Starbuck 2_Adama 2_Adama { (1, 2), ...} Already stored in HBase { (1, 2), ...} Baltar and Adama match from position 0 to position 1 Baltar and Starbuck match at position 1 New matches to add Key : [CHROMOSOME]_[USER ID] Qualifier : [CHROMOSOME]_[USER ID] Cell value : A list of ranges where the two users match on a chromosome 37
  • 40. The Way Hash Table Starbuck 2_ACTGA_0 Adama Baltar 1 1 1 1 2_TTAAG_0 2_CCTAG_1 1 1 2_TTGAC_2 1 1 2_GGGCG_2 1 Results Table 2_Starbuck 2_Adama 38 { (1), ...} { (1), ...} { (1, 2), ...} 2_Baltar 2_Baltar { (1, 2), ...} 2_Starbuck 2_Adama { (0,1), ...} { (0,1), ...}
  • 41. But wait ... what about Zarek, Roslin, Hera, and Helo? 39
  • 42. Run them in parallel with Hadoop! Photo by Benh Lieu Song 40
  • 43. Parallelism with Hadoop • Batches are usually about a thousand people • Each mapper takes a single chromosome for a single person • MapReduce Jobs : Job #1 : Match Words o Updates the hash table Job #2 : Match Segments o 41 Identifies areas where the samples match
  • 44. How does perform? A 1700% performance improvement over GERMLINE! (Along with more accurate results) 42
  • 45. Hours Run times for Matching (in hours) 25 20 15 10 5 0 117,500 112,500 107,500 102,500 97,500 92,500 87,500 82,500 77,500 72,500 67,500 62,500 57,500 52,500 47,500 42,500 37,500 32,500 27,500 22,500 17,500 12,500 43 7,500 2,500 Samples
  • 46. Run times for Matching (in hours) 180 160 140 120 Hours 100 GERMLINE run times 80 Jermline run times 60 Projected GERMLINE run times 40 20 0 44 Samples
  • 47. Incremental Changes Over Time • Support the business, move incrementally and adjust • After H2, pipeline speed stays flat (Courtesy from Bill’s plotting) 45
  • 48. Dramatically Increased our Capacity Bottom line: Without Hadoop and HBase, this would have been expensive and difficult. 46
  • 49. And now for everybody's favorite part .... Lessons Learned 47
  • 51. Lessons Learned : What went right? • This project would not have been possible without TDD • Two sets of test data : generated and public domain • 89% coverage • Corrected bugs found in the reference implementation • Has never failed in production 49
  • 52. Lessons Learned What would we do differently? 50
  • 53. Lessons Learned : What would we do differently? • Front-load some performance tests   HBase and Hadoop can have odd performance profiles HBase in particular has some strange behavior if you're not familiar with its inner workings • Allow a lot of time for live tests, dry runs, and deployment  51 These technologies are relatively new, and it isn't always possible to find experienced admins. Be prepared to "get your hands dirty"
  • 54. What’s next for the Science Team? 52
  • 55. Our new lab in Albuquerque, NM 53
  • 56. Okay, for real this time. What’s next for the Science Team? 54
  • 57. More Accurate Results Potential matches Relevant matches 55
  • 58. Mapping Potential Birth Locations for Ancestors Birth locations from 1750-1900 of individuals with large amounts of genetic ancestry from Senegal 1750-1850 1800-1900 Over-represented birth location in individuals with large amounts of Senegalese ancestry Birth location common amongst individuals with W. African ancestry 56
  • 59. How will the engineering team enable these advances? 57
  • 60. Engineering Improvements • Implement algorithmic improvements to make our results more accurate • Recalculate data as needed to support new scientific discoveries • Utilize cloud computing for burst capacity • Create asynchronous processes to continuously refine our data • Whatever science throws at us, we'll be there to turn their discoveries into robust, scalable solutions 58
  • 61. End of the Journey (for now) Questions? Tech Roots Blog: http://blogs.ancestry.com/techroots 59
  • 63. Appendix A. Who are the presenters? Bill Yetman 61 Jeremy Pollack
  • 64. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/scalingancestry-dna