• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!
 

HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!

on

  • 1,046 views

Presented by: Jeremy Pollack, Ancestry.com

Presented by: Jeremy Pollack, Ancestry.com

Statistics

Views

Total Views
1,046
Views on SlideShare
1,043
Embed Views
3

Actions

Likes
3
Downloads
0
Comments
0

1 Embed 3

https://twitter.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Everything from birth certificates, obituaries, immigration records, census records, voter registration, old phone books, everything.
  • Typically, the way it works is this :You search through our records to find one of your relatives. Once you've found enough records that you're satisfied you've found your relative, you attach them to your family tree. After that, Ancestry goes to work for you. Our search engine takes a look at your whole tree to find relatives that you may not know about yet, and presents these to you as hints. (shaky leaf) You can then examine these hints and see if they are, in fact, related to you. I, myself, found all my great-grandparents as well as a few aunts and uncles like that. It's pretty cool! And the beauty of it is that, say you've found a relative who's researched their family tree pretty extensively? Well, you get to piggyback on all that research by simply adding their family tree to yours. A fine example of crowdsourcing.
  • However, this has its limitations. What if you don't know your extended family that well? What if your ancestors came to the country as slaves? What if your ancestors came into the country illegally?DNA to the rescue.
  • Mention how we kept upgrading and tightening things up
  • For each person-to-person comparison, we add up the total length of their shared DNA and run that through a statistical model to see how closely they're related.
  • Remind people that GERMLINE was stateless

HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU! HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU! Presentation Transcript

  • DNA, HBase, Hadoop, and YOU! by Jeremy Pollack
  • What does Ancestry.com do? • Over 30,000 historical content collections • 11 billion records and images • Records dating back to 16th century • 4 petabytes We are the world's largest online family history resource.
  • It’s the “eureka” moment of discovery that drives our business!
  • DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism). (http://en.wikipedia.org/wiki/Single- nucleiotide_polymorphism) What does Ancestry DNA do? "Spit in a tube, pay $99, learn about your past" • Decodes your family origins (ethnicity) • Finds your long-lost relatives • We have identified over four million fourth cousins. • The average customer has close to 30 fourth cousin matches. • By examining these matches, we can connect your family tree to those of your distant relatives. • Ancestry DNA has 120K+ samples, one of the largest DNA databases in the world. • About 690GB of data (uncompressed), or about 6.2 MB per sample
  • What is GERMLINE? • GERMLINE is an algorithm that finds hidden relationships within a pool of DNA. • GERMLINE also refers to the reference implementation of that algorithm. • You can find it here : http://www1.cs.columbia.edu/~gusev/germline/
  • So what's the problem? • GERMLINE (the implementation) was not meant to be used in an industrial setting. • Stateless • Single threaded • Prone to swapping • GERMLINE performs poorly on large data sets. • We were running up against its limitations. • Put simply : GERMLINE couldn't scale.
  • 0 5 10 15 20 25 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000 42500 45000 47500 50000 52500 55000 57500 60000 Hours Number of samples GERMLINE Run Times (in hours)
  • Projected GERMLINE Run Times (in hours) 0 100 200 300 400 500 600 700 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000 42500 45000 47500 50000 52500 55000 57500 60000 62500 65000 67500 70000 72500 75000 77500 80000 82500 85000 87500 90000 92500 95000 97500 100000 102500 105000 107500 110000 112500 115000 117500 120000 122500 Hours Number of samples GERMLINE run times Projected GERMLINE run times
  • The Mission : Create a Scalable Matching Engine ... and thus was born (aka "Jermline with a J")
  • DNA Matching : How it Works
  • Starbuck : ACTGACCTAGTTGAC Adama : TTAAGCCTAGTTGAC The Input Kara Thrace, aka Starbuck • Ace viper pilot • Has a special destiny • Not to be trifled with Admiral Adama • Admiral of the Colonial Fleet • Routinely saves humanity from destruction • Not so great with model ships
  • 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Separate into words
  • 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC ACTGA_0 : Starbuck TTAAG_0 : Adama CCTAG_1 : Starbuck, Adama TTGAC_2 : Starbuck, Adama Build the hash table
  • Iterate through genome and find matches Starbuck and Adama match from position 1 to position 2 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC ACTGA_0 : Starbuck TTAAG_0 : Adama CCTAG_1 : Starbuck, Adama TTGAC_2 : Starbuck, Adama
  • Does that mean they're related? ...maybe...
  • Baltar : TTAAGCCTAGGGGCG But wait... what about Baltar? Gaius Baltar • Handsome • Genius • Kinda evil
  • Adding a new sample, the GERMLINE way
  • 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar Step one : Rebuild the entire hash table from scratch, including the new sample The GERMLINE Way
  • Starbuck and Adama match from position 1 to position 2 Adama and Baltar match from position 0 to position 1 Starbuck and Baltar match at position 1 Step two : Find everybody's matches all over again, including the new sample. (n x n comparisons) 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar The GERMLINE Way
  • Starbuck and Adama match from position 1 to position 2 Adama and Baltar match from position 0 to position 1 Starbuck and Baltar match at position 1 Step three : Now, throw away the evidence! 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar You have done this before, and you will have to do it ALL OVER AGAIN. The GERMLINE Way
  • Not so good, right? Now let's take a look at the way.
  • Step one : Update the hash table. Starbuck Adama 2_ACTGA_0 1 2_TTAAG_0 1 2_CCTAG_1 1 1 2_TTGAC_2 1 1 Already stored in HBase Baltar : TTAAG CCTAG GGGCG New sample to add Key : [CHROMOSOME]_[WORD]_[POSITION] Qualifier : [USER ID] Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome The way
  • Baltar and Adama match from position 0 to position 1 Baltar and Starbuck match at position 1 Already stored in HBase 2_Starbuck 2_Adama 2_Starbuck { (1, 2), ...} 2_Adama { (1, 2), ... } New matches to add Key : [CHROMOSOME]_[USER ID] Qualifier : [CHROMOSOME]_[USER ID] Cell value : A list of ranges where the two users match on a chromosome The way Step two : Find matches.
  • But wait ... what about Zarek, Roslin, Hera, and Helo?
  • Photo by Benh Lieu Song Run them in parallel with Hadoop!
  • • Batches are usually about a thousand people. • Each mapper takes a single chromosome for a single person. • MapReduce Jobs : • Job #1 : Match Words • Updates the hash table • Job #2 : Match Segments • Identifies areas where the samples match Parallelism with Hadoop
  • Okay, but how does Jermline perform?
  • Okay, but how does Jermline perform? A 1700% improvement over GERMLINE!
  • 0 5 10 15 20 25 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000 42500 45000 47500 50000 52500 55000 57500 60000 62500 65000 67500 70000 72500 75000 77500 80000 82500 85000 87500 90000 92500 95000 97500 100000 102500 105000 107500 110000 112500 115000 117500 120000 Hours Number of samples Run Times For Matching (in hours)
  • Run Times For Matching (in hours) 0 20 40 60 80 100 120 140 160 180 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000 42500 45000 47500 50000 52500 55000 57500 60000 62500 65000 67500 70000 72500 75000 77500 80000 82500 85000 87500 90000 92500 95000 97500 100000 102500 105000 107500 110000 112500 115000 117500 120000 Hours Number of samples GERMLINE run times Jermline run times Projected GERMLINE run times
  • Bottom line : By leveraging Hadoop and HBase, we dramatically increased our processing capacity. Without Hadoop and HBase, this would have been hideously expensive and difficult. • Previously, we ran GERMLINE on a single "beefy box". • 12-core 2.2GHZ Opteron 6174 with 256GB of RAM • We had upgraded this machine until it couldn't be upgraded any more. • Processing time was unacceptable, growth was unsustainable. • To continue running GERMLINE on a single box, we would have required a vastly more powerful machine, probably at the supercomputer level. • Now, we run Jermline on a cluster. • 20 X 12-core 2GHZ Xeon E5-2620 with 96GB of RAM • We can now run 16 batches per day, whereas before we could only run one. • Most importantly, growth is sustainable. To add capacity, we need only add more nodes.
  • Questions?