Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)

462 views

Published on

This presentation was given at Adobe with support from Utah Geek Events. It is the story of creating a business in an Agile way and digs into the technology that we used to support that business. It was a large open room with no microphone. I had a great audience with experienced Big Data developers and people new to the technology.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)

  1. 1. 1 Ancestry DNA at Scale Using Hadoop and HBase September 7, 2013
  2. 2. What does this talk cover? What does Ancestry do? How did our journey with Hadoop start? Using Hadoop as a Job Processor DNA Matching with Hadoop and HBase What’s next? 2
  3. 3. Ancestry.com Mission 3
  4. 4. Discoveries Are the Key • Over 30,000 historical content collections • 11 billion records and images • Records dating back to 16th century • 4 petabytes We are the world's largest online family history resource.
  5. 5. The “eureka” moment drives our business Discoveries In Detail
  6. 6. Discoveries With DNA Spit in a tube, pay $99, learn your past Autosomal DNA tests Over 120,000 DNA samples 700,000 SNPs for each sample 6,000,000+ 4th cousin matches 6 DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism). (http://en.wikipedia.org/wiki/Single- nucleiotide_polymorphism) - 50,000 100,000 150,000 Genotyped samples
  7. 7. What does the customer see? 7
  8. 8. 8 - 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 2,000 10,053 21,205 40,201 60,240 80,405 115,756 CousinMatches Database Size Network Effect – Cousin Matches
  9. 9. Where Did We Start? The process before Hadoop 9
  10. 10. What’s the Story? Cast of Characters (Scientists and Software Engineers) Pressures of a startup business – Release a product, learn, and then scale Sr. Manager and 5 developers and 4 member Science Team 10 Scientists Think they can code: • Linux • MySQL • PERL and/or Python Software Engineers Think they are Scientists: • Biology in HS and College • Math/Statistics • Read science papers
  11. 11. DNA Input Raw Data (A,C,T,G,0): 3 123456789_RZZZZ2_XXXXXXH3Q7U7Q2B_YYYY84598-DNA 0 0 0 -9 C C G G G G G G A A A A C C G G A A A A C C G G G G A A G G G A A A G G A G A A C C A A A A G G A A A G G G G G C C G G A A G G G G G G G A A A A C G A A A A G A G A A A A G G G G G G A G G G G G G G … (continues for 700,000+ snips) Map File: 0 rs10005853 0 0 0 rs10015934 0 0 0 rs1004236 0 0 0 rs10059646 0 0 0 rs10085382 0 0 0 rs10123921 0 0 0 rs10127827 0 0 0 rs10155688 0 0 0 rs10162780 0 0 0 rs1017484 0 0 0 rs10188129 0 0 11
  12. 12. What Did “Get Something Running” Look Like? Single Beefy Box – Only option is to scale Vertically 12 Old Version Pipeline Control Run Watch Dog B 4) Disc Management (V2) RakeshInit Results Processing 3) Poll status Finalize Heart beat Creates run Reruns Monitor 2) Enqueuer (dna validation) Monitor “Beefy Box” Runs on AdMixture (Ethnicity) Beagle (Phasing) and GermLine (Matching) runs here
  13. 13. Measure Everything Principle • Start time, end time, duration in seconds, and sample count for every step in the pipeline. Also the full end-to- end processing time • Put the data in pivot tables and graphed each step • Normalize the data (sample size was changing) • Use the data collected to predict future performance 13 #1
  14. 14. Challenges and Pain Points Performance degrades when DNA pool grows • Static (by batch size) • Linear (by DNA pool size) • Quadratic (Matching related steps) – Time bomb (Courtesy from Keith’s Potting) 14
  15. 15. Parallel Ethnicity Jobs Use Hadoop as a job processor 15
  16. 16. Why Attack Ethnicity First? • Smart developers, little Hadoop experience – Using Hadoop as a job scheduler and scaling the ethnicity step was easier than redesigning the matching step • AdMixture is a self-contained application – Reference panel, the users DNA, and a seed value for inputs – CPU intensive job that writes to stdout • Easy to split up the input • Looked hard enough at the matching problem to realize a HBase, MapReduce solution was realistic 16
  17. 17. Parallel Ethnicity Jobs Typical run of 1000 samples. Queue up one Hadoop job with 40 tasks, 25 samples per task 17 1) Map Reduce Hadoop Cluster (20 x 4 slots x 96g) Server Server Server Server Server Server Server Server Admixture Admixture Admixture Admixture Admixture Admixture Admixture Admixture Admixture Server Server #2
  18. 18. Results 1000 sample runs under 3 hours (one interesting bug) 18 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 2012-03-01T21:18:03 2012-03-31T16:27:50 2012-04-17T07:31:45 2012-05-17T18:36:08 2012-06-16T15:23:27 2012-06-29T19:42:18 2012-07-11T11:29:56 2012-07-22T07:48:32 2012-07-30T06:56:26 2012-08-08T20:42:30 2012-08-17T20:58:55 2012-09-01T01:51:54 2012-09-11T21:53:05 2012-09-23T21:46:15 2012-10-02T14:28:50 2012-10-14T17:45:53 2012-11-04T02:43:36 2012-11-24T11:12:19 2012-12-12T17:35:15 2012-12-25T04:36:45 2013-01-14T15:18:38 2013-01-29T12:29:56 2013-02-11T10:22:02 2013-03-02T16:03:16 2013-03-29T00:19:36 2013-04-21T02:02:51 2013-05-17T01:34:00 2013-05-29T07:08:04 2013-06-13T13:50:45 2013-06-25T21:06:04 2013-07-17T15:15:27 2013-08-06T07:57:41 AdMixture Time (sec) Sum of Run Size Admixture Time
  19. 19. Freed up the “Beefy Box” • Moving AdMixture off left an additional 10 threads for phasing and matching • Memory was freed up for phasing and matching • Just moving AdMixture off, saved over 6 hours of processing on the single box – Bought us time 19
  20. 20. New Matching Algorithm Hadoop and HBase 20
  21. 21. What is GERMLINE? • GERMLINE is an algorithm that finds hidden relationships within a pool of DNA • GERMLINE also refers to the reference implementation of that algorithm written in C++ • You can find it here : http://www1.cs.columbia.edu/~gusev/germline/
  22. 22. So what's the problem? • GERMLINE (the implementation) was not meant to be used in an industrial setting • Stateless • Single threaded • Prone to swapping (heavy memory usage) • Generic • Used for any DNA (fish, fruit fly, human, …) • GERMLINE performs poorly on large data sets • Our metrics predicted exactly where the process would slow to a crawl • Put simply : GERMLINE couldn't scale
  23. 23. 0 5 10 15 20 25 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000 42500 45000 47500 50000 52500 55000 57500 60000 Hours Number of samples GERMLINE Run Times (in hours)
  24. 24. Projected GERMLINE Run Times (in hours) 0 100 200 300 400 500 600 700 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000 42500 45000 47500 50000 52500 55000 57500 60000 62500 65000 67500 70000 72500 75000 77500 80000 82500 85000 87500 90000 92500 95000 97500 100000 102500 105000 107500 110000 112500 115000 117500 120000 122500 Hours Number of samples GERMLINE run times Projected GERMLINE run times
  25. 25. The Mission : Create a Scalable Matching Engine ... and thus was born (aka "Jermline with a J")
  26. 26. Starbuck : ACTGACCTAGTTGAC Adama : TTAAGCCTAGTTGAC The Input Kara Thrace, aka Starbuck • Ace viper pilot • Has a special destiny • Not to be trifled with Admiral Adama • Admiral of the Colonial Fleet • Routinely saves humanity from destruction DNA Matching : How it Works
  27. 27. 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Separate into words DNA Matching : How it Works
  28. 28. 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC ACTGA_0 : Starbuck TTAAG_0 : Adama CCTAG_1 : Starbuck, Adama TTGAC_2 : Starbuck, Adama Build the hash table DNA Matching : How it Works
  29. 29. Iterate through genome and find matches Starbuck and Adama match from position 1 to position 2 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC ACTGA_0 : Starbuck TTAAG_0 : Adama CCTAG_1 : Starbuck, Adama TTGAC_2 : Starbuck, Adama DNA Matching : How it Works
  30. 30. Does that mean they're related? ...maybe
  31. 31. Baltar : TTAAGCCTAGGGGCG But wait... what about Baltar? Gaius Baltar • Handsome • Genius • Kinda evil
  32. 32. Adding a new sample, the GERMLINE way
  33. 33. 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar Step one : Rebuild the entire hash table from scratch, including the new sample The GERMLINE Way
  34. 34. Starbuck and Adama match from position 1 to position 2 Adama and Baltar match from position 0 to position 1 Starbuck and Baltar match at position 1 Step two : Find everybody's matches all over again, including the new sample. (n x n comparisons) 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar The GERMLINE Way
  35. 35. Starbuck and Adama match from position 1 to position 2 Adama and Baltar match from position 0 to position 1 Starbuck and Baltar match at position 1 Step three : Now, throw away the evidence! 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar You have done this before, and you will have to do it ALL OVER AGAIN. The GERMLINE Way
  36. 36. Not so good, right? Now let's take a look at the way.
  37. 37. Step one : Update the hash table.Starbuck Adama 2_ACTGA_0 1 2_TTAAG_0 1 2_CCTAG_1 1 1 2_TTGAC_2 1 1 Already stored in HBase Baltar : TTAAG CCTAG GGGCG New sample to add Add a column for every new sample for each user Key : [CHROMOSOME]_[WORD]_[POSITION] Qualifier : [USER ID] Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome The way
  38. 38. Baltar and Adama match from position 0 to position 1 Baltar and Starbuck match at position 1 Already stored in HBase 2_Starbuck 2_Adama 2_Starbuck { (1, 2), ...} 2_Adama { (1, 2), ...} New matches to add “Fuzzy Match” the consecutive words. Worst case: Identical twins Key : [CHROMOSOME]_[USER ID] Qualifier : [CHROMOSOME]_[USER ID] Cell value : A list of ranges where the two users match on a chromosome The way Step two : Find matches.
  39. 39. 2_Starbuck 2_Adama 2_Baltar 2_Starbuck { (1, 2), ...} { (1), ...} 2_Adama { (1, 2), ...} { (0,1), ...} 2_Baltar { (1), ...} { (0,1), ...} The way Starbuck Adama Baltar 2_ACTGA_0 1 2_TTAAG_0 1 1 2_CCTAG_1 1 1 1 2_TTGAC_2 1 1 2_GGGCG_2 1 These are the updated tables after adding Baltar’s information Only looking at 3 samples, chromosome #2, positions 0, 1, and 2 Very simple example of how the matching process works
  40. 40. But wait ... what about Zarek, Roslin, Hera, and Helo?
  41. 41. Photo by Benh Lieu Song Run them in parallel with Hadoop!
  42. 42. • Batches are usually about a thousand people. • Each mapper takes a single chromosome for a single person. o Three samples per task means 22 jobs with 334 tasks (1000/3) each • MapReduce Jobs : Job #1 : Match Words • Updates the hash table Job #2 : Match Segments • Identifies areas where the samples match Parallelism with Hadoop
  43. 43. How does Jermline perform? A 1700% improvement over GERMLINE! Along with more accurate results #3
  44. 44. 0 5 10 15 20 25 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000 42500 45000 47500 50000 52500 55000 57500 60000 62500 65000 67500 70000 72500 75000 77500 80000 82500 85000 87500 90000 92500 95000 97500 100000 102500 105000 107500 110000 112500 115000 117500 120000 Hours Number of samples Run Times For Matching (in hours)
  45. 45. Run Times For Matching (in hours) 0 20 40 60 80 100 120 140 160 180 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000 42500 45000 47500 50000 52500 55000 57500 60000 62500 65000 67500 70000 72500 75000 77500 80000 82500 85000 87500 90000 92500 95000 97500 100000 102500 105000 107500 110000 112500 115000 117500 120000 Hours Number of samples GERMLINE run times Jermline run times Projected GERMLINE run times
  46. 46. • Support the business, move incrementally and adjust • After H2, pipeline speed stays flat • (Courtesy from Bill’s plotting) Incremental Changes Over Time 46
  47. 47. Bottom line : Without Hadoop and HBase, this would have been expensive and difficult. • Previously, we ran GERMLINE on a single "beefy box". • 12-core 2.2GHZ Opteron 6174 with 256GB of RAM • We had upgraded this machine until it couldn't be upgraded any more. • Processing time was unacceptable, growth was unsustainable. • To continue running GERMLINE on a single box, we would have required a vastly more powerful machine, probably at the supercomputer level – at considerable cost! • Now, we run Jermline on a cluster. • 20 X 12-core 2GHZ Xeon E5-2620 with 96GB of RAM • We can now run 16 batches per day, whereas before we could only run one. • Most importantly, growth is sustainable. To add capacity, we need only add more nodes. Dramatically Increased our Capacity
  48. 48. What’s Next? Hadoop and HBase 48
  49. 49. Continue to Evolve the Software • Azkaban for job control – Nearly complete • Phasing – Still runs on the “Beefy Box”, 1000 samples take over 11 hours – Total run time for 1000 samples is about 14 hours. – Re-implement with HBase, MapReduce, Hadoop • Version Updates – New algorithms require us to re-run the entire DNA pool – Burst capacity to the cloud • Machine Learning – Matching (V2) and Ethnicity (V3) both would benefit from a Machine Learning approach 49
  50. 50. End of the Journey (for now) - Questions? 50

×