Scaling AncestryDNA using Hadoop and HBase

455 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1aNuQO8.

Bill Yetman and Jeremy Pollack discuss using several Agile techniques -start simple, get going, iterate- and the “measure everything” principle to create the architecture behind the Family History website. Filmed at qconsf.com.

Jeremy Pollack is a Senior Engineer at Ancestry.com, where he supports a team of scientists and makes their discoveries scale. Bill Yetman has served as Senior Director of Engineering at Ancestry.com since January 2011. He holds a B.S. in Computer Science and a B.A. in Psychology from San Diego State University.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
455
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Scaling AncestryDNA using Hadoop and HBase

  1. 1. Scaling AncestryDNA Using Hadoop and HBase November 11, 2013 Jeremy Pollack (Engineer) and Bill Yetman (Manager) 1
  2. 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /scaling-ancestry-dna InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  3. 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. What Does This Talk Cover? What does Ancestry do? How does the science work? How did our journey with Hadoop start? DNA matching with Hadoop and Hbase Lessons Learned What’s next? 2
  5. 5. Ancestry.com Mission 3
  6. 6. Discoveries are the Key We are the world's largest online family history resource • Over 30,000 historical content collections • 12 billion records and images • Records dating back to 16th century • 10 petabytes 4
  7. 7. Discoveries in Detail The “eureka” moment drives our business 5
  8. 8. Discoveries with DNA Spit in a tube, pay $99, learn your past Autosomal DNA tests Over 200,000+ DNA samples 700,000 SNPs for each sample 10,000,000+ cousin matches 150,000 Genotyped Samples 100,000 50,000 - DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism) (http://en.wikipedia.org/wiki/Singlenucleiotide_polymorphism) 6
  9. 9. Network Effect – Cousin Matches 3,500,000 Cousin Matches 3,000,000 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 2,000 10,053 21,205 40,201 60,240 Database Size 7 80,405 115,756
  10. 10. Where Did We Start? The process before Hadoop 8
  11. 11. What’s the Story? Cast of Characters (Scientists and Software Engineers) Scientists Think they can code: • Linux • MySQL • PERL and/or Python Software Engineers Think they are Scientists: • Biology in HS and College • Math/Statistics • Read science papers Pressures of a new business – Release a product, learn, and then scale Sr. Manager and 3 developers and 2 member Science Team 9
  12. 12. What Did “Get Something Running” Look Like? Ethnicity Step and Matching (Germline) runs here “Beefy Box” Specifics: 1) Ran multiple threads for the two steps 2) Both steps were run in parallel 3) As the DNA Pool grew both steps required more memory Single Beefy Box – Only option is to scale Vertically 10
  13. 13. Measure Everything Principle • Start time, end time, duration in seconds, and sample count for every step in the pipeline. Also the full end-toend processing time • Put the data in pivot tables and graphed each step • Normalize the data (sample size was changing) • Use the data collected to predict future performance 11
  14. 14. Challenges and Pain Points Performance degrades when DNA pool grows • Static (by batch size) • Linear (by DNA pool size) • Quadratic (Matching related steps) – Time bomb (Courtesy from Keith’s Plotting) 12
  15. 15. New Matching Algorithm Hadoop and HBase 13
  16. 16. What is GERMLINE? • GERMLINE is an algorithm that finds hidden relationships within a pool of DNA • GERMLINE also refers to the reference implementation of that algorithm written in C++ • You can find it here : http://www1.cs.columbia.edu/~gusev/germline/ 14
  17. 17. So What’s the Problem? • GERMLINE (the implementation) was not meant to be used in an industrial setting    Stateless Single threaded Prone to swapping (heavy memory usage) • GERMLINE performs poorly on large data sets • Our metrics predicted exactly where the process would slow to a crawl • Put simply: GERMLINE couldn't scale 15
  18. 18. Hours GERMLINE Run Times (in hours) 25 20 15 10 5 0 60,000 57,500 55,000 52,500 50,000 47,500 45,000 42,500 40,000 37,500 35,000 32,500 30,000 27,500 25,000 22,500 20,000 17,500 15,000 12,500 10,000 7,500 16 5,000 2,500 Samples
  19. 19. Projected GERMLINE Run Times (in hours) 700 600 500 Hours 400 300 200 GERMLINE run times 100 Projected GERMLINE run times 0 122,500 112,500 102,500 92,500 82,500 Samples 72,500 62,500 52,500 42,500 32,500 22,500 12,500 2,500 17
  20. 20. The Mission : Create a Scalable Matching Engine ... and thus was born (aka "Jermline with a J") 18
  21. 21. What is Hadoop? • Hadoop is an open-source platform for processing large amounts of data in a scalable, fault-tolerant, affordable fashion, using commodity hardware • Hadoop specifies a distributed file system called HDFS • Hadoop supports a processing methodology known as MapReduce • Many tools are built on top of Hadoop, such as HBase, Hive, and Flume 19
  22. 22. What is MapReduce? 20
  23. 23. What is HBase? • HBase is an open-source NoSQL data store that runs on top of HDFS • HBase is columnar; you can think of it as a weird amalgam of a hashtable and a spreadsheet • HBase supports unlimited rows and columns • HBase stores data sparsely; there is no penalty for empty cells • HBase is gaining in popularity: Salesforce, Facebook, and Twitter have all invested heavily in the technology, as well as many others 21
  24. 24. Battlestar Galactica Characters, in an HBase Table KEY is_cylon hair_color gender is_final_five no Six blonde female Adama 22 true false brown male rank admiral
  25. 25. Adding a Row to an HBase Table KEY is_cylon hair_color gender is_final_five no Six blonde female Adama false brown male Baltar 23 true false brown male rank admiral
  26. 26. Adding a Column to an HBase Table KEY is_cylon hair_color gender is_final_five Six true blonde female no Adama false brown male Baltar false brown male 24 rank friends admiral Kara Thrace, Saul Tigh
  27. 27. DNA Matching : How it Works The Input Starbuck : ACTGACCTAGTTGAC Adama : TTAAGCCTAGTTGAC Kara Thrace, aka Starbuck • Ace viper pilot • Has a special destiny • Not to be trifled with 25 Admiral Adama • Admiral of the Colonial Fleet • Routinely saves humanity from destruction
  28. 28. DNA Matching : How it Works Separate into words 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC 26
  29. 29. DNA Matching : How it Works Build the hash table 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC ACTGA_0 : Starbuck TTAAG_0 : Adama CCTAG_1 : Starbuck, Adama TTGAC_2 : Starbuck, Adama 27
  30. 30. DNA Matching : How it Works Iterate through genome and find matches 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC ACTGA_0 : Starbuck TTAAG_0 : Adama CCTAG_1 : Starbuck, Adama TTGAC_2 : Starbuck, Adama Starbuck and Adama match from position 1 to position 2 28
  31. 31. Does that mean they're related? ...maybe 29
  32. 32. IBD to Relationship Estimation 0.02 0.03 0.04 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 0.00 0.01 • This is basically a classification problem probability • We use the total length of all shared segments to estimate the relationship between to genetic relatives 0.05 ERSA 5 10 20 50 100 200 total_IBD(cM) 30 500 1000 5000
  33. 33. But Wait...What About Baltar? Baltar : TTAAGCCTAGGGGCG Gaius Baltar • Handsome • Genius • Kinda evil 31
  34. 34. Adding a new sample, the GERMLINE way 32
  35. 35. The GERMLINE Way Step one: Rebuild the entire hash table from scratch, including the new sample 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar 33
  36. 36. The GERMLINE Way Step two: Find everybody's matches all over again, including the new sample. (n x n comparisons) 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar Starbuck and Adama match from position 1 to position 2 Adama and Baltar match from position 0 to position 1 Starbuck and Baltar match at position 1 34
  37. 37. The GERMLINE Way Step three: Now, throw away the evidence! 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar Starbuck and Adama match from position 1 to position 2 Adama and Baltar match from position 0 to position 1 Starbuck and Baltar match at position 1 You have done this before, and you will have to do it ALL OVER AGAIN. 35
  38. 38. The Way Step one: Update the hash table Starbuck 2_ACTGA_0 Adama 1 2_TTAAG_0 1 2_CCTAG_1 1 1 2_TTGAC_2 1 Already stored in HBase 1 Baltar : TTAAG CCTAG GGGCG New sample to add Key : [CHROMOSOME]_[WORD]_[POSITION] Qualifier : [USER ID] Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome 36
  39. 39. The Way Step two: Find matches, update the results table 2_Starbuck 2_Starbuck 2_Adama 2_Adama { (1, 2), ...} Already stored in HBase { (1, 2), ...} Baltar and Adama match from position 0 to position 1 Baltar and Starbuck match at position 1 New matches to add Key : [CHROMOSOME]_[USER ID] Qualifier : [CHROMOSOME]_[USER ID] Cell value : A list of ranges where the two users match on a chromosome 37
  40. 40. The Way Hash Table Starbuck 2_ACTGA_0 Adama Baltar 1 1 1 1 2_TTAAG_0 2_CCTAG_1 1 1 2_TTGAC_2 1 1 2_GGGCG_2 1 Results Table 2_Starbuck 2_Adama 38 { (1), ...} { (1), ...} { (1, 2), ...} 2_Baltar 2_Baltar { (1, 2), ...} 2_Starbuck 2_Adama { (0,1), ...} { (0,1), ...}
  41. 41. But wait ... what about Zarek, Roslin, Hera, and Helo? 39
  42. 42. Run them in parallel with Hadoop! Photo by Benh Lieu Song 40
  43. 43. Parallelism with Hadoop • Batches are usually about a thousand people • Each mapper takes a single chromosome for a single person • MapReduce Jobs : Job #1 : Match Words o Updates the hash table Job #2 : Match Segments o 41 Identifies areas where the samples match
  44. 44. How does perform? A 1700% performance improvement over GERMLINE! (Along with more accurate results) 42
  45. 45. Hours Run times for Matching (in hours) 25 20 15 10 5 0 117,500 112,500 107,500 102,500 97,500 92,500 87,500 82,500 77,500 72,500 67,500 62,500 57,500 52,500 47,500 42,500 37,500 32,500 27,500 22,500 17,500 12,500 43 7,500 2,500 Samples
  46. 46. Run times for Matching (in hours) 180 160 140 120 Hours 100 GERMLINE run times 80 Jermline run times 60 Projected GERMLINE run times 40 20 0 44 Samples
  47. 47. Incremental Changes Over Time • Support the business, move incrementally and adjust • After H2, pipeline speed stays flat (Courtesy from Bill’s plotting) 45
  48. 48. Dramatically Increased our Capacity Bottom line: Without Hadoop and HBase, this would have been expensive and difficult. 46
  49. 49. And now for everybody's favorite part .... Lessons Learned 47
  50. 50. Lessons Learned What went right? 48
  51. 51. Lessons Learned : What went right? • This project would not have been possible without TDD • Two sets of test data : generated and public domain • 89% coverage • Corrected bugs found in the reference implementation • Has never failed in production 49
  52. 52. Lessons Learned What would we do differently? 50
  53. 53. Lessons Learned : What would we do differently? • Front-load some performance tests   HBase and Hadoop can have odd performance profiles HBase in particular has some strange behavior if you're not familiar with its inner workings • Allow a lot of time for live tests, dry runs, and deployment  51 These technologies are relatively new, and it isn't always possible to find experienced admins. Be prepared to "get your hands dirty"
  54. 54. What’s next for the Science Team? 52
  55. 55. Our new lab in Albuquerque, NM 53
  56. 56. Okay, for real this time. What’s next for the Science Team? 54
  57. 57. More Accurate Results Potential matches Relevant matches 55
  58. 58. Mapping Potential Birth Locations for Ancestors Birth locations from 1750-1900 of individuals with large amounts of genetic ancestry from Senegal 1750-1850 1800-1900 Over-represented birth location in individuals with large amounts of Senegalese ancestry Birth location common amongst individuals with W. African ancestry 56
  59. 59. How will the engineering team enable these advances? 57
  60. 60. Engineering Improvements • Implement algorithmic improvements to make our results more accurate • Recalculate data as needed to support new scientific discoveries • Utilize cloud computing for burst capacity • Create asynchronous processes to continuously refine our data • Whatever science throws at us, we'll be there to turn their discoveries into robust, scalable solutions 58
  61. 61. End of the Journey (for now) Questions? Tech Roots Blog: http://blogs.ancestry.com/techroots 59
  62. 62. Appendix 60
  63. 63. Appendix A. Who are the presenters? Bill Yetman 61 Jeremy Pollack
  64. 64. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/scalingancestry-dna

×