Scaling AncestryDNA Using Hadoop and HBase

1,525 views

Published on

Learn more about how our team scales our AncestryDNA technology to match you within our database's DNA pool.

Published in: Technology
  • Be the first to comment

Scaling AncestryDNA Using Hadoop and HBase

  1. 1. 1 Scaling AncestryDNA Using Hadoop and HBase April 10th, 2014 Jeremy Pollack
  2. 2. Ancestry.com Mission 2
  3. 3. • Over 30,000 historical content collections • 13 billion records and images • Records dating back to 16th century • 10 petabytes We are the world's largest online family history resource Discoveries are the Key 3
  4. 4. The “eureka” moment drives our business Discoveries in Detail 4
  5. 5. Spit in a tube, pay $99, learn your past Autosomal DNA tests Over 200,000+ DNA samples 700,000 SNPs for each sample 10,000,000+ cousin matches DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism) (http://en.wikipedia.org/wiki/Single- nucleiotide_polymorphism) - 50,000 100,000 150,000 Genotyped Samples Discoveries with DNA 5
  6. 6. Network Effect – Cousin Matches 6 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 2,000 10,053 21,205 40,201 60,240 80,405 115,756 CousinMatches Database Size
  7. 7. So how do we do it? 7
  8. 8. • GERMLINE is an algorithm that finds hidden relationships within a pool of DNA • GERMLINE also refers to the reference implementation of that algorithm written in C++ • You can find it here : http://www1.cs.columbia.edu/~gusev/germline/ Introducing … GERMLINE! 8
  9. 9. • GERMLINE (the implementation) was not meant to be used in an industrial setting  Stateless  Single threaded  Prone to swapping (heavy memory usage) • GERMLINE performs poorly on large data sets • Our metrics predicted exactly where the process would slow to a crawl • Put simply: GERMLINE couldn't scale So What’s the Problem? 9
  10. 10. GERMLINE Run Times (in hours) 0 5 10 15 20 25 2,500 5,000 7,500 10,000 12,500 15,000 17,500 20,000 22,500 25,000 27,500 30,000 32,500 35,000 37,500 40,000 42,500 45,000 47,500 50,000 52,500 55,000 57,500 60,000 Hours Samples 10
  11. 11. Projected GERMLINE Run Times (in hours) 0 100 200 300 400 500 600 700 2,500 12,5… 22,5… 32,5… 42,5… 52,5… 62,5… 72,5… 82,5… 92,5… 102,… 112,… 122,… GERMLINE run times Projected GERMLINE run times Samples Hours 11
  12. 12. The Mission : Create a Scalable Matching Engine ... and thus was born (aka "Jermline with a J") 12
  13. 13. What is Hadoop? • Hadoop is an open-source platform for processing large amounts of data in a scalable, fault-tolerant, affordable fashion, using commodity hardware • Hadoop specifies a distributed file system called HDFS • Hadoop supports a processing methodology known as MapReduce • Many tools are built on top of Hadoop, such as HBase, Hive, and Flume 13
  14. 14. What is HDFS?
  15. 15. What happens when HDFS loses a server
  16. 16. What is MapReduce? 16
  17. 17. What is HBase? • HBase is an open-source NoSQL data store that runs on top of HDFS • HBase is columnar; you can think of it as a weird amalgam of a hashtable and a spreadsheet • HBase supports unlimited rows and columns • HBase stores data sparsely; there is no penalty for empty cells • HBase is gaining in popularity: Salesforce, Facebook, and Twitter have all invested heavily in the technology, as well as many others 17
  18. 18. Game of Thrones Characters, in an HBase Table KEY gender hair_color family_name is_evil Joffrey male blonde Baratheon yes Cersei female blonde Lannister kinda 18
  19. 19. Adding a Row to an HBase Table KEY gender hair_color family_name is_evil Joffrey male blonde Baratheon yes Cersei female blonde Lannister kinda Sansa female red Stark no 19
  20. 20. Adding a Column to an HBase Table KEY gender hair_color family_name is_evil title Joffrey male blonde Baratheon yes king Cersei female blonde Lannister kinda Sansa female red Stark no 20
  21. 21. Cersei : ACTGACCTAGTTGAC Joffrey : TTAAGCCTAGTTGAC The Input Cersei Baratheon • Former queen of Westeros • Machiavellian manipulator • Mostly evil, but occasionally sympathetic Joffrey Baratheon • Pretty much the human embodiment of evil • Needlessly cruel • Kinda looks like Justin Bieber DNA Matching : How it Works 21
  22. 22. 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC Separate into words DNA Matching : How it Works 22
  23. 23. 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC ACTGA_0 : Cersei TTAAG_0 : Joffrey CCTAG_1 : Cersei, Joffrey TTGAC_2 : Cersei, Joffrey Build the hash table DNA Matching : How it Works 23
  24. 24. Iterate through genome and find matches Cersei and Joffrey match from position 1 to position 2 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC ACTGA_0 : Cersei TTAAG_0 : Joffrey CCTAG_1 : Cersei, Joffrey TTGAC_2 : Cersei, Joffrey DNA Matching : How it Works 24
  25. 25. Does that mean they're related? ...maybe 25
  26. 26. IBD to Relationship Estimation • We use the total length of all shared segments to estimate the relationship between to genetic relatives • This is basically a classification problem 26 5 10 20 50 100 200 500 1000 5000 0.000.010.020.030.040.05 ERSA total_IBD(cM) probability m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
  27. 27. Jaime : TTAAGCCTAGGGGCG But Wait...What About Jaime? Jaime Lannister • Kind of a has-been • Killed the Mad King • Has the hots for his sister, Cersei 27
  28. 28. Adding a new sample, the GERMLINE way 28
  29. 29. 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC Jaime : TTAAG CCTAG GGGCG ACTGA_0 : Cersei TTAAG_0 : Joffrey, Jaime CCTAG_1 : Cersei, Joffrey, Jaime TTGAC_2 : Cersei, Joffrey GGGCG_2 : Jaime Step one: Rebuild the entire hash table from scratch, including the new sample The GERMLINE Way 29
  30. 30. Cersei and Joffrey match from position 1 to position 2 Joffrey and Jaime match from position 0 to position 1 Cersei and Jaime match at position 1 Step two: Find everybody's matches all over again, including the new sample. (n x n comparisons) 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC Jaime : TTAAG CCTAG GGGCG ACTGA_0 : Cersei TTAAG_0 : Joffrey, Jaime CCTAG_1 : Cersei, Joffrey, Jaime TTGAC_2 : Cersei, Joffrey GGGCG_2 : Jaime The GERMLINE Way 30
  31. 31. Cersei and Joffrey match from position 1 to position 2 Joffrey and Jaime match from position 0 to position 1 Cersei and Jaime match at position 1 Step three : Now, throw away the evidence! 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC Jaime : TTAAG CCTAG GGGCG ACTGA_0 : Cersei TTAAG_0 : Joffrey, Jaime CCTAG_1 : Cersei, Joffrey, Jaime TTGAC_2 : Cersei, Joffrey GGGCG_2 : Jaime The GERMLINE Way 31
  32. 32. Step one: Update the hash table Cersei Joffrey 2_ACTGA_0 1 2_TTAAG_0 1 2_CCTAG_1 1 1 2_TTGAC_2 1 1 Already stored in HBase Jaime : TTAAG CCTAG GGGCG New sample to add Key : [CHROMOSOME]_[WORD]_[POSITION] Qualifier : [USER ID] Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome The Way 32
  33. 33. Jaime and Joffrey match from position 0 to position 1 Jaime and Cersei match at position 1 Already stored in HBase 2_Cersei 2_Joffrey 2_Cersei { (1, 2), ...} 2_Joffrey { (1, 2), ...} New matches to add Key : [CHROMOSOME]_[USER ID] Qualifier : [CHROMOSOME]_[USER ID] Cell value : A list of ranges where the two users match on a chromosome Step two: Find matches, update the results table The Way 33
  34. 34. Results Table 2_Cersei 2_Joffrey 2_Jaime 2_Cersei { (1, 2), ...} { (1), ...} 2_Joffrey { (1, 2), ...} { (0,1), ...} 2_Jaime { (1), ...} { (0,1), ...} Hash Table Cersei Joffrey Jaime 2_ACTGA_0 1 2_TTAAG_0 1 1 2_CCTAG_1 1 1 1 2_TTGAC_2 1 1 2_GGGCG_2 1 The Way 34
  35. 35. But wait ... what about Daenerys, Tyrion, Arya, and Jon Snow? 35
  36. 36. Run them in parallel with Hadoop! 36
  37. 37. Parallelism with Hadoop • Batches are usually about a thousand people • Each mapper takes a single chromosome for a single person • MapReduce Jobs : Job #1 : Match Words o Updates the hash table Job #2 : Match Segments o Identifies areas where the samples match 37
  38. 38. How does perform? A 1700% performance improvement over GERMLINE! 38
  39. 39. 0 5 10 15 20 25 2,500 7,500 12,500 17,500 22,500 27,500 32,500 37,500 42,500 47,500 52,500 57,500 62,500 67,500 72,500 77,500 82,500 87,500 92,500 97,500 102,500 107,500 112,500 117,500 Hours Samples Run times for Matching (in hours) 39
  40. 40. 0 20 40 60 80 100 120 140 160 180 GERMLINE run times Jermline run times Projected GERMLINE run times Run times for Matching (in hours) Samples Hours 40
  41. 41. Bottom line: Without Hadoop and HBase, this would have been expensive and difficult. Dramatically Increased our Capacity 41
  42. 42. And now for everybody's favorite part .... Lessons Learned 42
  43. 43. Lessons Learned What went right? 43
  44. 44. Lessons Learned : What went right? 44 • This project would not have been possible without TDD • Two sets of test data : generated and public domain • 89% coverage • Corrected a bug in the reference implementation • Has never failed in production
  45. 45. Lessons Learned What would we do differently? 45
  46. 46. Lessons Learned : What would we do differently? • Front-load some performance tests  HBase and Hadoop can have odd performance profiles  HBase in particular has some strange behavior if you're not familiar with its inner workings • Allow a lot of time for live tests, dry runs, and deployment  These technologies are relatively new, and it isn't always possible to find experienced admins. Be prepared to "get your hands dirty" 46
  47. 47. Questions? 47
  48. 48. 48 And yes, we are hiring! You can contact me at jpollack@ancestry.com or just talk to me here!

×