HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

3,245 views

Published on

NextBio relies on HBase to store the world’s largest collection of continuously curated genomic knowledge. The HBase cluster is leveraged to store billions of correlations as well as processed genomic information. In this talk, we will describe how we use HBase, why we migrated from a large MySQL deployment to HBase, and the challenges along the way.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,245
On SlideShare
0
From Embeds
0
Number of Embeds
104
Actions
Shares
0
Downloads
74
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

  1. 1. Leveraging HBase for the Worlds Largest Curated Genomic Data Collection Satnam Alag, Ph.D. VP of Engineering satnam@nextbio.com© 2012 NextBio | All rights reserved | This information is proprietary and confidential. NEXTBIO 2008
  2. 2. Technology Generating Exponential Data© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  3. 3. Genomic Big Data Tumorscape+ # # 2000# 2003# 2006# 2009# 2012# Internal Data© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  4. 4. Use Case 1: HBase to Store Variant Data • Each Genome has ~4 million variants • Immutable – write once, never change, read many times • Bloom Filters are useful • Batch import of Data – HFile • Data to be accessed collocated in region • Separate Hbase cluster from Hadoop • All the smarts are in the keys For the various tablesIn Hbase:1 Genome  10Million rows100 Genomes  1Billion rows100K Genomes  1Trillion rows100M Genomes  1 Quadrillion1,000,000,000,000,000Fortunately, HBase cluster access can be partitioned by the application when required © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  5. 5. Accessing Data with PaginationTable 1:Key: Bioset Id + Display Order Columns Pagination Example: Page 5, Page Size = 100 Retrieve 100 rows from Display Order = 400-500 Number of rows = 1 per SNP Order of 4 million © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  6. 6. Accessing Data with KeysTable 1:Key: Bioset Id + Display OrderKeys returned by search index © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  7. 7. Filtering Data with PaginationTable 1:Key: Bioset Id + Display OrderTable 2:Id+GeneId+MutationClassColumn: Counts, Keys to Table Example: Gene: ESR1, Class: Misense Page Size = 100 Retrieve rows from Table 2 Retrieve rows by keys from Table 1Number of rowsOrder of 0.5 million per dataset(# genes x classes) © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  8. 8. Powering the Genome BrowserTable 1: Table 2:Key: Bioset Id + Display Order Id+GeneId+MutationClassTable 3:Id+ChromosomeId+Range+DisplayOrderExample:Chr: 6Specified RangeRetrieve all rows1 Row per SNP~ 4 million per dataset © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  9. 9. Use Case 2: Correlation Data© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  10. 10. Use Case 2 • Each Correlation score stored as a row • HFile created for new score • Over 20 billion correlations T1: scorebioset (base table) key: biosetid_1 [+] biosetid_2 B1 B2 … … .. Bn Bn +1 B1 B2 … … Bn Bn +1© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  11. 11. Lessons Learnt • HBase Works Wells For -- Immutable Data -- Insertions Using HFiles -- Billions of Rows -- Intelligence in Key Definition • Road to Production -- Redundant Data in Database© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

×