Successfully reported this slideshow.

HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

2

Share

Loading in …3
×
1 of 11
1 of 11

HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

2

Share

Download to read offline

NextBio relies on HBase to store the world’s largest collection of continuously curated genomic knowledge. The HBase cluster is leveraged to store billions of correlations as well as processed genomic information. In this talk, we will describe how we use HBase, why we migrated from a large MySQL deployment to HBase, and the challenges along the way.

NextBio relies on HBase to store the world’s largest collection of continuously curated genomic knowledge. The HBase cluster is leveraged to store billions of correlations as well as processed genomic information. In this talk, we will describe how we use HBase, why we migrated from a large MySQL deployment to HBase, and the challenges along the way.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

  1. 1. Leveraging HBase for the World's Largest Curated Genomic Data Collection Satnam Alag, Ph.D. VP of Engineering satnam@nextbio.com © 2012 NextBio | All rights reserved | This information is proprietary and confidential. NEXTBIO 2008
  2. 2. Technology Generating Exponential Data © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  3. 3. Genomic Big Data Tumorscape+ # # 2000# 2003# 2006# 2009# 2012# Internal Data © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  4. 4. Use Case 1: HBase to Store Variant Data • Each Genome has ~4 million variants • Immutable – write once, never change, read many times • Bloom Filters are useful • Batch import of Data – HFile • Data to be accessed collocated in region • Separate Hbase cluster from Hadoop • All the smarts are in the keys For the various tables In Hbase: 1 Genome  10Million rows 100 Genomes  1Billion rows 100K Genomes  1Trillion rows 100M Genomes  1 Quadrillion 1,000,000,000,000,000 Fortunately, HBase cluster access can be partitioned by the application when required © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  5. 5. Accessing Data with Pagination Table 1: Key: Bioset Id + Display Order Columns Pagination Example: Page 5, Page Size = 100 Retrieve 100 rows from Display Order = 400-500 Number of rows = 1 per SNP Order of 4 million © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  6. 6. Accessing Data with Keys Table 1: Key: Bioset Id + Display Order Keys returned by search index © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  7. 7. Filtering Data with Pagination Table 1: Key: Bioset Id + Display Order Table 2: Id+GeneId+MutationClass Column: Counts, Keys to Table Example: Gene: ESR1, Class: Misense Page Size = 100 Retrieve rows from Table 2 Retrieve rows by keys from Table 1 Number of rows Order of 0.5 million per dataset (# genes x classes) © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  8. 8. Powering the Genome Browser Table 1: Table 2: Key: Bioset Id + Display Order Id+GeneId+MutationClass Table 3: Id+ChromosomeId+Range+DisplayOrder Example: Chr: 6 Specified Range Retrieve all rows 1 Row per SNP ~ 4 million per dataset © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  9. 9. Use Case 2: Correlation Data © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  10. 10. Use Case 2 • Each Correlation score stored as a row • HFile created for new score • Over 20 billion correlations T1: scorebioset (base table) key: biosetid_1 [+] biosetid_2 B1 B2 … … .. Bn Bn +1 B1 B2 … … Bn Bn +1 © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  11. 11. Lessons Learnt • HBase Works Wells For -- Immutable Data -- Insertions Using HFiles -- Billions of Rows -- Intelligence in Key Definition • Road to Production -- Redundant Data in Database © 2012 NextBio | All rights reserved | This information is proprietary and confidential.

×