0
Leveraging HBase for the Worlds Largest   Curated Genomic Data Collection   Satnam Alag, Ph.D.   VP of Engineering   satna...
Technology Generating Exponential Data© 2012 NextBio | All rights reserved | This information is proprietary and confident...
Genomic Big Data                        Tumorscape+                                                      #                ...
Use Case 1: HBase to Store Variant Data  • Each Genome has ~4 million  variants  • Immutable – write once,  never change, ...
Accessing Data with PaginationTable 1:Key: Bioset Id + Display Order                                                      ...
Accessing Data with KeysTable 1:Key: Bioset Id + Display OrderKeys returned by search index © 2012 NextBio | All rights re...
Filtering Data with PaginationTable 1:Key: Bioset Id + Display OrderTable 2:Id+GeneId+MutationClassColumn: Counts, Keys to...
Powering the Genome BrowserTable 1:                                                           Table 2:Key: Bioset Id + Dis...
Use Case 2: Correlation Data© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Use Case 2     • Each Correlation score stored as a row     • HFile created for new score     • Over 20 billion correlatio...
Lessons Learnt  • HBase Works Wells For           -- Immutable Data           -- Insertions Using HFiles           -- Bill...
Upcoming SlideShare
Loading in...5
×

HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

1,411

Published on

NextBio relies on HBase to store the world’s largest collection of continuously curated genomic knowledge. The HBase cluster is leveraged to store billions of correlations as well as processed genomic information. In this talk, we will describe how we use HBase, why we migrated from a large MySQL deployment to HBase, and the challenges along the way.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,411
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
67
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection"

  1. 1. Leveraging HBase for the Worlds Largest Curated Genomic Data Collection Satnam Alag, Ph.D. VP of Engineering satnam@nextbio.com© 2012 NextBio | All rights reserved | This information is proprietary and confidential. NEXTBIO 2008
  2. 2. Technology Generating Exponential Data© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  3. 3. Genomic Big Data Tumorscape+ # # 2000# 2003# 2006# 2009# 2012# Internal Data© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  4. 4. Use Case 1: HBase to Store Variant Data • Each Genome has ~4 million variants • Immutable – write once, never change, read many times • Bloom Filters are useful • Batch import of Data – HFile • Data to be accessed collocated in region • Separate Hbase cluster from Hadoop • All the smarts are in the keys For the various tablesIn Hbase:1 Genome  10Million rows100 Genomes  1Billion rows100K Genomes  1Trillion rows100M Genomes  1 Quadrillion1,000,000,000,000,000Fortunately, HBase cluster access can be partitioned by the application when required © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  5. 5. Accessing Data with PaginationTable 1:Key: Bioset Id + Display Order Columns Pagination Example: Page 5, Page Size = 100 Retrieve 100 rows from Display Order = 400-500 Number of rows = 1 per SNP Order of 4 million © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  6. 6. Accessing Data with KeysTable 1:Key: Bioset Id + Display OrderKeys returned by search index © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  7. 7. Filtering Data with PaginationTable 1:Key: Bioset Id + Display OrderTable 2:Id+GeneId+MutationClassColumn: Counts, Keys to Table Example: Gene: ESR1, Class: Misense Page Size = 100 Retrieve rows from Table 2 Retrieve rows by keys from Table 1Number of rowsOrder of 0.5 million per dataset(# genes x classes) © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  8. 8. Powering the Genome BrowserTable 1: Table 2:Key: Bioset Id + Display Order Id+GeneId+MutationClassTable 3:Id+ChromosomeId+Range+DisplayOrderExample:Chr: 6Specified RangeRetrieve all rows1 Row per SNP~ 4 million per dataset © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  9. 9. Use Case 2: Correlation Data© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  10. 10. Use Case 2 • Each Correlation score stored as a row • HFile created for new score • Over 20 billion correlations T1: scorebioset (base table) key: biosetid_1 [+] biosetid_2 B1 B2 … … .. Bn Bn +1 B1 B2 … … Bn Bn +1© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  11. 11. Lessons Learnt • HBase Works Wells For -- Immutable Data -- Insertions Using HFiles -- Billions of Rows -- Intelligence in Key Definition • Road to Production -- Redundant Data in Database© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×