SlideShare a Scribd company logo
1 of 11
Leveraging HBase for the World's Largest
   Curated Genomic Data Collection

   Satnam Alag, Ph.D.
   VP of Engineering
   satnam@nextbio.com




© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
     NEXTBIO 2008
Technology Generating Exponential Data




© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Genomic Big Data

                        Tumorscape+




                                                      #
                                                      #
 2000#      2003#    2006#            2009#   2012#




           Internal Data




© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Use Case 1: HBase to Store Variant Data
  • Each Genome has ~4 million
  variants
  • Immutable – write once,
  never change, read many times
  • Bloom Filters are useful
  • Batch import of Data – HFile
  • Data to be accessed
  collocated in region
  • Separate Hbase cluster from
  Hadoop
  • All the smarts are in the keys
  For the various tables

In Hbase:
1 Genome  10Million rows
100 Genomes  1Billion rows
100K Genomes  1Trillion rows
100M Genomes  1 Quadrillion
1,000,000,000,000,000

Fortunately, HBase cluster access can be partitioned by the application when required
 © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Accessing Data with Pagination


Table 1:
Key: Bioset Id + Display Order


                                                                                            Columns

 Pagination Example:
 Page 5, Page Size = 100

 Retrieve 100 rows from
 Display Order = 400-500




 Number of rows = 1 per SNP
 Order of 4 million


 © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Accessing Data with Keys


Table 1:
Key: Bioset Id + Display Order


Keys returned by search index




 © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Filtering Data with Pagination
Table 1:
Key: Bioset Id + Display Order

Table 2:
Id+GeneId+MutationClass


Column: Counts, Keys to Table

 Example:
 Gene: ESR1,
 Class: Misense
 Page Size = 100

 Retrieve rows from Table 2
 Retrieve rows by keys from
 Table 1

Number of rows
Order of 0.5 million per dataset
(# genes x classes)
 © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Powering the Genome Browser
Table 1:                                                           Table 2:
Key: Bioset Id + Display Order                                     Id+GeneId+MutationClass
Table 3:
Id+ChromosomeId+Range+DisplayOrder

Example:
Chr: 6
Specified Range

Retrieve all rows




1 Row per SNP
~ 4 million per dataset


 © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Use Case 2: Correlation Data




© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Use Case 2
     • Each Correlation score stored as a row
     • HFile created for new score
     • Over 20 billion correlations
     T1: scorebioset (base table)
         key: biosetid_1 [+] biosetid_2


                   B1 B2 …                   …        ..      Bn Bn
                                                                 +1
          B1
          B2
          …
          …
          Bn
          Bn
          +1

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Lessons Learnt
  • HBase Works Wells For
           -- Immutable Data
           -- Insertions Using HFiles
           -- Billions of Rows
           -- Intelligence in Key Definition

  • Road to Production
           -- Redundant Data in Database




© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

More Related Content

Viewers also liked

Viewers also liked (20)

HBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseHBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBase
 
HBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBaseHBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBase
 
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart Meter
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
 
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUponHBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
 
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
 
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
 
Cross-Site BigTable using HBase
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBase
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 Minutes
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
 
HBaseCon 2015: Just the Basics
HBaseCon 2015: Just the BasicsHBaseCon 2015: Just the Basics
HBaseCon 2015: Just the Basics
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
 
HBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestHBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at Pinterest
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
 
HBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBaseHBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBase
 
HBaseCon 2015 General Session: The Evolution of HBase @ Bloomberg
HBaseCon 2015 General Session: The Evolution of HBase @ BloombergHBaseCon 2015 General Session: The Evolution of HBase @ Bloomberg
HBaseCon 2015 General Session: The Evolution of HBase @ Bloomberg
 

Similar to HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
Vladimir Bacvanski, PhD
 

Similar to HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection (20)

D1 1440 cesar wong next generation sequencing & bio medical data analysis
D1 1440 cesar wong next generation sequencing & bio medical data analysisD1 1440 cesar wong next generation sequencing & bio medical data analysis
D1 1440 cesar wong next generation sequencing & bio medical data analysis
 
STI Summit 2011 - LS4 LS Khaos
STI Summit 2011 - LS4 LS KhaosSTI Summit 2011 - LS4 LS Khaos
STI Summit 2011 - LS4 LS Khaos
 
Open Energy Data
Open Energy DataOpen Energy Data
Open Energy Data
 
Introduction to Bigdata & Hadoop
Introduction to Bigdata & HadoopIntroduction to Bigdata & Hadoop
Introduction to Bigdata & Hadoop
 
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneous
 
Trust Data Sharing and Utilization Infrastructure for Sensitive Data using Hy...
Trust Data Sharing and Utilization Infrastructure for Sensitive Data using Hy...Trust Data Sharing and Utilization Infrastructure for Sensitive Data using Hy...
Trust Data Sharing and Utilization Infrastructure for Sensitive Data using Hy...
 
EPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkEPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data Talk
 
Jethro + Symphony Health at Qlik Qonnections
Jethro + Symphony Health at Qlik QonnectionsJethro + Symphony Health at Qlik Qonnections
Jethro + Symphony Health at Qlik Qonnections
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Predictive modeling DBs
Predictive modeling DBsPredictive modeling DBs
Predictive modeling DBs
 
Big Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesBig Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use cases
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
 
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
 
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
 
Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoop
 
BI on Big Data with instant response times at Verizon
BI on Big Data with instant response times at VerizonBI on Big Data with instant response times at Verizon
BI on Big Data with instant response times at Verizon
 
Leveraging open source for big data stack
Leveraging open source for big data stackLeveraging open source for big data stack
Leveraging open source for big data stack
 
Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research
 
Neo4j and bioinformatics
Neo4j and bioinformaticsNeo4j and bioinformatics
Neo4j and bioinformatics
 

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

  • 1. Leveraging HBase for the World's Largest Curated Genomic Data Collection Satnam Alag, Ph.D. VP of Engineering satnam@nextbio.com © 2012 NextBio | All rights reserved | This information is proprietary and confidential. NEXTBIO 2008
  • 2. Technology Generating Exponential Data © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  • 3. Genomic Big Data Tumorscape+ # # 2000# 2003# 2006# 2009# 2012# Internal Data © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  • 4. Use Case 1: HBase to Store Variant Data • Each Genome has ~4 million variants • Immutable – write once, never change, read many times • Bloom Filters are useful • Batch import of Data – HFile • Data to be accessed collocated in region • Separate Hbase cluster from Hadoop • All the smarts are in the keys For the various tables In Hbase: 1 Genome  10Million rows 100 Genomes  1Billion rows 100K Genomes  1Trillion rows 100M Genomes  1 Quadrillion 1,000,000,000,000,000 Fortunately, HBase cluster access can be partitioned by the application when required © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  • 5. Accessing Data with Pagination Table 1: Key: Bioset Id + Display Order Columns Pagination Example: Page 5, Page Size = 100 Retrieve 100 rows from Display Order = 400-500 Number of rows = 1 per SNP Order of 4 million © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  • 6. Accessing Data with Keys Table 1: Key: Bioset Id + Display Order Keys returned by search index © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  • 7. Filtering Data with Pagination Table 1: Key: Bioset Id + Display Order Table 2: Id+GeneId+MutationClass Column: Counts, Keys to Table Example: Gene: ESR1, Class: Misense Page Size = 100 Retrieve rows from Table 2 Retrieve rows by keys from Table 1 Number of rows Order of 0.5 million per dataset (# genes x classes) © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  • 8. Powering the Genome Browser Table 1: Table 2: Key: Bioset Id + Display Order Id+GeneId+MutationClass Table 3: Id+ChromosomeId+Range+DisplayOrder Example: Chr: 6 Specified Range Retrieve all rows 1 Row per SNP ~ 4 million per dataset © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  • 9. Use Case 2: Correlation Data © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  • 10. Use Case 2 • Each Correlation score stored as a row • HFile created for new score • Over 20 billion correlations T1: scorebioset (base table) key: biosetid_1 [+] biosetid_2 B1 B2 … … .. Bn Bn +1 B1 B2 … … Bn Bn +1 © 2012 NextBio | All rights reserved | This information is proprietary and confidential.
  • 11. Lessons Learnt • HBase Works Wells For -- Immutable Data -- Insertions Using HFiles -- Billions of Rows -- Intelligence in Key Definition • Road to Production -- Redundant Data in Database © 2012 NextBio | All rights reserved | This information is proprietary and confidential.