Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010


Published on

Brian O'Connor's talk on UNC Lineberger Cancer Center's use of HBase in research. Delivered at the December 2010 Triangle Hadoop Users Group.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010

  1. 1. Big Bio and Fun with HBase Brian OConnor Pipeline Architect UNC Lineberger Comprehensive Cancer Center
  2. 2. Overview● The era of “Big Data” in biology ● Sequencing and scientific queries ● Computational requirements and data growth● The appeal of HBase/Hadoop● The SeqWare project and the Query Engine (QE) ● Implementation of HBase QE ● How well it works● The future ● Adding indexing and search engine ● Why HBase/Hadoop are important to modern biology
  3. 3. Biology In Transition● Biologists are used to working on one gene their entire careers● In the last 15 years biology has been transitioning to a high-throughput, data-driven discipline● Now scientist study systems: Thousands of genes Millions of SNPs Billions of bases of sequence● Biology is not physics
  4. 4. The Beginnings Capillary Sequencer● The Human Genome with Sanger Sequencing ● 1990-2000 Output ~1000 bases ● $3 billion● Gave us the blueprints for a human being, let us find most of the genes in humans, understand variation in humans
  5. 5. The Era of “Big Data” in Biology SOLiD 3 Plus Illumina GAIIx 454 GS FLX HeliScope● 2x50 run ~14 days ● 2x76 run, ~8 days ● 400bp, ~10 hours ● ~35bp run, ~8 days● >= 99.94% accuracy ● >= 98.5% accuracy ● Q20 at 400 bases ● <5% raw error rate(colorspace corrected) at 76 bases ● ~1M reads / run ● ~700M reads/run● ~1G reads / run, ● ~300M reads / ~400MB high qual ~25GB/run~60GB high quality flowcell, ~20-25GB ● Homopolymer issue ● Cost ?● Conservatively maybe high qual ● Cost ~$14K for ● So ~25GB in40GB aligned 38M/lane, 2.5GB/lane, 70x75, ~$90K/GB ~8days● ~60GB in ~14 days about 1.25GB aligned ● So ~1GB in ~1day● Cost ~$20K, ● Cost ~$10K, about$333/GB $450/GB● A Human Genome in ● ~20GB in ~8 days2 Weeks!
  6. 6. Types of Questions Biologists Ask● Want to ask simple questions: ● “What SNVs are in 5UTR of phosphatases?” ● “What frameshifts affect PTEN in lung cancer?” ● “What genes include homozygous, non- synonymous SNVs in all ovarian cancers?”● Biologists want to see data across samples● Database natural choice for this data, many examples
  7. 7. Impact of Next Gen Sequencing● If the human genome gave us the generic blue prints, next gen sequencing lets us look at individuals blue prints● Applications Sequence individuals cancer and find distinct druggable targets Sequence many people tumor look at mutation patterns subtract mutations just in tumor Have disease normal Does not have disease
  8. 8. The Big Problem● Biologist think about reagents and tumor not hard drives and CPUs● The crazy growth exceeds the growth of all IT components● Dramatically different models for scalability are required● Were soon reaching a point where a genome will cost less to sequence than it does to look at the data!
  9. 9. Increase in Sequencer Output Illumina Sequencer Ouput Log scale Sequence File Sizes Per Lane 100000000000 10000000000File Size (Bytes) Moores Law: 1000000000 CPU Power Doubles Every 2 Years 100000000 10000000 08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10 Date Suggests Sequencer Output Increases by 10x Every 2 Years! Kryders Law: Far outpacing hard drive, Storage Quadruples CPU, and bandwidth growth Every 2 Years
  10. 10. Lowering Costs = Bigger Projects ● Falling costs increase scope of projects ● The Cancer Genome Atlas (TCGA) ● 20 tumor types, 200 samples each in 4 years ● Around 4 Petabytes of data ● Once costs < $1000 many predict the technology will move into clinical applications ● 1.5 million new cancer patients in 2010 ● 1,500 Petabytes per year!? ● 4 PB/day vs 0.2 PB/day Facebook
  11. 11. So We Need to Rethink Scaling Up Particularly for Databases● The old ways of growing cant keep up... cant just “buy a bigger box”● We need to embrace what the Peta-scale community is currently doing● The Googles, Facebooks, Twitters, etc of the world are solving their scalability problems, biology needs to learn from their solutions● Clusters can be expanded, distributed storage used, least scalable portion is the database which biologist need to ask questions
  12. 12. Technologies That Can Help● Many open source tools designed for Petascale ● Map/Reduce for data processing ● Hadoop for distributed file systems (HDFS) and robust process control ● Pig/HIVE/etc processing unstructured/semi- structured data ● HBase/Cassandra/etc for databasing● Could go unstructured but biological data is most useful when aggregated and a database is extremely good for this
  13. 13. HBase to the Rescue● Billions of rows x millions of columns!● Focus on random access (vs. HDFS)● Table is column oriented, sparse matrix● Versioning (timestamps) built in● Flexible storage of different data types● Splits DB across many nodes transparently● Locality of data, I can run map/reduce jobs that process the table rows present on a given node ● 22M variants processed <1 minute on 5 node cluster
  14. 14. Magically Scalable Databases● Talking about distributed databases, forces a huge shift in what you can and cant do● “NoSQL is an umbrella term for a loosely defined class of non-relational data stores that break with a long history of relational databases and ACID guarantees. Data stores that fall under this term may not require fixed table schemas, and usually avoid join operations. The term was first popularized in early 2009.”
  15. 15. What You Give Up● SQL queries● Well defined schema, normalized data structure● Relationships manged by DB● Flexible and easy indexing of table columns● Existing tools that query a SQL database must be re-written● Certain ACID aspects● Software maturity, most distributed NOSQL projects are very new
  16. 16. What You Gain● Scalability is the clear win, you can have many processes on many nodes hit the collection of database servers● Ability to look at very large datasets and do complex computations across a cluster● More flexibility in representing information now and in the future● HBase includes data timestamps/versions● Integration with Hadoop
  17. 17. HBase Architecture
  18. 18. HBase TablesConceptual ViewPhysical Storage View
  19. 19. HBase APIs● Basic API ● Connect to an HBase server like its a single server ● Lets you iterate over the contents of the DB, with flexible filtering of the results ● Can pull back/write key/values easily● Map/Reduce source/sink ● Can use HBase tables easily from Map/Reduce ● May be easier/faster just to Map/Reduce than filter with the API ● I want to use this more in the future
  20. 20. HBase Installation● Install Hadoop first, version 0.20.x● I installed HBase version 0.20.1● Documented:● Currently have a 7 node Hadoop/HBase cluster running at UNC with about 40TB of storage, 56 CPUs, 168GB RAM● RPMs from Cloudera (0.89.20100924)
  21. 21. The SeqWare Project Genome Browser SeqWare LIMS SeqWare Query Engine High-performance, distributed Hadoop HBase Genome browser and Cluster query engine frontend database and web query engine, powers both browser and interactive queries HBase Hadoop Central DB Cluster Central portal for users that links to the tools to upload samples, coordinates SeqWare Nodes trigger analysis, and view results all analysis and results MetaDB metadata SeqWare Pipeline SGE Cluster SeqWare API* Storage Java Perl Python S SGE thrift Cluster network S Controls analysis on Developer API, savvy users the cluster, models can control all of SeqWares tools S analysis workflows for SGE programmatically RNA-Seq and other Cluster S NGS experimental designs Fully open source Import Daemons Big Data Small Data Data import tool facilitates sequence delivery to the storage via network* future project
  22. 22. The SeqWare Query Engine Project● SeqWare Query Engine is our HBase database for next gen sequence data Interactive Web Forms SeqWare Query Engine High-performance, distributed database and web query engine, powers both browser and Hadoop interactive queries Cluster HBase Hadoop Cluster HBase Genome Browser Hadoop HBase Cluster Web API Service Genome browser and query engine frontend
  23. 23. Webservice Interfaces RESTful XML Client API or HTML FormsSeqWare Query Engine includes aRESTful web service that returns XMLdescribing variant DBs, a web form forquerying, and returns BED/WIG dataavailable via a well-defined URL track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009" chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])
  24. 24. SeqWare QE Architecture Concepts● Types: variants, Indel chr1 translocations, coverage, * nonsynonmous consequence, and * is_dbSNP | rs10292192 generic “features”● Each can be “tagged” SNV chr12 with arbitrary key-value pairs, allows for the * nonsynonmous encoding of a surprising * is_dbSNP | rs10292192 amount of annotation!● Fundamental query is a chr3 translocation chr12 list of objects filtered by * gene_fusion these “tags”
  25. 25. Requirements for Query Engine BackendThe backend must: – Represent many types of data – Support a rich level of annotation – Support very large variant databases (~3 billion rows x thousands of columns) – Be distributed across a cluster – Support processing, annotating, querying & comparing samples (variants, coverage, annotations) – Support a crazy growth of data
  26. 26. HBase Query Engine Backend● Stores variants, translocations, coverage info, coding consequence reports, annotations as “tags”, and generic “features”● Common interface, uses HBase as backend ● Create a single database for all genomes ● Database is auto-sharded across cluster ● Can use Map/Reduce and other projects to do sophisticated queries ● Performance seems very good!
  27. 27. HBase SeqWare Query Engine Backend Webservice clientMapReduce HBase API ETL Map Region Analysis BED/WIG Job Server Node Files XML Metadata RESTful ETL Map Region Web Nodes Web Service Job Server ETL Analysis Reduce Job HMaster Node MetaDBETL Querying &jobs extract, HBase on HDFS Loadingtransform, &/or Variant & Coverage Nodes processload in parallel Database System queries via API
  28. 28. Underlying HBase Tableshg18Table family labelkey variant:genome4 variant:genome7 coverage:genome7chr15:00000123454 byte[] byte[] byte[] Variant object byte arrayGenome1102NTagIndexTablekey rowId:is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123 byte[] queries look up by tag thenDatabase on filesystem (HDFS) filter the variant results key timestamp column:variant chr15:00000123454 t1 genome7 byte[]
  29. 29. Backend Performance Comparison Pileup Load Time 1102N HBase vs. Berkeley DB 8000000 7000000 6000000 5000000 load time bdb variants 4000000 load time hbase 3000000 2000000 1000000 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 time (s)
  30. 30. Backend Performance Comparison BED Export Time 1102N HBase API vs. M/R vs. BerkeleyDB 8000000 7000000 6000000 5000000 dump time bdb variants 4000000 dump time hbase dump time m/r 3000000 2000000 1000000 0 0 1000 2000 3000 4000 5000 6000 7000 time
  31. 31. Querying with Map/ReduceFind all variants seen in ovarian cancer genomes tagged as “frameshift” Node 1 Reduce Map Iterate over every For each item in bin, Chr1 variant, bin if ovarian print out the variant Variants and tagged frameshift information Node 2 ● Finding variants with tag Chr2 or comparing at same Map Reduce Variants genomic position very efficient Node 3 ● Problem of overlapping Chr3 features not starting at the Map Reduce Variants same genomic location
  32. 32. Problematic Querying Map/Reduce Indel 1 chr1:1200-1500 ● Hard to do Indel 2 overlaps chr1:1400-1700In the genome SNV 1 queries which are really SNV 2 important for biological DBsIn the DB ● Here if I did a chr1:000001200 Indel 1 byte[] Map/Reduce query only SNV1 and chr1:000001400 Indel 2 byte[] SNV2 would “overlap” chr1:000001450 SNV 1 byte[] SNV 2 byte[]
  33. 33. SeqWare Query Engine Status● Open source, you can try it now!● Both BerkeleyDB & HBase backends● Multiple genomes stored in the same table, very Map/Reduce compatible for SNVs● Basic secondary indexing for “tags”● API used for queries via Webservice● Prototype Map/Reduce examples including “somatic” mutation detection in paired normal/cancer samples
  34. 34. SeqWare Query Engine Future● Dynamically building R-tree indexes or Nested Containment Lists with Map/Reduce “Experiences on Processing Spatial Data with MapReduce” by Cary et al.● Looking at using Katta or Solr for indexing free text data such as gene descriptions, OMIM entries, etc.● Queries across samples with simple logic● More testing, pushing our 7 node cluster, finding the max number of genomes this cluster can support
  35. 35. Final Thoughts● Era of Big Data for Biology is here!● CPU bound problems no doubt but as short reads become long reads and prices per GBase drop the problems seem to shift to handling/mining data● Tools designed for Peta-scale datasets are key Yahoos Hadoop Cluster
  36. 36. For More Information●●●● Brian OConnor <>● Article (12/21/2010): SeqWare Query Engine: storing and searching sequence data in the cloud Brian D O’Connor , Barry Merriman and Stanley F Nelson BMC Bioinformatics 2010, 11(Suppl 12)● We have job openings!!