Your SlideShare is downloading. ×
  • Like
O connor bosc2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

O connor bosc2010

  • 1,194 views
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,194
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
15
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. SeqWare Query Engine Storing & Searching Sequence Data in the Cloud Brian O'Connor UNC Lineberger Comprehensive Cancer Center BOSC July 9th, 2010
  • 2. SeqWare Query Engine ● Want to ask simple questions: ● “What SNVs are in 5'UTR of phosphatases?” ● “What frameshift indels affect PTEN?” ● “What genes include homozygous, non- synonymous SNVs?” ● SeqWare Query Engine created to query data ● RESTful Webservice ● Scalable/Queryable Backend
  • 3. Variant Annotation with SeqWare Whole Genome/Exome WIG BED SeqWare Query Alignment BAM Engine Webservice RESTlet Variant SeqWare Query Calling pileup Engine Backend Backend Interface dbSNP Variant, HBase or Coverage, & Berkeley DB Consequence Stores Consequence SeqWare Query SeqWare Pipeline Engine
  • 4. Webservice Interfaces RESTful XML Client API or HTML Forms SeqWare Query Engine includes a RESTful web service that returns XML describing variant DBs, a web form for querying, and returns BED/WIG data available via a well-defined URL http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009" chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])
  • 5. Loading in Genome Browsers SeqWare Query Engine URLs can be directly loaded into IGV & UCSC genome browsers
  • 6. Requirements for Query Engine Backend The backend must: – Represent many types of data – Support a rich level of annotation – Support very large variant databases (~3 billion rows x thousands of columns) – Be distributed across a cluster – Support processing, annotating, querying & comparing samples (variants, coverage, annotations) – Support a crazy growth of data
  • 7. Increase in Sequencer Output Nelson Lab - UCLA Illumina Sequencer Ouput Sequence File Sizes Per Lane 100000000000 Suggests Sequencer Output Increases by 5-10x Every 2 Years! 10000000000 Far outpacing hard drive, CPU, and bandwidth growth File Size (Bytes) 1000000000 100000000 Log scale 10000000 08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10 Date
  • 8. HBase to the Rescue? ● Billions of rows x millions of columns! ● Focus on random access (vs. HDFS) ● Table is column oriented, sparse matrix ● Versioning (timestamps) built in ● Flexible storage of different data types ● Splits DB across many nodes transparently ● Locality of data, I can run map/reduce jobs that process the table rows present on a given node ● 22M variants processed <1 minute on 5 node cluster
  • 9. Underlying HBase Tables hg18Table family label key variant:genome4 variant:genome7 coverage:genome7 chr15:00000123454 byte[] byte[] byte[] Variant object byte array Genome1102NTagIndexTable key rowId: is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123 byte[] queries look up by tag then Database on filesystem (HDFS) filter the variant results key timestamp column:variant chr15:00000123454 t1 genome7 byte[]
  • 10. HBase API & Map/Reduce Querying ● HBase API ● Powers the current Backend & Webservice ● Provides a familiar API, scanners, iterators, etc ● Backend written using this, retrieve variants by tags ● Distributed database but single thread using API ● Prototype somatic mutations by Map/Reduce ● Every row is examined, variants in tumor not in normal are retrieved ● Map/Reduce jobs run on node with local data ● Highly parallel & faster than API with single thread
  • 11. SeqWare Query Engine on HBase Backend Webservice clients MapReduce HBase API ETL Map Data Analysis/Web BED/WIG Job Nodes Files Node XML Metadata RESTful ETL Map Data Analysis/Web Web Service Job Node Nodes ETL Analysis/Web Name Reduce Job Nodes Node MetaDB ETL Querying & jobs extract, HBase on HDFS Loading Webservice combines transform, &/or Variant & Coverage Nodes process Variant/Coverage data load in parallel Database System queries via API with metadata
  • 12. Status of HBase Backend ● Both BerkeleyDB & HBase, Relational soon ● Multiple genomes stored in the same table, very Map/Reduce compatible ● Basic secondary indexing for “tags” ● API used for queries via Webservice ● Prototype Map/Reduce example for “somatic” mutation detection in paired normal/cancer samples ● Currently loading 1102 normal/tumor (GBM)
  • 13. Backend Performance Comparison Pileup Load Time 1102N HBase vs. Berkeley DB 8000000 7000000 6000000 5000000 load time bdb variants 4000000 load time hbase 3000000 2000000 1000000 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 time (s)
  • 14. Backend Performance Comparison BED Export Time 1102N HBase API vs. M/R vs. BerkeleyDB 8000000 7000000 6000000 5000000 dump time bdb variants 4000000 dump time hbase dump time m/r 3000000 2000000 1000000 0 0 1000 2000 3000 4000 5000 6000 7000 time
  • 15. HBase/Hadoop Have Potential! ● Era of Big Data for Biology is here! ● CPU bound problems no doubt but as short reads become long reads and prices per GBase drop the problems seem to shift to handling/mining data ● Tools designed for Peta-scale datasets are key
  • 16. Next Steps ● Model other datatypes: copy number, RNAseq gene/exon/splice junction counts, isoforms etc ● Focus on porting analysis/querying to Map/Reduce ● Indexing beyond “tags” with Katta (distributed Lucene) ● Push scalability, what are the limits of an 8 node HBase/Hadoop cluster? ● Look at Cascading, Pig, Hive, etc as advanced workflow and data mining tools ● Standards for Webservice dialect (DAS?) ● Exposing Query Engine through GALAXY
  • 17. Acknowledgments UCLA UNC  Jordan Mendler  Sara Grimm  Michael Clark  Matt Soloway  Hane Lee  Jianying Li  Bret Harry  Feri Zsuppan  Stanley Nelson  Neil Hayes  Chuck Perou  Derek Chiang
  • 18. Resources ● Hbase & Hadoop: http://hadoop.apache.org ● When to use HBase: http://blog.rapleaf.com/dev/?p=26 ● NOSQL presentations: http://blog.oskarsson.nu/2009/06/nosql-debrief.html ● Other DBs: CouchDB, Hypertable, Cassandra, Project Voldemort, and more... ● Data mining tools: Pig and Hive ● SeqWare: http://seqware.sourceforge.net ● briandoconnor@gmail.com
  • 19. Extra Slides
  • 20. Overview ● SeqWare Query Engine background ● New tools for combating the data deluge ● HBase/Hadoop in SeqWare Query Engine ● HBase for backend ● Map/Reduce & HBase API for webservice ● Better performance and scalability? ● Next steps
  • 21. SeqWare Query Engine: Lustre Filesystem BerkeleyDB backend webservice clients Genome BED/WIG Analysis/Web Nodes Files Database XML Metadata RESTful Genome Analysis/Web Nodes Web Service Database Genome Analysis/Web Nodes Database MetaDB BerkeleyDB Web/Analysis Webservice combines Variant & Coverage Nodes process Variant/Coverage data Databases queries with metadata
  • 22. More ● Details on API vs. M/R ● Details on XML Restful API & web app including loading in UCSC browser ● Details on generic store object (BerkeleyDB, HBase, and Relational at Renci) ● Byte serialization from BerkeleyDB, custom secondary key creation
  • 23. Pressures of Sequencing ● A lot of data (50GB SRF file, 150GB alignment files, 60GB variants for a 20x human genome) ● PostgreSQL (2xquad core, 64GB RAM) died with the Celsius schema (microarray database) after loading ~1 billion rows ● Needs to be processed, annotated, and queryable & comparable (variants, coverage, annotations) ● ~3 billion rows x thousands of columns ● COMBINE WITH PREVIOUS SLIDE
  • 24. Thoughts on BerkeleyDB ● BerkeleyDB let me: ● Create a database per genome, independent from a single database daemon ● Provision database to cluster ● Adapt to key-value database semantics ● Limitations: ● Creation on single node only ● Not inherently distributed ● Performance issues with big DBs, high I/O wait ● Google to the rescue?
  • 25. HBase Backend ● How the table(s) are actually structured ● Variants ● Coverage ● Etc ● How I do indexing currently (similar to indexing feature extension) ● Multiple secondary indexes
  • 26. Frontend ● RESTlet API ● What queries can you do? ● Examples ● URLs ● Potential for swapping out generic M/R for many of these queries (less reliance on indexes which will speed things up as DB grows)
  • 27. Ideas for a distributed future ● Federated Dbs/datastores/clusters for computation rather than one giant datacenter ● Distribute software not data
  • 28. Potential Questions ● How big is the DB to store whole human genome? ● How long does it take to M/R 3 billion positions on 5 node cluster? ● How does my stuff compare to other bioinf software? GATK, Crossbow, etc ● How did I choose HBase instead of Pig, Hive, etc?
  • 29. Current Prototyping Work ● Validate creation of U87 (genome resequencing at 20x) genome database ● SNVs ● Coverage ● Annotations ● Test fast querying of record subsets ● Test fast processing of whole DB using MapReduce ● Test stability, fault-tolerance, auto-balancing, and deployment issues along the way
  • 30. What About Fast Queries? ● I'm fairly convinced I can create a distributed HBase database on a Hadoop cluster ● I have a prototype HBase database running on two nodes ● But HBase shines when bulk processing DB ● Big question is how to make individual lookups fast ● Possible solution is Hbase+Katta for indexes (distributed Lucene)
  • 31. SeqWare Query Engine Lustre Filesystem backend webservice clients Genome Analysis/Web Nodes BED/WIG Files Database (8 CPU, 32GB RAM) DASXML RESTful Genome Analysis/Web Nodes Web Service Database (8 CPU, 32GB RAM) Genome Analysis/Web Nodes Annotation Database (8 CPU, 32GB RAM) Database BerkeleyDB Web/Analysis Webservice combines Variant & Coverage Nodes process Variant/Coverage data Databases queries with annotations (hg18)
  • 32. How Do We Scale Up the QE? ● Sequencers are increasing output by a factor of 10 every two years! ● Hard drives: 4x every 2 years ● CPUs: 2x every 2 years ● Bandwidth: 2x every 2 years (really?!) ● So there's a huge disconnect, can't just throw more hardware at a single database server! ● Must look for better ways to scale for
  • 33. Google to the Rescue? ● Companies like Google, Amazon, Facebook, etc have had to deal with massive scalability issues over the last 10+ years ● Solutions include: ● Frameworks like MapReduce ● Distributed file systems like HDFS ● Distributed databases like HBase ● Focus here on HBase
  • 34. What Do You Give Up? ● SQL queries ● Well defined schema, normalized data structure ● Relationships manged by DB ● Flexible and easy indexing of table columns ● Existing tools that query a SQL database must be re-written ● Certain ACID aspects ● Software maturity, most distributed NOSQL projects are very new
  • 35. What Do You Gain? ● Scalability is the clear win, you can have many processes on many nodes hit the collection of database servers ● Ability to look at very large datasets and do complex computations across a cluster ● More flexibility in representing information now and in the future ● HBase includes data timestamps/versions ● Integration with Hadoop
  • 36. SeqWare Query Engine Hadoop Hadoop HDFS Map Reduce backend webservice clients ETL Map Data Analysis/Web BED/WIG Job Nodes Files Node XML Metadata RESTful ETL Map Data Analysis/Web Web Service Job Node Nodes ETL Analysis/Web Name Reduce Job Nodes Annotation Node Database MapReduce jobs extract, HBase Web/Analysis Webservice combines transform, & Variant & Coverage Nodes process Variant/Coverage data load in parallel Database System queries with annotations (hg18)
  • 37. What an HBase DB Looks Like A Record in my HBase family label key variant:genome4 variant:genome7 coverage:genome7 chr15:00000123454 byte[] byte[] byte[] Variant object to byte array Database on filesystem (HDFS) key timestamp column:variant chr15:00000123454 t1 genome7 byte[]
  • 38. Scalability and BerkeleyDB ● BerkeleyDB let me: ● Create a database per genome, independent from a single database daemon ● Provision database to cluster for distributed analysis ● Adapt to key-value database semantics with nice API ● Limitations: ● Creation on single node only ● Want to query easily across genomes ● Database are not distributed ● I saw performance issues, high I/O wait
  • 39. Would 2,000 Genomes Kill SQL? ● Say each genome has 5M variants (not counting coverage!) ● 5M variant rows x 2,000 genomes = 10 billion rows ● Our DB server running PostgreSQL (2xquad core, 64GB RAM) died with the Celsius (Chado) schema after loading ~1 billion rows ● So maybe conservatively we would have issues with 150+ genomes ● That threshold is probably 1 year away with public datasets available via SRA, 1000 genomes, TCGA
  • 40. Related Projects
  • 41. My Abstract ● backend/frontend ● Traverse and query with Map/Reduce ● Java web service with RESTlet ● Deployment on 8 node cluster
  • 42. Background on Problem ● Why abandon PostgreSQL/MySQL/SQL? ● Experience with Celsius... ● What you give up ● What you gain
  • 43. First Solution: BerkeleyDB ● Good: ● key/value data store ● Easy to use ● Great for testing ● Bad: ● Not performant for multiple genomes ● Manual distribution across cluster ● Annoying phobia of shared filesystems
  • 44. Sequencers vs. Information Technology ● Sequencers are increasing output by a factor of 10 every two years! ● Hard drives: 4x every 2 years ● CPUs: 2x every 2 years ● Bandwidth: 2x every 2 years (really?!) ● So there's a huge disconnect, can't just throw more hardware at a single database server! ● Must look for better ways to scale
  • 45. ● What are we doing, what are the challenges. Big picture of the project (webservice, backend etc) ● How did people solve this problem before? How did I attempt to solve this problem? Where did it break down? ● “New” approach, looking to Google et al for scalability for big data problems ● What is Hbase/Hadoop & what do they provide? ● How did I adapt Hbase/hadoop to my problem? ● Specifics of implementation: overall flow, tables, query engine search (API), example M/R task ● Is this performant, does this scale? Can I get billionxmillion? Fast arbitrary retrieval? ● Next steps: more data types, focus on M/R for analytical tasks, focus on Katta for rich querying, push scalability w/ 8 nodes (test with genomes), look at Cascading & other tools for datamining