• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
O connor bosc2010
 

O connor bosc2010

on

  • 1,584 views

 

Statistics

Views

Total Views
1,584
Views on SlideShare
1,584
Embed Views
0

Actions

Likes
0
Downloads
15
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    O connor bosc2010 O connor bosc2010 Presentation Transcript

    • SeqWare Query Engine Storing & Searching Sequence Data in the Cloud Brian O'Connor UNC Lineberger Comprehensive Cancer Center BOSC July 9th, 2010
    • SeqWare Query Engine ● Want to ask simple questions: ● “What SNVs are in 5'UTR of phosphatases?” ● “What frameshift indels affect PTEN?” ● “What genes include homozygous, non- synonymous SNVs?” ● SeqWare Query Engine created to query data ● RESTful Webservice ● Scalable/Queryable Backend
    • Variant Annotation with SeqWare Whole Genome/Exome WIG BED SeqWare Query Alignment BAM Engine Webservice RESTlet Variant SeqWare Query Calling pileup Engine Backend Backend Interface dbSNP Variant, HBase or Coverage, & Berkeley DB Consequence Stores Consequence SeqWare Query SeqWare Pipeline Engine
    • Webservice Interfaces RESTful XML Client API or HTML Forms SeqWare Query Engine includes a RESTful web service that returns XML describing variant DBs, a web form for querying, and returns BED/WIG data available via a well-defined URL http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009" chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])
    • Loading in Genome Browsers SeqWare Query Engine URLs can be directly loaded into IGV & UCSC genome browsers
    • Requirements for Query Engine Backend The backend must: – Represent many types of data – Support a rich level of annotation – Support very large variant databases (~3 billion rows x thousands of columns) – Be distributed across a cluster – Support processing, annotating, querying & comparing samples (variants, coverage, annotations) – Support a crazy growth of data
    • Increase in Sequencer Output Nelson Lab - UCLA Illumina Sequencer Ouput Sequence File Sizes Per Lane 100000000000 Suggests Sequencer Output Increases by 5-10x Every 2 Years! 10000000000 Far outpacing hard drive, CPU, and bandwidth growth File Size (Bytes) 1000000000 100000000 Log scale 10000000 08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10 Date
    • HBase to the Rescue? ● Billions of rows x millions of columns! ● Focus on random access (vs. HDFS) ● Table is column oriented, sparse matrix ● Versioning (timestamps) built in ● Flexible storage of different data types ● Splits DB across many nodes transparently ● Locality of data, I can run map/reduce jobs that process the table rows present on a given node ● 22M variants processed <1 minute on 5 node cluster
    • Underlying HBase Tables hg18Table family label key variant:genome4 variant:genome7 coverage:genome7 chr15:00000123454 byte[] byte[] byte[] Variant object byte array Genome1102NTagIndexTable key rowId: is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123 byte[] queries look up by tag then Database on filesystem (HDFS) filter the variant results key timestamp column:variant chr15:00000123454 t1 genome7 byte[]
    • HBase API & Map/Reduce Querying ● HBase API ● Powers the current Backend & Webservice ● Provides a familiar API, scanners, iterators, etc ● Backend written using this, retrieve variants by tags ● Distributed database but single thread using API ● Prototype somatic mutations by Map/Reduce ● Every row is examined, variants in tumor not in normal are retrieved ● Map/Reduce jobs run on node with local data ● Highly parallel & faster than API with single thread
    • SeqWare Query Engine on HBase Backend Webservice clients MapReduce HBase API ETL Map Data Analysis/Web BED/WIG Job Nodes Files Node XML Metadata RESTful ETL Map Data Analysis/Web Web Service Job Node Nodes ETL Analysis/Web Name Reduce Job Nodes Node MetaDB ETL Querying & jobs extract, HBase on HDFS Loading Webservice combines transform, &/or Variant & Coverage Nodes process Variant/Coverage data load in parallel Database System queries via API with metadata
    • Status of HBase Backend ● Both BerkeleyDB & HBase, Relational soon ● Multiple genomes stored in the same table, very Map/Reduce compatible ● Basic secondary indexing for “tags” ● API used for queries via Webservice ● Prototype Map/Reduce example for “somatic” mutation detection in paired normal/cancer samples ● Currently loading 1102 normal/tumor (GBM)
    • Backend Performance Comparison Pileup Load Time 1102N HBase vs. Berkeley DB 8000000 7000000 6000000 5000000 load time bdb variants 4000000 load time hbase 3000000 2000000 1000000 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 time (s)
    • Backend Performance Comparison BED Export Time 1102N HBase API vs. M/R vs. BerkeleyDB 8000000 7000000 6000000 5000000 dump time bdb variants 4000000 dump time hbase dump time m/r 3000000 2000000 1000000 0 0 1000 2000 3000 4000 5000 6000 7000 time
    • HBase/Hadoop Have Potential! ● Era of Big Data for Biology is here! ● CPU bound problems no doubt but as short reads become long reads and prices per GBase drop the problems seem to shift to handling/mining data ● Tools designed for Peta-scale datasets are key
    • Next Steps ● Model other datatypes: copy number, RNAseq gene/exon/splice junction counts, isoforms etc ● Focus on porting analysis/querying to Map/Reduce ● Indexing beyond “tags” with Katta (distributed Lucene) ● Push scalability, what are the limits of an 8 node HBase/Hadoop cluster? ● Look at Cascading, Pig, Hive, etc as advanced workflow and data mining tools ● Standards for Webservice dialect (DAS?) ● Exposing Query Engine through GALAXY
    • Acknowledgments UCLA UNC  Jordan Mendler  Sara Grimm  Michael Clark  Matt Soloway  Hane Lee  Jianying Li  Bret Harry  Feri Zsuppan  Stanley Nelson  Neil Hayes  Chuck Perou  Derek Chiang
    • Resources ● Hbase & Hadoop: http://hadoop.apache.org ● When to use HBase: http://blog.rapleaf.com/dev/?p=26 ● NOSQL presentations: http://blog.oskarsson.nu/2009/06/nosql-debrief.html ● Other DBs: CouchDB, Hypertable, Cassandra, Project Voldemort, and more... ● Data mining tools: Pig and Hive ● SeqWare: http://seqware.sourceforge.net ● briandoconnor@gmail.com
    • Extra Slides
    • Overview ● SeqWare Query Engine background ● New tools for combating the data deluge ● HBase/Hadoop in SeqWare Query Engine ● HBase for backend ● Map/Reduce & HBase API for webservice ● Better performance and scalability? ● Next steps
    • SeqWare Query Engine: Lustre Filesystem BerkeleyDB backend webservice clients Genome BED/WIG Analysis/Web Nodes Files Database XML Metadata RESTful Genome Analysis/Web Nodes Web Service Database Genome Analysis/Web Nodes Database MetaDB BerkeleyDB Web/Analysis Webservice combines Variant & Coverage Nodes process Variant/Coverage data Databases queries with metadata
    • More ● Details on API vs. M/R ● Details on XML Restful API & web app including loading in UCSC browser ● Details on generic store object (BerkeleyDB, HBase, and Relational at Renci) ● Byte serialization from BerkeleyDB, custom secondary key creation
    • Pressures of Sequencing ● A lot of data (50GB SRF file, 150GB alignment files, 60GB variants for a 20x human genome) ● PostgreSQL (2xquad core, 64GB RAM) died with the Celsius schema (microarray database) after loading ~1 billion rows ● Needs to be processed, annotated, and queryable & comparable (variants, coverage, annotations) ● ~3 billion rows x thousands of columns ● COMBINE WITH PREVIOUS SLIDE
    • Thoughts on BerkeleyDB ● BerkeleyDB let me: ● Create a database per genome, independent from a single database daemon ● Provision database to cluster ● Adapt to key-value database semantics ● Limitations: ● Creation on single node only ● Not inherently distributed ● Performance issues with big DBs, high I/O wait ● Google to the rescue?
    • HBase Backend ● How the table(s) are actually structured ● Variants ● Coverage ● Etc ● How I do indexing currently (similar to indexing feature extension) ● Multiple secondary indexes
    • Frontend ● RESTlet API ● What queries can you do? ● Examples ● URLs ● Potential for swapping out generic M/R for many of these queries (less reliance on indexes which will speed things up as DB grows)
    • Ideas for a distributed future ● Federated Dbs/datastores/clusters for computation rather than one giant datacenter ● Distribute software not data
    • Potential Questions ● How big is the DB to store whole human genome? ● How long does it take to M/R 3 billion positions on 5 node cluster? ● How does my stuff compare to other bioinf software? GATK, Crossbow, etc ● How did I choose HBase instead of Pig, Hive, etc?
    • Current Prototyping Work ● Validate creation of U87 (genome resequencing at 20x) genome database ● SNVs ● Coverage ● Annotations ● Test fast querying of record subsets ● Test fast processing of whole DB using MapReduce ● Test stability, fault-tolerance, auto-balancing, and deployment issues along the way
    • What About Fast Queries? ● I'm fairly convinced I can create a distributed HBase database on a Hadoop cluster ● I have a prototype HBase database running on two nodes ● But HBase shines when bulk processing DB ● Big question is how to make individual lookups fast ● Possible solution is Hbase+Katta for indexes (distributed Lucene)
    • SeqWare Query Engine Lustre Filesystem backend webservice clients Genome Analysis/Web Nodes BED/WIG Files Database (8 CPU, 32GB RAM) DASXML RESTful Genome Analysis/Web Nodes Web Service Database (8 CPU, 32GB RAM) Genome Analysis/Web Nodes Annotation Database (8 CPU, 32GB RAM) Database BerkeleyDB Web/Analysis Webservice combines Variant & Coverage Nodes process Variant/Coverage data Databases queries with annotations (hg18)
    • How Do We Scale Up the QE? ● Sequencers are increasing output by a factor of 10 every two years! ● Hard drives: 4x every 2 years ● CPUs: 2x every 2 years ● Bandwidth: 2x every 2 years (really?!) ● So there's a huge disconnect, can't just throw more hardware at a single database server! ● Must look for better ways to scale for
    • Google to the Rescue? ● Companies like Google, Amazon, Facebook, etc have had to deal with massive scalability issues over the last 10+ years ● Solutions include: ● Frameworks like MapReduce ● Distributed file systems like HDFS ● Distributed databases like HBase ● Focus here on HBase
    • What Do You Give Up? ● SQL queries ● Well defined schema, normalized data structure ● Relationships manged by DB ● Flexible and easy indexing of table columns ● Existing tools that query a SQL database must be re-written ● Certain ACID aspects ● Software maturity, most distributed NOSQL projects are very new
    • What Do You Gain? ● Scalability is the clear win, you can have many processes on many nodes hit the collection of database servers ● Ability to look at very large datasets and do complex computations across a cluster ● More flexibility in representing information now and in the future ● HBase includes data timestamps/versions ● Integration with Hadoop
    • SeqWare Query Engine Hadoop Hadoop HDFS Map Reduce backend webservice clients ETL Map Data Analysis/Web BED/WIG Job Nodes Files Node XML Metadata RESTful ETL Map Data Analysis/Web Web Service Job Node Nodes ETL Analysis/Web Name Reduce Job Nodes Annotation Node Database MapReduce jobs extract, HBase Web/Analysis Webservice combines transform, & Variant & Coverage Nodes process Variant/Coverage data load in parallel Database System queries with annotations (hg18)
    • What an HBase DB Looks Like A Record in my HBase family label key variant:genome4 variant:genome7 coverage:genome7 chr15:00000123454 byte[] byte[] byte[] Variant object to byte array Database on filesystem (HDFS) key timestamp column:variant chr15:00000123454 t1 genome7 byte[]
    • Scalability and BerkeleyDB ● BerkeleyDB let me: ● Create a database per genome, independent from a single database daemon ● Provision database to cluster for distributed analysis ● Adapt to key-value database semantics with nice API ● Limitations: ● Creation on single node only ● Want to query easily across genomes ● Database are not distributed ● I saw performance issues, high I/O wait
    • Would 2,000 Genomes Kill SQL? ● Say each genome has 5M variants (not counting coverage!) ● 5M variant rows x 2,000 genomes = 10 billion rows ● Our DB server running PostgreSQL (2xquad core, 64GB RAM) died with the Celsius (Chado) schema after loading ~1 billion rows ● So maybe conservatively we would have issues with 150+ genomes ● That threshold is probably 1 year away with public datasets available via SRA, 1000 genomes, TCGA
    • Related Projects
    • My Abstract ● backend/frontend ● Traverse and query with Map/Reduce ● Java web service with RESTlet ● Deployment on 8 node cluster
    • Background on Problem ● Why abandon PostgreSQL/MySQL/SQL? ● Experience with Celsius... ● What you give up ● What you gain
    • First Solution: BerkeleyDB ● Good: ● key/value data store ● Easy to use ● Great for testing ● Bad: ● Not performant for multiple genomes ● Manual distribution across cluster ● Annoying phobia of shared filesystems
    • Sequencers vs. Information Technology ● Sequencers are increasing output by a factor of 10 every two years! ● Hard drives: 4x every 2 years ● CPUs: 2x every 2 years ● Bandwidth: 2x every 2 years (really?!) ● So there's a huge disconnect, can't just throw more hardware at a single database server! ● Must look for better ways to scale
    • ● What are we doing, what are the challenges. Big picture of the project (webservice, backend etc) ● How did people solve this problem before? How did I attempt to solve this problem? Where did it break down? ● “New” approach, looking to Google et al for scalability for big data problems ● What is Hbase/Hadoop & what do they provide? ● How did I adapt Hbase/hadoop to my problem? ● Specifics of implementation: overall flow, tables, query engine search (API), example M/R task ● Is this performant, does this scale? Can I get billionxmillion? Fast arbitrary retrieval? ● Next steps: more data types, focus on M/R for analytical tasks, focus on Katta for rich querying, push scalability w/ 8 nodes (test with genomes), look at Cascading & other tools for datamining