O connor bosc2010

SeqWare Query Engine
Storing & Searching Sequence Data in the Cloud

Brian O'Connor
UNC Lineberger Comprehensive
Cancer Center

BOSC
July 9th, 2010

● Want to ask simple questions:
● “What SNVs are in 5'UTR of phosphatases?”
● “What frameshift indels affect PTEN?”
● “What genes include homozygous, non-
synonymous SNVs?”
● SeqWare Query Engine created to query data
● RESTful Webservice
● Scalable/Queryable Backend

Variant Annotation with SeqWare
Whole Genome/Exome
WIG BED

SeqWare Query
Alignment BAM Engine Webservice
RESTlet

Variant SeqWare Query
Calling pileup Engine Backend
Backend
Interface

dbSNP
Variant, HBase or
Coverage, & Berkeley DB
Consequence Stores
Consequence

SeqWare Query
SeqWare Pipeline
Engine

Webservice Interfaces
RESTful XML Client API or HTML Forms

SeqWare Query Engine includes a
RESTful web service that returns XML
describing variant DBs, a web form
for querying, and returns BED/WIG
data available via a well-defined URL
http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN

track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009"
chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])

Loading in Genome Browsers
SeqWare Query
Engine URLs can
be directly loaded
into IGV & UCSC
genome browsers

Requirements for Query
Engine Backend
The backend must:
– Represent many types of data
– Support a rich level of annotation
– Support very large variant databases
(~3 billion rows x thousands of columns)
– Be distributed across a cluster
– Support processing, annotating, querying &
comparing samples (variants, coverage,
annotations)
– Support a crazy growth of data

Increase in Sequencer Output
Nelson Lab - UCLA
Illumina Sequencer Ouput
Sequence File Sizes Per Lane
100000000000

Suggests Sequencer Output
Increases by 5-10x Every 2 Years!

10000000000
Far outpacing hard drive,
CPU, and bandwidth growth
File Size (Bytes)

1000000000

100000000

Log scale
10000000
08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10

Date

HBase to the Rescue?
● Billions of rows x millions of columns!
● Focus on random access (vs. HDFS)
● Table is column oriented, sparse matrix
● Versioning (timestamps) built in
● Flexible storage of different data types
● Splits DB across many nodes transparently
● Locality of data, I can run map/reduce jobs that
process the table rows present on a given node
● 22M variants processed <1 minute on 5 node cluster

Underlying HBase Tables
hg18Table family label
key variant:genome4 variant:genome7 coverage:genome7

chr15:00000123454 byte[] byte[] byte[]

Variant object byte array

Genome1102NTagIndexTable
key rowId:

is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123 byte[]

queries look up by tag then
Database on filesystem (HDFS) filter the variant results
key timestamp column:variant
chr15:00000123454 t1 genome7 byte[]

HBase API & Map/Reduce Querying
● HBase API
● Powers the current Backend & Webservice
● Provides a familiar API, scanners, iterators, etc
● Backend written using this, retrieve variants by tags
● Distributed database but single thread using API
● Prototype somatic mutations by Map/Reduce
● Every row is examined, variants in tumor not in
normal are retrieved
● Map/Reduce jobs run on node with local data
● Highly parallel & faster than API with single thread

SeqWare Query Engine on HBase
Backend Webservice
clients
MapReduce HBase API

ETL Map Data Analysis/Web BED/WIG
Job Nodes Files
Node XML
Metadata

RESTful
ETL Map Data Analysis/Web Web Service
Job Node Nodes

ETL Analysis/Web
Name
Reduce Job Nodes
Node MetaDB

ETL Querying &
jobs extract, HBase on HDFS Loading Webservice combines
transform, &/or Variant & Coverage Nodes process Variant/Coverage data
load in parallel Database System queries via API with metadata

Status of HBase Backend
● Both BerkeleyDB & HBase, Relational soon
● Multiple genomes stored in the same table,
very Map/Reduce compatible
● Basic secondary indexing for “tags”
● API used for queries via Webservice
● Prototype Map/Reduce example for “somatic”
mutation detection in paired normal/cancer
samples
● Currently loading 1102 normal/tumor (GBM)

Backend Performance Comparison
Pileup Load Time 1102N
HBase vs. Berkeley DB
8000000

7000000

6000000

5000000
load time bdb
variants

4000000 load time hbase

3000000

2000000

1000000

0
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
time (s)

Backend Performance Comparison
BED Export Time 1102N
HBase API vs. M/R vs. BerkeleyDB
8000000

7000000

6000000

5000000 dump time bdb
variants

4000000 dump time hbase
dump time m/r
3000000

2000000

1000000

0
0 1000 2000 3000 4000 5000 6000 7000
time

HBase/Hadoop Have Potential!
● Era of Big Data for Biology is here!
● CPU bound problems no doubt but as short
reads become long reads and prices per GBase
drop the problems seem to shift to
handling/mining data
● Tools designed for Peta-scale datasets are key

Next Steps
● Model other datatypes: copy number, RNAseq
gene/exon/splice junction counts, isoforms etc
● Focus on porting analysis/querying to Map/Reduce
● Indexing beyond “tags” with Katta (distributed Lucene)
● Push scalability, what are the limits of an 8 node
HBase/Hadoop cluster?
● Look at Cascading, Pig, Hive, etc as advanced
workflow and data mining tools
● Standards for Webservice dialect (DAS?)
● Exposing Query Engine through GALAXY

Acknowledgments

UCLA UNC
 Jordan Mendler  Sara Grimm
 Michael Clark  Matt Soloway
 Hane Lee  Jianying Li
 Bret Harry  Feri Zsuppan
 Stanley Nelson  Neil Hayes
 Chuck Perou
 Derek Chiang

Resources
● Hbase & Hadoop: http://hadoop.apache.org
● When to use HBase:
http://blog.rapleaf.com/dev/?p=26
● NOSQL presentations:
http://blog.oskarsson.nu/2009/06/nosql-debrief.html
● Other DBs: CouchDB, Hypertable, Cassandra,
Project Voldemort, and more...
● Data mining tools: Pig and Hive
● SeqWare: http://seqware.sourceforge.net
● briandoconnor@gmail.com

Overview
● SeqWare Query Engine background
● New tools for combating the data deluge
● HBase/Hadoop in SeqWare Query Engine
● HBase for backend
● Map/Reduce & HBase API for webservice
● Better performance and scalability?
● Next steps

SeqWare Query Engine:
Lustre Filesystem
BerkeleyDB
backend webservice clients

Genome BED/WIG
Analysis/Web Nodes Files
Database XML
Metadata

RESTful
Genome Analysis/Web Nodes Web Service
Database

Genome Analysis/Web Nodes
Database MetaDB

BerkeleyDB Web/Analysis Webservice combines
Variant & Coverage Nodes process Variant/Coverage data
Databases queries with metadata

More
● Details on API vs. M/R
● Details on XML Restful API & web app including
loading in UCSC browser
● Details on generic store object (BerkeleyDB,
HBase, and Relational at Renci)
● Byte serialization from BerkeleyDB, custom
secondary key creation

Pressures of Sequencing
● A lot of data (50GB SRF file, 150GB alignment
files, 60GB variants for a 20x human genome)
● PostgreSQL (2xquad core, 64GB RAM) died
with the Celsius schema (microarray database)
after loading ~1 billion rows
● Needs to be processed, annotated, and
queryable & comparable (variants, coverage,
annotations)
● ~3 billion rows x thousands of columns
● COMBINE WITH PREVIOUS SLIDE

Thoughts on BerkeleyDB
● BerkeleyDB let me:
● Create a database per genome, independent from a
single database daemon
● Provision database to cluster
● Adapt to key-value database semantics
● Limitations:
● Creation on single node only
● Not inherently distributed
● Performance issues with big DBs, high I/O wait
● Google to the rescue?

HBase Backend
● How the table(s) are actually structured
● Variants
● Coverage
● Etc
● How I do indexing currently (similar to indexing
feature extension)
● Multiple secondary indexes

Frontend
● RESTlet API
● What queries can you do?
● Examples
● URLs
● Potential for swapping out generic M/R for
many of these queries (less reliance on indexes
which will speed things up as DB grows)

Ideas for a distributed future
● Federated Dbs/datastores/clusters for
computation rather than one giant datacenter
● Distribute software not data

Potential Questions
● How big is the DB to store whole human
genome?
● How long does it take to M/R 3 billion positions
on 5 node cluster?
● How does my stuff compare to other bioinf
software? GATK, Crossbow, etc
● How did I choose HBase instead of Pig, Hive,
etc?

Current Prototyping Work
● Validate creation of U87 (genome resequencing
at 20x) genome database
● SNVs
● Coverage
● Annotations
● Test fast querying of record subsets
● Test fast processing of whole DB using
MapReduce
● Test stability, fault-tolerance, auto-balancing,
and deployment issues along the way

What About Fast Queries?
● I'm fairly convinced I can create a distributed
HBase database on a Hadoop cluster
● I have a prototype HBase database running on
two nodes
● But HBase shines when bulk processing DB
● Big question is how to make individual lookups
fast
● Possible solution is Hbase+Katta for indexes
(distributed Lucene)

Lustre Filesystem
backend webservice clients

Genome Analysis/Web Nodes BED/WIG
Files
Database (8 CPU, 32GB RAM) DASXML

RESTful
Genome Analysis/Web Nodes Web Service
Database (8 CPU, 32GB RAM)

Genome Analysis/Web Nodes
Annotation
Database (8 CPU, 32GB RAM)
Database

BerkeleyDB Web/Analysis Webservice combines
Variant & Coverage Nodes process Variant/Coverage data
Databases queries with annotations (hg18)

How Do We Scale Up the QE?
● Sequencers are increasing output by a factor of
10 every two years!
● Hard drives: 4x every 2 years
● CPUs: 2x every 2 years
● Bandwidth: 2x every 2 years (really?!)
● So there's a huge disconnect, can't just throw
more hardware at a single database server!
● Must look for better ways to scale for

Google to the Rescue?
● Companies like Google, Amazon, Facebook,
etc have had to deal with massive scalability
issues over the last 10+ years
● Solutions include:
● Frameworks like MapReduce
● Distributed file systems like HDFS
● Distributed databases like HBase
● Focus here on HBase

What Do You Give Up?
● SQL queries
● Well defined schema, normalized data structure
● Relationships manged by DB
● Flexible and easy indexing of table columns
● Existing tools that query a SQL database must
be re-written
● Certain ACID aspects
● Software maturity, most distributed NOSQL
projects are very new

What Do You Gain?
● Scalability is the clear win, you can have many
processes on many nodes hit the collection of
database servers
● Ability to look at very large datasets and do
complex computations across a cluster
● More flexibility in representing information now
and in the future
● HBase includes data timestamps/versions
● Integration with Hadoop

Hadoop Hadoop HDFS
Map Reduce backend webservice clients

ETL Map Data Analysis/Web BED/WIG
Job Nodes Files
Node XML
Metadata

RESTful
ETL Map Data Analysis/Web Web Service
Job Node Nodes

ETL Analysis/Web
Name
Reduce Job Nodes Annotation
Node
Database
MapReduce
jobs extract, HBase Web/Analysis Webservice combines
transform, & Variant & Coverage Nodes process Variant/Coverage data
load in parallel Database System queries with annotations (hg18)

What an HBase DB Looks Like
A Record in my HBase family label

key variant:genome4 variant:genome7 coverage:genome7

chr15:00000123454 byte[] byte[] byte[]

Variant object to byte array

Database on filesystem (HDFS)

key timestamp column:variant
chr15:00000123454 t1 genome7 byte[]

Scalability and BerkeleyDB
● BerkeleyDB let me:
● Create a database per genome, independent from a
single database daemon
● Provision database to cluster for distributed analysis
● Adapt to key-value database semantics with nice API
● Limitations:
● Creation on single node only
● Want to query easily across genomes
● Database are not distributed
● I saw performance issues, high I/O wait

Would 2,000 Genomes Kill SQL?
● Say each genome has 5M variants (not counting
coverage!)
● 5M variant rows x 2,000 genomes = 10 billion rows
● Our DB server running PostgreSQL (2xquad core,
64GB RAM) died with the Celsius (Chado)
schema after loading ~1 billion rows
● So maybe conservatively we would have issues
with 150+ genomes
● That threshold is probably 1 year away with public
datasets available via SRA, 1000 genomes, TCGA

My Abstract
● backend/frontend
● Traverse and query with Map/Reduce
● Java web service with RESTlet
● Deployment on 8 node cluster

Background on Problem
● Why abandon PostgreSQL/MySQL/SQL?
● Experience with Celsius...
● What you give up
● What you gain

First Solution: BerkeleyDB
● Good:
● key/value data store
● Easy to use
● Great for testing
● Bad:
● Not performant for multiple genomes
● Manual distribution across cluster
● Annoying phobia of shared filesystems

Sequencers vs. Information Technology
● Sequencers are increasing output by a factor of
10 every two years!
● Hard drives: 4x every 2 years
● CPUs: 2x every 2 years
● Bandwidth: 2x every 2 years (really?!)
● So there's a huge disconnect, can't just throw
more hardware at a single database server!
● Must look for better ways to scale

● What are we doing, what are the challenges. Big picture of
the project (webservice, backend etc)
● How did people solve this problem before? How did I
attempt to solve this problem? Where did it break down?
● “New” approach, looking to Google et al for scalability for
big data problems
● What is Hbase/Hadoop & what do they provide?
● How did I adapt Hbase/hadoop to my problem?
● Specifics of implementation: overall flow, tables, query
engine search (API), example M/R task
● Is this performant, does this scale? Can I get
billionxmillion? Fast arbitrary retrieval?
● Next steps: more data types, focus on M/R for analytical
tasks, focus on Katta for rich querying, push scalability w/
8 nodes (test with genomes), look at Cascading & other
tools for datamining

O connor bosc2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to O connor bosc2010

Similar to O connor bosc2010 (20)

More from BOSC 2010

More from BOSC 2010 (20)

Recently uploaded

Recently uploaded (20)

O connor bosc2010