Hadoop for Bioinformatics: Building a Scalable Variant Store

Hadoop ecosystem for genomics
Uri Laserson
Mount Sinai School of Medicine
29 October 2013

1

Agenda
1.

Hadoop overview
•
•
•

2.

Scalable variant store
•
•

2

Historical context
Hadoop overview
Some sins in bioinformatics
Possible conventional solutions
Hadoop/Impala implementation

Indexing the Web
•

Web is Huge
•

•

How do you index it?
•
•
•
•

5

Hundreds of millions of pages in 1999
Crawl all the pages
Rank pages based on relevance metrics
Build search index of keywords to pages
Do it in real time!

Databases in 1999
1.
2.
3.
4.
5.

7

Buy a really big machine
Install expensive DBMS on it
Point your workload at it
Hope it doesn’t fail
Ambitious: buy another big machine as backup

Database Limitations
•

Didn’t scale horizontally
•

High marginal cost ($$$)

No real fault-tolerance story
• Vendor lock-in ($$$)
• SQL unsuited for search ranking
•

•
•

9

Complex analysis (PageRank)
Unstructured data

Google does something different
•

Designed their own storage and processing
infrastructure
•

•

Google File System (GFS) and MapReduce (MR)

Goals: KISS
Cheap
• Scalable
• Reliable
•

11

Google does something different
It worked!
• Powered Google Search for many years
• General framework for large-scale batch computation
tasks
• Still used internally at Google to this day
•

12

Google benevolent enough to publish

2003
13

2004

Birth of Hadoop at Yahoo!
2004-2006: Doug Cutting and Mike Cafarella
implement GFS/MR.
• 2006: Spun out as Apache Hadoop
• Named after Doug’s son’s yellow stuffed elephant
•

14

Open-source proliferation
Google

Open-source

Function

GFS

HDFS

Distributed file system

MapReduce

MapReduce

Batch distributed data processing

Bigtable

HBase

Distributed DB/key-value store

Protobuf/Stubby

Thrift or Avro

Data serialization/RPC

Pregel

Giraph

Distributed graph processing

Dremel/F1

Cloudera Impala

Scalable interactive SQL (MPP)

FlumeJava

Crunch

Abstracted data pipelines on Hadoop

Hadoop

15

Overview of core technology

16

HDFS design assumptions
Based on Google File System
• Files are large (GBs to TBs)
• Failures are common
•

•
•

Massive scale means failures very likely
Disk, node, or network failures

Accesses are large and sequential
• Files are append-only
•

17

HDFS properties
•

Fault-tolerant
•

•

Horizontally scalable
•

•

Gracefully responds to node/disk/network failures
Low marginal cost

High-bandwidth
HDFS storage distribution
1

Node A

Node B

Node C

Node D

Node E

2

2

1

1

2

1

3

4

2

3

3

3

4

5

5

4

5

4

5
Input File

18

MapReduce computation
•

Structured as
1.
2.
3.

Embarrassingly parallel “map stage”
Cluster-wide distributed sort (“shuffle”)
Aggregation “reduce stage”

Data-locality: process the data where it is stored
• Fault-tolerance: failed tasks automatically detected
and restarted
• Schema-on-read: data must not be stored conforming
to rigid schema
•

20

Cloudera Hadoop Stack
Storm

Spark

STREAM

DISTRIBUTED
MEMORY

GraphLab
GRAPH
COMPUTATION

25

Cloudera Impala
Modern MPP
database built on top
of HDFS
Designed for
interactive queries
on terabyte-scale
data sets.

26

Cloudera Search
• Interactive search queries on top of
HDFS
• Built on Solr and SolrCloud
• Near-realtime indexing of new documents

27

Serialization/RPC formats
•
•
•
•
•

Specify schemas/services in user-friendly IDLs
Code-generation to multiple languages (wirecompatible/portable)
Compact, binary formats
Natural support for schema evolution
Multiple implementations:
•

28

Apache Thrift, Apache Avro, Google’s Protocol Buffers


struct Tweet {
1: required i32 userId;
2: required string userName;
3: required string text;
4: optional Location loc;
16: optional string language = "english"
}

service Twitter {
void ping();
bool postTweet(1:Tweet tweet);
TweetSearchResult searchTweets(1:string query);
}

29

struct Observation {
// can be general contig too
1: required string chromosome,
// python-style 0-based slicing
2: required i64 start,
3: required i64 end,
// unique identifier for data set
// (like UCSC genome browser track)
4: required string track,

// these are likely derived from the
// track; separated for convenient join
5: optional string experiment,
6: optional string sample,
// one of these should be non-null,
// depending on the type of data
7: optional string valueStr,
8: optional i64 valueInt,
9: optional double valueDouble
}
30

Parquet format
Row-major format

31

Parquet format
Column-major format

32

Parquet format advantages
•

Columnar format
•
•

•

read fewer bytes
compression more efficient (incl. dictionary encodings)

Thrift/Avro/Protobuf-compatible data model
•

Support for nested data structures

Binary encodings
• Hadoop-friendly (“splittable”; implemented in Java)
• Predicate pushdown
• http://parquet.io/
•

33

Query Times on TPCDS Queries
500
450
400
350

Seconds

300
Text
250

Seq w/ Snappy
RC w/Snappy

200

Parquet w/Snappy

150
100
50
0
Q27

34

Q34

Q42

Q43

Q46

Q52

Q55

Q59

Q65

Q73

Q79

Q96

Core paradigm shifts with Hadoop

Colocation of storage and compute

Fault tolerance with cheap hardware

35

Benefits of Hadoop ecosystem
•

Inexpensive commodity compute/storage
•

•

Tolerates random hardware failure

Decreased need for high-bandwidth network pipes
Co-locate compute and storage
• Exploit data locality
•

•

Simple horizontal scalability by adding nodes
•

•
•
•

36

MapReduce jobs effectively guaranteed to scale

Fault-tolerance/replication built-in. Data is durable
Large ecosystem of tools
Flexible data storage. Schema-on-read. Unstructured data.

Some sins in genomics data infrastructure

37

HPC separates compute from storage
HPC is about compute.
Hadoop is about data.
Storage infrastructure
• Proprietary, distributed
file system
• Expensive

Compute cluster
Big network
pipe ($$$)

• High-performance
hardware
• Low failure rate
• Expensive

User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.

38

Hadoop colocates compute and storage
HPC is about compute.
Hadoop is about data.
Compute cluster
Storage infrastructure
• Commodity hardware
• Data-locality
• Reduced networking
needs

User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.

39

HPC is lower-level than Hadoop
HPC only exposes job scheduling
• Parallelization typically occurs through MPI
•

•
•

Very low-level communication primitives
Difficult to horizontally scale by simply adding nodes

Large data sets must be manually split
• Failures must be dealt with manually
•

•

40

Hadoop has fault-tolerance, data locality, horizontal
scalability

File system as DB; text file as LCD
Broad joint caller with 25k genomes hits file handle
limits
• Files streamed over network (HPC architecture)
• Large files split manually
• Sharing data/collaborating involves copying large files
•

41

Job scheduler as workflow tool
Submitting jobs to scheduler is very low level
• Workflow engines/execution models provide high
level execution graphs with fault-tolerance
•

•

42

e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pi
g, Hive

Poor security/access models
•

Deal with complex set of constraints from a variety of
consents/redactions
•
•
•
•

43

Certain individuals redact certain parts of their genomes
Certain samples can only be used as controls for particular
studies
Different research groups want to control access to the
data they generate
Clinical trial data must have more rigorous access
restrictions

Treating computation as free
Many institutions make large clusters available for
“free” to the average researcher
• Focus of dropping sequencing cost has been on
biochemistry
•

44


Stein, L. D. The case for cloud computing in genome informatics. Genome Biol (2010).
45


Sboner et al. “The real cost of sequencing: higher than you think”. Genome Biology (2011).
46

Lack of benchmarks for tracking progress
•

Need to benchmark whether quality of methods are
improving

http://www.nist.gov/mml/bbd/ppgenomeinabottle2.cfm
47

Lack of benchmarks for tracking progress

Bradnam et al. “Assemblathon 2”, Gigascience 2, 10 (2013).
48

Academic code
Unreproducible, unbuildable, undocumented, unmainta
ined, unavailable, backward-incompatible, shitty code

Most developers self-taught. Only one-third
think formal training is important. [1, 2]
“…people in my lab have requested code from authors
and received source code with syntax errors in it” [3]
[1]: Haussler et al. “A Million Cancer Genome Warehouse” (2012)
[2]: Hannay et al. “How do scientists develop and use scientific software?” (2009)
[3]: http://ivory.idyll.org/blog/on-code-review-of-scientific-code.html
49

Fundamentally a barrier to scaling.

50

NCBI Sequence Read Archive (SRA)
Today…
1.14 petabytes

One year ago…
609 terabytes

52

Every ‘ome has a -seq

Genome

DNA-seq

RNA-seq
Transcriptome FRT-seq
NET-seq
Methylome
Immunome

Immune-seq

Proteome

53

Bisulfite-seq
PhIP-seq
Bind-n-seq

Prescriptions for the future

54

Move to Hadoop-style environment
Data centralization on HDFS
• Data-local execution to avoid moving terabytes
• Higher-level execution engines to abstract away
computations from details of execution
• Hadoop-friendly, evolvable, serialization formats for:
•

•
•

•

55

Storage- and compute-efficiency
Abstracting data model from data storage details

Built-in horizontal scalability and fault-tolerance

APIs instead of file formats
Service-oriented architectures ensure stable contracts
• Allows for implementation changes with new
technologies
• Software community has lots of experience with this
type of architecture, along with mature tools.
• Can be implemented as language-independent.
•

56

High-granularity access/common consent
1.

Use technologies with highly-granular access
control
•

2.

Create common consents for patients to “donate”
their data to research
•

57

e.g., Apache Accumulo, cell-based access control

e.g., Personal Genome Project, SAGE Portable Legal
Consent, NCI “information donor”

Tools for open-source/reproducibility
Software and computations should be opensourced, e.g., on GitHub
• Release VMs or ipython notebooks with publications
•

•

•

58

“executable paper” to generate figures

Allow others to easily recompute all analyses

Building scalable variant store

59

Genomics ETL
biochemistry

•
•
•
•

60

.fastq

short read
alignment

.bam

genotype
calling

.vcf

analysis

Short read alignment is embarrassingly parallel
Pileup/variant calling requires distributed sort
GATK is a reimplementation of MapReduce; could run on Hadoop
Early Hadoop tools
• Crossbow: short read alignment/variant calling
• Hadoop-BAM: distributed bamtools
• BioPig: manipulating large fasta/q
• Contrail: de-novo assembly

Genomics ETL

GATK best practices
61

ADAM
• Defining alternative to BAM format that’s
• Hadoop-friendly, splittable, designed for
distributed computing
• Format built as Avro objects
• Data stored as Parquet format (columnar)
• Attempting to reimplement GATK pipeline to function
on Hadoop/Parquet
• Currently run out of the AMPLab at UC Berkeley

63

Genomics ETL

.fastq

64

short read
alignment

.bam

genotype
calling

.vcf

analysis

Querying large, integrated variant data
Biotech client has thousands of genomes
• Want to expose ad hoc querying functionality on large
scale
•

•

•

Integrating data with public data sets
(e.g., ENCODE, UCSC tracks, dbSNP, etc.)
•

65

e.g., vcftools/PLINK-SEQ on terabyte-scale data sets

Terabyte-scale annotation sets

Conventional approaches: manual
•

Manually parsing flat files
•
•
•

Write ad hoc scripts in perl or python
Build data structures in memory for
histograms/aggregations
Custom script per query

counts_dict = {}
for chain in vdj.parse_VDJXML(inhandle):
try: counts_dict[chain.junction] += 1
except KeyError: counts_dict[chain.junction] = 1

for count in counts_dict.itervalues():
print >>outhandle, np.int_(count)

66

Conventional approaches: database
•

Very feature rich and mature
•
•
•

Common analytical tasks (e.g., joins, group-by, etc.)
Access control
Very mature

Scalability issues
• Indices can be prohibitive
• RDBMS: schemas can be annoyingly rigid
• NoSQL: adolescent implementations (but easy to
start)
•

67

Conventional approaches: domain-specific
•
•
•
•
•

68

e.g., PLINK/SEQ
Designed for specific use-cases
Workflows are highly opinionated/rigid
Requires learning another language
Scalability issues

Hadoop sol’n: storage
•

Impala/Hive metastore provide a unified, flexible data
model
•

•

69

Define Avro types for all data

Data stored as Parquet format to maximize
compression and query performance

Hadoop sol’n: available analytics engines
•
•
•
•
•

70

Analytical operations implemented by experts in
distributed systems
Impala implements RDBMS-style operations
Search offers metadata indexing
Spark offers in-memory processing for ML
HDFS-based analytical engines designed for horizontal
scalability

Variant store architecture
.vcf

ETL

.parquet

.csv
external
annotations

Avro schema

Thrift service
JDBC
REST API
Impala shell

71

query

Hive metastore
Impala query engine

Results

Example schema
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID
REF ALT QUAL FILTER INFO
FORMAT
NA00001
NA00002
NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 .
T A 3 q10 NS=3;DP=11;AF=0.017
GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 .
T . 47 PASS NS=3;DP=13;AA=T
GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

72

Example schema
##fileDate=20090805
##phasing=partial
#CHROM POS ID
FORMAT
NA00001
NA00002
NA00003
20 17330 .
T A 3 q10 NS=3;DP=11;AF=0.017
GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1230237 .
GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

73

Example schema
##fileDate=20090805
##phasing=partial
#CHROM POS ID
FORMAT
NA00001
NA00002
NA00003
20 17330 .
T A 3 q10 NS=3;DP=11;AF=0.017
GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1230237 .
GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

74

Why denormalization is good
•

Replace joins with filters
•
•

•

75

For query engines with efficient scans, this simplifies
queries and can improve performance
Parquet format supports predicate pushdowns, reducing
necessary I/O

Because storage is cheap, amortize cost of up-front
join over simpler queries going forward

...
{
"default": null,
"doc": "Genotype",
"type": [
"null",
"string"
],
"name": "VCF_CALL_GT"
},
{
"default": null,
"doc": "Genotype Quality",
"type": [
"null",
"int"
],
"name": "VCF_CALL_GQ"
},
{
"default": null,
"doc": "Read Depth",
"type": [
"null",
"int"
],
"name": "VCF_CALL_DP"
},
{
"default": [],
"doc": "Haplotype Quality",
"type": "string",
"name": "VCF_CALL_HQ"
}

Example schema
{
"name": "VCF",
"type": "record"
"fields": [
{
"type": "string",
"name": "VCF_CHROM"
},
{
"type": "int",
"name": "VCF_POS"
},
{
"type": "string",
"name": "VCF_ID"
},
{
"type": "string",
"name": "VCF_REF"
},
{
"type": "string",
"name": "VCF_ALT"
},
...

]
}

76

Example variant-filtering query
•

“Give me all SNPs that are:
•
•
•
•
•

•

77

on chromosome 5
absent from dbSNP
present in COSMIC
observed in breast cancer samples
absent from prostate cancer samples”

On full 1000 genome data set (~37 billion
variants), query finishes in a couple seconds

Example variant-filtering query
SELECT cosmic as snp_id,
vcf_chrom as chr,
vcf_pos as pos,
sample_id as sample,
vcf_call_gt as genotype,
sample_affection as phenotype
FROM
hg19_parquet_snappy_join_cached_partitioned
WHERE
COSMIC IS NOT NULL AND
dbSNP IS NULL AND
sample_study = ”breast_cancer" AND
VCF_CHROM = "16";

78

Impala execution
Query compiled into execution tree, chopped up
across all nodes (if possible)
• Two join implementations
•

1.
2.

Broadcast: each node gets copy of full right table
Shuffle: both sides of join are partitioned

Partitioned tables vastly reduce amount of I/O
• File formats make enormous difference in query
performance
•

79

Other desirable query-examples
“How do the mutations in a given subject compare to
the mutations in other phenotypically similar
subjects?”
• “For a given gene, in what pathways and cancer
subtypes is it involved?” (connecting phenotypes to
annotations)
• “How common are an observed set of mutations?”
• “For a given type of cancer, what are the
characteristic disruptions?”
•

80

Types of queries desired
Lot’s of these queries can be simply translated into
SQL queries
• Similar to functionality provided by PLINK/SEQ, but
designed to scale to much larger data sets
•

81

All-vs-all eQTL
•

Possible to generate trillions of hypothesis tests
•
•

•

107 loci x 104 phenotypes x 10s of tissues = 1012 p-values
Tested below on 120 billion associations

Example queries:
•

“Given 5 genes of interest, find top 20 most significant
eQTLs (cis and/or trans)”
•

•

“Find all cis-eQTLs across the entire genome”
•
•

82

Finishes in several seconds
Finishes in a couple of minutes
Limited by disk throughput

All-vs-all eQTL
•

“Find all SNPs that are:
•
•

•

in LD with some lead SNP
or eQTL of interest
align with some functional
annotation of interest”

Still in testing, but likely
finishes in seconds

Schaub et al, Genome Research, 2012
83

Conclusions
Hadoop ecosystem provides centralized, scalable
repository for data
• An abundance of tools for providing views/analytics
into the data store
•

•

Separate implementation details from data pipelines

Software quality/data structures/file formats matter
• Genomics has much to gain from moving away from
HPC architecture toward Hadoop ecosystem
architecture
•

84

Cloud-based implementation
Hadoop-ecosystem architecture easily translates to
the cloud (AWS, OpenStack)
• Provides elastic capacity; no large initial CAPEX
• Risk of vendor lock-in once data set is large
• Allows simple sharing of data via public S3
buckets, for example
•

85

Future work
•

Broad Institute has experimented with Google’s
BigQuery for a variant store
•
•

BigQuery is Google’s Dremel exposed to public on Google’s
cloud
Closed-source, only Google cloud

Developed API for working with variant data
• Soon develop Impala-backed implementation of
Broad API
•

•

86

To be open-sourced

Future work
Drive towards several large data warehouses; storage
backend optimized for particular access patterns
• Each can expose one or more APIs for different
applications/access levels.
• Haussler, D. et al. A Million Cancer Genome
Warehouse. (2012). Tech Report.
•

87

Acknowledgements
Cloudera
Josh Wills
Jeff Hammerbacher
Impala team (Nong Li)
Sandy Ryza

Julien Le Dem (Twitter)
Our biotech client
Mike Schatz (CSHL)
Matt Massie
88

Hadoop for Bioinformatics: Building a Scalable Variant Store

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop for Bioinformatics: Building a Scalable Variant Store

Similar to Hadoop for Bioinformatics: Building a Scalable Variant Store (20)

More from Uri Laserson

More from Uri Laserson (6)

Recently uploaded

Recently uploaded (20)

Hadoop for Bioinformatics: Building a Scalable Variant Store

Editor's Notes