Genomics Is Not Special: Data Challenges in Biology

Genomics Is
Not Special
Uri Laserson // laserson@cloudera.com // 13 November 2014
Toward Data-Intensive Biology

2© 2014 Cloudera, Inc. All rights reserved.
http://omicsmaps.com/
>25 Pbp / year

Carr and Church, Nat. Biotech. 27: 1151 (20

For every “-ome” there’s a “-seq”
Genome DNA-seq
Transcriptome
RNA-seq
FRT-seq
NET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
Proteome
PhIP-seq
Bind-n-seq
http://liorpachter.wordpress.com/seq/

Based on IMGT/LIGM release 201111

Developer/computational efficiency becoming
paramount
Genome Biology 12: 125 (2011)

Software and data management around
since 1970s
• Version control/reproducibility
• Testing/automation/integration
• Databases and data formats
• API design
• Lots (most?) of big data innovation happening in industry

Example query
For each variant that is
• overlapping a DNase HS site
• predicted to be deleterious
• absent from dbSNP
compute the MAF by subpopulation
using samples in Framing Heart Study
PARTNER LOGO
CH
R
POS RE
F
AL
T
POP MAF POLYPHEN
7 122892
37
A G Plain 0.01 possibly
damaging
7 122892
37
A G Star-
bellied
0.03 possibly
damaging
12 228833
2
T C Plain 0.00
3
probably
damaging
12 228833
2
T C Star-
bellied
0.09 probably
damaging

Available data
Data set Format Size
Population genotypes VCF 10-100s of
billions
Dnase HS sites
(ENCODE)
narrowPeak
(BED)
<1 million
dbSNP CSV 10s of millions
Sample phenotypes JSON thousands

Why text data is a bad idea
• Text is highly inefficient
• Compresses poorly
• Values must be parsed
• Text is semi-structured at best
• Flexible schemas make parsing difficult
• Difficult to make assumptions on data structure
• Text poorly separates the roles of delimiters and data
• Requires escaping of control characters
• (ASCII actually includes RS 0x1E and FS 0x1F, but they’re never used)
• But still almost always better than Excel

Some reasons VCF in particular is bad
• Number of records (variants) grows with new variants, rather than new
genotypes
• difficult to write data
• adding a sample requires rewrite of entire file
• Data must be sorted
• Semi-structured: need to build a parser for each file
• Conflates two functions:
• catalogue of variation
• repository actual observed genotypes
• If gzipped, it’s not splittable
• Variants are not encoded uniquely by the VCF spec

Manually executing query in Python
class IntervalTree(object):
def update(self, feature):
pass # ...implement tree update
def overlaps(self, feature):
return True or False
dnase_sites = IntervalTree()
with open('path/to/dnase.narrowPeak', 'r') as ip:
for line in ip:
feature = parse_feature(line)
dnase_sites.update(feature)
samples = {}
with open('path/to/samples.json', 'r') as ip:
for line in ip:
sample = json.loads(line)
if is_framingham(sample):
samples[sample['name']] = sample
dbsnp = set()
with open('path/to/dbsnp.csv', 'r') as ip:
for line in ip:
snp = tuple(line.split()[:3])
dbsnp.add(snp)

Additional metadata must fit in memory
for line in ip:
samples = {}
for line in ip:
dbsnp = set()
for line in ip:
dbsnp.add(snp)

Can only read from POSIX filesystem
for line in ip:
samples = {}
for line in ip:
dbsnp = set()
for line in ip:
dbsnp.add(snp)

genotype_data = {}
reader = vcf.Reader('path/to/genotypes.vcf')
for variant in reader:
if (dnase_sites.overlaps(variant) and is_deleterious(call)
and not in_dbsnp(variant)):
for call in variant.samples:
if call.sample in samples:
pop = samples[call.sample]['population']
genotype_data.setdefault((variant, pop), []).append(call)
mafs = {}
for (variant, pop) in genotype_data.iter_keys():
mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])

Genotype data may be split across files
genotype_data = {}
reader = vcf.Reader('path/to/genotypes.vcf')
for variant in reader:
if (dnase_sites.overlaps(variant) and is_deleterious(call)
and not in_dbsnp(variant)):
for call in variant.samples:
if call.sample in samples:
pop = samples[call.sample]['population']
genotype_data.setdefault((variant, pop), []).append(call)
mafs = {}
for (variant, pop) in genotype_data.iter_keys():
mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])

• If file is gzipped, cannot split file without decompressing (use
Snappy)
• Reading files required access to POSIX-style file system
• Probably want to split VCF file into pieces to parallelize
• Requires manual scatter-gather
• Samples may be scattered among multiple VCF files (difficult to
append to VCF)
• Manually implementing broadcast join
• Build side must fit into memory

Manually executing query in Python on HPC
$ bsub –q shared_12h python split_genotypes.py
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv
$ bsub –q shared_12h python merge_maf.py

Manually executing query in Python on HPC
$ bsub –q shared_12h python split_genotypes.py
$ bsub –q shared_12h python merge_maf.py
How to serialize
intermediate
output?
Manually specify
requested
resources
Manually
split and
mergeBabysit and
check for
errors/failures

HPC separates compute from storage
HPC is about compute.
Hadoop is about data.
Storage infrastructure
• Proprietary,
distributed file
system
• Expensive
Compute cluster
• High-perf, reliable
hardware
• Expensive
Big
network
pipe ($$$)
User typically works by manually submitting jobs to scheduler
(e.g., LSF, Grid Engine, etc.)

HPC is lower-level than Hadoop
• HPC only exposes job scheduling
• Parallelization typically through MPI
• Very low-level communication primitives
• Difficult to horizontally scale by simply adding nodes
• Large data sets must be manually split
• Failures must be dealt with manually

HPC uses file system as DB; text file as LCD
• All tools assume flat files with POSIX semantics
• Sharing data/collaboration involves copying large files
• Broad joint caller with 25k genomes hits file handle limits
• Files always streamed over network (HPC architecture)

HPC uses job scheduler as workflow tool
• Submitting jobs to scheduler is low level
• Workflow engines/execution models provide high level execution
graphs with built-in fault tolerance
• e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pig, Hive

Prepping data for local analysis in R/Python
• Manual script to prepare CSV file for working locally
• Same issues as above
• Requires working set of data to fit into memory of a single machine
• Visualization

Domain-specific tools (e.g., PLINK/Seq)
$ pseq path/to/project v-stats --mask phe=framingham locset=dnase ref.ex=dbsnp
one of a limited set
of specific, useful
tasks
(yet another)
custom query
specification

Domain-specific tools (e.g., PLINK/Seq)
• Works great if your problem fits into the pre-designed computations
• Only works if your problem fits into the pre-designed computations
• How to do stats by subpopulation?
• Probably possible, but need to learn new notation
• Must work to get data in to begin-with
• Not obviously parallelizable for performance on large data sets
• Built on SQLite underneath

RDBMS and SQL (e.g., MySQL)
SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)
FROM genotypes g
INNER JOIN samples s
ON g.sample = s.sample
INNER JOIN dnase d
ON g.chr = d.chr
AND g.pos >= d.start
AND g.pos < d.end
LEFT OUTER JOIN dbsnp p
ON g.chr = p.chr
AND g.pos = p.pos
AND g.ref = p.ref
AND g.alt = p.alt
WHERE
s.study = "framingham"
p.pos IS NULL AND
g.polyphen IN ( "possibly damaging", "probably damaging" )
GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop

RDBMS and SQL (e.g., MySQL)
• Feature-rich and very mature
• Highly optimized and allows indexing
• Declarative (and abstracted) language for data
• Hassle to get data in; data end up formatted one way
• No clear scalability story
• SQL-only

Problems with old way
• Expensive
• No fault-tolerance
• No horizontal scalability
• Poor separation of data modeling and storage formats
• File format proliferation
• Inefficient text formats

Indexing the web
• Web is Huge
• Hundreds of millions of pages in 1999
• How do you index it?
• Crawl all the pages
• Rank pages based on relevance metrics
• Build search index of keywords to pages
• Do it in real time!

Databases in 1999
• Buy a really big machine
• Install expensive DBMS on it
• Point your workload on it
• Hope it doesn’t fail
• Ambitious: buy another big machine as backup

Database limitations
• Didn’t scale horizontally
• High marginal cost ($$$)
• No real fault-tolerance story
• Vendor lock-in ($$$)
• SQL unsuited for search ranking
• Complex analysis (PageRank)
• Unstructured data

Google does something different
• Designed their own storage and processing infrastructure
• Google File System (GFS) and MapReduce (MR)
• Goals: cheap, scalable, reliable
• General framework for large-scale batch computation
• Powered Google Search for many years
• Still used internally to this day (millions of jobs)

Google benevolent enough to publish
2003 2004

Birth of Hadoop at Yahoo!
• 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR
• 2006: Spun out as Apache Hadoop
• Named after Doug’s son’s yellow stuffed elephant

Open-source proliferation
Google Open-source Function
GFS HDFS Distributed file system
MapReduce MapReduce Batch distributed data
processing
Bigtable HBase Distributed DB/key-value store
Protobuf/Stubb
y
Thrift or Avro Data serialization/RPC
Pregel Giraph Distributed graph processing
Dremel/F1 Impala Scalable interactive SQL (MPP)
FlumeJava Crunch Abstracted data pipelines on
Hadoop

Hadoop provides:
• Data centralization on HDFS
• No rewriting data for each tool/application
• Data-local execution to avoid moving terabytes
• High-level execution engines
• SQL (Impala, Hive)
• Relational algebra (Spark, MapReduce)
• Bulk synchronous parallel (GraphX)
• Distributed in-memory
• Built-in horizontal scalability and fault-tolerance
• Hadoop-friendly, evolvable serialization formats/RPC

Hadoop provides serialization/RPC formats
(Avro)
• Specify schemas/services in user-friendly IDLs
• Code-generation to multiple languages (wire-compatible/portable)
• Compact, binary formats
• Support for schema evolution
• Like binary JSON record Feature {
union { null, string } featureId = null;
union { null, string } featureType = null; // e.g., DNase HS
union { null, string } source = null; // e.g., BED, GFF file
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, Strand } strand = null;
union { null, double } value = null;
array<Dbxref> dbxrefs = [];
array<string> parentIds = [];
map<string> attributes = {};
}

APIs instead of file formats
• Service-oriented architectures (SOA) ensure stable contracts
• Allows for implementation changes with new technologies
• Software community has lots of experience with SOA, along with
mature tools
• Can be implemented in language-independent fashion

Current file format hairball

API-oriented architecture

Hadoop provides columnar storage
(Parquet)
• Designed for general data storage
• Columnar format
• read fewer bytes
• compression more efficient
• Splittable
• Avro/Thrift-compatible
• Predicate pushdown
• RLE, dictionary-encoding

(Parquet)

(Parquet)
Vertical partitioning
(projection pushdown)
Horizontal partitioning
(predicate pushdown)
Read only
the data you
need!
+ =

Hadoop provides abstractions for data
processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduc
e
Impala
(SQL)
Solr
(search)
Spark
ADAMquince guacamole …
bdg-formats(Avro/Parquet)

Hadoop examples: filesystem
[laserson@bottou01-10g ~]$ hadoop fs –ls /user/laserson
Found 16 items
drwx------ - laserson laserson 0 2014-11-12 16:00 .Trash
drwxr-xr-x - laserson laserson 0 2014-11-12 00:29 .sparkStaging
drwx------ - laserson laserson 0 2014-06-07 13:27 .staging
drwxr-xr-x - laserson laserson 0 2014-10-30 14:15 1kg
drwxr-xr-x - laserson laserson 0 2014-05-08 17:29 bigml
drwxr-xr-x - laserson laserson 0 2014-10-30 14:14 book
drwxrwxr-x - laserson laserson 0 2014-06-16 12:59 editing
drwxr-xr-x - laserson laserson 0 2014-06-06 13:49 gdelt
-rw-r--r-- 3 laserson laserson 0 2014-10-27 16:24 hg19_text
drwxr-xr-x - laserson laserson 0 2014-06-12 19:53 madlibport
drwxr-xr-x - laserson laserson 0 2014-03-20 18:09 rock-health-python
drwxr-xr-x - laserson laserson 0 2014-05-15 13:25 test-udf
drwxr-xr-x - laserson laserson 0 2014-08-21 17:58 test_pymc
drwxr-xr-x - laserson laserson 0 2014-10-27 22:25 tmp
drwxr-xr-x - laserson laserson 0 2014-10-07 20:30 udf-scratch
drwxr-xr-x - laserson laserson 0 2014-03-02 13:50 udfs

Hadoop examples: batch MapReduce job
hadoop jar vcf2parquet-0.1.0-jar-with-dependencies.jar
com.cloudera.science.vcf2parquet.VCFtoParquetDriver
hdfs:///path/to/variants.vcf
hdfs:///path/to/output.parquet

Hadoop examples: interactive Spark shell
[laserson@bottou01-10g ~]$ spark-shell --master yarn
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 1.1.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
[...]
scala>

Hadoop examples: interactive Spark shell
def inDbSnp(g: Genotype): Boolean = true or false
def isDeleterious(g: Genotype): Boolean = g.getPolyPhen
val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()
val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()
val genotypesRDD = sc.adamLoad("path/to/genotypes")
val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase")
val filteredRDD = genotypesRDD
.filter(!inDbSnp(_))
.filter(isDeleterious(_))
.filter(isFramingham(_))
val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)
val maf = joinedRDD
.keyBy(x => (x.getVariant, getPopulation(x)))
.groupByKey()
.map(computeMAF(_))
.saveAsNewAPIHadoopFile("path/to/output")

Hadoop provides abstractions for data
processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduc
e
Impala
(SQL)
Solr
(search)
Spark
ADAMquince guacamole …
bdg-formats(Avro/Parquet)

Genomics ETL
.fastq .bam .vcf
.bed/.gtf/etc
short
read
alignme
nt
genotyp
e calling analysis

Hadoop variant store architecture
Impala shell (SQL)
REST API
JDBC
SQL
query
Impala engine
Hive metastore
Result
set
.parquet.vcf
ETL

Data denormalization
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
• Amortize join cost up-front
• Replace joins with predicates
(allowing predicate
pushdown)

Hadoop solution characteristics
• Data stored as Parquet columnar format for performance and
compression
• Impala/Hive metastore provide unified, flexible data model
• Impala implements RDBMS-style operations (by experts in
distributed systems)
• Spark offers flexible relational algebra operators (and in-memory
computing)
• Built-in fault tolerance for computations and horizontal scalability

Example variant-filtering query
• “Give me all SNPs that are:
• chromosome 16
• absent from dbSNP
• present in COSMIC
• observed in breast cancer samples”
• On full 1000 Genome data set
• ~37 billion genotypes
• 14 node cluster
• query completion in several seconds
SELECT cosmic as snp_id,
vcf_chrom as chr,
vcf_pos as pos,
sample_id as sample,
vcf_call_gt as genotype,
sample_affection as phenotype
FROM
hg19_parquet_snappy_join_cached_partitioned
WHERE
COSMIC IS NOT NULL AND
dbSNP IS NULL AND
sample_study = ”breast_cancer" AND
VCF_CHROM = "16";
PARTNER LOGO

Other queries/use cases
• All-vs-all eQTL integrated with
ENCODE
• >120 billion p-values
• “Top 20 eQTLs for 5 genes of interest”:
interactive
• “Find all cis-eQTLs”: several minutes
• Population genetics queries (e.g.,
backend for PLINK)
• Interval arithmetic on large ENCODE
data sets
• Duke CHGV
• ATAV DSL for preparing data for GWAS
• Week-long queries now take a few hours
by parallelizing on Spark

Computational biologists are reinventing the
wheel
• e.g., CRAM (columnar storage)
• e.g., workflow managers (Galaxy)
• e.g., GATK (scatter-gather)

Large-scale data analysis has been solved*
• Cheaper in terms of hardware
• Easier in terms of productivity
• Built-in horizontal scaling
• Built-in fault tolerance
• Layered abstractions for data modeling
• Hadoop!

Science on Hadoop
• ADAM project for genomics on Spark
• http://bdgenomics.org/
• Guacamole for somatic variation on Spark
• https://github.com/hammerlab/guacamole/
• Thunder project for neuroimaging on Spark
• http://thefreemanlab.com/thunder/
• Quince for variant store on Impala
• currently barebones, but with examples
• https://github.com/laserson/quince

Suggestions/resources
• Everyone should learn Python
• (also, everyone should try some experiments)
• Everyone should use version control (e.g., git)
• GitHub enables easy collaboration
• See Titus Brown’s blog
• Use the IPython Notebook (Jupyter) for productivity
• Big data is often about engineering; use the best tools
• For getting industry jobs:
• Show people you know how to code: put your projects on GitHub
• You should feel lucky if others will start using your code

Acknowledgements
• Cloudera
• Sandy Ryza (Spark development)
• Nong Li (Impala)
• Skye Wanderman-Milne (Impala)
• Impala genomics collaborators
• Kiran Mukhyala
• Slaton Lipscomb
• ADAM project
• Matt Massie
• Frank Nothaft
• Timothy Danford
• Mount Sinai School of Medicine
• Jeff Hammerbacher (+ lab)
• Duke CHGV
• Jonathan Keebler

Genomics Is Not Special: Data Challenges in Biology

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Genomics Is Not Special: Data Challenges in Biology

Similar to Genomics Is Not Special: Data Challenges in Biology (20)

More from Uri Laserson

More from Uri Laserson (6)

Recently uploaded

Recently uploaded (20)

Genomics Is Not Special: Data Challenges in Biology

Editor's Notes