SlideShare a Scribd company logo
1 of 70
Download to read offline
Genomics Is
Not Special
Uri Laserson // laserson@cloudera.com // 13 November 2014
Toward Data-Intensive Biology
2© 2014 Cloudera, Inc. All rights reserved.
http://omicsmaps.com/
>25 Pbp / year
3© 2014 Cloudera, Inc. All rights reserved.
Carr and Church, Nat. Biotech. 27: 1151 (20
4© 2014 Cloudera, Inc. All rights reserved.
For every “-ome” there’s a “-seq”
Genome DNA-seq
Transcriptome
RNA-seq
FRT-seq
NET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
Proteome
PhIP-seq
Bind-n-seq
http://liorpachter.wordpress.com/seq/
5© 2014 Cloudera, Inc. All rights reserved.
For every “-ome” there’s a “-seq”
Genome DNA-seq
Transcriptome
RNA-seq
FRT-seq
NET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
Proteome
PhIP-seq
Bind-n-seq
http://liorpachter.wordpress.com/seq/
6© 2014 Cloudera, Inc. All rights reserved.
Based on IMGT/LIGM release 201111
7© 2014 Cloudera, Inc. All rights reserved.
8© 2014 Cloudera, Inc. All rights reserved.
9© 2014 Cloudera, Inc. All rights reserved.
Developer/computational efficiency becoming
paramount
Genome Biology 12: 125 (2011)
10© 2014 Cloudera, Inc. All rights reserved.
Software and data management around
since 1970s
• Version control/reproducibility
• Testing/automation/integration
• Databases and data formats
• API design
• Lots (most?) of big data innovation happening in industry
11© 2014 Cloudera, Inc. All rights reserved.
Example query
For each variant that is
• overlapping a DNase HS site
• predicted to be deleterious
• absent from dbSNP
compute the MAF by subpopulation
using samples in Framing Heart Study
PARTNER LOGO
CH
R
POS RE
F
AL
T
POP MAF POLYPHEN
7 122892
37
A G Plain 0.01 possibly
damaging
7 122892
37
A G Star-
bellied
0.03 possibly
damaging
12 228833
2
T C Plain 0.00
3
probably
damaging
12 228833
2
T C Star-
bellied
0.09 probably
damaging
12© 2014 Cloudera, Inc. All rights reserved.
Available data
Data set Format Size
Population genotypes VCF 10-100s of
billions
Dnase HS sites
(ENCODE)
narrowPeak
(BED)
<1 million
dbSNP CSV 10s of millions
Sample phenotypes JSON thousands
13© 2014 Cloudera, Inc. All rights reserved.
Why text data is a bad idea
• Text is highly inefficient
• Compresses poorly
• Values must be parsed
• Text is semi-structured at best
• Flexible schemas make parsing difficult
• Difficult to make assumptions on data structure
• Text poorly separates the roles of delimiters and data
• Requires escaping of control characters
• (ASCII actually includes RS 0x1E and FS 0x1F, but they’re never used)
• But still almost always better than Excel
14© 2014 Cloudera, Inc. All rights reserved.
Some reasons VCF in particular is bad
• Number of records (variants) grows with new variants, rather than new
genotypes
• difficult to write data
• adding a sample requires rewrite of entire file
• Data must be sorted
• Semi-structured: need to build a parser for each file
• Conflates two functions:
• catalogue of variation
• repository actual observed genotypes
• If gzipped, it’s not splittable
• Variants are not encoded uniquely by the VCF spec
15© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python
class IntervalTree(object):
def update(self, feature):
pass # ...implement tree update
def overlaps(self, feature):
return True or False
dnase_sites = IntervalTree()
with open('path/to/dnase.narrowPeak', 'r') as ip:
for line in ip:
feature = parse_feature(line)
dnase_sites.update(feature)
samples = {}
with open('path/to/samples.json', 'r') as ip:
for line in ip:
sample = json.loads(line)
if is_framingham(sample):
samples[sample['name']] = sample
dbsnp = set()
with open('path/to/dbsnp.csv', 'r') as ip:
for line in ip:
snp = tuple(line.split()[:3])
dbsnp.add(snp)
16© 2014 Cloudera, Inc. All rights reserved.
Additional metadata must fit in memory
class IntervalTree(object):
def update(self, feature):
pass # ...implement tree update
def overlaps(self, feature):
return True or False
dnase_sites = IntervalTree()
with open('path/to/dnase.narrowPeak', 'r') as ip:
for line in ip:
feature = parse_feature(line)
dnase_sites.update(feature)
samples = {}
with open('path/to/samples.json', 'r') as ip:
for line in ip:
sample = json.loads(line)
if is_framingham(sample):
samples[sample['name']] = sample
dbsnp = set()
with open('path/to/dbsnp.csv', 'r') as ip:
for line in ip:
snp = tuple(line.split()[:3])
dbsnp.add(snp)
17© 2014 Cloudera, Inc. All rights reserved.
Can only read from POSIX filesystem
class IntervalTree(object):
def update(self, feature):
pass # ...implement tree update
def overlaps(self, feature):
return True or False
dnase_sites = IntervalTree()
with open('path/to/dnase.narrowPeak', 'r') as ip:
for line in ip:
feature = parse_feature(line)
dnase_sites.update(feature)
samples = {}
with open('path/to/samples.json', 'r') as ip:
for line in ip:
sample = json.loads(line)
if is_framingham(sample):
samples[sample['name']] = sample
dbsnp = set()
with open('path/to/dbsnp.csv', 'r') as ip:
for line in ip:
snp = tuple(line.split()[:3])
dbsnp.add(snp)
18© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python
genotype_data = {}
reader = vcf.Reader('path/to/genotypes.vcf')
for variant in reader:
if (dnase_sites.overlaps(variant) and is_deleterious(call)
and not in_dbsnp(variant)):
for call in variant.samples:
if call.sample in samples:
pop = samples[call.sample]['population']
genotype_data.setdefault((variant, pop), []).append(call)
mafs = {}
for (variant, pop) in genotype_data.iter_keys():
mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])
19© 2014 Cloudera, Inc. All rights reserved.
Genotype data may be split across files
genotype_data = {}
reader = vcf.Reader('path/to/genotypes.vcf')
for variant in reader:
if (dnase_sites.overlaps(variant) and is_deleterious(call)
and not in_dbsnp(variant)):
for call in variant.samples:
if call.sample in samples:
pop = samples[call.sample]['population']
genotype_data.setdefault((variant, pop), []).append(call)
mafs = {}
for (variant, pop) in genotype_data.iter_keys():
mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])
20© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python
• If file is gzipped, cannot split file without decompressing (use
Snappy)
• Reading files required access to POSIX-style file system
• Probably want to split VCF file into pieces to parallelize
• Requires manual scatter-gather
• Samples may be scattered among multiple VCF files (difficult to
append to VCF)
• Manually implementing broadcast join
• Build side must fit into memory
21© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python on HPC
$ bsub –q shared_12h python split_genotypes.py
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv
$ bsub –q shared_12h python merge_maf.py
22© 2014 Cloudera, Inc. All rights reserved.
Manually executing query in Python on HPC
$ bsub –q shared_12h python split_genotypes.py
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv
$ bsub –q shared_12h python merge_maf.py
How to serialize
intermediate
output?
Manually specify
requested
resources
Manually
split and
mergeBabysit and
check for
errors/failures
23© 2014 Cloudera, Inc. All rights reserved.
HPC separates compute from storage
HPC is about compute.
Hadoop is about data.
Storage infrastructure
• Proprietary,
distributed file
system
• Expensive
Compute cluster
• High-perf, reliable
hardware
• Expensive
Big
network
pipe ($$$)
User typically works by manually submitting jobs to scheduler
(e.g., LSF, Grid Engine, etc.)
24© 2014 Cloudera, Inc. All rights reserved.
HPC is lower-level than Hadoop
• HPC only exposes job scheduling
• Parallelization typically through MPI
• Very low-level communication primitives
• Difficult to horizontally scale by simply adding nodes
• Large data sets must be manually split
• Failures must be dealt with manually
25© 2014 Cloudera, Inc. All rights reserved.
HPC uses file system as DB; text file as LCD
• All tools assume flat files with POSIX semantics
• Sharing data/collaboration involves copying large files
• Broad joint caller with 25k genomes hits file handle limits
• Files always streamed over network (HPC architecture)
26© 2014 Cloudera, Inc. All rights reserved.
HPC uses job scheduler as workflow tool
• Submitting jobs to scheduler is low level
• Workflow engines/execution models provide high level execution
graphs with built-in fault tolerance
• e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pig, Hive
27© 2014 Cloudera, Inc. All rights reserved.
Prepping data for local analysis in R/Python
• Manual script to prepare CSV file for working locally
• Same issues as above
• Requires working set of data to fit into memory of a single machine
• Visualization
28© 2014 Cloudera, Inc. All rights reserved.
Domain-specific tools (e.g., PLINK/Seq)
$ pseq path/to/project v-stats --mask phe=framingham locset=dnase ref.ex=dbsnp
one of a limited set
of specific, useful
tasks
(yet another)
custom query
specification
29© 2014 Cloudera, Inc. All rights reserved.
Domain-specific tools (e.g., PLINK/Seq)
• Works great if your problem fits into the pre-designed computations
• Only works if your problem fits into the pre-designed computations
• How to do stats by subpopulation?
• Probably possible, but need to learn new notation
• Must work to get data in to begin-with
• Not obviously parallelizable for performance on large data sets
• Built on SQLite underneath
30© 2014 Cloudera, Inc. All rights reserved.
RDBMS and SQL (e.g., MySQL)
SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)
FROM genotypes g
INNER JOIN samples s
ON g.sample = s.sample
INNER JOIN dnase d
ON g.chr = d.chr
AND g.pos >= d.start
AND g.pos < d.end
LEFT OUTER JOIN dbsnp p
ON g.chr = p.chr
AND g.pos = p.pos
AND g.ref = p.ref
AND g.alt = p.alt
WHERE
s.study = "framingham"
p.pos IS NULL AND
g.polyphen IN ( "possibly damaging", "probably damaging" )
GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop
31© 2014 Cloudera, Inc. All rights reserved.
RDBMS and SQL (e.g., MySQL)
• Feature-rich and very mature
• Highly optimized and allows indexing
• Declarative (and abstracted) language for data
• Hassle to get data in; data end up formatted one way
• No clear scalability story
• SQL-only
32© 2014 Cloudera, Inc. All rights reserved.
Problems with old way
• Expensive
• No fault-tolerance
• No horizontal scalability
• Poor separation of data modeling and storage formats
• File format proliferation
• Inefficient text formats
33© 2014 Cloudera, Inc. All rights reserved.
34© 2014 Cloudera, Inc. All rights reserved.
Indexing the web
• Web is Huge
• Hundreds of millions of pages in 1999
• How do you index it?
• Crawl all the pages
• Rank pages based on relevance metrics
• Build search index of keywords to pages
• Do it in real time!
35© 2014 Cloudera, Inc. All rights reserved.
36© 2014 Cloudera, Inc. All rights reserved.
Databases in 1999
• Buy a really big machine
• Install expensive DBMS on it
• Point your workload on it
• Hope it doesn’t fail
• Ambitious: buy another big machine as backup
37© 2014 Cloudera, Inc. All rights reserved.
38© 2014 Cloudera, Inc. All rights reserved.
Database limitations
• Didn’t scale horizontally
• High marginal cost ($$$)
• No real fault-tolerance story
• Vendor lock-in ($$$)
• SQL unsuited for search ranking
• Complex analysis (PageRank)
• Unstructured data
39© 2014 Cloudera, Inc. All rights reserved.
40© 2014 Cloudera, Inc. All rights reserved.
Google does something different
• Designed their own storage and processing infrastructure
• Google File System (GFS) and MapReduce (MR)
• Goals: cheap, scalable, reliable
• General framework for large-scale batch computation
• Powered Google Search for many years
• Still used internally to this day (millions of jobs)
41© 2014 Cloudera, Inc. All rights reserved.
Google benevolent enough to publish
2003 2004
42© 2014 Cloudera, Inc. All rights reserved.
Birth of Hadoop at Yahoo!
• 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR
• 2006: Spun out as Apache Hadoop
• Named after Doug’s son’s yellow stuffed elephant
43© 2014 Cloudera, Inc. All rights reserved.
Open-source proliferation
Google Open-source Function
GFS HDFS Distributed file system
MapReduce MapReduce Batch distributed data
processing
Bigtable HBase Distributed DB/key-value store
Protobuf/Stubb
y
Thrift or Avro Data serialization/RPC
Pregel Giraph Distributed graph processing
Dremel/F1 Impala Scalable interactive SQL (MPP)
FlumeJava Crunch Abstracted data pipelines on
Hadoop
44© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides:
• Data centralization on HDFS
• No rewriting data for each tool/application
• Data-local execution to avoid moving terabytes
• High-level execution engines
• SQL (Impala, Hive)
• Relational algebra (Spark, MapReduce)
• Bulk synchronous parallel (GraphX)
• Distributed in-memory
• Built-in horizontal scalability and fault-tolerance
• Hadoop-friendly, evolvable serialization formats/RPC
45© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides serialization/RPC formats
(Avro)
• Specify schemas/services in user-friendly IDLs
• Code-generation to multiple languages (wire-compatible/portable)
• Compact, binary formats
• Support for schema evolution
• Like binary JSON record Feature {
union { null, string } featureId = null;
union { null, string } featureType = null; // e.g., DNase HS
union { null, string } source = null; // e.g., BED, GFF file
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, Strand } strand = null;
union { null, double } value = null;
array<Dbxref> dbxrefs = [];
array<string> parentIds = [];
map<string> attributes = {};
}
46© 2014 Cloudera, Inc. All rights reserved.
APIs instead of file formats
• Service-oriented architectures (SOA) ensure stable contracts
• Allows for implementation changes with new technologies
• Software community has lots of experience with SOA, along with
mature tools
• Can be implemented in language-independent fashion
47© 2014 Cloudera, Inc. All rights reserved.
Current file format hairball
48© 2014 Cloudera, Inc. All rights reserved.
API-oriented architecture
49© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides columnar storage
(Parquet)
• Designed for general data storage
• Columnar format
• read fewer bytes
• compression more efficient
• Splittable
• Avro/Thrift-compatible
• Predicate pushdown
• RLE, dictionary-encoding
50© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides columnar storage
(Parquet)
51© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides columnar storage
(Parquet)
Vertical partitioning
(projection pushdown)
Horizontal partitioning
(predicate pushdown)
Read only
the data you
need!
+ =
52© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides abstractions for data
processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduc
e
Impala
(SQL)
Solr
(search)
Spark
ADAMquince guacamole …
bdg-formats(Avro/Parquet)
53© 2014 Cloudera, Inc. All rights reserved.
Hadoop examples: filesystem
[laserson@bottou01-10g ~]$ hadoop fs –ls /user/laserson
Found 16 items
drwx------ - laserson laserson 0 2014-11-12 16:00 .Trash
drwxr-xr-x - laserson laserson 0 2014-11-12 00:29 .sparkStaging
drwx------ - laserson laserson 0 2014-06-07 13:27 .staging
drwxr-xr-x - laserson laserson 0 2014-10-30 14:15 1kg
drwxr-xr-x - laserson laserson 0 2014-05-08 17:29 bigml
drwxr-xr-x - laserson laserson 0 2014-10-30 14:14 book
drwxrwxr-x - laserson laserson 0 2014-06-16 12:59 editing
drwxr-xr-x - laserson laserson 0 2014-06-06 13:49 gdelt
-rw-r--r-- 3 laserson laserson 0 2014-10-27 16:24 hg19_text
drwxr-xr-x - laserson laserson 0 2014-06-12 19:53 madlibport
drwxr-xr-x - laserson laserson 0 2014-03-20 18:09 rock-health-python
drwxr-xr-x - laserson laserson 0 2014-05-15 13:25 test-udf
drwxr-xr-x - laserson laserson 0 2014-08-21 17:58 test_pymc
drwxr-xr-x - laserson laserson 0 2014-10-27 22:25 tmp
drwxr-xr-x - laserson laserson 0 2014-10-07 20:30 udf-scratch
drwxr-xr-x - laserson laserson 0 2014-03-02 13:50 udfs
54© 2014 Cloudera, Inc. All rights reserved.
Hadoop examples: batch MapReduce job
hadoop jar vcf2parquet-0.1.0-jar-with-dependencies.jar 
com.cloudera.science.vcf2parquet.VCFtoParquetDriver 
hdfs:///path/to/variants.vcf 
hdfs:///path/to/output.parquet
55© 2014 Cloudera, Inc. All rights reserved.
Hadoop examples: interactive Spark shell
[laserson@bottou01-10g ~]$ spark-shell --master yarn
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 1.1.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
[...]
scala>
56© 2014 Cloudera, Inc. All rights reserved.
Hadoop examples: interactive Spark shell
def inDbSnp(g: Genotype): Boolean = true or false
def isDeleterious(g: Genotype): Boolean = g.getPolyPhen
val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()
val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()
val genotypesRDD = sc.adamLoad("path/to/genotypes")
val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase")
val filteredRDD = genotypesRDD
.filter(!inDbSnp(_))
.filter(isDeleterious(_))
.filter(isFramingham(_))
val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)
val maf = joinedRDD
.keyBy(x => (x.getVariant, getPopulation(x)))
.groupByKey()
.map(computeMAF(_))
.saveAsNewAPIHadoopFile("path/to/output")
57© 2014 Cloudera, Inc. All rights reserved.
Hadoop provides abstractions for data
processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduc
e
Impala
(SQL)
Solr
(search)
Spark
ADAMquince guacamole …
bdg-formats(Avro/Parquet)
58© 2014 Cloudera, Inc. All rights reserved.
Genomics ETL
.fastq .bam .vcf
.bed/.gtf/etc
short
read
alignme
nt
genotyp
e calling analysis
59© 2014 Cloudera, Inc. All rights reserved.
Hadoop variant store architecture
Impala shell (SQL)
REST API
JDBC
SQL
query
Impala engine
Hive metastore
Result
set
.parquet.vcf
ETL
60© 2014 Cloudera, Inc. All rights reserved.
Data denormalization
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
• Amortize join cost up-front
• Replace joins with predicates
(allowing predicate
pushdown)
61© 2014 Cloudera, Inc. All rights reserved.
Hadoop solution characteristics
• Data stored as Parquet columnar format for performance and
compression
• Impala/Hive metastore provide unified, flexible data model
• Impala implements RDBMS-style operations (by experts in
distributed systems)
• Spark offers flexible relational algebra operators (and in-memory
computing)
• Built-in fault tolerance for computations and horizontal scalability
62© 2014 Cloudera, Inc. All rights reserved.
Example variant-filtering query
• “Give me all SNPs that are:
• chromosome 16
• absent from dbSNP
• present in COSMIC
• observed in breast cancer samples”
• On full 1000 Genome data set
• ~37 billion genotypes
• 14 node cluster
• query completion in several seconds
SELECT cosmic as snp_id,
vcf_chrom as chr,
vcf_pos as pos,
sample_id as sample,
vcf_call_gt as genotype,
sample_affection as phenotype
FROM
hg19_parquet_snappy_join_cached_partitioned
WHERE
COSMIC IS NOT NULL AND
dbSNP IS NULL AND
sample_study = ”breast_cancer" AND
VCF_CHROM = "16";
PARTNER LOGO
63© 2014 Cloudera, Inc. All rights reserved.
Other queries/use cases
• All-vs-all eQTL integrated with
ENCODE
• >120 billion p-values
• “Top 20 eQTLs for 5 genes of interest”:
interactive
• “Find all cis-eQTLs”: several minutes
• Population genetics queries (e.g.,
backend for PLINK)
• Interval arithmetic on large ENCODE
data sets
• Duke CHGV
• ATAV DSL for preparing data for GWAS
• Week-long queries now take a few hours
by parallelizing on Spark
64© 2014 Cloudera, Inc. All rights reserved.
Computational biologists are reinventing the
wheel
• e.g., CRAM (columnar storage)
• e.g., workflow managers (Galaxy)
• e.g., GATK (scatter-gather)
65© 2014 Cloudera, Inc. All rights reserved.
Large-scale data analysis has been solved*
• Cheaper in terms of hardware
• Easier in terms of productivity
• Built-in horizontal scaling
• Built-in fault tolerance
• Layered abstractions for data modeling
• Hadoop!
66© 2014 Cloudera, Inc. All rights reserved.
Science on Hadoop
• ADAM project for genomics on Spark
• http://bdgenomics.org/
• Guacamole for somatic variation on Spark
• https://github.com/hammerlab/guacamole/
• Thunder project for neuroimaging on Spark
• http://thefreemanlab.com/thunder/
• Quince for variant store on Impala
• currently barebones, but with examples
• https://github.com/laserson/quince
67© 2014 Cloudera, Inc. All rights reserved.
Suggestions/resources
• Everyone should learn Python
• (also, everyone should try some experiments)
• Everyone should use version control (e.g., git)
• GitHub enables easy collaboration
• See Titus Brown’s blog
• Use the IPython Notebook (Jupyter) for productivity
• Big data is often about engineering; use the best tools
• For getting industry jobs:
• Show people you know how to code: put your projects on GitHub
• You should feel lucky if others will start using your code
68© 2014 Cloudera, Inc. All rights reserved.
69© 2014 Cloudera, Inc. All rights reserved.
Acknowledgements
• Cloudera
• Sandy Ryza (Spark development)
• Nong Li (Impala)
• Skye Wanderman-Milne (Impala)
• Impala genomics collaborators
• Kiran Mukhyala
• Slaton Lipscomb
• ADAM project
• Matt Massie
• Frank Nothaft
• Timothy Danford
• Mount Sinai School of Medicine
• Jeff Hammerbacher (+ lab)
• Duke CHGV
• Jonathan Keebler
Thank you.

More Related Content

What's hot

"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data ScientistsDataWorks Summit
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMfnothaft
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 

What's hot (20)

"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
HDFS
HDFSHDFS
HDFS
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 

Viewers also liked

Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMMatt Massie
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadofnothaft
 
Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1MapR Technologies
 
Big Data & the networked future of Science (at Ignite Seattle 7)
Big Data & the networked future of Science (at Ignite Seattle 7)Big Data & the networked future of Science (at Ignite Seattle 7)
Big Data & the networked future of Science (at Ignite Seattle 7)Deepak Singh
 
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...Matthieu Schapranow
 
Human Genome and Big Data Challenges
Human Genome and Big Data ChallengesHuman Genome and Big Data Challenges
Human Genome and Big Data ChallengesPhilip Bourne
 
Interviewing - why some questions are off limits
Interviewing - why some questions are off limitsInterviewing - why some questions are off limits
Interviewing - why some questions are off limitsAnn Loraine
 
Visualizing the genome: Techniques for presenting genome data and annotations
Visualizing the genome: Techniques for presenting genome data and annotationsVisualizing the genome: Techniques for presenting genome data and annotations
Visualizing the genome: Techniques for presenting genome data and annotationsAnn Loraine
 
Bioinformatics & Genomics December Newsletter
Bioinformatics & Genomics December NewsletterBioinformatics & Genomics December Newsletter
Bioinformatics & Genomics December NewsletterKelly Wickham
 
Daily Snapshot - 2nd March 2017
Daily Snapshot - 2nd March 2017Daily Snapshot - 2nd March 2017
Daily Snapshot - 2nd March 2017Tracxn
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
Hannes Smarason: Genomics: Forging Patient-Centric Communities
Hannes Smarason: Genomics: Forging Patient-Centric CommunitiesHannes Smarason: Genomics: Forging Patient-Centric Communities
Hannes Smarason: Genomics: Forging Patient-Centric CommunitiesHannes Smárason
 
Hannes Smarason: Progress & Prospects in Genomics
Hannes Smarason: Progress & Prospects in GenomicsHannes Smarason: Progress & Prospects in Genomics
Hannes Smarason: Progress & Prospects in GenomicsHannes Smárason
 
Hannes Smarason: 2015 = An Inflection Point in Genomics
Hannes Smarason: 2015 = An Inflection Point in GenomicsHannes Smarason: 2015 = An Inflection Point in Genomics
Hannes Smarason: 2015 = An Inflection Point in GenomicsHannes Smárason
 
Tracxn Research — Home Improvements Landscape, December 2016
Tracxn Research — Home Improvements Landscape, December 2016Tracxn Research — Home Improvements Landscape, December 2016
Tracxn Research — Home Improvements Landscape, December 2016Tracxn
 
Tracxn Research — New Space Landscape, December 2016
Tracxn Research — New Space Landscape, December 2016Tracxn Research — New Space Landscape, December 2016
Tracxn Research — New Space Landscape, December 2016Tracxn
 
Tracxn Research - PR Tech Landscape, January 2017
Tracxn Research - PR Tech Landscape, January 2017Tracxn Research - PR Tech Landscape, January 2017
Tracxn Research - PR Tech Landscape, January 2017Tracxn
 

Viewers also liked (20)

Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAM
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1
 
Big Data & the networked future of Science (at Ignite Seattle 7)
Big Data & the networked future of Science (at Ignite Seattle 7)Big Data & the networked future of Science (at Ignite Seattle 7)
Big Data & the networked future of Science (at Ignite Seattle 7)
 
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
 
Human Genome and Big Data Challenges
Human Genome and Big Data ChallengesHuman Genome and Big Data Challenges
Human Genome and Big Data Challenges
 
Interviewing - why some questions are off limits
Interviewing - why some questions are off limitsInterviewing - why some questions are off limits
Interviewing - why some questions are off limits
 
Visualizing the genome: Techniques for presenting genome data and annotations
Visualizing the genome: Techniques for presenting genome data and annotationsVisualizing the genome: Techniques for presenting genome data and annotations
Visualizing the genome: Techniques for presenting genome data and annotations
 
Bioinformatics & Genomics December Newsletter
Bioinformatics & Genomics December NewsletterBioinformatics & Genomics December Newsletter
Bioinformatics & Genomics December Newsletter
 
Daily Snapshot - 2nd March 2017
Daily Snapshot - 2nd March 2017Daily Snapshot - 2nd March 2017
Daily Snapshot - 2nd March 2017
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Hannes Smarason: Genomics: Forging Patient-Centric Communities
Hannes Smarason: Genomics: Forging Patient-Centric CommunitiesHannes Smarason: Genomics: Forging Patient-Centric Communities
Hannes Smarason: Genomics: Forging Patient-Centric Communities
 
Hannes Smarason: Progress & Prospects in Genomics
Hannes Smarason: Progress & Prospects in GenomicsHannes Smarason: Progress & Prospects in Genomics
Hannes Smarason: Progress & Prospects in Genomics
 
Hannes Smarason: 2015 = An Inflection Point in Genomics
Hannes Smarason: 2015 = An Inflection Point in GenomicsHannes Smarason: 2015 = An Inflection Point in Genomics
Hannes Smarason: 2015 = An Inflection Point in Genomics
 
Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
Tracxn Research — Home Improvements Landscape, December 2016
Tracxn Research — Home Improvements Landscape, December 2016Tracxn Research — Home Improvements Landscape, December 2016
Tracxn Research — Home Improvements Landscape, December 2016
 
Tracxn Research — New Space Landscape, December 2016
Tracxn Research — New Space Landscape, December 2016Tracxn Research — New Space Landscape, December 2016
Tracxn Research — New Space Landscape, December 2016
 
Tracxn Research - PR Tech Landscape, January 2017
Tracxn Research - PR Tech Landscape, January 2017Tracxn Research - PR Tech Landscape, January 2017
Tracxn Research - PR Tech Landscape, January 2017
 

Similar to Genomics Is Not Special: Data Challenges in Biology

Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonDataWorks Summit
 
Enhancing Domain Specific Language Implementations Through Ontology
Enhancing Domain Specific Language Implementations Through OntologyEnhancing Domain Specific Language Implementations Through Ontology
Enhancing Domain Specific Language Implementations Through OntologyChunhua Liao
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand
 
[MathWorks] Versioning Infrastructure
[MathWorks] Versioning Infrastructure[MathWorks] Versioning Infrastructure
[MathWorks] Versioning InfrastructurePerforce
 
White Paper: Scaling Servers and Storage for Film Assets
White Paper: Scaling Servers and Storage for Film AssetsWhite Paper: Scaling Servers and Storage for Film Assets
White Paper: Scaling Servers and Storage for Film AssetsPerforce
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operationsgrim_radical
 
Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonPyData
 
Chef - industrialize and automate your infrastructure
Chef - industrialize and automate your infrastructureChef - industrialize and automate your infrastructure
Chef - industrialize and automate your infrastructureMichaël Lopez
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataDataWorks Summit
 
Improved Developer Productivity In JDK8
Improved Developer Productivity In JDK8Improved Developer Productivity In JDK8
Improved Developer Productivity In JDK8Simon Ritter
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On SparkJen Aman
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017Alex Diachenko
 
Puppet getting started by Dirk Götz
Puppet getting started by Dirk GötzPuppet getting started by Dirk Götz
Puppet getting started by Dirk GötzNETWAYS
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache PulsarTimothy Spann
 

Similar to Genomics Is Not Special: Data Challenges in Biology (20)

Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
Enhancing Domain Specific Language Implementations Through Ontology
Enhancing Domain Specific Language Implementations Through OntologyEnhancing Domain Specific Language Implementations Through Ontology
Enhancing Domain Specific Language Implementations Through Ontology
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
[MathWorks] Versioning Infrastructure
[MathWorks] Versioning Infrastructure[MathWorks] Versioning Infrastructure
[MathWorks] Versioning Infrastructure
 
White Paper: Scaling Servers and Storage for Film Assets
White Paper: Scaling Servers and Storage for Film AssetsWhite Paper: Scaling Servers and Storage for Film Assets
White Paper: Scaling Servers and Storage for Film Assets
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operations
 
Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
 
Chef - industrialize and automate your infrastructure
Chef - industrialize and automate your infrastructureChef - industrialize and automate your infrastructure
Chef - industrialize and automate your infrastructure
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged Data
 
Improved Developer Productivity In JDK8
Improved Developer Productivity In JDK8Improved Developer Productivity In JDK8
Improved Developer Productivity In JDK8
 
Data provenance in Hopsworks
Data provenance in HopsworksData provenance in Hopsworks
Data provenance in Hopsworks
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
big data ppt.ppt
big data ppt.pptbig data ppt.ppt
big data ppt.ppt
 
Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017
 
Puppet getting started by Dirk Götz
Puppet getting started by Dirk GötzPuppet getting started by Dirk Götz
Puppet getting started by Dirk Götz
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
 

More from Uri Laserson

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Uri Laserson
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic BiologyUri Laserson
 
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Uri Laserson
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesUri Laserson
 

More from Uri Laserson (6)

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic Biology
 
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 

Recently uploaded

chapter 28 Reproductive System Power Point
chapter 28 Reproductive System Power Pointchapter 28 Reproductive System Power Point
chapter 28 Reproductive System Power Pointchelseakland
 
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...Chiheb Ben Hammouda
 
Anatomy of Duodenum- detailed ppt presentation for science students
Anatomy of Duodenum- detailed ppt presentation for science studentsAnatomy of Duodenum- detailed ppt presentation for science students
Anatomy of Duodenum- detailed ppt presentation for science studentsgargipatil3210
 
Pharmaceutical warehouse audit using eAuditor Audits & Inspections .pptx
Pharmaceutical warehouse audit using eAuditor Audits & Inspections .pptxPharmaceutical warehouse audit using eAuditor Audits & Inspections .pptx
Pharmaceutical warehouse audit using eAuditor Audits & Inspections .pptxeAuditor Audits & Inspections
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasChayanika Das
 
Rabies ,a deadly viral disease transmitted by the most beautiful beings. by ...
Rabies ,a deadly viral disease transmitted by the most beautiful beings.  by ...Rabies ,a deadly viral disease transmitted by the most beautiful beings.  by ...
Rabies ,a deadly viral disease transmitted by the most beautiful beings. by ...uzmashireenmbe01
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsDanielBaumann11
 
Introduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsIntroduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsCreative-Biolabs
 
ULTRA STRUCTURE OF FUNGAL CELL AND GROWTH
ULTRA STRUCTURE OF FUNGAL CELL AND GROWTHULTRA STRUCTURE OF FUNGAL CELL AND GROWTH
ULTRA STRUCTURE OF FUNGAL CELL AND GROWTHKARTHIK REDDY C A
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaDr.Mahmoud Abbas
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...jana861314
 
Role of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxRole of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxjana861314
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx201bo007
 
Density and Buoyancy for Grade 8 Science Class
Density and Buoyancy for Grade 8 Science ClassDensity and Buoyancy for Grade 8 Science Class
Density and Buoyancy for Grade 8 Science Class276318345hao
 
Mycobacterium Mycobacterium tuberculosis
Mycobacterium Mycobacterium tuberculosisMycobacterium Mycobacterium tuberculosis
Mycobacterium Mycobacterium tuberculosisKARTHIK REDDY C A
 
Food_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyFood_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyHemantThakare8
 

Recently uploaded (20)

chapter 28 Reproductive System Power Point
chapter 28 Reproductive System Power Pointchapter 28 Reproductive System Power Point
chapter 28 Reproductive System Power Point
 
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
 
Anatomy of Duodenum- detailed ppt presentation for science students
Anatomy of Duodenum- detailed ppt presentation for science studentsAnatomy of Duodenum- detailed ppt presentation for science students
Anatomy of Duodenum- detailed ppt presentation for science students
 
Pharmaceutical warehouse audit using eAuditor Audits & Inspections .pptx
Pharmaceutical warehouse audit using eAuditor Audits & Inspections .pptxPharmaceutical warehouse audit using eAuditor Audits & Inspections .pptx
Pharmaceutical warehouse audit using eAuditor Audits & Inspections .pptx
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
 
Rabies ,a deadly viral disease transmitted by the most beautiful beings. by ...
Rabies ,a deadly viral disease transmitted by the most beautiful beings.  by ...Rabies ,a deadly viral disease transmitted by the most beautiful beings.  by ...
Rabies ,a deadly viral disease transmitted by the most beautiful beings. by ...
 
Biopreservation .pptx
Biopreservation                    .pptxBiopreservation                    .pptx
Biopreservation .pptx
 
Battery Reasearch Reagents from TCI Chemicals
Battery Reasearch Reagents from TCI ChemicalsBattery Reasearch Reagents from TCI Chemicals
Battery Reasearch Reagents from TCI Chemicals
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
 
TOPIC OF ELECTROMAGNETISM.PHYSICS NOTES FORM 4
TOPIC OF ELECTROMAGNETISM.PHYSICS NOTES FORM 4TOPIC OF ELECTROMAGNETISM.PHYSICS NOTES FORM 4
TOPIC OF ELECTROMAGNETISM.PHYSICS NOTES FORM 4
 
Introduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsIntroduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative Biolabs
 
ULTRA STRUCTURE OF FUNGAL CELL AND GROWTH
ULTRA STRUCTURE OF FUNGAL CELL AND GROWTHULTRA STRUCTURE OF FUNGAL CELL AND GROWTH
ULTRA STRUCTURE OF FUNGAL CELL AND GROWTH
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
 
Role of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxRole of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptx
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx
 
Density and Buoyancy for Grade 8 Science Class
Density and Buoyancy for Grade 8 Science ClassDensity and Buoyancy for Grade 8 Science Class
Density and Buoyancy for Grade 8 Science Class
 
Mycobacterium Mycobacterium tuberculosis
Mycobacterium Mycobacterium tuberculosisMycobacterium Mycobacterium tuberculosis
Mycobacterium Mycobacterium tuberculosis
 
Food_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyFood_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiology
 

Genomics Is Not Special: Data Challenges in Biology

  • 1. Genomics Is Not Special Uri Laserson // laserson@cloudera.com // 13 November 2014 Toward Data-Intensive Biology
  • 2. 2© 2014 Cloudera, Inc. All rights reserved. http://omicsmaps.com/ >25 Pbp / year
  • 3. 3© 2014 Cloudera, Inc. All rights reserved. Carr and Church, Nat. Biotech. 27: 1151 (20
  • 4. 4© 2014 Cloudera, Inc. All rights reserved. For every “-ome” there’s a “-seq” Genome DNA-seq Transcriptome RNA-seq FRT-seq NET-seq Methylome Bisulfite-seq Immunome Immune-seq Proteome PhIP-seq Bind-n-seq http://liorpachter.wordpress.com/seq/
  • 5. 5© 2014 Cloudera, Inc. All rights reserved. For every “-ome” there’s a “-seq” Genome DNA-seq Transcriptome RNA-seq FRT-seq NET-seq Methylome Bisulfite-seq Immunome Immune-seq Proteome PhIP-seq Bind-n-seq http://liorpachter.wordpress.com/seq/
  • 6. 6© 2014 Cloudera, Inc. All rights reserved. Based on IMGT/LIGM release 201111
  • 7. 7© 2014 Cloudera, Inc. All rights reserved.
  • 8. 8© 2014 Cloudera, Inc. All rights reserved.
  • 9. 9© 2014 Cloudera, Inc. All rights reserved. Developer/computational efficiency becoming paramount Genome Biology 12: 125 (2011)
  • 10. 10© 2014 Cloudera, Inc. All rights reserved. Software and data management around since 1970s • Version control/reproducibility • Testing/automation/integration • Databases and data formats • API design • Lots (most?) of big data innovation happening in industry
  • 11. 11© 2014 Cloudera, Inc. All rights reserved. Example query For each variant that is • overlapping a DNase HS site • predicted to be deleterious • absent from dbSNP compute the MAF by subpopulation using samples in Framing Heart Study PARTNER LOGO CH R POS RE F AL T POP MAF POLYPHEN 7 122892 37 A G Plain 0.01 possibly damaging 7 122892 37 A G Star- bellied 0.03 possibly damaging 12 228833 2 T C Plain 0.00 3 probably damaging 12 228833 2 T C Star- bellied 0.09 probably damaging
  • 12. 12© 2014 Cloudera, Inc. All rights reserved. Available data Data set Format Size Population genotypes VCF 10-100s of billions Dnase HS sites (ENCODE) narrowPeak (BED) <1 million dbSNP CSV 10s of millions Sample phenotypes JSON thousands
  • 13. 13© 2014 Cloudera, Inc. All rights reserved. Why text data is a bad idea • Text is highly inefficient • Compresses poorly • Values must be parsed • Text is semi-structured at best • Flexible schemas make parsing difficult • Difficult to make assumptions on data structure • Text poorly separates the roles of delimiters and data • Requires escaping of control characters • (ASCII actually includes RS 0x1E and FS 0x1F, but they’re never used) • But still almost always better than Excel
  • 14. 14© 2014 Cloudera, Inc. All rights reserved. Some reasons VCF in particular is bad • Number of records (variants) grows with new variants, rather than new genotypes • difficult to write data • adding a sample requires rewrite of entire file • Data must be sorted • Semi-structured: need to build a parser for each file • Conflates two functions: • catalogue of variation • repository actual observed genotypes • If gzipped, it’s not splittable • Variants are not encoded uniquely by the VCF spec
  • 15. 15© 2014 Cloudera, Inc. All rights reserved. Manually executing query in Python class IntervalTree(object): def update(self, feature): pass # ...implement tree update def overlaps(self, feature): return True or False dnase_sites = IntervalTree() with open('path/to/dnase.narrowPeak', 'r') as ip: for line in ip: feature = parse_feature(line) dnase_sites.update(feature) samples = {} with open('path/to/samples.json', 'r') as ip: for line in ip: sample = json.loads(line) if is_framingham(sample): samples[sample['name']] = sample dbsnp = set() with open('path/to/dbsnp.csv', 'r') as ip: for line in ip: snp = tuple(line.split()[:3]) dbsnp.add(snp)
  • 16. 16© 2014 Cloudera, Inc. All rights reserved. Additional metadata must fit in memory class IntervalTree(object): def update(self, feature): pass # ...implement tree update def overlaps(self, feature): return True or False dnase_sites = IntervalTree() with open('path/to/dnase.narrowPeak', 'r') as ip: for line in ip: feature = parse_feature(line) dnase_sites.update(feature) samples = {} with open('path/to/samples.json', 'r') as ip: for line in ip: sample = json.loads(line) if is_framingham(sample): samples[sample['name']] = sample dbsnp = set() with open('path/to/dbsnp.csv', 'r') as ip: for line in ip: snp = tuple(line.split()[:3]) dbsnp.add(snp)
  • 17. 17© 2014 Cloudera, Inc. All rights reserved. Can only read from POSIX filesystem class IntervalTree(object): def update(self, feature): pass # ...implement tree update def overlaps(self, feature): return True or False dnase_sites = IntervalTree() with open('path/to/dnase.narrowPeak', 'r') as ip: for line in ip: feature = parse_feature(line) dnase_sites.update(feature) samples = {} with open('path/to/samples.json', 'r') as ip: for line in ip: sample = json.loads(line) if is_framingham(sample): samples[sample['name']] = sample dbsnp = set() with open('path/to/dbsnp.csv', 'r') as ip: for line in ip: snp = tuple(line.split()[:3]) dbsnp.add(snp)
  • 18. 18© 2014 Cloudera, Inc. All rights reserved. Manually executing query in Python genotype_data = {} reader = vcf.Reader('path/to/genotypes.vcf') for variant in reader: if (dnase_sites.overlaps(variant) and is_deleterious(call) and not in_dbsnp(variant)): for call in variant.samples: if call.sample in samples: pop = samples[call.sample]['population'] genotype_data.setdefault((variant, pop), []).append(call) mafs = {} for (variant, pop) in genotype_data.iter_keys(): mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])
  • 19. 19© 2014 Cloudera, Inc. All rights reserved. Genotype data may be split across files genotype_data = {} reader = vcf.Reader('path/to/genotypes.vcf') for variant in reader: if (dnase_sites.overlaps(variant) and is_deleterious(call) and not in_dbsnp(variant)): for call in variant.samples: if call.sample in samples: pop = samples[call.sample]['population'] genotype_data.setdefault((variant, pop), []).append(call) mafs = {} for (variant, pop) in genotype_data.iter_keys(): mafs[(variant, pop)] = compute_maf(genotype_data[(variant, pop)])
  • 20. 20© 2014 Cloudera, Inc. All rights reserved. Manually executing query in Python • If file is gzipped, cannot split file without decompressing (use Snappy) • Reading files required access to POSIX-style file system • Probably want to split VCF file into pieces to parallelize • Requires manual scatter-gather • Samples may be scattered among multiple VCF files (difficult to append to VCF) • Manually implementing broadcast join • Build side must fit into memory
  • 21. 21© 2014 Cloudera, Inc. All rights reserved. Manually executing query in Python on HPC $ bsub –q shared_12h python split_genotypes.py $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv $ bsub –q shared_12h python merge_maf.py
  • 22. 22© 2014 Cloudera, Inc. All rights reserved. Manually executing query in Python on HPC $ bsub –q shared_12h python split_genotypes.py $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv $ bsub –q shared_12h python merge_maf.py How to serialize intermediate output? Manually specify requested resources Manually split and mergeBabysit and check for errors/failures
  • 23. 23© 2014 Cloudera, Inc. All rights reserved. HPC separates compute from storage HPC is about compute. Hadoop is about data. Storage infrastructure • Proprietary, distributed file system • Expensive Compute cluster • High-perf, reliable hardware • Expensive Big network pipe ($$$) User typically works by manually submitting jobs to scheduler (e.g., LSF, Grid Engine, etc.)
  • 24. 24© 2014 Cloudera, Inc. All rights reserved. HPC is lower-level than Hadoop • HPC only exposes job scheduling • Parallelization typically through MPI • Very low-level communication primitives • Difficult to horizontally scale by simply adding nodes • Large data sets must be manually split • Failures must be dealt with manually
  • 25. 25© 2014 Cloudera, Inc. All rights reserved. HPC uses file system as DB; text file as LCD • All tools assume flat files with POSIX semantics • Sharing data/collaboration involves copying large files • Broad joint caller with 25k genomes hits file handle limits • Files always streamed over network (HPC architecture)
  • 26. 26© 2014 Cloudera, Inc. All rights reserved. HPC uses job scheduler as workflow tool • Submitting jobs to scheduler is low level • Workflow engines/execution models provide high level execution graphs with built-in fault tolerance • e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pig, Hive
  • 27. 27© 2014 Cloudera, Inc. All rights reserved. Prepping data for local analysis in R/Python • Manual script to prepare CSV file for working locally • Same issues as above • Requires working set of data to fit into memory of a single machine • Visualization
  • 28. 28© 2014 Cloudera, Inc. All rights reserved. Domain-specific tools (e.g., PLINK/Seq) $ pseq path/to/project v-stats --mask phe=framingham locset=dnase ref.ex=dbsnp one of a limited set of specific, useful tasks (yet another) custom query specification
  • 29. 29© 2014 Cloudera, Inc. All rights reserved. Domain-specific tools (e.g., PLINK/Seq) • Works great if your problem fits into the pre-designed computations • Only works if your problem fits into the pre-designed computations • How to do stats by subpopulation? • Probably possible, but need to learn new notation • Must work to get data in to begin-with • Not obviously parallelizable for performance on large data sets • Built on SQLite underneath
  • 30. 30© 2014 Cloudera, Inc. All rights reserved. RDBMS and SQL (e.g., MySQL) SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call) FROM genotypes g INNER JOIN samples s ON g.sample = s.sample INNER JOIN dnase d ON g.chr = d.chr AND g.pos >= d.start AND g.pos < d.end LEFT OUTER JOIN dbsnp p ON g.chr = p.chr AND g.pos = p.pos AND g.ref = p.ref AND g.alt = p.alt WHERE s.study = "framingham" p.pos IS NULL AND g.polyphen IN ( "possibly damaging", "probably damaging" ) GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop
  • 31. 31© 2014 Cloudera, Inc. All rights reserved. RDBMS and SQL (e.g., MySQL) • Feature-rich and very mature • Highly optimized and allows indexing • Declarative (and abstracted) language for data • Hassle to get data in; data end up formatted one way • No clear scalability story • SQL-only
  • 32. 32© 2014 Cloudera, Inc. All rights reserved. Problems with old way • Expensive • No fault-tolerance • No horizontal scalability • Poor separation of data modeling and storage formats • File format proliferation • Inefficient text formats
  • 33. 33© 2014 Cloudera, Inc. All rights reserved.
  • 34. 34© 2014 Cloudera, Inc. All rights reserved. Indexing the web • Web is Huge • Hundreds of millions of pages in 1999 • How do you index it? • Crawl all the pages • Rank pages based on relevance metrics • Build search index of keywords to pages • Do it in real time!
  • 35. 35© 2014 Cloudera, Inc. All rights reserved.
  • 36. 36© 2014 Cloudera, Inc. All rights reserved. Databases in 1999 • Buy a really big machine • Install expensive DBMS on it • Point your workload on it • Hope it doesn’t fail • Ambitious: buy another big machine as backup
  • 37. 37© 2014 Cloudera, Inc. All rights reserved.
  • 38. 38© 2014 Cloudera, Inc. All rights reserved. Database limitations • Didn’t scale horizontally • High marginal cost ($$$) • No real fault-tolerance story • Vendor lock-in ($$$) • SQL unsuited for search ranking • Complex analysis (PageRank) • Unstructured data
  • 39. 39© 2014 Cloudera, Inc. All rights reserved.
  • 40. 40© 2014 Cloudera, Inc. All rights reserved. Google does something different • Designed their own storage and processing infrastructure • Google File System (GFS) and MapReduce (MR) • Goals: cheap, scalable, reliable • General framework for large-scale batch computation • Powered Google Search for many years • Still used internally to this day (millions of jobs)
  • 41. 41© 2014 Cloudera, Inc. All rights reserved. Google benevolent enough to publish 2003 2004
  • 42. 42© 2014 Cloudera, Inc. All rights reserved. Birth of Hadoop at Yahoo! • 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR • 2006: Spun out as Apache Hadoop • Named after Doug’s son’s yellow stuffed elephant
  • 43. 43© 2014 Cloudera, Inc. All rights reserved. Open-source proliferation Google Open-source Function GFS HDFS Distributed file system MapReduce MapReduce Batch distributed data processing Bigtable HBase Distributed DB/key-value store Protobuf/Stubb y Thrift or Avro Data serialization/RPC Pregel Giraph Distributed graph processing Dremel/F1 Impala Scalable interactive SQL (MPP) FlumeJava Crunch Abstracted data pipelines on Hadoop
  • 44. 44© 2014 Cloudera, Inc. All rights reserved. Hadoop provides: • Data centralization on HDFS • No rewriting data for each tool/application • Data-local execution to avoid moving terabytes • High-level execution engines • SQL (Impala, Hive) • Relational algebra (Spark, MapReduce) • Bulk synchronous parallel (GraphX) • Distributed in-memory • Built-in horizontal scalability and fault-tolerance • Hadoop-friendly, evolvable serialization formats/RPC
  • 45. 45© 2014 Cloudera, Inc. All rights reserved. Hadoop provides serialization/RPC formats (Avro) • Specify schemas/services in user-friendly IDLs • Code-generation to multiple languages (wire-compatible/portable) • Compact, binary formats • Support for schema evolution • Like binary JSON record Feature { union { null, string } featureId = null; union { null, string } featureType = null; // e.g., DNase HS union { null, string } source = null; // e.g., BED, GFF file union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, Strand } strand = null; union { null, double } value = null; array<Dbxref> dbxrefs = []; array<string> parentIds = []; map<string> attributes = {}; }
  • 46. 46© 2014 Cloudera, Inc. All rights reserved. APIs instead of file formats • Service-oriented architectures (SOA) ensure stable contracts • Allows for implementation changes with new technologies • Software community has lots of experience with SOA, along with mature tools • Can be implemented in language-independent fashion
  • 47. 47© 2014 Cloudera, Inc. All rights reserved. Current file format hairball
  • 48. 48© 2014 Cloudera, Inc. All rights reserved. API-oriented architecture
  • 49. 49© 2014 Cloudera, Inc. All rights reserved. Hadoop provides columnar storage (Parquet) • Designed for general data storage • Columnar format • read fewer bytes • compression more efficient • Splittable • Avro/Thrift-compatible • Predicate pushdown • RLE, dictionary-encoding
  • 50. 50© 2014 Cloudera, Inc. All rights reserved. Hadoop provides columnar storage (Parquet)
  • 51. 51© 2014 Cloudera, Inc. All rights reserved. Hadoop provides columnar storage (Parquet) Vertical partitioning (projection pushdown) Horizontal partitioning (predicate pushdown) Read only the data you need! + =
  • 52. 52© 2014 Cloudera, Inc. All rights reserved. Hadoop provides abstractions for data processing HDFS (scalable, distributed storage) YARN (resource management) MapReduc e Impala (SQL) Solr (search) Spark ADAMquince guacamole … bdg-formats(Avro/Parquet)
  • 53. 53© 2014 Cloudera, Inc. All rights reserved. Hadoop examples: filesystem [laserson@bottou01-10g ~]$ hadoop fs –ls /user/laserson Found 16 items drwx------ - laserson laserson 0 2014-11-12 16:00 .Trash drwxr-xr-x - laserson laserson 0 2014-11-12 00:29 .sparkStaging drwx------ - laserson laserson 0 2014-06-07 13:27 .staging drwxr-xr-x - laserson laserson 0 2014-10-30 14:15 1kg drwxr-xr-x - laserson laserson 0 2014-05-08 17:29 bigml drwxr-xr-x - laserson laserson 0 2014-10-30 14:14 book drwxrwxr-x - laserson laserson 0 2014-06-16 12:59 editing drwxr-xr-x - laserson laserson 0 2014-06-06 13:49 gdelt -rw-r--r-- 3 laserson laserson 0 2014-10-27 16:24 hg19_text drwxr-xr-x - laserson laserson 0 2014-06-12 19:53 madlibport drwxr-xr-x - laserson laserson 0 2014-03-20 18:09 rock-health-python drwxr-xr-x - laserson laserson 0 2014-05-15 13:25 test-udf drwxr-xr-x - laserson laserson 0 2014-08-21 17:58 test_pymc drwxr-xr-x - laserson laserson 0 2014-10-27 22:25 tmp drwxr-xr-x - laserson laserson 0 2014-10-07 20:30 udf-scratch drwxr-xr-x - laserson laserson 0 2014-03-02 13:50 udfs
  • 54. 54© 2014 Cloudera, Inc. All rights reserved. Hadoop examples: batch MapReduce job hadoop jar vcf2parquet-0.1.0-jar-with-dependencies.jar com.cloudera.science.vcf2parquet.VCFtoParquetDriver hdfs:///path/to/variants.vcf hdfs:///path/to/output.parquet
  • 55. 55© 2014 Cloudera, Inc. All rights reserved. Hadoop examples: interactive Spark shell [laserson@bottou01-10g ~]$ spark-shell --master yarn Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.1.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67) Type in expressions to have them evaluated. [...] scala>
  • 56. 56© 2014 Cloudera, Inc. All rights reserved. Hadoop examples: interactive Spark shell def inDbSnp(g: Genotype): Boolean = true or false def isDeleterious(g: Genotype): Boolean = g.getPolyPhen val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect() val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect() val genotypesRDD = sc.adamLoad("path/to/genotypes") val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase") val filteredRDD = genotypesRDD .filter(!inDbSnp(_)) .filter(isDeleterious(_)) .filter(isFramingham(_)) val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD) val maf = joinedRDD .keyBy(x => (x.getVariant, getPopulation(x))) .groupByKey() .map(computeMAF(_)) .saveAsNewAPIHadoopFile("path/to/output")
  • 57. 57© 2014 Cloudera, Inc. All rights reserved. Hadoop provides abstractions for data processing HDFS (scalable, distributed storage) YARN (resource management) MapReduc e Impala (SQL) Solr (search) Spark ADAMquince guacamole … bdg-formats(Avro/Parquet)
  • 58. 58© 2014 Cloudera, Inc. All rights reserved. Genomics ETL .fastq .bam .vcf .bed/.gtf/etc short read alignme nt genotyp e calling analysis
  • 59. 59© 2014 Cloudera, Inc. All rights reserved. Hadoop variant store architecture Impala shell (SQL) REST API JDBC SQL query Impala engine Hive metastore Result set .parquet.vcf ETL
  • 60. 60© 2014 Cloudera, Inc. All rights reserved. Data denormalization ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 • Amortize join cost up-front • Replace joins with predicates (allowing predicate pushdown)
  • 61. 61© 2014 Cloudera, Inc. All rights reserved. Hadoop solution characteristics • Data stored as Parquet columnar format for performance and compression • Impala/Hive metastore provide unified, flexible data model • Impala implements RDBMS-style operations (by experts in distributed systems) • Spark offers flexible relational algebra operators (and in-memory computing) • Built-in fault tolerance for computations and horizontal scalability
  • 62. 62© 2014 Cloudera, Inc. All rights reserved. Example variant-filtering query • “Give me all SNPs that are: • chromosome 16 • absent from dbSNP • present in COSMIC • observed in breast cancer samples” • On full 1000 Genome data set • ~37 billion genotypes • 14 node cluster • query completion in several seconds SELECT cosmic as snp_id, vcf_chrom as chr, vcf_pos as pos, sample_id as sample, vcf_call_gt as genotype, sample_affection as phenotype FROM hg19_parquet_snappy_join_cached_partitioned WHERE COSMIC IS NOT NULL AND dbSNP IS NULL AND sample_study = ”breast_cancer" AND VCF_CHROM = "16"; PARTNER LOGO
  • 63. 63© 2014 Cloudera, Inc. All rights reserved. Other queries/use cases • All-vs-all eQTL integrated with ENCODE • >120 billion p-values • “Top 20 eQTLs for 5 genes of interest”: interactive • “Find all cis-eQTLs”: several minutes • Population genetics queries (e.g., backend for PLINK) • Interval arithmetic on large ENCODE data sets • Duke CHGV • ATAV DSL for preparing data for GWAS • Week-long queries now take a few hours by parallelizing on Spark
  • 64. 64© 2014 Cloudera, Inc. All rights reserved. Computational biologists are reinventing the wheel • e.g., CRAM (columnar storage) • e.g., workflow managers (Galaxy) • e.g., GATK (scatter-gather)
  • 65. 65© 2014 Cloudera, Inc. All rights reserved. Large-scale data analysis has been solved* • Cheaper in terms of hardware • Easier in terms of productivity • Built-in horizontal scaling • Built-in fault tolerance • Layered abstractions for data modeling • Hadoop!
  • 66. 66© 2014 Cloudera, Inc. All rights reserved. Science on Hadoop • ADAM project for genomics on Spark • http://bdgenomics.org/ • Guacamole for somatic variation on Spark • https://github.com/hammerlab/guacamole/ • Thunder project for neuroimaging on Spark • http://thefreemanlab.com/thunder/ • Quince for variant store on Impala • currently barebones, but with examples • https://github.com/laserson/quince
  • 67. 67© 2014 Cloudera, Inc. All rights reserved. Suggestions/resources • Everyone should learn Python • (also, everyone should try some experiments) • Everyone should use version control (e.g., git) • GitHub enables easy collaboration • See Titus Brown’s blog • Use the IPython Notebook (Jupyter) for productivity • Big data is often about engineering; use the best tools • For getting industry jobs: • Show people you know how to code: put your projects on GitHub • You should feel lucky if others will start using your code
  • 68. 68© 2014 Cloudera, Inc. All rights reserved.
  • 69. 69© 2014 Cloudera, Inc. All rights reserved. Acknowledgements • Cloudera • Sandy Ryza (Spark development) • Nong Li (Impala) • Skye Wanderman-Milne (Impala) • Impala genomics collaborators • Kiran Mukhyala • Slaton Lipscomb • ADAM project • Matt Massie • Frank Nothaft • Timothy Danford • Mount Sinai School of Medicine • Jeff Hammerbacher (+ lab) • Duke CHGV • Jonathan Keebler

Editor's Notes

  1. Must start with cliche Grad school in math and biology NOT computer science
  2. NOTE: All TEXT files
  3. I’ll discuss some alternatives a bit later
  4. What if some data from DB? Need to rewrite data access.
  5. Different samples in different files.
  6. MongoDB similar
  7. If one process wants to communicate to another process (python script to filesystem) Why another file format? Because this solves all the problems
  8. David Haussler’s story
  9. What is Hadoop? Jars
  10. What is Hadoop? Jars
  11. Learn more about Spark Shameless plug Peregrine falcon