SlideShare a Scribd company logo
Hadoop ecosystem for genomics
Uri Laserson
Mount Sinai School of Medicine
29 October 2013


Hadoop overview


Scalable variant store


Historical context
Hadoop overview
Some sins in bioinformatics
Possible conventional solutions
Hadoop/Impala implementation
Historical Context

Indexing the Web

Web is Huge


How do you index it?


Hundreds of millions of pages in 1999
Crawl all the pages
Rank pages based on relevance metrics
Build search index of keywords to pages
Do it in real time!
Databases in 1999


Buy a really big machine
Install expensive DBMS on it
Point your workload at it
Hope it doesn’t fail
Ambitious: buy another big machine as backup
Database Limitations

Didn’t scale horizontally

High marginal cost ($$$)

No real fault-tolerance story
• Vendor lock-in ($$$)
• SQL unsuited for search ranking



Complex analysis (PageRank)
Unstructured data
Google does something different

Designed their own storage and processing


Google File System (GFS) and MapReduce (MR)

Goals: KISS
• Scalable
• Reliable

Google does something different
It worked!
• Powered Google Search for many years
• General framework for large-scale batch computation
• Still used internally at Google to this day

Google benevolent enough to publish


Birth of Hadoop at Yahoo!
2004-2006: Doug Cutting and Mike Cafarella
implement GFS/MR.
• 2006: Spun out as Apache Hadoop
• Named after Doug’s son’s yellow stuffed elephant

Open-source proliferation





Distributed file system



Batch distributed data processing



Distributed DB/key-value store


Thrift or Avro

Data serialization/RPC



Distributed graph processing


Cloudera Impala

Scalable interactive SQL (MPP)



Abstracted data pipelines on Hadoop


Overview of core technology

HDFS design assumptions
Based on Google File System
• Files are large (GBs to TBs)
• Failures are common


Massive scale means failures very likely
Disk, node, or network failures

Accesses are large and sequential
• Files are append-only

HDFS properties



Horizontally scalable


Gracefully responds to node/disk/network failures
Low marginal cost

HDFS storage distribution

Node A

Node B

Node C

Node D

Node E



















Input File

MapReduce computation

MapReduce computation

Structured as

Embarrassingly parallel “map stage”
Cluster-wide distributed sort (“shuffle”)
Aggregation “reduce stage”

Data-locality: process the data where it is stored
• Fault-tolerance: failed tasks automatically detected
and restarted
• Schema-on-read: data must not be stored conforming
to rigid schema

WordCount example

Cloudera Hadoop Stack

Cloudera Hadoop Stack

Cloudera Hadoop Stack

Cloudera Hadoop Stack





Cloudera Impala
Modern MPP
database built on top
Designed for
interactive queries
on terabyte-scale
data sets.

Cloudera Search
• Interactive search queries on top of
• Built on Solr and SolrCloud
• Near-realtime indexing of new documents

Serialization/RPC formats

Specify schemas/services in user-friendly IDLs
Code-generation to multiple languages (wirecompatible/portable)
Compact, binary formats
Natural support for schema evolution
Multiple implementations:


Apache Thrift, Apache Avro, Google’s Protocol Buffers
Serialization/RPC formats

struct Tweet {
1: required i32 userId;
2: required string userName;
3: required string text;
4: optional Location loc;
16: optional string language = "english"

service Twitter {
void ping();
bool postTweet(1:Tweet tweet);
TweetSearchResult searchTweets(1:string query);

Serialization/RPC formats
struct Observation {
// can be general contig too
1: required string chromosome,
// python-style 0-based slicing
2: required i64 start,
3: required i64 end,
// unique identifier for data set
// (like UCSC genome browser track)
4: required string track,

// these are likely derived from the
// track; separated for convenient join
5: optional string experiment,
6: optional string sample,
// one of these should be non-null,
// depending on the type of data
7: optional string valueStr,
8: optional i64 valueInt,
9: optional double valueDouble
Parquet format
Row-major format

Parquet format
Column-major format

Parquet format advantages

Columnar format


read fewer bytes
compression more efficient (incl. dictionary encodings)

Thrift/Avro/Protobuf-compatible data model

Support for nested data structures

Binary encodings
• Hadoop-friendly (“splittable”; implemented in Java)
• Predicate pushdown

Query Times on TPCDS Queries



Seq w/ Snappy
RC w/Snappy


Parquet w/Snappy













Core paradigm shifts with Hadoop

Colocation of storage and compute

Fault tolerance with cheap hardware

Benefits of Hadoop ecosystem

Inexpensive commodity compute/storage


Tolerates random hardware failure

Decreased need for high-bandwidth network pipes
Co-locate compute and storage
• Exploit data locality


Simple horizontal scalability by adding nodes



MapReduce jobs effectively guaranteed to scale

Fault-tolerance/replication built-in. Data is durable
Large ecosystem of tools
Flexible data storage. Schema-on-read. Unstructured data.
Some sins in genomics data infrastructure

HPC separates compute from storage
HPC is about compute.
Hadoop is about data.
Storage infrastructure
• Proprietary, distributed
file system
• Expensive

Compute cluster
Big network
pipe ($$$)

• High-performance
• Low failure rate
• Expensive

User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.

Hadoop colocates compute and storage
HPC is about compute.
Hadoop is about data.
Compute cluster
Storage infrastructure
• Commodity hardware
• Data-locality
• Reduced networking

User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.

HPC is lower-level than Hadoop
HPC only exposes job scheduling
• Parallelization typically occurs through MPI


Very low-level communication primitives
Difficult to horizontally scale by simply adding nodes

Large data sets must be manually split
• Failures must be dealt with manually



Hadoop has fault-tolerance, data locality, horizontal
File system as DB; text file as LCD
Broad joint caller with 25k genomes hits file handle
• Files streamed over network (HPC architecture)
• Large files split manually
• Sharing data/collaborating involves copying large files

Job scheduler as workflow tool
Submitting jobs to scheduler is very low level
• Workflow engines/execution models provide high
level execution graphs with fault-tolerance



e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pi
g, Hive
Poor security/access models

Deal with complex set of constraints from a variety of


Certain individuals redact certain parts of their genomes
Certain samples can only be used as controls for particular
Different research groups want to control access to the
data they generate
Clinical trial data must have more rigorous access
Treating computation as free
Many institutions make large clusters available for
“free” to the average researcher
• Focus of dropping sequencing cost has been on

Treating computation as free

Stein, L. D. The case for cloud computing in genome informatics. Genome Biol (2010).
Treating computation as free

Sboner et al. “The real cost of sequencing: higher than you think”. Genome Biology (2011).
Lack of benchmarks for tracking progress

Need to benchmark whether quality of methods are
Lack of benchmarks for tracking progress

Bradnam et al. “Assemblathon 2”, Gigascience 2, 10 (2013).
Academic code
Unreproducible, unbuildable, undocumented, unmainta
ined, unavailable, backward-incompatible, shitty code

Most developers self-taught. Only one-third
think formal training is important. [1, 2]
“…people in my lab have requested code from authors
and received source code with syntax errors in it” [3]
[1]: Haussler et al. “A Million Cancer Genome Warehouse” (2012)
[2]: Hannay et al. “How do scientists develop and use scientific software?” (2009)
Fundamentally a barrier to scaling.

NCBI Sequence Read Archive (SRA)
1.14 petabytes

One year ago…
609 terabytes

Every ‘ome has a -seq



Transcriptome FRT-seq




Prescriptions for the future

Move to Hadoop-style environment
Data centralization on HDFS
• Data-local execution to avoid moving terabytes
• Higher-level execution engines to abstract away
computations from details of execution
• Hadoop-friendly, evolvable, serialization formats for:




Storage- and compute-efficiency
Abstracting data model from data storage details

Built-in horizontal scalability and fault-tolerance
APIs instead of file formats
Service-oriented architectures ensure stable contracts
• Allows for implementation changes with new
• Software community has lots of experience with this
type of architecture, along with mature tools.
• Can be implemented as language-independent.

High-granularity access/common consent

Use technologies with highly-granular access


Create common consents for patients to “donate”
their data to research


e.g., Apache Accumulo, cell-based access control

e.g., Personal Genome Project, SAGE Portable Legal
Consent, NCI “information donor”
Tools for open-source/reproducibility
Software and computations should be opensourced, e.g., on GitHub
• Release VMs or ipython notebooks with publications




“executable paper” to generate figures

Allow others to easily recompute all analyses
Building scalable variant store

Genomics ETL




short read





Short read alignment is embarrassingly parallel
Pileup/variant calling requires distributed sort
GATK is a reimplementation of MapReduce; could run on Hadoop
Early Hadoop tools
• Crossbow: short read alignment/variant calling
• Hadoop-BAM: distributed bamtools
• BioPig: manipulating large fasta/q
• Contrail: de-novo assembly
Genomics ETL

GATK best practices

• Defining alternative to BAM format that’s
• Hadoop-friendly, splittable, designed for
distributed computing
• Format built as Avro objects
• Data stored as Parquet format (columnar)
• Attempting to reimplement GATK pipeline to function
on Hadoop/Parquet
• Currently run out of the AMPLab at UC Berkeley

Genomics ETL



short read




Querying large, integrated variant data
Biotech client has thousands of genomes
• Want to expose ad hoc querying functionality on large



Integrating data with public data sets
(e.g., ENCODE, UCSC tracks, dbSNP, etc.)


e.g., vcftools/PLINK-SEQ on terabyte-scale data sets

Terabyte-scale annotation sets
Conventional approaches: manual

Manually parsing flat files

Write ad hoc scripts in perl or python
Build data structures in memory for
Custom script per query

counts_dict = {}
for chain in vdj.parse_VDJXML(inhandle):
try: counts_dict[chain.junction] += 1
except KeyError: counts_dict[chain.junction] = 1

for count in counts_dict.itervalues():
print >>outhandle, np.int_(count)

Conventional approaches: database

Very feature rich and mature

Common analytical tasks (e.g., joins, group-by, etc.)
Access control
Very mature

Scalability issues
• Indices can be prohibitive
• RDBMS: schemas can be annoyingly rigid
• NoSQL: adolescent implementations (but easy to

Conventional approaches: domain-specific


Designed for specific use-cases
Workflows are highly opinionated/rigid
Requires learning another language
Scalability issues
Hadoop sol’n: storage

Impala/Hive metastore provide a unified, flexible data



Define Avro types for all data

Data stored as Parquet format to maximize
compression and query performance
Hadoop sol’n: available analytics engines


Analytical operations implemented by experts in
distributed systems
Impala implements RDBMS-style operations
Search offers metadata indexing
Spark offers in-memory processing for ML
HDFS-based analytical engines designed for horizontal
Variant store architecture




Avro schema

Thrift service
Impala shell



Hive metastore
Impala query engine

Example schema
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 .
T A 3 q10 NS=3;DP=11;AF=0.017
GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 .
T . 47 PASS NS=3;DP=13;AA=T
GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Example schema
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 .
T A 3 q10 NS=3;DP=11;AF=0.017
GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 .
T . 47 PASS NS=3;DP=13;AA=T
GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Example schema
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 .
T A 3 q10 NS=3;DP=11;AF=0.017
GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 .
T . 47 PASS NS=3;DP=13;AA=T
GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Why denormalization is good

Replace joins with filters



For query engines with efficient scans, this simplifies
queries and can improve performance
Parquet format supports predicate pushdowns, reducing
necessary I/O

Because storage is cheap, amortize cost of up-front
join over simpler queries going forward
"default": null,
"doc": "Genotype",
"type": [
"name": "VCF_CALL_GT"
"default": null,
"doc": "Genotype Quality",
"type": [
"name": "VCF_CALL_GQ"
"default": null,
"doc": "Read Depth",
"type": [
"name": "VCF_CALL_DP"
"default": [],
"doc": "Haplotype Quality",
"type": "string",
"name": "VCF_CALL_HQ"

Example schema
"name": "VCF",
"type": "record"
"fields": [
"type": "string",
"name": "VCF_CHROM"
"type": "int",
"name": "VCF_POS"
"type": "string",
"name": "VCF_ID"
"type": "string",
"name": "VCF_REF"
"type": "string",
"name": "VCF_ALT"


Example variant-filtering query

“Give me all SNPs that are:



on chromosome 5
absent from dbSNP
present in COSMIC
observed in breast cancer samples
absent from prostate cancer samples”

On full 1000 genome data set (~37 billion
variants), query finishes in a couple seconds
Example variant-filtering query
SELECT cosmic as snp_id,
vcf_chrom as chr,
vcf_pos as pos,
sample_id as sample,
vcf_call_gt as genotype,
sample_affection as phenotype
sample_study = ”breast_cancer" AND
VCF_CHROM = "16";

Impala execution
Query compiled into execution tree, chopped up
across all nodes (if possible)
• Two join implementations


Broadcast: each node gets copy of full right table
Shuffle: both sides of join are partitioned

Partitioned tables vastly reduce amount of I/O
• File formats make enormous difference in query

Other desirable query-examples
“How do the mutations in a given subject compare to
the mutations in other phenotypically similar
• “For a given gene, in what pathways and cancer
subtypes is it involved?” (connecting phenotypes to
• “How common are an observed set of mutations?”
• “For a given type of cancer, what are the
characteristic disruptions?”

Types of queries desired
Lot’s of these queries can be simply translated into
SQL queries
• Similar to functionality provided by PLINK/SEQ, but
designed to scale to much larger data sets

All-vs-all eQTL

Possible to generate trillions of hypothesis tests


107 loci x 104 phenotypes x 10s of tissues = 1012 p-values
Tested below on 120 billion associations

Example queries:

“Given 5 genes of interest, find top 20 most significant
eQTLs (cis and/or trans)”


“Find all cis-eQTLs across the entire genome”


Finishes in several seconds
Finishes in a couple of minutes
Limited by disk throughput
All-vs-all eQTL

“Find all SNPs that are:


in LD with some lead SNP
or eQTL of interest
align with some functional
annotation of interest”

Still in testing, but likely
finishes in seconds

Schaub et al, Genome Research, 2012
Hadoop ecosystem provides centralized, scalable
repository for data
• An abundance of tools for providing views/analytics
into the data store


Separate implementation details from data pipelines

Software quality/data structures/file formats matter
• Genomics has much to gain from moving away from
HPC architecture toward Hadoop ecosystem

Cloud-based implementation
Hadoop-ecosystem architecture easily translates to
the cloud (AWS, OpenStack)
• Provides elastic capacity; no large initial CAPEX
• Risk of vendor lock-in once data set is large
• Allows simple sharing of data via public S3
buckets, for example

Future work

Broad Institute has experimented with Google’s
BigQuery for a variant store

BigQuery is Google’s Dremel exposed to public on Google’s
Closed-source, only Google cloud

Developed API for working with variant data
• Soon develop Impala-backed implementation of
Broad API



To be open-sourced
Future work
Drive towards several large data warehouses; storage
backend optimized for particular access patterns
• Each can expose one or more APIs for different
applications/access levels.
• Haussler, D. et al. A Million Cancer Genome
Warehouse. (2012). Tech Report.

Josh Wills
Jeff Hammerbacher
Impala team (Nong Li)
Sandy Ryza

Julien Le Dem (Twitter)
Our biotech client
Mike Schatz (CSHL)
Matt Massie

More Related Content

What's hot

Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
Geoffrey Fox
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
 BioTeam Bhanu Rekepalli Presentation at BICoB 2015 BioTeam Bhanu Rekepalli Presentation at BICoB 2015
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
The BioTeam Inc.
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platforms
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
Hadoop Hadoop
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
Dzung Nguyen
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
JanBask Training
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
Venkata Naga Ravi
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
Geoffrey Fox
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
Brendan Tierney
Python in big data world
Python in big data worldPython in big data world
Python in big data world
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
Xuan-Chao Huang
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
Bart Vandewoestyne
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
Nicolas Poggi
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integrationDzung Nguyen
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
Ajit Koti
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman

What's hot (20)

Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
 BioTeam Bhanu Rekepalli Presentation at BICoB 2015 BioTeam Bhanu Rekepalli Presentation at BICoB 2015
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platforms
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Hadoop Hadoop
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
Python in big data world
Python in big data worldPython in big data world
Python in big data world
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
R Hadoop integration
R Hadoop integrationR Hadoop integration
R Hadoop integration
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem

Similar to Hadoop for Bioinformatics: Building a Scalable Variant Store

Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
Zohar Elkayam
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
Oleg Magazov
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
Tom Rogers
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
praveen bhat
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
P.Maharajothi, science),Bon secours college for women,thanjavur.
P.Maharajothi, science),Bon secours college for women,thanjavur.P.Maharajothi, science),Bon secours college for women,thanjavur.
P.Maharajothi, science),Bon secours college for women,thanjavur.
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
Hadoop Maharajathi,,Computer Science,Bonsecours college for women
Hadoop Maharajathi,,Computer Science,Bonsecours college for womenHadoop Maharajathi,,Computer Science,Bonsecours college for women
Hadoop Maharajathi,,Computer Science,Bonsecours college for women
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Douglas Moore

Similar to Hadoop for Bioinformatics: Building a Scalable Variant Store (20)

Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
P.Maharajothi, science),Bon secours college for women,thanjavur.
P.Maharajothi, science),Bon secours college for women,thanjavur.P.Maharajothi, science),Bon secours college for women,thanjavur.
P.Maharajothi, science),Bon secours college for women,thanjavur.
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Hadoop Maharajathi,,Computer Science,Bonsecours college for women
Hadoop Maharajathi,,Computer Science,Bonsecours college for womenHadoop Maharajathi,,Computer Science,Bonsecours college for women
Hadoop Maharajathi,,Computer Science,Bonsecours college for women
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

More from Uri Laserson

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)
Uri Laserson
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
Uri Laserson
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic Biology
Uri Laserson
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Uri Laserson
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson

More from Uri Laserson (6)

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic Biology
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)

Recently uploaded

Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

Recently uploaded (20)

Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

Hadoop for Bioinformatics: Building a Scalable Variant Store

  • 1. Hadoop ecosystem for genomics Uri Laserson Mount Sinai School of Medicine 29 October 2013 1
  • 2. Agenda 1. Hadoop overview • • • 2. Scalable variant store • • 2 Historical context Hadoop overview Some sins in bioinformatics Possible conventional solutions Hadoop/Impala implementation
  • 4. 4
  • 5. Indexing the Web • Web is Huge • • How do you index it? • • • • 5 Hundreds of millions of pages in 1999 Crawl all the pages Rank pages based on relevance metrics Build search index of keywords to pages Do it in real time!
  • 6. 6
  • 7. Databases in 1999 1. 2. 3. 4. 5. 7 Buy a really big machine Install expensive DBMS on it Point your workload at it Hope it doesn’t fail Ambitious: buy another big machine as backup
  • 8. 8
  • 9. Database Limitations • Didn’t scale horizontally • High marginal cost ($$$) No real fault-tolerance story • Vendor lock-in ($$$) • SQL unsuited for search ranking • • • 9 Complex analysis (PageRank) Unstructured data
  • 10. 10
  • 11. Google does something different • Designed their own storage and processing infrastructure • • Google File System (GFS) and MapReduce (MR) Goals: KISS Cheap • Scalable • Reliable • 11
  • 12. Google does something different It worked! • Powered Google Search for many years • General framework for large-scale batch computation tasks • Still used internally at Google to this day • 12
  • 13. Google benevolent enough to publish 2003 13 2004
  • 14. Birth of Hadoop at Yahoo! 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR. • 2006: Spun out as Apache Hadoop • Named after Doug’s son’s yellow stuffed elephant • 14
  • 15. Open-source proliferation Google Open-source Function GFS HDFS Distributed file system MapReduce MapReduce Batch distributed data processing Bigtable HBase Distributed DB/key-value store Protobuf/Stubby Thrift or Avro Data serialization/RPC Pregel Giraph Distributed graph processing Dremel/F1 Cloudera Impala Scalable interactive SQL (MPP) FlumeJava Crunch Abstracted data pipelines on Hadoop Hadoop 15
  • 16. Overview of core technology 16
  • 17. HDFS design assumptions Based on Google File System • Files are large (GBs to TBs) • Failures are common • • • Massive scale means failures very likely Disk, node, or network failures Accesses are large and sequential • Files are append-only • 17
  • 18. HDFS properties • Fault-tolerant • • Horizontally scalable • • Gracefully responds to node/disk/network failures Low marginal cost High-bandwidth HDFS storage distribution 1 Node A Node B Node C Node D Node E 2 2 1 1 2 1 3 4 2 3 3 3 4 5 5 4 5 4 5 Input File 18
  • 20. MapReduce computation • Structured as 1. 2. 3. Embarrassingly parallel “map stage” Cluster-wide distributed sort (“shuffle”) Aggregation “reduce stage” Data-locality: process the data where it is stored • Fault-tolerance: failed tasks automatically detected and restarted • Schema-on-read: data must not be stored conforming to rigid schema • 20
  • 26. Cloudera Impala Modern MPP database built on top of HDFS Designed for interactive queries on terabyte-scale data sets. 26
  • 27. Cloudera Search • Interactive search queries on top of HDFS • Built on Solr and SolrCloud • Near-realtime indexing of new documents 27
  • 28. Serialization/RPC formats • • • • • Specify schemas/services in user-friendly IDLs Code-generation to multiple languages (wirecompatible/portable) Compact, binary formats Natural support for schema evolution Multiple implementations: • 28 Apache Thrift, Apache Avro, Google’s Protocol Buffers
  • 29. Serialization/RPC formats struct Tweet { 1: required i32 userId; 2: required string userName; 3: required string text; 4: optional Location loc; 16: optional string language = "english" } service Twitter { void ping(); bool postTweet(1:Tweet tweet); TweetSearchResult searchTweets(1:string query); } 29
  • 30. Serialization/RPC formats struct Observation { // can be general contig too 1: required string chromosome, // python-style 0-based slicing 2: required i64 start, 3: required i64 end, // unique identifier for data set // (like UCSC genome browser track) 4: required string track, // these are likely derived from the // track; separated for convenient join 5: optional string experiment, 6: optional string sample, // one of these should be non-null, // depending on the type of data 7: optional string valueStr, 8: optional i64 valueInt, 9: optional double valueDouble } 30
  • 33. Parquet format advantages • Columnar format • • • read fewer bytes compression more efficient (incl. dictionary encodings) Thrift/Avro/Protobuf-compatible data model • Support for nested data structures Binary encodings • Hadoop-friendly (“splittable”; implemented in Java) • Predicate pushdown • • 33
  • 34. Query Times on TPCDS Queries 500 450 400 350 Seconds 300 Text 250 Seq w/ Snappy RC w/Snappy 200 Parquet w/Snappy 150 100 50 0 Q27 34 Q34 Q42 Q43 Q46 Q52 Q55 Q59 Q65 Q73 Q79 Q96
  • 35. Core paradigm shifts with Hadoop Colocation of storage and compute Fault tolerance with cheap hardware 35
  • 36. Benefits of Hadoop ecosystem • Inexpensive commodity compute/storage • • Tolerates random hardware failure Decreased need for high-bandwidth network pipes Co-locate compute and storage • Exploit data locality • • Simple horizontal scalability by adding nodes • • • • 36 MapReduce jobs effectively guaranteed to scale Fault-tolerance/replication built-in. Data is durable Large ecosystem of tools Flexible data storage. Schema-on-read. Unstructured data.
  • 37. Some sins in genomics data infrastructure 37
  • 38. HPC separates compute from storage HPC is about compute. Hadoop is about data. Storage infrastructure • Proprietary, distributed file system • Expensive Compute cluster Big network pipe ($$$) • High-performance hardware • Low failure rate • Expensive User typically works by manually submitting jobs to scheduler e.g., LSF, Grid Engine, etc. 38
  • 39. Hadoop colocates compute and storage HPC is about compute. Hadoop is about data. Compute cluster Storage infrastructure • Commodity hardware • Data-locality • Reduced networking needs User typically works by manually submitting jobs to scheduler e.g., LSF, Grid Engine, etc. 39
  • 40. HPC is lower-level than Hadoop HPC only exposes job scheduling • Parallelization typically occurs through MPI • • • Very low-level communication primitives Difficult to horizontally scale by simply adding nodes Large data sets must be manually split • Failures must be dealt with manually • • 40 Hadoop has fault-tolerance, data locality, horizontal scalability
  • 41. File system as DB; text file as LCD Broad joint caller with 25k genomes hits file handle limits • Files streamed over network (HPC architecture) • Large files split manually • Sharing data/collaborating involves copying large files • 41
  • 42. Job scheduler as workflow tool Submitting jobs to scheduler is very low level • Workflow engines/execution models provide high level execution graphs with fault-tolerance • • 42 e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading, Pi g, Hive
  • 43. Poor security/access models • Deal with complex set of constraints from a variety of consents/redactions • • • • 43 Certain individuals redact certain parts of their genomes Certain samples can only be used as controls for particular studies Different research groups want to control access to the data they generate Clinical trial data must have more rigorous access restrictions
  • 44. Treating computation as free Many institutions make large clusters available for “free” to the average researcher • Focus of dropping sequencing cost has been on biochemistry • 44
  • 45. Treating computation as free Stein, L. D. The case for cloud computing in genome informatics. Genome Biol (2010). 45
  • 46. Treating computation as free Sboner et al. “The real cost of sequencing: higher than you think”. Genome Biology (2011). 46
  • 47. Lack of benchmarks for tracking progress • Need to benchmark whether quality of methods are improving 47
  • 48. Lack of benchmarks for tracking progress Bradnam et al. “Assemblathon 2”, Gigascience 2, 10 (2013). 48
  • 49. Academic code Unreproducible, unbuildable, undocumented, unmainta ined, unavailable, backward-incompatible, shitty code Most developers self-taught. Only one-third think formal training is important. [1, 2] “…people in my lab have requested code from authors and received source code with syntax errors in it” [3] [1]: Haussler et al. “A Million Cancer Genome Warehouse” (2012) [2]: Hannay et al. “How do scientists develop and use scientific software?” (2009) [3]: 49
  • 50. Fundamentally a barrier to scaling. 50
  • 51. 51
  • 52. NCBI Sequence Read Archive (SRA) Today… 1.14 petabytes One year ago… 609 terabytes 52
  • 53. Every ‘ome has a -seq Genome DNA-seq RNA-seq Transcriptome FRT-seq NET-seq Methylome Immunome Immune-seq Proteome 53 Bisulfite-seq PhIP-seq Bind-n-seq
  • 55. Move to Hadoop-style environment Data centralization on HDFS • Data-local execution to avoid moving terabytes • Higher-level execution engines to abstract away computations from details of execution • Hadoop-friendly, evolvable, serialization formats for: • • • • 55 Storage- and compute-efficiency Abstracting data model from data storage details Built-in horizontal scalability and fault-tolerance
  • 56. APIs instead of file formats Service-oriented architectures ensure stable contracts • Allows for implementation changes with new technologies • Software community has lots of experience with this type of architecture, along with mature tools. • Can be implemented as language-independent. • 56
  • 57. High-granularity access/common consent 1. Use technologies with highly-granular access control • 2. Create common consents for patients to “donate” their data to research • 57 e.g., Apache Accumulo, cell-based access control e.g., Personal Genome Project, SAGE Portable Legal Consent, NCI “information donor”
  • 58. Tools for open-source/reproducibility Software and computations should be opensourced, e.g., on GitHub • Release VMs or ipython notebooks with publications • • • 58 “executable paper” to generate figures Allow others to easily recompute all analyses
  • 60. Genomics ETL biochemistry • • • • 60 .fastq short read alignment .bam genotype calling .vcf analysis Short read alignment is embarrassingly parallel Pileup/variant calling requires distributed sort GATK is a reimplementation of MapReduce; could run on Hadoop Early Hadoop tools • Crossbow: short read alignment/variant calling • Hadoop-BAM: distributed bamtools • BioPig: manipulating large fasta/q • Contrail: de-novo assembly
  • 61. Genomics ETL GATK best practices 61
  • 63. ADAM • Defining alternative to BAM format that’s • Hadoop-friendly, splittable, designed for distributed computing • Format built as Avro objects • Data stored as Parquet format (columnar) • Attempting to reimplement GATK pipeline to function on Hadoop/Parquet • Currently run out of the AMPLab at UC Berkeley 63
  • 65. Querying large, integrated variant data Biotech client has thousands of genomes • Want to expose ad hoc querying functionality on large scale • • • Integrating data with public data sets (e.g., ENCODE, UCSC tracks, dbSNP, etc.) • 65 e.g., vcftools/PLINK-SEQ on terabyte-scale data sets Terabyte-scale annotation sets
  • 66. Conventional approaches: manual • Manually parsing flat files • • • Write ad hoc scripts in perl or python Build data structures in memory for histograms/aggregations Custom script per query counts_dict = {} for chain in vdj.parse_VDJXML(inhandle): try: counts_dict[chain.junction] += 1 except KeyError: counts_dict[chain.junction] = 1 for count in counts_dict.itervalues(): print >>outhandle, np.int_(count) 66
  • 67. Conventional approaches: database • Very feature rich and mature • • • Common analytical tasks (e.g., joins, group-by, etc.) Access control Very mature Scalability issues • Indices can be prohibitive • RDBMS: schemas can be annoyingly rigid • NoSQL: adolescent implementations (but easy to start) • 67
  • 68. Conventional approaches: domain-specific • • • • • 68 e.g., PLINK/SEQ Designed for specific use-cases Workflows are highly opinionated/rigid Requires learning another language Scalability issues
  • 69. Hadoop sol’n: storage • Impala/Hive metastore provide a unified, flexible data model • • 69 Define Avro types for all data Data stored as Parquet format to maximize compression and query performance
  • 70. Hadoop sol’n: available analytics engines • • • • • 70 Analytical operations implemented by experts in distributed systems Impala implements RDBMS-style operations Search offers metadata indexing Spark offers in-memory processing for ML HDFS-based analytical engines designed for horizontal scalability
  • 71. Variant store architecture .vcf ETL .parquet .csv external annotations Avro schema Thrift service JDBC REST API Impala shell 71 query Hive metastore Impala query engine Results
  • 72. Example schema ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 72
  • 73. Example schema ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 73
  • 74. Example schema ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 74
  • 75. Why denormalization is good • Replace joins with filters • • • 75 For query engines with efficient scans, this simplifies queries and can improve performance Parquet format supports predicate pushdowns, reducing necessary I/O Because storage is cheap, amortize cost of up-front join over simpler queries going forward
  • 76. ... { "default": null, "doc": "Genotype", "type": [ "null", "string" ], "name": "VCF_CALL_GT" }, { "default": null, "doc": "Genotype Quality", "type": [ "null", "int" ], "name": "VCF_CALL_GQ" }, { "default": null, "doc": "Read Depth", "type": [ "null", "int" ], "name": "VCF_CALL_DP" }, { "default": [], "doc": "Haplotype Quality", "type": "string", "name": "VCF_CALL_HQ" } Example schema { "name": "VCF", "type": "record" "fields": [ { "type": "string", "name": "VCF_CHROM" }, { "type": "int", "name": "VCF_POS" }, { "type": "string", "name": "VCF_ID" }, { "type": "string", "name": "VCF_REF" }, { "type": "string", "name": "VCF_ALT" }, ... ] } 76
  • 77. Example variant-filtering query • “Give me all SNPs that are: • • • • • • 77 on chromosome 5 absent from dbSNP present in COSMIC observed in breast cancer samples absent from prostate cancer samples” On full 1000 genome data set (~37 billion variants), query finishes in a couple seconds
  • 78. Example variant-filtering query SELECT cosmic as snp_id, vcf_chrom as chr, vcf_pos as pos, sample_id as sample, vcf_call_gt as genotype, sample_affection as phenotype FROM hg19_parquet_snappy_join_cached_partitioned WHERE COSMIC IS NOT NULL AND dbSNP IS NULL AND sample_study = ”breast_cancer" AND VCF_CHROM = "16"; 78
  • 79. Impala execution Query compiled into execution tree, chopped up across all nodes (if possible) • Two join implementations • 1. 2. Broadcast: each node gets copy of full right table Shuffle: both sides of join are partitioned Partitioned tables vastly reduce amount of I/O • File formats make enormous difference in query performance • 79
  • 80. Other desirable query-examples “How do the mutations in a given subject compare to the mutations in other phenotypically similar subjects?” • “For a given gene, in what pathways and cancer subtypes is it involved?” (connecting phenotypes to annotations) • “How common are an observed set of mutations?” • “For a given type of cancer, what are the characteristic disruptions?” • 80
  • 81. Types of queries desired Lot’s of these queries can be simply translated into SQL queries • Similar to functionality provided by PLINK/SEQ, but designed to scale to much larger data sets • 81
  • 82. All-vs-all eQTL • Possible to generate trillions of hypothesis tests • • • 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values Tested below on 120 billion associations Example queries: • “Given 5 genes of interest, find top 20 most significant eQTLs (cis and/or trans)” • • “Find all cis-eQTLs across the entire genome” • • 82 Finishes in several seconds Finishes in a couple of minutes Limited by disk throughput
  • 83. All-vs-all eQTL • “Find all SNPs that are: • • • in LD with some lead SNP or eQTL of interest align with some functional annotation of interest” Still in testing, but likely finishes in seconds Schaub et al, Genome Research, 2012 83
  • 84. Conclusions Hadoop ecosystem provides centralized, scalable repository for data • An abundance of tools for providing views/analytics into the data store • • Separate implementation details from data pipelines Software quality/data structures/file formats matter • Genomics has much to gain from moving away from HPC architecture toward Hadoop ecosystem architecture • 84
  • 85. Cloud-based implementation Hadoop-ecosystem architecture easily translates to the cloud (AWS, OpenStack) • Provides elastic capacity; no large initial CAPEX • Risk of vendor lock-in once data set is large • Allows simple sharing of data via public S3 buckets, for example • 85
  • 86. Future work • Broad Institute has experimented with Google’s BigQuery for a variant store • • BigQuery is Google’s Dremel exposed to public on Google’s cloud Closed-source, only Google cloud Developed API for working with variant data • Soon develop Impala-backed implementation of Broad API • • 86 To be open-sourced
  • 87. Future work Drive towards several large data warehouses; storage backend optimized for particular access patterns • Each can expose one or more APIs for different applications/access levels. • Haussler, D. et al. A Million Cancer Genome Warehouse. (2012). Tech Report. • 87
  • 88. Acknowledgements Cloudera Josh Wills Jeff Hammerbacher Impala team (Nong Li) Sandy Ryza Julien Le Dem (Twitter) Our biotech client Mike Schatz (CSHL) Matt Massie 88
  • 89. 89

Editor's Notes

  1. Industry had at least some of these problems. Here’s how they solved them.
  2. Already mature technologies at this point.DB community thought it was silly.Non-Google were not yet at this scale.Google not in the business of releasing infrastructure software. They sell ads.
  3. Mostly through the Apache Software Foundation
  4. Talk HDFS and MapReduce.Then some other tools.
  5. Community Is coalescing around HDFS
  6. Large blocksBlocks replicated around
  7. Two functions required.
  8. Only need to supply 2 functions
  9. Software: Cloudera Enterprise – The Platform for Big DataA complete data management solution powered by Apache HadoopA collection of open source projects form the foundation of the platformCloudera has wrapped the open source core with additional software for system and data management as well as technical support5 Attributes of Cloudera Enterprise:ScalableStorage and compute in a single system – brings computation to data (rather than the other way around)Scale capacity and performance linearly – just add nodesProven at massive scale – tens of PB of data, millions of usersFlexibleStore any type of dataStructured, unstructured, semi-structuredIn it’s native format – no conversion requiredNo loss of data fidelity due to ETLFluid structuringNo single model or schema that the data must conform toDetermine how you want to look at data at the time you ask the question – if the attribute exists in the raw data, you can query against itAlter structure to optimize query performance as desired (not required) – multiple open source file formats like Avro, ParquetMultiple forms of computationBring different tools to bear on the data, depending on your skillset and what you want to doBatch processing – MapReduce, Hive, Pig, JavaInteractive SQL – Impala, BI toolsInteractive Search – for non-technical users, or helping to identify datasets for further analysisMachine learning – apply algorithms to large datasets using libraries like Apache MahoutMath – tools like SAS and R for data scientists and statisticiansMore to come…Cost-EffectiveScale out on inexpensive, industry standard hardware (vs. highly tuned, specialized hardware)Fault tolerance built-inLeverage cost structures with existing vendorsReduced data movement – can perform more operations in a single place due to flexible toolingFewer redundant copies of dataLess time spent migrating/managingOpen source software is easy acquire and prove the value/ROIOpenRapid innovationLarge development communitiesThe most talented engineers from across the worldEasy to acquire and prove valueFree to download and deployDemonstrate the value of the technology before you make a large-scale investmentNo vendor lock-in – choose your vendor based solely on meritCloudera’s open source strategyIf it stores or processes data, it’s open sourceBig commitment to open sourceLeading contributor to the Apache Hadoop ecosystem – defining the future of the platform together with the communityIntegratedWorks with all your existing investmentsDatabases and data warehousesAnalytics and BI solutionsETL toolsPlatforms and operating systemsHardware and networking equipmentOver 700 partners including all of the leaders in the market segments aboveComplements those investments by allowing you to align data and processes to the right solution
  10. Up until a couple years ago, Hadoop was just MapReduce.
  11. Software: Cloudera Enterprise – The Platform for Big DataA complete data management solution powered by Apache HadoopA collection of open source projects form the foundation of the platformCloudera has wrapped the open source core with additional software for system and data management as well as technical support5 Attributes of Cloudera Enterprise:ScalableStorage and compute in a single system – brings computation to data (rather than the other way around)Scale capacity and performance linearly – just add nodesProven at massive scale – tens of PB of data, millions of usersFlexibleStore any type of dataStructured, unstructured, semi-structuredIn it’s native format – no conversion requiredNo loss of data fidelity due to ETLFluid structuringNo single model or schema that the data must conform toDetermine how you want to look at data at the time you ask the question – if the attribute exists in the raw data, you can query against itAlter structure to optimize query performance as desired (not required) – multiple open source file formats like Avro, ParquetMultiple forms of computationBring different tools to bear on the data, depending on your skillset and what you want to doBatch processing – MapReduce, Hive, Pig, JavaInteractive SQL – Impala, BI toolsInteractive Search – for non-technical users, or helping to identify datasets for further analysisMachine learning – apply algorithms to large datasets using libraries like Apache MahoutMath – tools like SAS and R for data scientists and statisticiansMore to come…Cost-EffectiveScale out on inexpensive, industry standard hardware (vs. highly tuned, specialized hardware)Fault tolerance built-inLeverage cost structures with existing vendorsReduced data movement – can perform more operations in a single place due to flexible toolingFewer redundant copies of dataLess time spent migrating/managingOpen source software is easy acquire and prove the value/ROIOpenRapid innovationLarge development communitiesThe most talented engineers from across the worldEasy to acquire and prove valueFree to download and deployDemonstrate the value of the technology before you make a large-scale investmentNo vendor lock-in – choose your vendor based solely on meritCloudera’s open source strategyIf it stores or processes data, it’s open sourceBig commitment to open sourceLeading contributor to the Apache Hadoop ecosystem – defining the future of the platform together with the communityIntegratedWorks with all your existing investmentsDatabases and data warehousesAnalytics and BI solutionsETL toolsPlatforms and operating systemsHardware and networking equipmentOver 700 partners including all of the leaders in the market segments aboveComplements those investments by allowing you to align data and processes to the right solution
  12. Software: Cloudera Enterprise – The Platform for Big DataA complete data management solution powered by Apache HadoopA collection of open source projects form the foundation of the platformCloudera has wrapped the open source core with additional software for system and data management as well as technical support5 Attributes of Cloudera Enterprise:ScalableStorage and compute in a single system – brings computation to data (rather than the other way around)Scale capacity and performance linearly – just add nodesProven at massive scale – tens of PB of data, millions of usersFlexibleStore any type of dataStructured, unstructured, semi-structuredIn it’s native format – no conversion requiredNo loss of data fidelity due to ETLFluid structuringNo single model or schema that the data must conform toDetermine how you want to look at data at the time you ask the question – if the attribute exists in the raw data, you can query against itAlter structure to optimize query performance as desired (not required) – multiple open source file formats like Avro, ParquetMultiple forms of computationBring different tools to bear on the data, depending on your skillset and what you want to doBatch processing – MapReduce, Hive, Pig, JavaInteractive SQL – Impala, BI toolsInteractive Search – for non-technical users, or helping to identify datasets for further analysisMachine learning – apply algorithms to large datasets using libraries like Apache MahoutMath – tools like SAS and R for data scientists and statisticiansMore to come…Cost-EffectiveScale out on inexpensive, industry standard hardware (vs. highly tuned, specialized hardware)Fault tolerance built-inLeverage cost structures with existing vendorsReduced data movement – can perform more operations in a single place due to flexible toolingFewer redundant copies of dataLess time spent migrating/managingOpen source software is easy acquire and prove the value/ROIOpenRapid innovationLarge development communitiesThe most talented engineers from across the worldEasy to acquire and prove valueFree to download and deployDemonstrate the value of the technology before you make a large-scale investmentNo vendor lock-in – choose your vendor based solely on meritCloudera’s open source strategyIf it stores or processes data, it’s open sourceBig commitment to open sourceLeading contributor to the Apache Hadoop ecosystem – defining the future of the platform together with the communityIntegratedWorks with all your existing investmentsDatabases and data warehousesAnalytics and BI solutionsETL toolsPlatforms and operating systemsHardware and networking equipmentOver 700 partners including all of the leaders in the market segments aboveComplements those investments by allowing you to align data and processes to the right solution
  13. Designed to do the task like BigQuery or Teradata
  14. Interface description language
  15. Compare with UCSC file formats:
  16. Interface description language
  17. Interface description language
  18. Show more general version later: ADAM
  19. I don’t mean to indict anyone in particular.
  20. Not so much a sin, but 15 year old architecture.
  21. If you have a gun-fight in the data center, your MapReduce job will still finish.
  22. If you have a gun-fight in the data center, your MapReduce job will still finish.
  23. If you have a gun-fight in the data center, your MapReduce job will still finish.
  24. If you have a gun-fight in the data center, your MapReduce job will still finish.
  25. If you have a gun-fight in the data center, your MapReduce job will still finish.
  26. If you have a gun-fight in the data center, your MapReduce job will still finish.
  27. If you have a gun-fight in the data center, your MapReduce job will still finish.
  28. Log scale.
  29. If you have a gun-fight in the data center, your MapReduce job will still finish.
  30. If you have a gun-fight in the data center, your MapReduce job will still finish.
  31. If you have a gun-fight in the data center, your MapReduce job will still finish.
  32. If you have a gun-fight in the data center, your MapReduce job will still finish.
  33. Define ETL
  34. Define ETL
  35. Define ETL
  36. Trying to work within the Global AllianceAlong with MSSM, Broad, others
  37. Once you have a VCF, what to do with it?True that data has compressed massively, but can still get large
  38. Data here is already denormalizedKicked off the trajectory that landed me at Cloudera
  39. My first MapReduce job was written in MongoDB’s JavaScript aggregation engine.
  40. ETL:denormalize, dictionary-encode, Snappy-compress
  41. Show more general version later: ADAM
  42. Show more general version later: ADAM
  43. Show more general version later: ADAM
  44. Show more general version later: ADAM
  45. Comment on cached join.Comment on join order and join strategies.