SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
The Redemptive Power of Hadoop
Uri Laserson | @laserson | 14 November 2015
Scaling Up Genomics with Spark
2© Cloudera, Inc. All rights reserved.
We come in peace.
Pioneer plaque
3© Cloudera, Inc. All rights reserved.
What is genomics?
4© Cloudera, Inc. All rights reserved.
Organism
5© Cloudera, Inc. All rights reserved.
Organism Cell
6© Cloudera, Inc. All rights reserved.
Organism Cell Genome
7© Cloudera, Inc. All rights reserved.
8© Cloudera, Inc. All rights reserved.
9© Cloudera, Inc. All rights reserved.
Reference chromosome
10© Cloudera, Inc. All rights reserved.
Reference chromosome
Location
11© Cloudera, Inc. All rights reserved.
“… decoding the Book of Life”
12© Cloudera, Inc. All rights reserved.
Ortelius, 1570
13© Cloudera, Inc. All rights reserved.
14© Cloudera, Inc. All rights reserved.
Google Maps, 2015
15© Cloudera, Inc. All rights reserved.
16© Cloudera, Inc. All rights reserved.
17© Cloudera, Inc. All rights reserved.
18© Cloudera, Inc. All rights reserved.
19© Cloudera, Inc. All rights reserved.
20© Cloudera, Inc. All rights reserved.
>read1
TTGGACATTTCGGGGTCTCAGATT
>read2
AATGTTGTTAGAGATCCGGGATTT
>read3
GGATTCCCCGCCGTTTGAGAGCCT
>read4
AGGTTGGTACCGCGAAAAGCGCAT
21© Cloudera, Inc. All rights reserved.
>read1
TTGGACATTTCGGGGTCTCAGATT
>read2
AATGTTGTTAGAGATCCGGGATTT
>read3
GGATTCCCCGCCGTTTGAGAGCCT
>read4
AGGTTGGTACCGCGAAAAGCGCAT
Bioinformatics!
22© Cloudera, Inc. All rights reserved.
>read1
TTGGACATTTCGGGGTCTCAGATT
>read2
AATGTTGTTAGAGATCCGGGATTT
>read3
GGATTCCCCGCCGTTTGAGAGCCT
>read4
AGGTTGGTACCGCGAAAAGCGCAT
Bioinformatics!
23© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Pipelines!
24© Cloudera, Inc. All rights reserved.
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
Compressed text files (non-splittable)
Semi-structured
Poorly specified
25© Cloudera, Inc. All rights reserved.
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
Compressed text files (non-splittable)
Semi-structured
Poorly specified
Global sort order
26© Cloudera, Inc. All rights reserved.
C
HPC (scheduler)
POSIX filesystem
Java
HPC (Queue)
POSIX filesystem
C++
Single-node
SQLite
It’s file formats all the way down!
27© Cloudera, Inc. All rights reserved.
Dedup
28© Cloudera, Inc. All rights reserved.
/**
* Main work method. Reads the BAM file once and collects sorted information about
* the 5' ends of both ends of each read (or just one end in the case of pairs).
* Then makes a pass through those determining duplicates before re-reading the
* input file and writing it out with duplication flags set correctly.
*/
protected int doWork() {
// build some data structures
buildSortedReadEndLists(useBarcodes);
generateDuplicateIndexes(useBarcodes);
final SAMFileWriter out =
new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT);
final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator;
while (iterator.hasNext()) {
final SAMRecord rec = iterator.next();
if (!rec.isSecondaryOrSupplementary()) {
if (recordInFileIndex == nextDuplicateIndex) {
rec.setDuplicateReadFlag(true);
// Now try and figure out the next duplicate index
if (this.duplicateIndexes.hasNext()) {
nextDuplicateIndex = this.duplicateIndexes.next();
} else {
// Only happens once we've marked all the duplicates
nextDuplicateIndex = -1;
}
} else {
Method
Code
29© Cloudera, Inc. All rights reserved.
/**
* Main work method. Reads the BAM file once and collects sorted information about
* the 5' ends of both ends of each read (or just one end in the case of pairs).
* Then makes a pass through those determining duplicates before re-reading the
* input file and writing it out with duplication flags set correctly.
*/
protected int doWork() {
// build some data structures
buildSortedReadEndLists(useBarcodes);
generateDuplicateIndexes(useBarcodes);
final SAMFileWriter out =
new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT);
final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator;
while (iterator.hasNext()) {
final SAMRecord rec = iterator.next();
if (!rec.isSecondaryOrSupplementary()) {
if (recordInFileIndex == nextDuplicateIndex) {
rec.setDuplicateReadFlag(true);
// Now try and figure out the next duplicate index
if (this.duplicateIndexes.hasNext()) {
nextDuplicateIndex = this.duplicateIndexes.next();
} else {
// Only happens once we've marked all the duplicates
nextDuplicateIndex = -1;
}
} else {
Method
Code
30© Cloudera, Inc. All rights reserved.
@Option(shortName = "MAX_FILE_HANDLES",
doc = "Maximum number of file handles to keep open when spilling " +
"read ends to disk. Set this number a little lower than the " +
"per-process maximum number of file that may be open. This " +
"number can be found by executing the 'ulimit -n' command on " +
"a Unix system.")
public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;
31© Cloudera, Inc. All rights reserved.
@Option(shortName = "MAX_FILE_HANDLES",
doc = "Maximum number of file handles to keep open when spilling " +
"read ends to disk. Set this number a little lower than the " +
"per-process maximum number of file that may be open. This " +
"number can be found by executing the 'ulimit -n' command on " +
"a Unix system.")
public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;
Dedup
Method
Code
Platform
32© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
33© Cloudera, Inc. All rights reserved.
It’s pipelines all the way down!
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
34© Cloudera, Inc. All rights reserved.
It’s pipelines all the way down!
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Node 1
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Node 2
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Node 3
35© Cloudera, Inc. All rights reserved.
Manually running pipelines on HPC
$ bsub –q shared_12h python split_genotypes.py
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv
$ bsub –q shared_12h python merge_maf.py
36© Cloudera, Inc. All rights reserved.
37© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Alignment Dedup Recalibrate QC/Filter
Alignment Dedup Recalibrate QC/Filter
38© Cloudera, Inc. All rights reserved.
Node 1
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Node 2
Node 3
Alignment Dedup Recalibrate QC/Filter
Alignment Dedup Recalibrate QC/Filter
Node 4
39© Cloudera, Inc. All rights reserved.
Node 1
Alignment Dedup QC/Filter
Variant
Calling
Variant
Annotation
Node 2
Node 3
Alignment Dedup QC/Filter
Alignment Dedup QC/Filter
Node 4
Recalibrate
40© Cloudera, Inc. All rights reserved.
How now, brown cow?
41© Cloudera, Inc. All rights reserved.
Why Are We Still Defining File Formats By Hand?
• Instead of defining custom file
formats for each data type and
access pattern…
• Parquet creates a compressed
format for each Avro-defined
data model
• Improtvements over existing
formats
• ~20% for BAM
• ~90% for VCF
42© Cloudera, Inc. All rights reserved.
YARN-managed
Hadoop cluster
Spark
executors
𝑗=1
𝑑 𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖)
𝑗=1
𝑑 𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖)
𝑗=1
𝑑 𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖)Partial sums
𝑖=1
𝑁
𝑗=1
𝑑 𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖)
Driver
Application
code
ContEst Algorithm
43© Cloudera, Inc. All rights reserved.
44© Cloudera, Inc. All rights reserved.
Hadoop provides layered abstractions for data processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduce Impala (SQL) Solr (search) Spark
ADAMquince guacamole …
bdg-formats(Avro/Parquet)
45© Cloudera, Inc. All rights reserved.
Executing query in Hadoop: interactive Spark shell (ADAM)
def inDbSnp(g: Genotype): Boolean = true or false
def isDeleterious(g: Genotype): Boolean = g.getPolyPhen
val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()
val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()
val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”)
val genotypesRDD = sc.adamLoad("path/to/genotypes")
val filteredRDD = genotypesRDD
.filter(!inDbSnp(_))
.filter(isDeleterious(_))
.filter(isFramingham(_))
val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)
val maf = joinedRDD
.keyBy(x => (x.getVariant, getPopulation(x)))
.groupByKey()
.map(computeMAF(_))
maf.saveAsNewAPIHadoopFile("path/to/output")
apply predicates
load data
join data
group-by
aggregate (MAF)
persist data
46© Cloudera, Inc. All rights reserved.
Executing query in Hadoop: distributed SQL
SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)
FROM genotypes g
INNER JOIN samples s
ON g.sample = s.sample
INNER JOIN dnase d
ON g.chr = d.chr
AND g.pos >= d.start
AND g.pos < d.end
LEFT OUTER JOIN dbsnp p
ON g.chr = p.chr
AND g.pos = p.pos
AND g.ref = p.ref
AND g.alt = p.alt
WHERE
s.study = "framingham"
p.pos IS NULL AND
g.polyphen IN ( "possibly damaging", "probably damaging" )
GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop
apply predicates
“load” and
join data
group-by
aggregate (UDAF)
47© Cloudera, Inc. All rights reserved.
• Hosted at Berkeley and the
AMPLab
• Apache 2 License
• Contributors from both
research and commercial
organizations
• Core spatial primitives,
variant calling
• Avro and Parquet for data
models and file formats
Spark + Genomics = ADAM
48© Cloudera, Inc. All rights reserved.
Core Genomics Primitives: Spatial Join
49© Cloudera, Inc. All rights reserved.
ADAM preliminary performance
50© Cloudera, Inc. All rights reserved.
51© Cloudera, Inc. All rights reserved.
Acknowledgements
UCBerkeley
Matt Massie
Frank Nothaft
Michael Heuer
Tamr
Timothy Danford
MSSM
Jeff Hammerbacher
Ryan Williams
Cloudera
Tom White
Sandy Ryza
52© Cloudera, Inc. All rights reserved.
Thank you
@laserson
laserson@cloudera.com

More Related Content

What's hot

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Spark Summit
 
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Spark Summit
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
Databricks
 
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
Spark Summit
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
Spark Summit
 
Lighthouse
LighthouseLighthouse
Lighthouse
Kris Peeters
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
Paris Data Engineers !
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
Monitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadogMonitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadog
Seth Rosenblum
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Spark Summit
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 

What's hot (20)

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
 
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
 
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Lighthouse
LighthouseLighthouse
Lighthouse
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
File Context
File ContextFile Context
File Context
 
Monitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadogMonitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadog
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 

Viewers also liked

Arizona General Education Curriculum (AGEC-A) with Distinction_Morgan Deann D...
Arizona General Education Curriculum (AGEC-A) with Distinction_Morgan Deann D...Arizona General Education Curriculum (AGEC-A) with Distinction_Morgan Deann D...
Arizona General Education Curriculum (AGEC-A) with Distinction_Morgan Deann D...Morgan Davis
 
Electricidad siiiiiiiiii
Electricidad siiiiiiiiiiElectricidad siiiiiiiiii
Electricidad siiiiiiiiiibedwell222
 
MDAVIS_HEALTHCARE CHANGE PROJECT
MDAVIS_HEALTHCARE CHANGE PROJECTMDAVIS_HEALTHCARE CHANGE PROJECT
MDAVIS_HEALTHCARE CHANGE PROJECTMorgan Davis
 
10分で分かるTDD
10分で分かるTDD10分で分かるTDD
10分で分かるTDDtaketi
 
MCC SNA_Active Member Certificate
MCC SNA_Active Member CertificateMCC SNA_Active Member Certificate
MCC SNA_Active Member CertificateMorgan Davis
 
How could a loving God send people to hell?
How could a loving God send people to hell?How could a loving God send people to hell?
How could a loving God send people to hell?
Bill Drewett
 
Cartel publicidad ciclo 2015
Cartel publicidad ciclo 2015Cartel publicidad ciclo 2015
Cartel publicidad ciclo 2015
marlucasfe
 
evolución de seguridad
evolución de seguridadevolución de seguridad
evolución de seguridad
Jaime Cruz
 
Tạo lịch hẹn trong khách hàng
Tạo lịch hẹn trong khách hàngTạo lịch hẹn trong khách hàng
Tạo lịch hẹn trong khách hàng
Getfly CRM
 
Big Data and Spark Streaming. Oil production sensors data monitoring
Big Data and Spark Streaming. Oil production sensors data monitoringBig Data and Spark Streaming. Oil production sensors data monitoring
Big Data and Spark Streaming. Oil production sensors data monitoring
SoftElegance
 
#12 Chiến dịch kinh doanh - Hướng dẫn sử dụng phần mềm GetFly CRM
#12 Chiến dịch kinh doanh - Hướng dẫn sử dụng phần mềm GetFly CRM #12 Chiến dịch kinh doanh - Hướng dẫn sử dụng phần mềm GetFly CRM
#12 Chiến dịch kinh doanh - Hướng dẫn sử dụng phần mềm GetFly CRM
Getfly CRM
 
Tema 5 música iberoamericana
Tema 5 música iberoamericanaTema 5 música iberoamericana
Tema 5 música iberoamericana
jopape72
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
Dataconomy Media
 
NAVITAS - cata jan 2016
NAVITAS - cata jan 2016NAVITAS - cata jan 2016
NAVITAS - cata jan 2016Khaled Nukho
 
Rozen 2016-10-05-ieee-cibcb-big-genome-data-to-share
Rozen 2016-10-05-ieee-cibcb-big-genome-data-to-shareRozen 2016-10-05-ieee-cibcb-big-genome-data-to-share
Rozen 2016-10-05-ieee-cibcb-big-genome-data-to-share
Steve Rozen
 
Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & S...
Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & S...Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & S...
Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & S...
Deborah Siegel
 
Healthcare Analytics Market Categorization
Healthcare Analytics Market CategorizationHealthcare Analytics Market Categorization
Healthcare Analytics Market Categorization
Dale Sanders
 

Viewers also liked (20)

Arizona General Education Curriculum (AGEC-A) with Distinction_Morgan Deann D...
Arizona General Education Curriculum (AGEC-A) with Distinction_Morgan Deann D...Arizona General Education Curriculum (AGEC-A) with Distinction_Morgan Deann D...
Arizona General Education Curriculum (AGEC-A) with Distinction_Morgan Deann D...
 
Electricidad siiiiiiiiii
Electricidad siiiiiiiiiiElectricidad siiiiiiiiii
Electricidad siiiiiiiiii
 
MDAVIS_HEALTHCARE CHANGE PROJECT
MDAVIS_HEALTHCARE CHANGE PROJECTMDAVIS_HEALTHCARE CHANGE PROJECT
MDAVIS_HEALTHCARE CHANGE PROJECT
 
10分で分かるTDD
10分で分かるTDD10分で分かるTDD
10分で分かるTDD
 
MCC SNA_Active Member Certificate
MCC SNA_Active Member CertificateMCC SNA_Active Member Certificate
MCC SNA_Active Member Certificate
 
Cumple coso!
Cumple coso!Cumple coso!
Cumple coso!
 
Ewrt 1 c class 29
Ewrt 1 c class 29Ewrt 1 c class 29
Ewrt 1 c class 29
 
How could a loving God send people to hell?
How could a loving God send people to hell?How could a loving God send people to hell?
How could a loving God send people to hell?
 
السيرة الذاتية
السيرة الذاتيةالسيرة الذاتية
السيرة الذاتية
 
Cartel publicidad ciclo 2015
Cartel publicidad ciclo 2015Cartel publicidad ciclo 2015
Cartel publicidad ciclo 2015
 
evolución de seguridad
evolución de seguridadevolución de seguridad
evolución de seguridad
 
Tạo lịch hẹn trong khách hàng
Tạo lịch hẹn trong khách hàngTạo lịch hẹn trong khách hàng
Tạo lịch hẹn trong khách hàng
 
Big Data and Spark Streaming. Oil production sensors data monitoring
Big Data and Spark Streaming. Oil production sensors data monitoringBig Data and Spark Streaming. Oil production sensors data monitoring
Big Data and Spark Streaming. Oil production sensors data monitoring
 
#12 Chiến dịch kinh doanh - Hướng dẫn sử dụng phần mềm GetFly CRM
#12 Chiến dịch kinh doanh - Hướng dẫn sử dụng phần mềm GetFly CRM #12 Chiến dịch kinh doanh - Hướng dẫn sử dụng phần mềm GetFly CRM
#12 Chiến dịch kinh doanh - Hướng dẫn sử dụng phần mềm GetFly CRM
 
Tema 5 música iberoamericana
Tema 5 música iberoamericanaTema 5 música iberoamericana
Tema 5 música iberoamericana
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
NAVITAS - cata jan 2016
NAVITAS - cata jan 2016NAVITAS - cata jan 2016
NAVITAS - cata jan 2016
 
Rozen 2016-10-05-ieee-cibcb-big-genome-data-to-share
Rozen 2016-10-05-ieee-cibcb-big-genome-data-to-shareRozen 2016-10-05-ieee-cibcb-big-genome-data-to-share
Rozen 2016-10-05-ieee-cibcb-big-genome-data-to-share
 
Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & S...
Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & S...Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & S...
Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & S...
 
Healthcare Analytics Market Categorization
Healthcare Analytics Market CategorizationHealthcare Analytics Market Categorization
Healthcare Analytics Market Categorization
 

Similar to DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)
Uri Laserson
 
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
Dataconomy Media
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005dflexer
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operationsgrim_radical
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Intel Software Brasil
 
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
Orgad Kimchi
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
Paris Data Engineers !
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax
 
RSA NetWitness Log Decoder
RSA NetWitness Log DecoderRSA NetWitness Log Decoder
RSA NetWitness Log Decoder
Susam Pal
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache Mesos
Joe Stein
 
Application Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyApplication Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.key
Tim Bunce
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
jtdudley
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
Peter Lawrey
 
A brief introduction to PostgreSQL
A brief introduction to PostgreSQLA brief introduction to PostgreSQL
A brief introduction to PostgreSQL
Vu Hung Nguyen
 
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...
Rosemary Wang
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
Cloudera Japan
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
Katie Gulley
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
 
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
Ji-Woong Choi
 

Similar to DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark (20)

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)
 
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operations
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™
 
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
 
RSA NetWitness Log Decoder
RSA NetWitness Log DecoderRSA NetWitness Log Decoder
RSA NetWitness Log Decoder
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache Mesos
 
Application Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyApplication Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.key
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
 
A brief introduction to PostgreSQL
A brief introduction to PostgreSQLA brief introduction to PostgreSQL
A brief introduction to PostgreSQL
 
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...
 
pm1
pm1pm1
pm1
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
 

More from Hakka Labs

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
Hakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
Hakka Labs
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
Hakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
Hakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Hakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
Hakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
Hakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Hakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
Hakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
Hakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
Hakka Labs
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 

More from Hakka Labs (20)

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 

Recently uploaded

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 

Recently uploaded (20)

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

  • 1. 1© Cloudera, Inc. All rights reserved. The Redemptive Power of Hadoop Uri Laserson | @laserson | 14 November 2015 Scaling Up Genomics with Spark
  • 2. 2© Cloudera, Inc. All rights reserved. We come in peace. Pioneer plaque
  • 3. 3© Cloudera, Inc. All rights reserved. What is genomics?
  • 4. 4© Cloudera, Inc. All rights reserved. Organism
  • 5. 5© Cloudera, Inc. All rights reserved. Organism Cell
  • 6. 6© Cloudera, Inc. All rights reserved. Organism Cell Genome
  • 7. 7© Cloudera, Inc. All rights reserved.
  • 8. 8© Cloudera, Inc. All rights reserved.
  • 9. 9© Cloudera, Inc. All rights reserved. Reference chromosome
  • 10. 10© Cloudera, Inc. All rights reserved. Reference chromosome Location
  • 11. 11© Cloudera, Inc. All rights reserved. “… decoding the Book of Life”
  • 12. 12© Cloudera, Inc. All rights reserved. Ortelius, 1570
  • 13. 13© Cloudera, Inc. All rights reserved.
  • 14. 14© Cloudera, Inc. All rights reserved. Google Maps, 2015
  • 15. 15© Cloudera, Inc. All rights reserved.
  • 16. 16© Cloudera, Inc. All rights reserved.
  • 17. 17© Cloudera, Inc. All rights reserved.
  • 18. 18© Cloudera, Inc. All rights reserved.
  • 19. 19© Cloudera, Inc. All rights reserved.
  • 20. 20© Cloudera, Inc. All rights reserved. >read1 TTGGACATTTCGGGGTCTCAGATT >read2 AATGTTGTTAGAGATCCGGGATTT >read3 GGATTCCCCGCCGTTTGAGAGCCT >read4 AGGTTGGTACCGCGAAAAGCGCAT
  • 21. 21© Cloudera, Inc. All rights reserved. >read1 TTGGACATTTCGGGGTCTCAGATT >read2 AATGTTGTTAGAGATCCGGGATTT >read3 GGATTCCCCGCCGTTTGAGAGCCT >read4 AGGTTGGTACCGCGAAAAGCGCAT Bioinformatics!
  • 22. 22© Cloudera, Inc. All rights reserved. >read1 TTGGACATTTCGGGGTCTCAGATT >read2 AATGTTGTTAGAGATCCGGGATTT >read3 GGATTCCCCGCCGTTTGAGAGCCT >read4 AGGTTGGTACCGCGAAAAGCGCAT Bioinformatics!
  • 23. 23© Cloudera, Inc. All rights reserved. Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Pipelines!
  • 24. 24© Cloudera, Inc. All rights reserved. ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 Compressed text files (non-splittable) Semi-structured Poorly specified
  • 25. 25© Cloudera, Inc. All rights reserved. ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 Compressed text files (non-splittable) Semi-structured Poorly specified Global sort order
  • 26. 26© Cloudera, Inc. All rights reserved. C HPC (scheduler) POSIX filesystem Java HPC (Queue) POSIX filesystem C++ Single-node SQLite It’s file formats all the way down!
  • 27. 27© Cloudera, Inc. All rights reserved. Dedup
  • 28. 28© Cloudera, Inc. All rights reserved. /** * Main work method. Reads the BAM file once and collects sorted information about * the 5' ends of both ends of each read (or just one end in the case of pairs). * Then makes a pass through those determining duplicates before re-reading the * input file and writing it out with duplication flags set correctly. */ protected int doWork() { // build some data structures buildSortedReadEndLists(useBarcodes); generateDuplicateIndexes(useBarcodes); final SAMFileWriter out = new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT); final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator; while (iterator.hasNext()) { final SAMRecord rec = iterator.next(); if (!rec.isSecondaryOrSupplementary()) { if (recordInFileIndex == nextDuplicateIndex) { rec.setDuplicateReadFlag(true); // Now try and figure out the next duplicate index if (this.duplicateIndexes.hasNext()) { nextDuplicateIndex = this.duplicateIndexes.next(); } else { // Only happens once we've marked all the duplicates nextDuplicateIndex = -1; } } else { Method Code
  • 29. 29© Cloudera, Inc. All rights reserved. /** * Main work method. Reads the BAM file once and collects sorted information about * the 5' ends of both ends of each read (or just one end in the case of pairs). * Then makes a pass through those determining duplicates before re-reading the * input file and writing it out with duplication flags set correctly. */ protected int doWork() { // build some data structures buildSortedReadEndLists(useBarcodes); generateDuplicateIndexes(useBarcodes); final SAMFileWriter out = new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT); final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator; while (iterator.hasNext()) { final SAMRecord rec = iterator.next(); if (!rec.isSecondaryOrSupplementary()) { if (recordInFileIndex == nextDuplicateIndex) { rec.setDuplicateReadFlag(true); // Now try and figure out the next duplicate index if (this.duplicateIndexes.hasNext()) { nextDuplicateIndex = this.duplicateIndexes.next(); } else { // Only happens once we've marked all the duplicates nextDuplicateIndex = -1; } } else { Method Code
  • 30. 30© Cloudera, Inc. All rights reserved. @Option(shortName = "MAX_FILE_HANDLES", doc = "Maximum number of file handles to keep open when spilling " + "read ends to disk. Set this number a little lower than the " + "per-process maximum number of file that may be open. This " + "number can be found by executing the 'ulimit -n' command on " + "a Unix system.") public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;
  • 31. 31© Cloudera, Inc. All rights reserved. @Option(shortName = "MAX_FILE_HANDLES", doc = "Maximum number of file handles to keep open when spilling " + "read ends to disk. Set this number a little lower than the " + "per-process maximum number of file that may be open. This " + "number can be found by executing the 'ulimit -n' command on " + "a Unix system.") public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000; Dedup Method Code Platform
  • 32. 32© Cloudera, Inc. All rights reserved. Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation
  • 33. 33© Cloudera, Inc. All rights reserved. It’s pipelines all the way down! Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation
  • 34. 34© Cloudera, Inc. All rights reserved. It’s pipelines all the way down! Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Node 1 Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Node 2 Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Node 3
  • 35. 35© Cloudera, Inc. All rights reserved. Manually running pipelines on HPC $ bsub –q shared_12h python split_genotypes.py $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv $ bsub –q shared_12h python merge_maf.py
  • 36. 36© Cloudera, Inc. All rights reserved.
  • 37. 37© Cloudera, Inc. All rights reserved. Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Alignment Dedup Recalibrate QC/Filter Alignment Dedup Recalibrate QC/Filter
  • 38. 38© Cloudera, Inc. All rights reserved. Node 1 Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Node 2 Node 3 Alignment Dedup Recalibrate QC/Filter Alignment Dedup Recalibrate QC/Filter Node 4
  • 39. 39© Cloudera, Inc. All rights reserved. Node 1 Alignment Dedup QC/Filter Variant Calling Variant Annotation Node 2 Node 3 Alignment Dedup QC/Filter Alignment Dedup QC/Filter Node 4 Recalibrate
  • 40. 40© Cloudera, Inc. All rights reserved. How now, brown cow?
  • 41. 41© Cloudera, Inc. All rights reserved. Why Are We Still Defining File Formats By Hand? • Instead of defining custom file formats for each data type and access pattern… • Parquet creates a compressed format for each Avro-defined data model • Improtvements over existing formats • ~20% for BAM • ~90% for VCF
  • 42. 42© Cloudera, Inc. All rights reserved. YARN-managed Hadoop cluster Spark executors 𝑗=1 𝑑 𝑖 𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖) 𝑗=1 𝑑 𝑖 𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖) 𝑗=1 𝑑 𝑖 𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖)Partial sums 𝑖=1 𝑁 𝑗=1 𝑑 𝑖 𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖) Driver Application code ContEst Algorithm
  • 43. 43© Cloudera, Inc. All rights reserved.
  • 44. 44© Cloudera, Inc. All rights reserved. Hadoop provides layered abstractions for data processing HDFS (scalable, distributed storage) YARN (resource management) MapReduce Impala (SQL) Solr (search) Spark ADAMquince guacamole … bdg-formats(Avro/Parquet)
  • 45. 45© Cloudera, Inc. All rights reserved. Executing query in Hadoop: interactive Spark shell (ADAM) def inDbSnp(g: Genotype): Boolean = true or false def isDeleterious(g: Genotype): Boolean = g.getPolyPhen val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect() val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect() val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”) val genotypesRDD = sc.adamLoad("path/to/genotypes") val filteredRDD = genotypesRDD .filter(!inDbSnp(_)) .filter(isDeleterious(_)) .filter(isFramingham(_)) val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD) val maf = joinedRDD .keyBy(x => (x.getVariant, getPopulation(x))) .groupByKey() .map(computeMAF(_)) maf.saveAsNewAPIHadoopFile("path/to/output") apply predicates load data join data group-by aggregate (MAF) persist data
  • 46. 46© Cloudera, Inc. All rights reserved. Executing query in Hadoop: distributed SQL SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call) FROM genotypes g INNER JOIN samples s ON g.sample = s.sample INNER JOIN dnase d ON g.chr = d.chr AND g.pos >= d.start AND g.pos < d.end LEFT OUTER JOIN dbsnp p ON g.chr = p.chr AND g.pos = p.pos AND g.ref = p.ref AND g.alt = p.alt WHERE s.study = "framingham" p.pos IS NULL AND g.polyphen IN ( "possibly damaging", "probably damaging" ) GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop apply predicates “load” and join data group-by aggregate (UDAF)
  • 47. 47© Cloudera, Inc. All rights reserved. • Hosted at Berkeley and the AMPLab • Apache 2 License • Contributors from both research and commercial organizations • Core spatial primitives, variant calling • Avro and Parquet for data models and file formats Spark + Genomics = ADAM
  • 48. 48© Cloudera, Inc. All rights reserved. Core Genomics Primitives: Spatial Join
  • 49. 49© Cloudera, Inc. All rights reserved. ADAM preliminary performance
  • 50. 50© Cloudera, Inc. All rights reserved.
  • 51. 51© Cloudera, Inc. All rights reserved. Acknowledgements UCBerkeley Matt Massie Frank Nothaft Michael Heuer Tamr Timothy Danford MSSM Jeff Hammerbacher Ryan Williams Cloudera Tom White Sandy Ryza
  • 52. 52© Cloudera, Inc. All rights reserved. Thank you @laserson laserson@cloudera.com

Editor's Notes

  1. Before we dive in, let me ask a couple of questions: Biologists? Spark experts? Gonna tell you a lot of lies today. There are always at least three different constituencies in the room: * biologists * programmers * someone thinking about how to build a business around this Won’t satisfy everyone. Where I skip over the truth, maybe there will be at least a breadcrumb of truth left over. This will not be a very technical talk.
  2. Scared/pissed off some bio people in the past. Bioinformatics is a field with a long history, thirty or more years as a separate discipline. At the same time, the fundamental technology is changing. So if I talk about ‘problems of bioinformatics’ today, it’s OK because WE COME IN PEACE! Bioinformatics software development has been *remarkably* effective, for decades. If there are problems to be solved, these are the result of new technologies, new ambitions of scale.
  3. What even is genomics? Who here has heard the terms ‘chromosome’ and ‘gene’ before, and could explain the difference? So before we dive into the main part of the talk, I’m going to spend a few minutes discussing some of the basic biological concepts.
  4. Fundamentally, we’re interested in studying individuals (and populations of individuals) [ADVANCE]
  5. But each individual is actually a population: of cells [ADVANCE]
  6. But each of those cells has, ideally, an identical genome. The genome is a collection of 23 linear molecules. These are called ‘polymers,’ they’re built (like Legos) out of a small number of repeated interlocking parts – these are the A, T, G, and C you’ve probably heard about. The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
  7. Without losing much, assume that our genomes are contained on just a single chromosome. Now, not only do all the cells in your body have identical genomes… [ADVANCE]
  8. But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means… [ADVANCE]
  9. That we can define a ‘base’ or a ‘reference’ chromosome. Now that there is a reference that all of us adhere to… [ADVANCE]
  10. We can define a concept of ‘location’ across chromosomes. This is possibly the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This also means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  11. Here is Bill Clinton (and Craig Venter and Francis Collins), announcing in June of 2000 the “rough draft” of the Human Genome – this is the Human Genome Project. Took >10 years and $2 billion What did this actually do?
  12. 1570: Theatrum Orbis Terrarum: “Theater of the world” First modern atlas. A direct byproduct of the first 100 years of PRINTING, and a tool for describing and exploring the world around us. It’s direct descendants are still with us, today!
  13. Google maps! So how is the map created/used?
  14. Anyone recognize this? Genome analogy: a text file a part of the linear sequence of ACGTs. Difficult to understand.
  15. Mapmakers work to add ANNOTATIONS to the map.
  16. And often, it’s only the annotations that are interesting, so mapmakers focus on *annotation* of the maps themselves. The core technologies are 2D planar and spherical geometry, geometric operations composed out of latitudes and longitudes. What does the annotated map of the genome look like?
  17. Chromosome on top. Highlighted red portion is what we’re zoomed in on. See the scale: total of about 600,000 bases (ACGTs) arranged from left to right. Multiple annotation “tracks” are overlaid on the genome sequence, marking functional elements, positions of observed human differences, similarity to other animals. In part it’s the product of numerous additional large biology annotation projects (e.g., HapMap project, 1000 Genomes, ENCODE). How are these annotations actually generated? Shift gears and talk about the technology.
  18. DNA SEQUENCING If satellites provide images of the world for cartography, sequences are the microscopes that give you “images” of the genome. Over past decade, massive EXPONENTIAL increase in throughput (much faster than Moore’s law)
  19. Get sample Extract DNA (possibly other manipulations) Dump into sequencer Spits out text file (actually looks just like that) But how to get from the text file to an annotation track that reconstructs a genome or shows position of certain functional elements?
  20. Bioinformatics is the computational process to reconstruct the genomic information. But… [ADVANCE]
  21. Often considered simply a black box. What does it actually look like inside?
  22. Pipelines, of course. Example pipeline: raw sequencing data => a single individual’s “diff” from the reference. How are these typically structured? Each step is typically written as a standalone program – passing files from stage to stage These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem What does one of these files look like?
  23. Text is highly inefficient Compresses poorly Values must be parsed Text is semi-structured Flexible schemas make parsing difficult Difficult to make assumptions on data structure Text poorly separates the roles of delimiters and data Requires escaping of control characters (ASCII actually includes RS 0x1E and FS 0x1F, but they’re never used) But still almost always better than Excel
  24. Imposes severe constraint: global sort invariant. => Many impls depend on this, even if it’s not necessary or conducive to distributed computing.
  25. Bioinformaticians LOVE hand-coded file formats. But only store several fundamental data types. Strong assumptions in the formats. Inconsistent implementations in multiple languages. Doesn’t allow different storage backends. OK, we discussed what the data/files are like that are passed around. What about the computation itself?
  26. Let’s take one of the transformations in the pipeline. Basically a more complex version of a DISTINCT operation.
  27. Actual code from the standard Picard implementation of MarkDuplicates. Two things should be going on: Algorithm/Method overall Actual code implementation. Start by building some data structures from the input files. Then iterate over file and rewrite is as necessary.
  28. But what if we jump into one of these functions. You’ll find a dependence on… [ADVANCE]
  29. An input option related to Unix file handle limits? WTF? Why should this METHOD need know anything about the platform that this is running on? LEAKY ABSTRACTIONS
  30. Most bioinformatics tools make strong assumptions about their environments, and also the structure of the data (e.g., global sort), when it shouldn’t be necessary. Ok, but that’s not all… [ADVANCE]
  31. We’ve looked at the data and a bit of code for one of these tools. But this runs the pipeline on a single individual. But of course, it’s never one pipeline… [ADVANCE]
  32. It’s a pipeline per person! But since each pipeline runs (essentially) serially, scaling it up is easy… [ADVANCE]
  33. Scale out! Typically managed with a pretty low-level job scheduler.
  34. MANUAL split and merge MANUAL resource request BABYSIT for failures/errors CUSTOM intermediate ser/de But this basically works and the parallelism is pretty simple. This architecture has kept up with the pace of sequencing for some time now. So why am I even up here talking? Two reasons…
  35. SCALE! New levels of ambition for large biology projects. 100k genomes at Genomics England in collaboration with National Health Service. Raw data for a single individual can be in the hundreds of GB
  36. But even before we hit that huge scale (which is soon)… We don’t want to analyze each sample separately. We want to use ALL THE DATA we generate. Well, these pipelines often include lots of aggregation, perhaps we can just… [ADVANCE]
  37. Do the easy thing! Not ideal, especially as the amount of data goes up (data transfer), number of files increases (we saw file handles). May start hitting the cracks. But even worse… [ADVANCE]
  38. God help you if you want to jointly use all the data in earlier part of the pipeline.
  39. So what do we do? Two things
  40. Things like global sort order are overly restrictive and leads to algos relying on it when it’s not necessary.
  41. Example of an algo. Bioinformatics loves evaluating probabilistic models on the chromosomes. We can easily extract parallelism at different parts of our pipelines. Use higher level distributed computing primitives and let the system figure out all the platform issues for you: storage, job scheduling, fault tolerance, shuffles.
  42. Layered abstractions. Use multiple storage engines with different characteristics. Multiple execution engines. Application code/algos should only touch the top of the abstraction layer.
  43. Cheap scalable STORAGE at bottom Resource management middle EXECUTION engines that can run your code on the cluster and provide parallelism Consistent SERIALIZATION framework Scientist should NOT WORRY about lower levels (coordination, file formats, storage details, fault tolerance)
  44. Another computation for a statistical aggregate on genome variant data. Details not important. Spark data flow: Distributed data load High level joins/spatial computations that are parallelized as necessary. But really nice thing is because our data is stored using the Avro data model… [ADVANCE]
  45. You can execute the exact same computation using, for example, SQL! Pick the best tool for the job.
  46. We’ve implemented this vision with Spark, starting from the Amplab (same people that gave you Spark) into a project called ADAM The reason this works is that Spark naturally handles pipelines, and automatically performs shuffles when appropriate, but also…
  47. In addition to some of the standard pipeline transformations, implemented the core spatial join operations (analogous to a geospatial library).
  48. Single-node performance improvements. Free scalability: fixed price, significant wall-clock improvements See most recent SIGMOD.
  49. Not to be outdone, Craig Venter proposes 1 million genomes at Human Longevity Inc.