"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

1© Cloudera, Inc. All rights reserved.
The Redemptive Power of Hadoop
Sean Owen | @sean_r_owen
Scaling Up Genomics with Spark

What is genomics?

Organism Cell Genome

Reference chromosome
Location

“… decoding the Book of Life”

>read1
TTGGACATTTCGGGGTCTCAGATT
>read2
AATGTTGTTAGAGATCCGGGATTT
>read3
GGATTCCCCGCCGTTTGAGAGCCT
>read4
AGGTTGGTACCGCGAAAAGCGCAT
Bioinformatics!

Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Pipelines!

C
HPC (scheduler)
POSIX filesystem
Java
HPC (Queue)
POSIX filesystem
C++
Single-node
SQLite
It’s file formats all the way down!

##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
Compressed text files (non-splittable)
Semi-structured
Poorly specified
Global sort order

/**
* Main work method. Reads the BAM file once and collects sorted information about
* the 5' ends of both ends of each read (or just one end in the case of pairs).
* Then makes a pass through those determining duplicates before re-reading the
* input file and writing it out with duplication flags set correctly.
*/
protected int doWork() {
// build some data structures
buildSortedReadEndLists(useBarcodes);
generateDuplicateIndexes(useBarcodes);
final SAMFileWriter out =
new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT);
final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator;
while (iterator.hasNext()) {
final SAMRecord rec = iterator.next();
if (!rec.isSecondaryOrSupplementary()) {
if (recordInFileIndex == nextDuplicateIndex) {
rec.setDuplicateReadFlag(true);
// Now try and figure out the next duplicate index
if (this.duplicateIndexes.hasNext()) {
nextDuplicateIndex = this.duplicateIndexes.next();
} else {
// Only happens once we've marked all the duplicates
nextDuplicateIndex = -1;
}
} else {
Method
Code

@Option(shortName = "MAX_FILE_HANDLES",
doc = "Maximum number of file handles to keep open when spilling " +
"read ends to disk. Set this number a little lower than the " +
"per-process maximum number of file that may be open. This " +
"number can be found by executing the 'ulimit -n' command on " +
"a Unix system.")
public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;
Dedup
Method
Code
Platform

Variant
Calling
Variant
Annotation

It’s pipelines all the way down!
Variant
Calling
Variant
Annotation
Node 1
Variant
Calling
Variant
Annotation
Node 2
Variant
Calling
Variant
Annotation
Node 3

Variant
Calling
Variant
Annotation

Node 1
Variant
Calling
Variant
Annotation
Node 2
Node 3
Node 4

Node 1
Alignment Dedup QC/Filter
Variant
Calling
Variant
Annotation
Node 2
Node 3
Node 4
Recalibrate

Why Are We Still Defining File Formats By Hand?
• Instead of defining custom file
formats for each data type and
access pattern…
• Parquet creates a compressed
format for each Avro-defined
data model
• Improvements over existing
formats
• ~20% for BAM
• ~90% for VCF

YARN-managed
Hadoop cluster
Spark
executors
𝑗=1
𝑑 𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖)
𝑗=1
𝑑 𝑖
𝑗=1
𝑑 𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖)Partial sums
𝑖=1
𝑁
𝑗=1
𝑑 𝑖
Driver
Application
code
ContEst Algorithm

Hadoop provides layered abstractions for data processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduce Impala (SQL) Solr (search) Spark
ADAMquince guacamole …
bdg-formats(Avro/Parquet)

Executing query in Hadoop: interactive Spark shell (ADAM)
def inDbSnp(g: Genotype): Boolean = true or false
def isDeleterious(g: Genotype): Boolean = g.getPolyPhen
val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()
val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()
val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”)
val genotypesRDD = sc.adamLoad("path/to/genotypes")
val filteredRDD = genotypesRDD
.filter(!inDbSnp(_))
.filter(isDeleterious(_))
.filter(isFramingham(_))
val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)
val maf = joinedRDD
.keyBy(x => (x.getVariant, getPopulation(x)))
.groupByKey()
.map(computeMAF(_))
maf.saveAsNewAPIHadoopFile("path/to/output")
apply predicates
load data
join data
group-by
aggregate (MAF)
persist data

• Hosted at Berkeley and the
AMPLab
• Apache 2 License
• Contributors from both
research and commercial
organizations
• Core spatial primitives, variant
calling
• Avro and Parquet for data
models and file formats
Spark + Genomics = ADAM

Core Genomics Primitives: Spatial Join

ADAM preliminary performance

Acknowledgements
UCBerkeley
Matt Massie
Frank Nothaft
Michael Heuer
Tamr
Timothy Danford
MSSM
Jeff Hammerbacher
Ryan Williams
Cloudera
Uri Laserson
Tom White
Sandy Ryza

Thank you
@sean_r_owen
sowen@cloudera.com

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to "Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

Similar to "Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera (20)

More from Dataconomy Media

More from Dataconomy Media (20)

Recently uploaded

Recently uploaded (20)

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera

Editor's Notes