WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Karen Feng, Databricks
From Genomics to Medicine:
Advancing Healthcare at Scale
#UnifiedAnalytics #SparkAISummit
Genomics in the real world
• 6-year-old Nic Volker
• Intestinal inflammation
– Unknown cause
– 100+ surgeries
• Whole exome sequencing:
mutation in XIAP gene
• Cured through stem cell
transplant
3
https://www.washingtonpost.com/opinions/a-boys-mysterious-illness-a-bold-ga
mble-and-a-breakthrough-in-genetic-medicine/2016/04/20/13f20b16-e638-11e5
-bc08-3e03a5b41910_story.html
Genomics in the real world
• 6-year-old Nic Volker
• Intestinal inflammation
– Unknown cause
– 100+ surgeries
• Whole exome sequencing:
mutation in XIAP gene
• Cured through stem cell
transplant
4
https://www.researchgate.net/publication/318420329_Health_t
echnology_assessment_of_next-generation_sequencing
Genomics in the real world
• 6-year-old Nic Volker
• Intestinal inflammation
– Unknown cause
– 100+ surgeries
• Whole exome sequencing:
mutation in XIAP gene
• Cured through stem cell
transplant
5
Human CFCCG
200
205
http://cbm.msoe.edu/markMyweb/genomicJmols/xiap.html
Genomics in the real world
• 6-year-old Nic Volker
• Intestinal inflammation
– Unknown cause
– 100+ surgeries
• Whole exome sequencing:
mutation in XIAP gene
• Cured through stem cell
transplant
6
Human
Chicken
Zebra fish
Frog
House fly
CFCCG
AFCCG
CFCCG
CFHCD
CVWCN
200
205
http://cbm.msoe.edu/markMyweb/genomicJmols/xiap.html
Genomics in the real world
• 6-year-old Nic Volker
• Intestinal inflammation
– Unknown cause
– 100+ surgeries
• Whole exome sequencing:
mutation in XIAP gene
• Cured through stem cell
transplant
7
Nic’s XIAP
Human
Chicken
Zebra fish
Frog
House fly
CFCYG
CFCCG
AFCCG
CFCCG
CFHCD
CVWCN
200
205
http://cbm.msoe.edu/markMyweb/genomicJmols/xiap.html
Genomics in the real world
• 6-year-old Nic Volker
• Intestinal inflammation
– Unknown cause
– 100+ surgeries
• Whole exome sequencing:
mutation in XIAP gene
• Cured through stem cell
transplant
8
http://archive.jsonline.com/news/health/young-patient-faces-new-struggles-yea
rs-after-dna-sequencing-b99602505z1-336977681.html
Genomics is a big data problem
9
40,000 Petabytes / year by 2025From $2.7B to <$1,000
https://www.genome.gov/27541954/dna-sequencing-costs-data/ https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at industrial scale
• Joint genotyping
– Existing approach
– Databricks approach
• Genomics on Databricks
10
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at industrial scale
• Joint genotyping
– Existing approach
– Databricks approach
• Genomics on Databricks
11
The power of big genomic data
12
Accelerate
Target
Discovery
Motivation: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
Goal: identify a biological target
(eg. protein) that can be
mediated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait
The power of big genomic data
13
Accelerate
Target
Discovery
Motivation: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
Goal: identify a biological target
(eg. protein) that can be
mediated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait
The power of big genomic data
14
Accelerate
Target
Discovery
Motivation: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
Goal: identify a biological target
(eg. protein) that can be
mediated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait
The power of big genomic data
15
Reduce Costs
via Precision
Prevention
Motivation: propose
personalized lifestyle changes to
decrease disease risk
Goal: calculate individual’s
disease risk
Approach: large-scale
regressions to identify
contributing genetic variants
The power of big genomic data
16
Reduce Costs
via Precision
Prevention
Motivation: propose
personalized lifestyle changes to
decrease disease risk
Goal: calculate individual’s
disease risk
Approach: large-scale
regressions to identify
contributing genetic variants
The power of big genomic data
17
Reduce Costs
via Precision
Prevention
Motivation: propose
personalized lifestyle changes to
decrease disease risk
Goal: calculate individual’s
disease risk
Approach: large-scale
regressions to identify
contributing genetic variants
The power of big genomic data
18
Improve
Survival with
Optimized
Treatment
Motivation: decrease ER
admissions
Goal: personalize dosage
based on genetic variants
Approach: large-scale
regressions between
ineffective/effective/toxic
dosages and genetic variants
https://jamanetwork.com/journals/jama/fullarticle/2585977
The power of big genomic data
19
Improve
Survival with
Optimized
Treatment
Motivation: decrease ER
admissions
Goal: personalize dosage based
on genetic variants
Approach: large-scale
regressions between
ineffective/effective/toxic
dosages and genetic variants
http://www.bloodjournal.org/content/106/7/2329
The power of big genomic data
20
Improve
Survival with
Optimized
Treatment
Motivation: decrease ER
admissions
Goal: personalize dosage based
on genetic variants
Approach: large-scale
regressions between
ineffective/effective/toxic
dosages and genetic variants
The power of big genomic data
21
Accelerate
Target
Discovery
Reduce Costs
via Precision
Prevention
Improve
Survival with
Optimized
Treatment
Genomic analysis on big data is hard!
• Existing tools are often
– Inflexible
– Single-node
– Stitched together
22
https://www.biostars.org/p/98582/
Genomic analysis on big data is hard!
• Existing tools are often
– Inflexible
– Single-node
– Stitched together
23
https://www.biostars.org/p/98582/
“GATK or VCFtools … have
different chromosomal
notation, one has Chr, the
other does not.”
Genomic analysis on big data is hard!
• Existing tools are often
– Inflexible
– Single-node
– Stitched together
24
https://www.biostars.org/p/98582/
awk '{gsub(/^chr/,""); print}'
your.vcf > no_chr.vcf
Genomic analysis on big data is hard!
• Existing tools are often
– Inflexible
– Single-node
– Stitched together
25
https://www.biostars.org/p/98582/
“Give a statistical geneticist
an awk line, feed him for a
day, teach a statistical
geneticist how to awk, feed
him for a lifetime...”
Genomic analysis on big data is hard!
• Existing tools are often
– Inflexible
– Single-node
– Stitched together
26
vt normalize dbsnp.vcf
-r seq.fa
-o dbsnp.normalized.vcf
Genomic analysis on big data is hard!
• Existing tools are often
– Inflexible
– Single-node
– Stitched together
27
vt normalize dbsnp.vcf
-r seq.fa
-o dbsnp.normalized.vcf
https://academic.oup.com/bioinformatics/article/31/13/2202/196142
Genomic analysis on big data is hard!
• Existing tools are often
– Inflexible
– Single-node
– Stitched together
28
vt normalize dbsnp.vcf
-r seq.fa
-o dbsnp.normalized.vcf
https://academic.oup.com/bioinformatics/article/31/13/2202/196142
Row in a TSV file
Genomic analysis on big data is hard!
• Existing tools are often
– Inflexible
– Single-node
– Stitched together
29
Annotation
Alignment
Variant Calling
Quality Control
BWA
Analysis
Raw Data
plink
Picard
Genomic analysis on big data is hard!
• Existing tools are often
– Inflexible
– Single-node
– Stitched together
30
...
https://s.apache.org/existing-workflow-systems
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at industrial scale
• Joint genotyping
– Existing approach
– Databricks approach
• Genomics on Databricks
31
DNA Variants
ATC
32
Reference Average human genome
012
DNA Variants
ATC
AGC
33
Reference
Sample 1
Average human genome
Single-nucleotide polymorphism (SNP)
012
DNA Variants
ATC
AGC
34
Reference
Sample 1
Average human genome
Single-nucleotide polymorphism (SNP)
012
Confidence in
variant call?
Variant calling: GATK best practices
35
https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145
Variant calling: joint genotyping
36
Joint genotyping: motivation
Improve variant calls by accumulating evidence
across all samples
37
Joint genotyping: motivation
Based on a single sample alone, variant calls can
be ambiguous: C/C, C/T, T/T?
38
Chrom Pos Ref Alt Qual HG00096
20 17988568 C T 186.77 AD:13,8
GT:C/T
Joint genotyping: motivation
Adding more samples increases confidence in the
C/T variant call
39
Chrom Pos Ref Alt Qual HG00096 HG00268 NA19625
20 17988568 C T 698.90 AD:13,8
GT:C/T
AD:30,0
GT:T/T
AD:0,17
GT:C/C
DNA Variants
ATC
AGC
40
Reference
Sample 1
Average human genome
Single-nucleotide polymorphism (SNP)
012
DNA Variants
AT---C
AG---C
ATAAAC
41
Reference
Sample 1
Sample 2
Average human genome
Single-nucleotide polymorphism (SNP)
Insertion
01 2
DNA Variants
AT---C
AG---C
ATAAAC
A----C
42
Reference
Sample 1
Sample 2
Sample 3
Average human genome
Single-nucleotide polymorphism (SNP)
Insertion
Deletion
01 2
DNA Variants
AT---C
AG---C
ATAAAC
A----C
43
Reference
Sample 1
Sample 2
Sample 3
Average human genome
Single-nucleotide polymorphism (SNP)
Insertion
Deletion
Indel
01 2
Joint genotyping: motivation
Recall: TP/(TP+FN)
Precision: TP/(TP+FP)
44
Recall Precision
Indel 96.25% 98.32%
SNP 99.72% 99.40%
HG002
Joint genotyping: motivation
3.75% → 1.79% error rate = ~50% improvement
45
Recall Precision
Indel 98.21% 98.98%
SNP 99.78% 99.34%
HG002 with HG003, HG004
Recall Precision
Indel 96.25% 98.32%
SNP 99.72% 99.40%
HG002
Joint genotyping: motivation
What if we added even more samples?
46
...Chrom Pos Ref Alt Qual HG00096 HG00268 NA19625
20 17988568 C T 698.90 AD:13,8
GT:C/T
AD:30,0
GT:T/T
AD:0,17
GT:C/C
Joint genotyping: GATK architecture
47
gVCF
GATK
GenotypeGVCFs
pVCFGATK
CombineGVCFs
gVCF
Joint genotyping: GATK architecture
48
gVCF
GATK
GenotypeGVCFs
pVCFGATK
CombineGVCFs
gVCF
Flat file Flat fileFlat file
49
gVCF
GATK
GenotypeGVCFs
pVCFGATK
CombineGVCFs
Exponential runtime,
N+1 problem,
single node
gVCF
Joint genotyping: GATK architecture
Flat file Flat fileFlat file
Joint genotyping: GATK architecture
50
gVCF
GATK
GenotypeGVCFs
pVCFGATK
CombineGVCFs
gVCF
Single node
Exponential runtime,
N+1 problem,
single node
Flat file Flat fileFlat file
Joint genotyping: GATK architecture
51
https://gatkforums.broadinstitute.org/wdl/discussion/6716/scatter-gather-parallelism
scatter gather
Joint genotyping: GATK architecture
52
• Split by chromosome
– Data skew
– < 24x speedup
• Split by interval
– Hand curated list to
pass as CLI argument
• Boiler-plate in Workflow
Description Language
Joint genotyping: GATK architecture
53
• Split by chromosome
– Data skew
– < 24x speedup
• Split by interval
– Hand curated list to
pass as CLI argument
• Boiler-plate in Workflow
Description Language
https://software.broadinstitute.org/wdl/
Joint genotyping: GATK architecture
54
https://gatkforums.broadinstitute.org/wdl/discussion/6716/scatter-gather-parallelism
scatter gather
Joint genotyping: GATK architecture
55
https://gatkforums.broadinstitute.org/wdl/discussion/6716/scatter-gather-parallelism
map reduce
Joint genotyping: GATK architecture
56
gVCF
GATK
GenotypeGVCFs
pVCFGATK
CombineGVCFs
gVCF
Stage 1: Ingest
Joint genotyping: GATK architecture
57
Stage 1: Ingest
gVCF
GATK
CombineGVCFs
Exponential runtime,
N+1 problem,
single node
gVCF
Flat file Flat file
Joint genotyping: Databricks architecture
58
gVCF
gVCF
Rows
Chromosome 1, bin 1
...
Chromosome 22, bin 49000
partitionBy
Stage 1: Ingest
Linear
runtime
Joint genotyping: Databricks architecture
59
gVCF
gVCF
Rows
val gvcfRowsDf = spark.read
.format(“com.databricks.vcf”)
.load(“dbfs:/mnt/gvcf-files”)
Joint genotyping: Databricks architecture
60
Algorithm-aware
fine-grained parallelism
partitionBy
Stage 1: Ingest
Linear
runtime
gVCF
gVCF
Rows
Chromosome 1, bin 1
...
Chromosome 22, bin 49000
Joint genotyping: Databricks architecture
61
val binnedGvcfRowsDf = gvcfRowsDf.select(
$“*”, bin($“start”, $“end”, 500000))
gVCF
Rows
Chromosome 1, bin 1
...
Chromosome 22, bin 49000
Joint genotyping: Databricks architecture
62
IncrementalpartitionBy
Stage 1: Ingest
Linear
runtime
Algorithm-aware
fine-grained parallelism
gVCF
gVCF
Rows
Chromosome 1, bin 1
...
Chromosome 22, bin 49000
Joint genotyping: Databricks architecture
63
binnedGvcfRowsDf.write.mode(“append”).format(“delta”)
.partitionBy(“chromosome”, “binId”)
.save(“dbfs:/mnt/gvcf-rows”)
Chromosome 1, bin 1
...
Chromosome 22, bin 49000
Joint genotyping: Databricks architecture
64
IncrementalpartitionBy
Stage 1: Ingest
Linear
runtime
Algorithm-aware
fine-grained parallelism
gVCF
gVCF
Rows
Chromosome 1, bin 1
...
Chromosome 22, bin 49000
Joint genotyping: Databricks architecture
val gvcfRowsDf = spark.read
.format(“com.databricks.vcf”)
.load(“dbfs:/mnt/gvcfs”)
val binnedGvcfRowsDf = gvcfRowsDf.select($“*”,
bin($“start”, $“end”, 500000))
binnedGvcfRowsDf.write.mode(“append”).format(“delta”)
.partitionBy(“chromosome”, “binId”)
.save(“dbfs:/mnt/gvcf-rows”)
65
Joint genotyping: Databricks architecture
val gvcfRowsDf = spark.read
.format(“com.databricks.vcf”)
.load(“dbfs:/mnt/gvcfs”)
val binnedGvcfRowsDf = gvcfRowsDf.select($“*”,
bin($“start”, $“end”, 500000))
binnedGvcfRowsDf.write.mode(“append”).format(“delta”)
.partitionBy(“chromosome”, “binId”)
.save(“dbfs:/mnt/gvcf-rows”)
66
IngestVariants.scala: 131 lines
Joint genotyping: GATK architecture
67
gVCF
GATK
GenotypeGVCFs
pVCFGATK
CombineGVCFs
gVCF
Stage 2: Regenotype
Joint genotyping: GATK architecture
68
Stage 2: Regenotype
GATK
GenotypeGVCFs
pVCFgVCF
Single nodeFlat file Flat file
Joint genotyping: Databricks architecture
69
Stage 2: Regenotype
Joint genotyping algorithm
(GATK GenotypeGVCFs)
mapPartition
pVCF
gVCF
Rows
pVCF
Rows
Incremental
Joint genotyping: Databricks architecture
70
gVCF
Rows
val binnedGvcfRowsDf = spark.read
.format(“delta”)
.load(“dbfs:/mnt/gvcf-rows”)
Joint genotyping: Databricks architecture
71
Stage 2: Regenotype
mapPartition
Fine-grained parallelism
Incremental
Joint genotyping algorithm
(GATK GenotypeGVCFs)
pVCF
gVCF
Rows
pVCF
Rows
Joint genotyping: Databricks architecture
72
val pvcfRows = binnedGvcfRowsDf.mapPartitions { iter =>
val jointGenotyper = new GatkJointGenotyper()
iter.flatMap {
jointGenotyper.genotype(iter)
}
}
Joint genotyping algorithm
(GATK GenotypeGVCFs)
gVCF
Rows
pVCF
Rows
Joint genotyping: Databricks architecture
73
Stage 2: Regenotype
mapPartition
Fine-grained parallelism
Incremental
Fast querying
Joint genotyping algorithm
(GATK GenotypeGVCFs)
pVCF
gVCF
Rows
pVCF
Rows
Joint genotyping: Databricks architecture
74
pvcfRows.write
.format(“com.databricks.vcf”)
.save(“dbfs:/mnt/jointly-genotyped.vcf”)
pVCFpVCF
Rows
Joint genotyping: Databricks architecture
75
Stage 2: Regenotype
mapPartition
Fine-grained parallelism
Incremental
Fast querying
Joint genotyping algorithm
(GATK GenotypeGVCFs)
pVCF
gVCF
Rows
pVCF
Rows
Joint genotyping: Databricks architecture
val binnedGvcfRowsDf = spark.read.format(“delta”)
.load(“dbfs:/mnt/gvcf-rows”)
val pvcfRows = binnedGvcfRowsDf.mapPartitions { iter =>
val jointGenotyper = new GatkJointGenotyper()
iter.flatMap {
jointGenotyper.genotype(iter)
}
}
pvcfRows.write.format(“com.databricks.vcf”)
.save(“dbfs:/mnt/jointly-genotyped.vcf”)
76
Joint genotyping: Databricks architecture
val binnedGvcfRowsDf = spark.read.format(“delta”)
.load(“dbfs:/mnt/gvcf-rows”)
val pvcfRows = binnedGvcfRowsDf.mapPartitions { iter =>
val jointGenotyper = new GatkJointGenotyper()
iter.flatMap {
jointGenotyper.genotype(iter)
}
}
pvcfRows.write.format(“com.databricks.vcf”)
.save(“dbfs:/mnt/jointly-genotyped.vcf”)
77
JointlyCallVariants.scala: 280 lines
Joint genotyping: scales across samples
78
Joint genotyping: scales across samples
79
Spot
termination
Joint genotyping: scales across workers
80
Joint genotyping: GATK-concordant
81
Recall Precision
Site 99.9982% 99.9985%
Genotype 99.9992% 99.9988%
Call 99.9989% 99.9983%
Joint genotyping: the future
• Replace GATK joint
genotyping algo
– Deflates rare variants
• More incremental
updates
– Use Delta optimize to
deal with small files
82
https://genomebiology.biomedcentral.com/articles/10.1186/
s13059-017-1212-4
Joint genotyping: the future
• Replace GATK joint
genotyping algo
– Deflates rare variants
• More incremental
updates
– Use Delta optimize to
deal with small files
83
GATK algo
Joint genotyping: the future
• Replace GATK joint
genotyping algo
– Deflates rare variants
• More incremental
updates
– Use Delta optimize to
deal with small files
84
https://software.broadinstitute.org/gatk/documentation/article?id=7870
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at industrial scale
• Joint genotyping
– Existing approach
– Databricks approach
• Genomics on Databricks
85
UAP for Genomics
86
BAM VCF
CRAM Fastq
Dashboarding
Accelerate
time to impact
Real-time
visualizations
Machine
Learning
Rapid
Pipelines
✓ GATK4 best practices
✓ DNA, RNA, Cancer Seq
✓ Custom pipelines
✓ Joint-Genotyping
✓ Parallelize legacy tools
✓ GWAS
Scalable Tertiary
Analytics
Unified Analytics Platform for Genomics
Databricks Notebooks
UAP for Genomics: DNA-seq
87
Platform Reference confidence mode Cluster Runtime
Databricks GVCF 13 c5.9xlarge (416 cores) 39m23s
Edico GVCF 1 f1.2xlarge (fpga) 2h29m
30x Coverage Whole Genome
Platform Reference confidence code Cluster Runtime
Databricks GVCF 50 c5.9xlarge (1600 cores) 2h34m
300x Coverage Whole Genome
https://databricks.com/blog/2018/09/10/building-the-fastest-dnaseq-pipeline-at-scale.html
UAP for Genomics: dashboarding
88
https://databricks.com/blog/2019/03/07/simplifying-genomics-pipelines-at-scale-with-databricks-delta.html
UAP for Healthcare and Life Sciences
89
Pharma Payers
Government
Providers / Diagnostics/ Suppliers
Patients
Disease Prediction
Pharmacist Alerts
Smart Text Search
MRI Imaging Analysis
Rare Variant Validation
Polygenic Risk Scoring
Claims Processing; Medicare Improvement; Provider Intelligence
Biobanking
Drug Discovery
Clinical Trial Simulation
Commercial Analytics
Sequencing Annotation
Disease Prediction
Claims Risk Analysis
Health Plan Recommendations
Learn more
Read our blog series or sign up for a private preview:
www.databricks.com/genomics
90
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

From Genomics to Medicine: Advancing Healthcare at Scale

  • 1.
    WIFI SSID:SparkAISummit |Password: UnifiedAnalytics
  • 2.
    Karen Feng, Databricks FromGenomics to Medicine: Advancing Healthcare at Scale #UnifiedAnalytics #SparkAISummit
  • 3.
    Genomics in thereal world • 6-year-old Nic Volker • Intestinal inflammation – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene • Cured through stem cell transplant 3 https://www.washingtonpost.com/opinions/a-boys-mysterious-illness-a-bold-ga mble-and-a-breakthrough-in-genetic-medicine/2016/04/20/13f20b16-e638-11e5 -bc08-3e03a5b41910_story.html
  • 4.
    Genomics in thereal world • 6-year-old Nic Volker • Intestinal inflammation – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene • Cured through stem cell transplant 4 https://www.researchgate.net/publication/318420329_Health_t echnology_assessment_of_next-generation_sequencing
  • 5.
    Genomics in thereal world • 6-year-old Nic Volker • Intestinal inflammation – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene • Cured through stem cell transplant 5 Human CFCCG 200 205 http://cbm.msoe.edu/markMyweb/genomicJmols/xiap.html
  • 6.
    Genomics in thereal world • 6-year-old Nic Volker • Intestinal inflammation – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene • Cured through stem cell transplant 6 Human Chicken Zebra fish Frog House fly CFCCG AFCCG CFCCG CFHCD CVWCN 200 205 http://cbm.msoe.edu/markMyweb/genomicJmols/xiap.html
  • 7.
    Genomics in thereal world • 6-year-old Nic Volker • Intestinal inflammation – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene • Cured through stem cell transplant 7 Nic’s XIAP Human Chicken Zebra fish Frog House fly CFCYG CFCCG AFCCG CFCCG CFHCD CVWCN 200 205 http://cbm.msoe.edu/markMyweb/genomicJmols/xiap.html
  • 8.
    Genomics in thereal world • 6-year-old Nic Volker • Intestinal inflammation – Unknown cause – 100+ surgeries • Whole exome sequencing: mutation in XIAP gene • Cured through stem cell transplant 8 http://archive.jsonline.com/news/health/young-patient-faces-new-struggles-yea rs-after-dna-sequencing-b99602505z1-336977681.html
  • 9.
    Genomics is abig data problem 9 40,000 Petabytes / year by 2025From $2.7B to <$1,000 https://www.genome.gov/27541954/dna-sequencing-costs-data/ https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  • 10.
    Agenda • Genomics overview –Big data problem – Real-world applications – Pain points at industrial scale • Joint genotyping – Existing approach – Databricks approach • Genomics on Databricks 10
  • 11.
    Agenda • Genomics overview –Big data problem – Real-world applications – Pain points at industrial scale • Joint genotyping – Existing approach – Databricks approach • Genomics on Databricks 11
  • 12.
    The power ofbig genomic data 12 Accelerate Target Discovery Motivation: clinical trials with genomic evidence are 2x more likely to be approved by the FDA Goal: identify a biological target (eg. protein) that can be mediated with a drug Approach: large-scale regressions to correlate DNA variants and the trait
  • 13.
    The power ofbig genomic data 13 Accelerate Target Discovery Motivation: clinical trials with genomic evidence are 2x more likely to be approved by the FDA Goal: identify a biological target (eg. protein) that can be mediated with a drug Approach: large-scale regressions to correlate DNA variants and the trait
  • 14.
    The power ofbig genomic data 14 Accelerate Target Discovery Motivation: clinical trials with genomic evidence are 2x more likely to be approved by the FDA Goal: identify a biological target (eg. protein) that can be mediated with a drug Approach: large-scale regressions to correlate DNA variants and the trait
  • 15.
    The power ofbig genomic data 15 Reduce Costs via Precision Prevention Motivation: propose personalized lifestyle changes to decrease disease risk Goal: calculate individual’s disease risk Approach: large-scale regressions to identify contributing genetic variants
  • 16.
    The power ofbig genomic data 16 Reduce Costs via Precision Prevention Motivation: propose personalized lifestyle changes to decrease disease risk Goal: calculate individual’s disease risk Approach: large-scale regressions to identify contributing genetic variants
  • 17.
    The power ofbig genomic data 17 Reduce Costs via Precision Prevention Motivation: propose personalized lifestyle changes to decrease disease risk Goal: calculate individual’s disease risk Approach: large-scale regressions to identify contributing genetic variants
  • 18.
    The power ofbig genomic data 18 Improve Survival with Optimized Treatment Motivation: decrease ER admissions Goal: personalize dosage based on genetic variants Approach: large-scale regressions between ineffective/effective/toxic dosages and genetic variants https://jamanetwork.com/journals/jama/fullarticle/2585977
  • 19.
    The power ofbig genomic data 19 Improve Survival with Optimized Treatment Motivation: decrease ER admissions Goal: personalize dosage based on genetic variants Approach: large-scale regressions between ineffective/effective/toxic dosages and genetic variants http://www.bloodjournal.org/content/106/7/2329
  • 20.
    The power ofbig genomic data 20 Improve Survival with Optimized Treatment Motivation: decrease ER admissions Goal: personalize dosage based on genetic variants Approach: large-scale regressions between ineffective/effective/toxic dosages and genetic variants
  • 21.
    The power ofbig genomic data 21 Accelerate Target Discovery Reduce Costs via Precision Prevention Improve Survival with Optimized Treatment
  • 22.
    Genomic analysis onbig data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together 22 https://www.biostars.org/p/98582/
  • 23.
    Genomic analysis onbig data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together 23 https://www.biostars.org/p/98582/ “GATK or VCFtools … have different chromosomal notation, one has Chr, the other does not.”
  • 24.
    Genomic analysis onbig data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together 24 https://www.biostars.org/p/98582/ awk '{gsub(/^chr/,""); print}' your.vcf > no_chr.vcf
  • 25.
    Genomic analysis onbig data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together 25 https://www.biostars.org/p/98582/ “Give a statistical geneticist an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime...”
  • 26.
    Genomic analysis onbig data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together 26 vt normalize dbsnp.vcf -r seq.fa -o dbsnp.normalized.vcf
  • 27.
    Genomic analysis onbig data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together 27 vt normalize dbsnp.vcf -r seq.fa -o dbsnp.normalized.vcf https://academic.oup.com/bioinformatics/article/31/13/2202/196142
  • 28.
    Genomic analysis onbig data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together 28 vt normalize dbsnp.vcf -r seq.fa -o dbsnp.normalized.vcf https://academic.oup.com/bioinformatics/article/31/13/2202/196142 Row in a TSV file
  • 29.
    Genomic analysis onbig data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together 29 Annotation Alignment Variant Calling Quality Control BWA Analysis Raw Data plink Picard
  • 30.
    Genomic analysis onbig data is hard! • Existing tools are often – Inflexible – Single-node – Stitched together 30 ... https://s.apache.org/existing-workflow-systems
  • 31.
    Agenda • Genomics overview –Big data problem – Real-world applications – Pain points at industrial scale • Joint genotyping – Existing approach – Databricks approach • Genomics on Databricks 31
  • 32.
  • 33.
    DNA Variants ATC AGC 33 Reference Sample 1 Averagehuman genome Single-nucleotide polymorphism (SNP) 012
  • 34.
    DNA Variants ATC AGC 34 Reference Sample 1 Averagehuman genome Single-nucleotide polymorphism (SNP) 012 Confidence in variant call?
  • 35.
    Variant calling: GATKbest practices 35 https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145
  • 36.
  • 37.
    Joint genotyping: motivation Improvevariant calls by accumulating evidence across all samples 37
  • 38.
    Joint genotyping: motivation Basedon a single sample alone, variant calls can be ambiguous: C/C, C/T, T/T? 38 Chrom Pos Ref Alt Qual HG00096 20 17988568 C T 186.77 AD:13,8 GT:C/T
  • 39.
    Joint genotyping: motivation Addingmore samples increases confidence in the C/T variant call 39 Chrom Pos Ref Alt Qual HG00096 HG00268 NA19625 20 17988568 C T 698.90 AD:13,8 GT:C/T AD:30,0 GT:T/T AD:0,17 GT:C/C
  • 40.
    DNA Variants ATC AGC 40 Reference Sample 1 Averagehuman genome Single-nucleotide polymorphism (SNP) 012
  • 41.
    DNA Variants AT---C AG---C ATAAAC 41 Reference Sample 1 Sample2 Average human genome Single-nucleotide polymorphism (SNP) Insertion 01 2
  • 42.
    DNA Variants AT---C AG---C ATAAAC A----C 42 Reference Sample 1 Sample2 Sample 3 Average human genome Single-nucleotide polymorphism (SNP) Insertion Deletion 01 2
  • 43.
    DNA Variants AT---C AG---C ATAAAC A----C 43 Reference Sample 1 Sample2 Sample 3 Average human genome Single-nucleotide polymorphism (SNP) Insertion Deletion Indel 01 2
  • 44.
    Joint genotyping: motivation Recall:TP/(TP+FN) Precision: TP/(TP+FP) 44 Recall Precision Indel 96.25% 98.32% SNP 99.72% 99.40% HG002
  • 45.
    Joint genotyping: motivation 3.75%→ 1.79% error rate = ~50% improvement 45 Recall Precision Indel 98.21% 98.98% SNP 99.78% 99.34% HG002 with HG003, HG004 Recall Precision Indel 96.25% 98.32% SNP 99.72% 99.40% HG002
  • 46.
    Joint genotyping: motivation Whatif we added even more samples? 46 ...Chrom Pos Ref Alt Qual HG00096 HG00268 NA19625 20 17988568 C T 698.90 AD:13,8 GT:C/T AD:30,0 GT:T/T AD:0,17 GT:C/C
  • 47.
    Joint genotyping: GATKarchitecture 47 gVCF GATK GenotypeGVCFs pVCFGATK CombineGVCFs gVCF
  • 48.
    Joint genotyping: GATKarchitecture 48 gVCF GATK GenotypeGVCFs pVCFGATK CombineGVCFs gVCF Flat file Flat fileFlat file
  • 49.
    49 gVCF GATK GenotypeGVCFs pVCFGATK CombineGVCFs Exponential runtime, N+1 problem, singlenode gVCF Joint genotyping: GATK architecture Flat file Flat fileFlat file
  • 50.
    Joint genotyping: GATKarchitecture 50 gVCF GATK GenotypeGVCFs pVCFGATK CombineGVCFs gVCF Single node Exponential runtime, N+1 problem, single node Flat file Flat fileFlat file
  • 51.
    Joint genotyping: GATKarchitecture 51 https://gatkforums.broadinstitute.org/wdl/discussion/6716/scatter-gather-parallelism scatter gather
  • 52.
    Joint genotyping: GATKarchitecture 52 • Split by chromosome – Data skew – < 24x speedup • Split by interval – Hand curated list to pass as CLI argument • Boiler-plate in Workflow Description Language
  • 53.
    Joint genotyping: GATKarchitecture 53 • Split by chromosome – Data skew – < 24x speedup • Split by interval – Hand curated list to pass as CLI argument • Boiler-plate in Workflow Description Language https://software.broadinstitute.org/wdl/
  • 54.
    Joint genotyping: GATKarchitecture 54 https://gatkforums.broadinstitute.org/wdl/discussion/6716/scatter-gather-parallelism scatter gather
  • 55.
    Joint genotyping: GATKarchitecture 55 https://gatkforums.broadinstitute.org/wdl/discussion/6716/scatter-gather-parallelism map reduce
  • 56.
    Joint genotyping: GATKarchitecture 56 gVCF GATK GenotypeGVCFs pVCFGATK CombineGVCFs gVCF Stage 1: Ingest
  • 57.
    Joint genotyping: GATKarchitecture 57 Stage 1: Ingest gVCF GATK CombineGVCFs Exponential runtime, N+1 problem, single node gVCF Flat file Flat file
  • 58.
    Joint genotyping: Databricksarchitecture 58 gVCF gVCF Rows Chromosome 1, bin 1 ... Chromosome 22, bin 49000 partitionBy Stage 1: Ingest Linear runtime
  • 59.
    Joint genotyping: Databricksarchitecture 59 gVCF gVCF Rows val gvcfRowsDf = spark.read .format(“com.databricks.vcf”) .load(“dbfs:/mnt/gvcf-files”)
  • 60.
    Joint genotyping: Databricksarchitecture 60 Algorithm-aware fine-grained parallelism partitionBy Stage 1: Ingest Linear runtime gVCF gVCF Rows Chromosome 1, bin 1 ... Chromosome 22, bin 49000
  • 61.
    Joint genotyping: Databricksarchitecture 61 val binnedGvcfRowsDf = gvcfRowsDf.select( $“*”, bin($“start”, $“end”, 500000)) gVCF Rows Chromosome 1, bin 1 ... Chromosome 22, bin 49000
  • 62.
    Joint genotyping: Databricksarchitecture 62 IncrementalpartitionBy Stage 1: Ingest Linear runtime Algorithm-aware fine-grained parallelism gVCF gVCF Rows Chromosome 1, bin 1 ... Chromosome 22, bin 49000
  • 63.
    Joint genotyping: Databricksarchitecture 63 binnedGvcfRowsDf.write.mode(“append”).format(“delta”) .partitionBy(“chromosome”, “binId”) .save(“dbfs:/mnt/gvcf-rows”) Chromosome 1, bin 1 ... Chromosome 22, bin 49000
  • 64.
    Joint genotyping: Databricksarchitecture 64 IncrementalpartitionBy Stage 1: Ingest Linear runtime Algorithm-aware fine-grained parallelism gVCF gVCF Rows Chromosome 1, bin 1 ... Chromosome 22, bin 49000
  • 65.
    Joint genotyping: Databricksarchitecture val gvcfRowsDf = spark.read .format(“com.databricks.vcf”) .load(“dbfs:/mnt/gvcfs”) val binnedGvcfRowsDf = gvcfRowsDf.select($“*”, bin($“start”, $“end”, 500000)) binnedGvcfRowsDf.write.mode(“append”).format(“delta”) .partitionBy(“chromosome”, “binId”) .save(“dbfs:/mnt/gvcf-rows”) 65
  • 66.
    Joint genotyping: Databricksarchitecture val gvcfRowsDf = spark.read .format(“com.databricks.vcf”) .load(“dbfs:/mnt/gvcfs”) val binnedGvcfRowsDf = gvcfRowsDf.select($“*”, bin($“start”, $“end”, 500000)) binnedGvcfRowsDf.write.mode(“append”).format(“delta”) .partitionBy(“chromosome”, “binId”) .save(“dbfs:/mnt/gvcf-rows”) 66 IngestVariants.scala: 131 lines
  • 67.
    Joint genotyping: GATKarchitecture 67 gVCF GATK GenotypeGVCFs pVCFGATK CombineGVCFs gVCF Stage 2: Regenotype
  • 68.
    Joint genotyping: GATKarchitecture 68 Stage 2: Regenotype GATK GenotypeGVCFs pVCFgVCF Single nodeFlat file Flat file
  • 69.
    Joint genotyping: Databricksarchitecture 69 Stage 2: Regenotype Joint genotyping algorithm (GATK GenotypeGVCFs) mapPartition pVCF gVCF Rows pVCF Rows Incremental
  • 70.
    Joint genotyping: Databricksarchitecture 70 gVCF Rows val binnedGvcfRowsDf = spark.read .format(“delta”) .load(“dbfs:/mnt/gvcf-rows”)
  • 71.
    Joint genotyping: Databricksarchitecture 71 Stage 2: Regenotype mapPartition Fine-grained parallelism Incremental Joint genotyping algorithm (GATK GenotypeGVCFs) pVCF gVCF Rows pVCF Rows
  • 72.
    Joint genotyping: Databricksarchitecture 72 val pvcfRows = binnedGvcfRowsDf.mapPartitions { iter => val jointGenotyper = new GatkJointGenotyper() iter.flatMap { jointGenotyper.genotype(iter) } } Joint genotyping algorithm (GATK GenotypeGVCFs) gVCF Rows pVCF Rows
  • 73.
    Joint genotyping: Databricksarchitecture 73 Stage 2: Regenotype mapPartition Fine-grained parallelism Incremental Fast querying Joint genotyping algorithm (GATK GenotypeGVCFs) pVCF gVCF Rows pVCF Rows
  • 74.
    Joint genotyping: Databricksarchitecture 74 pvcfRows.write .format(“com.databricks.vcf”) .save(“dbfs:/mnt/jointly-genotyped.vcf”) pVCFpVCF Rows
  • 75.
    Joint genotyping: Databricksarchitecture 75 Stage 2: Regenotype mapPartition Fine-grained parallelism Incremental Fast querying Joint genotyping algorithm (GATK GenotypeGVCFs) pVCF gVCF Rows pVCF Rows
  • 76.
    Joint genotyping: Databricksarchitecture val binnedGvcfRowsDf = spark.read.format(“delta”) .load(“dbfs:/mnt/gvcf-rows”) val pvcfRows = binnedGvcfRowsDf.mapPartitions { iter => val jointGenotyper = new GatkJointGenotyper() iter.flatMap { jointGenotyper.genotype(iter) } } pvcfRows.write.format(“com.databricks.vcf”) .save(“dbfs:/mnt/jointly-genotyped.vcf”) 76
  • 77.
    Joint genotyping: Databricksarchitecture val binnedGvcfRowsDf = spark.read.format(“delta”) .load(“dbfs:/mnt/gvcf-rows”) val pvcfRows = binnedGvcfRowsDf.mapPartitions { iter => val jointGenotyper = new GatkJointGenotyper() iter.flatMap { jointGenotyper.genotype(iter) } } pvcfRows.write.format(“com.databricks.vcf”) .save(“dbfs:/mnt/jointly-genotyped.vcf”) 77 JointlyCallVariants.scala: 280 lines
  • 78.
    Joint genotyping: scalesacross samples 78
  • 79.
    Joint genotyping: scalesacross samples 79 Spot termination
  • 80.
    Joint genotyping: scalesacross workers 80
  • 81.
    Joint genotyping: GATK-concordant 81 RecallPrecision Site 99.9982% 99.9985% Genotype 99.9992% 99.9988% Call 99.9989% 99.9983%
  • 82.
    Joint genotyping: thefuture • Replace GATK joint genotyping algo – Deflates rare variants • More incremental updates – Use Delta optimize to deal with small files 82 https://genomebiology.biomedcentral.com/articles/10.1186/ s13059-017-1212-4
  • 83.
    Joint genotyping: thefuture • Replace GATK joint genotyping algo – Deflates rare variants • More incremental updates – Use Delta optimize to deal with small files 83 GATK algo
  • 84.
    Joint genotyping: thefuture • Replace GATK joint genotyping algo – Deflates rare variants • More incremental updates – Use Delta optimize to deal with small files 84 https://software.broadinstitute.org/gatk/documentation/article?id=7870
  • 85.
    Agenda • Genomics overview –Big data problem – Real-world applications – Pain points at industrial scale • Joint genotyping – Existing approach – Databricks approach • Genomics on Databricks 85
  • 86.
    UAP for Genomics 86 BAMVCF CRAM Fastq Dashboarding Accelerate time to impact Real-time visualizations Machine Learning Rapid Pipelines ✓ GATK4 best practices ✓ DNA, RNA, Cancer Seq ✓ Custom pipelines ✓ Joint-Genotyping ✓ Parallelize legacy tools ✓ GWAS Scalable Tertiary Analytics Unified Analytics Platform for Genomics Databricks Notebooks
  • 87.
    UAP for Genomics:DNA-seq 87 Platform Reference confidence mode Cluster Runtime Databricks GVCF 13 c5.9xlarge (416 cores) 39m23s Edico GVCF 1 f1.2xlarge (fpga) 2h29m 30x Coverage Whole Genome Platform Reference confidence code Cluster Runtime Databricks GVCF 50 c5.9xlarge (1600 cores) 2h34m 300x Coverage Whole Genome https://databricks.com/blog/2018/09/10/building-the-fastest-dnaseq-pipeline-at-scale.html
  • 88.
    UAP for Genomics:dashboarding 88 https://databricks.com/blog/2019/03/07/simplifying-genomics-pipelines-at-scale-with-databricks-delta.html
  • 89.
    UAP for Healthcareand Life Sciences 89 Pharma Payers Government Providers / Diagnostics/ Suppliers Patients Disease Prediction Pharmacist Alerts Smart Text Search MRI Imaging Analysis Rare Variant Validation Polygenic Risk Scoring Claims Processing; Medicare Improvement; Provider Intelligence Biobanking Drug Discovery Clinical Trial Simulation Commercial Analytics Sequencing Annotation Disease Prediction Claims Risk Analysis Health Plan Recommendations
  • 90.
    Learn more Read ourblog series or sign up for a private preview: www.databricks.com/genomics 90
  • 91.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT