From Genomics to Medicine: Advancing Healthcare at Scale

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Karen Feng, Databricks
From Genomics to Medicine:
Advancing Healthcare at Scale
#UnifiedAnalytics #SparkAISummit

Genomics in the real world
• 6-year-old Nic Volker
• Intestinal inflammation
– Unknown cause
– 100+ surgeries
• Whole exome sequencing:
mutation in XIAP gene
• Cured through stem cell
transplant
3
https://www.washingtonpost.com/opinions/a-boys-mysterious-illness-a-bold-ga
mble-and-a-breakthrough-in-genetic-medicine/2016/04/20/13f20b16-e638-11e5
-bc08-3e03a5b41910_story.html

– Unknown cause
– 100+ surgeries
transplant
4
https://www.researchgate.net/publication/318420329_Health_t
echnology_assessment_of_next-generation_sequencing

– Unknown cause
– 100+ surgeries
transplant
5
Human CFCCG
200
205
http://cbm.msoe.edu/markMyweb/genomicJmols/xiap.html

– Unknown cause
– 100+ surgeries
transplant
6
Human
Chicken
Zebra fish
Frog
House fly
CFCCG
AFCCG
CFCCG
CFHCD
CVWCN
200
205

– Unknown cause
– 100+ surgeries
transplant
7
Nic’s XIAP
Human
Chicken
Zebra fish
Frog
House fly
CFCYG
CFCCG
AFCCG
CFCCG
CFHCD
CVWCN
200
205

– Unknown cause
– 100+ surgeries
transplant
8
http://archive.jsonline.com/news/health/young-patient-faces-new-struggles-yea
rs-after-dna-sequencing-b99602505z1-336977681.html

Genomics is a big data problem
9
40,000 Petabytes / year by 2025From $2.7B to <$1,000
https://www.genome.gov/27541954/dna-sequencing-costs-data/ https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195

Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at industrial scale
• Joint genotyping
– Existing approach
– Databricks approach
• Genomics on Databricks
10

Agenda
11

The power of big genomic data
12
Accelerate
Target
Discovery
Motivation: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
Goal: identify a biological target
(eg. protein) that can be
mediated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait

13
Accelerate
Target
Discovery

14
Accelerate
Target
Discovery

15
Reduce Costs
via Precision
Prevention
Motivation: propose
personalized lifestyle changes to
decrease disease risk
Goal: calculate individual’s
disease risk
regressions to identify
contributing genetic variants

16
Reduce Costs
via Precision
Prevention
Motivation: propose
disease risk

17
Reduce Costs
via Precision
Prevention
Motivation: propose
disease risk

18
Improve
Survival with
Optimized
Treatment
Motivation: decrease ER
admissions
Goal: personalize dosage
based on genetic variants
regressions between
ineffective/effective/toxic
dosages and genetic variants
https://jamanetwork.com/journals/jama/fullarticle/2585977

19
Improve
Survival with
Optimized
Treatment
admissions
Goal: personalize dosage based
on genetic variants
regressions between
http://www.bloodjournal.org/content/106/7/2329

20
Improve
Survival with
Optimized
Treatment
admissions
Goal: personalize dosage based
on genetic variants
regressions between

21
Accelerate
Target
Discovery
Reduce Costs
via Precision
Prevention
Improve
Survival with
Optimized
Treatment

Genomic analysis on big data is hard!
• Existing tools are often
– Inflexible
– Single-node
– Stitched together
22
https://www.biostars.org/p/98582/

– Inflexible
– Single-node
23
“GATK or VCFtools … have
different chromosomal
notation, one has Chr, the
other does not.”

– Inflexible
– Single-node
24
awk '{gsub(/^chr/,""); print}'
your.vcf > no_chr.vcf

– Inflexible
– Single-node
25
“Give a statistical geneticist
an awk line, feed him for a
day, teach a statistical
geneticist how to awk, feed
him for a lifetime...”

– Inflexible
– Single-node
26
vt normalize dbsnp.vcf
-r seq.fa
-o dbsnp.normalized.vcf

– Inflexible
– Single-node
27
-r seq.fa
https://academic.oup.com/bioinformatics/article/31/13/2202/196142

– Inflexible
– Single-node
28
-r seq.fa
https://academic.oup.com/bioinformatics/article/31/13/2202/196142
Row in a TSV file

– Inflexible
– Single-node
29
Annotation
Alignment
Variant Calling
Quality Control
BWA
Analysis
Raw Data
plink
Picard

– Inflexible
– Single-node
30
...
https://s.apache.org/existing-workflow-systems

Agenda
31

DNA Variants
ATC
32
Reference Average human genome
012

DNA Variants
ATC
AGC
33
Reference
Sample 1
Average human genome
Single-nucleotide polymorphism (SNP)
012

DNA Variants
ATC
AGC
34
Reference
Sample 1
012
Confidence in
variant call?

Variant calling: GATK best practices
35
https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145

Variant calling: joint genotyping
36

Joint genotyping: motivation
Improve variant calls by accumulating evidence
across all samples
37

Based on a single sample alone, variant calls can
be ambiguous: C/C, C/T, T/T?
38
Chrom Pos Ref Alt Qual HG00096
20 17988568 C T 186.77 AD:13,8
GT:C/T

Adding more samples increases confidence in the
C/T variant call
39
Chrom Pos Ref Alt Qual HG00096 HG00268 NA19625
20 17988568 C T 698.90 AD:13,8
GT:C/T
AD:30,0
GT:T/T
AD:0,17
GT:C/C

DNA Variants
ATC
AGC
40
Reference
Sample 1
012

DNA Variants
AT---C
AG---C
ATAAAC
41
Reference
Sample 1
Sample 2
Insertion
01 2

DNA Variants
AT---C
AG---C
ATAAAC
A----C
42
Reference
Sample 1
Sample 2
Sample 3
Insertion
Deletion
01 2

DNA Variants
AT---C
AG---C
ATAAAC
A----C
43
Reference
Sample 1
Sample 2
Sample 3
Insertion
Deletion
Indel
01 2

Recall: TP/(TP+FN)
Precision: TP/(TP+FP)
44
Recall Precision
Indel 96.25% 98.32%
SNP 99.72% 99.40%
HG002

3.75% → 1.79% error rate = ~50% improvement
45
Recall Precision
Indel 98.21% 98.98%
SNP 99.78% 99.34%
HG002 with HG003, HG004
Recall Precision
Indel 96.25% 98.32%
SNP 99.72% 99.40%
HG002

What if we added even more samples?
46
...Chrom Pos Ref Alt Qual HG00096 HG00268 NA19625
20 17988568 C T 698.90 AD:13,8
GT:C/T
AD:30,0
GT:T/T
AD:0,17
GT:C/C

Joint genotyping: GATK architecture
47
gVCF
GATK
GenotypeGVCFs
pVCFGATK
CombineGVCFs
gVCF

48
gVCF
GATK
GenotypeGVCFs
pVCFGATK
CombineGVCFs
gVCF
Flat file Flat fileFlat file

49
gVCF
GATK
GenotypeGVCFs
pVCFGATK
CombineGVCFs
Exponential runtime,
N+1 problem,
single node
gVCF

50
gVCF
GATK
GenotypeGVCFs
pVCFGATK
CombineGVCFs
gVCF
Single node
N+1 problem,
single node

51
https://gatkforums.broadinstitute.org/wdl/discussion/6716/scatter-gather-parallelism
scatter gather

52
• Split by chromosome
– Data skew
– < 24x speedup
• Split by interval
– Hand curated list to
pass as CLI argument
• Boiler-plate in Workflow
Description Language

53
• Split by chromosome
– Data skew
– < 24x speedup
• Split by interval
– Hand curated list to
pass as CLI argument
• Boiler-plate in Workflow
Description Language
https://software.broadinstitute.org/wdl/

54
scatter gather

55
map reduce

56
gVCF
GATK
GenotypeGVCFs
pVCFGATK
CombineGVCFs
gVCF
Stage 1: Ingest

57
Stage 1: Ingest
gVCF
GATK
CombineGVCFs
N+1 problem,
single node
gVCF
Flat file Flat file

Joint genotyping: Databricks architecture
58
gVCF
gVCF
Rows
Chromosome 1, bin 1
...
Chromosome 22, bin 49000
partitionBy
Stage 1: Ingest
Linear
runtime

59
gVCF
gVCF
Rows
val gvcfRowsDf = spark.read
.format(“com.databricks.vcf”)
.load(“dbfs:/mnt/gvcf-files”)

60
Algorithm-aware
fine-grained parallelism
partitionBy
Stage 1: Ingest
Linear
runtime
gVCF
gVCF
Rows
Chromosome 1, bin 1
...

61
val binnedGvcfRowsDf = gvcfRowsDf.select(
$“*”, bin($“start”, $“end”, 500000))
gVCF
Rows
Chromosome 1, bin 1
...

62
IncrementalpartitionBy
Stage 1: Ingest
Linear
runtime
Algorithm-aware
gVCF
gVCF
Rows
Chromosome 1, bin 1
...

63
binnedGvcfRowsDf.write.mode(“append”).format(“delta”)
.partitionBy(“chromosome”, “binId”)
.save(“dbfs:/mnt/gvcf-rows”)
Chromosome 1, bin 1
...

64
IncrementalpartitionBy
Stage 1: Ingest
Linear
runtime
Algorithm-aware
gVCF
gVCF
Rows
Chromosome 1, bin 1
...

.load(“dbfs:/mnt/gvcfs”)
val binnedGvcfRowsDf = gvcfRowsDf.select($“*”,
bin($“start”, $“end”, 500000))
65

.load(“dbfs:/mnt/gvcfs”)
val binnedGvcfRowsDf = gvcfRowsDf.select($“*”,
bin($“start”, $“end”, 500000))
66
IngestVariants.scala: 131 lines

67
gVCF
GATK
GenotypeGVCFs
pVCFGATK
CombineGVCFs
gVCF
Stage 2: Regenotype

68
Stage 2: Regenotype
GATK
GenotypeGVCFs
pVCFgVCF
Single nodeFlat file Flat file

69
Stage 2: Regenotype
Joint genotyping algorithm
(GATK GenotypeGVCFs)
mapPartition
pVCF
gVCF
Rows
pVCF
Rows
Incremental

70
gVCF
Rows
val binnedGvcfRowsDf = spark.read
.format(“delta”)
.load(“dbfs:/mnt/gvcf-rows”)

71
Stage 2: Regenotype
mapPartition
Fine-grained parallelism
Incremental
pVCF
gVCF
Rows
pVCF
Rows

72
val pvcfRows = binnedGvcfRowsDf.mapPartitions { iter =>
val jointGenotyper = new GatkJointGenotyper()
iter.flatMap {
jointGenotyper.genotype(iter)
}
}
gVCF
Rows
pVCF
Rows

73
Stage 2: Regenotype
mapPartition
Incremental
Fast querying
pVCF
gVCF
Rows
pVCF
Rows

74
pvcfRows.write
.save(“dbfs:/mnt/jointly-genotyped.vcf”)
pVCFpVCF
Rows

75
Stage 2: Regenotype
mapPartition
Incremental
Fast querying
pVCF
gVCF
Rows
pVCF
Rows

val binnedGvcfRowsDf = spark.read.format(“delta”)
iter.flatMap {
}
}
pvcfRows.write.format(“com.databricks.vcf”)
76

val binnedGvcfRowsDf = spark.read.format(“delta”)
iter.flatMap {
}
}
pvcfRows.write.format(“com.databricks.vcf”)
77
JointlyCallVariants.scala: 280 lines

Joint genotyping: scales across samples
78

Joint genotyping: scales across samples
79
Spot
termination

Joint genotyping: scales across workers
80

Joint genotyping: GATK-concordant
81
Recall Precision
Site 99.9982% 99.9985%
Genotype 99.9992% 99.9988%
Call 99.9989% 99.9983%

Joint genotyping: the future
• Replace GATK joint
genotyping algo
– Deflates rare variants
• More incremental
updates
– Use Delta optimize to
deal with small files
82
https://genomebiology.biomedcentral.com/articles/10.1186/
s13059-017-1212-4

genotyping algo
updates
83
GATK algo

genotyping algo
updates
84
https://software.broadinstitute.org/gatk/documentation/article?id=7870

Agenda
85

UAP for Genomics
86
BAM VCF
CRAM Fastq
Dashboarding
Accelerate
time to impact
Real-time
visualizations
Machine
Learning
Rapid
Pipelines
✓ GATK4 best practices
✓ DNA, RNA, Cancer Seq
✓ Custom pipelines
✓ Joint-Genotyping
✓ Parallelize legacy tools
✓ GWAS
Scalable Tertiary
Analytics
Unified Analytics Platform for Genomics
Databricks Notebooks

UAP for Genomics: DNA-seq
87
Platform Reference confidence mode Cluster Runtime
Databricks GVCF 13 c5.9xlarge (416 cores) 39m23s
Edico GVCF 1 f1.2xlarge (fpga) 2h29m
30x Coverage Whole Genome
Platform Reference confidence code Cluster Runtime
Databricks GVCF 50 c5.9xlarge (1600 cores) 2h34m
300x Coverage Whole Genome
https://databricks.com/blog/2018/09/10/building-the-fastest-dnaseq-pipeline-at-scale.html

UAP for Genomics: dashboarding
88
https://databricks.com/blog/2019/03/07/simplifying-genomics-pipelines-at-scale-with-databricks-delta.html

UAP for Healthcare and Life Sciences
89
Pharma Payers
Government
Providers / Diagnostics/ Suppliers
Patients
Disease Prediction
Pharmacist Alerts
Smart Text Search
MRI Imaging Analysis
Rare Variant Validation
Polygenic Risk Scoring
Claims Processing; Medicare Improvement; Provider Intelligence
Biobanking
Drug Discovery
Clinical Trial Simulation
Commercial Analytics
Sequencing Annotation
Disease Prediction
Claims Risk Analysis
Health Plan Recommendations

Learn more
Read our blog series or sign up for a private preview:
www.databricks.com/genomics
90

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

From Genomics to Medicine: Advancing Healthcare at Scale

More Related Content

What's hot

Similar to From Genomics to Medicine: Advancing Healthcare at Scale

More from Databricks

Recently uploaded

From Genomics to Medicine: Advancing Healthcare at Scale