SlideShare a Scribd company logo
Spark Accelerated Genomics
Processing
Accelerating Personal Genomics with Apache Spark
Kartik Thakore
Agenda
1. Intro to Genomics
2. Genomics processing
3. Traditional GATK Pipelines
4. Doc.ai App Ready
Intro to (Personal) Genomics
● DNA code
○ DNA sequence
■ sequence of A, C, T or G
○ Human Genome
■ 3 billion bases (letters)
● SNP (single nucleotide polymorphisms)
○ Represents a difference in a single DNA
building block (called a nucleotide)
○ 4-5 million SNPs in a person’s Genome
○ More than 100 million SNPs in
populations
● Analyzing and interpreting the genome of a single
individual
● Defining unique genetic characteristics
○ Non-medically relevant traits
(taste preference, exercise, behavior, etc)
○ Ancestry information
○ Understand risk for various diseases
○ Carrier status
○ Disease diagnosis
○ Response to certain treatments & drugs
(pharmacogenomics)
Intro to (Personal) Genomics
Obtaining personal genomic information
● From biological sample to raw data
● “Decoding” the DNA sequence
○ Technological revolution
○ Multiple different platforms (and providers)
○ Two technologies
■ Genotyping (23andme, Ancestry)
■ Sequencing (full genome (WGS), exome)
Obtaining personal genomic information
Genotyping
● Looks at specific, interesting known variants in the DNA
● Technology: SNP arrays/chips
○ Efficient and cost-effective (~$50-$100)
○ Straightforward analysis
● BUT:
○ Requires prior identification of variants of interest
○ Limited information, no novel information
● Examples: 23andme, Ancestry
Sequencing
● Reads whole sequences
● Technology: Next-Generation-Sequencing
○ More data and context: total information content
○ No prior knowledge needed: discover variation you
don’t know about beforehand
○ Full picture: deeper discovery about the genetic
underpinnings (rare diseases, ...)
● BUT:
○ Some information is not useful (much of the human
genome does not vary between individuals, so it is
redundant)
○ Big datasets → more elaborate analysis and storage
○ Bit more expensive (~ $1000 range)
■ Still significant advances in technology,
decreasing the time and labor costs
Doc.ai - genotyping SNPs Processing MVP
User
Genetic
testing
Download
raw data
Sign up
Doc.ai app
Upload
raw data
Ancestry
analysis
Variant
annotation
dbSNP
ClinVar
Polyphen
...
Data
Interpretation
Data analysis
pipeline
WGS Variant Calling: Traditional GATK
GATK Best practices
GATK Pipeline Data Engineering
● Setup of tools
○ Usually involves downloading several
tools
● Pipeline runs
○ Manually executing scripts
○ Manual verification of results
● Reproducibility and reliability
○ Particularly difficult
○ Need to hire specialist to run
● Future proofing
○ Updating data sets manually
○ Updating tools manually
● Can be automated with significant work
GATK Pipeline User Experience
● File Verification Real World Data
○ Pipeline assumes VCFs are valid
● Job failures are not trackable
● Development and dockerization is hard
● Scaling also becomes difficult
Product and Design Considerations
● Transparent Processing
○ Actual progress
● As fast as possible feedback
○ File verification
● Job run results and failure inspectability
● How the heck do we do this with an
App?
● Jobs must be able to handle large
variations in file sizes
● Data store cannot lose files
○ High Friction for the User to upload again
● Ability to painlessly scale
Doc.ai Genomics Pipeline: Ingestion
Doc.ai Unimatrix PipelineVCFs
Mobile
App
uploaded Stored and verified
TLS terminated
1. Variant Call Formats from 23andMe and Ancestry.com are
uploaded via the mobile app
2. Files are verified by the Unimatrix (Borg shout out :D )
a. This is done without processing the file!!!
3. If the files are invalid we can tell the user right away
4. Additionally if files are valid we can also terminate the TLS
connection which allows the user to move on
5. The UX is significantly smoother
Stored
on GCS
Doc.ai - VCF to Dataframe
val PARQUET_FILE_NAME = INPUT_FILE.stripSuffix(".txt") + ".parquet"
val schema = StructType(
List(
StructField("rsid", StringType, false),
StructField("chromosome", StringType, false),
StructField("position", IntegerType, false),
StructField("genotype", StringType, false)
)
)
val sc_file = sc.textFile(INPUT_FILE)
val rdd = sc_file.filter(line => !line.startsWith("#")).map(line => line.split("t")).map(entries
=> Row(entries(0), entries(1), entries(2).toInt, entries(3)))
val df = spark.createDataFrame(rdd, schema)
val updated_df = df.withColumn("genotype", when(col("genotype").equalTo("--"),
"..").otherwise(col("genotype"))).sort(asc("chromosome"), asc("position"))
Doc.ai Genomics Pipeline: Processing
Doc.ai Unimatrix Pipeline
Stored
on GCS
DataBricks Job REST API
DB Jobs trigger
Genomics
Ancestry
Analysis
Google Cloud
AWS Cloud
● Ancestry analysis is a notebook as
a job
● Parameters and file locations are
sent via the Job REST API
● Results are stored back on GCS
● Results are then transformed and
prepared for mobile application
(Mongo)
● Progress from job runs (times) are
also periodically sent
Progress and
Job completion
signalled back
Doc.ai - Reference Mapping
// A reference data is used that can help mapp ancestry to an individuals SNPs rsids
val df =
spark.read.parquet("doc_ai_ancestry_frequencies_50_hgdp_pca_filtered_v3.parquet").createTempView(
"ANCESTRY_FREQS")
// A SNPs hashing look up table is used to prepare the users’ snps to join with ANCESTRY_FREQS
val ancestryDF = spark.sql(" select population_freq, snps, … from JOINED_USER_POPULATION_DF)
// Finally results are verified by a suite of assertions to detect potential issues
Doc.ai - What it looks and feels like
What’s next
https://databricks.com/blog/2018/09/10/building-the-fastest-
dnaseq-pipeline-at-scale.html
● Testing our limits of our approach
● Askew VCFs
● Genotyped Samples over Sequenced
● Helps accounts for issues with vendor
specific VCFs

More Related Content

Similar to Accelerating Genomics SNPs Processing and Interpretation with Apache Spark

Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
Ian Foster
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
Ian Foster
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
MongoDB
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
Li Shen
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
MongoDB
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
Ian Foster
 
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant scienceVenice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
GigaScience, BGI Hong Kong
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
Databricks
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
David Ruau
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
ARPUTHA SELVARAJ A
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
HARSHITHA EBBALI
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
Ian Foster
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
Guy Coates
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
Paul Groth
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
Juan Antonio Vizcaino
 
2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge
Prof. Wim Van Criekinge
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
Ian Foster
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit
 

Similar to Accelerating Genomics SNPs Processing and Interpretation with Apache Spark (20)

Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant scienceVenice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
 
2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 

Recently uploaded (20)

Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 

Accelerating Genomics SNPs Processing and Interpretation with Apache Spark

  • 1. Spark Accelerated Genomics Processing Accelerating Personal Genomics with Apache Spark Kartik Thakore
  • 2. Agenda 1. Intro to Genomics 2. Genomics processing 3. Traditional GATK Pipelines 4. Doc.ai App Ready
  • 3. Intro to (Personal) Genomics ● DNA code ○ DNA sequence ■ sequence of A, C, T or G ○ Human Genome ■ 3 billion bases (letters) ● SNP (single nucleotide polymorphisms) ○ Represents a difference in a single DNA building block (called a nucleotide) ○ 4-5 million SNPs in a person’s Genome ○ More than 100 million SNPs in populations
  • 4. ● Analyzing and interpreting the genome of a single individual ● Defining unique genetic characteristics ○ Non-medically relevant traits (taste preference, exercise, behavior, etc) ○ Ancestry information ○ Understand risk for various diseases ○ Carrier status ○ Disease diagnosis ○ Response to certain treatments & drugs (pharmacogenomics) Intro to (Personal) Genomics
  • 5. Obtaining personal genomic information ● From biological sample to raw data ● “Decoding” the DNA sequence ○ Technological revolution ○ Multiple different platforms (and providers) ○ Two technologies ■ Genotyping (23andme, Ancestry) ■ Sequencing (full genome (WGS), exome)
  • 6. Obtaining personal genomic information Genotyping ● Looks at specific, interesting known variants in the DNA ● Technology: SNP arrays/chips ○ Efficient and cost-effective (~$50-$100) ○ Straightforward analysis ● BUT: ○ Requires prior identification of variants of interest ○ Limited information, no novel information ● Examples: 23andme, Ancestry Sequencing ● Reads whole sequences ● Technology: Next-Generation-Sequencing ○ More data and context: total information content ○ No prior knowledge needed: discover variation you don’t know about beforehand ○ Full picture: deeper discovery about the genetic underpinnings (rare diseases, ...) ● BUT: ○ Some information is not useful (much of the human genome does not vary between individuals, so it is redundant) ○ Big datasets → more elaborate analysis and storage ○ Bit more expensive (~ $1000 range) ■ Still significant advances in technology, decreasing the time and labor costs
  • 7. Doc.ai - genotyping SNPs Processing MVP User Genetic testing Download raw data Sign up Doc.ai app Upload raw data Ancestry analysis Variant annotation dbSNP ClinVar Polyphen ... Data Interpretation Data analysis pipeline
  • 8. WGS Variant Calling: Traditional GATK GATK Best practices
  • 9. GATK Pipeline Data Engineering ● Setup of tools ○ Usually involves downloading several tools ● Pipeline runs ○ Manually executing scripts ○ Manual verification of results ● Reproducibility and reliability ○ Particularly difficult ○ Need to hire specialist to run ● Future proofing ○ Updating data sets manually ○ Updating tools manually ● Can be automated with significant work
  • 10. GATK Pipeline User Experience ● File Verification Real World Data ○ Pipeline assumes VCFs are valid ● Job failures are not trackable ● Development and dockerization is hard ● Scaling also becomes difficult
  • 11. Product and Design Considerations ● Transparent Processing ○ Actual progress ● As fast as possible feedback ○ File verification ● Job run results and failure inspectability ● How the heck do we do this with an App? ● Jobs must be able to handle large variations in file sizes ● Data store cannot lose files ○ High Friction for the User to upload again ● Ability to painlessly scale
  • 12. Doc.ai Genomics Pipeline: Ingestion Doc.ai Unimatrix PipelineVCFs Mobile App uploaded Stored and verified TLS terminated 1. Variant Call Formats from 23andMe and Ancestry.com are uploaded via the mobile app 2. Files are verified by the Unimatrix (Borg shout out :D ) a. This is done without processing the file!!! 3. If the files are invalid we can tell the user right away 4. Additionally if files are valid we can also terminate the TLS connection which allows the user to move on 5. The UX is significantly smoother Stored on GCS
  • 13. Doc.ai - VCF to Dataframe val PARQUET_FILE_NAME = INPUT_FILE.stripSuffix(".txt") + ".parquet" val schema = StructType( List( StructField("rsid", StringType, false), StructField("chromosome", StringType, false), StructField("position", IntegerType, false), StructField("genotype", StringType, false) ) ) val sc_file = sc.textFile(INPUT_FILE) val rdd = sc_file.filter(line => !line.startsWith("#")).map(line => line.split("t")).map(entries => Row(entries(0), entries(1), entries(2).toInt, entries(3))) val df = spark.createDataFrame(rdd, schema) val updated_df = df.withColumn("genotype", when(col("genotype").equalTo("--"), "..").otherwise(col("genotype"))).sort(asc("chromosome"), asc("position"))
  • 14. Doc.ai Genomics Pipeline: Processing Doc.ai Unimatrix Pipeline Stored on GCS DataBricks Job REST API DB Jobs trigger Genomics Ancestry Analysis Google Cloud AWS Cloud ● Ancestry analysis is a notebook as a job ● Parameters and file locations are sent via the Job REST API ● Results are stored back on GCS ● Results are then transformed and prepared for mobile application (Mongo) ● Progress from job runs (times) are also periodically sent Progress and Job completion signalled back
  • 15. Doc.ai - Reference Mapping // A reference data is used that can help mapp ancestry to an individuals SNPs rsids val df = spark.read.parquet("doc_ai_ancestry_frequencies_50_hgdp_pca_filtered_v3.parquet").createTempView( "ANCESTRY_FREQS") // A SNPs hashing look up table is used to prepare the users’ snps to join with ANCESTRY_FREQS val ancestryDF = spark.sql(" select population_freq, snps, … from JOINED_USER_POPULATION_DF) // Finally results are verified by a suite of assertions to detect potential issues
  • 16. Doc.ai - What it looks and feels like
  • 17. What’s next https://databricks.com/blog/2018/09/10/building-the-fastest- dnaseq-pipeline-at-scale.html ● Testing our limits of our approach ● Askew VCFs ● Genotyped Samples over Sequenced ● Helps accounts for issues with vendor specific VCFs