SlideShare a Scribd company logo
1 of 28
Download to read offline
High-throughputGenomics
at Your Fingertips with Apache Spark
Erwin Datema
Roeland van Ham
Spark Summit EU 2016, October 26, Brussels
Overview
High-throughput Genomics at Your Fingertips with Apache Spark
• Disclaimer : I am a scientist in computational biology (bioinformatics)
• I am not a computer scientist
• I am not a data scientist
• Scope
• KeyGene’s journey into Spark to analyze genomics data
• Goal: enable interactive genomics data processing and querying
• Told from a user’s perspective
• Contents
• Introduction to KeyGene
• Crash Course Genomics
• Big Data Challenges
The crop innovation company 2
Global trends
The crop innovation company 33
World population grows from 7
to 9 billion people in 2050
Climate change
Limited/bad land, water and fossil
fuels
More obese people
Malnourished people
Agricultural challenges
The crop innovation company 4
YIELD: Producing more food on less land
Population
3.0 billion
Population
4.4 billion
Population
6.0 billion
Population
7.5 billion
4.3 hectares
arable land
per person
3.o hectares
arable land
per person
2.2 hectares
arable land
per person
1.8 hectares
arable land
per person
Genetic improvementof crops
The crop innovation company 5
Molecular breeding
Molecular mutagenesis
Our strategy:
Use of natural genetic
variation in crops
Not GM:
At this moment too
costly (20-100 mil €)
Regulatory, societal
and technical hurdles
About KeyGene
The crop innovation company 6
working for the
future of global
agriculture
Current
shareholders
Founded in
1989
>
Go-to Ag Biotech
company for higher
crop yield & quality
R&D strategy
The crop innovation company 7
Fundamental
research
Developing
technologies
& traits
Applying
molecular
breeding of
crops
Breeding
Seed
products
Market
Universities
KeyGene
Partners breeding industry
Big Data in Genomics
Relevance
The crop innovation company 8
image source: https://www.genome.gov/images/content/costpermb2015_4.jpg datasource: https://www.ncbi.nlm.nih.gov/genbank/statistics/
• Genomic data is being produced on an unprecedented scale
• The cost to sequence a genome is now a few thousand dollars
• We can routinely sequence tens to hundreds of individual plants
Totaldata volume (# bases) in NCBIGenBank
Plant Genomics
Different from human genomics!
The crop innovation company 9
one genome many genomes
genomics as a diagnostic tool genomics as a tool to direct breeding
“simple” genetics complex genetics
high quality, shared resources variable quality, fragmented resources
understand and cure diseases improve crop yield and quality
~16 Gb~5.5 Gb~3.5 Gb~850 Mb~450 Mb
Crash Course Genomics
Plant genome sizes
The crop innovation company 10
OnionWheatPepperPotatoMelon
diploidhexaploiddiploidtetraploiddiploid
Mb = megabases
Gb = gigabases
diploid = two sets of chromosomes
tetraploid = four sets of chromosomes
Crash Course Genomics
DNA, chromosomes, nucleotides
• DNA consists of four different elements (nucleotides or bases)
• We represent DNA as strings of characters from the alphabet A C G T
The crop innovation company 11
• DNA is organized into chromosomes
• Each chromosome contains millions of
nucleotides (characters)
• We can only ‘read’ short pieces of DNA
(hundreds to thousands of nucleotides)
Crash Course Genomics
Genes and traits
The crop innovation company 12
drought tolerance fruit size
genome
genes
• Genes are (some of) the functional elements in the genome
• We represent genes as an interval on a chromosome
Crash Course Genomics
Polymorphisms and their impact
The crop innovation company 13
large fruits
small fruits
no impact
Crash Course Genomics
Read alignment and variant calling
The crop innovation company 14
• The ‘reference genome’ represents the known sequence of a species
• Reads are aligned against the reference genome (string similarity search)
• Complex: sequence variation, repetitive regions and data errors
• Variants are called from differences observed in ‘pile-ups’ of reads
Crash Course Genomics
Population-scale genome sequencing
The crop innovation company 15
genome
genes
reads
(individual 1)
reads
(individual 2)
reads
(individual n)
5 Gigabases
1 billionreads
1 billionreads
1 billionreads
• Align a billion reads x 1,000 individuals to a 5 Gb genome
• Call hundreds of millions (up to potentially billions) of sequence variants
Crash Course Genomics
Recap
• Genome sequences are represented as strings of A C G T
• Variation between genome sequences underlies differences in traits
• We can “read” the genome in little pieces
• High throughput, massively parallel sequencing technologies
• Up to thousands of individual plants from a given species
• Genomics data analysis is challenging
• Rapid increase in data generation
• Rapid turnover of sequencing technologies and their outputs
• Scientific software is (often) bad (Nature News, Oct 13 2010)
The crop innovation company 16
Genomics data analysis
High Performance Computing
• Computational challenges
• Align billions of reads to the reference genome
• Call millions or billions of sequence variants
• Determine the small number of variants that impact a given trait
• HPC infrastructure (e.g. SGE clusters) are the de facto standard
• Manually split large datasets (to accommodate the job scheduler)
• Manually deal with failures: check logs, resubmit jobs…
• Many software tools are in fact large, monolithic “pipelines”
• No fine-grained control over resource usage
• A single error often implies a complete re-run of the analysis
The crop innovation company 17
High Performance Computing
Big Data technologies
The crop innovation company 18
compute node
compute node
compute node
compute node
compute node
submit host
compute node
compute node
storage
storage
storage
storage
storage
storage
storage
storage
storageserver
computecluster
Conventional Compute Cluster
Expensive, proprietary storage
Expensive network connections
Expensive, high-reliability hardware
Spark Cluster
Sparkcluster
compute + storage
compute + storage
compute + storage
compute + storage
compute + storage
admin / name node
compute + storage
compute + storage
Commodity hardware
Linear scalability
Fault tolerance
Genomics application landscape
Opportunities for Big Data technologies
19
?
Conventional
Compute
Cluster
Compute tool
High Memory
Hardware
Accelerated
(GPU / FPGA)
Spark
(and Hadoop)
Big Data in Genomics
Challenges
The crop innovation company 20
FASTQ SAM BAM VCF
DNA sequencer reads alignments variants
• Genomics is a dynamic, rapidly changing field
• Data generators and analysis algorithms are in constant flux
• Tools are generally built around flat, text-based file formats
• Workflow is file-centric (POSIX file system; no streaming…)
Big Data in Genomics
Our ‘Sparkified’ pipeline
The crop innovation company 21
Scala
BWA
GATK4
Mark
Duplicates
ADAM
Guacamole
FASTQ
HDFS
SAM
HDFS
BAM
HDFS
VCF
local
read alignment processing variant calling
Genomics on Spark
Solutions to legacy designs
The crop innovation company 22
FASTQ
DNA sequencer reads
FASTQ
reads
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
identifier
sequence
separator
quality
RDD[String]
f1.zip(f2).map
(e => e._1+ e._2)
‘interleaved’ reads
class FastQRecordReader extends RecordReader
Genomics on Spark
Read alignment
The crop innovation company 23
FASTQ
FASTQ
FASTQ
FASTQ
FASTQ RDD
Scala
partition 1
partition 2
partition n
…
BWA
partition 1
partition 2
…
partition n
memory
RDD SAM
Scala
HDFSHDFS
perl wrapper for BWA
called by rdd.pipe()
class FastQRecordReader
extends RecordReader
sort output
and add header
Genomics on Spark
Processing and variant calling
The crop innovation company 24
Broad Institute,Cambridge
Broad Institute,Cambridge
AMPlab, University of California
• Alignment processing
• Variant calling (in development)
• Data schemas + APIs
• Variant calling (Guacamole)
• Variant analysis
• You’ve all attended the Keynote talk…
Genomics on Spark
Variant selection and analysis
The crop innovation company 25
VCF
Scala
parquet
HDFSlocal
RDD tempTable
SQL
selection
Scala
VCF
localmemory
Variant selection
phenotypes GWAS
• Interactive, “real-time” selection of variant data with simple SQL queries
• GWAS analysis on Spark (e.g. hail) or conventional infra (e.g. PLINK)
Big Data in Genomics
KeyGene’s ambition
The crop innovation company 26
Wrap-up
Conclusions and lessons learnt
• Initial success in applying Spark to plant genomics
• Proof-of-concept for enabling interactive GWAS analysis on Spark
• Spark appears to be a good fit for (some of our current) Genomics problems
• Developer community needed to translate core Genomics applications!
• Paradigm shift required to move away from flat POSIX files…
• Opportunities for streaming data analysis
The crop innovation company 27
The End
High-throughput Genomics at Your Fingertips with Apache Spark
The crop innovation company 28
• Erwin Datema erwin.datema@keygene.com
• Roeland van Ham roeland.van-ham@keygene.com

More Related Content

What's hot

Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Databricks
 
Population-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisPopulation-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysis
Denis C. Bauer
 

What's hot (20)

A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
 
Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
 
Managing Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger InstituteManaging Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger Institute
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Population-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisPopulation-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysis
 
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud Experiences
 

Viewers also liked

Viewers also liked (20)

Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir DragievSpark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier Aguedes
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
The Spark (R)evolution in The Netherlands
The Spark (R)evolution in The NetherlandsThe Spark (R)evolution in The Netherlands
The Spark (R)evolution in The Netherlands
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine Learning
 
Democratizing AI with Apache Spark
Democratizing AI with Apache SparkDemocratizing AI with Apache Spark
Democratizing AI with Apache Spark
 
Spark Summit EU talk by Sudeep Das and Aish Faenton
Spark Summit EU talk by Sudeep Das and Aish FaentonSpark Summit EU talk by Sudeep Das and Aish Faenton
Spark Summit EU talk by Sudeep Das and Aish Faenton
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Spark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc BourlierSpark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc Bourlier
 
Solving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized GenomicsSolving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized Genomics
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean Wampler
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 

Similar to Spark Summit EU talk by Erwin Datema and Roeland van Ham

Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
David Ruau
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
MongoDB
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
c.titus.brown
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
StampedeCon
 

Similar to Spark Summit EU talk by Erwin Datema and Roeland van Ham (20)

Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010 Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
 
Platforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esPlatforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-es
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
 
HUGenomics: a support for personalized medicine research
HUGenomics: a support for personalized medicine researchHUGenomics: a support for personalized medicine research
HUGenomics: a support for personalized medicine research
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free software
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017
 
Life Sciences De-Mystified - Mark Bünger - PICNIC '10
Life Sciences De-Mystified - Mark Bünger - PICNIC '10Life Sciences De-Mystified - Mark Bünger - PICNIC '10
Life Sciences De-Mystified - Mark Bünger - PICNIC '10
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Bda2015 tutorial-part2-data&databases
Bda2015 tutorial-part2-data&databasesBda2015 tutorial-part2-data&databases
Bda2015 tutorial-part2-data&databases
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 

Spark Summit EU talk by Erwin Datema and Roeland van Ham

  • 1. High-throughputGenomics at Your Fingertips with Apache Spark Erwin Datema Roeland van Ham Spark Summit EU 2016, October 26, Brussels
  • 2. Overview High-throughput Genomics at Your Fingertips with Apache Spark • Disclaimer : I am a scientist in computational biology (bioinformatics) • I am not a computer scientist • I am not a data scientist • Scope • KeyGene’s journey into Spark to analyze genomics data • Goal: enable interactive genomics data processing and querying • Told from a user’s perspective • Contents • Introduction to KeyGene • Crash Course Genomics • Big Data Challenges The crop innovation company 2
  • 3. Global trends The crop innovation company 33 World population grows from 7 to 9 billion people in 2050 Climate change Limited/bad land, water and fossil fuels More obese people Malnourished people
  • 4. Agricultural challenges The crop innovation company 4 YIELD: Producing more food on less land Population 3.0 billion Population 4.4 billion Population 6.0 billion Population 7.5 billion 4.3 hectares arable land per person 3.o hectares arable land per person 2.2 hectares arable land per person 1.8 hectares arable land per person
  • 5. Genetic improvementof crops The crop innovation company 5 Molecular breeding Molecular mutagenesis Our strategy: Use of natural genetic variation in crops Not GM: At this moment too costly (20-100 mil €) Regulatory, societal and technical hurdles
  • 6. About KeyGene The crop innovation company 6 working for the future of global agriculture Current shareholders Founded in 1989 > Go-to Ag Biotech company for higher crop yield & quality
  • 7. R&D strategy The crop innovation company 7 Fundamental research Developing technologies & traits Applying molecular breeding of crops Breeding Seed products Market Universities KeyGene Partners breeding industry
  • 8. Big Data in Genomics Relevance The crop innovation company 8 image source: https://www.genome.gov/images/content/costpermb2015_4.jpg datasource: https://www.ncbi.nlm.nih.gov/genbank/statistics/ • Genomic data is being produced on an unprecedented scale • The cost to sequence a genome is now a few thousand dollars • We can routinely sequence tens to hundreds of individual plants Totaldata volume (# bases) in NCBIGenBank
  • 9. Plant Genomics Different from human genomics! The crop innovation company 9 one genome many genomes genomics as a diagnostic tool genomics as a tool to direct breeding “simple” genetics complex genetics high quality, shared resources variable quality, fragmented resources understand and cure diseases improve crop yield and quality
  • 10. ~16 Gb~5.5 Gb~3.5 Gb~850 Mb~450 Mb Crash Course Genomics Plant genome sizes The crop innovation company 10 OnionWheatPepperPotatoMelon diploidhexaploiddiploidtetraploiddiploid Mb = megabases Gb = gigabases diploid = two sets of chromosomes tetraploid = four sets of chromosomes
  • 11. Crash Course Genomics DNA, chromosomes, nucleotides • DNA consists of four different elements (nucleotides or bases) • We represent DNA as strings of characters from the alphabet A C G T The crop innovation company 11 • DNA is organized into chromosomes • Each chromosome contains millions of nucleotides (characters) • We can only ‘read’ short pieces of DNA (hundreds to thousands of nucleotides)
  • 12. Crash Course Genomics Genes and traits The crop innovation company 12 drought tolerance fruit size genome genes • Genes are (some of) the functional elements in the genome • We represent genes as an interval on a chromosome
  • 13. Crash Course Genomics Polymorphisms and their impact The crop innovation company 13 large fruits small fruits no impact
  • 14. Crash Course Genomics Read alignment and variant calling The crop innovation company 14 • The ‘reference genome’ represents the known sequence of a species • Reads are aligned against the reference genome (string similarity search) • Complex: sequence variation, repetitive regions and data errors • Variants are called from differences observed in ‘pile-ups’ of reads
  • 15. Crash Course Genomics Population-scale genome sequencing The crop innovation company 15 genome genes reads (individual 1) reads (individual 2) reads (individual n) 5 Gigabases 1 billionreads 1 billionreads 1 billionreads • Align a billion reads x 1,000 individuals to a 5 Gb genome • Call hundreds of millions (up to potentially billions) of sequence variants
  • 16. Crash Course Genomics Recap • Genome sequences are represented as strings of A C G T • Variation between genome sequences underlies differences in traits • We can “read” the genome in little pieces • High throughput, massively parallel sequencing technologies • Up to thousands of individual plants from a given species • Genomics data analysis is challenging • Rapid increase in data generation • Rapid turnover of sequencing technologies and their outputs • Scientific software is (often) bad (Nature News, Oct 13 2010) The crop innovation company 16
  • 17. Genomics data analysis High Performance Computing • Computational challenges • Align billions of reads to the reference genome • Call millions or billions of sequence variants • Determine the small number of variants that impact a given trait • HPC infrastructure (e.g. SGE clusters) are the de facto standard • Manually split large datasets (to accommodate the job scheduler) • Manually deal with failures: check logs, resubmit jobs… • Many software tools are in fact large, monolithic “pipelines” • No fine-grained control over resource usage • A single error often implies a complete re-run of the analysis The crop innovation company 17
  • 18. High Performance Computing Big Data technologies The crop innovation company 18 compute node compute node compute node compute node compute node submit host compute node compute node storage storage storage storage storage storage storage storage storageserver computecluster Conventional Compute Cluster Expensive, proprietary storage Expensive network connections Expensive, high-reliability hardware Spark Cluster Sparkcluster compute + storage compute + storage compute + storage compute + storage compute + storage admin / name node compute + storage compute + storage Commodity hardware Linear scalability Fault tolerance
  • 19. Genomics application landscape Opportunities for Big Data technologies 19 ? Conventional Compute Cluster Compute tool High Memory Hardware Accelerated (GPU / FPGA) Spark (and Hadoop)
  • 20. Big Data in Genomics Challenges The crop innovation company 20 FASTQ SAM BAM VCF DNA sequencer reads alignments variants • Genomics is a dynamic, rapidly changing field • Data generators and analysis algorithms are in constant flux • Tools are generally built around flat, text-based file formats • Workflow is file-centric (POSIX file system; no streaming…)
  • 21. Big Data in Genomics Our ‘Sparkified’ pipeline The crop innovation company 21 Scala BWA GATK4 Mark Duplicates ADAM Guacamole FASTQ HDFS SAM HDFS BAM HDFS VCF local read alignment processing variant calling
  • 22. Genomics on Spark Solutions to legacy designs The crop innovation company 22 FASTQ DNA sequencer reads FASTQ reads @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 identifier sequence separator quality RDD[String] f1.zip(f2).map (e => e._1+ e._2) ‘interleaved’ reads class FastQRecordReader extends RecordReader
  • 23. Genomics on Spark Read alignment The crop innovation company 23 FASTQ FASTQ FASTQ FASTQ FASTQ RDD Scala partition 1 partition 2 partition n … BWA partition 1 partition 2 … partition n memory RDD SAM Scala HDFSHDFS perl wrapper for BWA called by rdd.pipe() class FastQRecordReader extends RecordReader sort output and add header
  • 24. Genomics on Spark Processing and variant calling The crop innovation company 24 Broad Institute,Cambridge Broad Institute,Cambridge AMPlab, University of California • Alignment processing • Variant calling (in development) • Data schemas + APIs • Variant calling (Guacamole) • Variant analysis • You’ve all attended the Keynote talk…
  • 25. Genomics on Spark Variant selection and analysis The crop innovation company 25 VCF Scala parquet HDFSlocal RDD tempTable SQL selection Scala VCF localmemory Variant selection phenotypes GWAS • Interactive, “real-time” selection of variant data with simple SQL queries • GWAS analysis on Spark (e.g. hail) or conventional infra (e.g. PLINK)
  • 26. Big Data in Genomics KeyGene’s ambition The crop innovation company 26
  • 27. Wrap-up Conclusions and lessons learnt • Initial success in applying Spark to plant genomics • Proof-of-concept for enabling interactive GWAS analysis on Spark • Spark appears to be a good fit for (some of our current) Genomics problems • Developer community needed to translate core Genomics applications! • Paradigm shift required to move away from flat POSIX files… • Opportunities for streaming data analysis The crop innovation company 27
  • 28. The End High-throughput Genomics at Your Fingertips with Apache Spark The crop innovation company 28 • Erwin Datema erwin.datema@keygene.com • Roeland van Ham roeland.van-ham@keygene.com