SlideShare a Scribd company logo
Implementing
Smith-Waterman in Spark
Elizabeth Fong
Insight Data Engineering NY
September - October 2015
Motivation
- What is your genomic sequence?
- Gene Therapy: personalised treatment based on your genomic sequence
- Many gene alignment algorithms available
- Not many distributed
Motivation
- What is your genomic sequence?
- Gene Therapy: personalised treatment based on your genomic sequence
- Many gene alignment algorithms available
- Not many distributed
The Problem
Where does this align?
ACTCA
ATGCAGAC
ACTCA
ATGCAGAC
ACTCA
ATGCAGAC
ACTC_ A
A _TGCAGAC
About the Data
Dataset
- NCBI ftp site for dataset download
- Total size = 26.5GB out of approx. 200TB
- Dataset contains 518 files of 12,321,160 sequences of mRNA
- mean (average) number of base pairs per sequence = 2,160
- median number of base pairs per sequence = 1,609
Input Files
- Randomly selected sequences from the dataset
Metadata
The sequence
Many
Reference
Sequences
The sequence
Input
Best-Aligned
Reference Sequence
where the
sequences
match
Reference
Input
Legend
a alignment
i insertion
d deletionOptimal
Alignment
Scaling up: Tests and Data
Reference File
Size
/ MB
# Reference
Sequences
Input File Size
/ KB
# Inputs # Comparisons
Time (distribute
reference)
/ h
49.0 17,864 2 19 339416 0.75
69.0 23,398 3 31 725338 2.05
83.8 35,751 3 32 1144032 2.64
97.9 30,096 2 19 571824 1.50
118.6 ? 2 ? N/A
234.9 ? 3 ? N/A
Scaling up: Tests and Data
Scaling up: Additional Tests
read length 80
# reference sequences 10
# reference files 1
Scaling up: Additional Tests
read length 80
# reference sequences 10
# reference files 1
Scaling up: Additional Tests
# reads 5
read length 80
reference sequence length 400
# reference files 1
Scaling up: Additional Tests
# reads 5
read length 80
# reference sequences 1
# reference files 1
Scaling up: Parallelization
Parallelize the algorithm, reads or the reference set
- Algorithm
- Dynamic programming algorithm
- Fills up a matrix where each cell depends on 3 of its neighbouring cells
- Reads
- reads are much shorter
- but need to store all the scores for each read on reduce
- Reference Set
- much longer than reads
- # references per file range from 1 to 73,030
Challenges Encountered
- Understanding the algorithm
- Object-Oriented Java → Functional Spark
- OutOfMemory: Java heap space
- Distributing the algorithm
- Distributing either dataset
- ftp → S3
BA Computer Science, minor in Molecular Biology
Mount Holyoke College
http://vlab.amrita.edu/?sub=3&brch=274&sim=1433&cnt=1
http://vlab.amrita.edu/?sub=3&brch=274&sim=1433&cnt=1
Let the following scores be:
Match = 5
Mismatch = -3
Gap = -4
- C G T G A A T T C A T
-
G
A
C
T
T 13 9
A 14
C
0
alignment
insertion
deletion
score = max
{
- C G T G A A T T C A T
-
G
A
C
T
T 13 9
A 14
Alignment (diagonal)
Score = 13 + (match or mismatch)
= 13 + 5
= 18
Insertion (vertical)
Score = 9 + (gap)
= 9 - 4
= 5
- C G T G A A T T C A T
-
G
A
C
T
T 13 9
A 14
C
Deletion (horizontal)
Score = 14 + (gap)
= 9 - 4
= 10
- C G T G A A T T C A T
-
G
A
C
T
T 13 9
A 14
C
score = max
{
0
alignment
insertion
deletion
score = max{ 0 , 18 , 5 , 10 }
= 18
- C G T G A A T T C A T
-
G
A
C
T
T 13 9
A 14 18
C
http://vlab.amrita.edu/?sub=3&brch=274&sim=1433&cnt=1
http://vlab.amrita.edu/?sub=3&brch=274&sim=1433&cnt=1
align align align align align del align
G A A T T C C
G A C T T _ C
align align align align align ins align
G A A T T _ C
G A C T T A C
Both alignment scores = 18
Therefore both are optimal alignments
http://vlab.amrita.edu/?sub=3&brch=274&sim=1433&cnt=1
Insight Data Engineering - Demo

More Related Content

Similar to Insight Data Engineering - Demo

2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
Prof. Wim Van Criekinge
 
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingProf. Wim Van Criekinge
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
Prof. Wim Van Criekinge
 
Database Searching
Database SearchingDatabase Searching
Database Searching
Meghaj Mallick
 
Gwas.emes.comp
Gwas.emes.compGwas.emes.comp
Gwas.emes.comp
Richard Emes
 
How to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issuesHow to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issues
Cloudera, Inc.
 
Pasteur deep seq_analysis_theory_2016
Pasteur deep seq_analysis_theory_2016Pasteur deep seq_analysis_theory_2016
Pasteur deep seq_analysis_theory_2016
Christophe Antoniewski
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
Golden Helix Inc
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
mikaelhuss
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Prof. Wim Van Criekinge
 
2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge
Prof. Wim Van Criekinge
 
Thoughts on kafka capacity planning
Thoughts on kafka capacity planningThoughts on kafka capacity planning
Thoughts on kafka capacity planning
JamieAlquiza
 
Presto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop MeetupPresto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop Meetup
Justin Borgman
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
Bioinformatics and Computational Biosciences Branch
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
MLconf
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
Chris Fregly
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Chris Fregly
 
SISY 2008
SISY 2008SISY 2008
SISY 2008
Zoran Popovic
 
20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club
Med_KU
 

Similar to Insight Data Engineering - Demo (20)

2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
 
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
 
Database Searching
Database SearchingDatabase Searching
Database Searching
 
Gwas.emes.comp
Gwas.emes.compGwas.emes.comp
Gwas.emes.comp
 
How to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issuesHow to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issues
 
Pasteur deep seq_analysis_theory_2016
Pasteur deep seq_analysis_theory_2016Pasteur deep seq_analysis_theory_2016
Pasteur deep seq_analysis_theory_2016
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
 
2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge2015 bioinformatics database_searching_wimvancriekinge
2015 bioinformatics database_searching_wimvancriekinge
 
Thoughts on kafka capacity planning
Thoughts on kafka capacity planningThoughts on kafka capacity planning
Thoughts on kafka capacity planning
 
Presto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop MeetupPresto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop Meetup
 
Blast 2013 1
Blast 2013 1Blast 2013 1
Blast 2013 1
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
 
SISY 2008
SISY 2008SISY 2008
SISY 2008
 
20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club
 

Insight Data Engineering - Demo