Insight Data Engineering - Demo

Implementing
Smith-Waterman in Spark
Elizabeth Fong
Insight Data Engineering NY
September - October 2015

Motivation
- What is your genomic sequence?
- Gene Therapy: personalised treatment based on your genomic sequence
- Many gene alignment algorithms available
- Not many distributed

Motivation
- What is your genomic sequence?
- Gene Therapy: personalised treatment based on your genomic sequence
- Many gene alignment algorithms available
- Not many distributed
The Problem
Where does this align?
ACTCA
ATGCAGAC
ACTCA
ATGCAGAC
ACTCA
ATGCAGAC
ACTC_ A
A _TGCAGAC

About the Data
Dataset
- NCBI ftp site for dataset download
- Total size = 26.5GB out of approx. 200TB
- Dataset contains 518 files of 12,321,160 sequences of mRNA
- mean (average) number of base pairs per sequence = 2,160
- median number of base pairs per sequence = 1,609
Input Files
- Randomly selected sequences from the dataset

Input
Best-Aligned
Reference Sequence
where the
sequences
match

Reference
Input
Legend
a alignment
i insertion
d deletionOptimal
Alignment

Scaling up: Tests and Data
Reference File
Size
/ MB
# Reference
Sequences
Input File Size
/ KB
# Inputs # Comparisons
Time (distribute
reference)
/ h
49.0 17,864 2 19 339416 0.75
69.0 23,398 3 31 725338 2.05
83.8 35,751 3 32 1144032 2.64
97.9 30,096 2 19 571824 1.50
118.6 ? 2 ? N/A
234.9 ? 3 ? N/A

Scaling up: Additional Tests
read length 80
# reference sequences 10
# reference files 1

# reads 5
read length 80
reference sequence length 400
# reference files 1

# reads 5
read length 80
# reference sequences 1
# reference files 1

Scaling up: Parallelization
Parallelize the algorithm, reads or the reference set
- Algorithm
- Dynamic programming algorithm
- Fills up a matrix where each cell depends on 3 of its neighbouring cells
- Reads
- reads are much shorter
- but need to store all the scores for each read on reduce
- Reference Set
- much longer than reads
- # references per file range from 1 to 73,030

Challenges Encountered
- Understanding the algorithm
- Object-Oriented Java → Functional Spark
- OutOfMemory: Java heap space
- Distributing the algorithm
- Distributing either dataset
- ftp → S3

BA Computer Science, minor in Molecular Biology
Mount Holyoke College

http://vlab.amrita.edu/?sub=3&brch=274&sim=1433&cnt=1

Let the following scores be:
Match = 5
Mismatch = -3
Gap = -4
- C G T G A A T T C A T
-
G
A
C
T
T 13 9
A 14
C
0
alignment
insertion
deletion
score = max
{

-
G
A
C
T
T 13 9
A 14
Alignment (diagonal)
Score = 13 + (match or mismatch)
= 13 + 5
= 18

Insertion (vertical)
Score = 9 + (gap)
= 9 - 4
= 5
-
G
A
C
T
T 13 9
A 14
C

Deletion (horizontal)
Score = 14 + (gap)
= 9 - 4
= 10
-
G
A
C
T
T 13 9
A 14
C

score = max
{
0
alignment
insertion
deletion
score = max{ 0 , 18 , 5 , 10 }
= 18
-
G
A
C
T
T 13 9
A 14 18
C

align align align align align del align
G A A T T C C
G A C T T _ C

align align align align align ins align
G A A T T _ C
G A C T T A C

Both alignment scores = 18
Therefore both are optimal alignments
http://vlab.amrita.edu/?sub=3&brch=274&sim=1433&cnt=1

Insight Data Engineering - Demo

Insight Data Engineering - Demo

Recommended

Recommended

More Related Content

Similar to Insight Data Engineering - Demo

Similar to Insight Data Engineering - Demo (20)

Insight Data Engineering - Demo